chadmin/scalable_oversight

Fork 0

Stefano Rossi fdf7024bcd

first commit

2025-07-12 17:16:47 +02:00

8.2 KiB

Raw Blame History

theme

title

titleTemplate

author

info

keywords

mdc

hideInToc

addons

python

presenter

browserExporter

download

exportFilename

twoslash

lineNumbers

monaco

selectable

record

contextMenu

wakeLock

overviewSnapshots

colorSchema

routerMode

aspectRatio

canvasWidth

css

unocss

defaults

drawings

htmlAttrs

transition

background

seriph

Scalable Oversight for Complex AI Tasks

%s - AI Safety & Oversight

Rossi Stefano

## Methods for Scaling Human Feedback in AI Supervision

AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate

true

false

slidev-addon-rabbit

slidev-addon-python-runner

installs

prelude

loadPackagesFromImports

suppressDeprecationWarnings

alwaysReload

loadPyodideOptions

true

false

true

dev

true

scalable-oversight-for-ai

false

true

false

dev

true

false

dark

history

16/9

980

unocss

configFile
./uno.config.ts

layout
center

enabled	persist	presenterOnly	syncAll
true	false	false	true

dir	lang
ltr	en

slide-left

none

Scalable Oversight for Complex AI

Can Human Feedback Keep Up?

Techniques to Align Large Language Models at Scale

Stefano Rossi

2 May, 2025

Introduction

Alignment at scale
Human limitations
Model deception

Recursive techniques
AI-augmented feedback
Factored cognition

The Problem with Feedback

Why it fails

Too complex for humans to judge outputs correctly
Deception and hallucinations trick reviewers
LLMs sycophantically agree rather than pursue truth

Why it matters

Most alignment methods rely on human supervision
RLHF breaks at scale and complexity
Feedback noise → reward hacking and misalignment

What Is Scalable Oversight?

Scalable oversight techniques aim to empower humans to give accurate feedback on complex tasks.

They leverage structured methods, AI assistants, or recursive mechanisms to extend our cognitive reach.

Techniques Overview

Technique	Description
Iterated Amplification	Task decomposition + model assistance = scalable evaluation
Recursive Reward Modeling	AI helps humans give better feedback to train better AI
Constitutional AI	Use fixed rules to guide feedback generation
Debate	AIs argue, human judges; surfacing deception through opposition
Weak-to-Strong Generalization	Train powerful models with weaker labels

Iterated Amplification (IDA)

Core Idea

Decompose hard problems
Evaluate sub-steps
Train model to replicate the full pipeline

Example

Book summarization
Summarize pages → chapters → full book
Distill into a one-shot summary model

Debate

Two AIs take opposing views on a complex topic and argue it out. A human (or weaker AI) acts as judge.

Helps surface flaws and manipulations
Leverages model’s reasoning capacity
Challenges: truth ≠ persuasion, collusion risk

Constitutional AI

Replace human feedback with AI critiques guided by human-written principles.

Used in Anthropic’s Claude models, where rules encode ethical and practical constraints.

Scalable via automation
Fewer humans involved, more repeatability

Weak-to-Strong Generalization

Train GPT-4 using feedback from GPT-2
GPT-4 learns better than its weak teacher
Generalization beats imitation
Bootstrapping + confidence loss = better performance

Key Challenges

Failure Risks

Models appear aligned, but aren’t
Deceptive behavior during training
Generalization failures

Limitations

Factored cognition doesn’t scale universally
Feedback may be misinterpreted
Ongoing distributional shift at deployment

Conclusion

Scalable oversight is not a silver bullet. But it’s the best shot we’ve got at aligning complex systems to human values, or at least, approximations of them.

The future of AI safety depends on teaching AIs to teach themselves, with guardrails we can actually monitor.

8.2 KiB Raw Blame History Unescape Escape

Scalable Oversight for Complex AI

Can Human Feedback Keep Up?

Techniques to Align Large Language Models at Scale

Introduction

The Problem with Feedback

Why it fails

Why it matters

What Is Scalable Oversight?

Techniques Overview

Iterated Amplification (IDA)

Core Idea

Example

Debate

Constitutional AI

Weak-to-Strong Generalization

Key Challenges

Failure Risks

Limitations

Conclusion

Questions?

8.2 KiB

Raw Blame History