8.2 KiB
8.2 KiB
theme | title | titleTemplate | author | info | keywords | mdc | hideInToc | addons | python | presenter | browserExporter | download | exportFilename | twoslash | lineNumbers | monaco | selectable | record | contextMenu | wakeLock | overviewSnapshots | colorSchema | routerMode | aspectRatio | canvasWidth | css | unocss | defaults | drawings | htmlAttrs | transition | background | |||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
seriph | Scalable Oversight for Complex AI Tasks | %s - AI Safety & Oversight | Rossi Stefano | ## Methods for Scaling Human Feedback in AI Supervision | AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate | true | false |
|
|
true | dev | true | scalable-oversight-for-ai | false | true | false | false | dev | dev | true | false | dark | history | 16/9 | 980 |
|
|
|
|
|
slide-left | none |
Scalable Oversight for Complex AI
Can Human Feedback Keep Up?
Techniques to Align Large Language Models at Scale
Stefano Rossi
2 May, 2025
Introduction
- Alignment at scale
- Human limitations
- Model deception
- Recursive techniques
- AI-augmented feedback
- Factored cognition
The Problem with Feedback
Why it fails
- Too complex for humans to judge outputs correctly
- Deception and hallucinations trick reviewers
- LLMs sycophantically agree rather than pursue truth
Why it matters
- Most alignment methods rely on human supervision
- RLHF breaks at scale and complexity
- Feedback noise → reward hacking and misalignment
What Is Scalable Oversight?
Scalable oversight techniques aim to empower humans to give accurate feedback on complex tasks.
They leverage structured methods, AI assistants, or recursive mechanisms to extend our cognitive reach.
Techniques Overview
Technique | Description |
---|---|
Iterated Amplification | Task decomposition + model assistance = scalable evaluation |
Recursive Reward Modeling | AI helps humans give better feedback to train better AI |
Constitutional AI | Use fixed rules to guide feedback generation |
Debate | AIs argue, human judges; surfacing deception through opposition |
Weak-to-Strong Generalization | Train powerful models with weaker labels |
Iterated Amplification (IDA)
Core Idea
- Decompose hard problems
- Evaluate sub-steps
- Train model to replicate the full pipeline
Example
- Book summarization
- Summarize pages → chapters → full book
- Distill into a one-shot summary model
Debate
Two AIs take opposing views on a complex topic and argue it out. A human (or weaker AI) acts as judge.
- Helps surface flaws and manipulations
- Leverages model’s reasoning capacity
- Challenges: truth ≠ persuasion, collusion risk
Constitutional AI
Replace human feedback with AI critiques guided by human-written principles.
Used in Anthropic’s Claude models, where rules encode ethical and practical constraints.
- Scalable via automation
- Fewer humans involved, more repeatability
Weak-to-Strong Generalization
- Train GPT-4 using feedback from GPT-2
- GPT-4 learns better than its weak teacher
- Generalization beats imitation
- Bootstrapping + confidence loss = better performance
Key Challenges
Failure Risks
- Models appear aligned, but aren’t
- Deceptive behavior during training
- Generalization failures
Limitations
- Factored cognition doesn’t scale universally
- Feedback may be misinterpreted
- Ongoing distributional shift at deployment
Conclusion
Scalable oversight is not a silver bullet. But it’s the best shot we’ve got at aligning complex systems to human values, or at least, approximations of them.
The future of AI safety depends on teaching AIs to teach themselves, with guardrails we can actually monitor.