--- # theme id, package name, or local path theme: seriph title: Scalable Oversight for Complex AI Tasks titleTemplate: '%s - AI Safety & Oversight' author: Rossi Stefano info: | ## Methods for Scaling Human Feedback in AI Supervision keywords: AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate mdc: true hideInToc: false addons: - slidev-addon-rabbit - slidev-addon-python-runner python: installs: [] prelude: '' loadPackagesFromImports: true suppressDeprecationWarnings: true alwaysReload: false loadPyodideOptions: {} presenter: true browserExporter: dev download: true exportFilename: scalable-oversight-for-ai twoslash: false lineNumbers: true monaco: false selectable: false record: dev contextMenu: dev wakeLock: true overviewSnapshots: false colorSchema: dark routerMode: history aspectRatio: 16/9 canvasWidth: 980 css: - unocss unocss: configFile: './uno.config.ts' defaults: layout: center drawings: enabled: true persist: false presenterOnly: false syncAll: true htmlAttrs: dir: ltr lang: en transition: slide-left background: none ---

Scalable Oversight for Complex AI

Can Human Feedback Keep Up?

Techniques to Align Large Language Models at Scale

Stefano Rossi

2 May, 2025

--- # Introduction

Alignment at scale
Human limitations
Model deception

Recursive techniques
AI-augmented feedback
Factored cognition

--- # The Problem with Feedback

Why it fails

Too complex for humans to judge outputs correctly
Deception and hallucinations trick reviewers
LLMs sycophantically agree rather than pursue truth

Why it matters

Most alignment methods rely on human supervision
RLHF breaks at scale and complexity
Feedback noise → reward hacking and misalignment

--- # What Is Scalable Oversight?

Scalable oversight techniques aim to empower humans to give accurate feedback on complex tasks.

They leverage structured methods, AI assistants, or recursive mechanisms to extend our cognitive reach.

--- # Techniques Overview

Technique	Description
Iterated Amplification	Task decomposition + model assistance = scalable evaluation
Recursive Reward Modeling	AI helps humans give better feedback to train better AI
Constitutional AI	Use fixed rules to guide feedback generation
Debate	AIs argue, human judges; surfacing deception through opposition
Weak-to-Strong Generalization	Train powerful models with weaker labels

--- # Iterated Amplification (IDA)

Core Idea

Decompose hard problems
Evaluate sub-steps
Train model to replicate the full pipeline

Example

Book summarization
Summarize pages → chapters → full book
Distill into a one-shot summary model

--- # Debate

Two AIs take opposing views on a complex topic and argue it out. A human (or weaker AI) acts as judge.

Helps surface flaws and manipulations
Leverages model’s reasoning capacity
Challenges: truth ≠ persuasion, collusion risk

--- # Constitutional AI

Replace human feedback with AI critiques guided by human-written principles.

Used in Anthropic’s Claude models, where rules encode ethical and practical constraints.

Scalable via automation
Fewer humans involved, more repeatability

--- # Weak-to-Strong Generalization

Train GPT-4 using feedback from GPT-2
GPT-4 learns better than its weak teacher
Generalization beats imitation
Bootstrapping + confidence loss = better performance

--- # Key Challenges

Failure Risks

Models appear aligned, but aren’t
Deceptive behavior during training
Generalization failures

Limitations

Factored cognition doesn’t scale universally
Feedback may be misinterpreted
Ongoing distributional shift at deployment

--- # Conclusion

Scalable oversight is not a silver bullet. But it’s the best shot we’ve got at aligning complex systems to human values, or at least, approximations of them.

The future of AI safety depends on teaching AIs to teach themselves, with guardrails we can actually monitor.

---