--- # theme id, package name, or local path theme: seriph title: Scalable Oversight for Complex AI Tasks titleTemplate: '%s - AI Safety & Oversight' author: Rossi Stefano info: | ## Methods for Scaling Human Feedback in AI Supervision keywords: AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate mdc: true hideInToc: false addons: - slidev-addon-rabbit - slidev-addon-python-runner python: installs: [] prelude: '' loadPackagesFromImports: true suppressDeprecationWarnings: true alwaysReload: false loadPyodideOptions: {} presenter: true browserExporter: dev download: true exportFilename: scalable-oversight-for-ai twoslash: false lineNumbers: true monaco: false selectable: false record: dev contextMenu: dev wakeLock: true overviewSnapshots: false colorSchema: dark routerMode: history aspectRatio: 16/9 canvasWidth: 980 css: - unocss unocss: configFile: './uno.config.ts' defaults: layout: center drawings: enabled: true persist: false presenterOnly: false syncAll: true htmlAttrs: dir: ltr lang: en transition: slide-left background: none ---

Scalable Oversight for Complex AI

Can Human Feedback Keep Up?

Techniques to Align Large Language Models at Scale

Stefano Rossi
2 May, 2025
--- # Introduction
--- # The Problem with Feedback

Why it fails

Why it matters

--- # What Is Scalable Oversight?

Scalable oversight techniques aim to empower humans to give accurate feedback on complex tasks.

They leverage structured methods, AI assistants, or recursive mechanisms to extend our cognitive reach.

--- # Techniques Overview
Technique Description
Iterated AmplificationTask decomposition + model assistance = scalable evaluation
Recursive Reward ModelingAI helps humans give better feedback to train better AI
Constitutional AIUse fixed rules to guide feedback generation
DebateAIs argue, human judges; surfacing deception through opposition
Weak-to-Strong GeneralizationTrain powerful models with weaker labels
--- # Iterated Amplification (IDA)

Core Idea

Example

--- # Debate

Two AIs take opposing views on a complex topic and argue it out. A human (or weaker AI) acts as judge.

--- # Constitutional AI

Replace human feedback with AI critiques guided by human-written principles.

Used in Anthropic’s Claude models, where rules encode ethical and practical constraints.

--- # Weak-to-Strong Generalization
--- # Key Challenges

Failure Risks

Limitations

--- # Conclusion

Scalable oversight is not a silver bullet. But it’s the best shot we’ve got at aligning complex systems to human values, or at least, approximations of them.

The future of AI safety depends on teaching AIs to teach themselves, with guardrails we can actually monitor.

---

Questions?