--- # theme id, package name, or local path theme: seriph title: Scalable Oversight for Complex AI Tasks titleTemplate: '%s - AI Safety & Oversight' author: Rossi Stefano info: | ## Methods for Scaling Human Feedback in AI Supervision keywords: AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate mdc: true hideInToc: false addons: - slidev-addon-rabbit - slidev-addon-python-runner python: installs: [] prelude: '' loadPackagesFromImports: true suppressDeprecationWarnings: true alwaysReload: false loadPyodideOptions: {} presenter: true browserExporter: dev download: true exportFilename: scalable-oversight-for-ai twoslash: false lineNumbers: true monaco: false selectable: false record: dev contextMenu: dev wakeLock: true overviewSnapshots: false colorSchema: dark routerMode: history aspectRatio: 16/9 canvasWidth: 980 css: - unocss unocss: configFile: './uno.config.ts' defaults: layout: center drawings: enabled: true persist: false presenterOnly: false syncAll: true htmlAttrs: dir: ltr lang: en transition: slide-left background: none ---
Scalable oversight techniques aim to empower humans to give accurate feedback on complex tasks.
They leverage structured methods, AI assistants, or recursive mechanisms to extend our cognitive reach.
Technique | Description |
---|---|
Iterated Amplification | Task decomposition + model assistance = scalable evaluation |
Recursive Reward Modeling | AI helps humans give better feedback to train better AI |
Constitutional AI | Use fixed rules to guide feedback generation |
Debate | AIs argue, human judges; surfacing deception through opposition |
Weak-to-Strong Generalization | Train powerful models with weaker labels |
Two AIs take opposing views on a complex topic and argue it out. A human (or weaker AI) acts as judge.
Replace human feedback with AI critiques guided by human-written principles.
Used in Anthropic’s Claude models, where rules encode ethical and practical constraints.
Scalable oversight is not a silver bullet. But it’s the best shot we’ve got at aligning complex systems to human values, or at least, approximations of them.
The future of AI safety depends on teaching AIs to teach themselves, with guardrails we can actually monitor.