scalable_oversight/slides.md
2025-07-12 17:16:47 +02:00

8.2 KiB
Raw Permalink Blame History

theme title titleTemplate author info keywords mdc hideInToc addons python presenter browserExporter download exportFilename twoslash lineNumbers monaco selectable record contextMenu wakeLock overviewSnapshots colorSchema routerMode aspectRatio canvasWidth css unocss defaults drawings htmlAttrs transition background
seriph Scalable Oversight for Complex AI Tasks %s - AI Safety & Oversight Rossi Stefano ## Methods for Scaling Human Feedback in AI Supervision AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate true false
slidev-addon-rabbit
slidev-addon-python-runner
installs prelude loadPackagesFromImports suppressDeprecationWarnings alwaysReload loadPyodideOptions
true true false
true dev true scalable-oversight-for-ai false true false false dev dev true false dark history 16/9 980
unocss
configFile
./uno.config.ts
layout
center
enabled persist presenterOnly syncAll
true false false true
dir lang
ltr en
slide-left none

Scalable Oversight for Complex AI

Can Human Feedback Keep Up?

Techniques to Align Large Language Models at Scale

Stefano Rossi
2 May, 2025

Introduction

  • Alignment at scale
  • Human limitations
  • Model deception
  • Recursive techniques
  • AI-augmented feedback
  • Factored cognition

The Problem with Feedback

Why it fails

  • Too complex for humans to judge outputs correctly
  • Deception and hallucinations trick reviewers
  • LLMs sycophantically agree rather than pursue truth

Why it matters

  • Most alignment methods rely on human supervision
  • RLHF breaks at scale and complexity
  • Feedback noise → reward hacking and misalignment

What Is Scalable Oversight?

Scalable oversight techniques aim to empower humans to give accurate feedback on complex tasks.

They leverage structured methods, AI assistants, or recursive mechanisms to extend our cognitive reach.


Techniques Overview

Technique Description
Iterated AmplificationTask decomposition + model assistance = scalable evaluation
Recursive Reward ModelingAI helps humans give better feedback to train better AI
Constitutional AIUse fixed rules to guide feedback generation
DebateAIs argue, human judges; surfacing deception through opposition
Weak-to-Strong GeneralizationTrain powerful models with weaker labels

Iterated Amplification (IDA)

Core Idea

  • Decompose hard problems
  • Evaluate sub-steps
  • Train model to replicate the full pipeline

Example

  • Book summarization
  • Summarize pages → chapters → full book
  • Distill into a one-shot summary model

Debate

Two AIs take opposing views on a complex topic and argue it out. A human (or weaker AI) acts as judge.

  • Helps surface flaws and manipulations
  • Leverages models reasoning capacity
  • Challenges: truth ≠ persuasion, collusion risk

Constitutional AI

Replace human feedback with AI critiques guided by human-written principles.

Used in Anthropics Claude models, where rules encode ethical and practical constraints.

  • Scalable via automation
  • Fewer humans involved, more repeatability

Weak-to-Strong Generalization

  • Train GPT-4 using feedback from GPT-2
  • GPT-4 learns better than its weak teacher
  • Generalization beats imitation
  • Bootstrapping + confidence loss = better performance

Key Challenges

Failure Risks

  • Models appear aligned, but arent
  • Deceptive behavior during training
  • Generalization failures

Limitations

  • Factored cognition doesnt scale universally
  • Feedback may be misinterpreted
  • Ongoing distributional shift at deployment

Conclusion

Scalable oversight is not a silver bullet. But its the best shot weve got at aligning complex systems to human values, or at least, approximations of them.

The future of AI safety depends on teaching AIs to teach themselves, with guardrails we can actually monitor.


Questions?