284 lines
8.2 KiB
Markdown
284 lines
8.2 KiB
Markdown
|
---
|
|||
|
# theme id, package name, or local path
|
|||
|
theme: seriph
|
|||
|
title: Scalable Oversight for Complex AI Tasks
|
|||
|
titleTemplate: '%s - AI Safety & Oversight'
|
|||
|
author: Rossi Stefano
|
|||
|
info: |
|
|||
|
## Methods for Scaling Human Feedback in AI Supervision
|
|||
|
keywords: AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate
|
|||
|
mdc: true
|
|||
|
hideInToc: false
|
|||
|
addons:
|
|||
|
- slidev-addon-rabbit
|
|||
|
- slidev-addon-python-runner
|
|||
|
python:
|
|||
|
installs: []
|
|||
|
prelude: ''
|
|||
|
loadPackagesFromImports: true
|
|||
|
suppressDeprecationWarnings: true
|
|||
|
alwaysReload: false
|
|||
|
loadPyodideOptions: {}
|
|||
|
presenter: true
|
|||
|
browserExporter: dev
|
|||
|
download: true
|
|||
|
exportFilename: scalable-oversight-for-ai
|
|||
|
twoslash: false
|
|||
|
lineNumbers: true
|
|||
|
monaco: false
|
|||
|
selectable: false
|
|||
|
record: dev
|
|||
|
contextMenu: dev
|
|||
|
wakeLock: true
|
|||
|
overviewSnapshots: false
|
|||
|
colorSchema: dark
|
|||
|
routerMode: history
|
|||
|
aspectRatio: 16/9
|
|||
|
canvasWidth: 980
|
|||
|
css:
|
|||
|
- unocss
|
|||
|
unocss:
|
|||
|
configFile: './uno.config.ts'
|
|||
|
defaults:
|
|||
|
layout: center
|
|||
|
drawings:
|
|||
|
enabled: true
|
|||
|
persist: false
|
|||
|
presenterOnly: false
|
|||
|
syncAll: true
|
|||
|
htmlAttrs:
|
|||
|
dir: ltr
|
|||
|
lang: en
|
|||
|
transition: slide-left
|
|||
|
background: none
|
|||
|
---
|
|||
|
|
|||
|
<!-- INTRO SLIDE -->
|
|||
|
<div class="flex flex-col items-center justify-center h-full py-10">
|
|||
|
<h1 class="text-center text-5xl font-bold gradient-text mb-10">Scalable Oversight for Complex AI</h1>
|
|||
|
<h2 class="text-center text-4xl mb-6" style="color: var(--accent-color);">Can Human Feedback Keep Up?</h2>
|
|||
|
<h3 class="text-center text-3xl mb-14 animate-pulse highlight-word">Techniques to Align Large Language Models at Scale</h3>
|
|||
|
<div class="flex w-full justify-between mt-auto">
|
|||
|
<div class="text-left text-xl">Stefano Rossi</div>
|
|||
|
<div class="text-right text-xl">2 May, 2025</div>
|
|||
|
</div>
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
</div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# Introduction
|
|||
|
|
|||
|
<div class="grid grid-cols-2 gap-6">
|
|||
|
<div class="panel-info">
|
|||
|
<ul>
|
|||
|
<li><span class="highlight-word">Alignment at scale</span></li>
|
|||
|
<li><span class="highlight-word">Human limitations</span></li>
|
|||
|
<li><span class="highlight-word">Model deception</span></li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
<div class="panel-success">
|
|||
|
<ul>
|
|||
|
<li><span class="highlight-word">Recursive techniques</span></li>
|
|||
|
<li><span class="highlight-word">AI-augmented feedback</span></li>
|
|||
|
<li><span class="highlight-word">Factored cognition</span></li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# The Problem with Feedback
|
|||
|
|
|||
|
<div class="two-column">
|
|||
|
<div class="panel-info">
|
|||
|
<h2>Why it fails</h2>
|
|||
|
<ul>
|
|||
|
<li>Too complex for humans to judge outputs correctly</li>
|
|||
|
<li><span class="highlight-word">Deception</span> and hallucinations trick reviewers</li>
|
|||
|
<li>LLMs <span class="highlight-word">sycophantically agree</span> rather than pursue truth</li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
<div class="panel-warning">
|
|||
|
<h2>Why it matters</h2>
|
|||
|
<ul>
|
|||
|
<li>Most alignment methods rely on human supervision</li>
|
|||
|
<li><span class="highlight-word">RLHF breaks</span> at scale and complexity</li>
|
|||
|
<li>Feedback noise → <span class="highlight-word">reward hacking</span> and misalignment</li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# What Is Scalable Oversight?
|
|||
|
|
|||
|
<div class="panel-info">
|
|||
|
<p>Scalable oversight techniques aim to <span class="highlight-word">empower humans</span> to give accurate feedback on complex tasks.</p>
|
|||
|
<p>They leverage structured methods, AI assistants, or recursive mechanisms to <span class="highlight-word">extend our cognitive reach</span>.</p>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# Techniques Overview
|
|||
|
|
|||
|
<table class="styled-table hoverable-table">
|
|||
|
<thead>
|
|||
|
<tr>
|
|||
|
<th>Technique</th>
|
|||
|
<th>Description</th>
|
|||
|
</tr>
|
|||
|
</thead>
|
|||
|
<tbody>
|
|||
|
<tr><td><span class="highlight-word">Iterated Amplification</span></td><td>Task decomposition + model assistance = scalable evaluation</td></tr>
|
|||
|
<tr><td><span class="highlight-word">Recursive Reward Modeling</span></td><td>AI helps humans give better feedback to train better AI</td></tr>
|
|||
|
<tr><td><span class="highlight-word">Constitutional AI</span></td><td>Use fixed rules to guide feedback generation</td></tr>
|
|||
|
<tr><td><span class="highlight-word">Debate</span></td><td>AIs argue, human judges; surfacing deception through opposition</td></tr>
|
|||
|
<tr><td><span class="highlight-word">Weak-to-Strong Generalization</span></td><td>Train powerful models with weaker labels</td></tr>
|
|||
|
</tbody>
|
|||
|
</table>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# Iterated Amplification (IDA)
|
|||
|
|
|||
|
<div class="panel-info">
|
|||
|
<h2>Core Idea</h2>
|
|||
|
<ul>
|
|||
|
<li>Decompose hard problems</li>
|
|||
|
<li>Evaluate sub-steps</li>
|
|||
|
<li>Train model to replicate the full pipeline</li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="panel-success">
|
|||
|
<h2>Example</h2>
|
|||
|
<ul>
|
|||
|
<li>Book summarization</li>
|
|||
|
<li>Summarize pages → chapters → full book</li>
|
|||
|
<li>Distill into a one-shot summary model</li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# Debate
|
|||
|
|
|||
|
<div class="panel-info">
|
|||
|
<p>Two AIs take opposing views on a complex topic and argue it out. A human (or weaker AI) acts as judge.</p>
|
|||
|
<ul>
|
|||
|
<li>Helps surface <span class="highlight-word">flaws and manipulations</span></li>
|
|||
|
<li>Leverages model’s reasoning capacity</li>
|
|||
|
<li>Challenges: truth ≠ persuasion, collusion risk</li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# Constitutional AI
|
|||
|
|
|||
|
<div class="panel-info">
|
|||
|
<p>Replace human feedback with <span class="highlight-word">AI critiques</span> guided by human-written principles.</p>
|
|||
|
<p>Used in Anthropic’s Claude models, where rules encode ethical and practical constraints.</p>
|
|||
|
<ul>
|
|||
|
<li>Scalable via automation</li>
|
|||
|
<li>Fewer humans involved, more repeatability</li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# Weak-to-Strong Generalization
|
|||
|
|
|||
|
<div class="panel-info">
|
|||
|
<ul>
|
|||
|
<li>Train GPT-4 using feedback from GPT-2</li>
|
|||
|
<li>GPT-4 learns better than its weak teacher</li>
|
|||
|
<li><span class="highlight-word">Generalization beats imitation</span></li>
|
|||
|
<li>Bootstrapping + confidence loss = better performance</li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# Key Challenges
|
|||
|
|
|||
|
<div class="two-column">
|
|||
|
<div class="panel-warning">
|
|||
|
<h2>Failure Risks</h2>
|
|||
|
<ul>
|
|||
|
<li>Models appear aligned, but aren’t</li>
|
|||
|
<li><span class="highlight-word">Deceptive behavior</span> during training</li>
|
|||
|
<li>Generalization failures</li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
<div class="panel-danger">
|
|||
|
<h2>Limitations</h2>
|
|||
|
<ul>
|
|||
|
<li>Factored cognition doesn’t scale universally</li>
|
|||
|
<li>Feedback may be misinterpreted</li>
|
|||
|
<li>Ongoing distributional shift at deployment</li>
|
|||
|
</ul>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# Conclusion
|
|||
|
|
|||
|
<div class="panel-success">
|
|||
|
<p><strong>Scalable oversight is not a silver bullet.</strong> But it’s the best shot we’ve got at aligning complex systems to human values, or at least, approximations of them.</p>
|
|||
|
<p>The future of AI safety <span class="highlight-word">depends on teaching AIs to teach themselves</span>, with guardrails we can actually monitor.</p>
|
|||
|
</div>
|
|||
|
|
|||
|
<div class="hud-element circle-small"></div>
|
|||
|
<div class="hud-element circle-big"></div>
|
|||
|
<div class="hud-lines"></div>
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
<div class="bouncing-box">
|
|||
|
<div class="screensaver-icon ai"><i class="fas fa-brain"></i></div>
|
|||
|
<h1 class="multicolor-text z-10 relative">Questions?</h1>
|
|||
|
</div>
|
|||
|
|
|||
|
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/all.min.css">
|