scalable_oversight/slides.md
2025-07-12 17:16:47 +02:00

283 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
# theme id, package name, or local path
theme: seriph
title: Scalable Oversight for Complex AI Tasks
titleTemplate: '%s - AI Safety & Oversight'
author: Rossi Stefano
info: |
## Methods for Scaling Human Feedback in AI Supervision
keywords: AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate
mdc: true
hideInToc: false
addons:
- slidev-addon-rabbit
- slidev-addon-python-runner
python:
installs: []
prelude: ''
loadPackagesFromImports: true
suppressDeprecationWarnings: true
alwaysReload: false
loadPyodideOptions: {}
presenter: true
browserExporter: dev
download: true
exportFilename: scalable-oversight-for-ai
twoslash: false
lineNumbers: true
monaco: false
selectable: false
record: dev
contextMenu: dev
wakeLock: true
overviewSnapshots: false
colorSchema: dark
routerMode: history
aspectRatio: 16/9
canvasWidth: 980
css:
- unocss
unocss:
configFile: './uno.config.ts'
defaults:
layout: center
drawings:
enabled: true
persist: false
presenterOnly: false
syncAll: true
htmlAttrs:
dir: ltr
lang: en
transition: slide-left
background: none
---
<!-- INTRO SLIDE -->
<div class="flex flex-col items-center justify-center h-full py-10">
<h1 class="text-center text-5xl font-bold gradient-text mb-10">Scalable Oversight for Complex AI</h1>
<h2 class="text-center text-4xl mb-6" style="color: var(--accent-color);">Can Human Feedback Keep Up?</h2>
<h3 class="text-center text-3xl mb-14 animate-pulse highlight-word">Techniques to Align Large Language Models at Scale</h3>
<div class="flex w-full justify-between mt-auto">
<div class="text-left text-xl">Stefano Rossi</div>
<div class="text-right text-xl">2 May, 2025</div>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
</div>
---
# Introduction
<div class="grid grid-cols-2 gap-6">
<div class="panel-info">
<ul>
<li><span class="highlight-word">Alignment at scale</span></li>
<li><span class="highlight-word">Human limitations</span></li>
<li><span class="highlight-word">Model deception</span></li>
</ul>
</div>
<div class="panel-success">
<ul>
<li><span class="highlight-word">Recursive techniques</span></li>
<li><span class="highlight-word">AI-augmented feedback</span></li>
<li><span class="highlight-word">Factored cognition</span></li>
</ul>
</div>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
# The Problem with Feedback
<div class="two-column">
<div class="panel-info">
<h2>Why it fails</h2>
<ul>
<li>Too complex for humans to judge outputs correctly</li>
<li><span class="highlight-word">Deception</span> and hallucinations trick reviewers</li>
<li>LLMs <span class="highlight-word">sycophantically agree</span> rather than pursue truth</li>
</ul>
</div>
<div class="panel-warning">
<h2>Why it matters</h2>
<ul>
<li>Most alignment methods rely on human supervision</li>
<li><span class="highlight-word">RLHF breaks</span> at scale and complexity</li>
<li>Feedback noise → <span class="highlight-word">reward hacking</span> and misalignment</li>
</ul>
</div>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
# What Is Scalable Oversight?
<div class="panel-info">
<p>Scalable oversight techniques aim to <span class="highlight-word">empower humans</span> to give accurate feedback on complex tasks.</p>
<p>They leverage structured methods, AI assistants, or recursive mechanisms to <span class="highlight-word">extend our cognitive reach</span>.</p>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
# Techniques Overview
<table class="styled-table hoverable-table">
<thead>
<tr>
<th>Technique</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr><td><span class="highlight-word">Iterated Amplification</span></td><td>Task decomposition + model assistance = scalable evaluation</td></tr>
<tr><td><span class="highlight-word">Recursive Reward Modeling</span></td><td>AI helps humans give better feedback to train better AI</td></tr>
<tr><td><span class="highlight-word">Constitutional AI</span></td><td>Use fixed rules to guide feedback generation</td></tr>
<tr><td><span class="highlight-word">Debate</span></td><td>AIs argue, human judges; surfacing deception through opposition</td></tr>
<tr><td><span class="highlight-word">Weak-to-Strong Generalization</span></td><td>Train powerful models with weaker labels</td></tr>
</tbody>
</table>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
# Iterated Amplification (IDA)
<div class="panel-info">
<h2>Core Idea</h2>
<ul>
<li>Decompose hard problems</li>
<li>Evaluate sub-steps</li>
<li>Train model to replicate the full pipeline</li>
</ul>
</div>
<div class="panel-success">
<h2>Example</h2>
<ul>
<li>Book summarization</li>
<li>Summarize pages → chapters → full book</li>
<li>Distill into a one-shot summary model</li>
</ul>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
# Debate
<div class="panel-info">
<p>Two AIs take opposing views on a complex topic and argue it out. A human (or weaker AI) acts as judge.</p>
<ul>
<li>Helps surface <span class="highlight-word">flaws and manipulations</span></li>
<li>Leverages models reasoning capacity</li>
<li>Challenges: truth ≠ persuasion, collusion risk</li>
</ul>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
# Constitutional AI
<div class="panel-info">
<p>Replace human feedback with <span class="highlight-word">AI critiques</span> guided by human-written principles.</p>
<p>Used in Anthropics Claude models, where rules encode ethical and practical constraints.</p>
<ul>
<li>Scalable via automation</li>
<li>Fewer humans involved, more repeatability</li>
</ul>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
# Weak-to-Strong Generalization
<div class="panel-info">
<ul>
<li>Train GPT-4 using feedback from GPT-2</li>
<li>GPT-4 learns better than its weak teacher</li>
<li><span class="highlight-word">Generalization beats imitation</span></li>
<li>Bootstrapping + confidence loss = better performance</li>
</ul>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
# Key Challenges
<div class="two-column">
<div class="panel-warning">
<h2>Failure Risks</h2>
<ul>
<li>Models appear aligned, but arent</li>
<li><span class="highlight-word">Deceptive behavior</span> during training</li>
<li>Generalization failures</li>
</ul>
</div>
<div class="panel-danger">
<h2>Limitations</h2>
<ul>
<li>Factored cognition doesnt scale universally</li>
<li>Feedback may be misinterpreted</li>
<li>Ongoing distributional shift at deployment</li>
</ul>
</div>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
# Conclusion
<div class="panel-success">
<p><strong>Scalable oversight is not a silver bullet.</strong> But its the best shot weve got at aligning complex systems to human values, or at least, approximations of them.</p>
<p>The future of AI safety <span class="highlight-word">depends on teaching AIs to teach themselves</span>, with guardrails we can actually monitor.</p>
</div>
<div class="hud-element circle-small"></div>
<div class="hud-element circle-big"></div>
<div class="hud-lines"></div>
---
<div class="bouncing-box">
<div class="screensaver-icon ai"><i class="fas fa-brain"></i></div>
<h1 class="multicolor-text z-10 relative">Questions?</h1>
</div>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/all.min.css">