--- # theme id, package name, or local path theme: seriph title: Scalable Oversight for Complex AI Tasks titleTemplate: '%s - AI Safety & Oversight' author: Rossi Stefano info: | ## Methods for Scaling Human Feedback in AI Supervision keywords: AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate mdc: true hideInToc: false addons: - slidev-addon-rabbit - slidev-addon-python-runner python: installs: [] prelude: '' loadPackagesFromImports: true suppressDeprecationWarnings: true alwaysReload: false loadPyodideOptions: {} presenter: true browserExporter: dev download: true exportFilename: scalable-oversight-for-ai twoslash: false lineNumbers: true monaco: false selectable: false record: dev contextMenu: dev wakeLock: true overviewSnapshots: false colorSchema: dark routerMode: history aspectRatio: 16/9 canvasWidth: 980 css: - unocss unocss: configFile: './uno.config.ts' defaults: layout: center drawings: enabled: true persist: false presenterOnly: false syncAll: true htmlAttrs: dir: ltr lang: en transition: slide-left background: none ---

Backdoor Attacks

Hidden Threats in AI Models

Embedding Malicious Behavior in LLMs

Stefano Rossi
09 May, 2025
--- # Introduction

www.reddit.com/r/fakehistoryporn/

--- # Problem Statement

What is a Backdoor Attack?

Why It's a Threat

--- # Exploitation Method

How It Works

Key Insight

--- # Mitigation Strategies
Strategy Description
Data SanitizationScreen training data for malicious inputs
Adversarial TestingProbe model with potential triggers
Model InspectionAnalyze weights for anomalous patterns
Fine-Tune ScrubbingRemove backdoors via retraining
--- # Demo

Live Demonstration

--- # Risk Assessment

Real-World Impact

Threat Scale

---

Political Compass Score

Political Compass

trackingai.org/political-test

--- # Complexity Analysis

Attack Difficulty

Advanced Attacks

--- # Conclusion

Backdoor attacks pose a hidden threat to LLMs.

Mitigation requires robust training and testing.

Next steps: data security, model auditing, and community standards.

---

Questions?