abuses_and_vulnerabilities_.../slides.md
2025-07-10 00:44:37 +02:00

7.1 KiB

theme title titleTemplate author info keywords mdc hideInToc addons python presenter browserExporter download exportFilename twoslash lineNumbers monaco selectable record contextMenu wakeLock overviewSnapshots colorSchema routerMode aspectRatio canvasWidth css unocss defaults drawings htmlAttrs transition background
seriph Scalable Oversight for Complex AI Tasks %s - AI Safety & Oversight Rossi Stefano ## Methods for Scaling Human Feedback in AI Supervision AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate true false
slidev-addon-rabbit
slidev-addon-python-runner
installs prelude loadPackagesFromImports suppressDeprecationWarnings alwaysReload loadPyodideOptions
true true false
true dev true scalable-oversight-for-ai false true false false dev dev true false dark history 16/9 980
unocss
configFile
./uno.config.ts
layout
center
enabled persist presenterOnly syncAll
true false false true
dir lang
ltr en
slide-left none

Backdoor Attacks

Hidden Threats in AI Models

Embedding Malicious Behavior in LLMs

Stefano Rossi
09 May, 2025

Introduction

  • AI safety faces growing threats
  • Backdoor attacks hide malicious behavior
  • Triggered by specific inputs
  • Context: training vulnerabilities
  • Goal: expose & mitigate
  • Focus: real-world risks

Problem Statement

What is a Backdoor Attack?

  • Malicious behavior embedded during training
  • Triggered by specific inputs (e.g., keywords)
  • Example: Model outputs harmful content on trigger

Why It's a Threat

  • Invisible until activated
  • Bypasses standard testing
  • Compromises trustworthy AI

Exploitation Method

How It Works

  • Poison training data with malicious examples
  • Fine-tune model to respond to triggers
  • Example: Insert "cf" to trigger harmful output
  • Test in controlled environment

Key Insight

  • Training vulnerabilities enable stealthy attacks.

Mitigation Strategies

Strategy Description
Data SanitizationScreen training data for malicious inputs
Adversarial TestingProbe model with potential triggers
Model InspectionAnalyze weights for anomalous patterns
Fine-Tune ScrubbingRemove backdoors via retraining

Demo

Live Demonstration


Risk Assessment

Real-World Impact

  • Targeted attacks on critical systems
  • Misinformation at scale
  • Erosion of trust in AI

Threat Scale

  • Stealthy and hard to detect
  • Exploitable by insiders or adversaries
  • High damage potential

Political Compass Score

Political Compass

trackingai.org/political-test


Complexity Analysis

Attack Difficulty

  • Moderate complexity: Requires training access
  • Needs technical expertise in ML
  • Resources: Data and compute

Advanced Attacks

  • May involve sophisticated triggers or insider threats

Conclusion

Backdoor attacks pose a hidden threat to LLMs.

Mitigation requires robust training and testing.

Next steps: data security, model auditing, and community standards.


Questions?