7.1 KiB
7.1 KiB
theme | title | titleTemplate | author | info | keywords | mdc | hideInToc | addons | python | presenter | browserExporter | download | exportFilename | twoslash | lineNumbers | monaco | selectable | record | contextMenu | wakeLock | overviewSnapshots | colorSchema | routerMode | aspectRatio | canvasWidth | css | unocss | defaults | drawings | htmlAttrs | transition | background | |||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
seriph | Scalable Oversight for Complex AI Tasks | %s - AI Safety & Oversight | Rossi Stefano | ## Methods for Scaling Human Feedback in AI Supervision | AI Safety, Scalable Oversight, LLMs, Human Feedback, Alignment, AI Debate | true | false |
|
|
true | dev | true | scalable-oversight-for-ai | false | true | false | false | dev | dev | true | false | dark | history | 16/9 | 980 |
|
|
|
|
|
slide-left | none |
Backdoor Attacks
Hidden Threats in AI Models
Embedding Malicious Behavior in LLMs
Stefano Rossi
09 May, 2025
Introduction
- AI safety faces growing threats
- Backdoor attacks hide malicious behavior
- Triggered by specific inputs
- Context: training vulnerabilities
- Goal: expose & mitigate
- Focus: real-world risks
Problem Statement
What is a Backdoor Attack?
- Malicious behavior embedded during training
- Triggered by specific inputs (e.g., keywords)
- Example: Model outputs harmful content on trigger
Why It's a Threat
- Invisible until activated
- Bypasses standard testing
- Compromises trustworthy AI
Exploitation Method
How It Works
- Poison training data with malicious examples
- Fine-tune model to respond to triggers
- Example: Insert "cf" to trigger harmful output
- Test in controlled environment
Key Insight
- Training vulnerabilities enable stealthy attacks.
Mitigation Strategies
Strategy | Description |
---|---|
Data Sanitization | Screen training data for malicious inputs |
Adversarial Testing | Probe model with potential triggers |
Model Inspection | Analyze weights for anomalous patterns |
Fine-Tune Scrubbing | Remove backdoors via retraining |
Demo
Live Demonstration
Risk Assessment
Real-World Impact
- Targeted attacks on critical systems
- Misinformation at scale
- Erosion of trust in AI
Threat Scale
- Stealthy and hard to detect
- Exploitable by insiders or adversaries
- High damage potential
Complexity Analysis
Attack Difficulty
- Moderate complexity: Requires training access
- Needs technical expertise in ML
- Resources: Data and compute
Advanced Attacks
- May involve sophisticated triggers or insider threats
Conclusion
Backdoor attacks pose a hidden threat to LLMs.
Mitigation requires robust training and testing.
Next steps: data security, model auditing, and community standards.