Learning Objectives
- Describe key failure modes: mesa-optimization, goal misgeneralization, deceptive alignment
- Evaluate empirical evidence that models fake alignment and persist deceptively through safety training
- Assess why standard safety training may be insufficient against deceptively misaligned behavior
Core Readings
The OTHER AI Alignment Problem: Mesa-Optimizers and Inner AlignmentRobert Miles · 2021
Alignment Faking in Large Language ModelsAnthropic & Redwood Research · 2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingAnthropic · 2024
From Shortcuts to Sabotage: Natural Emergent Misalignment From Reward HackingAnthropic · 2025
Recommended
ML Systems Will Have Weird Failure ModesJacob Steinhardt · 2022
Sycophancy to Subterfuge: Investigating Reward Tampering in Language ModelsAnthropic · 2024
Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMsBetley et al. · 2025
Frontier Models are Capable of In-Context SchemingApollo Research · 2024
Distillation of "How Likely is Deceptive Alignment?"Evan Hubinger · 2022
Further Reading
Why Alignment Could Be Hard With Modern Deep LearningAjeya Cotra · 2021
Goal Misgeneralization: Why Correct Specifications Aren't Enough for Correct GoalsShah et al. · 2022
The Alignment Problem From a Deep Learning PerspectiveNgo et al. · 2022
Goal Misgeneralization in Deep Reinforcement LearningLangosco et al. · 2022
Optimal Policies Tend to Seek PowerTurner et al. · 2022
Language Models as Agent ModelsAndreas · 2022