Technical Intro Fellowship

Week 3

Inner Alignment

Learning Objectives

Describe key failure modes: mesa-optimization, goal misgeneralization, deceptive alignment
Evaluate empirical evidence that models fake alignment and persist deceptively through safety training
Assess why standard safety training may be insufficient against deceptively misaligned behavior

Core Readings

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner AlignmentRobert Miles · 2021

Alignment Faking in Large Language ModelsAnthropic & Redwood Research · 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingAnthropic · 2024

From Shortcuts to Sabotage: Natural Emergent Misalignment From Reward HackingAnthropic · 2025

Recommended

ML Systems Will Have Weird Failure ModesJacob Steinhardt · 2022

Sycophancy to Subterfuge: Investigating Reward Tampering in Language ModelsAnthropic · 2024

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMsBetley et al. · 2025

Frontier Models are Capable of In-Context SchemingApollo Research · 2024

Distillation of "How Likely is Deceptive Alignment?"Evan Hubinger · 2022

Further Reading

Why Alignment Could Be Hard With Modern Deep LearningAjeya Cotra · 2021

Goal Misgeneralization: Why Correct Specifications Aren't Enough for Correct GoalsShah et al. · 2022

The Alignment Problem From a Deep Learning PerspectiveNgo et al. · 2022

Goal Misgeneralization in Deep Reinforcement LearningLangosco et al. · 2022

Optimal Policies Tend to Seek PowerTurner et al. · 2022

Language Models as Agent ModelsAndreas · 2022