Technical Intro Fellowship

Week 2

Outer Alignment

Learning Objectives

Understand the gap between intended goals and proxy training objectives
Identify major outer alignment failure modes: specification gaming, reward hacking
Evaluate how RLHF and reward modeling advance but don't fully resolve alignment

Core Readings

Specification Gaming: The Flip Side of AI IngenuityDeepMind · 2020

Aligning Language Models to Follow InstructionsOpenAI · 2022

Language Models Learn to Mislead Humans via RLHFWen et al. · 2024

Large Language Models Can Strategically Deceive Their Users When Put Under PressureApollo Research · 2023

Recommended

Deep RL From Human PreferencesChristiano et al. · 2017

Playing the Training GameKelsey Piper · 2023

Natural Emergent Misalignment From Reward Hacking in Production RLAnthropic & Redwood Research · 2025

Constitutional AI: Harmlessness From AI FeedbackAnthropic · 2022

Open Problems and Fundamental Limitations of RLHFCasper et al. · 2023

Further Reading

Sycophancy to Subterfuge: Investigating Reward Tampering in Language ModelsAnthropic · 2024

Scaling Laws for Reward Model OveroptimizationOpenAI · 2022

Concrete Problems in AI SafetyAmodei et al. · 2016

Scalable Agent Alignment via Reward ModelingDeepMind · 2018