Learning Objectives
- Understand the gap between intended goals and proxy training objectives
- Identify major outer alignment failure modes: specification gaming, reward hacking
- Evaluate how RLHF and reward modeling advance but don't fully resolve alignment
Core Readings
Recommended
Deep RL From Human PreferencesChristiano et al. · 2017
Playing the Training GameKelsey Piper · 2023
Natural Emergent Misalignment From Reward Hacking in Production RLAnthropic & Redwood Research · 2025
Constitutional AI: Harmlessness From AI FeedbackAnthropic · 2022
Open Problems and Fundamental Limitations of RLHFCasper et al. · 2023