Technical Intro Fellowship
Week 5

Control & Scalable Oversight

Learning Objectives

  • Explain the AI control framework and how it differs from alignment, including the assumption that the model may be intentionally subversive
  • Describe concrete control techniques (resampling, monitoring, trusted models) and evaluate their strengths and limitations
  • Articulate the core challenge of scalable oversight and assess approaches like weak-to-strong generalization and debate
Further Reading
AI Safety via DebateIrving, Christiano, Amodei · 2018
Supervising Strong Learners by Amplifying Weak ExpertsChristiano, Shlegeris, Amodei · 2018
How to Prevent Collusion When Using Untrusted Models to Monitor Each OtherShlegeris (Redwood) · 2024
Thoughts on the Conservative Assumptions in AI ControlShlegeris (Redwood) · 2025
An Overview of Control MeasuresGreenblatt (Redwood) · 2025
A Sketch of an AI Control Safety CaseKorbak, Clymer, Hilton, Shlegeris, Irving · 2025
Subversion Strategy Eval: Can Language Models Statelessly Strategize to Subvert Control Protocols?Mallen et al. · 2024
An Overview of Areas of Control WorkGreenblatt (Redwood) · 2025
Four Places Where You Can Put LLM MonitoringShlegeris (Redwood) · 2025
Misalignment and Strategic Underperformance: Sandbagging and Exploration HackingStastny & Shlegeris (Redwood) · 2025
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMsMathew et al. (LASR Labs) · 2024
Detecting Strategic Deception Using Linear ProbesRoger & Greenblatt (Redwood) · 2024
Debate Helps Supervise Unreliable ExpertsMichael et al. · 2023
How to Keep Improving When You're Better Than Any Teacher — IDARobert Miles · 2019
Rob Miles YouTube Explainer of the AI Control PaperRobert Miles · 2024
AXRP Episode 27: AI Control With Buck Shlegeris and Ryan GreenblattDaniel Filan · 2024
80,000 Hours: Buck Shlegeris on Controlling AI That Wants to Take OverRobert Wiblin · 2024
80,000 Hours: Ryan Greenblatt on AI R&D Automation and Misaligned TakeoverRobert Wiblin · 2025
Detecting Misbehavior in Frontier Reasoning ModelsBaker et al. (OpenAI) · 2025
Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyKorbak et al. · 2025
Reasoning Models Don't Always Say What They ThinkChen et al. (Anthropic/OpenAI) · 2025
Monitoring MonitorabilityGuan, Wang, Carroll et al. (OpenAI) · 2025
Measuring Chain-of-Thought Monitorability Through Faithfulness and VerbosityMeek et al. · 2025
Measuring Chain of Thought Faithfulness by Unlearning Reasoning StepsEMNLP · 2025
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought PromptingTurpin et al. · 2023
Measuring Faithfulness in Chain-of-Thought ReasoningLanham et al. · 2023
Chain-of-Thought Reasoning in the Wild: FaithfulnessArcuschin et al. · 2025
When Chain of Thought Is Necessary, Language Models Struggle to Evade MonitorsEmmons et al. · 2025
CoT Red-Handed: Stress Testing Chain-of-Thought MonitoringArnav et al. · 2025
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought MonitoringLi, Phuong, Siegel · 2025
Stress Testing Deliberative Alignment for Anti-Scheming TrainingSchoen et al. (Apollo Research) · 2025
Auditing Language Models for Hidden ObjectivesMeinke et al. (Anthropic) · 2025
Preventing Language Models From Hiding Their ReasoningRoger & Greenblatt (Redwood) · 2023