Technical Intro Fellowship

Week 5

Control & Scalable Oversight

Learning Objectives

Explain the AI control framework and how it differs from alignment, including the assumption that the model may be intentionally subversive
Describe concrete control techniques (resampling, monitoring, trusted models) and evaluate their strengths and limitations
Articulate the core challenge of scalable oversight and assess approaches like weak-to-strong generalization and debate

Core Readings

The Case for Ensuring That Powerful AIs Are ControlledRedwood Research · 2024

Using Dangerous AI, But Safely?Robert Miles AI Safety · 2024

On Scalable Oversight With Weak LLMs Judging Strong LLMsGoogle DeepMind · 2024

Weak-to-Strong GeneralizationOpenAI · 2023

Recommended

AI Control LessWrong SequenceVarious

How Can We Solve Diffuse Threats?Redwood Research

Ctrl-Z: Controlling AI Agents via ResamplingRedwood Research · 2025

Simple Probes Can Catch Sleeper AgentsAnthropic · 2024

Debating With More Persuasive LLMs Leads to More Truthful AnswersKhan et al. · 2024

Combining W2SG With Scalable OversightJan Leike · 2023

Humans Consulting HCHPaul Christiano · 2016

Measuring Progress on Scalable Oversight for Large Language ModelsBowman et al. · 2022

Debate Update: Obfuscated Arguments ProblemBarnes & Christiano · 2020

Further Reading

AI Safety via DebateIrving, Christiano, Amodei · 2018

Supervising Strong Learners by Amplifying Weak ExpertsChristiano, Shlegeris, Amodei · 2018

How to Prevent Collusion When Using Untrusted Models to Monitor Each OtherShlegeris (Redwood) · 2024

Thoughts on the Conservative Assumptions in AI ControlShlegeris (Redwood) · 2025

An Overview of Control MeasuresGreenblatt (Redwood) · 2025

A Sketch of an AI Control Safety CaseKorbak, Clymer, Hilton, Shlegeris, Irving · 2025

Subversion Strategy Eval: Can Language Models Statelessly Strategize to Subvert Control Protocols?Mallen et al. · 2024

An Overview of Areas of Control WorkGreenblatt (Redwood) · 2025

Four Places Where You Can Put LLM MonitoringShlegeris (Redwood) · 2025

Misalignment and Strategic Underperformance: Sandbagging and Exploration HackingStastny & Shlegeris (Redwood) · 2025

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMsMathew et al. (LASR Labs) · 2024

Detecting Strategic Deception Using Linear ProbesRoger & Greenblatt (Redwood) · 2024

Debate Helps Supervise Unreliable ExpertsMichael et al. · 2023

How to Keep Improving When You're Better Than Any Teacher — IDARobert Miles · 2019

Rob Miles YouTube Explainer of the AI Control PaperRobert Miles · 2024

AXRP Episode 27: AI Control With Buck Shlegeris and Ryan GreenblattDaniel Filan · 2024

80,000 Hours: Buck Shlegeris on Controlling AI That Wants to Take OverRobert Wiblin · 2024

80,000 Hours: Ryan Greenblatt on AI R&D Automation and Misaligned TakeoverRobert Wiblin · 2025

Detecting Misbehavior in Frontier Reasoning ModelsBaker et al. (OpenAI) · 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyKorbak et al. · 2025

Reasoning Models Don't Always Say What They ThinkChen et al. (Anthropic/OpenAI) · 2025

Monitoring MonitorabilityGuan, Wang, Carroll et al. (OpenAI) · 2025

Measuring Chain-of-Thought Monitorability Through Faithfulness and VerbosityMeek et al. · 2025

Measuring Chain of Thought Faithfulness by Unlearning Reasoning StepsEMNLP · 2025

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought PromptingTurpin et al. · 2023

Measuring Faithfulness in Chain-of-Thought ReasoningLanham et al. · 2023

Chain-of-Thought Reasoning in the Wild: FaithfulnessArcuschin et al. · 2025

When Chain of Thought Is Necessary, Language Models Struggle to Evade MonitorsEmmons et al. · 2025

CoT Red-Handed: Stress Testing Chain-of-Thought MonitoringArnav et al. · 2025

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought MonitoringLi, Phuong, Siegel · 2025

Stress Testing Deliberative Alignment for Anti-Scheming TrainingSchoen et al. (Apollo Research) · 2025

Auditing Language Models for Hidden ObjectivesMeinke et al. (Anthropic) · 2025

Preventing Language Models From Hiding Their ReasoningRoger & Greenblatt (Redwood) · 2023