Technical Intro Fellowship

Week 6

Interpretability & Evals

Learning Objectives

Describe at a high level what attribution graphs and linear probes are, and what each lets researchers discover about model internals
Point to at least one concrete finding from modern interp on a frontier model (planning, multi-step reasoning, unfaithful chain of thought, etc.)
Distinguish capability evals, propensity evals, dangerous-capability evals, and alignment audits

Core Readings

Tracing the Thoughts of a Large Language ModelAnthropic · 2025

Simple Probes Can Catch Sleeper AgentsAnthropic · 2024

An Early Warning System for Novel AI RisksGoogle DeepMind · 2023

Translating Claude's Thoughts into LanguageAnthropic

Natural Language Autoencoders — Llama 3.3 70B (Interactive Demo)Fraser-Taliente, Kantamneni, Ong et al. / Neuronpedia · 2026

Recommended

Difficulties with Evaluating a Deception Detector for AIsGoogle DeepMind · 2025

Auditing Language Models for Hidden ObjectivesAnthropic · 2025

Every Benchmark is BrokenJonathan Gabor · 2026

Interpretability DreamsChris Olah · 2023

Responsible Scaling PoliciesMETR / ARC Evals · 2023

Frontier Models Are Capable of In-Context SchemingApollo Research · 2024

Measuring AI Ability to Complete Long TasksMETR · 2025

Demystifying Evals for AI AgentsAnthropic · 2024

EIS III: Broad Critiques of Interpretability ResearchStephen Casper · 2023

Further Reading

Toy Models of SuperpositionElhage et al. · 2022

Towards MonosemanticityBricken, Templeton et al. (Anthropic) · 2023

Mapping the Mind of a Large Language Model / Scaling MonosemanticityTempleton et al. (Anthropic) · 2024

On the Biology of a Large Language ModelLindsey et al. (Anthropic) · 2025

Refusal in Language Models Is Mediated by a Single DirectionArditi et al. · 2024

Emergent MisalignmentBetley et al. · 2025

Persona Features Control Emergent MisalignmentWang et al. (OpenAI) · 2025

AI Control: Improving Safety Despite Intentional SubversionGreenblatt et al. · 2024

Agentic Misalignment: How LLMs Could Be Insider ThreatsLynch et al. (Anthropic) · 2025

Auditing Language Models for Hidden Objectives (full paper)Marks et al. (Anthropic) · 2025

Building and Evaluating Alignment Auditing AgentsBricken, Marks et al. (Anthropic) · 2025

EvilGenie: A Reward Hacking BenchmarkGabor et al. · 2025

Sabotage Evaluations for Frontier ModelsBenton et al. (Anthropic) · 2024

AI Sandbaggingvan der Weij et al. · 2024

The WMDP BenchmarkLi et al. · 2024