Learning Objectives
- Describe at a high level what attribution graphs and linear probes are, and what each lets researchers discover about model internals
- Point to at least one concrete finding from modern interp on a frontier model (planning, multi-step reasoning, unfaithful chain of thought, etc.)
- Distinguish capability evals, propensity evals, dangerous-capability evals, and alignment audits
Core Readings
Tracing the Thoughts of a Large Language ModelAnthropic · 2025
Simple Probes Can Catch Sleeper AgentsAnthropic · 2024
An Early Warning System for Novel AI RisksGoogle DeepMind · 2023
Translating Claude's Thoughts into LanguageAnthropic
Natural Language Autoencoders — Llama 3.3 70B (Interactive Demo)Fraser-Taliente, Kantamneni, Ong et al. / Neuronpedia · 2026
Recommended
Difficulties with Evaluating a Deception Detector for AIsGoogle DeepMind · 2025
Auditing Language Models for Hidden ObjectivesAnthropic · 2025
Every Benchmark is BrokenJonathan Gabor · 2026
Interpretability DreamsChris Olah · 2023
Responsible Scaling PoliciesMETR / ARC Evals · 2023
Frontier Models Are Capable of In-Context SchemingApollo Research · 2024
Measuring AI Ability to Complete Long TasksMETR · 2025
Demystifying Evals for AI AgentsAnthropic · 2024
EIS III: Broad Critiques of Interpretability ResearchStephen Casper · 2023
Further Reading
Toy Models of SuperpositionElhage et al. · 2022
Towards MonosemanticityBricken, Templeton et al. (Anthropic) · 2023
Mapping the Mind of a Large Language Model / Scaling MonosemanticityTempleton et al. (Anthropic) · 2024
On the Biology of a Large Language ModelLindsey et al. (Anthropic) · 2025
Refusal in Language Models Is Mediated by a Single DirectionArditi et al. · 2024
Emergent MisalignmentBetley et al. · 2025
Persona Features Control Emergent MisalignmentWang et al. (OpenAI) · 2025
AI Control: Improving Safety Despite Intentional SubversionGreenblatt et al. · 2024
Agentic Misalignment: How LLMs Could Be Insider ThreatsLynch et al. (Anthropic) · 2025
Auditing Language Models for Hidden Objectives (full paper)Marks et al. (Anthropic) · 2025
Building and Evaluating Alignment Auditing AgentsBricken, Marks et al. (Anthropic) · 2025
EvilGenie: A Reward Hacking BenchmarkGabor et al. · 2025
Sabotage Evaluations for Frontier ModelsBenton et al. (Anthropic) · 2024
AI Sandbaggingvan der Weij et al. · 2024
The WMDP BenchmarkLi et al. · 2024