Technical Intro Fellowship
Week 6

Interpretability & Evals

Learning Objectives

  • Describe at a high level what attribution graphs and linear probes are, and what each lets researchers discover about model internals
  • Point to at least one concrete finding from modern interp on a frontier model (planning, multi-step reasoning, unfaithful chain of thought, etc.)
  • Distinguish capability evals, propensity evals, dangerous-capability evals, and alignment audits