Technical Intro Fellowship
Week 3

Inner Alignment

Learning Objectives

  • Describe key failure modes: mesa-optimization, goal misgeneralization, deceptive alignment
  • Evaluate empirical evidence that models fake alignment and persist deceptively through safety training
  • Assess why standard safety training may be insufficient against deceptively misaligned behavior