Large Language Models

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

HHengyuan ZhangZZhihao ZhangMMingyang WangZZunhai SuYYiwei WangQQianli WangSShuzhou YuanEErcong NieXXufeng DuanQQibo XueZZeping YuCChenming ShangXXiao LiangJJing XiongHHui ShenCChaofan TaoZZhengwu LiuSSenjie JinZZhiheng XiDDongdong ZhangSSophia AnaniadouTTao GuiRRuobing XieHHayden Kwok-Hay SoHHinrich SchützeXXuanjing HuangQQi ZhangNNgai Wong
arXiv ID
2601.14004
Published
January 20, 2026
Authors
28
Hugging Face Likes
42
Comments
2

Abstract

Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.

Keywords

Mechanistic InterpretabilityLarge Language ModelsLocalizingSteeringInterpretable ObjectsAlignmentCapabilityEfficiency

More in Large Language Models

View all
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models | Paperchime