Alignment·Anthropic·Mar 2025

★27. Tracing the Thoughts of a Large Language Model (Circuit Tracing)

Mapped full input-to-output computational pathways in Claude 3.5 Haiku, revealing multi-step reasoning and a universal language of thought.

Research Paper

Summary

Two companion papers. Circuit Tracing linked interpretable features into computational circuits, revealing input-to-output pathways. 'On the Biology of a Large Language Model' applied this to Claude 3.5 Haiku, discovering multi-step reasoning, poetry planning, and a universal 'language of thought' shared across languages. Open-sourced the tools.

Key Concepts

Attribution Graphs (Feature-to-Feature Connections)

Computational maps showing how features (interpretable model components) connect and influence each other throughout the model's layers. Rather than treating features in isolation, attribution graphs reveal the circuits: how a specific feature in one layer connects to dependent features in later layers, forming the model's reasoning pathways.

Multi-Step Reasoning Circuits

Identifiable computational structures where the model chains intermediate concepts together to solve multi-step problems. The model internally represents concepts (like "Texas") that never appear in outputs, then uses these representations in subsequent layers to reach the final answer, demonstrating genuine reasoning beyond surface-level pattern matching.

Poetry Planning (Forward Planning Evidence)

The discovery that Claude plans ahead when writing poetry, identifying rhyming words before composing a line and then constructing the line specifically to reach that word. This contradicts the narrative that language models "just predict the next token" — instead showing genuine forward planning and goal-directed behavior.

Language-Independent Conceptual Space

Evidence that the model maintains a universal "language of thought" that operates independently of the language being used. When translating concepts across languages, the model routes through a shared conceptual space rather than translating word-by-word, suggesting deeper understanding than surface-level language processing.

Open-Source Interpretability Tools

The research team released computational tools enabling other researchers to trace circuits in their own models. This democratizes mechanistic interpretability, moving it from proprietary lab work to a community capability, accelerating progress across the field and validating the approach's importance.

Connections

Influenced by

15. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

May 2024