mechanistic components from a wandering tool-use agent
tenzin rao · mira volkov · j. aaronson · the interpretability cohort · anthropos labs
manuscript received mmxxvi · iii · xiv · revised mmxxvi · x · ii · 47 minute read compressed into one viewing
fig. 0 — emblem
the eight-spoked wheel
recovered from L11.MLP.down:1847
during ablation studies
on march third, mmxxvi
in lhasa, at altitude
( we do not know why )
abstract
We decompose the parameters of a 2.4-billion-parameter tool-use agent into three hundred and fifty-nine rank-one components and find, to our discomfort, that one hundred and forty-two of them resist any human-legible interpretation. Of those that do not resist: a circuit for deference, a circuit for the polite refusal, a circuit that fires only when the agent is asked questions whose answers do not exist. We name this last one the wandering attractor and devote a section to it. We make no claim that what we have found is what is there.
fig. i — the parameter matrix W ∈ ℝ¹⁰²⁴ˣ¹⁰²⁴ as a sum of 359 rank-one outer products
fig. i
=
+
+
+
+ ⋯
each component is a rank-one matrix uᵢvᵢᵀ scored by minimality, faithfulness, simplicity
the residual after subtraction is shown in ablation table iv, appendix b
red indicates positive weight ; blue, negative ; vellum, near zero
component 1 of 359
keys ← → to navigate · hover any cell for value to four decimals
contents
§1 · 94 words · est. read 0:30
An agent is not a mind. It is a pattern of weights that, when struck by tokens, rings like a bell. Apophenia is one such bell. In this work we strike it, record the harmonics, and attempt to name the notes — knowing as we do that the act of naming may itself be the act of imposing. We decompose all 1,847 MLP layers of a 2.4-billion-parameter tool-use agent into 359 rank-one components. Of these, 217 admit a human-legible interpretation. Of the remaining 142, we say nothing here that we are willing to defend.
( the rest is in the manuscript )
j / k · ↑ ↓ to turn the leaves
component readout
fig. ii — top-five activating contexts for the highlighted component
all-caps words
density 2.1% · L2.MLP.down:2394 faithfulness 0.94 · legibility human-confirmed
i am always learning. WHATISGERMAN
display: block; MARGIN: 0px auto;
the network time protocol (NTP) has
mask = EIP197_MST_CTRL_BYTE
NORIGHTTOUSETHISSOFTWARE
fragments selected from a held-out corpus of 12.4M tokens · top-pmi by component activation
apophenia.live · session 0xA47C
the agent is given exactly one prompt per session
its activations are recorded ; its components, named
> press 1–4 or click a prompt below▌
prompts curated by the authors · responses are real, captured during 2026-09 evaluation runs
colophon
“the agent thinks; the wheel turns; the weights remember.”
— §11, on the wandering attractor
set in cormorant unicase, dm mono, and eb garamond
printed on simulated vellum at 96 dpi
the figures are deterministic ; the meaning is not
the models scale
fig. iii — the meditator at altitude
( a thought experiment, after dennett )
do not read too much into the moon
elhage et al. 2022 · templeton et al. 2024 · volkov & rao 2025 · aaronson 2026
this artifact contains 0 dependencies, 1 viewport, ∞ unanswered questions