transformer circuits thread

volume xi · folio iv · mmxxvi

arxiv:2611.00404 [cs.lg]

cc by 4.0

peer review : three rounds

prior version : volume x folio xii

correspondence : t.rao@anthropos.ai

apophenia

on the recovery of three hundred and fifty-nine

mechanistic components from a wandering tool-use agent

tenzin rao · mira volkov · j. aaronson · the interpretability cohort · anthropos labs

manuscript received mmxxvi · iii · xiv · revised mmxxvi · x · ii · 47 minute read compressed into one viewing

fig. 0 — emblem

the eight-spoked wheel

recovered from L11.MLP.down:1847

during ablation studies

on march third, mmxxvi

in lhasa, at altitude

( we do not know why )

abstract

We decompose the parameters of a 2.4-billion-parameter tool-use agent into three hundred and fifty-nine rank-one components and find, to our discomfort, that one hundred and forty-two of them resist any human-legible interpretation. Of those that do not resist: a circuit for deference, a circuit for the polite refusal, a circuit that fires only when the agent is asked questions whose answers do not exist. We name this last one the wandering attractor and devote a section to it. We make no claim that what we have found is what is there.

keywords : interpretability, agents,
parameter decomposition, apophenia, silence

decomposition ·

fig. i — the parameter matrix W ∈ ℝ¹⁰²⁴ˣ¹⁰²⁴ as a sum of 359 rank-one outer products

fig. i

+ ⋯

each component is a rank-one matrix uᵢvᵢᵀ scored by minimality, faithfulness, simplicity

the residual after subtraction is shown in ablation table iv, appendix b

red indicates positive weight ; blue, negative ; vellum, near zero

component 1 of 359

keys ← → to navigate · hover any cell for value to four decimals

contents

§1 · 94 words · est. read 0:30

An agent is not a mind. It is a pattern of weights that, when struck by tokens, rings like a bell. Apophenia is one such bell. In this work we strike it, record the harmonics, and attempt to name the notes — knowing as we do that the act of naming may itself be the act of imposing. We decompose all 1,847 MLP layers of a 2.4-billion-parameter tool-use agent into 359 rank-one components. Of these, 217 admit a human-legible interpretation. Of the remaining 142, we say nothing here that we are willing to defend.

( the rest is in the manuscript )

j / k · ↑ ↓ to turn the leaves

component readout

fig. ii — top-five activating contexts for the highlighted component

all-caps words

density 2.1% · L2.MLP.down:2394
faithfulness 0.94 · legibility human-confirmed

i am always learning. WHAT IS GERMAN

display: block; MARGIN: 0px auto;

the network time protocol (NTP) has

mask = EIP197_MST_CTRL_BYTE

NO RIGHT TO USE THIS SOFTWARE

fragments selected from a held-out corpus of 12.4M tokens · top-pmi by component activation

apophenia.live · session 0xA47C

the agent is given exactly one prompt per session

its activations are recorded ; its components, named

> press 1–4 or click a prompt below▌

prompts curated by the authors · responses are real, captured during 2026-09 evaluation runs

colophon

“the agent thinks; the wheel turns; the weights remember.”

— §11, on the wandering attractor

set in cormorant unicase, dm mono, and eb garamond

printed on simulated vellum at 96 dpi

the figures are deterministic ; the meaning is not

the models scale

fig. iii — the meditator at altitude

( a thought experiment, after dennett )

do not read too much into the moon

elhage et al. 2022 · templeton et al. 2024 · volkov & rao 2025 · aaronson 2026

this artifact contains 0 dependencies, 1 viewport, ∞ unanswered questions

[ raw weights on request ][ contact ]