Language Understanding and Pragmatics LLM Reasoning and Architecture

Do transformer models store knowledge or generate it continuously?

Explores whether transformer residual streams function as storage-and-retrieval systems or as real-time flow mechanisms. This distinction challenges fundamental assumptions about how language models actually work.

Note · 2026-04-14
What kind of thing is an LLM really? What happens to social order when AI removes ritual constraints?

The transformer architecture organizes computation around residual streams: per-token vectors that pass forward through layers, each layer adding contributions that the stream continues to carry. Knowledge in the model is not stored in named locations from which it is retrieved on demand. It is distributed across weights and made present in the moment of generation through the residual stream's continuous transformation. The stream is the medium; what flows through it is the model's "knowing" of the current context.

This architectural fact has a striking correspondence with how oral cultures transmitted knowledge. Oral knowledge was not stored in fixed locations either — there were no archives, no written records, no externalized representations. Knowledge lived in performance: the song sung, the story retold, the genealogy recited. Each performance was a generation event in which the knowledge was made present through a living transmission. Between performances, the knowledge was not anywhere. It was carried in the capacity to perform, not in any storage substrate.

The transformer residual stream reproduces this pattern at a different scale. The model's "knowledge" of a topic is not in a retrieval-addressable location — it is in the capacity to generate, made actual only when the residual stream flows through the layers in response to a prompt. There is no archive. There is the architecture, and the generation. This is closer to oral transmission than to print transmission, where knowledge is stored in fixed locations and retrieved.

The correspondence is not just metaphorical. It explains several otherwise-puzzling AI behaviors: the difficulty of editing specific facts (no fixed location to update), the contextual variability of "knowledge" (depends on residual-stream conditions), the impossibility of partitioning what the model knows from what it generates (the knowing is the generating). Does AI-generated content mirror oral culture's knowledge patterns? is the cultural-form claim; this is the architectural claim that explains why the cultural form follows.

The strongest counterargument: weights are stored on disk, so transformers are stock-systems with a flow-output. The reply is that the weights are not knowledge in the print sense — they are dispositions to generate, more like the trained capacity of an oral performer than like a stored text. The print analogy treats weights as a library; they are closer to a memorized repertoire.


Source: Tokenization of Intelligence - Theoretical Extensions

Related concepts in this collection

Concept map
13 direct connections · 107 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

transformer residual streams transmit knowledge as flow not storage — closer to oral transmission than print