Do transformer models store knowledge or generate it continuously?
Explores whether transformer residual streams function as storage-and-retrieval systems or as real-time flow mechanisms. This distinction challenges fundamental assumptions about how language models actually work.
The transformer architecture organizes computation around residual streams: per-token vectors that pass forward through layers, each layer adding contributions that the stream continues to carry. Knowledge in the model is not stored in named locations from which it is retrieved on demand. It is distributed across weights and made present in the moment of generation through the residual stream's continuous transformation. The stream is the medium; what flows through it is the model's "knowing" of the current context.
This architectural fact has a striking correspondence with how oral cultures transmitted knowledge. Oral knowledge was not stored in fixed locations either — there were no archives, no written records, no externalized representations. Knowledge lived in performance: the song sung, the story retold, the genealogy recited. Each performance was a generation event in which the knowledge was made present through a living transmission. Between performances, the knowledge was not anywhere. It was carried in the capacity to perform, not in any storage substrate.
The transformer residual stream reproduces this pattern at a different scale. The model's "knowledge" of a topic is not in a retrieval-addressable location — it is in the capacity to generate, made actual only when the residual stream flows through the layers in response to a prompt. There is no archive. There is the architecture, and the generation. This is closer to oral transmission than to print transmission, where knowledge is stored in fixed locations and retrieved.
The correspondence is not just metaphorical. It explains several otherwise-puzzling AI behaviors: the difficulty of editing specific facts (no fixed location to update), the contextual variability of "knowledge" (depends on residual-stream conditions), the impossibility of partitioning what the model knows from what it generates (the knowing is the generating). Does AI-generated content mirror oral culture's knowledge patterns? is the cultural-form claim; this is the architectural claim that explains why the cultural form follows.
The strongest counterargument: weights are stored on disk, so transformers are stock-systems with a flow-output. The reply is that the weights are not knowledge in the print sense — they are dispositions to generate, more like the trained capacity of an oral performer than like a stored text. The print analogy treats weights as a library; they are closer to a memorized repertoire.
Source: Tokenization of Intelligence - Theoretical Extensions
Related concepts in this collection
-
Does AI-generated content mirror oral culture's knowledge patterns?
Walter Ong's framework for oral versus literate cultures may describe how AI content functions on social media. Understanding this parallel could explain why AI discourse feels fundamentally different from print-era knowledge.
the cultural-form claim that this provides architectural grounding for
-
Is AI returning knowledge to flow-based economies?
Exploring whether AI's on-demand generation mirrors the flow-based knowledge transmission of oral cultures, and how this differs structurally from both print commodification and gift economies.
the broader economic-form claim
-
Is the LLM a tool or a new form of intelligence itself?
Does framing AI as merely delivering pre-existing intelligence miss what's actually happening? This explores whether the model itself constitutes a fundamentally new intelligence-medium with distinct cultural effects.
the medium-theoretic claim about what the model does
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
transformer residual streams transmit knowledge as flow not storage — closer to oral transmission than print