Can attractor dynamics compete with input-based probing for characterizing model knowledge?
This explores a contrast between two ways of figuring out what a model 'knows': watching the internal dynamics of its hidden states (the attractor/cycle view) versus poking it from the outside with prompts and inputs (probing), and whether the internal-dynamics view holds up as a serious alternative.
This reads the question as a face-off: can watching what a model's hidden states *do over time* — settling into cycles or attractor-like patterns — tell us as much about its knowledge as feeding it inputs and reading the outputs? The corpus suggests the two aren't really rivals so much as windows onto different things, and that input-based probing has a hard ceiling the dynamics view can see past.
Start with the limit of probing. Prompt optimization can only surface knowledge already in the model — it reorganizes the training distribution but can't inject anything new Can prompt optimization teach models knowledge they lack?. So if you only characterize knowledge by what inputs can elicit, you measure what's *reachable from outside*, not what's *there*. That's exactly where internal dynamics earn their keep. Reasoning models show roughly five cycles per sample in their hidden-state reasoning graphs versus near-zero in base models, and that cyclicity tracks accuracy and maps onto documented 'aha moments' where the model reconsiders an answer Do reasoning cycles in hidden states reveal aha moments?. The cycle is a dynamical signature — a knowledge-processing event you'd never catch by reading only the final token.
There's a deeper reason the dynamics view matters: internal structure and external behavior are decoupled. Models can hit identical accuracy through radically different internal mechanisms What actually happens inside a language model?, What actually happens inside the minds of language models?, which means output-based probing is blind to *how* the answer was reached. Other internal signatures point the same way — hidden states sparsify in a localized, systematic way under unfamiliar tasks, acting as an adaptive filter rather than a failure Do language models sparsify their activations under difficult tasks?, and post-trained models show measurably lower on-policy output entropy as they start treating their own outputs as actions that shape future inputs Do models recognize their own outputs as actions shaping future inputs?. These are dynamical facts about knowledge-in-use that no single prompt reveals.
But the honest answer is *complement, not compete*. The cleanest claim in the corpus is that representational or dynamical analysis alone finds correlations without causation — you need to locate a candidate feature internally and then verify it causally by intervening through inputs Can we understand LLM mechanisms with only representational analysis?. The strongest case studies do exactly this hybrid: sparse autoencoders revealed an entity-recognition mechanism that the model uses to track whether it knows a fact, and that internal signal *causally steers* hallucination and refusal — a knowledge characterization that only holds because internal structure and behavioral probe were joined Do models know what they don't know?.
The quietly surprising payoff: 'characterizing knowledge' may be the wrong frame for both methods. Several notes converge on the idea that much of what looks like newly-probed capability was latent all along — base models already contain reasoning strategies as pre-existing activation vectors, and RL post-training teaches *when* to deploy them, not how Does RL post-training create reasoning or just deploy it?, with understanding itself arriving in hierarchical tiers where higher-order circuits sit atop older heuristics rather than replacing them Do language models understand in fundamentally different ways?. If knowledge is a layered, latent, deploy-on-demand thing, then attractor dynamics and input probing aren't competing measurements of one quantity — they're measuring deployment versus possession.
Sources 10 notes
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Distilled reasoning models show ~5 cycles per sample versus near-zero in base models, and cyclicity correlates with accuracy. These cycles in hidden-state reasoning graphs directly map to RL-trained models' documented aha moments—moments when models reconsider intermediate answers.
Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.
LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.