Can knowledge encoded in model representations fail to influence generation?

This explores whether a model can hold something in its internal representations — a correct answer, a piece of context, a latent skill — and still not let it shape the words it actually generates.

This explores whether knowledge that demonstrably exists inside a model's hidden states can fail to reach the output — and the corpus says yes, repeatedly, through several distinct mechanisms. The most striking case: models can compute the *correct* answer in their early layers and then actively suppress it before generation. Logit-lens analysis of models trained with hidden chain-of-thought shows the right answer forming in layers 1–3, only to be overwritten in the final layers to produce format-compliant filler tokens — the reasoning stays fully recoverable from lower-ranked predictions, but never surfaces in the text Do transformers hide reasoning before producing filler tokens?. So the gap between "the model knows" and "the model says" isn't hypothetical; it's measurable.

A second, more mundane mechanism is interference. When a model's parametric training associations are strong, they can override information sitting right in the context window — the model generates outputs inconsistent with what it was just told, because prior knowledge dominates in-context knowledge. Notably, textual prompting alone can't fix this; the corpus reports that *causal intervention in the representations themselves* is required to make the context win Why do language models ignore information in their context?. That reframes your question: it's not only that encoded knowledge fails to influence generation, but that competing encoded knowledge can crowd it out.

The flip side is just as interesting — much capability is latent and simply *unelicited*. Base models already contain reasoning ability that minimal training unlocks; five independent methods (RL steering, critique fine-tuning, decoding changes, SAE feature steering, RLVR) all elicit reasoning that was already present in activations, suggesting post-training selects rather than creates Do base models already contain hidden reasoning ability?. A companion view holds that RL post-training teaches *when* to reason, not *how* — the strategies pre-exist as activation vectors before any RL touches them Does RL post-training create reasoning or just deploy it?. The bottleneck, in other words, is elicitation: knowledge can sit in the representations indefinitely without a trigger to route it into output.

This is why a growing line of work argues reasoning should be studied as hidden-state trajectory formation, not as the surface text — the visible chain-of-thought is only a *partial interface* onto a latent process, and faithfulness tests show the words don't reliably mirror the computation underneath Where does LLM reasoning actually happen during generation?. There's even an architectural escape hatch: diffusion LLMs with bidirectional attention can refine reasoning embedded directly in masked positions, decoupling it from the left-to-right generation that forces autoregressive models to either emit or bury their intermediate work Can reasoning and answers be generated separately in language models?.

The useful boundary here: prompting and post-training can *reorganize and route* what's already encoded, but neither injects what isn't there — prompt optimization activates existing knowledge and hits a hard ceiling at the edge of the training distribution Can prompt optimization teach models knowledge they lack?. So the picture has two failure directions worth holding together: encoded knowledge that exists but stays silent (suppression, interference, missing elicitation), and the absence of knowledge no representation-level trick can conjure. The interesting frontier is the first — closing the gap between what the activations know and what the tokens say.

Sources 7 notes

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Where does LLM reasoning actually happen during generation?

Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can knowledge encoded in model representations fail to influence generation?

Sources 7 notes

Next inquiring lines