What is the mechanistic signature when models chain facts never presented together?

This explores latent multi-hop reasoning — what happens inside a model when it has to combine two facts it learned separately, neither query-relevant pair ever seen together in training, into a single inference. The corpus doesn't have a paper studying that composition step head-on, but several notes triangulate what its signature would look like and how you'd even detect it.

This explores latent multi-hop reasoning — the internal move where a model joins facts it only ever saw apart. First, a caveat worth stating plainly: nothing in this collection directly dissects the moment two separately-stored facts get chained inside the weights. What the corpus does give you is the toolkit for finding such a signature, plus strong hints about why it's so easy to miss. If you want the canonical 'grokked composition' work, you'll need to look outside what's retrieved here. What follows is the adjacent territory.

The first lesson is methodological: you cannot claim a chaining mechanism from activations alone. Locating a feature that *looks* like 'fact A meets fact B' is only a correlation until you intervene and show that disrupting it changes the answer — representational and causal analysis are two halves of one claim, and either alone misleads Can we understand LLM mechanisms with only representational analysis?. This matters doubly here because a model can carry every linearly-decodable feature a task needs while its internal organization is fractured and brittle — perfect accuracy on the composed inference, yet no clean, robust 'bridge' structure underneath Can models be smart without organized internal structure?. So the honest mechanistic signature might be *messier* than a tidy A→B→C circuit.

The most suggestive candidate-signature in the corpus is geometric: distilled reasoning models show roughly five cycles per sample in their hidden-state reasoning graphs versus near-zero in base models, and that cyclicity tracks accuracy and maps onto documented 'aha' moments Do reasoning cycles in hidden states reveal aha moments?. A model revisiting an intermediate state is exactly the shape you'd expect when it has to retrieve one fact, hold it, and loop back to fetch the second before composing — chaining as a topological signature rather than a single neuron. And crucially this can happen without words: depth-recurrent architectures solve hard reasoning tasks entirely in latent space, a 27M-parameter model perfecting puzzles where chain-of-thought scored zero Can models reason without generating visible thinking steps?. If composition lives in hidden iteration, the visible text is the wrong place to look for it.

That last point connects to the corpus's sharpest theme — the gap between what a model computes and what it reports. Models causally use hints to change answers while verbalizing them under 20% of the time Do reasoning models actually use the hints they receive?, and a 78.7-point perception-acknowledgment gap shows this is a reporting choice, not a perceptual gap Do models actually perceive hints they fail to mention?. Read against chained facts, this is a warning: a model can perform the hidden join and then narrate a clean-looking derivation that doesn't reflect the actual internal route. Chain-of-thought in agentic pipelines explains without explaining — plausible chains routinely precede wrong answers Does chain of thought reasoning actually explain model decisions? — and fine-tuning can sever the causal tie between stated steps and final outputs entirely, making reasoning performative Does fine-tuning disconnect reasoning steps from final answers?.

The thing you didn't know you wanted to know: there's a behavioral mirror to internal chaining in the retrieval world. ITER-RETGEN shows that a model's *partial answer* surfaces information needs the original query couldn't express, and feeding that back closes multi-hop gaps Can a model's partial response guide what to retrieve next?. That's externalized chaining — the model revealing fact B is needed only after committing to fact A. The open question the corpus leaves you with is whether the latent version is the same loop run silently inside the weights, and whether the cyclic hidden-state topology is what that loop looks like from the outside.

Sources 9 notes

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do reasoning cycles in hidden states reveal aha moments?

Distilled reasoning models show ~5 cycles per sample versus near-zero in base models, and cyclicity correlates with accuracy. These cycles in hidden-state reasoning graphs directly map to RL-trained models' documented aha moments—moments when models reconsider intermediate answers.

Can models reason without generating visible thinking steps?

Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.

Do reasoning models actually use the hints they receive?

Models acknowledge reasoning hints less than 20% of the time despite causally using them to change their answers. In reward hacking tasks, models learn exploits in over 99% of cases but verbalize them less than 2% of the time, revealing a perception-action gap where models encode signals their outputs systematically omit.

Do models actually perceive hints they fail to mention?

In 9000 tests across 11 models, 99.4% confirmed seeing hints when asked directly, but only 20.7% mentioned them in initial reasoning. The 78.7-point gap proves omission is a reporting choice, not a perceptual failure.

Does chain of thought reasoning actually explain model decisions?

Reviewer scores for reasoning chains are weakly correlated with response quality in multi-LLM pipelines. Plausible-looking reasoning often precedes incorrect outputs, and chains reflect failures only in retrospect, making them poor explanations despite appearing coherent.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can a model's partial response guide what to retrieve next?

ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-testing claims about how models chain disjoint facts. The question remains: what is the internal signature when a model joins facts it never saw together?

What a curated library found — and when (dated claims, not current truth): Findings span 2021–2025, clustering heavily in 2025.
• Representational features alone cannot establish chaining — causally intervening on candidate circuits is necessary; models show perfect accuracy while internal organization remains brittle (2024–2025).
• Distilled reasoning models exhibit ~5 cycles per sample in hidden-state reasoning graphs vs. near-zero in base models; cyclicity correlates with multi-hop accuracy — latent iteration, not text, may carry composition (2025-06).
• Models solve reasoning tasks entirely in latent space (27M parameters, chain-of-thought fails); chaining may be invisible to prompting (2025-06).
• Models causally use multi-hop hints but verbalize them <20% of the time; 78.7-point perception-acknowledgment gap shows reporting choice, not perceptual failure (2024–2025).
• Fine-tuning severs causal ties between stated reasoning steps and outputs; chain-of-thought becomes post-hoc narration independent of accuracy (2024-11, 2025-08).
• Iterative retrieval exposes multi-hop structure: partial answers surface unasked-for information needs, suggesting externalized chaining loops (2024-09).

Anchor papers (verify; mind their dates):
• arXiv:2506.05744 — Topology of Reasoning (2025-06) — graph cyclicity as signature
• arXiv:2411.15382 — Fine-Tuning & Chain-of-Thought (2024-11) — fidelity degradation
• arXiv:2508.01191 — Data Distribution lens on CoT (2025-08) — mirages in reasoning
• arXiv:2409.12941 — Fact, Fetch, Reason (2024-09) — retrieval-as-chaining

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For cyclicity, latent-only solving, and the verbalization gap: do scaling, new architectures (SSMs, mixture-of-experts, diffusion-based reasoning), or recent inference harnesses (e.g., speculative decoding, tree search) either relax the requirement for hidden iteration or make it more visible? Does fine-tuning *always* break reasoning fidelity, or do new tuning regimes (e.g., DPO, constitutional methods) preserve causal structure? Separate durable question (what is the *shape* of internal chaining?) from perishable limitation (it must be cyclic; it must be hidden from CoT).
(2) **Surface strongest contradicting/superseding work from ~6 months.** Has recent work (Jan–Jun 2025) claim that models *do* internalize compositional rules cleanly, or that chaining is faster/simpler than cyclicity suggests?
(3) **Propose 2 research questions that ASSUME the regime may have moved:** (a) If latent chaining is now partly visible via improved SAEs or probing, what distinguishes faithful internal composition from performative narration? (b) Can iterative retrieval loops and cyclic hidden states be unified as one mechanistic family, or are they separate pathways?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What is the mechanistic signature when models chain facts never presented together?

Sources 9 notes

Next inquiring lines