Why does cross-text analogical reasoning fail when semantics decouple from symbols?
This explores why LLMs stumble at reasoning by analogy across different texts once you strip away the familiar meanings — when the words stop carrying the model's usual associations, the underlying logic doesn't carry over.
This explores why LLMs stumble at reasoning by analogy across different texts once meaning is detached from the symbols carrying it — and the corpus has a sharp answer: it's because these models reason through semantic association, not symbolic logic. The clearest statement comes from work showing that when you keep the logical rules intact but swap out the familiar meaning, performance collapses — even with the correct rules sitting right there in the prompt Do large language models reason symbolically or semantically?. The model was never manipulating symbols; it was leaning on parametric commonsense and token associations. Analogy across texts requires mapping abstract structure from one case onto another, and if the machinery is associative rather than structural, that mapping has nothing to grab onto once the meaning changes.
The corpus suggests this is one face of a deeper pattern: what looks like reasoning is often constrained imitation of reasoning's *form*. Chain-of-thought, the canonical reasoning method, turns out to reproduce familiar patterns from training rather than perform novel inference — which is why it degrades predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?, why invalid prompts work nearly as well as valid ones, and why format matters far more than logical content What makes chain-of-thought reasoning actually work?. If the reasoning trace is scaffolding rather than meaning — corrupted traces teach about as well as correct ones Do reasoning traces need to be semantically correct? — then there's no symbolic substrate to transfer when the semantics decouple. The cross-text analogy fails because there was never abstract inference to port.
There's a striking lateral angle here on *why* the associative shortcut dominates: models systematically prefer high-frequency surface forms over rare-but-equivalent paraphrases, tracking statistical mass from pretraining rather than meaning Do language models really understand meaning or just surface frequency?. Decoupling semantics from symbols is essentially forcing the model onto low-frequency, low-association terrain — exactly where its primary mechanism has nothing to ride on. Relatedly, reasoning failures track *instance unfamiliarity* rather than task complexity: a chain succeeds if the model saw similar instances, regardless of difficulty Do language models fail at reasoning due to complexity or novelty?. An unfamiliar, semantics-stripped analogy is the definition of an out-of-distribution instance.
A useful tension in the collection: not everyone reads these breakdowns as reasoning failures. One line of work argues collapses are really *execution* limits — models that know the algorithm can't run it across many steps in text alone, and tool access pushes them past the supposed cliff Are reasoning model collapses really failures of reasoning?. And local, preceding-token memorization drives the majority of CoT errors, growing exactly as distributional shift increases Where do memorization errors arise in chain-of-thought reasoning?. Read together, these don't contradict the semantic story so much as locate it: the model leans on nearby learned associations to carry each step, so when meaning is decoupled the local crutch vanishes and errors avalanche.
What you didn't know you wanted to know: the same evidence that explains this failure also hints the verbalized reasoning may be partly a performance. Models can scale reasoning in latent space without producing visible thinking tokens at all Can models reason without generating visible thinking tokens?, and minimal chains match verbose ones at under 8% of the tokens Can minimal reasoning chains match full explanations?. So the question of cross-text analogy isn't only "can the model abstract?" — it's whether the symbolic-looking trace we read was ever where the reasoning lived. If it wasn't, decoupling semantics doesn't break a symbolic engine; it just removes the associative camouflage that made the engine look symbolic in the first place.
Sources 10 notes
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.