Can language models perform genuine symbolic reasoning without semantic grounding?

This explores whether LLMs can manipulate symbols by formal rules alone — the way a logic engine does — or whether they're really leaning on the meanings of words, so that stripping away meaning collapses the reasoning.

This question asks whether an LLM can do real symbolic reasoning — pushing symbols around by formal rules — without depending on what those symbols mean. The most direct answer in the corpus is discouraging: when researchers decouple semantic content from the reasoning task, model performance collapses even when the correct rules are sitting right there in the prompt Do large language models reason symbolically or semantically?. Models lean on commonsense token associations baked in from training, not on formal manipulation. So the headline finding is that LLMs are semantic reasoners wearing a symbolic costume.

Look one layer deeper and the picture gets more interesting, because semantics doesn't just help — it actively corrupts. When LLMs run syllogisms, they use a content-independent three-stage circuit (recite, suppress the middle term, mediate) that genuinely works across architectures — a real symbolic-ish mechanism. But parallel attention heads carrying world knowledge bias the conclusion toward what's *plausible* rather than what's *valid*, and this contamination gets worse at larger scale How do language models perform syllogistic reasoning internally?. So grounding isn't a clean scaffold the model could shed; it bleeds into the logic and overrides it.

A cluster of work argues that even the visible reasoning is partly theater. Chain-of-thought turns out to be constrained imitation of reasoning *form* — reproducing familiar schemata from training — and it degrades predictably under distribution shift, the signature of pattern-matching rather than genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Reasoning traces themselves are persuasive appearances: invalid logical steps perform almost as well as valid ones, and corrupting a trace barely hurts, which means semantic correctness of the steps isn't what's producing the gains Do reasoning traces show how models actually think?. If broken symbolic steps work as well as sound ones, the symbols aren't doing the load-bearing work.

But here's the twist worth knowing: some failures that look like reasoning failures are actually *execution* failures. When models hit the supposed reasoning cliff, giving them tools lets them solve the problem — they knew the algorithm, they just couldn't run many steps reliably in text Are reasoning model collapses really failures of reasoning?. And reasoning accuracy craters with longer inputs far below the context limit, in a way uncorrelated with language-modeling skill Does reasoning ability actually degrade with longer inputs?. This complicates the verdict: maybe the symbolic competence is partially there but throttled by the medium of token-by-token generation.

That last thread points to where the field is trying to escape the trap — by moving reasoning *off* the surface tokens entirely. Latent-reasoning architectures iterate in hidden state without verbalizing steps, suggesting words are a training artifact rather than a requirement for computation Can models reason without generating visible thinking tokens?. Probing shows transformers already compute answers in early layers and then overwrite them with format-compliant filler Do transformers hide reasoning before producing filler tokens?, and pruning reveals that models internally rank symbolic-computation tokens as most important, preserving them while discarding grammar and filler Which tokens in reasoning chains actually matter most?. Meta's Large Concept Model goes further, reasoning over language-agnostic sentence embeddings before decoding to any language Can reasoning happen at the sentence level instead of tokens?. The unsettling synthesis: the corpus says LLMs can't currently do ungrounded symbolic reasoning — semantics is both their crutch and their contaminant — yet the most promising research direction is precisely to pull reasoning into an abstract, less-grounded latent space, which is a quiet bet that genuine symbolic computation might be buildable after all.

Sources 10 notes

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

How do language models perform syllogistic reasoning internally?

LLMs implement a content-independent three-stage reasoning mechanism—recitation, middle-term suppression, mediation—that works across architectures. However, additional attention heads encoding world knowledge systematically bias conclusions toward semantically plausible rather than logically valid answers, with contamination increasing at larger scales.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can language models perform genuine symbolic reasoning without semantic grounding?

Sources 10 notes

Next inquiring lines