What circuit mechanisms produce belief bias in syllogistic reasoning?

This explores what's actually happening inside the model when it gets fooled by a believable-but-illogical conclusion — the specific circuitry that lets world knowledge override logical form. The clearest answer in the corpus is mechanistic: syllogistic reasoning runs on a content-independent three-stage circuit — recitation (restating the premises), middle-term suppression (dropping the shared term that links them), and mediation (drawing the conclusion). This same machinery shows up across architectures, so it looks like a genuine reasoning algorithm rather than a memorized trick. Belief bias enters through a *separate* set of attention heads that encode world knowledge and quietly tilt the conclusion toward what's semantically plausible instead of what's logically valid. The unsettling part: this contamination gets *worse* at larger scales, so scaling up doesn't wash out the bias — it amplifies it How do language models perform syllogistic reasoning internally?.

Zoom out from the circuit and the behavior matches humans almost eerily well. Models reproduce the human belief-bias signature item-by-item — the same syllogisms that trip people up trip up the model, at comparable error rates, and the same pattern recurs on natural-language inference and the Wason selection task. That behavioral isomorphism across three independent tasks is the behavioral shadow of the circuit story: content and logical form aren't cleanly separable inside a transformer, they're entangled by architecture Do language models show the same content effects humans do?. A complementary probe makes the point even sharper — strip the familiar semantics out of a reasoning problem while leaving the logical rules intact, and performance collapses. The model was never manipulating symbols; it was riding token associations and parametric commonsense Do large language models reason symbolically or semantically?.

Where does the bias come from in the first place? Not from instruction tuning. A causal experiment varying random seeds and cross-tuning found that models sharing a pretrained backbone carry the same bias fingerprint regardless of what finetuning data they saw — biases are planted during pretraining and only nudged afterward Where do cognitive biases in language models come from?. And it's not unique to syllogisms: the same flavor of training-data-statistics-driven error shows up in causal reasoning, where models reproduce human failures like weak explaining-away and Markov violations on collider networks Do large language models make the same causal reasoning mistakes as humans?. The common thread is that these aren't bugs in a logic engine — they're the predictable output of a system whose 'reasoning' is statistical pattern completion.

That reframing is where it gets interesting for anyone hoping to fix it. If chain-of-thought were genuine inference, you could reason your way past belief bias; but the corpus argues CoT is constrained imitation of reasoning *form*, reproducing familiar schemata rather than performing novel symbolic manipulation — which is why it fails predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching? Why does chain-of-thought reasoning fail in predictable ways?. There's a layered geometry to it too: knowledge tends to live in lower network layers and reasoning adjustments in higher ones, which is why pushing harder on reasoning can actively degrade knowledge-heavy behavior Why does reasoning training help math but hurt medical tasks?. One concrete lever does exist — training judges with RL to actually deliberate during evaluation measurably reduces their susceptibility to surface-feature biases Can reasoning during evaluation reduce judgment bias in LLM judges? — suggesting belief bias is suppressible through process, even if it's baked in at the source. The thing worth walking away with: the model has a real, transferable logic circuit *and* a world-knowledge circuit running in parallel, and belief bias is what you see when the second one wins.

Sources 9 notes

How do language models perform syllogistic reasoning internally?

LLMs implement a content-independent three-stage reasoning mechanism—recitation, middle-term suppression, mediation—that works across architectures. However, additional attention heads encoding world knowledge systematically bias conclusions toward semantically plausible rather than logically valid answers, with contamination increasing at larger scales.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Do large language models make the same causal reasoning mistakes as humans?

LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Can reasoning during evaluation reduce judgment bias in LLM judges?

Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.

What circuit mechanisms produce belief bias in syllogistic reasoning?

Sources 9 notes

Next inquiring lines