Language Understanding and Pragmatics LLM Reasoning and Architecture

Do harder reasoning tasks trigger more semantic bias?

Does the difficulty of a logical task determine how much semantic content influences reasoning? This matters because it reveals whether we can isolate 'pure' logical reasoning in benchmarks.

Note · 2026-05-02 · sourced from Linguistics, NLP, NLU
Do reasoning traces show how models actually think? Where exactly do reasoning models fail and break?

Lampinen et al. observe a difficulty-modulation pattern: content effects are weakest on NLI (a relatively simple inference task), stronger on syllogism validity judgment, and strongest on the Wason selection task — which is the hardest, even for mathematics undergraduates and academic mathematicians who score below 50% on its abstract version. The directional claim is clean: as the logical demands of the task exceed available working-memory or circuit capacity, the system falls back on semantic priors. Both humans and LMs show this fallback in the same direction along the same difficulty axis.

The pattern explains a recurring frustration with reasoning benchmarks. Benchmarks designed to test "purely logical" reasoning still show heavy content sensitivity, and benchmark designers often treat this as a confound to be controlled. The Lampinen finding suggests it cannot be controlled — content sensitivity is more pronounced exactly where the benchmark is most demanding. The harder the task, the more believability bleeds into the result. A reasoning benchmark whose items vary in content believability is partly a believability test, not a logic test, and the harder the items the more this is true.

The connection to Why do LLMs fail at simple deductive reasoning? is partial but illuminating. That note shows LMs and humans diverge on certain reasoning surfaces — long multi-hop versus simple deduction. Lampinen shows they converge on the difficulty-modulation pattern itself, even where their absolute capabilities differ. Both observations can be true: humans and LMs occupy different absolute positions on a difficulty curve, but both slide toward semantic-fallback as difficulty rises.

For False Punditry, the connection is straightforward and uncomfortable. Pundits and LLMs both reach for plausible-sounding content when underlying logic is hard, by the same failure-mode mechanism. The pundit who confidently restates a familiar belief when asked a hard question, and the LLM that confabulates a believable answer when the logic exceeds its circuits, are not analogically similar — they are mechanistically similar. Both are systems whose reasoning capacity has been exceeded and which fall back on a semantic prior that sounds right. Recognizing this similarity is more diagnostically useful than insisting on the difference.


Source: Linguistics, NLP, NLU Paper: Language models show human-like content effects on reasoning tasks

Related concepts in this collection

Concept map
12 direct connections · 140 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

content effects scale with task difficulty — the harder the abstract task the more semantic content takes over from logical form, in humans and LMs