How can entailment benchmarks separate genuine reasoning from memorization effects?

This explores how to design entailment tests that catch a model genuinely inferring 'the premise supports the hypothesis' versus one just recognizing a hypothesis it saw during training — and what the corpus reveals about telling the two apart.

This explores how to design entailment tests that catch a model genuinely inferring 'the premise supports the hypothesis' versus one just recognizing a hypothesis it saw during training. The corpus has a sharp, specific answer to the core mechanism — and a cluster of adjacent findings that explain *why* the separation is so hard.

The most direct piece is attestation bias Do LLMs predict entailment based on what they memorized?. McKenna et al. found that LLMs often predict 'entailed' based on whether the hypothesis statement looks familiar from training, regardless of whether the premise actually supports it. The clever diagnostic move is the *random premise* test: pair an attested hypothesis with an unrelated premise. A genuine reasoner should now say 'not entailed' — but a memorizer keeps saying 'entailed,' because it's responding to the proposition, not the logical relationship. That swap is the heart of any benchmark that wants to separate the two: break the premise–hypothesis link while keeping surface familiarity constant, and watch which models still take the bait.

That trick generalizes into a broader principle the corpus keeps circling: if you want to expose memorization, push the test *off the training distribution*. Several notes show reasoning that looks fluent collapsing the moment you do this. Chain-of-thought degrades predictably under shifts in task, length, and format Does chain-of-thought reasoning actually generalize beyond training data?, and that predictable collapse is itself the signature of imitation rather than capability Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Even simple input-length padding drops reasoning accuracy far below the context limit Does reasoning ability actually degrade with longer inputs?. So a benchmark can separate reasoning from memorization not just by *what* it asks but by *how far* it perturbs each item from anything seen before.

The unsettling counterpart is that the 'reasoning' signal itself turns out to be a weak proxy. Logically *invalid* CoT exemplars perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and deliberately corrupted reasoning traces train models about as well as correct ones Do reasoning traces need to be semantically correct?. This means a benchmark cannot trust the *form* of a reasoning chain as evidence of inference — the chain can be nonsense and the answer still right. The implication for benchmark design is uncomfortable: scoring the visible reasoning trace measures the wrong thing; you have to score behavior under adversarial premise swaps instead.

What you didn't know you wanted to know is *where* in the data the two abilities live. Procedural-knowledge analysis shows that genuine reasoning draws on broad, transferable procedures scattered across many documents, while factual recall depends on narrow, document-specific memorization of the exact target fact Does procedural knowledge drive reasoning more than factual retrieval?. That gives an entailment benchmark a deeper design lever than 'novel sentences': construct items whose hypotheses are attested as *facts* somewhere in training but whose *entailment* can only be resolved procedurally. A memorizer rides the attested fact; a reasoner has to run the procedure — and the gap between them is exactly the thing you're trying to measure. The token-level work on where memorization errors originate Where do memorization errors arise in chain-of-thought reasoning? suggests this gap is even instrumentable down to which tokens get copied from nearby context versus genuinely computed.

Sources 8 notes

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

How can entailment benchmarks separate genuine reasoning from memorization effects?

Sources 8 notes

Next inquiring lines