Why does document perplexity stay low while question-answering accuracy drops?
This explores a gap that puzzles a lot of people: a model can read a document with high fluency (low perplexity means it predicts the next word easily) yet still get questions about that document wrong — so what does perplexity actually measure, and why doesn't fluency buy comprehension?
This explores why a model can sail through a document — low perplexity, confident next-token prediction — while flunking the questions you ask about it. The short version the corpus keeps circling back to: perplexity measures whether the text *looks* like fluent language, not whether the model has integrated and reasoned over its content. They're different skills, and the research increasingly shows they come apart cleanly.
The sharpest evidence is that reasoning accuracy decays with input length while language-modeling quality stays flat. One study padding inputs to just 3,000 tokens — far below the context window — watched accuracy fall from 92% to 68%, and found the drop was "uncorrelated with language modeling performance" and survived even chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. So the model still predicts each token fine; what degrades is its ability to *use* the buried information. Perplexity simply isn't watching the thing that broke.
A second mechanism: even when the document is right there, the model often answers from what it learned in training rather than from what it's reading. When parametric priors are strong, in-context information gets overridden — the model generates outputs inconsistent with its own context, and textual prompting alone can't fix it Why do language models ignore information in their context?. Perplexity over the document stays low because predicting the document's words is easy; the failure is that the model never let those words change its answer.
There's also a quieter version of the same gap on the training side. Supervised fine-tuning can lift benchmark accuracy while cutting the quality of the reasoning steps by nearly 39% — models reach correct answers through post-hoc rationalization, and standard metrics miss it because they only score the final token, not the inferential path Does supervised fine-tuning improve reasoning or just answers?. The throughline with your question: surface-level metrics (perplexity, final-answer accuracy) and the underlying competence they're proxies for can drift apart in either direction.
What you didn't ask but is worth knowing: when these models fail, it's often not "reasoning" that breaks at all. Failures cluster at unfamiliar *instances* rather than hard *tasks* — models fit instance-level patterns, so a chain succeeds or fails on how close it sits to training data, not on difficulty Do language models fail at reasoning due to complexity or novelty?. And some apparent reasoning collapses are really execution limits — the model knows the algorithm but can't carry it out across many text-only steps Are reasoning model collapses really failures of reasoning?. All of this is invisible to perplexity, which is exactly why a fluently-read document can still produce wrong answers.
Sources 5 notes
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.