Why might rationales that predict common text patterns fail on hard novel reasoning?

This explores why teaching a model to generate reasoning by predicting likely-looking text can break down precisely when a problem is unfamiliar — the case where real reasoning matters most.

This explores why rationales trained to predict common text patterns — the way models like Quiet-STaR learn to reason as a byproduct of better next-token prediction Can models learn reasoning from predicting any text? — can fail on hard, novel problems. The short version: predicting familiar text and reasoning through something new are not the same skill, and the corpus keeps catching them apart.

The clearest evidence is that reasoning failures track *novelty*, not difficulty. Models don't hit a wall at some complexity threshold; they hit it at the edge of what they've seen. A long reasoning chain succeeds if it resembles trained instances, and a short one fails if it doesn't — because the model is fitting instance-level patterns rather than running a general algorithm Do language models fail at reasoning due to complexity or novelty?. A rationale tuned to predict common text is, almost by definition, an instance-pattern matcher, so the novel case is exactly where its bet stops paying off.

Several notes converge on the same diagnosis from different angles: chain-of-thought is *constrained imitation* of reasoning's form, not genuine inference. It works by reproducing familiar schemata from training, which is why performance degrades predictably under distribution shift in task, length, or format Does chain-of-thought reasoning reveal genuine inference or pattern matching? Does chain-of-thought reasoning actually generalize beyond training data? What makes chain-of-thought reasoning actually work?. When you decouple the semantic content from the logic — strip away the familiar token associations — accuracy collapses even when the correct rules are sitting right there in context, because the model leans on semantic familiarity rather than symbolic manipulation Do large language models reason symbolically or semantically?. That's the mechanism behind your question in one sentence: a text-pattern predictor has nothing to fall back on once the surface pattern is gone.

Here's the laterally interesting twist the corpus offers: not every reasoning failure is a *reasoning* failure. Some collapses are execution failures — the model knows the algorithm but can't carry out enough steps in text-only generation, and tool access pushes the supposed cliff back Are reasoning model collapses really failures of reasoning?. And startlingly, the rationale text doesn't even have to be *correct* to help: models trained on deliberately corrupted or irrelevant traces keep their accuracy, suggesting traces sometimes act as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. This complicates the simple story — if traces are partly scaffolding, then 'predicting common patterns' may be doing two jobs at once, and only one of them generalizes.

The practical takeaway for a curious reader: the signature of imitation is *predictable degradation* — under longer inputs (accuracy dropping well below context limits) Does reasoning ability actually degrade with longer inputs?, under structural complexity in language Why do large language models fail at complex linguistic tasks?, and outside the training distribution generally. Genuine reasoning would fail randomly or not at all; pattern-matching fails on a curve. If you want to test whether a rationale is reasoning or reciting, push it somewhere novel and watch the slope.

Sources 10 notes

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Why might rationales that predict common text patterns fail on hard novel reasoning?

Sources 10 notes

Next inquiring lines