Why do readability and style metrics plateau while reasoning improves with scale?
This explores why surface-level qualities of text (how readable or stylish it sounds) hit a ceiling quickly, while reasoning ability keeps climbing as models grow — the corpus suggests style and reasoning are different kinds of skill that saturate on different timelines.
This explores why surface-level qualities of text (how readable or stylish it sounds) hit a ceiling quickly, while reasoning ability keeps climbing as models grow. The short version the corpus points to: style is a *pattern-matching* skill that gets cheap fast, while reasoning is a *procedural* skill that keeps absorbing scale. Two papers make the gap almost embarrassingly clear. Imitation models trained to mimic ChatGPT learn its confident, fluent voice well enough to fool human judges — but close *no* capability gap on factuality or novel tasks Can imitating ChatGPT fool evaluators into thinking models improved?. Style is the part that transfers easily; competence isn't. Separately, a model can hit 95% accuracy identifying an author purely from style patterns, yet has no framework to explain *why* those choices carry meaning Can language models truly understand literary style?. Detection saturates early; interpretation doesn't. So 'plateau' isn't a bug — it's what happens when a task is fully solvable by surface statistics.
Why would style be so statistically shallow? Because models are, at bottom, tracking frequency. Given two ways of saying the same thing, an LLM systematically prefers the higher-frequency phrasing — across math, translation, and commonsense — suggesting they lean on statistical mass from pretraining rather than meaning Do language models really understand meaning or just surface frequency?. Readability and style metrics largely *reward* that frequency-tracking: smooth, common phrasing scores well. Once a model has absorbed the distribution of fluent text, there's nowhere further to go. The metric tops out because the underlying skill tops out.
Reasoning is a different animal, and the corpus keeps showing it's not really one skill but a stack of them that scale separately. Chain-of-thought turns out to be format-driven pattern generation more than logic — training format shapes strategy 7.5× more than domain, and even invalid reasoning steps work nearly as well as valid ones What makes chain-of-thought reasoning actually work?. The actual learning signal concentrates in a tiny minority of tokens: only ~20% are high-entropy 'forking points,' and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?. Scale keeps paying off here because there's a long tail of these decision points to get right, whereas style has no comparable tail.
The twist worth carrying away: even reasoning's apparent gains are partly about something other than 'thinking harder.' Trace length tracks how close a problem is to training data, not its difficulty Does longer reasoning actually mean harder problems?, and more capable models actually prefer *shorter* chains — optimal length follows an inverted-U and shrinks as competence rises Why does chain of thought accuracy eventually decline with length?. Some reported reasoning 'collapses' are really execution limits — the model knows the algorithm but can't run enough steps in text — and vanish when you give it tools Are reasoning model collapses really failures of reasoning?. So both halves of the question dissolve a bit on inspection: style plateaus because it's surface statistics that saturate, and reasoning 'improves' because it's a bundle of procedural and execution capacities with far more headroom — not because the model is getting more eloquent. The two metrics were never measuring the same kind of thing.
Sources 8 notes
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
GPT-2 achieves 95% accuracy identifying authorship through style patterns alone, but lacks the evaluative framework to explain why those stylistic choices carry meaning. Detection without interpretation remains cataloguing, not criticism.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.