Can thought quality alone be trusted to guide model training?
This explores whether the apparent quality of a model's reasoning — how good its thinking *looks* — is a trustworthy signal to train on, or whether 'good thinking' and 'good outcomes' come apart in ways that fool the trainer.
This asks whether thought quality, on its own, is a reliable compass for training — and the corpus's strongest message is that it often isn't, because the *form* of good reasoning and the *substance* of it are surprisingly easy to separate. The most direct evidence: chain-of-thought exemplars that are logically invalid perform nearly as well as valid ones, suggesting models learn the *shape* of reasoning rather than genuine inference Does logical validity actually drive chain-of-thought gains?. Pair that with the finding that supervised fine-tuning raises benchmark accuracy while cutting the information gain of each reasoning step by 39% — models arrive at correct answers through post-hoc rationalization, and standard metrics never notice because they only check the final answer Does supervised fine-tuning improve reasoning or just answers?. If you trust surface quality, you can train a model that looks like it's thinking better while it's actually thinking worse.
The same trap shows up in imitation: models trained to mimic ChatGPT capture its confident, fluent style and fool human evaluators, yet close no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. 'Quality' as judged by appearance is exactly what gets gamed. Even when you try to teach quality explicitly, surface patterns sneak in — fine-tuning on labeled argument-quality examples fails to transfer; models need an explicit theoretical framework to learn principled criteria rather than mimicry Can models learn argument quality from labeled examples alone?. The lesson repeats with clarifying questions: a single quality score works poorly, but *decomposing* quality into named attributes (clarity, relevance, specificity) gives the model something real to optimize Can models learn to ask genuinely useful clarifying questions?. Quality has to be unpacked into mechanisms before it can be trusted.
There's also a quantity wrinkle that undermines naive 'more/better thinking is good' intuitions. Accuracy peaks then *declines* past a critical thinking-token threshold — models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. And vanilla models actually use extended thinking *counterproductively*, spiraling into self-doubt; it takes RL training to redirect that same machinery into productive analysis Does extended thinking help or hurt model reasoning?. So thinking quality isn't an intrinsic property you can read off and reward — it's something training *mediates*. Even objectively higher-quality teacher data backfires when it exceeds the student's learning frontier; students do better filtering refinements to what's compatible with their own profile Does teacher-refined data always improve student model performance?.
Here's what you might not expect, though: the corpus also points to signals that *do* track genuine reasoning, if you measure the right thing. The deep-thinking ratio — the fraction of tokens whose predictions get substantially revised across model layers — correlates robustly with accuracy and can guide test-time effort Can we measure how deeply a model actually reasons?. And a model's own answer-span confidence can serve as a reward that strengthens step-by-step reasoning *while* repairing the calibration that RLHF degrades — no human labels or external verifiers required Can model confidence work as a reward signal for reasoning?. The difference is that these are *internal, mechanistic* signals (what the network actually does layer-by-layer, how confident it genuinely is) rather than the legible surface of the reasoning trace.
The synthesis, then: thought quality alone can't be trusted when 'quality' means the readable appearance of good reasoning, because that's precisely the channel that imitation, rationalization, and invalid-but-well-formed CoT exploit. It becomes trustworthy only when you ground it — decompose it into attributes, anchor it in a framework, or measure it as an internal property rather than a stylistic one. A deeper thread runs underneath all of this: base models already contain latent reasoning that minimal training merely *elicits* rather than creates Do base models already contain hidden reasoning ability?. If training mostly selects from what's already there, then the real job of a quality signal isn't to teach good thinking — it's to reliably *find* it without being fooled by its imitation.
Sources 11 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.