Can models learn to select exemplars based on reasoning skills rather than complexity?
This explores whether a model can pick which training or in-context examples to learn from based on the *reasoning skill* they exercise — the procedures, decision points, and forms of inference — rather than by surface difficulty or problem length.
This explores whether a model can pick which training or in-context examples to learn from based on the *reasoning skill* they exercise rather than by surface difficulty or problem length. The corpus suggests the question rests on a deeper premise: that "complexity" is the wrong axis entirely. One finding shows that reasoning models don't actually break down at complexity thresholds — they break down at *instance unfamiliarity*. Models fit instance-level patterns rather than general algorithms, so a long, hairy reasoning chain succeeds if the model has seen similar instances, while a short, simple one fails if it hasn't Do language models fail at reasoning due to complexity or novelty?. If failure tracks novelty and not difficulty, then selecting exemplars by complexity is optimizing the wrong signal in the first place.
So what *is* the load-bearing property of a good exemplar? Several notes converge on a surprising answer: it's the *form* and *procedure* of reasoning, not its correctness or rigor. Logically invalid chain-of-thought prompts perform nearly as well as valid ones, because the model learns the structural shape of reasoning rather than genuine inference Does logical validity actually drive chain-of-thought gains?. Even deliberately corrupted traces teach as well as correct ones — they act as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. And what generalizes across reasoning isn't memorized facts but transferable *procedural knowledge* drawn from diverse pretraining sources Does procedural knowledge drive reasoning more than factual retrieval?. Together these suggest that if you wanted to select exemplars by "reasoning skill," the skill that matters is procedural structure — and it can be present even when the example is wrong.
The corpus also shows models *can* learn to route and select based on reasoning demands, not difficulty labels. Thinkless trains a single model to decide when to engage extended thinking versus answer directly, using a decoupled RL objective that learns this self-calibrated routing *without* explicit difficulty labels Can models learn when to think versus respond quickly?. That's a model selecting based on the reasoning a problem requires rather than a complexity score. In a related vein, the critical learning signal lives in a small minority of high-entropy "forking" tokens — the pivotal decision points — and training on just that ~20% matches full updates Do high-entropy tokens drive reasoning model improvements?. Reasoning skill, it turns out, is concentrated in specific decision moments, not spread evenly across a problem's surface complexity.
There's a deeper reason skill-based selection should work: the reasoning is often already latent. Five independent mechanisms all elicit reasoning that base models already contain — post-training *selects* rather than creates it Do base models already contain hidden reasoning ability?. A single well-chosen RLVR example can jump math accuracy from 36% to 73.6% and keep improving generalization for over a thousand steps past training saturation Can a single training example unlock mathematical reasoning?. If one example can activate a whole capability, then *which* example you select matters enormously — and the activating property is the reasoning behavior it triggers, not how hard it looks.
The sharpest practical lesson comes from work on argument quality and question-asking: models trained on labeled examples alone learn surface patterns, not principled criteria, and only generalize when quality is *decomposed into explicit, theory-grounded attributes* Can models learn argument quality from labeled examples alone? Can models learn to ask genuinely useful clarifying questions?. The implication for your question is direct: "reasoning skill" isn't a single scalar a model can sort by — it has to be broken into named sub-skills (the procedure used, the decision points exercised, the instance patterns covered) before selection becomes tractable. Complexity is one number and easy to sort by; reasoning skill is a structured object, which is exactly why decomposition keeps showing up as the thing that makes it learnable.
Sources 10 notes
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.