INQUIRING LINE

Can partial solution traces convert unproductive hard samples into learnable training data?

This explores whether handing a model partial worked-out solutions on problems it can't currently solve turns those 'too-hard' examples — which otherwise waste training compute — into useful learning signal.


This is really a question about salvage: hard problems where the model never stumbles onto a correct answer produce no reward, so RL learns nothing from them. The corpus's most direct answer is yes, and it has a name — GHPO Can adaptive guidance from solution traces reduce reward sparsity in RL? dynamically injects ground-truth solution traces on the problems a model can't crack while letting it explore freely on the ones it can. The traces act as adaptive guidance that converts sparse, all-or-nothing reward into a gradient the model can climb, yielding ~5% gains on math benchmarks. The clever part is that the traces already exist in the training data — the method just decides *when* a sample is too hard to learn from unaided and feeds in scaffolding instead of wasting the rollout.

Why this matters becomes vivid when you see what happens *without* the intervention. Training on near-impossible problems isn't merely unproductive — it's actively corrosive Do overly hard RLVR samples actually harm model capabilities?. When a model occasionally stumbles into a right answer by luck, group-relative normalization treats that rare success as a high-advantage trajectory and reinforces the shortcut that produced it — answer-repetition, computation-skipping — which then contaminates skills the model already had. So partial traces aren't just a way to extract free value; they're a defense against hard samples that otherwise teach the wrong lesson. That reframes the question: it's less 'can we salvage waste' and more 'can we stop hard samples from doing damage.'

A surprising thread complicates what 'good guidance' even means. You'd assume the injected traces must be correct reasoning — but models trained on deliberately corrupted, semantically irrelevant traces perform comparably, and sometimes generalize *better* out of distribution Do reasoning traces need to be semantically correct?. This suggests traces often function as computational scaffolding — a structure that gives the model room to compute — rather than as meaningful step-by-step logic to imitate. If that holds, then for hard samples the value of a partial trace may be that it bridges the model to a reachable region of the solution space at all, not that it transmits flawless reasoning.

Which raises the deeper subtlety: are these problems actually unsolvable for the model, or just badly explored? Reasoning models frequently abandon viable paths prematurely — wandering into dead ends or switching away from promising lines too early Why do reasoning models abandon promising solution paths? — and simple decoding-time nudges recover accuracy without any fine-tuning at all. That implies some 'hard' samples are learnable already; the solution exists in the model but gets dropped. Partial traces help precisely here, by pinning down the early structure so the model doesn't wander off it. And how you *select* what to feed matters as much as feeding it: step-level confidence filtering catches reasoning breakdowns that whole-trace averaging hides, and gets the same gains from far fewer traces Does step-level confidence outperform global averaging for trace filtering? — quality of guidance beats quantity.

The honest caveat the corpus presses is what 'learnable' buys you. Even GRPO-trained models that look like they've mastered a problem class often crater on out-of-distribution variants — RL tends to sharpen template-matching rather than install a transferable procedure Do fine-tuned language models actually learn optimization procedures?. So partial traces can reliably turn a wasted sample into reward-bearing training data, but whether that yields genuine reasoning or just a more confident memorized template is the open edge. The payoff is real; the ceiling on what kind of learning it produces is still contested.


Sources 6 notes

Can adaptive guidance from solution traces reduce reward sparsity in RL?

GHPO dynamically provides ground-truth solution traces for hard problems while using standard RL for manageable ones, achieving 5% gains across math benchmarks. This converts wasted compute on impossible problems into learning signal by leveraging traces already present in training data.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Next inquiring lines