INQUIRING LINE

Does partial trace guidance work better than curriculum learning for hard problems?

This explores two strategies for teaching models to solve problems that are too hard for plain reward-based training — handing the model partial worked solutions on demand (partial trace guidance) versus ordering the training from easy to hard (curriculum learning) — and asks whether the corpus shows one clearly winning on the hardest problems.


This explores whether feeding a model partial solution traces beats ordering its training as a curriculum when problems are too hard to learn from reward alone — and the corpus suggests these aren't really rivals so much as two answers to the same bottleneck, with the line between them blurry. The shared enemy is reward sparsity: on a hard problem a model almost never stumbles onto the correct answer, so reinforcement learning gets no signal and the compute is wasted. Partial trace guidance attacks this directly. GHPO detects when a problem is beyond the model's current reach and reveals just enough of a ground-truth trace to get it moving, then lets normal exploration take over — converting impossible problems into learning signal and posting ~5% math gains Can adaptive guidance from solution traces reduce reward sparsity in RL?. Curriculum learning attacks the same sparsity from a different angle: SRL-then-RLVR runs an imitation phase first so the model can produce plausible rollouts, which is precisely what makes the later reward phase informative rather than blind Does sequencing imitation then exploration training improve reasoning?.

The surprising thing is how much these two ideas converge once you look closely. Reverse-curriculum R3 is arguably a curriculum *made of* partial traces: it starts the model near the finished answer and slides the start state backward step by step, so early in training the model only has to complete the last move and gradually takes on more Can curriculum learning approximate expensive process supervision?. That's both a difficulty ramp and a trace-revelation scheme at once — which is a clue that 'partial trace vs. curriculum' is a false binary. GHPO's adaptive 'help only when stuck' is itself a kind of per-problem curriculum; SRL's imitation phase is itself a form of trace guidance.

What actually matters for hard problems, the corpus hints, may be neither label but *what the traces contain*. Stream-of-Search shows that training on the messy full search process — mistakes, backtracking, dead ends serialized into the trace — produces 25% better solvers than training on clean optimal trajectories, because the model learns an internal search strategy instead of copying a fixed path Does training on messy search processes improve reasoning?. This matters because hard problems are exactly where models fall apart: reasoning LLMs 'wander' unsystematically, and their success probability drops exponentially with problem depth Why do reasoning LLMs fail at deeper problem solving?. Guidance that teaches *how to search* — including how to recover from being wrong — is what addresses that failure, regardless of whether you call the delivery mechanism a trace or a curriculum.

Here's the genuinely unsettling thread the corpus pulls on: the traces may not need to be *correct* to work. Models trained on deliberately corrupted reasoning traces perform comparably to those trained on correct ones, suggesting traces act as computational scaffolding rather than meaningful reasoning Do reasoning traces need to be semantically correct? — and instruction tuning similarly teaches output format more than task understanding Does instruction tuning teach task understanding or output format?. If true, the value of 'partial trace guidance' may lie less in transmitting the right answer and more in pulling the model into the right output space so its own latent ability can activate. That reframes the whole question: a single example can flip math accuracy from 36% to 73.6%, implying the capability was already latent and just needed a trigger Can a single training example unlock mathematical reasoning?.

So the honest answer is: the corpus doesn't crown a winner, and the question itself may be miscast. The strongest result is a *combination* — SRL imitation then RLVR refinement beats either alone — and the most effective methods (GHPO, R3) quietly fuse both ideas. If you want to go deeper on what makes a trace useful versus decorative, the corruption and confidence-filtering work is the sharper rabbit hole: step-level confidence filtering shows trace *quality* beats trace *quantity* Does step-level confidence outperform global averaging for trace filtering?, and trace length turns out to track training-distribution familiarity rather than true difficulty Does longer reasoning actually mean harder problems? — which means 'this problem is hard' is itself harder to detect than it looks.


Sources 10 notes

Can adaptive guidance from solution traces reduce reward sparsity in RL?

GHPO dynamically provides ground-truth solution traces for hard problems while using standard RL for manageable ones, achieving 5% gains across math benchmarks. This converts wasted compute on impossible problems into learning signal by leveraging traces already present in training data.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Next inquiring lines