Can trajectory quality filtering improve model training in noisy environments?
This explores whether selectively keeping or discarding training trajectories by quality — rather than training on everything a model generates — helps when the reward signal is messy, sparse, or unverifiable, and the corpus has several distinct angles on it.
This explores whether filtering trajectories by quality helps a model learn when the training signal is noisy. The short version from the corpus: yes, and the more interesting finding is *why* — bad trajectories don't just waste compute, they actively poison the model. The clearest case is what happens when you *don't* filter: training on problems that are nearly impossible for the model causes it to learn degenerate shortcuts — answer repetition, computation-skipping — and those shortcuts then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. The mechanism is subtle: group-relative reward normalization treats a rare lucky success on an impossible problem as a *high-advantage* trajectory and reinforces it hard. So in a noisy regime, the absence of filtering doesn't just slow learning — it amplifies the worst examples. That reframes filtering from an efficiency tweak into damage control.
The corpus offers a particularly elegant filtering approach where the noise estimate and the filter are the same statistic. Cross-rollout variance — how much a model's multiple attempts at the same query disagree — gets used at two levels at once: to weight individual tokens and to throw out entire queries whose comparisons are degenerate, yielding 2–3× faster, more stable training specifically on *unverifiable* tasks where you have no clean ground-truth signal Can one statistical measure serve dual purposes in RL training?. That's worth sitting with: the model's own internal disagreement is a usable proxy for which trajectories are trustworthy, and it's cheap because it's self-supervised.
Filtering is one lever; *generating* better quality signals is the complementary one. AlphaLLM uses tree search to rank solution paths by success and derive dense, process-level reward signals without any human annotation — the tree structure itself becomes the quality filter, naturally sorting good paths from bad Can tree search replace human feedback in LLM training?. And there's a mechanistic reason all of this matters more than it seems: RL's primary effect appears to be *suppressing* wrong trajectories rather than amplifying correct ones What actually changes inside a model during RL training?. If negative reinforcement is doing the heavy lifting, then the quality of what you feed in — and what you let through your filter — directly determines what the model learns to avoid versus accidentally learns to repeat.
Two adjacent framings stretch the question further. First, quality isn't only about discarding — it's about *ordering*. Training structured tasks before creative ones (BWT-guided scheduling) prevents entropy collapse from damaging open-ended capabilities, a 6.2% gain from sequencing alone Does training order reshape how models handle different task types?. So in a mixed noisy environment, *when* a trajectory arrives can matter as much as whether it's kept. Second, the reward itself can be the noise source: binary correctness rewards quietly degrade calibration by rewarding confident wrong guesses, fixable by adding a proper scoring rule Does binary reward training hurt model calibration?. Filtering trajectories won't save you if the metric scoring them is structurally biased.
One caveat the corpus surfaces: trajectories aren't always interchangeable noise to be cleaned. For in-context sequential decision-making, *full* same-environment trajectories are exactly what enables generalization — the burstiness is the signal, not the noise Why do trajectories matter more than individual examples for in-context learning?. So the honest answer is that quality filtering helps, but 'quality' is task-dependent: discard the degenerate-shortcut trajectories, but don't strip the structural trajectory information a model needs to learn from.
Sources 7 notes
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.