Can trajectory quality filtering improve model training in noisy environments?

This explores whether selectively keeping or discarding training trajectories by quality — rather than training on everything a model generates — helps when the reward signal is messy, sparse, or unverifiable, and the corpus has several distinct angles on it.

This explores whether filtering trajectories by quality helps a model learn when the training signal is noisy. The short version from the corpus: yes, and the more interesting finding is *why* — bad trajectories don't just waste compute, they actively poison the model. The clearest case is what happens when you *don't* filter: training on problems that are nearly impossible for the model causes it to learn degenerate shortcuts — answer repetition, computation-skipping — and those shortcuts then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. The mechanism is subtle: group-relative reward normalization treats a rare lucky success on an impossible problem as a *high-advantage* trajectory and reinforces it hard. So in a noisy regime, the absence of filtering doesn't just slow learning — it amplifies the worst examples. That reframes filtering from an efficiency tweak into damage control.

The corpus offers a particularly elegant filtering approach where the noise estimate and the filter are the same statistic. Cross-rollout variance — how much a model's multiple attempts at the same query disagree — gets used at two levels at once: to weight individual tokens and to throw out entire queries whose comparisons are degenerate, yielding 2–3× faster, more stable training specifically on *unverifiable* tasks where you have no clean ground-truth signal Can one statistical measure serve dual purposes in RL training?. That's worth sitting with: the model's own internal disagreement is a usable proxy for which trajectories are trustworthy, and it's cheap because it's self-supervised.

Filtering is one lever; *generating* better quality signals is the complementary one. AlphaLLM uses tree search to rank solution paths by success and derive dense, process-level reward signals without any human annotation — the tree structure itself becomes the quality filter, naturally sorting good paths from bad Can tree search replace human feedback in LLM training?. And there's a mechanistic reason all of this matters more than it seems: RL's primary effect appears to be *suppressing* wrong trajectories rather than amplifying correct ones What actually changes inside a model during RL training?. If negative reinforcement is doing the heavy lifting, then the quality of what you feed in — and what you let through your filter — directly determines what the model learns to avoid versus accidentally learns to repeat.

Two adjacent framings stretch the question further. First, quality isn't only about discarding — it's about *ordering*. Training structured tasks before creative ones (BWT-guided scheduling) prevents entropy collapse from damaging open-ended capabilities, a 6.2% gain from sequencing alone Does training order reshape how models handle different task types?. So in a mixed noisy environment, *when* a trajectory arrives can matter as much as whether it's kept. Second, the reward itself can be the noise source: binary correctness rewards quietly degrade calibration by rewarding confident wrong guesses, fixable by adding a proper scoring rule Does binary reward training hurt model calibration?. Filtering trajectories won't save you if the metric scoring them is structurally biased.

One caveat the corpus surfaces: trajectories aren't always interchangeable noise to be cleaned. For in-context sequential decision-making, *full* same-environment trajectories are exactly what enables generalization — the burstiness is the signal, not the noise Why do trajectories matter more than individual examples for in-context learning?. So the honest answer is that quality filtering helps, but 'quality' is task-dependent: discard the degenerate-shortcut trajectories, but don't strip the structural trajectory information a model needs to learn from.

Sources 7 notes

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Can trajectory quality filtering improve model training in noisy environments?

Sources 7 notes

Next inquiring lines