Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does training on messy search processes improve reasoning?

Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.

Note · 2026-02-22 · sourced from Question Answer Search
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Language models are almost never shown fruitful mistakes during training. They see only the outcome of a decision-making process, not the process itself. Stream of Search (SoS) demonstrates what happens when you change this: train LMs on the full search process — exploration, dead ends, backtracking, pruning — represented as a serialized string.

The results on Countdown (a game requiring combining numbers with arithmetic to reach a target): SoS-pretrained models achieve 25% higher accuracy than models trained to predict only the optimal trajectory. The improvement comes from learning to search rather than learning to predict.

SoS systematizes search components into a unified language that captures multiple symbolic search strategies (BFS, DFS, and their variations) in a common serialized format. This is "intrinsic" search — the model learns an internal policy for exploration — unlike "extrinsic" approaches (ToT, GoT) that use fixed external search strategies and call the LM only for generation and evaluation. The distinction matters: extrinsic methods have high inference costs and fixed strategies, while intrinsic search is learned and adaptive.

The most striking finding: SoS models learn internal world models for search. Unlike symbolic search that relies on an explicit environment model, SoS models simulate state transitions themselves. This means the model can generalize its search strategy to novel problems without an explicitly programmed transition function.

This is distinct from the Do reasoning traces need to be semantically correct? finding. That result shows trace CONTENT is dispensable — semantically irrelevant tokens still provide computational scaffolding. SoS shows something different: the search PROCESS itself is valuable training data. It's not that mistakes don't matter (corrupted traces) — it's that the experience of making and recovering from mistakes teaches something that pure success doesn't.

The self-improvement connection is direct: after SoS pretraining, models can improve via STaR (self-taught reasoning) and APA (advantage-weighted policy aggregation) — optimizing for correctness on top of the learned search capability. This addresses the snowballing error problem (each wrong step makes subsequent steps more likely wrong) by teaching models to BACKTRACK rather than compound errors.

Since Why do reasoning LLMs fail at deeper problem solving?, SoS provides a potential training solution: if wandering exploration is the problem, training on systematic search processes (including recovery from wrong paths) could teach the systematic search strategy that current reasoning models lack.

SoS is fundamentally a training FORMAT intervention. Since Does training data format shape reasoning strategy more than domain?, representing the search process as serialized strings -- with explicit backtracking markers, dead-end annotations, and pruning decisions -- is a format choice that shapes the resulting reasoning strategy. SoS training on BFS-like exploration vs DFS-like exploration mirrors the MC/FF format distinction: the serialization format determines whether the model learns breadth-first or depth-first search behavior. And since How quickly do errors compound during model self-training?, SoS's inclusion of backtracking in training data directly addresses the avalanching vulnerability -- a model that has learned to recognize dead ends and backtrack from them is structurally less susceptible to compounding errors in self-training loops.


Source: Question Answer Search

Related concepts in this collection

Concept map
17 direct connections · 185 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

training on the search process including mistakes and backtracking produces better problem-solvers than training on optimal trajectories only