Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does training on messy search processes improve reasoning?

Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.

Note · 2026-02-22 · sourced from Question Answer Search

Language models are almost never shown fruitful mistakes during training. They see only the outcome of a decision-making process, not the process itself. Stream of Search (SoS) demonstrates what happens when you change this: train LMs on the full search process — exploration, dead ends, backtracking, pruning — represented as a serialized string.

The results on Countdown (a game requiring combining numbers with arithmetic to reach a target): SoS-pretrained models achieve 25% higher accuracy than models trained to predict only the optimal trajectory. The improvement comes from learning to search rather than learning to predict.

SoS systematizes search components into a unified language that captures multiple symbolic search strategies (BFS, DFS, and their variations) in a common serialized format. This is "intrinsic" search — the model learns an internal policy for exploration — unlike "extrinsic" approaches (ToT, GoT) that use fixed external search strategies and call the LM only for generation and evaluation. The distinction matters: extrinsic methods have high inference costs and fixed strategies, while intrinsic search is learned and adaptive.

The most striking finding: SoS models learn internal world models for search. Unlike symbolic search that relies on an explicit environment model, SoS models simulate state transitions themselves. This means the model can generalize its search strategy to novel problems without an explicitly programmed transition function.

This is distinct from the Do reasoning traces need to be semantically correct? finding. That result shows trace CONTENT is dispensable — semantically irrelevant tokens still provide computational scaffolding. SoS shows something different: the search PROCESS itself is valuable training data. It's not that mistakes don't matter (corrupted traces) — it's that the experience of making and recovering from mistakes teaches something that pure success doesn't.

The self-improvement connection is direct: after SoS pretraining, models can improve via STaR (self-taught reasoning) and APA (advantage-weighted policy aggregation) — optimizing for correctness on top of the learned search capability. This addresses the snowballing error problem (each wrong step makes subsequent steps more likely wrong) by teaching models to BACKTRACK rather than compound errors.

Since Why do reasoning LLMs fail at deeper problem solving?, SoS provides a potential training solution: if wandering exploration is the problem, training on systematic search processes (including recovery from wrong paths) could teach the systematic search strategy that current reasoning models lack.

SoS is fundamentally a training FORMAT intervention. Since Does training data format shape reasoning strategy more than domain?, representing the search process as serialized strings -- with explicit backtracking markers, dead-end annotations, and pruning decisions -- is a format choice that shapes the resulting reasoning strategy. SoS training on BFS-like exploration vs DFS-like exploration mirrors the MC/FF format distinction: the serialization format determines whether the model learns breadth-first or depth-first search behavior. And since How quickly do errors compound during model self-training?, SoS's inclusion of backtracking in training data directly addresses the avalanching vulnerability -- a model that has learned to recognize dead ends and backtrack from them is structurally less susceptible to compounding errors in self-training loops.

Source: Question Answer Search

Related concepts in this collection

Do reasoning traces need to be semantically correct? Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
complementary finding: corrupted traces show content is dispensable; SoS shows PROCESS exposure is beneficial. Different mechanisms.
Why do reasoning LLMs fail at deeper problem solving? Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
SoS could address wandering by training systematic search with backtracking
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
SoS teaches when to persist vs when to backtrack, addressing the premature switching problem from the training side
Can reasoning topologies be formally classified as graph types? This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.
SoS represents multiple search topologies in a unified serialized format
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
SoS is a format intervention: serializing search processes (BFS, DFS, backtracking) as training strings shapes the resulting reasoning strategy
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
SoS's backtracking training directly counters avalanching: models learn to recognize and recover from dead ends rather than compounding errors

Concept map

17 direct connections · 185 in 2-hop network ·dense cluster

Does training on messy search processes improve … Do reasoning traces need to be semantically correc… Why do reasoning LLMs fail at deeper problem solvi… Do reasoning models switch between ideas too frequ… Can reasoning topologies be formally classified as… Does training data format shape reasoning strategy… How quickly do errors compound during model self-t…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

training on the search process including mistakes and backtracking produces better problem-solvers than training on optimal trajectories only