Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can models learn to internalize search as reasoning?

Does training on linearized search traces teach models to implement search algorithms internally, expanding what they can discover during reasoning? This matters because it could unlock entirely new problem-solving modes beyond standard chain-of-thought.

Note · 2026-02-23 · sourced from Inference time scaling

Standard chain-of-thought produces a reasoning trace. Meta-CoT asks a different question: what search process generates that trace? The framework draws from dual-process theory — CoT is System 1 (pattern-completed reasoning), while Meta-CoT is System 2 (deliberate search over reasoning strategies). The claim is that state-of-the-art models like o1 and DeepSeek-R1 already exhibit behaviors consistent with in-context search: they explore multiple paths, backtrack, and select among candidate reasoning chains rather than generating a single trace sequentially.

The training pipeline makes the internalization concrete: (1) generate linearized search traces from MCTS or A* algorithms applied to reasoning problems, (2) instruction-tune on these traces so the model learns the structure of search, (3) RL post-training to refine the search behavior. The linearized traces are the key innovation — they convert tree-structured search into sequential token predictions that autoregressive models can learn.

The speculative but important claim: if a model can learn to implement search algorithms in-context, then RL training on such a model constitutes optimization over algorithms rather than specific outputs. This could yield novel modes of problem-solving that neither symbolic tree-search nor standard CoT can achieve, because the model is not constrained by the specific search algorithm it was trained on — it can adapt and combine strategies.

This extends Does RL teach reasoning or just when to use it? in a significant direction: Meta-CoT proposes that search IS trainable as "how." The timing thesis says RL teaches WHEN to reason; Meta-CoT says the reasoning process itself can be internalized through exposure to search traces. If both are correct, RL training operates at two levels: activating reasoning (timing) and shaping the reasoning process (search internalization).

However, the tension with Does the choice of RL algorithm actually matter for reasoning? is notable: if the pretrained prior bounds exploration, then internalized search may still be constrained by what the model already knows. Meta-CoT would need to demonstrate that linearized search traces genuinely expand the exploration boundary rather than just reorganizing existing capability.

Source: Inference time scaling

Related concepts in this collection

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
extends: Meta-CoT proposes that search CAN be trained as the "how" component
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
Meta-CoT goes further: linearized traces may teach a new capability, not just unlock existing
Can extended RL training discover reasoning strategies base models cannot? Does reinforcement learning genuinely expand what models can reason about, or does it only optimize existing latent capabilities? ProRL tests this by running RL longer on diverse tasks with better training controls.
supports: algorithm optimization could be the mechanism for genuine novelty
Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
tension: Meta-CoT claims search is trainable but prior-boundedness may constrain what internalized search can discover
Can models learn reasoning from predicting text alone? Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.
complementary internalization approaches: Quiet-STaR internalizes rationale generation at every token during pretraining, while Meta-CoT internalizes search algorithms via linearized traces during post-training — both aim to embed reasoning into the forward pass but at different granularities (token-level prediction vs. trace-level search strategy)

Concept map

12 direct connections · 123 in 2-hop network ·dense cluster

Can models learn to internalize search as reason… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab… Can extended RL training discover reasoning strate… Does the choice of RL algorithm actually matter fo… Can models learn reasoning from predicting text al…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

meta-cot frames chain-of-thought production as a search problem that models can learn to internalize