Reinforcement Learning for LLMs LLM Reasoning and Architecture

Do base models already contain hidden reasoning ability?

Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.

Note · 2026-02-22 · sourced from Reasoning Architectures

Three convergent findings build a strong case that reasoning capability is primarily a pre-training phenomenon:

Finding 1 (Base Models paper): Base models already spontaneously demonstrate strong reasoning capabilities and "aha moment" self-reflection patterns when sampled sufficiently. Reasoning traces generated by RL-fine-tuned models are already present in base model outputs — they just appear with lower frequency. RL biases generation toward high-reward patterns; it doesn't create new patterns.

Finding 2 (Steering): A hybrid model using base model weights + thinking model steering vectors recovers 91% of the performance gap to thinking models while steering only 12% of tokens. The reasoning mechanisms (backtracking, uncertainty estimation, subgoal-setting) already exist as directions in the base model's activation space.

Finding 3 (CFT/RLVR): Critique Fine-Tuning on a single problem can unlock reasoning potential at RLVR-level effectiveness. By exposing the model to diverse critiques of varied incorrect solutions to one problem, CFT activates reasoning patterns already latent in the base model without requiring hundreds of GPU hours of RL training.

Finding 4 (CoT-Decoding): Pre-trained LLMs inherently contain CoT reasoning paths that can be elicited simply by altering the decoding procedure. Rather than greedy decoding, inspecting top-k alternative tokens reveals that CoT paths are frequently present in the model's probability distribution. A confidence metric differentiates CoT from non-CoT paths — the model shows increased confidence in its final answer when a CoT reasoning path is present. This is entirely unsupervised, requiring no prompting, tuning, or training modifications — purely a decoding change. CoT-decoding adds a fourth mechanism to the latent capability evidence: RL steering, CFT, RLVR, and now decoding all unlock reasoning already present.

Finding 5 (SAE Reasoning Steering): Sparse Autoencoders decompose model activations into interpretable features, revealing latent features causally associated with reasoning behavior. Steering a single identified reasoning feature at the first generation step matches or exceeds CoT performance across six model families up to 70B parameters — without any explicit CoT prompting. The reasoning mode triggers early in generation and is robust enough to override prompt-level \no_think instructions. This is the most direct mechanistic evidence yet: the capability is not just present (as CoT-decoding shows) but causally controllable through a single latent dimension. See Can we trigger reasoning without explicit chain-of-thought prompts?. Together with CoT-decoding (Finding 4), this establishes five independent elicitation mechanisms: RL steering, CFT, RLVR, decoding, and SAE feature steering — all converging on the same latent capability.

The synthesis: post-training methods are selectors, not creators. They select which of the base model's latent capabilities to express reliably in context. The implication is that the main bottleneck for reasoning is not capability acquisition (which happens during pre-training on the world's text) but capability elicitation.

RLVR evidence deepens this: Two additional findings from the RLVR literature reinforce the latent-capability thesis. First, 1-shot RLVR achieves a 37-point jump on MATH500 (36%→73.6%) from a single training example. After the model perfectly memorizes its one example, test accuracy continues improving for 1,400 more steps — post-saturation generalization. The data is exhausted, but activation continues. See Can a single training example unlock mathematical reasoning?. Second, spurious rewards — random, incorrect, or format-only — improve Qwen2.5-Math nearly as much as correct rewards (~21-25% improvement). But the same spurious rewards fail completely for Llama3.1 and OLMo2. The differentiating variable is not reward quality but pretraining: Qwen's code-reasoning pretraining creates latent capability that any optimization pressure can activate. See Why do random rewards improve reasoning for some models but not others?. Together with the pass@k finding that RLVR narrows capability scope rather than expanding it, the evidence converges: RLVR is a catalyst that triggers a phase transition from broad pretraining distribution to reliable sampling of correct answers.

This partially contradicts Can simple rewards alone teach complex domain reasoning? — that note documents genuine capability emergence in domain-specialized contexts (medical, mathematical). The reconciliation: emergence may reflect reliable expression of latent capability, not creation from scratch. The distinction matters for research direction: if capability already exists, the investment in RL may be better directed toward elicitation methods.

The implication for Can prompt optimization teach models knowledge they lack?: the same principle extends to reasoning capability, not just knowledge.

Source: Reasoning Architectures, RLVR, Cognitive Models Latent

Related concepts in this collection

Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
partially contradicted: "emergence" may be reliable expression of latent capability, not creation
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
mechanism: if base models have capability, RL teaches timing of deployment
Can prompt optimization teach models knowledge they lack? Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
extends to reasoning capability not just knowledge
Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
qualified: targeted activation methods can close most of the gap
Can a single training example unlock mathematical reasoning? Does minimal data suffice to activate latent reasoning capabilities in language models? This explores whether one example can produce dramatic performance gains comparable to much larger datasets.
strongest evidence: one example activates 37-point gain with continued generalization
Why do random rewards improve reasoning for some models but not others? Spurious rewards boost Qwen's math reasoning by 16-25% but fail for Llama and OLMo. We explore whether reward quality matters, or if pretraining strategy determines what RLVR can unlock.
pretraining determines activation potential; reward signal is the catalyst, not the teacher
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.
pass@k confirms RLVR selects from existing capability, does not create new
Does reasoning rely on procedural knowledge or factual memorization? Explores whether LLMs learn reasoning through general procedural patterns across documents or through memorizing specific facts. Understanding this distinction matters for training data strategy.
identifies what the latent capability consists of: procedural knowledge synthesized from diverse pretraining documents that demonstrates how to reason, not what to recall; this is what minimal training signals activate
Can models learn when to think versus respond quickly? Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.
concrete implementation of the latent-capability thesis: Thinkless trains only a routing token via DeGRPO, not reasoning capability; the design premise is that capability is already present and what's needed is adaptive activation
Can models learn to internalize search as reasoning? Does training on linearized search traces teach models to implement search algorithms internally, expanding what they can discover during reasoning? This matters because it could unlock entirely new problem-solving modes beyond standard chain-of-thought.
extends beyond activation: Meta-CoT claims linearized search traces can teach genuinely new search capability, not just unlock existing patterns — testing the boundary of the latent-capability thesis
Does reinforcement learning teach social reasoning or just shortcuts? When RL optimizes for accuracy on theory of mind tasks, do models actually learn to track mental states, or do they find faster paths to correct answers? The distinction matters for genuine reasoning capability.
the scale-dependent finding adds a social-reasoning dimension: 7B models have latent ToM capability that RL can activate, but smaller models lack sufficient latent capacity for social reasoning, suggesting a domain-specific threshold below which the latent-capability thesis does not hold
Does reinforcement learning update only a small fraction of parameters? Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
parametric signature of latent capability: RL touches only 5-30% of parameters because the rest already encode adequate reasoning; the sparsity is intrinsic and consistent across 7 algorithms and 10 models, confirming capability preexists in the weights
Can pretraining corpora themselves provide verifiable RL rewards? Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?
strengthens the foundation: RPT may create stronger latent capabilities than standard pretraining by embedding RL reasoning patterns during pretraining itself, making the subsequent minimal-signal activation even more effective
Can models improve themselves on tasks without verifiable answers? Most self-improvement methods require objective correctness signals, limiting them to math and code. Can models self-improve on open-ended instruction tasks where answers can't be automatically verified?
extends the minimal-signal thesis to general instruction tasks: 1000 demonstrations of reasoning enrichment are sufficient to enable iterative self-improvement, consistent with the latent capability thesis — the catalyst teaches articulation of reasoning, not reasoning itself
Can 78 demonstrations teach agency better than 10000? Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.
extends the latent-capability thesis from reasoning to autonomous agency: 78 curated trajectories outperform 10K+ samples, suggesting agentic behavior is also a latent capability that minimal signals can activate
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
most direct mechanistic evidence: single latent feature causally controls reasoning activation across 6 model families up to 70B

Concept map

30 direct connections · 201 in 2-hop network ·medium cluster

Do base models already contain hidden reasoning … Can simple rewards alone teach complex domain reas… Does RL teach reasoning or just when to use it? Can prompt optimization teach models knowledge the… Can non-reasoning models catch up with more comput… Can a single training example unlock mathematical … Why do random rewards improve reasoning for some m… Does RLVR actually expand what models can reason a… Does reasoning rely on procedural knowledge or fac…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

base models already possess latent reasoning capability that minimal training signals can unlock