Reinforcement Learning for LLMs LLM Reasoning and Architecture

Do base models already contain hidden reasoning ability?

Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.

Note · 2026-02-22 · sourced from Reasoning Architectures

Three convergent findings build a strong case that reasoning capability is primarily a pre-training phenomenon:

Finding 1 (Base Models paper): Base models already spontaneously demonstrate strong reasoning capabilities and "aha moment" self-reflection patterns when sampled sufficiently. Reasoning traces generated by RL-fine-tuned models are already present in base model outputs — they just appear with lower frequency. RL biases generation toward high-reward patterns; it doesn't create new patterns.

Finding 2 (Steering): A hybrid model using base model weights + thinking model steering vectors recovers 91% of the performance gap to thinking models while steering only 12% of tokens. The reasoning mechanisms (backtracking, uncertainty estimation, subgoal-setting) already exist as directions in the base model's activation space.

Finding 3 (CFT/RLVR): Critique Fine-Tuning on a single problem can unlock reasoning potential at RLVR-level effectiveness. By exposing the model to diverse critiques of varied incorrect solutions to one problem, CFT activates reasoning patterns already latent in the base model without requiring hundreds of GPU hours of RL training.

Finding 4 (CoT-Decoding): Pre-trained LLMs inherently contain CoT reasoning paths that can be elicited simply by altering the decoding procedure. Rather than greedy decoding, inspecting top-k alternative tokens reveals that CoT paths are frequently present in the model's probability distribution. A confidence metric differentiates CoT from non-CoT paths — the model shows increased confidence in its final answer when a CoT reasoning path is present. This is entirely unsupervised, requiring no prompting, tuning, or training modifications — purely a decoding change. CoT-decoding adds a fourth mechanism to the latent capability evidence: RL steering, CFT, RLVR, and now decoding all unlock reasoning already present.

Finding 5 (SAE Reasoning Steering): Sparse Autoencoders decompose model activations into interpretable features, revealing latent features causally associated with reasoning behavior. Steering a single identified reasoning feature at the first generation step matches or exceeds CoT performance across six model families up to 70B parameters — without any explicit CoT prompting. The reasoning mode triggers early in generation and is robust enough to override prompt-level \no_think instructions. This is the most direct mechanistic evidence yet: the capability is not just present (as CoT-decoding shows) but causally controllable through a single latent dimension. See Can we trigger reasoning without explicit chain-of-thought prompts?. Together with CoT-decoding (Finding 4), this establishes five independent elicitation mechanisms: RL steering, CFT, RLVR, decoding, and SAE feature steering — all converging on the same latent capability.

The synthesis: post-training methods are selectors, not creators. They select which of the base model's latent capabilities to express reliably in context. The implication is that the main bottleneck for reasoning is not capability acquisition (which happens during pre-training on the world's text) but capability elicitation.

RLVR evidence deepens this: Two additional findings from the RLVR literature reinforce the latent-capability thesis. First, 1-shot RLVR achieves a 37-point jump on MATH500 (36%→73.6%) from a single training example. After the model perfectly memorizes its one example, test accuracy continues improving for 1,400 more steps — post-saturation generalization. The data is exhausted, but activation continues. See Can a single training example unlock mathematical reasoning?. Second, spurious rewards — random, incorrect, or format-only — improve Qwen2.5-Math nearly as much as correct rewards (~21-25% improvement). But the same spurious rewards fail completely for Llama3.1 and OLMo2. The differentiating variable is not reward quality but pretraining: Qwen's code-reasoning pretraining creates latent capability that any optimization pressure can activate. See Why do random rewards improve reasoning for some models but not others?. Together with the pass@k finding that RLVR narrows capability scope rather than expanding it, the evidence converges: RLVR is a catalyst that triggers a phase transition from broad pretraining distribution to reliable sampling of correct answers.

This partially contradicts Can simple rewards alone teach complex domain reasoning? — that note documents genuine capability emergence in domain-specialized contexts (medical, mathematical). The reconciliation: emergence may reflect reliable expression of latent capability, not creation from scratch. The distinction matters for research direction: if capability already exists, the investment in RL may be better directed toward elicitation methods.

The implication for Can prompt optimization teach models knowledge they lack?: the same principle extends to reasoning capability, not just knowledge.


Source: Reasoning Architectures, RLVR, Cognitive Models Latent

Related concepts in this collection

Concept map
30 direct connections · 201 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

base models already possess latent reasoning capability that minimal training signals can unlock