Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does RL teach reasoning or just when to use it?

Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.

Note · 2026-02-22 · sourced from Reasoning Architectures

The standard account of thinking models like DeepSeek R1 and GPT-o1 attributes their reasoning gains to RL post-training — a story where RL teaches models how to reason. The "Base Models Know How to Reason" paper inverts this: pre-training is when reasoning capability is acquired; RL teaches models when to deploy it.

The evidence is direct. A hybrid model that combines a base model's reasoning capabilities with a thinking model's deployment decisions — without any weight updates — recovers up to 91% of the performance gap between base and thinking models while steering only 12% of tokens. The steering uses activation-space vectors: directions in the base model that, when added at the right moments, induce reasoning behaviors like backtracking, uncertainty estimation, and subgoal-setting. The thinking model acts as a controller, deciding which steering vectors to activate and when.

This reframes what RL actually does. RL doesn't inject new reasoning skills; it biases token generation toward patterns with high reward. If base models already contain the execution-level skills (which they demonstrably do — sampled sufficiently, they produce reasoning traces already present in thinking model outputs), RL is essentially training an attention-based curriculum: produce the right reasoning at the right moment.

The implications are uncomfortable for the RL-is-essential narrative. Reasoning capability is largely a pre-training phenomenon. RL is a deployment optimizer, not a capability creator. This connects to Can prompt optimization teach models knowledge they lack? — the same principle operating at the training/inference boundary rather than purely at inference time.

Three RLVR findings reinforce this: First, pass@k analysis shows RLVR models have narrower capability boundaries than base models — at high k, base models outperform all six tested RLVR algorithms. RLVR is a sampling efficiency optimizer, not a capability expander. See Does RLVR actually expand what models can reason about?. Second, 1-shot RLVR achieves a 37-point jump (MATH500 36%→73.6%) from a single training example, with generalization continuing for 1,400 steps after the model perfectly memorizes its one example. The data is exhausted but activation continues — because the training signal triggers a phase transition in the model's output distribution. See Can a single training example unlock mathematical reasoning?. Third, spurious rewards (random, incorrect, or format-only) improve Qwen2.5-Math nearly as much as ground-truth rewards — but fail for Llama and OLMo. The differentiating variable is pretraining strategy, not reward signal quality. See Why do random rewards improve reasoning for some models but not others?.

The practical implication for reasoning system design: targeted steering of base models may be a more efficient path to reasoning performance than full RL training, particularly for domains where RLVR reward signal is hard to define.


Source: Reasoning Architectures

Related concepts in this collection

Concept map
30 direct connections · 232 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl post-training teaches models when to activate reasoning mechanisms not how to execute them