Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does RL improve domain reasoning by adding knowledge or removing it?

When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.

Note · 2026-02-21 · sourced from Domain Specialization

The Knowledge or Reasoning paper's KI/InfoGain framework allows a more precise account of what RL training does to domain reasoning than "RL makes models better at reasoning." The specific finding: RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, improving both reasoning accuracy and knowledge correctness (average KI gain of +12.4 points). RL does not appear to add new domain facts the model didn't know — it makes the model less likely to invoke incorrect domain knowledge when reasoning.

This is a structurally different claim than most framing of RL's role. The standard story is: SFT gives the model capability, RL aligns that capability to desired behavior. But in domain-specific contexts, the alignment function is more specific: RL is suppressing the wrong pattern activations during reasoning, not teaching the model new things.

The medical AI context makes this clear. Medical reasoning is knowledge-dominant — knowledge accuracy correlates more strongly with final accuracy than reasoning quality across benchmarks. SFT raises knowledge levels (KI +6.2% on medical tasks) but also introduces verbose or suboptimal reasoning, reducing InfoGain. RL corrects this: it rewards factual correctness and penalizes paths that introduce inaccurate medical claims, effectively performing a kind of knowledge path surgery on the training distribution.

This connects to but is distinct from Does policy entropy collapse limit reasoning performance in RL?, which describes RL's effect on exploration diversity during training. That insight is about the training dynamics that limit scaling; this one is about the mechanism through which RL improves domain-specific output quality. The two claims operate at different levels: collapse is a training-time constraint, pruning is a mechanism-of-action description.

The practical implication: for knowledge-intensive domains, RL is not optional enhancement on top of SFT — it is the correction mechanism that compensates for SFT's tendency to memorize answer patterns rather than reason correctly. Does supervised fine-tuning actually improve reasoning quality? documents the problem RL is solving.

The "RL Squeezes, SFT Expands" paper provides graph-topology evidence for this pruning mechanism. RL training compresses the diversity of reasoning paths while SFT expands them — this compression IS the pruning. RL doesn't add new paths; it removes low-reward ones, concentrating probability mass on the subset of reasoning trajectories that lead to correct outcomes. Since Does negative reinforcement alone outperform full reinforcement learning?, the pruning mechanism may be RL's primary contribution: suppressing wrong paths matters more than reinforcing right ones.

Source: Domain Specialization

Related concepts in this collection

Does supervised fine-tuning actually improve reasoning quality? While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
the problem RL is correcting: SFT memorizes paths, RL prunes wrong ones
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
related RL dynamics at training scale; this is about mechanism, not bottleneck
Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
RL emergence in domain contexts
Does medical AI need knowledge or reasoning more? Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?
why pruning is more important in medical than math
Does negative reinforcement alone outperform full reinforcement learning? Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.
mechanism: suppression of wrong paths (negative reinforcement) is the dominant RL contribution, consistent with pruning
Can document count be learned instead of fixed in RAG? Standard RAG systems use a fixed number of documents regardless of query complexity. Can an RL agent learn to dynamically select both how many documents and their order based on what helps the generator produce correct answers?
same pruning mechanism applied to document selection: RL learns which documents to include/exclude from the generator's context, suppressing noisy documents the same way it suppresses inaccurate reasoning paths
Does supervising retrieval steps outperform final answer rewards? Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
extends the pruning principle to retrieval chains: process-level RL supervision prunes bad intermediate retrieval steps, not just bad final answers; the step-level pruning that works for reasoning paths also works for retrieval trajectories
Why does SFT-then-RL training follow a predictable three-phase pattern? When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.
temporal context: CHORD shows that RL pruning operates specifically in the readaptation phase, correcting the capability disruption caused by SFT; pruning is most effective when it targets the overfitting artifacts that SFT introduces
Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning? Explores whether rewarding coherent reasoning patterns during training helps models internalize domain knowledge better than standard fine-tuning approaches that treat all tokens equally.
RLAG extends the pruning logic: rather than pruning inaccurate paths post-hoc, RLAG uses retrieved augmentation as the reference signal during RL to define which knowledge pathways count as coherent; the pruning and the target are jointly specified by the augmented generation reward

Concept map

19 direct connections · 162 in 2-hop network ·medium cluster

Does RL improve domain reasoning by adding knowl… Does supervised fine-tuning actually improve reaso… Does policy entropy collapse limit reasoning perfo… Can simple rewards alone teach complex domain reas… Does medical AI need knowledge or reasoning more? Does negative reinforcement alone outperform full … Can document count be learned instead of fixed in … Does supervising retrieval steps outperform final … Why does SFT-then-RL training follow a predictable…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

rl improves domain reasoning by pruning inaccurate knowledge from reasoning paths not by adding capability