Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does RL improve domain reasoning by adding knowledge or removing it?

When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.

Note · 2026-02-21 · sourced from Domain Specialization
How do you build domain expertise into general AI models? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The Knowledge or Reasoning paper's KI/InfoGain framework allows a more precise account of what RL training does to domain reasoning than "RL makes models better at reasoning." The specific finding: RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, improving both reasoning accuracy and knowledge correctness (average KI gain of +12.4 points). RL does not appear to add new domain facts the model didn't know — it makes the model less likely to invoke incorrect domain knowledge when reasoning.

This is a structurally different claim than most framing of RL's role. The standard story is: SFT gives the model capability, RL aligns that capability to desired behavior. But in domain-specific contexts, the alignment function is more specific: RL is suppressing the wrong pattern activations during reasoning, not teaching the model new things.

The medical AI context makes this clear. Medical reasoning is knowledge-dominant — knowledge accuracy correlates more strongly with final accuracy than reasoning quality across benchmarks. SFT raises knowledge levels (KI +6.2% on medical tasks) but also introduces verbose or suboptimal reasoning, reducing InfoGain. RL corrects this: it rewards factual correctness and penalizes paths that introduce inaccurate medical claims, effectively performing a kind of knowledge path surgery on the training distribution.

This connects to but is distinct from Does policy entropy collapse limit reasoning performance in RL?, which describes RL's effect on exploration diversity during training. That insight is about the training dynamics that limit scaling; this one is about the mechanism through which RL improves domain-specific output quality. The two claims operate at different levels: collapse is a training-time constraint, pruning is a mechanism-of-action description.

The practical implication: for knowledge-intensive domains, RL is not optional enhancement on top of SFT — it is the correction mechanism that compensates for SFT's tendency to memorize answer patterns rather than reason correctly. Does supervised fine-tuning actually improve reasoning quality? documents the problem RL is solving.

The "RL Squeezes, SFT Expands" paper provides graph-topology evidence for this pruning mechanism. RL training compresses the diversity of reasoning paths while SFT expands them — this compression IS the pruning. RL doesn't add new paths; it removes low-reward ones, concentrating probability mass on the subset of reasoning trajectories that lead to correct outcomes. Since Does negative reinforcement alone outperform full reinforcement learning?, the pruning mechanism may be RL's primary contribution: suppressing wrong paths matters more than reinforcing right ones.


Source: Domain Specialization

Related concepts in this collection

Concept map
19 direct connections · 162 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl improves domain reasoning by pruning inaccurate knowledge from reasoning paths not by adding capability