Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?
Explores whether rewarding coherent reasoning patterns during training helps models internalize domain knowledge better than standard fine-tuning approaches that treat all tokens equally.
SFT on domain knowledge treats all tokens equally. A training example of a medical question answered correctly does not distinguish between the tokens that encode critical clinical reasoning and the tokens that are boilerplate formatting. CPT (continual pre-training) is worse: it processes entire domain documents without targeting clinically critical information. Both approaches fail at knowledge coherence — the model may learn isolated facts without integrating them into the connected knowledge structures needed for complex reasoning.
RLAG (Reinforcement Learning from Augmented Generation) takes a different approach. For each question, generate two responses: one with retrieved domain context as prefix, one without. The augmented response is the "preferred" response (the model sees what the correct answer looks like with evidence support). The unaugmented response is what the model can produce from parametric knowledge alone. The reward signals: answer accuracy and explanation rationality — not just whether the final answer is right but whether the reasoning that produced it is coherent.
The iterative cycle: sample → compute rewards → optimize → repeat. With each cycle the model internalizes the knowledge patterns from retrieved context, gradually reducing the gap between its unaugmented performance and augmented performance. The retrieved context during training becomes scaffolding that the model eventually internalizes.
The key difference from SFT: RLAG rewards the model for the quality of its knowledge representations, not just for reproducing training examples. A model that gets the right answer through incoherent reasoning is not rewarded. A model that produces a coherent explanation from genuinely integrated knowledge is.
This adds a new mechanism to the How do knowledge injection methods trade off flexibility and cost?: RL-from-augmentation is not purely dynamic (inference-time RAG) nor purely static (SFT/CPT) — it uses dynamic context during training to progressively embed what it learned into weights, creating models that can reason coherently without retrieval at test time.
Source: RAG
Related concepts in this collection
-
How do knowledge injection methods trade off flexibility and cost?
When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.
RLAG is a hybrid: uses dynamic retrieval at training time to drive static weight updating; adds a fifth mechanism to the taxonomy
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
same pruning dynamic: RL removes incoherent knowledge pathways; RLAG uses augmented generation as the reference signal for what coherent pathways look like
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
RLAG's explanation-rationality reward is a direct response to this SFT failure mode
-
Why do specialized models fail outside their domain?
Deep domain optimization creates sharp performance cliffs at domain boundaries. Specialized models generate plausible-sounding but ungrounded responses when queries fall outside their training scope, and often fail to signal their own ignorance.
RLAG's retrieval-augmented training mitigates the cliff by anchoring knowledge to retrieved evidence rather than memorized patterns; models that internalize structured knowledge through RL are less likely to generate plausible-but-wrong outputs at domain boundaries than those trained purely on static data
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
rl from augmented generation embeds domain knowledge more effectively than sft by rewarding coherent knowledge structures