Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation

Paper · arXiv 2509.20162 · Published September 24, 2025
RAGDomain SpecializationReward Models

Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre- Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at https: //github.com/ChaojunNie/RLAG.

Large language models (LLMs) have demonstrated exceptional capabilities in capturing and storing factual knowledge across diverse disciplines, attributed to their comprehensive training corpora (Roberts et al., 2020; Cohen et al., 2023; Hu et al., 2023; Wang et al., 2024). However, foundation models trained on broad datasets inherently underrepresent specialized domains relative to their significance in specific applications, creating knowledge gaps in downstream applications. Due to the static nature of training data and the difficulty of accounting for all potential downstream applications during development, LLMs often struggle to answer highly specialized questions (Bang et al., 2023; Ji et al., 2023; Zhang et al., 2023a).

In-context learning (ICL) enhances performance on downstream tasks by providing models with exemplars during inference, enabling adaptation without parameter updates (Wang et al., 2023a; Li et al., 2023; Highmore, 2024). Retrieval Augmented Generation (RAG) augments model outputs by integrating relevant information from external knowledge bases, improving factual accuracy and reasoning capabilities (Guu et al., 2020; Lewis et al., 2020; Jiang et al., 2023). Since both ICL and RAG enhance performance through external information at inference time, neither permanently improves the model’s intrinsic capabilities for downstream tasks.

This study focuses on embedding knowledge into model weights. Training on downstream datasets embeds domain-specific knowledge directly into model parameters, enabling autonomous reasoning without external support (Gururangan et al., 2020; Ke et al., 2023; Song et al., 2025). While Continual Pre-Training (CPT) (Ke et al., 2023) processes entire domain corpora, its effectiveness is limited by the uniform importance assigned to tokens during training (Liu et al., 2024; Zhang et al., 2024). Supervised fine-tuning (SFT) (Wei et al., 2021) effectively embeds key information through targeted training; however, models trained exclusively on labeled knowledge pairs often exhibit reduced performance on complex reasoning tasks.

Inspired by reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022; Rafailov et al., 2023), we introduce Reinforcement Learning from Augmented Generation (RLAG). In our scenario, generation augmented with relevant literature is preferred over unaugmented generation when addressing downstream questions. The core principle involves optimizing the model to generate preferred generations independently while continuously improving these generations through iterative refinement. Notably, our objective extends beyond enabling models to merely reproduce literatureaugmented answers (achievable through SFT); we aim for models to thoroughly assimilate knowledge contained within domain literature, thereby maintaining robust knowledge capabilities throughout conversations as shown in Figure 1.

As illustrated in Figure 2, RLAG comprises two principal components: sampling and optimizing. During sampling, we employ a broadcasting operation to concatenate each option with the question, generating two responses differentiated by the presence or absence of retrieved snippets as a prefix. We compute log probabilities for each component through the model’s output logits and select the maximum from the option-specific segment as prediction. The optimization phase leverages three predefined reward functions calculated from the sampling results and retrieved snippets to update the model. In the next iteration, we use the updated model for sampling and optimization. To further isolate LLMs’ abilities to learn new knowledge, we built a dataset covering events postmodel training cutoff. Current events dataset is sourced from Wikipedia (Wikipedia contributors, 2025).