A Survey on Post-training of Large Language Models

Paper · arXiv 2503.06072 · Published March 8, 2025

The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; and Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT’s foundational alignment strategies in 2018 to DeepSeek-R1’s innovative reasoning advancements in 2025, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications

Post-training. Post-training refers to the techniques and methodologies employed after a model has undergone pre-training, aiming to refine and adapt the model for specific tasks or user requirements. Following the release of GPT-3 [7], with its 175 billion parameters, the field of post-training experienced a significant surge in interest and innovation. Various approaches emerged to enhance model performance, including fine-tuning [16, 17], which adjusts model parameters using labeled datasets or specific task data; alignment strategies [18, 19, 20], which optimize models to better align with user preferences; knowledge adaptation techniques [21, 22], which enable models to incorporate domain-specific knowledge; and reasoning improvements [23, 24], which enhance a model’s ability to make logical inferences and decisions. Collectively known as Post-training Language Models (PoLMs), these techniques have led to the development of models such as GPT-4 [9], LLaMA-3 [25], Gemini-2.0 [26], and Claude-3.5 [27], marking substantial progress in LLM capabilities. However, post-trained models often struggle to adapt to new tasks without retraining or significant parameter adjustments, making PTM development an area of active research.

3.2.1 Instruction Tuning

Instruction Tuning [96] is a technique that refines a base LLM by fine-tuning it on specially constructed instruction datasets. This method substantially boosts the model’s ability to generalize across a variety of tasks and domains, improving its flexibility and accuracy. As shown in Fig. 5, the process begins by transforming existing NLP datasets (e.g., those for text classification, translation, and summarization) into natural language instructions that include task descriptions, input examples, expected outputs, and illustrative demonstrations. Techniques like Self-Instruct [86] further enhance the diversity of these datasets by automatically generating additional instruction–output pairs, expanding the model’s exposure to a broader range of tasks. The finetuning procedure adapts the model’s parameters to align with these task-specific instructions, resulting in an LLM that performs robustly across both familiar and previously unseen tasks. For instance, InstructGPT [45] and GPT-4 [7] have shown significant improvements in instruction-following capabilities across a wide array of applications.

The effectiveness of Instruction Tuning largely depends on the quality and breadth of the instruction datasets. High-quality datasets should encompass a wide range of languages, domains, and task complexities to ensure that the model remains broadly applicable [96]. Furthermore, the clarity and organization of instructions play a critical role in enabling the model to interpret and execute tasks effectively. Techniques such as integrating demonstration examples, including Chain-of-Thought prompting [47], can significantly improve performance on tasks requiring complex reasoning.

3.2.3 Prompt-Tuning

Prompt-tuning [44, 100] is a method designed to adapt large language models efficiently by optimizing trainable vectors at the input layer rather than modifying the model’s internal parameters. As shown in Fig. 6 (b), this technique builds on discrete prompting methods [101, 102] by introducing soft prompt tokens, which can be structured either in an unrestricted format [44] or as a prefix [100]. These learned prompt embeddings are combined with the input text embeddings before being processed by the model, thereby guiding the model’s output while keeping the pre-trained weights frozen. Two notable implementations of prompt-tuning are Ptuning [44], which uses a flexible method to combine context, prompt, and target tokens, making it suitable for both understanding and generation tasks. This method enhances the learning of soft prompt representations through a bidirectional LSTM architecture. In contrast, standard prompt-tuning [100] employs a simpler design, wherein prefix prompts are prepended to the input, and only the prompt embeddings are updated during training based on task-specific supervision.

Research has shown that prompt-tuning can match the performance of full-parameter fine-tuning across many tasks, while requiring significantly fewer trainable parameters. However, its success is closely tied to the underlying language model’s capacity, as prompt-tuning only modifies a small number of parameters at the input layer [44]. Building on these advancements, newer approaches such as P-Tuning v2 [99] have demonstrated that prompt-tuning strategies can scale effectively across various model sizes, handling complex tasks previously thought to require full fine-tuning. These findings establish prompt-tuning as a highly efficient alternative to traditional fine-tuning, offering comparable performance with reduced computational and memory costs.

3.3 Reinforcement Fine-Tuning

Reinforcement Fine-Tuning (ReFT) [103] represents an advanced technique that integrates RL with SFT to enhance the model’s ability to solve complex, dynamic problems. Unlike traditional SFT, which typically uses a single CoT annotation for each problem, ReFT enables the model to explore multiple valid reasoning paths, thereby improving its generalization capacity and problem-solving skills. The ReFT process begins with the standard SFT phase, where the model is initially trained on labeled data to learn fundamental task-solving abilities through supervised annotations. Following this initial fine-tuning, the model undergoes further refinement using RL algorithms, such as Proximal Policy Optimization (PPO) [46]. During the reinforcement phase, the model generates multiple CoT annotations for each problem, exploring different potential reasoning paths. These generated paths are evaluated by comparing the model’s predicted answers to the true answers, with rewards assigned for correct outputs and penalties for incorrect ones. This iterative process drives the model to adjust its policy, ultimately improving its reasoning strategy.

As shown in Fig. 7, the ReFT process is executed in two stages. The upper section represents the SFT phase, where the model iterates over the training data to learn the correct CoT annotation for each problem over several epochs. In the lower section, the ReFT phase is introduced: starting from the SFT-trained model, the model generates alternative CoT annotations (e′) based on its current policy and compares its predicted answers (y′) with the true answers (y). Positive rewards are given for correct answers, and negative rewards for incorrect answers, driving the model to improve its performance. These reward signals are then used to update the model’s policy through reinforcement learning, enhancing its ability to generate accurate and diverse CoT annotations.

4.2.2 RLAIF Training Pipeline

The RLAIF training pipeline follows several key stages wherein AI-generated feedback is utilized to iteratively refine the model’s behavior. The pipeline facilitates the alignment of LLM outputs with human expectations in a manner that scales across various tasks, as detailed by [108]. The stages are as follows: AI Feedback Collection. In this phase, the AI system generates feedback based on predefined criteria, which may include task-specific metrics, correctness of responses, or appropriateness of the model’s outputs. Unlike human feedback, which requires interpretation and manual annotation, AI feedback can be consistently generated across a broad range of model outputs. This characteristic enables AI feedback to be continuously provided, scaling the feedback loop significantly.

Reward Model Training. The AI-generated feedback is subsequently used to train or refine a reward model. This model maps input-output pairs to corresponding rewards, aligning the model’s output with the desired outcomes as dictated by the feedback. While traditional RLHF relies on direct human feedback to evaluate outputs, RLAIF utilizes AI-generated labels, which, although potentially introducing issues related to consistency and bias, offer advantages in scalability and independence from human resources.

Policy Update. The final stage involves updating the model’s policy based on the reward model trained in the previous step. Reinforcement learning algorithms are employed to adjust the model’s parameters, optimizing the policy to maximize cumulative reward across a variety of tasks. This process is

Self-Refine for Reasoning (§5.1), which guides the model to autonomously detect and remedy errors in its own reasoning steps; and Reinforcement Learning for Reasoning (§5.2), which employs reward-based optimization to improve the consistency and depth of the model’s chain-of-thought. These approaches collectively enable more robust handling of long-horizon decision-making, logical proofs, mathematical reasoning, and other challenging tasks.

5.1 Self-Refine for Reasoning

Reasoning remains a core challenge in optimizing LLMs for tasks that demand intricate logical inference and context-dependent decision-making. In this context, self-refine emerges as a powerful mechanism to iteratively pinpoint and correct errors during or after text generation, substantially improving both reasoning depth and overall reliability. As shown in Fig. 12, self-refine methods can be divided into four categories: Intrinsic Self-refine, which relies on the model’s internal reasoning loops; External Self-refine, which incorporates external feedback resources; Fine-tuned Intrinsic Self-refine, which iteratively updates the model’s reasoning processes based on self-generated corrections; and Fine-tuned External Self-refine, which harnesses external signals and fine-tuning to refine reasoning in a more adaptive, long-term manner. Tab. 4 further illustrates

how each category fortifies LLM reasoning capacity across various tasks.

5.2.2 Reward Design for Reasoning

Unlike traditional RL tasks with clear rewards like game scores, reasoning in LLMs demands structured reward designs reflecting correctness, efficiency, and informativeness. Common approaches include binary correctness rewards, assigning rT = 1 for a correct final answer and rT = 0 otherwise, which is simple but introduces high variance due to sparse feedback; step-wise accuracy rewards, offering incremental feedback based on metrics like inference rule validity or intermediate step consistency to guide multi-step reasoning; self-consistency rewards, measuring stability across multiple reasoning paths and assigning higher rewards for agreement to enhance robustness; and preference-based rewards, derived from RLHF or RLAIF, where a model rϕ(st, at) trained on human or AI feedback evaluates reasoning quality, providing nuanced guidance for complex tasks.

5.2.3 Large-Scale RL on Base Model

Large-scale Reinforcement Learning has emerged as a transformative post-training paradigm for enhancing the reasoning capabilities of LLMs, shifting the focus from traditional SFT to dynamic, self-evolving optimization strategies. This approach leverages extensive computational frameworks and iterative rewardbased feedback to refine base models directly, bypassing the need for pre-annotated datasets and enabling autonomous development of complex inference skills. By integrating large-scale RL, LLMs can address intricate multi-step reasoning tasks (e.g., mathematical problem-solving, logical deduction, and strategic planning), where conventional SFT often falls short due to its reliance on static, human-curated data [45]. The DeepSeek-R1 model exemplifies this paradigm, employing advanced RL techniques to achieve state-of-theart reasoning performance while optimizing resource efficiency, as illustrated in Fig. 13. This subsection delineates the key methodologies underpinning DeepSeek-R1’s success, including novel optimization algorithms, adaptive exploration, and trajectory management, which collectively redefine the potential of RLdriven reasoning in LLMs.