LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Some policy gradient approaches are explained below: Policy Gradient (REINFORCE). The REINFORCE algorithm [114, 115] is a method used to improve decision-making by adjusting the model’s strategy (policy) based on rewards received from its actions. Instead of directly learning the best action for every situation, the algorithm refines how likely different actions are to be chosen, gradually improving outcomes over time.
Advantage Actor-Critic (A2C/A3C). RL methods like REINFORCE [114] rely solely on policy gradients, which suffer from high variance, leading to unstable and inefficient learning.
Actor updates are guided by the policy gradient theorem, where the advantage function A(st, at) defined in Sec. 2.1, determines how much better an action at is compared to the expected value of state
3 Reinforced LLMs
From a methodological perspective, the integration of RL into LLM reasoning typically follows three core steps:
Supervised Fine-Tuning (SFT): Commences with a pretrained language model that is subsequently refined on a supervised dataset of high-quality, human-crafted examples. This phase ensures the model acquires a baseline compliance with format and style guidelines.
Reward Model (RM) Training: Generated outputs from the fine-tuned model are collected and subjected to human preference labeling. The reward model is then trained to replicate these label-based scores or rankings, effectively learning a continuous reward function that maps response text to a scalar value.
RL Fine-Tuning: Finally, the main language model is optimized via a policy gradient algorithm most e.g PPO to maximize the reward model’s output. By iterating this loop, the LLM learns to produce responses that humans find preferable along key dimensions such as accuracy, helpfulness, and stylistic coherence.
Reward Modeling and Alignment: Sophisticated reward functions are developed—drawing from human preferences, adversarial feedback, or automated metrics— to guide the model toward outputs that are coherent, safe, and contextually appropriate. These rewards are critical for effective credit assignment across multistep reasoning processes.