Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Paper · arXiv 2411.15124 · Published November 22, 2024

Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce TÜLU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. TÜLU 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With TÜLU 3, we build a multi-task evaluation scheme for posttraining with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks.

The TÜLU 3 training recipe involves multiple stages, with each stage building upon the previous model and focusing on different types of data — namely, prompt-completion instances for supervised finetuning, preferences for preference tuning, or verifiable rewards for reinforcement learning. Our methodology facilitates identifying skill deficiencies and refining the data mix, methods and parameters, ensuring a balanced performance of core skills across the training process. Through rigorous, principled experimentation, we determine the best data mix for supervised finetuning, resulting in the TÜLU 3 SFT checkpoint. Leveraging recent advances in preference tuning, we then train a model over carefully curated on-policy preference data from comparing TÜLU 3 SFT completions against outputs from other language models. Furthermore, we introduce a new final finetuning stage – Reinforcement Learning with Verifiable Rewards (RLVR) - which employs a novel RL objective tailored to enhance specific skills with verifiable answers, such as mathematics and precise instruction following.

9 Related Work 9.1

The Evolution of Post-training Recipes Modern “post-training” has its roots in multi-task language model training, in particular instruction tuning [Mishra et al., 2022, Wei et al., 2022a, Sanh et al., 2022, Wang et al., 2022b, Longpre et al., 2023], in which language models are trained on samples including task instructions and their corresponding responses, allowing the models to generalize ‘zero-shot’ to new tasks. Early instruction-tuning datasets tended to focus on more traditional NLP tasks (e.g., natural language inference) rather than more generic tasks that downstream users might perform [Wang et al., 2022a]. With the rise of ChatGPT and chat-based LMs (Claude, Gemini, etc), post-training techniques evolved beyond instruction tuning to include preference tuning stages, with models undergoing both instruction tuning and then preference tuning (PreFT) or ‘RLHF’ [Ouyang et al., 2022].

Early work in RLHF originated from experiments on Deep RL for control [Christiano et al., 2017, Ibarz et al., 2018, Leike et al., 2018] and typically involves first learning a reward model from human preferences, and then optimizing a language model via an RL framework using the learnt reward [Stiennon et al., 2020, Nakano et al., 2021, Askell et al., 2021, Ouyang et al., 2022]. Recently, approaches that allow directly training a language model on such preferences have been developed [Rafailov et al., 2024, Zhao et al., 2023], reducing the complexity of incorporating PreFT into training. While early approaches to PreFT were extremely human-centric, using tens or hundreds of thousands of human-written instructions and human preference labels, more recent work uses mixtures of human and synthetically generated preference data, along with multiple rounds of training and varied training algorithms [Touvron et al., 2023, Dubey et al., 2024, Gunter et al., 2024].