AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Paper · arXiv 2505.24298 · Published May 30, 2025
Reinforcement LearningReasoning ArchitecturesTraining Fine TuningNovel Architectures

1 Introduction

Reinforcement learning (RL) has emerged as a new scaling paradigm for enhancing the capabilities of large language models (LLMs) by enabling thinking abilities [52]. Given a prompt, RL allows an LLM to generate thinking tokens before outputting a final answer, enabling test-time scaling [29, 47]. These thinking LLMs are named Large Reasoning Models (LRMs) and have been shown to have particularly strong capabilities on challenging reasoning problems, such as math [9, 5, 20], coding [3, 14, 15], logic puzzles [22, 34], and agentic tasks [23, 57].

Effective RL training often requires massive parallelization to derive a large batch of rollouts for sufficient exploration, which is the key to obtaining optimal model performance. For example, popular RL algorithms, such as PPO [42] and GRPO [43], often require an effective training batch of thousands of outputs [60, 61, 53]. Moreover, an LRM can generate tens of thousands of thinking tokens for each input prompt [6], further posing an urgent need for an efficient training system to run RL training on a large scale.

However, developing an efficient large-scale RL system is challenging. An RL system needs to frequently switch between LLM generation and training, which can introduce significant system overhead without careful optimizations. For LRMs, the output length of the training model varies significantly for different prompts throughout the RL process, which results in an ever-changing workload for both generation and training. This characteristic often triggers idle time in high-performance hardware, leading to a waste of computation. Furthermore, classical large-scale RL algorithms like PPO or GRPO typically require on-policy training data, i.e., samples generated by the latest model, to ensure the best model performance, which poses additional system challenges.

Consequently, most existing large-scale RL systems are designed in a fully synchronous manner [27, 11, 45, 44] by strictly alternating between LLM generation and training, which ensures that the LLM is always trained on the latest outputs for the best practical performance. In such a synchronous design, the generation step must wait until the finish of the longest output within a batch. Due to the varying output lengths for LRMs, a synchronous RL system suffers from severe training inefficiency. Very recently, there have also been attempts to explore parallel generation and training [30, 24, 49]. These works use outputs generated from a previous model version to update the current model. For the best performance, the model version used for rollout generation is limited to only one or two steps older. However, all these systems still follow a batched generation setting, where all the samples within a training batch are from the same model version. Accordingly, the issue of system inefficiency during the generation phase still remains unaddressed.

To fundamentally resolve the issues in system design, we develop AREAL, a fully Asynchronous RL training system for LRMs that completely decouples generation from training without hurting the final performance. AREAL runs LLM generation in a streaming manner, where each rollout worker continuously generates new outputs without waiting, leading to high GPU utilization. Meanwhile, the trainer workers in AREAL run parallel model updates whenever a training batch is obtained from the rollout workers. Once the model is updated, we synchronize the model weights in each rollout worker. In such an asynchronous design, each training batch of AREAL may contain samples generated by different model versions. Therefore, AREAL incorporates a modified objective of the PPO algorithm, which can leverage samples generated from much older model versions without any performance drop.