Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Paper · arXiv 2501.11651 · Published January 20, 2025
Test Time ComputeInference time scaling

While reinforcement learning (RL) holds promise for enabling self-exploration and learning from feedback, recent attempts yield only modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through oversampling. We further employ an entropy bonus as an auxiliary loss, alongside a dynamic anchor for regularization to facilitate reward optimization. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks.

Regarding test-time scaling, existing methods typically rely on repeated sampling (Brown et al., 2024), where multiple outputs are generated from a given policy model and auxiliary verifiers (Snell et al., 2024) are used to select the best response. As a result, their inference costs are significantly increased. However, these approaches do not update to the policy model itself. They thus fail to fundamentally improve the reasoning ability of LLMs. Repeatedly sampling short responses with verifiers also falls short of the expected inference scaling behavior OpenAI (2024). Ideally, deeper thinking and longer generation are expected to directly lead to better performance without relying on external signals. Consequently, improving reasoning through RL and inference scaling remains an under-explored challenge.

we finetune the LLM using synthesized chain-of-thought data with trial-and-error and self-verification, substantially expanding the exploration space before RL training.