Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Paper · arXiv 2502.02508 · Published February 4, 2025
Test Time ComputeSelf Refinement Self Consistency FeedbackInference time scaling

This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.

Orthogonal to the above work, our study investigates a new direction that enables LLMs with autoregressive search capabilities, i.e., an extended reasoning process with self-reflection and self-exploration of new strategies. Specifically, we introduce the Chain-of-Action-Thought (COAT) mechanism, which enables LLMs to take various meta actions during problem solving. Unlike conventional post-training consisting of large-scale supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), we propose a novel two-stage training paradigm: (1) a small-scale format tuning (FT) stage to internalize the COAT reasoning format and (2) a large-scale self-improvement stage that utilizes reinforcement learning with “Restart and Explore” (RAE) techniques