RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.
It has been suggested that RL with verifiable rewards (RLVR) in LLMs simply incentivizes pre-existing capabilities of the base model (Base model) (Liu et al., 2025c; Zhao et al., 2025a; Shah et al., 2025; Gandhi et al., 2025) since it performs Chain-of-Thought (Wei et al., 2022) in vast vocabulary spaces within the constraints of the Base model’s prior. Recently, Yue et al. (2025) investigated the Pass@k metric (Chen et al., 2021; Song et al., 2025b; Dang et al., 2025; Wen et al., 2025; Wu et al., 2025), which measures the probability that at least one correct solution is found when sampling k independent solutions from the model (i.e., Best-of-k). They showed that, as k increases, Base model’s Pass@k eventually surpasses that of the RL model trained with RLVR. This observation suggests that Base models already possess the capability to solve problems that RL models can solve. However, these studies primarily evaluate answer accuracy without investigating the underlying reasoning process. Additionally, current state-of-the-art models for mathematics and coding, such as ProRL (Liu et al., 2025a) and AceReason (Chen et al., 2025d; Liu et al., 2025d), apply RL starting from DeepSeek-R1 (Guo et al., 2025) distillation model checkpoints, essentially conducting two-stage training with SFT followed by RL (SFT+RL models). DeepSeek-R1 (Guo et al., 2025) also features cold-start integration.
Yet, various SFT+RL training approaches are currently developed through trial-and-error without grasping the distinct roles of RL (reinforcement) and SFT (imitation). An important question to ask is then, ”how do RL and SFT shape the reasoning process beyond accuracy measurements?”
In this paper, we systematically dive into reasoning process at two granularities (Figure 1): (1) trajectory-level, where entire thinking generations are regarded as single trajectory, and (2) step-level, where each node (vertex) in the latent space graph (hereafter referred to as the reasoning graph) represents a logical expression (i.e., a sentence), such as a problem setup, a calculation, or a verification.
For trajectory-level analysis, we sample multiple outputs from Base, RL, SFT, and SFT+RL models2, then identify unique trajectories by applying clustering to group similar ones. We find that RL decreases the number of unique incorrect trajectories, whether starting from Base or SFT models, whereas SFT increases the number of unique correct trajectories, suggesting that RL compresses incorrect trajectories while SFT expands correct ones. We also note that SFT alone preserves incorrect trajectories. These results justify the two-stage approach of creating correct trajectories with SFT followed by suppressing incorrect paths with RL. Additionally, RL consistently reduces correct trajectories, which provides an explanation for why Base model’s Pass@k converges with that of the RL model at large k.
At the step-level, we construct reasoning graphs by segmenting model outputs into sentences, generating their embeddings, and clustering these representations to define nodes in sentence space. We observe that rank plots for node visitation frequency, degree, and betweenness centrality in reasoning graphs follow exponential laws. Remarkably, analysis of their decay rates reveals that RL elevates the decay rate, whereas SFT degrades it, suggesting that RL not only compresses the graph but also consolidates functionality (e.g., hubs) into fewer nodes (steps).
Our contributions are summarized as follows:
• Trajectory-level analysis confirms that RL compresses incorrect trajectories while SFT expands correct ones, highlighting why the two-stage approach (SFT then RL) is effective.
• Step-level analysis uncovers that RL also consolidates reasoning graph functionality into fewer steps, whereas SFT expands it across diverse steps. Moreover, through topological metrics, we demonstrate that, while both RL and SFT generate local cyclic structures, they produce distinct global topologies.
• At both trajectory and step-level analysis, we provided empirical support that RL squeezes and SFT expands the reasoning process. Our findings interpret why existing post-training recipes work and suggest directions for developing new training methods and for data curation.