What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

Paper · arXiv 2509.19284 · Published September 23, 2025

Large reasoning models (LRMs) spend substantial test-time compute on long chain-of-thought (CoT) traces, but what characterizes an effective CoT remains unclear. While prior work reports gains from lengthening CoTs and increasing review (revisiting earlier steps) via appended wait tokens, recent studies suggest that shorter thinking can outperform longer traces. We therefore conduct a systematic evaluation across ten LRMs on math and scientific reasoning. Contrary to the “longer-is-better” narrative, we find that both naive CoT lengthening and increased review are associated with lower accuracy.

As CoT unfolds step by step, token-level metrics can conflate verbosity with process quality. We introduce a graph view of CoT to extract structure and identify a single statistic—the Failed-Step Fraction (FSF), the fraction of steps in abandoned branches—that consistently out-predicts length and review ratio for correctness across models. To probe causality, we design two interventions. First, we rank candidate CoTs by each metric at test time, where FSF yields the largest pass@1 gains; second, we edit CoTs to remove failed branches, which significantly improves accuracy, indicating that failed branches bias subsequent reasoning. Taken together, these results characterize effective CoTs as those that fail less and support structure-aware test-time scaling over indiscriminately generating long CoT.

Large reasoning models (LRMs) (Jaech et al., 2024; Rastogi et al., 2025) increasingly exploit test-time compute by generating long chain-of-thought (CoT) traces. Challenging prompts are decoded over hundreds of thousands of tokens. A notable line of work, beginning with S1 (Muennighoff et al., 2025) and reinforced in subsequent papers (Ringel et al., 2025; Jurayj et al., 2025), shows that appending wait to the generation to increase test-time compute can improve reasoning performance. However, it is unclear whether such long reasoning traces are desired. Long reasoning traces not only significantly increase the resources for those hosting LRMs but also reduce user experience due to latency, especially for questions that intuitively do not require long reasoning. Moreover, recent studies (Wu et al., 2025b; Hassid et al., 2025; Ghosal et al., 2025; Marjanović et al., 2025) report that shorter thoughts are better, and continuing to append ‘wait’ can induce oscillatory performance. Furthermore, it remains unclear whether different LRMs exhibit similar reasoning behaviors.

These conflicting findings motivate a systematic re-examination of how lexical and structural properties of reasoning traces relate to reasoning performance. In this work, we evaluate the effectiveness of reasoning traces along multiple dimensions and uncover consistent patterns across LRMs. We analyze ten reasoning models with accessible reasoning traces on tasks spanning math and scientific reasoning (HARP, Yue et al. (2024), and GPQA-Diamond (Rein et al., 2024)), with the aim of providing systematic insight into what characterizes effective reasoning.

We begin by examining two properties that recent work suggests may drive reasoning performance: CoT length and review behaviors. In the S1 approach, inserting wait increases generation Length and encourages Review behaviors, including checking, verifying, or backtracking prior steps. These Review behaviors are shown to be important to reasoning (Gandhi et al., 2025; Chen et al., 2024). Therefore, we first investigate how Length and Review behaviors lead to reasoning improvement observed in Muennighoff et al. (2025). We define Review Ratio as the fraction of Review tokens within a CoT to isolate the effect of Review from Length. Using a conditional correlation analysis to isolate the question-level confounding factors, we find consistent patterns across models and datasets. Within the same question, shorter reasoning traces are associated with higher accuracy, and lower Review Ratio are associated with higher accuracy.

We further hypothesize that Length and Review Ratio are surface proxies for underlying structural properties of the reasoning (Jiang et al., 2025) and we test one possible cause: the prevalence of failed reasoning branches. We therefore extract a reasoning graph for each CoT. This representation allows for the evaluation of graph-level metrics. In particular, we focus on the Failed-Step Fraction (FSF): the fraction of steps belonging to failed exploratory branches.

Among graph-level features, FSF emerges as a stronger and more stable predictor of correctness than CoT Length or Review Ratio, with consistent, significant correlations across difficulty strata and across all ten models on both math and scientific reasoning. These findings support measuring reasoning quality via the reasoning graph. Figure 1 illustrates our annotation and the corresponding extracted reasoning graph. Finally, we design two experiments to test causality.

Finally, we design two experiments to test causality. We first run a test-time intervention on AIME-25 and GPQA-Diamond: for each problem we sample 64 generations, rerank by each metric, and evaluate top-1 (pass@1) performance. FSF-based selection yields the largest and most consistent gains, with up to 10% accuracy improvement on AIME, while selection by Length or Review Ratio gives smaller benefits. Second, we intervene on the CoT directly via controlled editing: removing the failed branch substantially increases accuracy on incorrect traces. Together, these results provide causal evidence that FSF is a strong lever for accuracy, that long failed branches bias subsequent exploration, and that current models do not fully “unsee” earlier mistakes when backtracking