Does failed-step fraction predict reasoning quality better?
Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.
Across 10 large reasoning models on math and scientific reasoning tasks, a single structural graph metric — Failed-Step Fraction (FSF) — consistently outperforms CoT length and review ratio as a predictor of correctness.
FSF is defined as the fraction of steps belonging to failed exploratory branches in the reasoning graph. A failed branch is a set of reasoning steps that were explored and then abandoned before reaching the final answer. High FSF means the model spent significant effort on dead ends; low FSF means reasoning was mostly direct.
Three converging lines of evidence:
Correlation analysis (conditional on question): shorter reasoning traces are associated with higher accuracy; lower review ratio is associated with higher accuracy; FSF is the strongest and most stable predictor across difficulty strata and all 10 models
Test-time selection: Sampling 64 generations per problem and reranking by each metric shows FSF-based selection yields the largest pass@1 gains (up to 10% on AIME) — outperforming length- or review-based selection
Causal intervention: Directly editing CoT traces to remove failed branches substantially improves accuracy on previously-incorrect traces
The causal mechanism: failed branches do not disappear from the model's context when backtracking occurs. Current models do not fully "unsee" earlier mistakes when exploring new paths. The failed branches bias subsequent exploration, pulling reasoning toward already-rejected directions and compounding errors.
This connects to Which sentences actually steer a reasoning trace? — thought anchors are the positive pivots where reasoning changes direction successfully; FSF is the corresponding negative measure of how much failed exploration is poisoning the context.
The practical implication is concrete: structure-aware test-time scaling (select for low FSF) outperforms indiscriminate scaling (add more tokens, encourage more review). Length and review are proxies for FSF — but noisy ones. The graph structure of reasoning is the real signal.
Source: Reasoning Critiques
Related concepts in this collection
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
FSF is the negative measure of the same phenomenon: failed branches vs. successful pivots
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
FSF explains why: shorter correct traces contain fewer failed branches
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
self-revision is one mechanism that creates failed branches; FSF captures the accumulated damage
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
parallel sampling avoids failed-branch contamination by exploring independent paths; FSF explains the advantage
-
Do prior errors in context history amplify future errors?
When a language model makes mistakes early in a task, do those errors contaminate subsequent predictions? We explore whether error accumulation degrades long-horizon performance through passive context pollution rather than capability limits.
self-conditioning is the mechanism that makes high FSF toxic: failed branches remain in context and passively contaminate subsequent reasoning — FSF quantifies the degree of contamination, self-conditioning explains why it degrades performance
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
failed-step fraction is a stronger predictor of reasoning quality than trace length or review ratio