LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does failed-step fraction predict reasoning quality better?

Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.

Note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time?

Across 10 large reasoning models on math and scientific reasoning tasks, a single structural graph metric — Failed-Step Fraction (FSF) — consistently outperforms CoT length and review ratio as a predictor of correctness.

FSF is defined as the fraction of steps belonging to failed exploratory branches in the reasoning graph. A failed branch is a set of reasoning steps that were explored and then abandoned before reaching the final answer. High FSF means the model spent significant effort on dead ends; low FSF means reasoning was mostly direct.

Three converging lines of evidence:

  1. Correlation analysis (conditional on question): shorter reasoning traces are associated with higher accuracy; lower review ratio is associated with higher accuracy; FSF is the strongest and most stable predictor across difficulty strata and all 10 models

  2. Test-time selection: Sampling 64 generations per problem and reranking by each metric shows FSF-based selection yields the largest pass@1 gains (up to 10% on AIME) — outperforming length- or review-based selection

  3. Causal intervention: Directly editing CoT traces to remove failed branches substantially improves accuracy on previously-incorrect traces

The causal mechanism: failed branches do not disappear from the model's context when backtracking occurs. Current models do not fully "unsee" earlier mistakes when exploring new paths. The failed branches bias subsequent exploration, pulling reasoning toward already-rejected directions and compounding errors.

This connects to Which sentences actually steer a reasoning trace? — thought anchors are the positive pivots where reasoning changes direction successfully; FSF is the corresponding negative measure of how much failed exploration is poisoning the context.

The practical implication is concrete: structure-aware test-time scaling (select for low FSF) outperforms indiscriminate scaling (add more tokens, encourage more review). Length and review are proxies for FSF — but noisy ones. The graph structure of reasoning is the real signal.


Source: Reasoning Critiques

Related concepts in this collection

Concept map
16 direct connections · 167 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

failed-step fraction is a stronger predictor of reasoning quality than trace length or review ratio