Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does step-level confidence outperform global averaging for trace filtering?

Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

Standard majority voting treats all reasoning traces equally. DeepConf improves on this by filtering traces based on model-internal confidence signals — and the key finding is that local (step-level) confidence is more informative than global confidence averaged across the full trace.

Global confidence fails in two ways: (1) it averages over the entire trace, masking critical reasoning breakdowns at specific intermediate steps; (2) it requires the full trace to be generated before it can be computed, preventing early stopping.

Step-level confidence catches local failures as they occur. A single low-confidence step is a signal worth acting on immediately, before it compounds through subsequent reasoning. This enables early termination of low-quality traces, reducing unnecessary token generation while maintaining or improving accuracy.

The practical payoff: getting from 68% to 82% accuracy on AIME 2025 via standard majority voting requires 511 additional traces per question with Qwen3-8B. Confidence-aware filtering achieves similar accuracy gains with far fewer traces. The compute efficiency argument for trace filtering is strong.

The implication: trace quality is more relevant than trace quantity for aggregation, and local confidence is a better quality proxy than global confidence or trace length.

Self-Evaluation Guided Beam Search as decoding implementation: The Self-Evaluation approach (Xie et al., 2023) translates step-level confidence into a decoding algorithm. It defines a constraint function C(st, s1:t-1) ∈ [0,1] that outputs the LLM's confidence in the correctness of each reasoning step given prior context. This confidence guides a stochastic beam search: each "step" in beam search is a semantic reasoning unit (not a single token), and the self-evaluation score serves as a better-calibrated automatic criterion for pruning the search. Stochastic beam search balances exploitation (following high-confidence paths) and exploration (temperature-controlled randomness to avoid premature convergence). This operationalizes step-level confidence as a search mechanism rather than just a filter.


Source: Test Time Compute

Related concepts in this collection

Concept map
20 direct connections · 170 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

confidence-aware step-level filtering outperforms global confidence averaging for trace selection