Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does step-level confidence outperform global averaging for trace filtering?

Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.

Note · 2026-02-20 · sourced from Test Time Compute

Standard majority voting treats all reasoning traces equally. DeepConf improves on this by filtering traces based on model-internal confidence signals — and the key finding is that local (step-level) confidence is more informative than global confidence averaged across the full trace.

Global confidence fails in two ways: (1) it averages over the entire trace, masking critical reasoning breakdowns at specific intermediate steps; (2) it requires the full trace to be generated before it can be computed, preventing early stopping.

Step-level confidence catches local failures as they occur. A single low-confidence step is a signal worth acting on immediately, before it compounds through subsequent reasoning. This enables early termination of low-quality traces, reducing unnecessary token generation while maintaining or improving accuracy.

The practical payoff: getting from 68% to 82% accuracy on AIME 2025 via standard majority voting requires 511 additional traces per question with Qwen3-8B. Confidence-aware filtering achieves similar accuracy gains with far fewer traces. The compute efficiency argument for trace filtering is strong.

The implication: trace quality is more relevant than trace quantity for aggregation, and local confidence is a better quality proxy than global confidence or trace length.

Self-Evaluation Guided Beam Search as decoding implementation: The Self-Evaluation approach (Xie et al., 2023) translates step-level confidence into a decoding algorithm. It defines a constraint function C(st, s1:t-1) ∈ [0,1] that outputs the LLM's confidence in the correctness of each reasoning step given prior context. This confidence guides a stochastic beam search: each "step" in beam search is a semantic reasoning unit (not a single token), and the self-evaluation score serves as a better-calibrated automatic criterion for pruning the search. Stochastic beam search balances exploitation (following high-confidence paths) and exploration (temperature-controlled randomness to avoid premature convergence). This operationalizes step-level confidence as a search mechanism rather than just a filter.

Source: Test Time Compute

Related concepts in this collection

Why does majority voting outperform more complex inference methods? Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
confidence-aware filtering as an improvement on naive majority voting
Do hedging markers actually signal careful thinking in AI? Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
linguistic confidence signals and internal confidence signals may converge
Do only 20 percent of tokens actually matter for reasoning? Chain-of-thought reasoning might depend on a small minority of high-entropy tokens that act as decision points. If true, could training focus only on these critical tokens match or exceed full-gradient updates?
extends: high-entropy tokens are forking decision points; step-level confidence at those forks is precisely where filtering signal concentrates, so step-level filtering targets the same minority tokens that carry RLVR's training signal
Can we measure how deeply a model actually reasons? What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
complements: DTR provides a stronger trace-quality signal than confidence alone (layer-wise stabilization); together with step-level confidence they define a two-channel filtering criterion (computational depth + step certainty)
Can reasoning steps be dynamically pruned without losing accuracy? This explores whether chain-of-thought reasoning contains redundant steps that can be identified and removed during inference. Understanding which steps matter could improve efficiency while maintaining correctness.
extends: PI categorizes step types and shows verification/backtracking steps receive minimal subsequent attention; this gives a structural complement to confidence-based filtering — drop steps that are both low-confidence AND attention-invisible
Do reflection tokens carry more information about correct answers? Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.
grounds: MI peaks identify which tokens carry signal about correctness; step-level confidence converges on the same sparse tokens through a different measurement channel

Concept map

20 direct connections · 170 in 2-hop network ·medium cluster

Does step-level confidence outperform global ave… Why does majority voting outperform more complex i… Do hedging markers actually signal careful thinkin… Do only 20 percent of tokens actually matter for r… Can we measure how deeply a model actually reasons… Can reasoning steps be dynamically pruned without … Do reflection tokens carry more information about …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

confidence-aware step-level filtering outperforms global confidence averaging for trace selection