Deep Think with Confidence

Paper · arXiv 2508.15260 · Published August 21, 2025
Test Time ComputeReward ModelsReinforcement LearningDeep Research

Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks.

Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities, particularly when equipped with methods that enhance their performance during test-time inference. A prominent technique is self-consistency, which samples multiple reasoning paths and aggregates final answers through majority voting (Wang et al., 2023). This type of approach, also known as parallel thinking, significantly improves reasoning accuracy but incurs substantial computational overhead: generating numerous reasoning traces per query scales inference overhead linearly, limiting practical deployment (Xue et al., 2023). For example, improving pass@1 accuracy from 68% to 82% using standard majority voting on AIME 2025 requires 511 additional reasoning traces per question using Qwen3-8B, consuming 100 million additional tokens.

Moreover, parallel thinking with majority voting exhibits diminishing returns—performance often saturates or degrades as the number of traces increase (Chen et al., 2024a). A key limitation is that standard majority voting treats all reasoning traces equally, ignoring quality variations (Pal et al., 2024; Wang et al., 2025). This can lead to suboptimal performance when low-quality traces dominate the voting process.

Recent work has leveraged next-token distribution statistics to assess reasoning trace quality (Geng et al., 2024; Fadeeva et al., 2024; Kang et al., 2025). Higher prediction confidence typically correlates with lower entropy and reduced uncertainty. By aggregating token-level statistics such as entropy and confidence scores, existing methods compute global confidence measures across an entire trace to identify and filter low-quality traces to improve majority voting performance (Kang et al., 2025).

However, global confidence measures present several limitations in practice. First, they may obscure confidence fluctuations at local reasoning steps, which can provide sufficient signals for estimating trace quality. Averaging across entire tokens in a trace can mask critical reasoning breakdowns that occur at specific intermediate steps. Second, global confidence measures require generating complete reasoning traces before they can be calculated, which prevents early stopping of low-quality traces.

We introduce Deep Think with Confidence (DeepConf), a simple yet effective test-time method that combines parallel thinking with confidence-aware filtering, based on local confidence measurements. DeepConf operates in both offline and online modes, identifying and discarding low-confidence reasoning traces either during or after generation. This approach reduces unnecessary token generation while maintaining or improving final answer accuracy.