Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods

Paper · arXiv 2504.14047 · Published April 18, 2025
Test Time ComputeInference time scalingReasoning Critiques

This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models on challenging reasoning tasks. Specifically, we focus our research on verifier-free inference time-scaling methods due to its generalizability without needing a reward model. We construct the Pareto frontier of quality and efficiency. We find that non-reasoning models, even with an extremely high inference budget, still fall substantially behind reasoning models. For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods like best-of-N and sequential revisions, while the additional inference compute offers minimal improvements.

The landscape of language model reasoning has evolved along two primary dimensions. First, approaches like Chain-of-thought (Wei et al., 2023), self-consistency (Wang et al., 2022), tree-structured sampling (Snell et al., 2024), and mixture of agents (Wang et al., 2025) have emerged as effective techniques for boosting reasoning capabilities during inference without requiring model parameter changes. Second, a new class of ”reasoning models”, explicitly post-trained to solve highly challenging problems, has been introduced, exemplified by models like o1 (OpenAI et al., 2024), Deepseek-R1 (DeepSeek-AI et al., 2025), and QwQ (Team, 2024).

Note that in Figure 1, we added another baseline “reasoning truncation” by simply truncating the reasoning process (encapsulated in “⟨ think⟩” “⟨/think⟩” tokens) and then extended the response to the end by sequence completions. We found that this can significantly degrade the response quality, and extreme truncation eventually crossed the curve of non-reasoning models.

4.3 Linguistic Markers and Word Frequency Analysis

Reasoning models tend to use certain linguistic markers, especially thinking tokens such as ”alternatively” or ”however”. In this section, we investigate the relationships between such linguistic markers and correctness.

4.3.1 Linguistic markers occurs more frequently in incorrect responses

A compelling finding emerges from our linguistic marker analysis: incorrect responses consistently exhibit a higher density and diversity of linguistic markers. Figure 5 provides empirical evidence demonstrating that both hedging and thinking markers (more details of marker definition can be found in Table 2 of Appendix) are markedly more prevalent in incorrect responses compared to correct ones.