LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can intermediate reasoning points yield better answers than final ones?

When reasoning models commit to a single path, they may miss better conclusions available at earlier decision points. Can aggregating completions from intermediate reasoning states recover lost accuracy?

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time?

The standard evaluation practice for reasoning models is straightforward: generate a complete trace, extract the final answer. But the final answer may not be the model's best conclusion — it is the conclusion reached by committing to one particular path through reasoning space.

"Beyond the Last Answer" proposes a different approach: segment the reasoning trace into subthoughts based on linguistic cues ("Wait," "Alternatively," "Hmm"), then prompt the model to complete a solution from each intermediate point. Each completion produces a candidate answer. The mode — the most frequent answer across all completions — is significantly more accurate than the final answer alone.

The gains are substantial: +13% on AIME2024 and +10% on AIME2025 across various reasoning models. Non-greedy sampling (T=1.0, top-p=0.95) frequently maximizes improvement because it better explores the reasoning space around each segment.

This differs from Why does parallel reasoning outperform single chain thinking? in a crucial way. Parallel voting generates independent chains from scratch. Subthought aggregation mines the intermediate states of a single existing chain — treating the reasoning trace as a landscape of potential conclusions rather than a single path to one conclusion. The trace already contains the information; the model just committed too early to a particular continuation.

The consistency signal is equally valuable. High consistency (low entropy) across subthought completions correlates with correct baseline answers. High entropy signals model struggle or likely errors. This makes subthought analysis a confidence estimator that requires no external verifier — the model's own internal consistency across intermediate points is the signal.

The connection to Does reflection in reasoning models actually correct errors? is direct: if most reflection merely confirms the initial direction, then later subthoughts are more likely to confirm than correct. Mining earlier subthoughts before the confirmatory drift sets in should recover more diverse (and more accurate) completions. This is exactly what the results show.


Source: Reasoning o1 o3 Search

Related concepts in this collection

Concept map
15 direct connections · 132 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

subthought mode aggregation from intermediate reasoning points yields higher accuracy than the final answer by up to 13 percent