Why does single-shot learning fail in REVTHINK's multi-source reasoning tasks?
This reads 'single-shot learning' as a model trying to reach an answer in one forward pass, and asks why that breaks down when a task pulls reasoning from several sources at once — the corpus doesn't have a note on REVTHINK by name, but it has a lot to say about why one-pass reasoning fails and why generating multiple or reversed trajectories rescues it.
This explores why a single forward pass of reasoning fails on tasks that require integrating several reasoning sources — and the corpus doesn't contain a REVTHINK note specifically, so I'll synthesize the conceptual territory it sits in rather than retrieve it directly. The honest answer the collection points to: a single shot commits to one trajectory through the solution space, and the failure modes of that commitment are well-documented even if this exact paper isn't.
The sharpest framing comes from work on premature commitment. Reasoning models tend to 'wander' down invalid paths and 'underthink' by abandoning promising ones too early — and the fix isn't more compute, it's structural: viable solutions already exist in the model but get dropped before they pay off Why do reasoning models abandon promising solution paths?. A single-shot pass has no mechanism to recover from this. It picks a direction and lives with it. This is why methods that hold open multiple possibilities help: stochastic latent reasoning lets a model represent a distribution over solutions instead of one prediction, which is exactly what you need when a problem admits several valid strategies or sources that must be reconciled Can stochastic latent reasoning help models explore multiple solutions?.
There's also a scaling argument hiding underneath the question. Depth-only reasoning — one long serial chain — is both slow and brittle, whereas sampling parallel trajectories covers the solution space without the variance penalty of going deeper Can reasoning systems scale wider instead of only deeper?. Multi-source tasks are precisely where width beats depth: each source is a different thread, and a single sequential pass forces them into one ordering rather than letting them be sampled independently and combined. Reverse-thinking and multi-pass schemes are, in this light, a way of buying width.
The failure also has a generalization face. Single-pass chain-of-thought degrades predictably outside its training distribution — models reproduce the *form* of reasoning while the underlying logic quietly breaks Does chain-of-thought reasoning actually generalize beyond training data? — and reasoning breaks not at some complexity threshold but at instance-novelty boundaries, because models fit instance patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. A multi-source task is almost by definition a novel combination, so a single shot trained to pattern-match individual sources has nothing to fall back on when it has to fuse them.
The quietly surprising thing here: at least one strand of the corpus argues these collapses aren't reasoning failures at all but *execution* failures — models that know an algorithm still can't run it across many serial steps in text alone Are reasoning model collapses really failures of reasoning?. If that's right, single-shot learning on multi-source tasks may be failing less because the model can't reason and more because one pass gives it no room to externalize, branch, and recombine the work. That reframes the whole fix — from 'teach better reasoning' to 'give reasoning more than one shot.'
Sources 6 notes
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.