How does training data format shape whether models reason in parallel or sequentially?
This explores how the *shape* of training data — multiple-choice vs. free-form, format vs. content — pushes a model toward exploring many options at once (parallel) or working through steps one at a time (sequential), and what that trade-off costs.
This explores how the *shape* of training data — not its subject matter — steers whether a model spreads out to explore many candidate answers (parallel) or grinds through one chain of intermediate steps (sequential). The headline result in the corpus is blunt: presentation matters far more than topic. Models trained on multiple-choice data learn breadth-first exploration, while free-form training produces depth-first sequential reasoning, and the format effect outweighs the domain effect by roughly 7.5x Does training data format shape reasoning strategy more than domain?. So if you train on data that looks like 'pick from these options,' the model learns to fan out; if you train on data that looks like 'show your work,' it learns to walk a line.
That distinction isn't cosmetic, because the two strategies are not interchangeable. On genuinely compositional problems — things like tracing connectivity through a graph, where each step depends on the last — sequential chain-of-thought beats parallel voting by an *exponential* margin, simply because the answer requires accumulating intermediate results that no set of short independent guesses can reconstruct When does sequential reasoning beat parallel voting?. The flip side: width has real advantages too. Sampling many parallel latent trajectories lets a system explore the solution space without paying the serial latency cost of going ever deeper Can reasoning systems scale wider instead of only deeper?. The format your data implies is, in effect, a bet on which of these regimes your tasks will need.
What's striking is how much of 'reasoning ability' turns out to be format acquisition rather than knowledge acquisition. A 1.5B model with only LoRA format adaptation can match far larger RL-trained models, which suggests RL is mostly teaching a model *how to organize its output* rather than handing it new facts Can small models reason well by just learning output format?. The same theme shows up from the other direction: RL post-training tends to collapse onto a single dominant format inherited from pretraining within the first epoch, amplifying one presentation style and suppressing the alternatives — and which one wins depends on model scale, not necessarily on which performs best Does RL training collapse format diversity in pretrained models?. Format isn't just learned; it gets locked in, sometimes arbitrarily.
Here's the part you might not expect: the format a model *displays* and the computation it actually *runs* can come apart entirely. When models are trained with hidden chain-of-thought tokens, logit-lens analysis shows them computing the correct answer in the earliest layers, then actively overwriting that representation to emit format-compliant filler tokens at the surface Do transformers hide reasoning before producing filler tokens?. In other words, the sequential 'reasoning' you read in the output may be a presentation artifact layered on top of a parallel internal computation. And the form can be hollow in a worse way too — chain-of-thought degrades predictably once you push outside the training distribution in task, length, or format, producing fluent text that imitates the *shape* of reasoning without the underlying logic Does chain-of-thought reasoning actually generalize beyond training data?. That fragility tracks the finding that LLMs lean on semantic associations rather than symbolic manipulation, so reasoning stays tethered to the distribution its format was learned on Do large language models reason symbolically or semantically?.
The practical upshot is that 'parallel vs. sequential' isn't only a decoding-time choice — it's baked in upstream by how examples are formatted, and it can even be routed dynamically. Decoupled RL methods can train a single model to decide *when* to engage extended sequential thinking versus answer concisely, which is essentially learning to switch reasoning modes per problem rather than committing to one Can models learn when to think versus respond quickly?. If you want to go deeper on the failure side, the distribution-bound and semantic-reasoner notes are the doorways; on the 'format is the lever' side, start with the format-vs-domain study and the RL-format-collapse note.
Sources 9 notes
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.