How much does training composition affect syntactic versus reasoning performance?

This explores whether the mix of data and training signals you feed a model shapes its grip on output *form* (syntax, format, the shape of an answer) differently than its grip on actual *reasoning* — and the corpus suggests these two are governed by surprisingly separate levers.

This reads the question as: does training composition pull form and reasoning in different directions? The collection's recurring answer is yes — and the gap is wider than you'd expect. A striking thread is that much of what looks like 'reasoning' gains is really form being learned. Logically *invalid* chain-of-thought examples produce nearly the same accuracy boost as valid ones Does logical validity actually drive chain-of-thought gains?, and a related line argues chain-of-thought is constrained imitation of a reasoning *shape* rather than genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?. So composition that supplies the surface pattern of reasoning reliably teaches the syntax of it — while the underlying logic stays distribution-bound and degrades the moment you shift task, length, or format Does chain-of-thought reasoning actually generalize beyond training data?.

What *does* move real reasoning turns out to be a specific composition ingredient: procedural knowledge. An analysis of five million pretraining documents found reasoning generalization rides on broad, transferable procedural material spread across many sources, whereas factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. That's the cleanest 'composition matters' result here — the *kind* of knowledge in the mix, not just its volume, determines whether you get transfer or rote lookup.

The trade-off cuts deeper when you look at where these capabilities live in the network. Knowledge sits in lower layers and reasoning in higher ones, which is why training that sharpens reasoning improves math but can actively *degrade* knowledge-heavy domains like medicine Why does reasoning training help math but hurt medical tasks?. So composition isn't a free lunch: shifting the mix toward reasoning isn't neutral with respect to other competencies — you can buy reasoning by spending knowledge.

The form-versus-reasoning split also shows up in *how* you train, not just what's in the data. For small models on function calling, DPO on correct-and-incorrect preference pairs beats plain supervised fine-tuning precisely because the negative examples target rigid *format* failures that SFT leaves unfixed Can small models match large models on function calling?. And at the token level, only ~20% of tokens — the high-entropy forking points — carry the reasoning learning signal, so RLVR is really reshaping a small reasoning-critical minority while the rest is essentially form Do high-entropy tokens drive reasoning model improvements?.

The thing you didn't know you wanted to know: syntactic competence is cheap and composition-robust — models pick up the *form* of correct output from almost any reasonable mix, even illogical exemplars. Reasoning is expensive, composition-sensitive, and zero-sum against other domains. Tellingly, even when models can't reason symbolically they lean on semantic associations from their training distribution Do large language models reason symbolically or semantically?, and 'compositional reasoning' often collapses into memorized subgraph matching that shatters on novel combinations Do transformers actually learn systematic compositional reasoning?. The form is in the data; the reasoning, mostly, is not.

Sources 9 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

How much does training composition affect syntactic versus reasoning performance?

Sources 9 notes

Next inquiring lines