LLM Reasoning and Architecture Reinforcement Learning for LLMs

What do models actually learn from chain-of-thought training?

When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.

Note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time?

When training on reasoning demonstrations, what actually gets learned? Controlled ablations reveal a striking asymmetry: models are highly tolerant to content errors but highly sensitive to structural disruption.

Two types of perturbation were applied to Long CoT training samples:

Content perturbations (model is mostly unaffected):

Structural perturbations (model is severely affected):

What models learn from reasoning demonstrations is not what to think but how to structure thinking: the pattern of reflection, backtracking, and self-validation that makes long CoT effective. The specific facts, numbers, and even the correctness of individual steps are secondary. The logical architecture — which steps precede which, how contradiction leads to backtracking, how intermediate validation is structured — is primary.

This partially explains why distillation from a larger reasoning model to a smaller one works even with relatively few samples (17k samples showed substantial gains): the small model is not memorizing the reasoning content, it is acquiring the structural pattern of how reasoning unfolds. Structure is cheap to transmit.

This deepens Does training data format shape reasoning strategy more than domain? — that finding was that training format (multiple choice vs fill-in) shapes strategy more than domain. This finding shows the same principle operating at a finer scale: within a Long CoT format, structural coherence matters more than content correctness. Format dominance operates at multiple levels.

The practical implication: generating training data for reasoning models does not require perfect reasoning. It requires structurally coherent reasoning — chains with correct logical architecture, even if specific steps contain errors.

FOL-based validation confirms the coherence/validity distinction: Analysis of RLVR-trained models using first-order logic error taxonomy shows that RLVR improves local trace coherence — transitions between adjacent steps become more logically consistent — without guaranteeing global mathematical validity. The models produce traces that read as better reasoning (fewer non-sequiturs, more explicit intermediate steps) but the improvement is structural, not semantic. Local consistency gains should not be mistaken for improved mathematical proof capability. This provides formal grounding for the structural-over-content principle: what RLVR optimizes is the architecture of reasoning, not its truth-preserving properties. See Does RLVR actually improve mathematical reasoning or just coherence?.

Molecular bond taxonomy specifies what kind of structure matters: The Molecular Structure of Thought paper decomposes Long CoT structure into three interaction types: Deep-Reasoning (covalent bonds — dense local deduction clusters), Self-Reflection (hydrogen bonds — long-range corrective links), and Self-Exploration (van der Waals forces — weak bridges between distant clusters). This provides the specific structural vocabulary for why coherence matters: effective reasoning requires the right distribution of these bond types, and "semantic isomers" (same semantic content, different bond distributions) from different teachers destabilize learning when mixed — even with matched token statistics. See Does long chain of thought reasoning follow molecular bond patterns?.


Source: Reasoning Critiques, RLVR, Novel Architectures

Related concepts in this collection

Concept map
15 direct connections · 102 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

long cot learning is driven by structural coherence, not content correctness