LLM Reasoning and Architecture Reinforcement Learning for LLMs

What do models actually learn from chain-of-thought training?

When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.

Note · 2026-02-22 · sourced from Reasoning Critiques

When training on reasoning demonstrations, what actually gets learned? Controlled ablations reveal a striking asymmetry: models are highly tolerant to content errors but highly sensitive to structural disruption.

Two types of perturbation were applied to Long CoT training samples:

Content perturbations (model is mostly unaffected):

50% of numbers randomly changed → only 3.2% lower accuracy on AIME 2024
Reasoning keywords removed (no "Alternatively," no "Wait, but") → minimal impact
Training on samples with incorrect final answers → only 3.2% lower accuracy

Structural perturbations (model is severely affected):

67% of reasoning steps shuffled → 13.3% lower accuracy on AIME 2024
Reasoning steps deleted → significant degradation
Logical consistency disrupted → model cannot acquire the structural capability

What models learn from reasoning demonstrations is not what to think but how to structure thinking: the pattern of reflection, backtracking, and self-validation that makes long CoT effective. The specific facts, numbers, and even the correctness of individual steps are secondary. The logical architecture — which steps precede which, how contradiction leads to backtracking, how intermediate validation is structured — is primary.

This partially explains why distillation from a larger reasoning model to a smaller one works even with relatively few samples (17k samples showed substantial gains): the small model is not memorizing the reasoning content, it is acquiring the structural pattern of how reasoning unfolds. Structure is cheap to transmit.

This deepens Does training data format shape reasoning strategy more than domain? — that finding was that training format (multiple choice vs fill-in) shapes strategy more than domain. This finding shows the same principle operating at a finer scale: within a Long CoT format, structural coherence matters more than content correctness. Format dominance operates at multiple levels.

The practical implication: generating training data for reasoning models does not require perfect reasoning. It requires structurally coherent reasoning — chains with correct logical architecture, even if specific steps contain errors.

FOL-based validation confirms the coherence/validity distinction: Analysis of RLVR-trained models using first-order logic error taxonomy shows that RLVR improves local trace coherence — transitions between adjacent steps become more logically consistent — without guaranteeing global mathematical validity. The models produce traces that read as better reasoning (fewer non-sequiturs, more explicit intermediate steps) but the improvement is structural, not semantic. Local consistency gains should not be mistaken for improved mathematical proof capability. This provides formal grounding for the structural-over-content principle: what RLVR optimizes is the architecture of reasoning, not its truth-preserving properties. See Does RLVR actually improve mathematical reasoning or just coherence?.

Molecular bond taxonomy specifies what kind of structure matters: The Molecular Structure of Thought paper decomposes Long CoT structure into three interaction types: Deep-Reasoning (covalent bonds — dense local deduction clusters), Self-Reflection (hydrogen bonds — long-range corrective links), and Self-Exploration (van der Waals forces — weak bridges between distant clusters). This provides the specific structural vocabulary for why coherence matters: effective reasoning requires the right distribution of these bond types, and "semantic isomers" (same semantic content, different bond distributions) from different teachers destabilize learning when mixed — even with matched token statistics. See Does long chain of thought reasoning follow molecular bond patterns?.

Source: Reasoning Critiques, RLVR, Novel Architectures

Related concepts in this collection

Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
same principle at a finer scale: within CoT format, structure > content
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
explains why structure matters: models are learning to imitate the *form*, so form is what they need
Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
connects: if models learn structure not content, their reasoning chains are structurally sound but not causally faithful
Does RLVR actually improve mathematical reasoning or just coherence? RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
FOL-based confirmation: RLVR optimizes structural coherence, not mathematical validity
Does long chain of thought reasoning follow molecular bond patterns? Can we understand extended reasoning as organized like molecular structures with distinct interaction types? This matters because it explains why mixing reasoning traces from different sources often fails despite similar statistics.
specific bond taxonomy for what kind of structure matters
Do reasoning traces need to be semantically correct? Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
convergent evidence from the opposite direction: content corruption is tolerated (that note) confirming this note's finding that structural coherence, not content correctness, drives learning
Does logical validity actually drive chain-of-thought gains? What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.
extends the structure-over-content finding from training to inference: at prompting time, logically invalid exemplars provide the same structural scaffolding benefits as valid ones, confirming that what transfers is form not logical content
What three separate factors drive chain-of-thought performance? Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.
the three-factor decomposition maps onto the structural-vs-content distinction: structural coherence enables the probability-shifting and scaffolding mechanisms (the dominant factors), while content correctness contributes only through the noisy-reasoning factor

Concept map

15 direct connections · 102 in 2-hop network ·medium cluster

What do models actually learn from chain-of-thou… Does training data format shape reasoning strategy… Does chain-of-thought reasoning reveal genuine inf… Do language models actually use their reasoning st… Does RLVR actually improve mathematical reasoning … Does long chain of thought reasoning follow molecu… Do reasoning traces need to be semantically correc… Does logical validity actually drive chain-of-thou… What three separate factors drive chain-of-thought…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

long cot learning is driven by structural coherence, not content correctness