LLM Reasoning and Architecture Reinforcement Learning for LLMs

Do reasoning traces need to be semantically correct?

Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search

"Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens" presents the strongest evidence yet against the assumption that reasoning traces carry meaningful semantics that contribute to solution quality.

The experimental design is clean. Transformers are trained on A* search traces for shortest-path planning in random mazes. Three conditions: (1) correct traces, (2) no traces, and (3) deliberately corrupted traces that have no relation to the specific problem they are paired with. The corrupted traces are not just noisy — they are systematically irrelevant, paired with wrong problems.

The results: corrupted-trace models maintain performance largely consistent with correct-trace models. In some cases they improve on correct-trace models and generalize more robustly to out-of-distribution tasks. Models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions — the formal A* validator confirms only a loose correlation between trace accuracy and solution accuracy.

This result directly challenges three assumptions simultaneously. First, that intermediate tokens function as reasoning steps (they may function as computational scaffolding — additional forward passes — regardless of semantic content). Second, that correct traces are superior training data (the scaffolding hypothesis predicts that any tokens providing additional computation would work). Third, that the "aha moment" in DeepSeek R1 indicates genuine realization (a single token insertion does not change internal state; it provides one more forward pass).

The "Stop Anthropomorphizing" position paper reinforces this from a different angle. It argues the community's tendency to call intermediate tokens "thoughts" or "reasoning traces" is actively harmful — generating false confidence and directing research toward improving trace quality rather than understanding the computational mechanism. The LLM-Modulo framework (generate-test with external verification) is proposed as the principled alternative: treat the LLM as a generator, use sound external verifiers for guarantees.

The practical implication: optimizing trace "interpretability" or "correctness" may be orthogonal to optimizing solution accuracy. The traces most useful for model performance may be those that provide optimal computational scaffolding, not those that most closely resemble human reasoning. This converges with What do models actually learn from chain-of-thought training?, which shows from the opposite direction that structural perturbations (shuffled steps) cause severe accuracy drops while content perturbations (wrong numbers, removed keywords) cause minimal impact. Together, these findings isolate the active ingredient: logical architecture, not semantic content.

Theoretical backing (RL-STaR): The theoretical analysis of the STaR framework provides formal support: RL-based self-taught reasoning can improve capabilities despite incorrect reasoning steps in the training data, because the iterative policy gradient converges under bounded error conditions. The model doesn't need correct intermediate steps to learn to produce correct final answers — what matters is the policy improvement trajectory, not the fidelity of individual traces. The quality of the pre-trained model sets the floor for effective bootstrapping, but the tolerance for noisy intermediates is built into the convergence guarantee.

Source: Reasoning o1 o3 Search

Related concepts in this collection

Do reasoning traces actually cause correct answers? Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
this provides the strongest evidence for the stylistic mimicry claim: even irrelevant mimicry works
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
extends: not just constrained imitation but imitation of form without semantic content still effective
Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
extends necessity failure: traces can be semantically irrelevant and still produce correct solutions
Can minimal reasoning chains match full explanations? Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
both findings converge: most trace content is dispensable
Does training on messy search processes improve reasoning? Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.
complementary finding: corrupted traces show content is dispensable (scaffolding hypothesis); SoS shows the search PROCESS itself is valuable training data (process exposure hypothesis). Different mechanisms, both challenge optimal-trace supremacy.
What do models actually learn from chain-of-thought training? When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
convergent evidence from the opposite direction: corrupted content is tolerated (this note) while corrupted structure causes severe degradation (that note); together they confirm traces function as structural scaffolding not semantic reasoning
Does logical validity actually drive chain-of-thought gains? What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.
convergent finding from prompting rather than training: invalid exemplar reasoning at inference time (that note) parallels corrupted training traces (this note), both showing logical validity is dispensable for performance gains
What three separate factors drive chain-of-thought performance? Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.
the three-factor decomposition explains WHY corrupted traces work: output probability (the dominant factor) is shifted by intermediate token generation regardless of content validity; only the noisy-reasoning factor requires semantic correctness

Concept map

19 direct connections · 124 in 2-hop network ·medium cluster

Do reasoning traces need to be semantically corr… Do reasoning traces actually cause correct answers… Does chain-of-thought reasoning reveal genuine inf… Do language models actually use their reasoning st… Can minimal reasoning chains match full explanatio… Does training on messy search processes improve re… What do models actually learn from chain-of-though… Does logical validity actually drive chain-of-thou… What three separate factors drive chain-of-thought…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

deliberately corrupted reasoning traces perform comparably to correct traces and sometimes generalize better out of distribution