LLM Reasoning and Architecture Reinforcement Learning for LLMs

Do reasoning traces need to be semantically correct?

Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

"Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens" presents the strongest evidence yet against the assumption that reasoning traces carry meaningful semantics that contribute to solution quality.

The experimental design is clean. Transformers are trained on A* search traces for shortest-path planning in random mazes. Three conditions: (1) correct traces, (2) no traces, and (3) deliberately corrupted traces that have no relation to the specific problem they are paired with. The corrupted traces are not just noisy — they are systematically irrelevant, paired with wrong problems.

The results: corrupted-trace models maintain performance largely consistent with correct-trace models. In some cases they improve on correct-trace models and generalize more robustly to out-of-distribution tasks. Models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions — the formal A* validator confirms only a loose correlation between trace accuracy and solution accuracy.

This result directly challenges three assumptions simultaneously. First, that intermediate tokens function as reasoning steps (they may function as computational scaffolding — additional forward passes — regardless of semantic content). Second, that correct traces are superior training data (the scaffolding hypothesis predicts that any tokens providing additional computation would work). Third, that the "aha moment" in DeepSeek R1 indicates genuine realization (a single token insertion does not change internal state; it provides one more forward pass).

The "Stop Anthropomorphizing" position paper reinforces this from a different angle. It argues the community's tendency to call intermediate tokens "thoughts" or "reasoning traces" is actively harmful — generating false confidence and directing research toward improving trace quality rather than understanding the computational mechanism. The LLM-Modulo framework (generate-test with external verification) is proposed as the principled alternative: treat the LLM as a generator, use sound external verifiers for guarantees.

The practical implication: optimizing trace "interpretability" or "correctness" may be orthogonal to optimizing solution accuracy. The traces most useful for model performance may be those that provide optimal computational scaffolding, not those that most closely resemble human reasoning. This converges with What do models actually learn from chain-of-thought training?, which shows from the opposite direction that structural perturbations (shuffled steps) cause severe accuracy drops while content perturbations (wrong numbers, removed keywords) cause minimal impact. Together, these findings isolate the active ingredient: logical architecture, not semantic content.

Theoretical backing (RL-STaR): The theoretical analysis of the STaR framework provides formal support: RL-based self-taught reasoning can improve capabilities despite incorrect reasoning steps in the training data, because the iterative policy gradient converges under bounded error conditions. The model doesn't need correct intermediate steps to learn to produce correct final answers — what matters is the policy improvement trajectory, not the fidelity of individual traces. The quality of the pre-trained model sets the floor for effective bootstrapping, but the tolerance for noisy intermediates is built into the convergence guarantee.


Source: Reasoning o1 o3 Search

Related concepts in this collection

Concept map
19 direct connections · 124 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

deliberately corrupted reasoning traces perform comparably to correct traces and sometimes generalize better out of distribution