Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

Paper · arXiv 2505.13775 · Published May 19, 2025

In this paper, we critically examine that interpretation by investigating how the semantics of intermediate tokens—often anthropomorphized as “thoughts” or reasoning traces and which are claimed to display behaviors like backtracking, self-verification, and meta-cognition—actually influence model performance. We train transformer models on formally verifiable reasoning traces and solutions, constraining both intermediate steps and final outputs to align with those of a formal solver. By constructing a formal interpreter of the semantics of our problems and intended algorithm, we systematically evaluate not only solution accuracy but also the correctness of intermediate traces, thus allowing us to evaluate whether the latter causally influences the former. Our experiments involve training transformer models on traces and solutions generated by A* search. We notice that, despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions. To further show that trace accuracy is only loosely connected to solution accuracy, we then train models on noisy, corrupted traces which have no relation to the specific problem each is paired with, and find that not only does performance remain largely consistent with models trained on correct data, but in some cases can improve upon it and generalize more robustly on out-of-distribution tasks. These results challenge the assumption that intermediate tokens or “Chains of Thought” reflect or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly correct forms) as evidence of human-like or algorithmic behaviors in language models.

While (typically) no optimization pressure is applied to the intermediate tokens [2, 3], empirically it has been observed that language models perform better on many domains if they output such tokens first [4, 5, 6, 7, 8, 1, 9, 10, 11]. While ∗equal contribution the fact of the performance increase is well-known, the reasons for it are less clear. Previous work has often framed it in anthropomorphic terms, claiming that these models are “thinking” before outputting their answers [4, 12, 1, 13, 3, 14]. Simultaneously, the process of performing more auto-regressive forward passes before outputting the final answer has been credited as an instance of inference-time scaling – that is, these models are assumed to be doing problem-adaptive computation.

Famously, DeepSeek’s R1 paper claimed that one of the most impressive observed behaviors of their trained models was the so-called “aha” moment: as part of the chain of thought it was producing in order to answer some question, the model output the token “aha”, seeming to indicate that it had come upon a sudden realization. While a human may say “aha” to indicate exactly a sudden internal state change, this interpretation is unwarranted for models which do not have any such internal state, and which on the next forward pass will only differ from the pre-aha pass by the inclusion of that single token in their context. Interpreting this token as meaningful in this way requires making an additional assumption that has thus far been brushed to the side in discussions of how long CoT models function and what they do – that the derivational traces they produce are semantically meaningful in the same way that the traces they were trained on were or at least in the way that a human might expect them to be.

In this paper, we shed some light on the question of whether intermediate traces are semantically meaningful. Following previous work that elucidated important functional aspects of large scale models through controlled small scale experiments [15, 16, 17] and working within a sort of “model organism” paradigm, we focus the current work on fully controlled, open, and replicable models trained from scratch. Our models are trained on a simple and well-understood shortest path planning problem for randomly generated mazes, with our training runs including varying kinds of intermediate traces – from none to ones generated by the classic A∗ algorithm to noisy and irrelevant ones. This setup is not only well-understood as a classical computer science problem, but has also grown to be well-studied domain for trace-augmented transformer training [18, 19, 20, 21].

We approach the problem of understanding intermediate token semantics from three major novel angles, performing empirical evaluations on models we train on small planning tasks. First, we construct a validator for A∗ execution traces and use it to validate and compare trace accuracy to solution accuracy, finding only a loose correlation between the two. Then, we train half billion parameter Qwen models on none, correct, and deliberately irrelevant traces. We present a dataset manipulation that – despite the fact that it removes all specific-problem-relevant semantics – leads to trained models that perform better on both in and out of distribution tasks. We argue that, if performance is the goal, assuming human-like or algorithm-interpretable trace semantics is not only unnecessary but potentially misleading.