Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

Paper · arXiv 2504.09762 · Published April 14, 2025

Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called “reasoning traces” or even “thoughts” – implicitly anthropomorphizing the model, implying these tokens resemble steps a human might take when solving a challenging problem. In this paper, we present evidence that this anthropomorphization isn’t a harmless metaphor, and instead is quite dangerous – it confuses the nature of these models and how to use them effectively, and leads to questionable research.

we take the position that anthropomorphizing intermediate tokens as reasoning/thinking traces is (1) wishful (2) has little concrete supporting evidence (3) engenders false confidence and(4) may be pushing the community into fruitless research directions.

Perhaps the most popular and enduring class of test-time inference ideas involves generating many candidate solutions from an LLM and using some selection procedure to choose the final output. The simplest implementation is known as self-consistency[54]: choose the most common answer. Total time spent is proportional to the number of solutions generated, but while this method can work practically, it provides no guarantees that its answers will be more correct. More sophisticated selection procedures attempt to verify that an LLM’s output is correct. When paired with an LLM in this manner, the combined system can be seen as a generate-test framework, and naturally raises questions about the verification process: who does it, and with what guarantees? A variety of approaches have been tried–including using LLMs themselves as verifiers[57] (although this is known to be problematic [49]), learning verifiers[2, 59], and using external sound verifiers that come with either full or partial guarantees. In cases where verifiers provide explanations or feedback when a guess is incorrect, these can be passed back to the LLM so it generates better subsequent guesses. Several well-known LLM-based reasoning systems such as FunSearch [42], Alpha Geometry [52] and AlphaEvolve [1] all can be viewed under this lens. The LLM-Modulo framework[23, 21] provides an umbrella for these types of verification-based approaches, along with their guarantees, which are essential when these systems are deployed in safety-critical applications, or even in conventional applications where wrong answers are unacceptable.

One intuition driving today’s research is that this performance gap is partly because the training data is incomplete. LLMs have soaked up every article, post, and book on the internet but not what it took to produce them – whether internal verbalizations, scratch paper outlines, or typed up but discarded drafts. Perhaps, the hope here goes, if more of these derivational traces were included, this would help LLMs replay versions of the same processes.

While promising, it is far from immediately clear how to source data like this at sufficient scale. There are few if any large collections of generic derivational traces. Not only is it burdensome for people to produce granular step-by-step representations of their own thoughts, but they are unlikely to have direct and explicit access to those processes in the first place. And in those cases where they do, they may deliberately or subconsciously efface their tracks. As Gauss famously remarked when asked to give step-wise intuitions for his proofs: no self-respecting architect leaves the support structure in place once the edifice is complete!

Generating Candidate Derivational Traces: Several trace generation methods were considered: Human-generated Traces: An obvious way to obtain additional derivational data is to have humans create it. OpenAI paid contractors to write questions and step by step solutions to grade school math problems to create GSM8k[30]. While companies have continued to source data like this, it is infeasibly expensive, especially at the data scales necessary for large scale model training and for the diversity of problems that require supporting derivational data.

Solver-generated Traces: A much more scalable approach is to use formal solvers to automatically generate both solutions and rationales derived from solver-specific intermediate representations. Searchformer[27], Stream of Search[13], as well as DeepMind’s work in [45, 32] use standard search algorithms to produce datasets containing not just answers but also the execution traces generated along the way. For instance, when using A* search to solve a problem, SearchFormer’s data generation pipeline will provide a representation of each manipulation of the open and closed lists as a derivational trace. Unfortunately, domain-specific solvers cannot be used to generate traces for arbitrary problems, limiting the generality of this technique.

LLM-generated Traces: Rather than creating high-quality traces from the start, an increasingly popular approach is to generate them from an LLM and filter afterwards. This sort of generation is feasible because modern LLMs are pre-trained on data that already contains some derivational traces (e.g. educational web pages, grade school math explanations, and other sources with steps)4, and outputs that match these styles can be reliably induced, often by merely appending “Let’s think step by step” to the prompt and hoping for traces that might loosely resemble reasoning [24].

Filtering Traces: Naively LLM-generated traces are often not useful unless they are filtered. Researchers have varied in how they approach this trace selection process, ranging from selecting only those that are correct at each step (according to human labelers), training process reward models that attempt to automate human verification[30], to selecting traces by formally verifying whether they lead to correct final solutions without considering the trace content [58, 9].

Improving LLMs Using Derivational Traces: Once derivational traces have been selected, they can be used to further train an LLM. The hope is that, by outputting useful intermediate tokens, the LLM will be more likely to output correct solutions across a wider variety of problems. Early approaches fine-tuned LLMs directly on such traces[58, 27, 13], but more recent advances have pivoted towards using reinforcement learning (RL) instead (although there are questions about the generality of the MDP models used in the current LLMs like DeepSeek R1; see Section 5). The first major successful and publicly understood models trained this way were DeepSeek’s R1-Zero and R1 models[9].

After completing normal LLM pre-training, they begin an RL post-training phase on a new dataset – consisting of questions whose answers can be automatically verified. During this phase, the LLM generates multiple possible completions for each question; these completions take the form of traces culminating in separately marked final answers, and are scored according to the correctness of that final answer. The best completions are then rewarded, adjusting the model parameters to be more likely to output them rather than those completions that did not lead to a correct final answer. In essence, this RL process views the LLM as a token-choosing policy and uses a policy gradient algorithm to iteratively improve its parameters. The “state” here is the context window; the next action is just the token emitted by the policy (see Section 5).

Conceptually, this RL phase can be considered a two step process repeated many times: first, generate potential trajectories from the LLM and weight them using an automatically computed success criterion; second, selectively fine-tune the same LLM on its own output. Whether SFT or RL is used to modify the parameters of the base LLM, the resulting model’s architecture is still the same as that of any other LLM. The only difference is in the probability distribution the model captures: one that favors outputting intermediate tokens (which mimic the derivational traces it was trained on) followed by the LLM’s guess at the solution. This reframing makes it clear that pure fine-tuning and RL approaches are not as different as might be initially assumed, supported by [44].

As we discussed, post-training can induce a model to first generate long strings of intermediate tokens before outputting its final answer. There has been a tendency in the field to view these intermediate tokens as the human-like “thoughts” of the model or to see them as reasoning traces which could reflect internal reasoning procedures. This is precisely the tendency our position paper argues against. We start by listing the various (unhealthy) ramifications of this anthropomorphization:

• Viewing intermediate tokens as reasoning/thinking traces has led to a drive to make them “interpretable” to humans in the loop (nevermind that interpretability mostly meant that the traces were in pseudo English). For example, DeepSeek [9] dabbled in training an RL-only model (R1-Zero) but released a final version (R1) that was trained with additional data and filtering steps specifically to reduce the model’s default tendencies to produce intermediate token sequences that mix English and Chinese!

• It has led to an implicit assumption that correctness/interpretability of the intermediate tokens has a strong correlation, or even causal connection, with the solution produced. This tendency is so pronounced that a major vendor’s study showing that LRM’s answers are not always faithful to their intermediate tokens was greeted with surprise [8].

• Viewing intermediate tokens as traces of thinking/reasoning has naturally led to interpreting the length of the intermediate tokens as some sort of meaningful measure of problem [50, 51] difficulty/effort and techniques that increased the length of intermediate tokens were celebrated as “learning to reason” [9]. Simultaneously there were efforts to shorten intermediate traces produced and celebrate that as learning to reason efficiently [3].

• There have been attempts to cast intermediate tokens as learning some “algorithm” that generated the training data. For example, the authors of SearchFormer [27] claim that their transformer learns to become “more optimal” than A* because it produces shorter intermediate token traces than A*’s derivational trace on the same problem.

These corollaries, in turn, have lead to research efforts, which, when viewed under the lens of our position, become questionable enterprises (as we shall discuss in the following sections).

Famously, DeepSeek’s R1 paper claimed that one of the most impressive observed behaviors of their trained models was the so-called “aha” moment: as part of the chain of thought it was producing in order to answer some question, the model output the token “aha”, seeming to indicate that it had come upon a sudden realization. While a human may say “aha” to indicate exactly a sudden internal state change, this interpretation is unwarranted for models which do not have any such internal state, and which on the next forward pass will only differ from the pre-aha pass by the inclusion of that single token in their context. Interpreting the “aha” moment as meaningful exemplifies the long-neglected assumption about long CoT models – the false idea that derivational traces are semantically meaningful, either in resemblance to algorithm traces or to human reasoning. Further, there have also been works which attribute cognitive behaviors (like backtracking, self-verification etc.) to the models based on their reasoning traces and try to induce these kinds of behaviors through examples in the hope of improving the models’ performance [12, 41].

One reason that this anthropomorphization continues unabated is because it is hard to either prove or disprove the correctness of these generated traces. DeepSeek’s R1, even on very small and simple problems, will babble over 30 pages worth of text in response to each and every query, and it is far from clear how to check if these monologues constitute sound reasoning.5

we can formally verify the status of traces generated by format-constrained models trained to imitate the derivational traces of domain-specific solvers.

Presumably, natural language reasoning follows algorithmic structure, even if it does not correspond to a rigidly-defined algorithm. For example, see Polya’s “How to Solve It,” [40] which outlines the elements of mathematical problem solving in an algorithmic way, even if they are often implicit. Accordingly, we argue that studying algorithmic search traces, such as in [47], resembles a model organism for understanding systems like R1 (analogous to the roles of Drosophila Melanogaster or Caenorhabditis Elegans in biology). If a technique can learn to produce semantic reasoning traces for natural language problems, it ought to be able to do so for algorithmic traces as well, and vice-versa. Accordingly, evidence that models trained on algorithmic traces do not learn semantics applies to natural language problems and systems that apply to them, namely R1.

A similar investigation to test the correlation, and potentially any causation, between intermediate traces and final solution performance was carried out by the authors in [5] in the Question-Answering (QA) domains. By decomposing the QA reasoning problems into verifiable sub-problems that can be evaluated at inference time, the authors first generated a Supervised Fine-Tuning (SFT) dataset with correct intermediate traces paired with correct final solutions. To carry out an intervention experiment, they generate another SFT dataset consisting of incorrect intermediate traces again paired with correct final solutions. For the first SFT experiment setting, the results show a large number of False Positives where the fine-tuned models output correct final solutions but incorrect intermediate traces. Interestingly, the intervention experiments with incorrect intermediate traces even outperforms the SFT with correct intermediate trace setting.