Understanding Hidden Computations in Chain-of-Thought Reasoning

Paper · arXiv 2412.04537 · Published December 5, 2024

Chain-of-Thought (CoT) prompting has significantly enhanced the reasoning abilities of large language models. However, recent studies have shown that models can still perform complex reasoning tasks even when the CoT is replaced with filler(hidden) characters (e.g., “. . . ”), leaving open questions about how models internally process and represent reasoning steps. In this paper, we investigate methods to decode these hidden characters in transformer models trained with filler CoT sequences. By analyzing layer-wise representations using the logit lens method and examining token rankings, we demonstrate that the hidden characters can be recovered without loss of performance.

Lanham et al. [6] investigated the faithfulness of CoT explanations, revealing that as models become larger and more capable, the faithfulness of their reasoning often decreases across various tasks. Their work highlights that CoT’s performance improvements do not solely stem from added computational steps or specific phrasing, but also raise concerns about the genuine alignment between stated reasoning and actual cognitive processes.

Radhakrishnan et al. [7] proposed decomposing questions into simpler subquestions to enhance the faithfulness of model-generated reasoning. Their decomposition-based methods not only improve the reliability of CoT explanations but also maintain performance gains, suggesting a viable pathway to more transparent and verifiable reasoning processes in large language models.

Instance-Adaptive Chain-of-Thought (CoT) differs from parallelizable CoT in that it requires caching subproblem solutions within token outputs. Specifically, in instance-adaptive computation, the operations performed in later CoT tokens depend on the results obtained from earlier CoT tokens. This dependency structure is incompatible with the parallel processing nature of filler token computation.

Starting from layer 3, we observed the emergence of filler tokens among the top-ranked predictions. The filler token (“. . . ”) began to appear more frequently as the top prediction, indicating that the model was starting to shift its focus toward producing the expected output format with filler tokens.

By layer 4 (the final layer), the filler token dominated the top predictions, and the original numerical tokens were relegated to lower ranks (rank-2). This pattern suggests that the model performs the necessary computations in the earlier layers and then overwrites the intermediate representations with filler tokens in the later layers to produce the expected output.

Our findings indicate that the model internally performs the necessary computations for the 3SUM task and that these computations can be recovered by examining lower-ranked tokens during decoding. The model appears to perform the reasoning in the earlier layers and then overwrites the intermediate representations with filler tokens in the later layers to produce the expected output.

This overwriting behavior may involve mechanisms such as induction heads [4], where the model learns to copy or overwrite tokens based on patterns in the data. Understanding how the model manages the trade-off between performing computations and producing the expected output could provide valuable insights into the internal workings of transformer models