The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Paper · arXiv 2505.10185 · Published May 15, 2025
Reasoning Methods CoT ToTDomain SpecializationDiscourses

Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the COT ENCYCLOPEDIA, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.

In this paper, we introduce the COT ENCYCLOPEDIA, a method to systematically analyze and control long CoTs that involve multiple, intertwined reasoning strategies. We do so through a bottom-up, clustering-based framework designed to capture, interpret, and steer diverse reasoning strategies at scale. Rather than relying on predefined categories, our approach begins by prompting a language model to produce free-form explanations of the reasoning strategies used in its own responses. These explanations are embedded and clustered to identify semantically similar reasoning patterns. For each resulting cluster, we generate contrastive rubrics (e.g., Inductive vs. Deductive, Directive vs. Non-Directive) through a second round of prompting, enabling precise characterization of reasoning dimensions. Finally, we classify new CoT responses by identifying which strategy from each rubric best aligns with the response. Human evaluation supports the quality of this pipeline: while top-down strategy labels from previous work [5] are judged as reasonable in only 51% of cases, our bottom-up method achieves perceived reasonableness of 92–97% across stages. See Figure 2 for an overview.

Beyond interpretability, the COT ENCYCLOPEDIA offers two practical benefits. First, it can improve a reasoning model’s performance by guiding it to adopt more effective strategies. This is achieved by (1) training a classifier to predict which strategy a model would use for a given input, (2) applying Bayes’ rule to estimate the likelihood of correctness when using each strategy, and (3) prompting the model to follow the most promising one. Across five benchmarks, we observe performance improvements of 2.5–8.3% in three different reasoning models. To our knowledge, this is the first demonstration that controlling a model’s high-level reasoning strategies can directly enhance accuracy.

Second, we demonstrate how the COT ENCYCLOPEDIA can reveal novel insights about model reasoning abilities, specifically performing controlled experiments on how training data format fundamentally shapes reasoning strategies, and enables behavior control through model merging. Our analysis shows that the domain of the training data (e.g., math vs. commonsense) has little effect on reasoning patterns, with Cohen’s d consistently below 0.2. In contrast, the format—multiplechoice (MC) versus free-form (FF)—has a much larger effect, with effect sizes up to 1.5. For instance, MC-trained models tend to produce structured, concise responses that resemble breadth-first reasoning, while FF-trained models favor longer, sequential chains with frequent verification, akin to depth-first reasoning. By linearly interpolating weights between MC- and FF-trained models, we generate models that smoothly transition in strategy, demonstrating controllability without fine-tuning.

Prior work [5] offered valuable insights by defining four reasoning behaviors—verification, backtracking, subgoal setting, and backward chaining—but such predefined categories struggle to capture the full diversity of emerging or model-specific strategies. To address this gap, we introduce COT ENCYCLOPEDIA, a five-stage framework for identifying, organizing, and analyzing diverse reasoning strategies in CoT outputs. Unlike prior top-down approaches, COT ENCYCLOPEDIA derives reasoning dimensions in a bottom-up, data-driven manner using large language models. As shown in Figure 2, the framework systematically extracts classification criteria, compresses them via semantic clustering, and generates human-interpretable reports on model reasoning behaviors.

taxonomy defines six reasoning dimensions: Analytical Perspective, Scope of Approach,

Reasoning Type, Idea Development, Verification Focus, and Clarification Approach.

As a baseline, we also assess the presence of four predefined cognitive behaviors—verification, backtracking, subgoal setting, and backward chaining—within the same responses [5].

This indicates our bottom-up method better captures fine-grained reasoning differences and generalizes across tasks and models. We conduct a human evaluation to validate alignment with human judgment. From model outputs, we sample 100 responses and assign 25 to each of four annotators. For each response, annotators answer four binary questions, assessing: (1) plausibility of fine-grained criteria, (2) coherence of high-level grouping, (3) relevance of predefined-criteria analysis, and (4) relevance of high-level-criteria analysis. As shown in Figure 3, annotators judge both fine- and high-level criteria as reasonable for most cases and find our framework produces more coherent analyses than the baseline.

We have shown that each dataset typically has a generally optimal reasoning strategy, indicating opportunities to enhance model performance. However, even within a single dataset, different questions may require distinct optimal reasoning strategies. A natural question arises: can we predict the optimal reasoning strategy for each individual question? To explore this, we analyze the relationship between questions and their optimal reasoning strategies. Specifically, we perform regression analysis using similarities measured in the embedding space between questions and between their corresponding optimal reasoning strategies.

higher similarity between questions corresponds to greater similarity between their reasoning strategies, suggesting that models adopt similar strategies for similar problems. Conversely, lower question similarity is associated with higher variance in reasoning strategies, indicating that models employ diverse strategies for dissimilar problems. These findings suggest the potential to predict effective reasoning strategies for unseen questions based on the strategies used in similar, previously encountered questions.

Our empirical results revealed four key insights: (1) optimal reasoning strategies significantly enhance task performance on both helpfulness and safety benchmarks; (2) these patterns can be predicted from input questions alone, enabling real-time adaptive reasoning control; (3) training data format influences reasoning strategies more substantially than domain; and (4) desired reasoning behaviors can be interpolated through model weight merging without additional training.

We then synthesize a natural language report using LLM, which selects and composes

rubric-specific templates to describe the reasoning pattern of each response. For example:

“The response shows a bottom-up reasoning style, combining data-driven verification ...”

taxonomy defines six reasoning dimensions: Analytical Perspective, Scope of Approach, Reasoning Type, Idea Development, Verification Focus, and Clarification Approach.

To examine how training data characteristics influence reasoning strategies, we compare the effects of data format and domain using Reinforcement Learning with Verifiable Rewards (RLVR). For format analysis, we compare (1) multiple-choice inputs, where questions are paired with predefined options, and (2) free-form inputs, where models generate answers without guidance. Using the NuminaMath dataset [10], originally in free-form, we synthetically generate multiple-choice versions to control for content while isolating presentation format

MC-trained models produce concise, structured answers, while FF-trained models are more verbose and often repeat filler words like ‘wait.’

that MC-trained models explore multiple solution paths early on—similar to breadth-first search—whereas FF-trained models follow a single path with iterative verification, resembling depth-first search.