Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

Paper · arXiv 2505.10182 · Published May 15, 2025

Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden thought processes underlying texts, based on the premise that texts are the result of the author’s thinking process. Specifically, we apply Reasoning CPT to Gemma2-9B using synthetic data with hidden thoughts derived from STEM and Law corpora, and compare it to standard CPT on the MMLU benchmark. Our analysis reveals that Reasoning CPT consistently improves performance across all evaluated domains. Notably, reasoning skills acquired in one domain transfer effectively to others; the performance gap with conventional methods widens as problem difficulty increases, with gains of up to 8 points on the most challenging problems. Furthermore, models trained with hidden thoughts learn to adjust the depth of their reasoning according to problem difficulty.

In contrast, pretraining does not depend on specific domains or explicit reward signals. The challenge lies in constructing suitable training data for reasoning. A natural approach is to mine thinking processes from texts and use them as training data for continual pretraining. The key insight here is that every text can be seen as the result of an author’s implicit thought processes. For example, mathematical proof texts show logical flow, but behind them are many decisions involving trial and error, hypothesis verification, and consideration of counterexamples. Similarly, legal opinions reflect logical conclusions concisely, but judges recall precedents and consider legal interpretations and social impacts from multiple perspectives. We refer to these hidden internal thought processes as hidden thoughts.

Reconstructing hidden thoughts behind texts with LLMs provides a promising pathway for creating extensive reasoning data across diverse domains. Unlike traditional methods such as SFT or RL, this approach has several potential benefits: (1) it does not require strict correctness checks; (2) it efficiently leverages existing high-quality texts; and (3) it enables training reasoning abilities across various domains. Despite these advantages, the potential of continual pretraining with hidden thought processes remains insufficiently explored.

In this study, we evaluate the effect of incorporating hidden thoughts into continual pretraining across two different domains: STEM and Law. We use LLMs to simulate human hidden thoughts involved in creating expert-quality texts—such as recalling background knowledge, making decisions, verifying steps, and expressing thoughts naturally—to build synthetic datasets. Specifically, we generate hidden thoughts using Gemma2-9B-it [Shao et al., 2024] for specialized texts from two sources: OpenWebMath [Paster et al., 2024] and FreeLaw [Gao et al., 2021], and combine them with the original texts. We compare two continual pretraining settings applied to the base model Gemma2-9B [Shao et al., 2024]: one using the original texts (CPT), and another using synthetic data that combines original texts with hidden thoughts (Reasoning CPT). Our evaluation on the MMLU benchmark reveals several key findings:

• Cross-Domain Transfer of Reasoning Capabilities: Reasoning CPT consistently outperforms standard CPT across all test domains, achieving performance improvements of up to 3.3 points over the base model. A particularly interesting finding is that significant improvements are observed even in domains different from those used for training. For example, models trained with hidden thoughts from the law domain show improvements not only in MMLU social sciences but also in MMLU-STEM domains by 4.3 points.

• Higher Performance on Difficult Problems: The advantages of Reasoning CPT become more pronounced as problem difficulty increases. For the most challenging problems, it achieves accuracy improvements of approximately 8 points compared to standard CPT (Figure 1).

• Automatic Adjustment of Reasoning Length to Problem Difficulty: Our analysis reveals that models trained with Reasoning CPT automatically scale their reasoning length with problem complexity.. They generate shorter reasoning for simple problems and longer reasoning for difficult ones. This adaptive behavior likely results from mining longer hidden thoughts corresponding to increased token counts in the original texts, contributing to more efficient reasoning.

Hidden Thought Expert-written texts are the result of underlying thought processes that are not explicitly shown. For example, a math proof is the result of a mathematician trying many approaches, using background knowledge, and checking the work, although the final text only shows the end results. Unlike Ruan et al. [2025], which mainly added background knowledge of texts, we focus on recreating the human thinking processes involved in creating text. We build synthetic datasets by combining LLM-generated hidden thoughts (representing thinking not shown in the text) with the original texts. Figure 2 and Figure 3 show examples of our synthetic data.

As shown in this example, the hidden thoughts we generate include four characteristic aspects that capture how domain experts typically think:

Human-like thinking style: Spontaneous expressions such as “Hmm.. . ” or “Aha!” (e.g.,“Hmm, what kind of information are we talking about?”)
Background knowledge recall: Internally recalling relevant rules, concepts, or facts mentioned (or presupposed) in the text (e.g., “Let me recall the specific terms of the protective order. It protects sensitive information such as price proposals. . . ”)
Decision-making: Considering an action (e.g., “I need to justify redacting information from the court decision.”)
Self-verification: Checking the reasoning for possible mistakes or omissions (e.g., “Let me double-check if there are any exceptions to the protective order that might apply here.”)

<start_of_thought>

OK, I need to justify redacting information from the court decision.

First, what’s the basis for redaction? It’s likely a protective order. I need to remember the specific terms of that order. Right, it protects sensitive information.

Wait, what kind of information are we talking about? Proposals, prices, and price discussions... this sounds like commercially sensitive data.

Hmm, I could argue that this information is confidential under the protective order, but that’s a bit generic. Let me think about the specific harms of disclosure.

Aha! Releasing this information would give Torres and G4S’s competitors an unfair advantage. They could see how Torres and G4S approach procurements and potentially undercut their bids. That’s a clear competitive disadvantage.

I should focus on this competitive harm argument. It’s more specific and directly addresses the nature of the information being redacted.

I need to make sure I clearly state that this information is protected under the protective order, but the primary justification for redaction is the potential competitive harm.

Let me double-check if there are any exceptions to the protective order that might apply here. I don’t think so, but it’s always good to be thorough.

<end_of_thought>

The information that all three parties agree should be redacted from the court’s decision is protected information as defined by the protective order. Specifically, the disclosure of information contained in Torres’s and G4S’s proposals, total evaluated prices, and price-related discussions would provide Torres, G4S, and their competitors with insight into how Torres and G4S approach procurements of local guard services contracts, putting Torres and G4S at a competitive disadvantage. Accordingly, the court will redact that information from its decision.

the synthetic sequence—created by concatenating the original text with its hidden thoughts

4 Analysis of Reasoning Efficiency

Longer chains of thought improve accuracy but incur greater computational cost and latency [Sui et al., 2025]. The “overthinking phenomenon”—spending too many tokens on simple problems—is a practical challenge for reasoning models. Efficient reasoning, balancing thinking token count with reasoning accuracy, is a key design element in LLM reasoning. If reasoning is too verbose, it wastes computational resources; if too brief, accuracy suffers. In practical scenarios, models must handle problems of varying difficulty, making appropriate adjustment of reasoning depth essential. This section analyzes how much thought each model generates for problems of different difficulty levels, analyzing the relationship between reasoning efficiency and performance. We focus on these questions: (1) How does reasoning length change with problem difficulty? (2) What creates the difference in reasoning efficiency between standard CPT and Reasoning CPT? To answer these questions, we analyze the relationship between thinking token count and accuracy using the MMLU problems and model outputs from §3.2.

For simpler problems (Very Easy, Easy), Reasoning CPT models achieve higher accuracy with fewer thinking tokens, while for more difficult problems (Hard, Very Hard), they significantly increase thought length with corresponding substantial accuracy improvements. At Very Easy and Easy difficulty levels, Reasoning CPT uses fewer thinking tokens than CPT in both domains, with little difference in accuracy. At Medium difficulty, Reasoning CPT models substantially increase token count while CPT models show relatively smaller increases. Accordingly, the accuracy gap between Reasoning CPT and CPT begins to widen, with Reasoning CPT scoring 2.6 points higher in STEM and 1.1 points higher in Law. The differences between models become most pronounced at Hard and Very Hard difficulty levels. From Hard to Very Hard problems, Reasoning CPT models tend to increase thought length, with corresponding substantial accuracy improvements. For example, at Very Hard difficulty, Reasoning CPT-trained models achieve 51.8% to 52.5% accuracy, while CPT accuracy remains at 43.9% to 44.6%. This approximately 8-point accuracy difference suggests that models trained with Reasoning CPT have acquired the ability to automatically think longer for complex problems.

A plausible driver of this behaviour is the positive correlation between the length of the original text and the length of its hidden-thought segment observed in our training corpus. Figure 6 plots original-text token count against hidden-thought token count for each domain, yielding Spearman coefficients of ρ = 0.348 (STEM) and ρ = 0.486 (Law). Although token count is only a coarse proxy for information content, this correlation may teach the model a simple heuristic: continue thinking until sufficient evidence has been accumulated to predict the subsequent answer with high confidence. As a result, easy questions—which reach that confidence threshold quickly—elicit short chains of thought, whereas hard questions trigger longer chains, reproducing the efficiency pattern seen in Figure 5.