Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Paper · arXiv 2605.28713
Context EngineeringChain-of-Thought and Reasoning MethodsRetrieval-Augmented Generation (RAG)Inference-Time Scaling

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.

Large Language Models (LLMs) are increasingly used in Retrieval-Augmented Generation (RAG) [41, 14], long-document understanding [1, 21], and long-horizon agentic workflows [17, 22, 6]. They must process ever longer input contexts to provide richer evidence, even though such contexts incur substantial inference overhead and introduce redundant or irrelevant information [33, 3, 32]. Context compression is therefore essential for efficient and reliable long-context inference, aiming to retain task-critical information within a much shorter input.

In this work, we explore context compression from the perspective of model intrinsic capabilities. Given that thinking is fundamentally a process of distilling and compressing information [38, 24], we propose Thinking as Compression (TaC), a new compression paradigm that treats thinking traces itself as compressed context. As illustrated in Figure 1(c), thinking itself is potentially to compresses long contexts by focusing on key information, skipping redundancies, revisiting important evidence, and linking scattered facts into a compact trace for downstream generation. To validate this paradigm, we conduct a pilot study in Section 2.2 demonstrating that raw thinking traces can indeed serve as effective compressed contexts, already outperforming many representative compression methods.

In this work, we explore context compression through the intrinsic thinking ability of modern reasoning LLMs. We propose Thinking as Compression (TaC), which treats thinking traces themselves as compressed contexts, avoiding dedicated compression modules. While prompt-only TaC already shows strong potential, raw thinking traces may suffer from poor budget control and shortcut behaviors. To address this, we introduce TaC-C, a simple reward-driven framework that optimizes thinking traces to be compact, controllable, and useful for downstream generation. Experiments show that TaC-C consistently outperforms existing compression baselines under different compression ratios and produces reusable compressed contexts that transfer across downstream models.