Efficient Reasoning with Hidden Thinking
Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose Heima (as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. We design the Heima Encoder to condense each intermediate CoT into a compact, higher-level hidden representation using a single thinking token, effectively minimizing verbosity and reducing the overall number of tokens required during the reasoning process. Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs.
The CoT enables the MLLMs to generate intermediate steps that mirror human-like problem-solving processes, breaking down complex tasks into smaller, sequentially manageable components before arriving at the final solution. This approach not only enhances interpretability but also enables more effective multi-step reasoning, equipping MLLMs to address tasks that demand intricate logical understanding and contextual coherence, especially when processing the inherent complexity of visual information.
However, CoT reasoning often requires generating a substantial amount of additional text during the reasoning process, particularly for complex problems. This increased verbosity significantly impacts the efficiency of problem-solving, especially in large models with massive parameters and expensive inference costs. Thus, it becomes crucial to reduce the number of tokens generated during CoT reasoning to enhance the efficiency of MLLMs without compromising their reasoning capabilities.
crucial to verify the effectiveness of the hidden representations encapsulated within thinking tokens to ensure that the model is genuinely learning hidden reasoning processes rather than merely fitting the data. Thus, to interpret the thinking tokens, we design the adaptive Heima Decoder, which exploits the standard next-token prediction paradigm in LLMs for the reconstruction of variable-length (i.e., adaptive) textual sequence based on thinking tokens.