Divide-or-Conquer? Which Part Should You Distill Your LLM?

Paper · arXiv 2402.15000 · Published February 22, 2024
Training Fine TuningReasoning by Reflection

we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. Further, we hypothesize that the decomposition should be easier to distill into a smaller model compared to the problem solving because the latter requires large amounts of domain knowledge while the former only requires learning general problem solving strategies. We propose methods to distill these two capabilities and evaluate their impact on reasoning outcomes and inference cost.

However, the use of gigantic LLMs with hundreds of billions of parameters can be costly during inference, particularly when the reasoning chain generated is lengthy. Additionally, due to the opaque nature of these black box LLMs, they offer limited adaption options. There is a need to use cheaper and more flexible models to leverage the power of these black box LLMs for local adaptation and cost-efficient inference. Distilling the large LLMs would seem like a reasonable strategy, but it often results in a significant drop in performance for downstream tasks (Chiang et al., 2023).

effectively addressing such tasks requires the model to proficiently perform two essential capabilities simultaneously: 1) planning and decomposition, which involves breaking down complex objectives into smaller, more manageable subgoals to facilitate efficient handling of intricate tasks; and 2) execution and solving, which involves memorizing vast amounts of knowledge from extensive web training data and effectively recalling this information when needed to execute the problem-solving process. The first capability, decomposition, typically requires the model to engage in self-reflection on the input query and generate a Chain-of-Thoughts (CoT)- style reasoning chain (Wei et al., 2022) to tackle the problem.

  1. Is decomposition capability more generalizable than solving capability? We hypothesize that decomposition can sometimes be abstracted into symbolic principles, making it more universally applicable across tasks, datasets, and models. This enables tasks and models to share a common decomposition engine and benefit from each other’s power, reducing the effort and costs involved in distilling a model for each individual task.

A natural question arises: is it possible to distill only the long reasoning chain, which accounts for most of the inference cost, but is relatively easier to distill? To this end, we propose and evaluate the distillation of only the decomposition capability from the LLM.

We illustrate that the distilled query decomposition model exhibits good generalization across tasks, datasets, and models. However, the distilled solving ability does not generalize well.

Interactive vs static process Note that an interactive and dynamic process could be beneficial for certain reasoning tasks. In our experiments with math and QA datasets, the decomposition and solving stages are more independent, thus we did not observe gain by switching to an interactive process. Our primary focus lies in understanding the impact of distilling task decomposition and solving capabilities, rather than finding the optimal framework. Using a static approach would enable us to have a clearer separation of the decomposition and solving. The distilled decomposer can also potentially be integrated into more dynamic reasoning processes, enabling iterative solving and refinement based on intermediate outputs.

Instruction for decomposition: Idecomp

Your task is to break down a given complex question into the most relevant and helpful subquestions, ensuring that no more thanthree subquestions are formulated for each question. Both the context and the main question will be provided to you. If the question does not need breaking down to be answered, return “No decomposition”; otherwise, list the necessary subquestions. Only return subquestions that directly aid in answering the original question, avoiding any that could be harmful or unhelpful. Question: Q

Errors in intermediate steps can influence subsequent steps and affect the final outcomes. 2) The cost of the dynamic pipeline is markedly higher than that of the static pipeline