Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

Paper · arXiv 2412.02674 · Published December 3, 2024

Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights data based on this verification, and distills the filtered data. Despite several empirical successes, a fundamental understanding is still lacking. In this work, we initiate a comprehensive, modular and controlled study on LLM self-improvement. We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap. Through experiments with various model families and tasks, we discover a scaling phenomenon of self-improvement – a variant of the generation-verification gap scales monotonically with the model pre-training flops.

4.3 Unimprovable Tasks

The primary objective of self-improvement is predicated on the assumption that “verification is easier than generation”. As such, it is also worthwhile to consider tasks where such intuition would not hold. One such scenario involves factual tasks that require generating a factually correct answer to a trivia question. We hypothesize that the capability to generate a correct answer is contingent solely on whether the model has been trained with the relevant factual knowledge, and verification would provide little additional signal. To test this, we measure gap(f) on the Natural Question dataset (Kwiatkowski et al., 2019), where u(x, y) = 1 if y is one of the candidate answers to the question x, and u(x, y) = 0 otherwise. Our analysis on a test subset of 3610 questions, presented in Table 1, reveals that despite all models achieving non-trivial generation accuracy, the gap remains smaller than 1%, or is even negative, across all models. This suggests that certain tasks may not benefit from the current self-improvement framework

4.4 Sudoku

Generalized sudoku is a canonical example where the generation (NP-hard) is harder than the verification (P) (Haythorpe, 2016). We consider 4 by 4 sudoku puzzles, each with a unique solution, with 288 puzzles in total. We task the models to use CoT reasoning for both generation and verification. The results, presented in Table 2 and detailed further in Appendix C.3, reveal a surprising pattern: only the largest models, such as Qwen-1.5/2 72B and Llama 3.1 70B, exhibit non-trivial gaps. For these models, the improvement is indeed more significant (50% − 300% improvement in accuracy) than the improvement in the math task.

This requirement is similar in mathematical tasks; however, it is likely that most models have been exposed to math verification during pre-training, unlike sudoku verification. Consequently, smaller models may lack the requisite reasoning capabilities to improve on sudoku tasks. Although our analysis is primarily post-hoc, an interesting avenue for future research would be to develop a metric to predict a model’s “self-improvability” on specific tasks.

We also examine the “effective diversity” of generations throughout the iterative self-improvement process using the metric pass@k.6 We present the results in Figure 4. We observe when k is small, pass@k increases with the number of rounds of self-improvement, validating the success of the self-improvement process. However, when k is large, pass@k decreases with the number of iterations, indicating that the diversity of the generations is reduced through the self-improvement process. This trend may result from the model’s inability to verify rare, yet correct, answers, potentially leading to convergence on incorrect solutions during the self-improvement process.

LLM-as-a-Judge. LLM-as-a-judge refers to using an LLM to verify the generation of some other (or the same) LLM (Chiang et al., 2023; Zheng et al., 2023; Bubeck et al., 2023; Chiang and Lee, 2023; Zhou et al., 2024). Recently the same idea has also been applied to train a generative reward model (Ankner et al., 2024; Zhang et al., 2024b). Having a model that can verify its own generation is one of the key components of self-improvement, and in this work, we perform a fine-grained study on various types of LLM verification mechanisms.

Our results reveal several intriguing properties such as the scaling properties of the relative gap, saturation of iterative self-improvement and enhancement of verification via ensemble methods. These insights are likely to have practical implications for improving pre-training, post-training, and test-time inference. Additionally, our research opens several promising avenues for future exploration:

• While our scaling analysis is primarily observational (Ruan et al., 2024), pursuing a more extensive scaling law study (Kaplan et al., 2020) based on our preliminary findings could provide robust empirical guidelines.

• Our results hint at an inference-time scaling law (Wu et al., 2024) is possible for self-improvement (or with cross-improvement (c.r. Section 4.2)). Identifying compute-optimal methods for self-improvement across different tasks remains a critical challenge.

• The decline in the effective diversity of generations during iterative self-improvement presents a significant obstacle. Developing strategies to mitigate this issue offers considerable empirical benefits.

• The distinct non-overlap property of verification mechanisms, despite their functional similarities, suggests that combining compositional verification could significantly enhance self-improvement. Exploring this potential further could yield fruitful results.