Tuning Language Models by Proxy

Paper · Source

We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the same end as direct tuning, but by accessing only its predictions over the output vocabulary, not its parameters. Our method tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the larger untuned model in the direction of tuning, while retaining the benefits of larger-scale pretraining

For instruction-tuning (§3), we contrast the predictions of LLAMA2-7B-CHAT and LLAMA2-7B for guidance. Remarkably, we find that proxy-tuning closes 91% of the performance gap between LLAMA2-13B and its directly tuned CHAT version, and 88% of the gap for the 70B model, when evaluated across knowledge, reasoning, and safety benchmarks. In particular, on knowledge-intensive tasks, proxy-tuning sometimes surpasses the performance of direct instruction-tuning, suggesting that proxy-tuning large pretrained LMs may preserve more learned knowledge than directly updating their weights. Proxytuning a larger model also consistently outperforms the small tuned expert, indicating that our method combines the benefits of tuning with larger pretraining scale.

For domain adaptation (§4), we apply proxy-tuning to adapt pretrained models to code. Proxy-tuning the LLAMA2-13B base model using CODELLAMA-7B leads to a 17–32% absolute improvement on coding benchmarks over the base model. Finally, we apply proxy-tuning to achieve task-specific finetuning for question-answering and math problems (§5). On average across the two tasks, proxy-tuning LLAMA2-70B leads to a 31% absolute improvement over the untuned 70B model, and 9% improvement over the tuned 7B task model. Moreover, we find that proxy-tuning can enable untuned models to follow the strict syntactic constraints of the problem at hand, which are learned only by the small expert.

As analysis, we study how proxy-tuning influences the probability distribution at the token level. Specifically when used for instruction-tuning (§6.1), we find that proxy-tuning has the largest influence in promoting reasoning and stylistic tokens, consistent with other evidence that alignment mainly affects style rather than knowledge (Gudibande et al., 2023; Mitchell et al., 2024). While proxy-tuning does not require tuning any hyperparameters, we next show how one can be optionally introduced to the ensemble (§6.2). Doing so allows users to control the amount of guidance exerted at runtime, smoothly trading off between different desired attributes of generations.