ARGS: Alignment as Reward-Guided Search

Paper · arXiv 2402.01694 · Published January 23, 2024
Assistants PersonalizationReward ModelsAlignment

we introduce ARGS, Alignment as Reward-Guided Search, a novel framework that integrates alignment into the decoding process, eliminating the need for expensive RL training. By adjusting the model’s probabilistic predictions using a reward signal, ARGS generates texts with semantic diversity while being aligned with human preferences, offering a promising and flexible solution for aligning language models.

Unlike traditional alignment approaches, our method integrates alignment into the decoding process, enabling quick realignments without having to go through the exhaustive process of retraining the foundational model using PPO. This is especially valuable in today’s rapidly changing field of machine learning, and ensures that models remain relevant and responsive to contemporary requirements without the need for extensive overhauls. Specifically, at each decoding step, our key idea is to adjust the model’s probabilistic prediction using a reward signal. This adjustment is crucial as it enables the generated text to both (1) maintain the semantic relevance with respect to the previous context, and (2) align with the reward criteria and human preference. These two sub-goals can be flexibly traded off with proper weighting on the reward signal, which degenerates to the standard maximum-likelihood decoding when the weight is zero.

Our method has two main components: (1) reward-guided scoring, which assigns scores to possible continuations of the text, and (2) token selection, which selects a continuation.

Our method has two main components: (1) reward-guided scoring, which assigns scores to possible continuations of the text, and (2) token selection, which selects a continuation. We detail the reward-guided scoring method in Section 2.1 and the token selection methods in Section 2.2.

2.1 REWARD-GUIDED SCORING Our goal is to steer the decoded outputs of language models in alignment with human preference. At each decoding step, our key idea is to adjust the model’s probabilistic prediction by a reward signal (Figure 1). This adjustment is crucial as it enables the model to generate text that is not only coherent and contextually relevant but also tailored to satisfy specific alignment criteria or objectives. Specifically, a reward model (RM) assigns a scalar reward value to each response. Following Stiennon et al. (2020), reward models are often trained on a dataset comprised of paired comparisons between two responses generated for the same input or prompt. Formally,