Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
In this paper, we introduce Inverse-Q, an innovative framework that transcends traditional RL methods by optimizing token-level reinforcement learning without the need for additional reward or value models. Inverse- Q leverages direct preference optimization techniques but extends them by estimating the conditionally optimal policy directly from the model’s responses, facilitating more granular and flexible policy shaping. Our approach reduces reliance on human annotation and external supervision, making it especially suitable for low-resource settings. We present extensive experimental results demonstrating that Inverse-Q* not only matches but potentially exceeds the effectiveness of PPO in terms of convergence speed and the alignment of model responses with human preferences.
DPO optimizes preference reward loss directly through reward model loss, affecting the probability margins of preference pairs. Similar methods like RSO (Tripathi and Singh, 2020), ReST (Gulcehre et al., 2023), and ReST-em (Singh et al., 2023) train policies to fit optimal prior distributions on predefined response sets, avoiding the need for a critic model. However, these methods still require additional supervisory signals, such as a reward model, to enhance response quality, leading to trade-offs in labeling costs and accuracy.
A crucial observation is that direct optimization methods still require the logits of entire response sequences to construct the loss function due to the need for differentiability in back-propagation. Lacking corresponding advantage function modeling, such constructions cannot naturally generalize to token-level process supervision. Based on this insight, we hypothesize: Is there a special trajectory estimation whose feedback signal can naturally generalize to dense reward function modeling within the token MDP, thereby automatically constructing advantage function interpretations for each token?