SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

Paper · arXiv 2310.05344 · Published October 9, 2023
Reinforcement Learning

reward models in RLHF stage commonly rely on single-dimensional feedback as opposed to explicit, multifaceted signals that indicate attributes such as helpfulness, humor, and toxicity. To address these limitations, we propose STEERLM, a supervised finetuning method that empowers end-users to control responses during inference. STEERLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses while maintaining customizability.