Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning
Maintaining a consistent persona is a key quality for any open domain dialogue system. Current state-of-the-art systems do this by training agents with supervised learning or online reinforcement learning (RL). However, systems trained with supervised learning often lack consistency as they are never punished for uttering contradictions. Additional training with RL can alleviate some of these issues, however the training process is expensive. Instead, we propose an offline RL framework to improve the persona consistency of dialogue systems. Our framework allows us to combine the advantages of previous methods as we can inexpensively train our model on existing data as in supervised learning, while punishing and rewarding specific utterances as in RL. We also introduce a simple importance sampling method to reduce the variance of importance weights in offline RL training which we call Variance-Reducing MLE-Initialized (VaRMI) importance sampling. Our automatic and human evaluations show that our framework improves both the persona consistency and dialogue quality of a state-of-the-art social chatbot.
Some work has attempted resolve the problems with supervised learning through the use of online RL (Song et al., 2019b; Liu et al., 2020). However, the training process for RL is quite expensive since the dialogue model must continuously generate new training samples. Furthermore, online RL methods require the use of accurate critics to evaluate the generated bot utterances. These critics must incentivize persona consistency while also enforcing strong constraints on dialogue fluency, as without them the model will degenerate (Verma et al., 2022; Song et al., 2019b). This requires training multiple, separate critic models or using human critics during training which is also expensive. Given these challenges, we propose an offline RL framework to improve the persona consistency of open domain dialogue systems (Figure 1). Offline RL has several advantages over existing training methods. Unlike supervised learning, offline RL explicitly punishes contradictory utterances during training. This further improves persona consistency by making the bot more sensitive to contradictions.
Unlike online RL, offline RL does not require our dialogue model to generate new samples during training. Instead, we can inexpensively train our model using large existing datasets that have been collected/synthesized for supervised learning. We exploit this pre-existing data to train our model on human annotated reward labels instead of classifier based rewards which are common in online RL. Training on human-annotated rewards also reduces the chance of training failures due to policy divergence. This can arise in settings where value function approximation is needed to determine Q-values and may require the use of behavior regularization (van Hasselt et al., 2018; Wu et al., 2019).
Despite the advantages of offline RL, offline RL training can suffer from high variance due to the need for importance sampling. To alleviate this, we introduce an importance sampling method called VaRMI to reduce the variance of importance weights. This method can be applied beyond our task to other settings where policy-gradient offline RL training is used.
Prior work has explored the application of offline RL on task-oriented dialogue (Verma et al., 2022; Snell et al., 2022; Jang et al., 2022). Task oriented dialogue is a natural extension of offline RL as crafting a reward function is straightforward. Applying offline RL to social dialogue is less clear as there is no obvious reward to use for our policy. We exploit the fact that persona consistency is a key component of open domain dialogue. Intuitively, this makes sense as humans naturally speak with a persona during a conversation. Prior studies have shown that improving persona consistency also improves the quality of social dialogue (Zhang et al., 2018; Roller et al., 2020).