Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression

Paper · arXiv 2501.12698 · Published January 22, 2025

it is important to evaluate not only each response but also the user’s overall dialogue impression. For example, improving the dialogue system’s consistency of responses, personality, and empathy will improve the users’ dialogue experiences. The reward model must evaluate the overall dialogue impression using contexts up to that point.

This study investigates the best strategy to tune dialogue models to evaluate viewpoints for overall dialogue impression scores. We compare reward models based on prompting and supervised fine-tuning using a small dialogue dataset annotated with overall dialogue impression scores. Specifically, using the prepared dialogue evaluation data, a regression task is trained and used to answer 12 different overall dialogue impressions. We tune the dialogue models to improve the output from these reward models. Our experimental results showed that tuning dialogue models using such reward models for overall dialogue impressions achieved the best performance

Agency I felt that the system was speaking from its perspective

Attentiveness The system was interested in me and was actively trying

to talk with me

Consistency The system’s utterances were consistent and coherent

Ease Continuing the dialogue was easy

Empathetic I was able to empathize with the system’s utterances

Emotion I felt that the system had feelings

Enjoyability I enjoyed interacting with the system

Humanness The system’s utterances were humanlike and natural

Personality I could sense the system’s personality and character

Respeak I want to talk with this system again

Topic I felt that the system had a topic it wanted to discuss

Trust I felt that what the system said was trustworthy

The reward model evaluates the overall dialogue impression on a scale of 0-10, with the dialogue context 𝐶𝑖 , its response 𝑅𝑖 , and the evaluation score 𝑆𝑖,𝐸𝑗 corresponding to a metric 𝐸𝑗 as inputs. Here, 𝑖 represents the index identifying each dialogue sample, and 𝑗 represents the index identifying each evaluation metric. We fine-tune an LLM to regress the score. A linear layer is added to the final layer of the original model to train this model as a regression model rather than a generative model. We also train a single model to evaluate 12 different metrics. In training, the mean squared error was used as the loss function; the epoch was set to 10, and the batch size was 16 for eight devices. Therefore, the final batch size is 128.

confirmed that it is difficult to evaluate the overall dialogue impression with an LLM that simply performs prompting and that the SFT of the model corresponding to each evaluation metric is necessary. The model trained by 7b SFT is used in the following sections as the reward model for tuning the dialogue model.

In DPO pre-processing, the model generates two types of responses to update the dialogue history, and the reward model evaluates these two sentences for the dialogue history. The generated sentence with the higher evaluation value is then used for training as accepted and the lower as rejected

some issues were identified, such as the fact that the evaluation model gives high ratings to natural responses even if they are inherently undesirable, such as a dull response, and countermeasures were found to be needed to address these issues.