Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue
Open-Domain Dialogue (ODD) Models finetuned for ODD tend to generate considerably less contextualized responses than models adapted using in-context learning. In particular, fine-tuning Llama2C reduces contextualization by 40%, while for MistralI by 35%. Similarly, fine-tuning reduces their appropriateness by 30% compared to their in-context learning version. This contrasts with automatic evaluation (Table 2), where in-context learning obtained a higher perplexity (i.e. worse results) compared to fine-tuning.
Knowledge-Grounded Dialogue (KGD) Concerning KGD, the results are model-dependent. When considering Llama2C, in-context learning provides, regardless of the knowledge, 10% more contextualized responses compared to fine-tuning. On the other hand, fine-tuning MistralI on Retrieved Knowledge leads to the highest contextualization (95%).
Task-Oriented Dialogue (TOD) When adapting Llama2C and MistralI to TOD, the results clearly show that fine-tuning is preferable over incontext learning. In particular, if we consider the best model for each technique, when fine-tuned Llama2C generates 20% more contextualized responses, while MistralI generates 15% more.
Our study highlights the limitation of currently available automatic metrics and the necessity of conducting human evaluations to advance human machine dialogue research, as the evaluations by human judges correlate poorly with automatic metrics. Furthermore, conducted human evaluations indicate that there is no universal best-technique for adapting LLMs to a dialogue type and the performance of each technique depends on the base LLM as well as the dialogue type. In addition, the correct incorporation of external knowledge depends on various factors such as the retriever accuracy, the representation of the knowledge, and the presence of noise (non-gold) documents, as it can be the least contributing element in the input vector according to explainability studies.