Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models
Effective interlocutors account for the uncertain goals, beliefs, and emotions of others. But even the best human conversationalist cannot perfectly anticipate the trajectory of a dialogue. How well can language models represent inherent uncertainty in conversations? We propose FortUne Dial, an expansion of the long-standing “conversation forecasting” task: instead of just accuracy, evaluation is conducted with uncertainty-aware metrics, effectively enabling abstention on individual instances. We study two ways in which language models potentially represent outcome uncertainty (internally, using scores and directly, using tokens) and propose fine-tuning strategies to improve calibration of both representations. Experiments on eight difficult negotiation corpora demonstrate that our proposed fine-tuning strategies (a traditional supervision strategy and an off-policy reinforcement learning strategy) can calibrate smaller open-source models to compete with pre-trained models 10x their size.
Dialogue models are increasingly fluent, topical, and informative conversationalists, capable of predicting plausible next-utterances given a partial conversation. Yet, the capacity to generate a single, plausible utterance is not the same as modeling the uncertainty about all possible next-utterances in a calibrated way – that is, assigning an appropriate probability to potential conversation outcomes, reflective of the randomness we observe in the real world. For example, in negotiations, “Sounds good!” or “No thanks” may be equally fluent/topical/informative next-utterances, but one choice may be more likely if the goals, beliefs, and emotions of the interlocutors are taken into account. While even the best conversationalists cannot perfectly predict the trajectory of a dialogue, humans often manage uncertainty about social cues appropriately (Druckman and Olekalns, 2008), and demonstrate ability to both anticipate and affect the likelihood of future conversation outcomes (Ho et al., 2022). Meanwhile, it is not yet clear if language models posses even the simplest of these capabilities: anticipation of outcome certainty.
we instead account for how well language models represent uncertainty about outcomes by measuring performance with calibration metrics. In effect, these calibration metrics allow models to abstain from predicting on instances when they estimate high uncertainty. Potential applications of models performant in this setting include: improved tools for studying the effects of strategy and social structure in negotiations (Curhan and Pentland, 2007), intervening to improve human and machine conversations (Lewis et al., 2017; Zhou et al., 2019; Schluger et al., 2022; Argyle et al., 2023), or assessing trust/heterogeneity in a data source via metrics like entropy (Csáky et al., 2019; Kuhn et al., 2022).
Here, we focus on the case of negotiations; this type of conversation is not only particularly sensitive to social uncertainties, but also, outcomes are readily quantified post-hoc. We ask language models questions about the likelihood of deals, decisions, and emotional conflicts in settings like marketplaces, online forums, and courtrooms, totaling 8 tasks to test uncertainty modeling in negotiations. Our contributions include:
formalizing the conversation uncertainty modeling task, along with its metrics (§ 2.1);
introducing two methods for representing uncertainty about the outcome of conversations using language models (§ 2.2);
and proposing fine-tuning (§ 2.3, § 2.4) and inference-time strategies (§ 2.5) for improving these representations.