Can models learn to abstain when uncertain about predictions?
Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.
Generating a single plausible next-utterance is not the same as modeling the uncertainty about ALL possible next-utterances in a calibrated way. In negotiations, "Sounds good!" and "No thanks" may be equally fluent/topical/informative responses, but one may be more likely given the goals, beliefs, and emotions of the interlocutors.
FortUne Dial formalizes this as conversation uncertainty modeling, shifting evaluation from pure accuracy to uncertainty-aware metrics that enable abstention on individual instances. When the model estimates high uncertainty about an outcome, it should say "I don't know" rather than forcing a prediction.
Two representations of uncertainty:
- Internal — using model scores (logits, probabilities) as uncertainty estimates
- Direct — using generated tokens to express probability assessments
Two fine-tuning strategies improve calibration:
- Traditional supervision — standard supervised fine-tuning with calibration objectives
- Off-policy RL — reinforcement learning strategy for calibration
The practical result: smaller open-source models, once calibrated, can compete with pre-trained models 10x their size on uncertainty-aware forecasting. This suggests that calibration ability is undertrained in standard LLMs — the capability exists but the training signal is absent.
Applications include: studying effects of strategy and social structure in negotiations, intervening to improve human and machine conversations, and assessing trust/heterogeneity in data sources via entropy metrics.
Real-world deployment evidence from CRAFT: When the CRAFT conversational forecasting model was deployed as a prototype moderation tool for Wikipedia editors, moderator feedback revealed critical design dimensions. Score change (trajectory) was more actionable than absolute score — moderators preferred seeing whether a conversation was trending toward derailment rather than a static risk number. Crucially, moderator confidence in predicting derailment varied dramatically: four of nine participants believed they could forecast in any Wikipedia context, four others only in very specific contexts with low confidence, and one only for personally-known participants on familiar topics. This variance means forecasting tools must accommodate heterogeneous human expertise rather than assuming uniform detection ability. A further missing dimension: conversation age. Moderators reported that inactive conversations (>2-3 days since last comment) are unlikely to revive, much less turn uncivil — but the prototype did not surface this temporal signal. The scale problem is stark: even topic-engaged moderators cannot proactively monitor all at-risk conversations, forcing them to rely on random discovery strategies.
Since Does reasoning fine-tuning make models worse at declining to answer?, calibrated uncertainty and appropriate abstention are capabilities that current training actively degrades. Since Does training objective determine which direction models fail at abstention?, the direction of calibration failure depends on the training regime — a forecasting system built on reasoning-trained models would over-predict, while one built on safety-trained models would refuse to predict. Conversation forecasting requires the opposite of both failure modes: models that know what they don't know about where a conversation is heading.
Additional empirical domain — Instagram hostility forecasting: A separate forecasting study on Instagram demonstrates that hostile comments can be predicted from early conversational signals: AUC 0.82 for predicting hostility presence 10+ hours in the future, and AUC 0.91 for predicting whether a post will receive more than 10 hostile comments vs. only one. Predictive features include the post author's history of receiving hostile comments, user-directed profanity, number of distinct participants, and hostility trends in the conversation so far. This complements the CRAFT deployment evidence above — different platform, similar principle: early conversational dynamics carry forecastable signal about future trajectory.
Related concepts in this collection
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
reasoning training degrades exactly the abstention capability conversation forecasting needs
-
Why do language models fail confidently in specialized domains?
LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?
overconfidence is the complementary failure to poor calibration
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
calibration fix for RL applies to dialogue forecasting
-
Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
specifies how training objectives differentially break forecasting calibration: reasoning-trained forecasters would over-predict, safety-trained would over-refuse
-
Can conversation structure predict dialogue success better than content?
Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.
TRACE measures trajectory retrospectively for reward; forecasting uses trajectory prospectively for prediction; same underlying principle that conversation shape carries outcome signal
-
Can opening politeness patterns predict whether conversations will turn hostile?
Do pragmatic politeness features in first exchanges—hedging, greetings, indirectness—reliably signal whether a conversation will later derail into personal attacks? Understanding early linguistic markers could help identify and prevent online hostility.
politeness strategies identify WHICH early features predict trajectory; forecasting provides HOW to quantify confidence in those predictions
-
Why do LLM judges fail at predicting sparse user preferences?
When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
the same calibrated abstention pattern: personalized judges that express uncertainty on sparse persona inputs achieve 80%+ reliability on high-certainty samples, paralleling how calibrated forecasting models improve by abstaining when uncertain rather than forcing predictions
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
ASK-driven topic drift is a specific conversational trajectory that calibrated forecasting should detect: users in an anomalous knowledge state produce drift patterns with 84% detectable precision, providing a concrete forecasting target for conversation trajectory prediction
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
conversation forecasting under uncertainty requires calibrated probability estimates — calibrated models should abstain on uncertain predictions rather than forcing outputs