How can agents learn to estimate user satisfaction in real-time during conversation?
This explores how a conversational agent could read whether the user is satisfied *while the conversation is still happening* — not from a post-hoc rating, but from signals it learns to interpret turn by turn.
This explores how an agent learns to estimate user satisfaction *in the moment* rather than from a survey afterward — and the corpus suggests the hard part isn't measurement, it's deciding what to even treat as the satisfaction signal. The most direct answer is that you make satisfaction (or a proxy for it) the reward an agent optimizes online. Can recommendation metrics train language models directly? shows you can hand an LLM a black-box metric like NDCG or Recall and let it learn directly against that signal, no distillation needed — meaning any scorable proxy for 'did this land' can drive learning. Can unified policy learning improve conversational recommender systems? pushes this further: when the decisions of *what to ask, what to recommend, and when* are learned as one RL policy instead of three separate modules, the agent optimizes the whole conversation trajectory — which is closer to optimizing eventual satisfaction than optimizing any single turn.
But a reward only tells you if you hit a target you already defined. The more interesting thread is agents that build their own internal estimator. Can models learn to evaluate their own work during training? trains a model to compute its own reward in the unused space after its output — internalizing self-evaluation so the judgment travels with the model at zero inference cost. Can personas evolve in real time to match what users actually want? does the live version of this: it simulates recent interactions against the user's textual feedback *at test time* and updates a persona to match what the user actually seems to want — a working example of adjusting to a satisfaction signal mid-stream rather than waiting for retraining.
Here's the thing you might not expect: much of the corpus argues the richest satisfaction signal isn't a thumbs-up at all — it's the conversational friction the agent should be reading. When should AI agents ask users instead of just searching? borrows from conversation analysis to formalize the moments when a user's intent is drifting and the agent should pause to clarify — dissatisfaction *as it forms*, before a bad answer is even produced. Could proactive dialogue make conversations dramatically more efficient? suggests turn count itself is a readable proxy: needing fewer turns to resolve something is a satisfaction signal hiding in plain sight. And How can proactive agents avoid feeling intrusive to users? warns of the inverse failure — an agent that optimizes for being helpful can read as intrusive, so 'satisfaction' has a civility dimension (timing, boundaries, autonomy) that a naive reward will miss entirely.
What does satisfaction even decompose into? How do users mentally model dialogue agent partners? gives the most concrete map: when users judge a dialogue agent, perceived *competence* accounts for nearly half the impression, with human-likeness and communicative flexibility making up the rest. That's a hint about what a real-time estimator should weight most heavily — and a warning that an agent maximizing 'sounding human' is optimizing the smaller share of what users actually care about.
Two cross-cutting tools round this out. To estimate satisfaction efficiently you need to ask without nagging: Can user preferences be learned from just ten questions? shows roughly ten well-chosen adaptive questions can pin down a personalized reward function, reducing uncertainty fast instead of interrogating the user. And to *train* a satisfaction estimator at all, you need realistic users to practice on — Can controlled latent variables make LLM user simulators realistic? and Can training user simulators reduce persona drift in dialogue? build controllable, drift-resistant user simulators that generate the synthetic dissatisfaction-and-recovery conversations an agent needs to learn from before it ever meets a real person.
Sources 11 notes
Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.
Research shows that formulating attribute-asking, item-recommending, and timing decisions as a single graph-based RL policy achieves better joint optimization than isolated components. Separation prevents gradient signals from informing one another and fails to optimize conversation trajectory holistically.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.
Intelligence and adaptivity alone create socially blind agents that interrupt poorly and override user direction. The Intelligence-Adaptivity-Civility taxonomy shows civility—respecting boundaries, timing, and autonomy—is essential to making proactivity welcome rather than intrusive.
The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.