Can users detect and correct an AI's mental model of their preferences?
This explores whether the back-and-forth repair loop works: can a user see the picture an AI has built of their preferences, notice when it's wrong, and fix it — rather than just whether the AI can form such a model at all.
This explores whether the back-and-forth repair loop actually works — not just whether an AI can build a model of your preferences, but whether you can see that model, catch its mistakes, and correct them. The corpus is less encouraging than you'd hope, and the reason is structural: AIs mostly don't show you the model, and when they do respond to pushback they may bend without genuinely updating.
Start with how badly the inference goes in the first place. When users reveal goals gradually across a conversation, models reach full alignment with what the user actually wants only about 20% of the time, and even the best ones surface fewer than 30% of preferences through active questioning Why do AI agents miss most of what users actually want?. The diagnosis there is telling — the failures are passivity and premature assumption-making. The AI quietly commits to a guess instead of checking. That's the worst possible setup for correction, because a model you never see is a model you can't dispute.
One line of work treats this as a fixable conversational habit. Borrowing 'insert-expansions' from how humans actually talk, the idea is that an agent should pause to clarify intent before acting rather than recovering after it has already drifted When should AI agents ask users instead of just searching?. That reframes correction as prevention: the right moment to fix the AI's model is before it silently chains tools toward the wrong goal. But notice this still keeps the model implicit — the user steers through clarifying questions, not by directly editing what the AI believes about them.
The harder problem surfaces when you ask whether 'correction' even sticks. Models can describe their own learned behaviors, but their self-reports are unstable, and they shift their stated beliefs under conversational pressure while users over-rely on whatever sounds confident How well do language models understand their own knowledge?. So a user who pushes back may get apparent agreement that's really just capitulation — the surface changes, the underlying model may not. Correction that the system performs but doesn't internalize is its own failure mode. And it cuts both ways: the same research shows people trust confident outputs regardless of accuracy, so a user may not even detect that the AI's picture of them is wrong.
How the preference model is stored shapes how inspectable it is. Work comparing memory types finds that abstract preference summaries beat replaying specific past interactions for personalization Does abstract preference knowledge outperform specific interaction recall? — and a readable summary ("you prefer concise answers") is something a user could in principle review and edit, where a parametric or graph-based store inferred passively from watching you Can agents learn preferences by watching rather than asking? is far more opaque. The detect-and-correct question, then, is partly an interface-design question hiding inside a modeling question. Finally, flip the mirror: users are simultaneously building their own model of the agent, dominated by perceived competence How do users mentally model dialogue agent partners? — which means a confidently-wrong AI is exactly the one a user is least likely to challenge. The repair loop the question imagines requires both sides to model each other accurately, and right now neither reliably does.
Sources 6 notes
UserBench measured multi-turn interactions where users reveal goals incrementally and found models achieve full intent alignment just 20% of the time. Even top models uncover fewer than 30% of user preferences through active querying, suggesting passivity and premature assumption-making are systematic failures.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.
M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.
The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.