Can conversations themselves personalize without user profiles?
Can a conversational AI learn about user traits and adapt in real time by rewarding itself for asking insightful questions, rather than relying on pre-collected profiles or historical data?
Most LLM personalization requires something before the conversation starts — a user profile, historical interactions, preference embeddings, or calibration queries. The curiosity reward approach inverts this: the conversation itself is the personalization mechanism.
The key idea: augment standard RLHF with an auxiliary reward that measures how much each turn improves the model's belief about the user's latent type. The agent is rewarded for reducing its uncertainty about who it's talking to. This creates an intrinsic drive to ask insightful questions, make context-sensitive probes, and adapt responses based on inferred traits — rather than passively responding to stated preferences.
The architecture separates two reward channels:
- End-of-conversation sparse reward — standard RLHF signal for overall conversation quality
- Turn-based intrinsic reward — improvement in user type prediction accuracy after each action
This dual signal forces a balance between helpfulness and inquisitiveness. Without the curiosity reward, models default to passive helpfulness (since Why can't conversational AI agents take the initiative?). With it, models learn to strategically gather information about users.
Tested in two domains: education (inferring learning style to adapt teaching) and fitness (inferring lifestyle attributes to personalize exercise recommendations). The simulation used 20 user attributes with 5 decision-relevant ones and 15 background attributes — emulating real-world complexity where most user characteristics are irrelevant noise.
The distinction from prior work is sharp. PReF (reward factorization) requires 10 pre-conversation preference queries. PLUS (text-based summaries) requires historical interaction data. P-RLHF requires user-specific feedback data. The curiosity reward requires nothing — personalization emerges from the conversation dynamics.
This connects to Can AI agents learn when they have something worth saying? — both use intrinsic motivation, but for different purposes. Inner Thoughts drives general social proactivity (10 heuristics from cognitive psychology). Curiosity reward drives personalization-specific proactivity (reducing uncertainty about user type). Together they suggest that intrinsic motivation is a general mechanism for making AI conversationally active, with specific reward signals shaping what the activity targets.
The implication for open-ended dialogue is significant: when there's no clear task, engagement itself becomes the objective. Curiosity-driven agents that encourage users to share naturally may be more enjoyable than those that wait to be asked — and the sharing simultaneously enables better personalization.
Source: Assistants Personalization
Related concepts in this collection
-
Can AI agents learn when they have something worth saying?
What if AI proactivity came from modeling intrinsic motivation to participate rather than predicting who speaks next? This explores whether a framework based on human cognitive patterns—internal thought generation parallel to conversation—can make agents genuinely responsive rather than passively reactive.
complementary intrinsic motivation mechanisms: social proactivity vs personalization-specific proactivity
-
Why can't conversational AI agents take the initiative?
Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
curiosity reward directly addresses structural passivity for personalization
-
Why do language models respond passively instead of asking clarifying questions?
Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.
curiosity reward IS a multi-turn-aware reward that incentivizes active intent discovery
-
Can text summaries condition reward models better than embeddings?
Exploring whether learning interpretable text-based summaries of user preferences outperforms embedding vectors for training personalized reward models in language model alignment.
PLUS requires historical data; curiosity reward requires none
-
When should proactive agents push toward their goals versus accommodate users?
Proactive dialogue agents face a tension between reaching their objectives efficiently and keeping users satisfied. This question explores whether these two aims can coexist or require constant negotiation.
curiosity reward enables dynamic estimation of the cooperative degree and satisfaction factors: by reducing uncertainty about user type, the agent can better predict which topics will satisfy vs. alienate, enabling more nuanced goal weight computation in real-time
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
complementary active information-seeking: proactive critical thinking targets task-level missing information; curiosity reward targets user-level missing information; both transform passive agents into active seekers but for different knowledge gaps
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
curiosity reward enables real-time personalization by rewarding the agent for reducing uncertainty about user type during multi-turn conversation