Conversational AI Systems

Can text summaries condition reward models better than embeddings?

Exploring whether learning interpretable text-based summaries of user preferences outperforms embedding vectors for training personalized reward models in language model alignment.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

Standard RLHF models the entire user population with a single reward model. Prior pluralistic approaches either condition on embedding vectors (which compress text into single vectors, losing information) or use in-context learning with raw conversation histories (which hurts generalization across topics). PLUS proposes a third path: learn text-based summaries of user preferences via RL, then condition the reward model on these summaries.

The architecture is a co-adaptation loop. A summarizer is trained with PPO to generate user preference summaries from past conversation histories. A reward model is simultaneously trained to make personalized predictions conditioned on these summaries. The summarizer's reward signal is the reward model's predictive accuracy — so the summarizer learns which aspects of past conversations actually matter for predicting future preferences, rather than which topics were discussed.

The critical finding is that untrained summarizers focus on conversation topics ("the user asked about cats") rather than preference dimensions ("the user values concise, factual information"). RL training shifts attention to the dimensions that matter for prediction. Zero-shot summaries fail because they lack this discriminative signal.

The practical implications are significant: the text summaries are portable (transferring to GPT-4 for zero-shot personalization), interpretable (users can read and modify them), and concise. This connects to the broader tension between personalization and alignment. Since Does chatbot personalization build trust or expose privacy risks?, PLUS's transparent text summaries may offer a less opaque path to personalization than embedding-based approaches.

Complementary approaches form a design space for personalized alignment. PReF (Personalization via Reward Factorization) represents user preferences as weighted sums of base reward functions and infers per-user weights via active learning with only 10-20 preference queries — no historical data needed. P-RLHF takes a third approach: a lightweight user model captures individual preferences jointly with the LLM, handling both explicit preferences (stated) and implicit preferences (from feedback data) without pre-defined preference dimensions. The curiosity reward approach eliminates pre-conversation calibration entirely — the agent learns about the user during conversation by being rewarded for reducing uncertainty about user type (see Can conversations themselves personalize without user profiles?). Together, these methods span a spectrum: PLUS requires historical data but produces portable summaries; PReF requires 10 active queries but no history; curiosity reward requires nothing upfront but learns more slowly. The choice depends on available data and acceptable latency to personalization.


Source: Reinforcement Learning

Related concepts in this collection

Concept map
20 direct connections · 147 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

learned text-based user preference summaries condition reward models more effectively than embedding vectors for pluralistic alignment