Recommender Systems Conversational AI Systems

Can user preferences be learned from just ten questions?

Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.

Note · 2026-02-23 · sourced from Assistants Personalization

Standard RLHF trains a single reward model on aggregated human preferences, assuming a universal preference structure. PReF (Personalization via Reward Factorization) makes a different assumption: user preferences lie in a low-dimensional space and can be represented as weighted sums of a small set of base reward functions.

The three-stage architecture:

Base reward learning — train a set of base reward functions from paired preference data annotated with user identity. Each base function captures one dimension of preference variation (e.g., conciseness vs detail, formality vs casualness).
User coefficient inference — present the new user with a sequence of question-response pairs and ask which response they prefer. The questions are selected adaptively using active learning: each question is chosen to maximally reduce uncertainty about the user's coefficients. Results from logistic bandit theory enable efficient uncertainty computation.
Inference-time alignment — once user-specific coefficients are known, use inference-time methods to generate reward-aligned responses without modifying model weights. This enables scalable per-user adaptation.

The practical significance: 10-20 questions suffice. This is dramatically more efficient than approaches requiring historical interaction data or per-user fine-tuning. The active learning component is critical — random question selection would require far more queries because most questions are uninformative for distinguishing between users.

The low-dimensional preference assumption is both the strength and the limitation. If real preferences don't decompose into a small number of base dimensions, the factorization misses important variation. However, the survey evidence from How do personalization granularity levels trade precision against scalability? suggests that persona-level personalization (group-based, moderate dimensionality) is often sufficient and that user-level precision trades against data requirements.

The inference-time alignment component connects to Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Both avoid weight modification per user, but PReF applies a user-specific reward function while proxy tuning applies a task-specific distributional shift. The combination suggests a design space: different axes of adaptation (user preferences, task requirements, domain knowledge) can each be applied at inference time through different mechanisms.

Source: Assistants Personalization

Related concepts in this collection

Can text summaries condition reward models better than embeddings? Exploring whether learning interpretable text-based summaries of user preferences outperforms embedding vectors for training personalized reward models in language model alignment.
PLUS uses RL-trained text summaries; PReF uses factorized reward functions. Complementary approaches to the same problem.
Can decoding-time tuning preserve knowledge better than weight fine-tuning? Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
both are inference-time adaptation methods; different mechanisms
Does chatbot personalization build trust or expose privacy risks? Explores whether personalization features that increase user trust and social connection simultaneously heighten privacy concerns and create rising behavioral expectations over time.
PReF's explicit preference queries may increase privacy concerns vs implicit approaches

Concept map

14 direct connections · 115 in 2-hop network ·medium cluster

Can user preferences be learned from just ten qu… Can text summaries condition reward models better … Can decoding-time tuning preserve knowledge better… Does chatbot personalization build trust or expose…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

reward factorization represents user-specific preferences as linear combinations of base reward functions — 10 active-learning queries suffice for personalization