Recommender Systems Conversational AI Systems

Can user preferences be learned from just ten questions?

Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.

Note · 2026-02-23 · sourced from Assistants Personalization
What kind of thing is an LLM really? How do you build domain expertise into general AI models? How should researchers navigate LLM reasoning research?

Standard RLHF trains a single reward model on aggregated human preferences, assuming a universal preference structure. PReF (Personalization via Reward Factorization) makes a different assumption: user preferences lie in a low-dimensional space and can be represented as weighted sums of a small set of base reward functions.

The three-stage architecture:

  1. Base reward learning — train a set of base reward functions from paired preference data annotated with user identity. Each base function captures one dimension of preference variation (e.g., conciseness vs detail, formality vs casualness).

  2. User coefficient inference — present the new user with a sequence of question-response pairs and ask which response they prefer. The questions are selected adaptively using active learning: each question is chosen to maximally reduce uncertainty about the user's coefficients. Results from logistic bandit theory enable efficient uncertainty computation.

  3. Inference-time alignment — once user-specific coefficients are known, use inference-time methods to generate reward-aligned responses without modifying model weights. This enables scalable per-user adaptation.

The practical significance: 10-20 questions suffice. This is dramatically more efficient than approaches requiring historical interaction data or per-user fine-tuning. The active learning component is critical — random question selection would require far more queries because most questions are uninformative for distinguishing between users.

The low-dimensional preference assumption is both the strength and the limitation. If real preferences don't decompose into a small number of base dimensions, the factorization misses important variation. However, the survey evidence from How do personalization granularity levels trade precision against scalability? suggests that persona-level personalization (group-based, moderate dimensionality) is often sufficient and that user-level precision trades against data requirements.

The inference-time alignment component connects to Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Both avoid weight modification per user, but PReF applies a user-specific reward function while proxy tuning applies a task-specific distributional shift. The combination suggests a design space: different axes of adaptation (user preferences, task requirements, domain knowledge) can each be applied at inference time through different mechanisms.


Source: Assistants Personalization

Related concepts in this collection

Concept map
14 direct connections · 115 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reward factorization represents user-specific preferences as linear combinations of base reward functions — 10 active-learning queries suffice for personalization