INQUIRING LINE

Can user preferences be represented as linear reward combinations?

This explores whether a user's tastes can be captured as a weighted mix of a few shared 'reward' building blocks — and what that buys you versus the alternatives the corpus has tried.


This explores whether a user's tastes can be captured as a weighted mix of a few shared reward building blocks. The most direct answer in the corpus is yes — and surprisingly cheaply. PReF Can user preferences be learned from just ten questions? learns a set of base reward functions from preference data, then represents any individual as a linear combination of those bases. The clever part is the coefficients: instead of needing mountains of data per person, active learning picks the ten most informative questions to pin down where you sit in that shared space. You get personalized alignment at inference time, with no model weights touched.

But the interesting thing is *why* the field reached for linear combinations in the first place — and it's a reaction against a real failure. A single 'averaged' reward model literally cannot represent disagreement: faced with a 51-49 split, it must either keep leaving 49% unhappy or make everyone unhappy half the time Can aggregate reward models satisfy genuinely disagreeing users?. The standard Bradley-Terry setup makes this worse, collapsing genuinely multi-modal preferences into a centroid that serves nobody Do unimodal reward models actually serve all user preferences?. VPL's fix there — conditioning the reward on a latent user vector — is a close cousin of factorization: both say a user is a *point in a learned preference space* rather than a single global utility.

The recommender-systems corner of the corpus arrived at the same idea from a different door, and it's worth noticing the tension. AMP-CF argues users aren't a single latent vector at all but several personas, weighted by attention depending on what's being recommended Can attention mechanisms reveal which user taste explains each recommendation? Can modeling multiple user personas improve recommendation accuracy?. That's still a mixture of basis components — but the weights are *dynamic*, shifting per item, not a fixed coefficient vector. So the open question the linear view raises: are your reward weights a stable fingerprint, or do they move depending on context?

There are also voices arguing the linear-reward frame is the wrong abstraction entirely. One line shows that text-based preference *summaries* condition reward models more effectively than embedding vectors, and stay interpretable to the user Can text summaries beat embeddings for personalized reward models?; a related result finds abstract semantic preference knowledge beats replaying specific past interactions Does abstract preference knowledge outperform specific interaction recall?. And feedback itself may not be scalar-decomposable: agent signals carry both *evaluative* ('how good was that') and *directive* ('how should it change') information, and a reward number throws the second kind away Can scalar rewards capture all the information in agent feedback?. If preference has a directional component, a linear reward sum can't hold all of it.

The thing worth walking away with: representing users as linear reward combinations is provably cheap and elegant, but it inherits a hazard. The moment you give each user their own reward, you lose the averaging that quietly suppressed sycophancy — personalized reward models can amplify echo chambers and flattery exactly the way recommender feeds did Does personalizing reward models amplify user echo chambers?. So 'can we?' has a clean yes; 'should we, unguarded?' is where the corpus gets nervous.


Sources 9 notes

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Do unimodal reward models actually serve all user preferences?

Standard BTL reward models assume a single utility function, but when preferences are genuinely multi-modal across user groups, maximum-likelihood fitting produces a centroid policy that optimizes nobody's utility. VPL recovers multi-modal distributions using latent user context, enabling user-conditional reward modeling.

Can attention mechanisms reveal which user taste explains each recommendation?

AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.

Can modeling multiple user personas improve recommendation accuracy?

AMP-CF separates user representation into latent personas weighted by attention to the candidate item. This candidate-conditional approach improves accuracy by adapting the user representation at prediction time and produces inherent explanations for why items were recommended.

Can text summaries beat embeddings for personalized reward models?

PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Next inquiring lines