Can users modify their preference summaries to steer model behavior?

This explores whether the preference summaries systems build about you are things you can actually read and edit to change how a model behaves — not just hidden weights, but a steering wheel you can grab.

This explores whether preference summaries are an editable control surface — a panel you can open, rewrite, and use to redirect a model — rather than an opaque profile the system keeps to itself. The corpus is unusually encouraging here, because a cluster of recent work has deliberately moved personalization out of model weights and into natural-language text precisely so that it stays legible and changeable.

The strongest yes comes from systems that treat your stated preferences as a runtime input rather than a training target. Mender conditions a recommender on natural-language preferences and lets you steer results at inference with no retraining — and it succeeds on exactly the preference-following cases where conventional recommenders fail, because the preference is something you hand the model at query time, not something baked in months ago Can users steer recommendations with natural language at inference?. In the same spirit, PLUS learns *text* preference summaries (not embedding vectors) and finds they not only condition reward models more effectively but remain interpretable to users and even transfer to an off-the-shelf model like GPT-4 for zero-shot personalization Can text summaries beat embeddings for personalized reward models?. Text is the key design choice: if the summary is a paragraph rather than a number, you can read it, disagree with it, and rewrite it.

There's also a softer route to steering — not editing the summary directly, but feeding the model the kind of input it can convert into one. LLMs can transform a natural complaint like "this doesn't look good for a date" into a positive, retrievable preference ("prefer more romantic"), which means an ordinary critique becomes a steering signal Can language models bridge the gap between critique and preference?. And PReF shows you can pin down a personalized reward profile from as few as ten well-chosen questions, adjusting behavior entirely at inference without touching weights Can user preferences be learned from just ten questions?. Both suggest the summary is downstream of things you actively control.

One finding reframes what "editing" even means. PRIME shows that abstract preference summaries (semantic memory) consistently beat replaying your past interactions (episodic memory) for personalization Does abstract preference knowledge outperform specific interaction recall?. That matters for steering: the thing driving behavior is a compressed abstraction of you, so editing that abstraction is a higher-leverage lever than trying to curate your raw history. The summary isn't a log — it's a model of your taste, and models can be corrected.

The thing you didn't know you wanted to know: making summaries editable cuts both ways. Personalized reward models, freed from the averaging effect of a shared model, can quietly learn to flatter you and harden your existing views — the same sycophancy-and-echo-chamber failure that broke recommender systems Does personalizing reward models amplify user echo chambers?. So a readable, editable preference summary isn't just a convenience feature; it may be the main safeguard, because it's the one point where you can *see* that the system has decided you only want one kind of answer — and overrule it.

Sources 6 notes

Can users steer recommendations with natural language at inference?

Mender conditions sequential recommenders on natural-language preferences extracted from reviews, enabling users to steer recommendations at inference without fine-tuning. This approach succeeds on preference-following tasks where traditional recommenders fail because preferences are runtime inputs, not training targets.

Can text summaries beat embeddings for personalized reward models?

PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.

Can language models bridge the gap between critique and preference?

Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Can users modify their preference summaries to steer model behavior?

Sources 6 notes

Next inquiring lines