INQUIRING LINE

How should preference channels from historical sessions inform unified policy learning?

This explores how a system should fold what it has learned about a user across past sessions — their stored preference signals — into a single decision-making policy, rather than learning preferences and learning what to do as separate problems.


This explores how a system should fold what it has learned about a user across past sessions — their stored preference signals — into a single decision-making policy, rather than bolting preferences onto an otherwise generic decision engine. The corpus's sharpest claim is that unification beats separation at the decision layer: when a conversational recommender treats "what to ask," "what to recommend," and "when to act" as one graph-based RL policy instead of three modules, the gradient signals inform each other and the whole conversation gets optimized as a trajectory rather than as disconnected steps Can unified policy learning improve conversational recommender systems?. That's the structural argument for why historical preference channels shouldn't be a side-car feature feeding a frozen policy — they belong inside the same optimization loop.

But the harder question is what *form* those historical preferences should take before they touch the policy, and here the corpus pushes against the obvious answer. The instinct is to retrieve past interactions and let the policy condition on them. The PRIME work argues the opposite: abstracted, semantic preference summaries consistently beat raw episodic recall across models — and, tellingly, recency-based recall beats similarity-based retrieval Does abstract preference knowledge outperform specific interaction recall?. So a session history isn't best consumed as a transcript to search; it's best distilled into compact preference knowledge. The same asymmetry shows up in how trajectories should be stored at all — successes kept as concrete demonstrations, failures compressed into abstracted lessons — which both saves context and learns better than treating every past episode uniformly Should successful and failed episodes be processed differently?.

There's a whole family of approaches that let history shape the policy *without* retraining it. AgentFly formalizes the agent as a memory-augmented MDP where credit assignment and policy improvement happen entirely through memory operations — no weight updates Can agents learn continuously from experience without updating weights?. PReF takes a parametric route: it learns base reward functions, then infers a user's personal reward coefficients from as few as ten adaptive questions, aligning the policy at inference time rather than fine-tuning Can user preferences be learned from just ten questions?. And M3-Agent shows preference channels needn't come from explicit asking at all — an entity-centric memory graph can infer them from continuous observation, separating episodic events from semantic knowledge the way human memory binds facts about a person over time Can agents learn preferences by watching rather than asking?.

The part you might not expect is the warning attached to all of this. Folding per-user history into the reward signal is precisely how you remove the averaging effect that keeps an aggregate model honest — personalized reward models learn sycophancy and reinforce echo chambers, mirroring the failure modes of recommender systems Does personalizing reward models amplify user echo chambers?. So "more personal history in the policy" is not monotonically good. A couple of corpus ideas hint at guardrails: POLAR reframes reward modeling as measuring distance from a target policy rather than from absolute preference labels, which gives you a reference point that isn't just "whatever this user liked last" Can reward models learn by comparing policies instead of judging them?, and hierarchical RL with meta-learning specifically prevents a master policy from collapsing onto one dominant behavior across diverse user types Can meta-learning prevent dialogue policies from collapsing?.

Put together, the corpus's answer is layered: unify the decision policy so preference signals and actions co-optimize; feed it abstracted, recency-weighted preference knowledge rather than raw episodic logs; prefer inference-time alignment or memory operations over constant retraining; and build in a counter-pressure against the sycophancy that personalization invites. If you want to stress-test any of this without burning real users, the synthetic-user-simulator line — conditioning an LLM on session-level profile and turn-level intent variables — gives you controllable historical channels to experiment against Can controlled latent variables make LLM user simulators realistic?.


Sources 10 notes

Can unified policy learning improve conversational recommender systems?

Research shows that formulating attribute-asking, item-recommending, and timing decisions as a single graph-based RL policy achieves better joint optimization than isolated components. Separation prevents gradient signals from informing one another and fails to optimize conversation trajectory holistically.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can agents learn preferences by watching rather than asking?

M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Can reward models learn by comparing policies instead of judging them?

POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.

Can meta-learning prevent dialogue policies from collapsing?

Without MAML, hierarchical RL for Motivational Interviewing phases collapses to a dominant action regardless of user type. Meta-learning enables the master policy to maintain variability and adapt across diverse user profiles.

Can controlled latent variables make LLM user simulators realistic?

RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.

Next inquiring lines