INQUIRING LINE

How do personalization errors differ from general accuracy problems in summaries?

This explores what makes a summary fail *for a particular person* — getting the wrong things or mis-attributing them — versus failing on plain truthfulness, and why the corpus treats these as two different problems with different fixes.


This explores how personalization errors (a summary that's true but wrong *for you*) differ from general accuracy errors (a summary that's simply false or ungrounded). The corpus draws a sharper line between these than you might expect. A summary can be factually flawless and still fail an individual: in a user study of meeting summaries, the central complaint wasn't that the model lied — it was that the system summarized *global* importance instead of what mattered to the specific reader, and that speaker mis-attributions damaged group trust and accountability even when the underlying facts were right Why do LLM meeting summaries fail to help individuals?. That's the signature of a personalization error: the content is accurate, but the relevance and the 'who said what' are calibrated to the wrong person.

The most striking difference is the *shape* of the error. General accuracy failures tend to get worse the further the model drifts from its evidence — hallucination rises as sources degrade Can RAG systems refuse to answer without reliable evidence?, and models override their own context when training priors are strong Why do language models ignore information in their context?. Personalization errors follow an *inverted* curve. One study found a U-shaped error pattern where the worst mistakes come not from a totally wrong user profile but from one that's *almost* right — the model confidently applies nearly-matched preferences, an uncanny-valley effect more damaging than obvious mismatch Why do similar user profiles produce worse personalization errors?. So accuracy errors scale with distance from the truth; personalization errors spike with deceptive *closeness* to the right person.

They also live in different layers of the content. Accuracy is about semantic facts; personalization, it turns out, is mostly about style and preference. Profiles built from a user's past *outputs* match or beat full profiles, while profiles built from their *inputs* actively hurt — suggesting personalization rides on how someone writes and what they prefer, not on the topical content of their queries Do user outputs outperform inputs for LLM personalization?. And abstracted preference summaries beat literal recall of past interactions Does abstract preference knowledge outperform specific interaction recall?. This means a personalization error isn't a missing fact you could retrieve — it's a misread of taste, which retrieval alone can't fix.

Because of that, the *fixes* diverge. Accuracy problems are addressed by grounding and refusal — constrain the model to only say what the evidence supports Can RAG systems refuse to answer without reliable evidence?. Personalization problems get fixed by aligning the summary to a *downstream goal* or a *learned model of the person*: training summarizers against ranking rewards so they emphasize the attributes a user actually acts on Can reinforcement learning align summarization with ranking goals?, or learning text preference summaries that condition a reward model and capture dimensions a generic summary misses Can text summaries beat embeddings for personalized reward models?. A grounded-but-generic summary passes the accuracy test and fails the personalization test.

The thing worth carrying away: personalization errors are *relational and confidence-amplified* in a way accuracy errors aren't. A false fact is usually flagged as a hallucination; a confidently mis-personalized summary feels authoritative precisely because it's well-formed and almost-right — and downstream, people rarely catch it, editing AI text only 23% of the time before it reaches an audience Do writers actually edit AI-generated text before publishing?. The accuracy bug announces itself; the personalization bug hides inside a fluent, trustworthy-looking summary aimed at the wrong reader.


Sources 9 notes

Why do LLM meeting summaries fail to help individuals?

A user study of seven participants found three critical failures: systems summarize global importance rather than individual relevance, mis-attributions damage group trust and accountability, and one format cannot serve both quick scanning and detailed reference needs.

Why do similar user profiles produce worse personalization errors?

PRIME shows a U-shaped error curve where most-similar profile replacements cause steepest performance drops. The model confidently applies wrong preferences when profiles are nearly but not truly matched, an uncanny valley effect more harmful than obvious mismatch.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do user outputs outperform inputs for LLM personalization?

Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Can reinforcement learning align summarization with ranking goals?

ReLSum trains summarizers using downstream relevance scores as RL rewards, producing dense, attribute-focused summaries instead of fluent prose. This alignment to the actual ranking metric improves recall, NDCG, and user engagement in production e-commerce search.

Can text summaries beat embeddings for personalized reward models?

PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.

Do writers actually edit AI-generated text before publishing?

Writers edited AI-generated paragraphs only 23% of the time, with edits averaging 96% similarity to the original. This means AI's opinionated and distorted voice propagates with minimal human filtering before publication.

Next inquiring lines