What metrics capture whether recommendations reflect a user's full taste range?
This explores how you'd actually measure whether a recommendation list mirrors the spread of someone's interests — not just whether it nails their single biggest preference — and the corpus suggests the standard accuracy metrics quietly work against that goal.
This explores how you'd actually measure whether a recommendation list reflects a user's whole range of tastes rather than just their dominant one. The sharpest answer in the collection is that the metrics most systems optimize for — top-N ranking scores like NDCG and Recall, the ones even used as reward signals to train language models Can recommendation metrics train language models directly? — don't capture taste range at all. They reward getting the most relevant items to the top, and as Steck's work on calibrated recommendations shows, ranking purely by per-item relevance produces lists dominated by your primary interest even when your history clearly documents secondary ones Do accuracy-optimized recommendations preserve user interest diversity?. The metric that does capture range is calibration: does the *proportion* of, say, documentaries to thrillers in your recommendations match the proportion in what you've actually watched? A list can score beautifully on accuracy while being badly miscalibrated.
That reframes the question. "Full taste range" isn't one number — it's the gap between what you optimize and what you measure. If accuracy is the only lens, minority interests get crowded out silently, and you won't see it unless you explicitly measure proportional representation against the user's documented interest mix.
The corpus also points to *why* range collapses, which matters because a metric only helps if you know the failure it's catching. One culprit is structural: when embedding dimensions are too small, systems overfit toward popular items to maximize ranking quality, and niche interests starve over time — a distortion you can only catch by tracking long-term exposure of niche items, not a single snapshot of list quality Does embedding dimensionality secretly drive popularity bias in recommenders?. So genuine range measurement has a temporal dimension: it's about whether secondary tastes keep getting represented across many sessions, not whether one list looks diverse today.
A different thread suggests the cleaner fix may be representational rather than a post-hoc metric. Modeling a user as multiple attention-weighted personas means each recommendation traces back to the specific facet of taste it satisfies — diversity becomes legible by construction, and you can read off which persona each item serves instead of measuring spread after the fact Can attention mechanisms reveal which user taste explains each recommendation? Can modeling multiple user personas improve recommendation accuracy?. In that view, the "metric" is whether your representation can even express more than one taste at once.
The surprising turn: range may not be something a system can measure from your history alone. Social Poisson Factorization finds that friends with *different* tastes — not similar ones — are what surface items outside your usual preferences, meaning the signal for your unexplored range lives in your network rather than your own logs Can friends with different tastes improve recommendations?. And there's a measurement trap underneath all of this: even the ratings you'd use to define a user's "true" taste distribution are shaped by prior ratings and social dynamics, so the baseline you calibrate against is itself contaminated Do online ratings actually reflect independent customer opinions?. Capturing full taste range, then, is less about finding the right score and more about noticing that accuracy and range pull in opposite directions — and deciding to measure the one your optimizer is busy ignoring.
Sources 7 notes
Steck's research shows that ranking by per-item relevance naturally produces lists dominated by a user's primary interest, even when they have documented secondary interests. Enforcing calibration via post-hoc reranking restores proportional representation without sacrificing overall accuracy.
Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.
Research shows that when user/item embedding dimensions are too small, recommender systems overfit toward popular items to maximize ranking quality. This compounds over time as niche items receive insufficient exposure, and cannot be fixed post-hoc without treating dimensionality as a fairness hyperparameter.
AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.
AMP-CF separates user representation into latent personas weighted by attention to the candidate item. This candidate-conditional approach improves accuracy by adapting the user representation at prediction time and produces inherent explanations for why items were recommended.
Social Poisson Factorization uses friends' diverse tastes to recommend items outside users' usual preferences, outperforming methods that pull friends' representations together. Networks add value through influence on anomalous choices, not taste similarity.
Moe and Trusov decomposed ratings into baseline quality, social-dynamics influence, and error, finding that prior ratings meaningfully affect subsequent ones. These effects have both immediate sales impact and long-term compounding effects through future ratings, though high opinion variance can eventually dampen the distortion.