How can evaluation metrics reflect graded relevance and user attention?
Traditional IR metrics treat relevance as binary, but real user needs involve degrees of relevance and attention patterns. Can evaluation methods capture both graded relevance judgments and the reality that users examine fewer documents further down ranked lists?
Traditional IR evaluation — precision and recall — assumes binary relevance: a document is either relevant or not. Real user information needs are not binary. A document might be highly relevant, marginally relevant, or completely irrelevant, and evaluation should credit systems for surfacing highly-relevant documents earlier in the ranking than marginally-relevant ones.
Jarvelin and Kekalainen's three measures handle this. Cumulative Gain (CG) accumulates relevance scores down the ranked list — a system gets credit for relevant documents anywhere in the ranking. Discounted Cumulative Gain (DCG) applies a position discount that devalues late-retrieved documents, reflecting that users examine fewer documents at lower ranks. Normalized DCG (nDCG) computes DCG as a fraction of the ideal DCG (the DCG of the perfect ranking), giving a 0-to-1 score that is comparable across queries and systems.
The conceptual contribution is binding evaluation to user behavior. Modern users overwhelmed by retrieval results don't examine all of them — they examine top results more than later ones. Evaluation that doesn't reflect this incentivizes systems to put any relevant documents into the result set, regardless of position. DCG-style evaluation incentivizes systems to put highly relevant documents at the top.
This metric is now standard not just in IR but in recommendation, where the same logic applies: a recommendation list is examined top-down, and the perfect recommendation is the one at position 1. nDCG@k is the dominant evaluation metric for top-K recommendation precisely because it encodes the user-attention pattern that makes ranking matter.
Source: Recommenders General
Related concepts in this collection
-
Why do recommender systems struggle to balance accuracy and diversity?
Recommender systems treat accuracy and diversity as competing objectives, requiring separate tuning. But what if the conflict is artificial, stemming from how we measure success rather than a fundamental tension?
extends: nDCG encodes user-attention but still allows degenerate top-K composition — accuracy metrics including nDCG miss set-level constraints
-
What does Netflix need to optimize in those first 90 seconds?
Streaming users abandon after 60-90 seconds reviewing 1-2 screens. Does the recommender problem lie in predicting ratings accurately, or in making those limited screens immediately compelling?
exemplifies in domain: nDCG's position discount captures the user-attention pattern Netflix observed empirically — choosing fatigue makes top positions disproportionate
-
Why does multinomial likelihood work better for ranking recommendations?
Explores whether the choice of likelihood function in VAE-based collaborative filtering matters for matching training objectives to ranking evaluation metrics. Why items should compete for probability mass.
complements: both align modeling and evaluation with the top-N ranking objective rather than per-item regression
-
Why do the same users rate items differently each time?
User ratings are assumed to be clean preference signals, but do they actually fluctuate unpredictably? This matters because recommender systems rely on ratings as ground truth, yet temporal inconsistency and individual rating styles may contaminate that signal.
tension with: graded relevance assumes a stable ground-truth grading; rating noise undermines that assumption
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
discounted cumulative gain extends IR evaluation to graded relevance — late-retrieved relevant documents discount because users examine fewer