Can graded relevance assumptions hold when user ratings are temporally inconsistent?

This explores a tension between two things the corpus treats separately: graded-relevance evaluation (metrics like DCG that assume each item carries a stable relevance grade) and the messy reality that the same user's ratings drift across time and context.

This explores whether graded-relevance evaluation can survive the fact that the ratings feeding it aren't stable. Graded relevance is the assumption behind DCG and nDCG: each document or item carries a relevance *score*, not just a yes/no, and the metric rewards putting high-scored items near the top where users actually look How can evaluation metrics reflect graded relevance and user attention?. The whole edifice rests on those grades meaning something consistent. The corpus has a direct strike against that premise: Amatriain et al. found the same user gives the same item ratings that swing by multiple stars across sessions, driven by mood, anchoring, and personal rating style rather than any change in preference Why do the same users rate items differently each time?. If the grade itself wobbles by two stars, a metric that finely discounts position by relevance level is measuring partly real preference and partly noise.

The instability isn't just random jitter you could average away — it has structure that bends the grades in a direction. Online ratings are shaped by the ratings that came before them: Moe and Trusov decompose a rating into baseline quality, social-dynamics influence, and error, and show prior ratings push later ones, compounding over time Do online ratings actually reflect independent customer opinions?. So 'temporally inconsistent' understates it; the inconsistency is correlated and self-reinforcing, which is exactly the kind of signal an averaging defense can't clean up. A graded relevance label built from such ratings encodes the crowd's history, not a clean preference grade.

The sharper reframing in the corpus is that not all rating signals are even the same *kind* of thing. Annotation responses decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences — distinguishable precisely by whether they stay consistent across measurement conditions Do all annotation responses measure the same underlying thing?. Read against the graded-relevance question, this is the punchline: temporal inconsistency isn't a flaw in an otherwise-good grade, it's the *diagnostic* that tells you a chunk of your grades were never stable preferences to begin with. Treating all of them as gradable on one scale contaminates whatever you train or evaluate downstream. And there's a cousin failure where trust signals detach from relevance entirely — users prefer answers with more citations whether or not the citations are relevant Do users trust citations more when there are simply more of them? — a reminder that the human judgments we treat as relevance grades are often heuristics wearing relevance's clothes.

Where the corpus points toward a way through is interesting: lean on *order* rather than *level*. Multinomial likelihood beats Gaussian and logistic for collaborative filtering because it forces items to compete for probability, aligning training with top-N ranking rather than with reproducing each cardinal score Why does multinomial likelihood work better for ranking recommendations?. Relative ranking is far more robust to a user who rates everything a star high today and a star low tomorrow than absolute graded scores are. And if you do keep graded labels, you may need to model the distortions explicitly the way YouTube's ranker bolts on a position tower to strip selection bias before the bias becomes a self-amplifying loop Why do ranking systems need to model selection bias explicitly?.

So the honest answer: graded-relevance assumptions hold *weakly* and conditionally. They're fine as a coarse evaluation convenience, but the corpus suggests the grades are a blend of stable preference, rating-behavior, and social contagion — and the more finely a metric leans on the exact grade, the more it's measuring noise. The thing you didn't know you wanted to know: temporal inconsistency is most useful not as something to denoise away, but as a filter for telling which judgments were ever real preferences in the first place.

Sources 7 notes

How can evaluation metrics reflect graded relevance and user attention?

Jarvelin and Kekalainen's DCG and nDCG metrics handle graded relevance by accumulating relevance scores with a position discount factor that devalues late-retrieved documents. This binds evaluation to observed user behavior: users examine top results more carefully than lower-ranked ones, making ranking position matter.

Why do the same users rate items differently each time?

Amatriain et al. found that the same user gives substantially different ratings to the same item across sessions, shifting by multiple stars. This noise stems from temporal inconsistency, rater-specific biases, and anchoring effects—making ratings reflect both preference and rating-behavior rather than stable preference alone.

Do online ratings actually reflect independent customer opinions?

Moe and Trusov decomposed ratings into baseline quality, social-dynamics influence, and error, finding that prior ratings meaningfully affect subsequent ones. These effects have both immediate sales impact and long-term compounding effects through future ratings, though high opinion variance can eventually dampen the distortion.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Do users trust citations more when there are simply more of them?

Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.

Why does multinomial likelihood work better for ranking recommendations?

Liang et al. show that switching VAE likelihoods from Gaussian/logistic to multinomial achieves state-of-the-art results because enforced probability competition between items directly aligns training with top-N ranking objectives. Rebalancing KL regularization further improves performance.

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Can graded relevance assumptions hold when user ratings are temporally inconsistent?

Sources 7 notes

Next inquiring lines