How should unobserved items differ from items rated zero preference?

This explores the recommender-system problem of how to treat an item a user simply never saw or clicked versus an item they explicitly disliked — and the corpus says these are not the same kind of signal at all.

This explores the gap between silence and rejection in preference data — and the corpus's sharpest answer is that an unobserved item carries *low confidence*, not low preference. The cleanest framing comes from work showing that implicit feedback (watches, clicks, purchases) actually splits into two separate magnitudes: a preference direction and a confidence in that direction Can implicit feedback reveal both preference and confidence?. A zero-preference rating is a confident negative — the user looked and said no. An unobserved item is an *absence of evidence*: maybe they'd have loved it, maybe not, but the system has near-zero confidence either way. Collapsing both to the number 0 throws away exactly the dimension that distinguishes them.

This matters because explicit ratings are themselves far noisier than they look. The same user rates the same item differently across sessions, drifting by multiple stars from temporal mood, anchoring, and personal rating style Why do the same users rate items differently each time?. So even a literal zero isn't a stable 'I dislike this' — it's preference tangled with rating behavior. If an observed zero is already shaky, then equating it with the vast sea of unobserved items (which is most of the catalog) is doubly wrong. The practical move is weighting: treat observed negatives as high-confidence signal and unobserved items as a soft, low-confidence prior rather than hard negatives.

There's a deeper modeling consequence here too. Collaborative-filtering work finds that switching the likelihood to a multinomial — which forces items to *compete* for a fixed budget of probability — beats Gaussian or logistic formulations that score each item in isolation Why does multinomial likelihood work better for ranking recommendations?. That competition is really a statement about unobserved items: ranking is relative, so an item not chosen loses probability mass to the ones that were, without being branded an explicit negative. The structure of the loss function is doing the 'absence ≠ rejection' work automatically.

Worth pulling in laterally: the corpus also shows that not all stated preferences are the same signal. Annotation responses decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, distinguishable only by consistency across conditions Do all annotation responses measure the same underlying thing?. A 'zero' might be a real distaste or a non-attitude the user invented because you asked — another reason a single number underdetermines meaning. And negative feedback isn't even reliably negative information: language models can flip a critique like 'doesn't look good for a date' into the positive preference 'prefer something more romantic' Can language models bridge the gap between critique and preference?. A disliked item often encodes what the user *does* want, whereas an unobserved item encodes nothing — which is the asymmetry you'd otherwise miss entirely.

The thing you didn't know you wanted to know: the right design isn't 'how negative should an unobserved item be,' it's 'how *certain* am I.' Once you model confidence as its own axis, the question dissolves — zero-preference items get full weight as evidence, unobserved items get downweighted as guesses, and an explicit dislike can even be mined for the preference hiding inside it.

Sources 5 notes

Can implicit feedback reveal both preference and confidence?

Hu, Koren, and Volinsky show that implicit signals (watches, purchases, clicks) encode preference and confidence as two distinct dimensions. Explicit ratings collapse these into one number, losing information about certainty in the preference estimate.

Why do the same users rate items differently each time?

Amatriain et al. found that the same user gives substantially different ratings to the same item across sessions, shifting by multiple stars. This noise stems from temporal inconsistency, rater-specific biases, and anchoring effects—making ratings reflect both preference and rating-behavior rather than stable preference alone.

Why does multinomial likelihood work better for ranking recommendations?

Liang et al. show that switching VAE likelihoods from Gaussian/logistic to multinomial achieves state-of-the-art results because enforced probability competition between items directly aligns training with top-N ranking objectives. Rebalancing KL regularization further improves performance.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can language models bridge the gap between critique and preference?

Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.

How should unobserved items differ from items rated zero preference?

Sources 5 notes

Next inquiring lines