How do implicit signals like clicks capture preference more reliably than explicit ratings?
This explores why behavioral traces like clicks and watches often capture user preference better than star ratings — and the corpus suggests the real answer is that implicit signals carry an extra dimension ratings throw away, while also dragging in biases you have to model around.
This explores why behavioral traces like clicks and watches often capture user preference better than star ratings. The sharpest reframing in the corpus is that implicit and explicit signals aren't just noisier and cleaner versions of the same thing — they're shaped differently. Implicit feedback splits into *two* paired magnitudes: preference (did they engage or not) and confidence (how much, how often) Can implicit feedback reveal both preference and confidence?. A single star rating collapses both into one number and loses the certainty information. Watching a show ten times and watching it once both might map to 'liked it,' but the implicit signal keeps the volume knob that the rating discards.
The deeper reason explicit ratings underperform is that a rating isn't always a preference at all. Decomposing annotation responses shows they contain three different things — genuine preferences, non-attitudes (people answering when they don't actually have an opinion), and constructed preferences (opinions invented on the spot by the act of being asked) Do all annotation responses measure the same underlying thing?. Asking forces a number into existence even when no real preference exists, contaminating the signal. Clicks don't have this problem: nobody clicks to be polite or to fill in a survey box. The behavior happened because the interest was real, which is also why agents can often learn more by watching than by asking Can agents learn preferences by watching rather than asking?.
But 'more reliable' comes with a catch the corpus is blunt about: implicit signals are reliable about *what users did*, not cleanly about *what they wanted*, because the system itself shaped what they could do. You only click what you were shown. YouTube's ranking work makes this concrete — without explicitly modeling selection and position bias, recommenders converge on degenerate loops that amplify their own past decisions, mistaking 'we showed it at the top' for 'they preferred it' Why do ranking systems need to model selection bias explicitly?. So the reliability is real but conditional: you have to subtract out the artifacts of exposure.
There's also a modeling subtlety in *how* you treat click data once you trust it. Clicks behave like a competition, not independent yes/no judgments — a multinomial likelihood that forces items to compete for a fixed probability budget outperforms Gaussian or logistic treatments, because it bakes in the top-N ranking objective that implicit feedback is really about Why does multinomial likelihood work better for click prediction?. And the same accuracy-chasing that implicit signals enable can quietly crowd out a user's minority interests unless you rerank for calibration Why do accuracy-optimized recommenders crowd out minority interests?.
The thing you might not have expected to learn: the field is increasingly trying to recover the *richness of explicit feedback without the cost of asking for it*. Negative comments like 'doesn't look good for a date' can be transformed by an LLM into a positive, retrievable preference ('prefer more romantic') Can language models bridge the gap between critique and preference?, and scalar reward signals are being shown to throw away directional information that natural language feedback preserves Can scalar rewards capture all the information in agent feedback?. So the trajectory isn't 'implicit beats explicit' — it's that the most useful signal carries *both* preference and the surrounding context of how strong and which-direction it points, which clicks supply abundantly and ratings supply thinly.
Sources 8 notes
Hu, Koren, and Volinsky show that implicit signals (watches, purchases, clicks) encode preference and confidence as two distinct dimensions. Explicit ratings collapse these into one number, losing information about certainty in the preference estimate.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.
YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.
Multinomial likelihood better models click data because it forces items to compete for a fixed probability budget, implicitly optimizing for top-N ranking. Gaussian and logistic likelihoods allow high probability across many items simultaneously, misaligning training with ranking objectives.
Accuracy-optimized models systematically miscalibrate by over-weighting dominant user interests. A post-processing reranking algorithm that enforces calibration constraints can restore proportional representation without retraining the underlying model.
Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.