How does implicit feedback structure differ from explicit ratings mathematically?
This explores the mathematical shape of the signal: explicit ratings are a single scalar per item, while implicit feedback (clicks, watches, purchases) carries more dimensions — and the corpus keeps finding the same pattern, that collapsing feedback into one number throws away information.
This explores the mathematical shape of the signal. An explicit rating is one number per item — a 4 out of 5. Implicit feedback looks like it should be even poorer (you only know someone watched, clicked, or bought), but the foundational result in the corpus is that it actually splits into *two* paired magnitudes: a binary preference (did they engage or not) and a confidence weight (how much engagement — how many minutes watched, how many repeat purchases). Hu, Koren, and Volinsky's recommender work shows explicit ratings collapse these two dimensions into one scalar, which silently discards how *certain* you are about each preference Can implicit feedback reveal both preference and confidence?. So the surprising inversion is that the "weaker" signal is mathematically richer — it's a (preference, confidence) pair, not a point on a line.
What makes this an Inquiring Line worth pulling on is that the exact same collapse shows up far outside recommender systems, in reinforcement learning. A scalar reward is the RL equivalent of an explicit rating — one number summarizing an action. But natural feedback decomposes into two orthogonal channels: *evaluative* (how good was this?) and *directive* (how should it change?). A scalar captures the first and throws away the second, which is why the two are complementary rather than redundant Can scalar rewards capture all the information in agent feedback?. Critique-GRPO makes the loss concrete: models stuck on a numerical-reward plateau start solving problems the moment they get chain-of-thought critiques, because the scalar never encoded *why* an answer failed Can natural language feedback overcome numerical reward plateaus?.
There's a sharper, provable version of the same idea around calibration. A binary correctness reward is the most collapsed feedback possible — one bit. Because it doesn't penalize a confident wrong answer differently from a hesitant one, it mathematically incentivizes high-confidence guessing and degrades calibration. Adding a Brier (proper scoring) term restores the missing dimension — confidence — and the result is that accuracy and calibration can be jointly optimized with no trade-off Does binary reward training hurt model calibration?. That's the same (preference, confidence) decomposition from the recommender paper, reappearing as a guarantee in the reward-design literature.
The constructive flip side: when you *keep* the structure instead of collapsing it, you can do things scalars can't. Rich tokenized environment feedback can be converted into dense, per-token credit assignment, letting the policy act as its own process reward model rather than leaning on a single external number Can environment feedback replace scalar rewards in policy learning?. And a whole strand of late-2025 work is converging on the idea that the explicit reward model — the scalar-emitting box — is optional once you read the richer signal directly out of the policy's own computations Can language models replace reward models with internal signals?.
So the answer to "how do they differ mathematically" is cleaner than the question suggests: explicit ratings are a projection down to one dimension; implicit (and natural, and language) feedback retains at least two — magnitude *and* the confidence or direction attached to it. Nearly every failure mode in the corpus, from degenerate ranking equilibria to truth-indifference under RLHF, traces back to optimizing the projection while pretending it's the whole signal.
Sources 6 notes
Hu, Koren, and Volinsky show that implicit signals (watches, purchases, clicks) encode preference and confidence as two distinct dimensions. Explicit ratings collapse these into one number, losing information about certainty in the preference estimate.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.
Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.