Can standard accuracy metrics miss the real constraints on user consumption?
This explores whether a single headline number like 'accuracy' can hide the things that actually shape what a user can use, trust, or be served — the constraints on consumption that aggregate scores average away.
This reads the question as asking whether topline accuracy is blind to the real limits on what users can actually consume — and the corpus says yes, repeatedly, from several different angles. The cleanest version is that aggregate accuracy hides confident wrong answers: in medical triage, legal interpretation, and financial planning, errors concentrate in exactly the rare, high-harm cases where a fluent-but-wrong response is most dangerous, yet overall performance still looks strong Why do confident wrong answers hide in standard accuracy metrics?. The constraint on consumption isn't 'how often is it right' but 'can the user tell when to trust it' — and accuracy alone never reports that.
The recommendation side shows the same blind spot as a distributional problem. Optimizing purely for accuracy systematically over-weights a user's dominant interests and crowds out their minority ones, so the feed becomes accurate-on-average but miscalibrated against the actual breadth of what a person wants; post-hoc reranking has to be bolted on to restore proportional representation Why do accuracy-optimized recommenders crowd out minority interests?. There's a structural reason this gets worse over time, too: real systems have power-law user and item frequencies, so fixed hashing piles its collisions onto exactly the heavy users and popular items the model most needs to get right Why do hash collisions hurt recommendation models so much?. The headline metric can rise while the experience of the most active users quietly degrades.
A second thread is that 'success' is often not even one thing. Phone agents turn out to have statistically independent capabilities — completing a task, respecting privacy while doing it, and reusing saved preferences — and a success-only ranking predicts neither of the other two Do phone agents succeed at all three critical tasks equally?. Agent evaluation more broadly collapses when squeezed to one score, hiding trajectory quality, memory hygiene, and verification cost that determine whether the thing is actually usable in deployment What should we actually measure in agent evaluation?. Privacy and preference-fit are real constraints on consumption that a task-success number simply does not contain.
The training-signal notes explain *why* the metric goes blind. Binary correctness rewards don't penalize confident wrong answers, so they actively push models toward high-confidence guessing — degrading calibration unless you add something like a Brier-score term to optimize accuracy and trustworthiness jointly Does binary reward training hurt model calibration?. And on the human-feedback side, annotation responses aren't a single 'preference' signal at all: they decompose into genuine preferences, non-attitudes, and constructed preferences, and treating them as one contaminates the very rewards we then measure against Do all annotation responses measure the same underlying thing?. So the corpus's quiet payoff is this: the fix is rarely 'a better model' — it's measuring the constraint you were averaging away. Whether that's calibration, fairness across interests, privacy, or signal purity, the missing dimension is usually nameable and addressable once you stop trusting the single number.
Sources 7 notes
Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.
Accuracy-optimized models systematically miscalibrate by over-weighting dominant user interests. A post-processing reranking algorithm that enforces calibration constraints can restore proportional representation without retraining the underlying model.
Monolith's empirical work shows that real recommendation systems have power-law distributed frequencies, causing collisions to accumulate precisely on the entities models need most accurate. Fixed-size hashed tables worsen this over time as new IDs arrive.
MyPhoneBench demonstrates that task success, privacy-compliant completion, and saved-preference reuse are statistically distinct capabilities with no model dominating all three. Success-only rankings do not predict privacy or preference performance.
Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.