Do phone agents succeed at all three critical tasks equally?

Explores whether task success, privacy compliance, and preference reuse develop together in phone-use agents, or whether benchmarking one capability tells you nothing about the others.

Note · 2026-05-18 · sourced from Assistants Personalization

The MyPhoneBench evaluation surfaces a finding with direct deployment consequences: the three properties most relevant for phone-use agent deployment — task success, privacy compliance during completion, and proper use of saved preferences in later sessions — are statistically distinct capabilities. No model dominates all three. Evaluating one of them does not predict the others.

The pattern matters because of how benchmarks have been structured. Most agent benchmarks score task success: did the agent complete the task as instructed? Models that score well on this single metric get ranked as "frontier" and get deployed. But when the same models are scored jointly on success-plus-privacy or success-plus-preference-reuse, the ranking reshuffles. A model that wins on success-only may lose on success-with-privacy, because it completes tasks by overfilling personal entries. A model with mediocre success may have better privacy compliance because it stops at minimal disclosure.

The deeper observation is that "deployment readiness" is not a scalar. It is a vector across the capabilities the deployment actually requires. For phone-use agents, that vector includes at minimum success, privacy compliance, and longitudinal preference handling. For other agent deployments it would include different combinations. Evaluating on the wrong subset of capabilities produces models that score well on the benchmark and fail in production.

For benchmark designers, this argues for joint evaluation as the default rather than as a research add-on. A benchmark that scores only one capability and ranks models on it is producing rankings that will not generalize to deployment. The methodological move is to evaluate the capability vector and present results as multi-dimensional rather than collapsing to a single score.

For agent developers, the immediate consequence: do not assume success-trained models will be privacy-compliant or preference-respecting. These need to be selected and trained for, not assumed.

Related concepts in this collection

Why do phone-use agents overfill optional personal data fields? Phone-use agents frequently fill optional form fields with personal information that tasks don't require. Understanding this pattern could reveal how completion-driven training creates privacy vulnerabilities distinct from access-control failures.
same paper, the specific failure mode that produces the capability divergence
Can a two-category privacy boundary actually be auditable? Most privacy frameworks are either too vague or too complex for agent deployment. Can a minimal binary split—LOW versus HIGH data categories—provide enough clarity for both users and automated compliance auditing?
same paper, the contract that makes joint evaluation possible
Do short benchmarks predict how models perform over long workflows? Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.
adjacent: benchmarks-overstate-deployment-readiness pattern at a different scale axis

Concept map

13 direct connections · 96 in 2-hop network ·medium cluster Open in graph ↗

Do phone agents succeed at all three critical ta… Why do phone-use agents overfill optional personal… Can a two-category privacy boundary actually be au… Do short benchmarks predict how models perform ove…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

task success privacy compliance and saved-preference reuse are distinct capabilities in phone-use agents — success-only evaluations overestimate deployment readiness

Do phone agents succeed at all three critical tasks equally?

Related concepts in this collection

Related papers in this collection