Agentic Systems and Planning

Do phone agents succeed at all three critical tasks equally?

Explores whether task success, privacy compliance, and preference reuse develop together in phone-use agents, or whether benchmarking one capability tells you nothing about the others.

Note · 2026-05-18 · sourced from Assistants Personalization

The MyPhoneBench evaluation surfaces a finding with direct deployment consequences: the three properties most relevant for phone-use agent deployment — task success, privacy compliance during completion, and proper use of saved preferences in later sessions — are statistically distinct capabilities. No model dominates all three. Evaluating one of them does not predict the others.

The pattern matters because of how benchmarks have been structured. Most agent benchmarks score task success: did the agent complete the task as instructed? Models that score well on this single metric get ranked as "frontier" and get deployed. But when the same models are scored jointly on success-plus-privacy or success-plus-preference-reuse, the ranking reshuffles. A model that wins on success-only may lose on success-with-privacy, because it completes tasks by overfilling personal entries. A model with mediocre success may have better privacy compliance because it stops at minimal disclosure.

The deeper observation is that "deployment readiness" is not a scalar. It is a vector across the capabilities the deployment actually requires. For phone-use agents, that vector includes at minimum success, privacy compliance, and longitudinal preference handling. For other agent deployments it would include different combinations. Evaluating on the wrong subset of capabilities produces models that score well on the benchmark and fail in production.

For benchmark designers, this argues for joint evaluation as the default rather than as a research add-on. A benchmark that scores only one capability and ranks models on it is producing rankings that will not generalize to deployment. The methodological move is to evaluate the capability vector and present results as multi-dimensional rather than collapsing to a single score.

For agent developers, the immediate consequence: do not assume success-trained models will be privacy-compliant or preference-respecting. These need to be selected and trained for, not assumed.

Related concepts in this collection

Concept map
13 direct connections · 96 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

task success privacy compliance and saved-preference reuse are distinct capabilities in phone-use agents — success-only evaluations overestimate deployment readiness