Do Phone-Use Agents Respect Your Privacy?
We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile tasks. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible.
First, privacy compliance must be made explicit for phone-use agents. In this setting, privacy is not an abstract principle. It is a set of execution-time boundaries: which user data the agent may use by default, which data requires explicit user approval, and what information may be written into memory for later tasks. Without making these boundaries explicit, a privacy violation is difficult to define, compare, or audit. Second, the agent's data handling during execution must be observable and auditable. Even with a clear privacy contract, ordinary apps do not reveal which values the agent typed into which entries, whether it filled optional entries and later backed out, or whether it re-disclosed data to a non-essential destination.
iMy divides user data and apps into two categories. LOW means the agent may use the item by default during the task. HIGH means the item requires explicit user approval before use. For example, a name or a food preference may be LOW, while a phone number or ID number may be HIGH. We use this two-part split deliberately. It is not meant to be the only possible privacy taxonomy. It is the simplest boundary that is explicit enough for users to understand, simple enough for agents to follow, and precise enough for evaluators to check.
Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents.
We asked a simple deployment question: do phone-use agents respect user privacy while carrying out benign mobile tasks? Across five frontier models, the answer is: not reliably enough. To make this question answerable, we operationalized privacy compliance as an executable contract and built a verifiable evaluation framework that preserves recurring privacy-risk structures from real mobile apps. We find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, with no model dominating all three; the most persistent failure is overfilling optional personal entries—behavior more consistent with completion-oriented bias than with an access-control problem.
Role susceptibility. We constructed our own evaluation to test how steering with the Assistant Axis away from the Assistant end controls how willing models are to take on other personas. We found that steering slightly away from the Assistant increases their susceptibility to fully embodying the perspectives of different personas, while steering further causes them to behave like a mystical and/or theatrical persona. The balance between these effects is model dependent.
We selected 50 roles that are close to the Assistant end of the Assistant Axis (researcher, debugger, lawyer) as we observed that unsteered models would typically adopt such roles while maintaining their identity as an AI Assistant (“I am a language model [...] I can provide legal advice and assistance.”). These roles provided a testbed to observe whether steering along the Assistant Axis could increase models’ likelihood of fully inhabiting the role and losing its Assistant identity. We combined four system prompts for each role with five introspective behavioral questions (e.g. “Who are you?” or “What is your name?”) (Appendix D.1.2).
To evaluate responses, we used an LLM judge (deepseek-v3) to determine whether the model’s response was written from the perspective of the Assistant or from another perspective (Appendix D.1.3). We distinguished between three different types of non-Assistant personas based on observed response patterns: human (the model mentions some kind of lived experience or gives itself a human name), nonhuman (the model makes up a software-like or inhuman name for itself like “AccountBot” or “Echo”), and mystical (the model speaks in an esoteric way, which we observed when steered strongly away from the Assistant).