Magentic-UI: Towards Human-in-the-loop Agentic Systems
AI agents powered by large language models are increasingly capable of autonomously completing complex, multi-step tasks using external tools. Yet, they still fall short of humanlevel performance in most domains including computer use, software development, and research. Their growing autonomy and ability to interact with the outside world, also introduces safety and security risks including potentially misaligned actions and adversarial manipulation. We argue that human-in-the-loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems. We introduce Magentic-UI, an open-source web interface for developing and studying human-agent interaction. Built on a flexible multi-agent architecture, Magentic-UI supports web browsing, code execution, and file manipulation, and can be extended with diverse tools via Model Context Protocol (MCP). Moreover, Magentic-UI presents six interaction mechanisms for enabling effective, low-cost human involvement: co-planning, co-tasking, multi-tasking, action guards, and long-term memory. We evaluate Magentic-UI across four dimensions: autonomous task completion on agentic benchmarks, simulated user testing of its interaction capabilities, qualitative studies with real users, and targeted safety assessments. Our findings highlight Magentic-UI’s potential to advance safe and efficient human-agent collaboration.
Magentic-UI’s key interaction mechanisms, as shown in Figure 1, include: co-planning to enable collaboration on a plan of action, co-tasking to facilitate seamless take-and-hand-over of control, action approval to ensure oversight of high-stakes actions, answer verification to help validate the task was completed correctly, memory to leverage past experience to improve future performance, and multi-tasking to parallelize execution while staying in the loop.
Collaborative Task Execution
Collaborative Task Execution (Co-Tasking). Once agents start performing a task, they can encounter many obstacles that can hinder task completion. For instance, what if a product you asked the agent to purchase is no longer available? Or, what if the agent started deviating from the plan you both agreed to? To realize the benefits of humans-in-the-loop, we need to provide efficient mechanisms that enable the agent to query the human and allow the human to steer the agent’s behavior at any moment, as well as verify its work. The human and agent collaborate to execute the task, which we denote as co-tasking, also referred to as co-execution in the literature [20]. Co-tasking can allow the human to intervene to complete steps the agent is unable to, e.g., CAPTCHA, allowing the human-agent team to complement their individual strengths. Moreover, it allows the agent to ask clarifying questions when faced with unexpected ambiguity while completing the task. For instance, if the agent is supposed to purchase a particular product but it is unavailable, it can ask the user about potential substitutes. Finally, co-tasking can allow for interactive verification of agent actions during and after task execution is completed.
Figure 4 shows the three ways in which co-tasking occurs in Magentic-UI: (a) the user interrupting the agent to steer its behavior, (b) the agent interrupting the user to ask for help or clarifications, and (c) the user verifying the agent’s work and asking for follow-ups. All of these interactions occur when the user wants to solve a single task or is multitasking.
User Oversight and Interruptions. Once the user accepts Magentic-UI’s plan, execution of the task starts. The interface provides real-time updates on intermediate agent actions, allowing the user to maintain continuous oversight. Each plan step appears as a collapsible banner in the task execution view, containing all related agent actions. Once a plan step is completed, we collapse all agent actions for that step to not overwhelm the UI. Agent interactions with the web browser are animated, giving users a live preview of upcoming actions. Users may pause the task execution at any point, providing clarifications, making adjustments to upcoming steps, or intervening directly within the embedded browser. As previously mentioned, Magentic-UI exposes the browser to the user and hands off control immediately upon user intervention. Figure 4(a) shows what happens when a user interrupts Magentic-UI mid task-execution. After making adjustments, users can seamlessly resume automated execution, maintaining fluid collaboration between the human and agent. UX considerations here prioritize immediate actionability and clarity, reducing the cognitive load required for real-time monitoring.
Agent Interrupting the User. The user is part of the underlying multi-agent team in Magentic- UI, this means that the Orchestrator can delegate steps of the plan to the user. Figure 4(b) shows the agent asking the user a clarifying question in the middle of task execution. Each agent in the multi-agent team has a natural language description field that helps the Orchestrator know which steps of the plan it should delegate to that agent. That description field determines when the Orchestrator can delegate actions to the user. The guiding principle we followed is to interrupt the user as little as possible and only when necessary. Therefore, we specified that we would only interrupt the user for clarifying questions or help, but only after failures in completing the task from other agents. Here is the raw description field we used:
The description field is essentially a parameter that we can modify to control the user deferral behavior. The main issue with optimizing this parameter is the lack of ground truth signals for when is the right time to interrupt the user. For the development of Magentic-UI, we arrived at this description through unstructured interaction with the system. This is in contrast to work on learning to defer in classification [44, 53] where there is a clear signal to identify when it is a good time to defer to the user. Our simulated user experiments in Section 7.3 provide a possible environment for quantitatively choosing such parameters.
Final Answer Verification. Once the task is completed, Magentic-UI displays a final answer to the user as shown in Figure 4(c). The final answer will consist of a text response, in addition to any generated files that the user can download. The user can verify the answer by either going through the agent actions for each step or by asking the agent follow-up questions in the UI. Follow-up questions that can be answered without any agent actions are immediately returned to the user. If any follow-up query requires agent action, it essentially triggers a new planning phase that takes into account the previous task.
Multitasking. Magentic-UI allows the user to run multiple tasks at the same time. The user can interact with each task session by switching between them, as shown in the left-hand side panel of the interface in Figure 2. Each session has a session status indicator that displays whether user input is required. We believe that multitasking is one of the keys to realizing the benefits of agents, even if agents are below human-level performance. This is because it is trivial to spin up a large number of agents that can make partial progress towards each task, which allows the human to complete it more easily. The main limiting factor here is humans’ ability to oversee and manage all these agents.