LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Paper · arXiv 2508.15760 · Published August 21, 2025

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

Our experiments reveal that even state-of-the-art models struggle with the demands of complex tool orchestration, achieving less than 60% success rate. It highlights a significant gap between current agent capabilities and the robustness required for truly autonomous task execution. More importantly, by digging deep into the error cases with various models, we are able to glean useful insights into different models’ agentic capabilities. By carefully analyzing the agent trajectories (Chen et al., 2023), we identify seven common failure modes in frontier models, shedding light on how to further improve these systems. Moreover, we observe a striking log-shaped curve in token efficiency: closed-source models gain rapidly and then plateau, while open-source models fail to turn tokens into reliable evidence. We release this benchmark to accelerate development in tool-augmented AI systems and foster innovations in planning, reasoning, and long-horizon task execution (Xi et al., 2023).

Closed-source models exhibit a mild upward trend with tokens, yet planning quality remains the primary driver. Open-source models exhibit two characteristic inefficiencies. Llama variants cluster in the low-token, low-tool region, under-exploring tool affordances and often stopping early, which yields low ARS and TSR. Qwen variants trend toward the opposite extreme, producing longer outputs and invoking more tools without commensurate gains compared to the closed-source models. Extended-thinking variants consistently shift the efficiency frontier upward at comparable token budgets, suggesting gains from improved planning and error recovery rather than verbosity.

5.2 FAILURE ANALYSIS

To diagnose failure modes in MCP-based tool use, we carefully analyze execution logs across different models and identify three error categories including seven subtypes: tool planning and orchestration errors (1–4), parameter errors (5–6), and output handling errors (7). (1) Ignoring requirement: the agent misses an explicitly stated requirement and does not select any relevant tool. Typical signs include no corresponding thinking process and tool call, early termination, or a generic final answer that does not address the requirement. This often occurs when the agent fails to extract key requirements from the prompt or loses track of them during execution. (2) Overconfident self-solving: the agent recognizes the requirement but attempts to answer from its own knowledge or using its own reasoning and capabilities without calling the needed tool. Symptoms include: no corresponding tool call, generic or hallucinated answers, and premature termination. (3) Unproductive thinking: the agent acknowledges that a tool is needed and may discuss plans or parameters, but never initiates the call and does not propose any solution that addresses the requirement. It loops in unproductive or verbose thinking and eventually times out or gives up. Symptoms include repeated plan rewrites without execution, token-consuming thinking, and reaching the round limit with zero calls for the requirement. (4) Wrong tool selection:

the agent calls a tool but chooses an inappropriate one, leading to erroneous intermediate states or final outputs. This can happen as a single misselection or repeated wrong calls until the budget is exhausted. Symptoms include irrelevant responses, repeated mistakes, or missing required fields in outputs. (5) Syntactic errors: parameters provided to a tool are malformed, such as having incorrect types, missing or wrong field names, or invalid schema. These errors prevent the MCP server from correctly parsing the request, leading to failure. (6) Semantic errors: parameters are well-formed but do not match the task intent. Common cases include mis-scoped query strings, wrong identifiers or entity references, and incorrect contextual constraints. These errors often arise from mistakes in intermediate reasoning used to generate parameters. (7) Output parsing errors: the tool returns a correct result, but the agent mishandles it during parsing, causing incorrect intermediate states or final answers. We further evaluate several popular models spanning a range of capabilities, and their core error-type distributions are shown in Figure 7. Several interesting observations can be inferred from the results:

• Semantic errors dominate: even strong models show rates of 16–25%, while smaller ones exceed 40% (e.g., GPT-4.1-mini), pinpointing content grounding and constraint enforcement as the primary bottleneck in live tool use.

• Syntactic errors are negligible for frontier models but catastrophic for Llama-3.3-70B-Instruct ≈ 48%. A likely cause is limited MCP-specific training—MCP adoption surged (Ehtesham et al., 2025) after the Llama-3 release (Meta Llama Team, 2024)—suggesting that targeted fine-tuning on MCP function-call schemas could substantially cut such errors and boost overall performance.

• Overconfident self-solving is common in mid-tier models: they often skip tool calls because planning and screening remain brittle under large tool pools and long contexts, making reliance on internal knowledge (Chhikara, 2025) seem safer than attempting uncertain tool selection and parameterization.