Evaluation and Benchmarking of LLM Agents: A Survey

Paper · arXiv 2507.21504 · Published July 29, 2025
Evaluations

The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in-depth overview of the emerging field of LLM agent evaluation, introducing a twodimensional taxonomy that organizes existing work along (1) evaluation objectives—what to evaluate, such as agent behavior, capabilities, reliability, and safety—and (2) evaluation process—how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling. In addition to taxonomy, we highlight enterprise-specific challenges, such as role-based access to data, the need for reliability guarantees, dynamic and longhorizon interactions, and compliance, which are often overlooked in current research. We also identify the future research directions, including holistic, more realistic, and scalable evaluation. This work aims to bring clarity to the fragmented landscape of agent evaluation and provide a framework for systematic assessment, enabling researchers and practitioners to evaluate LLM agents for real-world deployment.

The Evaluation Objectives dimension is concerned with the targets of evaluation. The first category, Agent Behavior, in this dimension focuses on outcome-oriented aspects such as task completion and output quality, capturing how well an agent meets end-users’ expectations. Next, Agent Capabilities emphasize process-oriented competencies, including tool use, planning and reasoning, memory and context retention, and multi-agent collaboration. These capabilities provide insights into how agents achieve their goals and how well they meet their design specification. Reliability assesses whether an agent behaves consistently for the same input and robustly when input varies or the system encounters errors. Finally, Safety and Alignment evaluates the agent’s trustworthiness and security, including fairness, compliance, and the prevention of harmful or unethical behaviors.

The Evaluation Process dimension describes how agents are assessed. Interaction Mode distinguishes between static evaluation, where agents respond to fixed inputs, and interactive assessment, where agents engage with users. Evaluation Data discusses both synthetic and real-world datasets, as well as benchmarks tailored to specific domains such as software engineering, healthcare, and finance [23, 62]. Metrics Computation Methods encompasses quantitative measures, such as task success and factual accuracy, as well as qualitative evaluations based on human or LLM judgments. Evaluation Tooling refers to the supporting infrastructure, such as instrumentation frameworks (e.g., LangSmith, Arize AI) and public leaderboards (e.g., Holistic Evaluation of Agents), that enable scalable and reproducible assessment. Lastly, Evaluation Contexts define the environment in which evaluations are conducted, from controlled simulations to open-world settings such as web browsers or APIs.

Several metrics have been proposed to assess these abilities. Invocation Accuracy [54] evaluates whether the agent makes the correct decision about whether to call a tool at all. Tool Selection Accuracy measures whether the proper tool is chosen from a list of options. Retrieval Accuracy focuses on whether the system can retrieve the correct tool from a larger toolset, often measured using rank accuracy 𝑘. For ranking-based evaluation, Mean Reciprocal Rank (MRR) quantifies the position of the correct tool in the ranked list. In contrast, Normalized Discounted Cumulative Gain (NDCG) reflects how well the system ranks all relevant tools[54].

Parameter-related evaluation involves two aspects. The parameter name F1 score [83] measures the agent’s ability to identify the parameter names required for a given function correctly and then correctly assign values to them. While some evaluations rely on the correctness of abstract syntax trees (ASTs) to check if the tool call is syntactically valid, this approach may miss semantic errors, such as incorrect or hallucinated parameter values, especially for parameters constrained to enumerated types [75]. To address this limitation, recent work, such as the Gorilla paper, has proposed execution-based evaluation, in which the system runs the tool calls and assesses their outcomes, offering a more comprehensive and grounded assessment of tool use capability [75].

3.2.2 Planning and Reasoning: Planning and reasoning are essential capabilities for LLM-based agents, especially in complex tasks that require multiple steps or making decisions under uncertainty. Planning involves selecting the correct set of tools in an appropriate order. At the same time, reasoning enables agents to make context-aware decisions, either ahead of time or dynamically during task execution [37]. T-eval [12] formulated planning evaluation as comparing the set of predicted tools against a reference. Since tool order and dependency also matter, some benchmarks adopt graph-based representations and introduce metrics such as Node F1 for tool selection and Edge F1 or Normalized Edit Distance for assessing tool invocation sequences and structural accuracy [83].

In dynamic environments, agents often need to interleave planning and execution, adapting their actions in response to evolving context [37]. This pattern is illustrated by the ReAct paradigm, where agents alternate between reasoning steps and tool usage [106]. Evaluating such adaptive reasoning requires more than comparing static plans—it demands metrics that reflect decision-making in real time. The T-Eval framework [12] addresses this by introducing a reasoning metric that assesses how closely an agent’s predicted next tool call aligns with the expected one at each step. This captures the agent’s ability to make informed decisions when tool outputs are not known in advance. Similarly, AgentBoard [64] proposes the metric Progress Rate, which compares the agent’s actual trajectory against the expected one, offering a fine-grained measure of how effectively the agent advances toward its goal.

When agents are instructed to plan in the form of generating complete multi-step programs, evaluation methods from code generation become relevant. Benchmarks like ScienceAgentBench compare the generated plans against annotated references using program similarity metrics [11]. Additionally, the Step Success Rate has been proposed to measure the percentage of steps in the generated plan that are successfully executed, providing a holistic view of planning quality during execution [28].

3.2.3 Memory and Context Retention: A critical capability for longrunning agents is the ability to retain information throughout many interactions and apply previous context to current requests. Guan et al. [31] categorize memory evaluation in multi-turn conversations by Memory Span (how long information is stored) and Memory Forms (how information is represented). For example, LongEval [43] and SocialBench [9] are benchmarks that test an agent’s context retention in long dialogues (40+ turns). An agent might be given a conversation that spans dozens of exchanges and later asked questions that require recalling details from early in the conversation. Maharana et al. [65] demonstrate evaluation with dialogues spanning hundreds of turns (600+ turns), and Li et al. [50] introduce memory-enhanced evaluation techniques, tracking how well agents maintain consistency in long-horizon tasks. These evaluations often use synthetic or logged conversations as datasets, and metrics include Factual Recall Accuracy or Consistency Score (no contradictions between turns). Memory evaluation may also consider working memory for tool-using agents (i.e., whether the agent keeps track of intermediate results) and forgetting strategies (i.e., whether it appropriately forgets irrelevant details to avoid confusion).

3.2.4 Multi-Agent Collaboration: Evaluating multi-agent collaboration in LLM-based systems requires different methodologies compared to traditional reinforcement learning-driven coordination [7, 48, 89]. Unlike conventional agents that rely on predefined reward structures, LLM agents coordinate through natural language, strategic reasoning, and decentralized problem-solving [32, 33]. These capabilities are crucial in real-world applications such as financial decision-making and structured data analysis, where autonomous agents must exchange information, negotiate, and synchronize decision-making processes efficiently [50, 55]. Autonomous Agents for Collaborative Tasks [55] evaluates Collaborative Efficiency, assessing how well multiple agents share responsibilities and distribute tasks dynamically.