CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Paper · arXiv 2505.18878 · Published May 24, 2025
Work Application Use CasesEvaluations

Existing benchmarks fall short in realism, data fidelity, agent-user interaction, and coverage across business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic and realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across customer sales, service, as well as configure, price, and quote for Business-to-Business and Business-to-Customer scenarios. It also incorporates multi-turn interactions guided by diverse personas and confidentiality awareness assessments. Experiments show leading LLM agents achieve approximately solely 58% single-turn success rate on CRMArena-Pro, with significant performance drops in multi-turn settings to 35%. Among the business skills evaluated,Workflow Execution is notably more tractable, with top-performing agents surpassing 83% success rate in single-turn tasks, while other skills present greater challenges. Additionally, agents exhibit near-zero inherent confidentiality awareness (improvable with prompting but often at a cost to task performance). These results underscore a significant gap between current LLM capabilities and real-world enterprise demands, highlighting needs for improved multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

we introduce CRMArena-Pro, an expert-validated benchmark that provides a more comprehensive and realistic evaluation of LLM agents in diverse business contexts. Building upon the sandbox environment and data generation pipeline of CRMArena [6], CRMArena-Pro significantly expands the evaluation framework beyond customer service to encompass crucial sales (e.g., distilling insights from sales calls) and CPQ processes (e.g., identifying invalid product configurations on a sales quote). Furthermore, we have enriched the data generation methodology to synthesize realistic enterprise data and interaction scenarios tailored for both B2B and B2C settings. This makes CRMArena-Pro the first benchmark specifically designed to evaluate LLM performance across this broader spectrum of business applications while also systematically incorporating scenarios to probe critical aspects such as multi-turn conversational abilities and confidentiality awareness. An overview of how CRMArena-Pro extends the CRMArena benchmark is shown in Figure 1.

Our findings indicate that LLM agents are generally not well-equipped with many of the skills essential for complex work tasks; Workflow Execution stands out as a notable exception, however, where strong agents like gemini-2.5-pro achieve success rates higher than 83%. More importantly, we found that all evaluated models demonstrate near-zero confidentiality awareness. Although targeted prompting can improve this awareness, such interventions often compromise task completion performance1