Quantifying Human-AI Synergy

Paper · Source

We introduce a novel Bayesian Item Response Theory framework to quantify human– AI synergy, separating individual and collaborative ability while controlling for task difficulty in interactive settings. Unlike standard static benchmarks, our approach models human–AI performance as a joint process, capturing both user-specific factors and moment-to-moment fluctuations. We validate the framework by applying it to human–AI benchmark data (n=667) and find significant synergy. We demonstrate that collaboration ability is distinct from individual problem-solving ability. Users better able to infer and adapt to others’ perspectives achieve superior collaborative performance with AI—but not when working alone. Moreover, moment-to-moment fluctuations in perspective taking influence AI response quality, highlighting the role of dynamic user factors in collaboration. By introducing a principled framework to analyze data from human-AI collaboration, interactive benchmarks can better complement current single-task benchmarks and crowd-assessment methods. This work informs the design and training of language models that transcend static prompt benchmarks to achieve adaptive, socially aware collaboration with diverse and dynamic human partners.

In this paper, we build on the recent emergence of human-AI benchmarks by proposing and validating a principled framework for analyzing human-AI interaction data to quantify and explain human-AI synergy. Viewing these interactions through the lens of teamwork, we apply Item Response Theory and Bayes shrinkage to account for differences in task difficulty and user ability. Crucially, our framework distinguishes between and estimates each user’s “individual ability” (θ) and “collaborative ability” (κ, see Methods). We validate this approach using data from Chang et al. (2025) in which 667 humans completed tasks with and without AI assistance across math, physics, and moral reasoning. We benchmark two AI models of different capacity and capability—GPT-4o and Llama-3.1-8B—and quantify the extent to which they improve human performance, controlling for variation in user ability and task difficulty. Finally, we examine which users benefit most from AI collaboration and investigate why—focusing on the role of Theory of Mind.

Our paper makes two key contributions. First, we introduce and empirically validate a framework for benchmarking human–AI synergy—a new paradigm for evaluating LLMs that goes beyond static, single-task accuracy metrics and crowd-judgment scoring. Our approach quantifies how much different AI models improve user performance over solo baselines, while controlling for task difficulty and user ability. In doing so, it estimates user-specific performance gains from AI collaboration and separately identifies each user’s individual and collaborative ability. This enables fine-grained comparisons of models’ capacity to enhance the collective intelligence of human–AI teams in realistic problem-solving settings. Second, we identify Theory of Mind as a key cognitive mechanism in human–AI synergy. Users with stronger ToM achieve superior collaborative performance with AI— but not when working alone—and both stable individual differences and moment-to-moment fluctuations in ToM predict AI response quality. These findings suggest that ToM-like capabilities—and the ability to adapt to users’ social-cognitive states—are critical for LLMs intended for interactive, dynamic problem-solving, and point to a new research and development agenda for building socially aware, adaptive AI partners. Together, these contributions open a path toward designing AI systems that prioritize emergent human-AI synergy over standalone performance.

We can interpret κhuman i as the collaborative ability of user i (when working jointly with AI). This collaborative ability includes the user’s ability in deciding when and how to delegate tasks to the AI assistant, how to formulate the delegation (i.e., writing LLM prompts), deciding whether to accept the AI’s response, providing refinements or requesting clarification, and so forth. Similarly, κAI m represents the collaborative capability of the AI model (that user i is paired with). This parameter is an important object for AI benchmarking purposes as it quantifies the extent to which different models amplify human performance. Using ChatBench data, for example, we are able to compare κAI GPT4o and κAI Llama (see Figure 1B).