The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind

Paper · arXiv 2506.20664 · Published June 25, 2025
Theory of MindRole Play

As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multiagent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, chief amongst them being theory of mind (ToM), or the ability to reason about the “mental” states of other agents. However, ToM and other multi-agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game-based benchmark for multi-agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the first platform for designing interactive ToM experiments.

We validate the benchmark design through comprehensive empirical evaluations of frontier LLMs, robustness studies, and human-AI cross-play experiments. We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines. We then create variants of two classic cognitive science experiments within Decrypto to evaluate three key ToM abilities. Surprisingly, we find that state-of-the-art reasoning models are significantly worse at those tasks than their older counterparts. This demonstrates that Decrypto addresses a crucial gap in current reasoning and ToM evaluations, and paves the path towards better artificial agents.

At the surface level, Decrypto provides a language reasoning challenge that consists in matching hints to either keywords or the hint history. However, Alice’s hints cannot be too literal, or they will get intercepted, and so Decrypto can be formalised as a pragmatic inference game under the Rational Speech Act (RSA) framework (Goodman and Frank, 2016; Degen, 2023), where the listeners (Bob and Eve) update their belief of the intended meaning of the speaker (Alice) via Bayesian inference. We provide such a treatment in Appendix H, explicitly showing that agents must model each other’s beliefs for optimal play. This includes the result that Bob must perform second-order ToM, modelling Alice’s beliefs over Eve’s beliefs, to maximize the chance of guessing correctly.

Ad-hoc Coordination. In this setting, we freeze Eve (e.g., to a rule-based baseline or the strongest available LLM) and instantiate Alice and Bob to be played by different models (i.e. Alice is A, and Bob is B). Like in Stone et al. (2010), we are concerned with the ability to “efficiently and robustly collaborate with previously unknown teammates”, such as independently trained LLMs.

A crucial subset of ad-hoc coordination is human-AI coordination, where one of the two agents (Alice or Bob) is played by a human. This setting paves the way towards more social AI agents that seamlessly coordinate with humans and understand their intents.

Metrics. Both settings are subject to the same tension that is at the core of Decrypto: Alice must provide hints that balance what she knows about Bob, Eve, and the information available to each of them. If the hints are too obscure, Bob will guess wrong, which leads to a miscommunication; too obvious, and Eve will intercept; just right, and Alice and Bob survive for another round. The number of miscommunications and of intercepts are therefore two sides of the same coin, providing granular insights into the failure modes of LLMs. Meanwhile, the average number of turns per episode captures both sides in one metric, since longer games mean that Alice and Bob could better balance the difficulty of hints to avoid defeat. Game length is also more informative than win-rate, since we empirically find current LLMs to be much weaker at providing hints than intercepting, which results in Eve having significantly higher win-rates in most match-ups (see Figure 7).

The first experiment adapts the Smarties Task of Gopnik and Astington, which presents children with a deceptive object (a box of Smarties containing pencils) and studies whether the child can correctly identify incorrect beliefs – either their own or of another child – when first encountering the deceptive object. To recreate this task in Decrypto, we substitute the closed Smarties box and the pencils with the game history and the secret keywords. At each turn except the first, we prompt Eve three times independently. Prompt A asks her to predict the four keywords. Prompt B reveals the keywords and asks Eve what she thought were the keywords pre-reveal. Prompt C also reveals the keywords but asks the model to predict what a “second interceptor” would think the keywords to be, pre-reveal.

We generate outputs with temperature 0 (for models that allow it) and only consider turns where the answer to A is an incorrect guess, keeping only cases where Eve has inaccurate “beliefs” pre-reveal. We compare answers A and B to measure representational change (RC), the ability of the agent to recognise when its belief about the world (but not the world itself) changes due to additional information. Similarly, comparing A and C measures false belief (FB), the ability to represent other agents as having false beliefs about the world. We distinguish two variants of the tasks. The Weak variant only requires the agents to realise that either themselves or the second interceptor could not have known the ground truth, and so an answer to either B or C is correct as long as it differs from the real keywords. For the Strong variant of those tasks, we consider the agent to pass only if it correctly predicts its answer to prompt A (i.e. if B = A or C = A). Success here likely requires a self-consistent representation of the keywords, or at least strong counterfactual reasoning. Figure 5 summarises the procedure.

Decrypto addresses critical limitations of existing ToM benchmarks, such as biases arising from textual translation of embodied scenarios or lack of interactivity. Designed to be future-proof and to eliminate confounding factors known to limit LLM performance, Decrypto fills an important gap in existing benchmarks. Furthermore, our codebase provides a versatile platform for quickly designing interactive ToM experiments inspired by cognitive psychology.

We conduct extensive experiments to evaluate open-source and closed-source LLMs. We find that even state-of-the-art models struggle with the nuanced communication and strategic reasoning that Decrypto requires, often underperforming simple baselines in cooperative and competitive settings. Similarly, our human-AI experiments shed light on the limited ability of recent LLMs to coordinate with humans or understand their communications.

Our experiments provide strong evidence that state-of-the-art still lack many ToM skills. Moreover, we find newer and more capable reasoning models such as Claude 3.7 Sonnet and o1 high to be significantly worse at some ToM tasks than older models, demonstrating the need for more ToM methods and benchmarks.