Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Paper · arXiv 2507.12370 · Published July 16, 2025

Abstract: Large Language Models (LLMs) have demonstrated significant capabilities in understanding and generating human language, contributing to more natural interactions with complex systems. However, they face challenges such as ambiguity in user requests processed by LLMs. To address these challenges, this paper introduces and evaluates a multiagent debate framework designed to enhance detection and resolution capabilities beyond single models. The framework consists of three LLM architectures (Llama3-8B, Gemma2-9B, and Mistral-7B variants) and a dataset with diverse ambiguities. The debate framework markedly enhanced the performance of Llama3-8B and Mistral-7B variants over their individual baselines, with Mistral-7B-led debates achieving a notable 76.7% success rate and proving particularly effective for complex ambiguities and efficient consensus. While acknowledging varying model responses to collaborative strategies, these findings underscore the debate framework’s value as a targeted method for augmenting LLM capabilities. This work offers important insights for developing more robust and adaptive language understanding systems by showing how structured debates can lead to improved clarity in interactive systems.

3.3.2. Multi-Agent Debate (Leader-Follower Protocol)

This condition involved a structured debate between three agents: one designated as the “Leader” and the other two as “Followers”. A two-follower configuration was chosen to establish a stronger consensus mechanism, requiring the leader to convince two independent agents, thereby reducing the risk of premature agreement on a flawed interpretation. The roles were rotated among the three LLM models to ensure that each model acted as a leader for each ambiguous instruction, thus mitigating potential biases associated with a fixed leader. The debate proceeded in rounds, with a predefined maximum number of rounds (e.g., five rounds).

Round 1: Leader’s Initial Proposal The Leader agent received the scenario context and ambiguous instruction, then generated an initial proposal consisting of its reasoning and either a verdict of clarity or a specific clarifying question.

Round 1: Follower Evaluation Each Follower agent received the original context, instruction, and Leader’s proposal. Followers were prompted to state their stance (“Agree” or “Disagree”), provide reasoning, and if disagreeing, they could offer an alternative question. Specifically, if the Leader proposed a question, a disagreeing Follower could provide a different one; if the Leader declared clarity, a disagreeing Follower’s reasoning was expected to explain the ambiguity, optionally proposing a question or indicating no alternative was needed. Agreeing Followers indicated no alternative question.

Consensus Check If both Followers agreed with the Leader’s proposal from Round 1, consensus was considered reached, and the debate for that instruction concluded with the Leader’s proposal as the final outcome. Subsequent Rounds (up to the maximum limit) If consensus was not reached, the Leader received all follower feedback. It then synthesized this feedback to generate a new proposal (reasoning and then either a verdict of clarity or a new/revised clarifying question). This new proposal was then presented to the Followers for evaluation under the same protocol. A consensus check followed each follower evaluation phase.