DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs

Paper · arXiv 2505.07049 · Published May 11, 2025

We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue-style reasoning models, aiming to boost diversity and coherency of the reasoning process. Recent advances in RL-based large reasoning models have led to impressive long CoT capabilities and high performance on math and science benchmarks. However, these reasoning models rely mainly on monologue-style reasoning, which often limits reasoning diversity and coherency, frequently recycling fixed strategies or exhibiting unnecessary shifts in attention. Our work consists of an analysis of monologue reasoning patterns and the development of a dialogue-based reasoning approach. We first introduce the Compound-QA task, which concatenates multiple problems into a single prompt to assess both diversity and coherency of reasoning. Our analysis shows that Compound-QA exposes weaknesses in monologue reasoning, evidenced by both quantitative metrics and qualitative reasoning traces. Building on the analysis, we propose a dialogue-based reasoning, named DialogueReason, structured around agents, environment, and interactions.

Typically, these models structure their reasoning within a dedicated think block (Guo et al., 2025), where the detailed monologue reasoning process unfolds, followed by an answer block that explicitly provides the final solution. Such LRMs demonstrate impressive capabilities in reflection, self-verification, and critical analysis, achieving state-of-the-art results in mathematics and science tasks. However, these monologue-style LRMs exhibit low diversity and low coherency in their reasoning processes. Low diversity occurs when models persistently apply fixed strategies across diverse problems, resulting in performance degradation when problems require different approaches. Low coherency arises from frequent shifts in attention within a single reasoning path, exemplified by repetitive hesitations such as "Wait. . . " or unnecessary switches to alternative ideas. Consequently, the reasoning process becomes fragmented, difficult to interpret, and often ineffective, swinging between overcommitting to a strategy and neglecting manifold possibilities (Wang et al., 2025; Chen et al., 2024).

To address the limitations, we first analyze the monologue reasoning patterns with a focus on diversity and coherency, and subsequently propose a dialogue-based reasoning pattern. Specifically, inspired by the structure of divergent and convergent thinking, we introduce the Compound Question Answering (Compound-QA) task to systematically evaluate monologue reasoning models. By concatenating multiple independently solvable problems into a single input prompt, the Compound-QA task intrinsically evaluates a model’s ability to employ diverse reasoning strategies while maintaining internal coherency across different reasoning paths. Our analysis further demonstrates that Compound-QA effectively reveals the weaknesses of current monologue reasoning models, both quantitatively through performance metrics and qualitatively through detailed examination of their reasoning traces. Building upon these insights and drawing inspiration from multi-agent simulation, we articulate the design space for our proposed dialogue reasoning pattern. The design space is composed of three core dimensions: individual reasoning agents, the reasoning environment, and interaction settings.

• Reasoning Diversity: When facing heterogeneous sub-problems within a single input, the model should flexibly adopt different solution strategies rather than rigidly applying the same approach to all tasks. For example, combinatorial problems might require breadth-first search (BFS) to enumerate possible solutions, whereas geometric proofs might rely on depth-first search (DFS) for detailed deductive reasoning.

• Reasoning Coherency: Once the model chooses a particular reasoning path, it should consistently and thoroughly explore that path to derive a robust conclusion rather than frequently switching paths and consequently losing track of the correct solution.

To address the limitations of reasoning diversity and coherency, we propose a novel dialogue-based reasoning pattern, named DialogueReason. As illustrated in Figure 4, DialogueReason conceptualizes reasoning as an interactive process between distinct dialogue participants within a defined conversational setting. The figure includes a simple yet effective system prompt to stimulate dialogue-based reasoning. This prompt can be customized by specifying dialogue configurations, such as defining participants (e.g., a teacher and a student), setting the context (e.g., a math class), or choosing the dialogue format (e.g., Socratic dialogue). More systematically, inspired by agent-based simulation methodologies (Macal and North, 2005), we further articulate the design space of our dialogue reasoning pattern through three key dimensions: agents, environment, and interactions: • Agent dimension: Defines the number of reasoning agents, their designated characters, objectives,

and the interests they represent;

• Environment dimension: Specifies environmental functionalities, such as recording and adjusting task progression, introducing emergent events, and maintaining overall task control;

• Interaction dimension:

– Agent-to-agent interactions: Includes conflict resolution, negotiation, supplementation, and prompting among agents, represented via linguistic dialogues;

– Agent-to-environment interactions: Involves agents expressing requirements to the environment and the environment providing feedback to agents, thereby dynamically adjusting task goals and agent characters.

By explicitly configuring the characters and dialogue environment, the pattern encourages diverse reasoning paths tailored to different problem types. Meanwhile, the structured turn-taking and conversational boundaries inherent in dialogue promote coherency, making the reasoning process more interpretable and logically organized.

Specifically, the model first sets up a dedicated scene for each question (such as the "Quantum Café") and introduces multiple characters, each with distinct areas of expertise, within that scene. A specific problem to be solved is then posed. Through dialogue, the characters engage in interactions, progressively proposing, discussing, and refining possible solutions until they reach a consensus and summarize the final answer. When transitioning to the next question, the model constructs a new environment based on the background of the new problem (for example, shifting from the "Quantum Café" to the "Theoretical Physics Hall") and introduces a different set of characters distinct from those in the previous discussion. This approach of switching between multiple scenes and characters not only effectively prevents interference between questions and enhances the diversity of reasoning strategies, but also promotes dialogue coherency by maintaining clear character boundaries, thereby providing a more intuitive presentation of the model’s complete thought process.