MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis

Paper · arXiv 2506.03750 · Published June 4, 2025

The application of AI in psychiatric diagnosis faces significant challenges, including the subjective nature of mental health assessments, symptom overlap across disorders, and privacy constraints limiting data availability. To address these issues, we present MoodAngels, the first specialized multi-agent framework for mood disorder diagnosis. Our approach combines granular-scale analysis of clinical assessments with a structured verification process, enabling more accurate interpretation of complex psychiatric data. Complementing this framework, we introduce MoodSyn, an open-source dataset of 1,173 synthetic psychiatric cases that preserves clinical validity while ensuring patient privacy. Experimental results demonstrate that MoodAngels outperforms conventional methods, with our baseline agent achieving 12.3% higher accuracy than GPT-4o on real-world cases, and our full multi-agent system delivering further improvements. Evaluation in the MoodSyn dataset demonstrates exceptional fidelity, accurately reproducing both the core statistical patterns and complex relationships present in the original data while maintaining strong utility for machine learning applications. Together, these contributions provide both an advanced diagnostic tool and a critical research resource for computational psychiatry, bridging important gaps in AI-assisted mental health assessment.

To identify the most relevant questions for mood disorder diagnosis, we computed the Pearson correlation between each question’s score (and total score) and the presence of a mood disorder, selecting the top 5% with the highest correlations. These questions naturally clustered into key symptom groups: depressive mood, loss of interest, anxiety, insomnia, and suicidal tendencies. These groups enhance diagnostic robustness through cross-validation and comprehensive symptom coverage. To further refine our framework, we included clinically significant PHQ-9 questions, such as phq9_Q2 (depressed mood) and phq9_Q1 (loss of interest), even if their correlation scores were slightly below the threshold, ensuring a nuanced and reliable diagnostic process.

By evaluating response consistency within each symptom group, we derive more accurate inferences about the visitor’s probable condition. For instance, when a visitor reports frequent depressive symptoms on a self-assessment scale but clinicians observe no corresponding depressive signs, this discrepancy directs MoodAngels to investigate additional behavioral markers for diagnostic validation.

2.2 Retrieval Datastore

Since overlapping symptoms may correspond to multiple disorders, we extract and structure diagnostic and differential criteria from the Diagnostic and Statistical Manual of Mental Disorders: DSM-5 [15], a widely recognized authority in psychiatry, to build a retrievable knowledge base. The knowledge base construction process is detailed in Appendix B.3.2.

To prevent MoodAngels from making arbitrary decisions based solely on symptom presentation, we also incorporate clinicians’ diagnostic expertise by including anonymized clinical data for retrieval.

These experiences are also beneficial when a visitor’s symptoms are ambiguous, for historical diagnostic precedents may offer additional interpretive insights. The clinical data used in our study consists of anonymized real-world hospital cases, totaling 2804 entries. We partitioned the dataset such that 80% of the cases are used as historical cases for retrieval, while the remaining 20% serve as the test set. All clients in the dataset have completed scale assessments, although clients without diagnosed conditions do not have medical records available, and our agents are not pre-informed about this distinction. The dataset statistics are summarized in Table 1.

2.3 Diagnostic Agents

To mitigate overreliance on past cases (which could overlook individual variability in psychiatric diagnoses), we develop three diagnostic variants with differing levels of historical dependence: Angel.R (no reference to previous cases), Angel.D (displays retrieved cases as context), and Angel.C (compares each retrieved case with the current query and returns an analysis as context). Prompts of this angels are provided in Appendix C.1. By aggregating independent diagnoses from these three agents and facilitating debate among their conclusions, our final diagnosis model, multi-Angels, bridges the gap between computational decision-making and the nuanced understanding essential for accurate psychiatric evaluations. The following parts introduce the main components of our agents: Symptom Matching. To align client symptoms in medical records with DSM-5 diagnostic criteria, we process records and compute relevance between records and criteria using dense vector encoding 5. The BGE-M3 embedder [19] is employed for its strong semantic embedding capabilities. We retrieve the top-5 most similar criteria, returning their text, classification, and similarity scores. The tool does not diagnose but provides results for agent analysis, ensuring decisions integrate quantitative data and clinical expertise, mitigating over-reliance on single metrics.

For cases with overlapping symptoms, an additional instruction prompts the agent to consider differential diagnosis, guiding systematic evaluation of potential conditions. This enhances the agent’s ability to distinguish between mood disorders and other diseases. Scale Performance Analysis. We diagnose the presence of mood disorder using 16 key questions selected in Section 2.1 as the most mood-relevant items. Client performances are converted from numeric scores to textual descriptions based on question content and options. For agent interpretability, performances are reorganized into coherent descriptive paragraphs, enhancing analysis effectiveness (examples are provided in Appendix C.3).

Similar Cases Retrieval. To leverage clinical experience from similar cases, we develop two optional tools for retrieving medical records and scales with similar performance. After performing similarity retrieval, our tools generate different outputs tailored to the type of diagnosis agent in use. For Angel.R, this tool is intentionally excluded to minimize potential interference from the diagnostic outcomes of other cases. For Angel.D, the tool directly returns the retrieved cases for reference, enabling the agent to review and draw insights from them. For Angel.C, the tool conducts a detailed comparison of similarities and differences among the retrieved cases and returns an analysis text summarizing the findings. Consistent with the symptom matching tool, we employ BGE-M3 as the retriever. This approach ensures that the tool adapts to the specific needs of each diagnosis agent, enhancing the diagnostic process while maintaining flexibility and precision.

Multi-agent Diagnosis. To integrate insights from all Angels and improve diagnostics, Angel.R, Angel.D, and Angel.C first provide independent decisions and reasoning. A Judge Agent consolidates their inputs. If consensus is reached, the Judge outputs the diagnosis and reasoning. For disagreements, two Debate Agents are introduced: a Positive Agent, supporting a mood disorder diagnosis, and a Negative Agent, opposing it. Both Debate Agents and the Judge access symptom matching results, scale performances, relevant cases, and the Angels’ diagnosis and reasoning. In each debate round, the Positive Agent speaks first, followed by the Negative Agent. After each round, the Judge evaluates the arguments and decides whether to conclude the debate. If concluded, the Judge delivers the final diagnosis and supporting reasons. More details about the judge and a complete debate example are presented in Appendix C.4.1.