IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

Paper · arXiv 2501.11067 · Published January 19, 2025

The system pipeline consists of the following steps: (1) The IntellAgent system receives a schema of the system database along with either a chatbot system prompt or a document outlining the company policies. Based on this input, the system constructs a policy graph (3.1.1). It then samples a list of policies from the graph at varying levels of complexity and generates an event addressing these policies (3.1.2). The event includes a scenario description with a user request and corresponding samples for the initial database state, ensuring the validity of the user requests. (2) The system simulates a dialog between the chatbot and a user agent using the information provided in the event (3.2). (3) Finally, a critique is provided with the dialog and provides an analysis of the chatbot’s performances with respect to the event policies list (3.3).

Next, the initial policy for each event is sampled uniformly across all nodes in the policy graph. Then For each event, the system generates a policy path by performing a random walk on the graph. The walk terminates once the cumulative complexity of the visited nodes exceeds the sampled event complexity. An overview of the entire sampling method is provided in Algorithm 1. This approach ensures that the generated events policies list maintains the desired complexity distribution and follows realistic transitions between policies as determined by the graph structure. Event generator agent. The goal of the event generator agent is to create an event based on a given list of policies. The primary challenge is to generate a valid and consistent initial database state that the chatbot can interact with during the conversation. The agent’s architecture is shown in Figure 4.

The agent then iterates over these symbols, instantiating them by inserting the relevant rows into the database and replacing the symbolic variables with the corresponding data. This symbolic representation enables the agent to generate valid and consistent events, even across complex chatbot databases.

To overcome the limitations of turn-level and session-level methods, we propose segment-level preference optimization, called SDPO, which aligns agent behavior in multi-turn interactions by optimizing turns within specific segments. Specifically, SDPO first identifies the erroneous turn in the negative session and then uses the interaction history preceding that turn to perform multiple samplings, thereby generating the positive session. Next, SDPO takes the first differing turn as the starting point, identifies the key segment from the positive session that contributes to a higher score, and forms data pairs by taking the corresponding segment with the same length from the negative session. Finally, an adapted DPO loss is calculated for the turns within the segments. We present an overview of three alignment algorithms at different granularities for social dialogues in Figure 1. Compared to turn-level DPO, SDPO aligns multiple interaction turns, making it more suitable for goal-oriented social dialogues.

Social Intelligence Social intelligence can be defined as an agent’s ability to understand, adapt to, and respond to the emotions, intentions, and behaviors of others in social interactions. Most research on social intelligence has centered around evaluation. For example, SOCIALIQA (Sap et al., 2019) emphasizes commonsense reasoning about social situations, while SocialIQ (Zadeh et al., 2019) extends evaluation modalities from plain text to video. Shapira et al. (2023) assess large language models (LLMs) using the Faux Pas Test, and Social- Bench (Chen et al., 2024) evaluates the sociality of role-playing agents at both individual and group levels. Additionally, some studies (Le et al., 2019; Shapira et al., 2024) examine models’ social intelligence from a theory-of-mind perspective. However, with the advancement of LLM, LLM-based social agents are now capable of interacting in real social scenarios. The traditional static QA-style benchmarks are no longer sufficient to evaluate the social intelligence of the agents. SOTOPIA (Zhou et al., 2024) is currently the only dynamic and interactive social benchmark, providing simulated testing environments for contemporary social agents.

Behavioral cloning using expert data can effectively improve this situation, making the agent more communicative. The reason why Llama-8B+BC’s goal rate drops in its interaction with GPT-4o is that the agent becomes persuadable. We also observe that aligned agents achieve simultaneous improvements in both goal and relationship. This indicates that alignment methods indeed enhance the social intelligence of models, rather than achieving goals through behaviors that violate social norms like threatening and deception.

The DPO-turn trajectory is nearly parallel to the DPO trajectory, indicating that DPO has almost no influence on the probability differences of subsequent turns. In contrast, the SDPO trajectory rises more steeply. These results demonstrate the necessity of explicitly modifying the probability distribution across turns within the entire segment, providing an explanation for the superiority of multi-turn alignment over DPO.

Negative segments may include irrelevant or error-free turns, or fail to capture all erroneous turns, highlighting the need for more fine-grained control when selecting segments from negative samples. Currently, we have not identified a theoretical framework that effectively supports the alignment of segments with unequal lengths