SDPO: Segment-Level Direct Preference Optimization for Social Agents

Paper · arXiv 2501.01821 · Published January 3, 2025

Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and sessionlevel methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multiturn agent behavior while minimizing training noise.

By incorporating identity-specific information, LLM-based agents can simulate human social behaviors, demonstrating basic social intelligence in tasks such as role-playing casual conversations (Wang et al., 2024a; Lu et al., 2024) and navigate simulated social environments (Park et al., 2023). However, recent studies (Zhou et al., 2024) have shown that, in more complex, goal-oriented social scenarios, such as negotiation, competition, and cooperation, LLMs still struggle to exhibit the nuanced decision-making abilities that are characteristic of human social interactions.