Pixel-Level Reasoning Segmentation via Multi-turn Conversations

Paper · arXiv 2502.09447 · Published February 13, 2025
Multimodal

Existing general multimodal large language models (MLLMs) (Bai et al., 2023; Zhu et al., 2023; Liu et al., 2024b) exhibit exceptional visual perception, enabling both image segmentation and textual reasoning, while they primarily rely on explicit human instructions for region-level grounding. Although some segmentation-specific works have explored grounded reasoning responses (Peng et al., 2023; You et al., 2023; Pi et al., 2023; Zhang et al., 2023a), they depend on user-provided regions to trigger reasoning. These perception systems still cannot actively comprehend user’s nuanced intent in real-world scenarios. To alleviate this problem, Lai et al. (2023) proposes the reasoning segmentation task that aims to achieve segmentation based on a implicit reasoning query. Recent studies (Ren et al., 2024; Xia et al., 2024; Yuan et al., 2024) have extended this region-level task to encompass multi-object segmentation scenarios to advance development. However, these methods have two limitations: 1) They rely on single-turn ambiguous queries and cannot fully understand users’ evolving intent. 2) They lack pixel-level segmentation and only achieve region-level segmentation through one-step explanations (e.g., segment roughly all ingridients in Figure 1(a)). In contrast, multi-turn interactions can progressively clarify vague and generalized instructions such as "make a bread". As illustrated in Figure 1(b), the system through multiturn interactions first guide to clarify the user’s desired type of bread, providing targeted responses, and ultimately focuses on the user’s specific needs, achieving pixel-level segmentation in final.