Large Multimodal Agents: A Survey
Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as large multimodal agents (LMAs for short). First, we introduce the essential components involved in developing LMAs and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks integrating multiple LMAs, enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, hindering effective comparison among different LMAs.
large multimodal agents (LMAs) in our paper.3 Typically, they face more sophisticated challenges than language-only agents. Take web searching for example, an LMA first requires the input of user requirements to look up relevant information through a search bar. Subsequently, it navigates to web pages through mouse clicks and scrolls to browse real-time web page content. Lastly, the LMA needs to process multimodal data (e.g., text, videos, and images) and perform multi-step reasoning, including extracting key information from web articles, video reports, and social media updates, and integrating this information to respond to the user’s query.
Planning. Planners play a central role in LMAs, akin to the function of the human brain. They are responsible for deep reasoning about the current task and formulating corresponding plans. Compared to language-only agents, LMAs operate in a more complicated environment, making it more challenging to devise reasonable plans. We detail planners from four perspectives (models, format, inspection & reflection, and planning methods):