Building Cooperative Embodied Agents Modularly with Large Language Models

Paper · arXiv 2307.02485 · Published July 5, 2023

In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multiobjective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on CWAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoLLAMA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation.

Therefore, this paper aims to investigate how to leverage LLMs to build cooperative embodied agents that can collaborate and efficiently communicate with other agents and humans to accomplish longhorizon multi-objective tasks in a challenging decentralized setting with costly communication. To this end, we focus on an embodied multi-agent setting as shown in Figure 1, where two decentralized embodied agents have to cooperate to finish a multi-objective household task efficiently with complex partial observation given. Specifically, communication in our setting takes time as in real life, so the agents can’t simply keep free talking with each other. To succeed in this setting, agents must i) perceive the observation to extract useful information, ii) maintain their memory about the world, the task, and the others, iii) decide what and when to communicate for the best efficiency and iv) plan collaboratively to reach the common goal.

Inspired by prior work in cognitive architectures (Laird, 2019), we present CoELA, a Cooperative Embodied Language Agent, a cognitive architecture with a novel modular framework that utilizes the rich world knowledge, strong reasoning ability and mastery natural language understanding and generation capability of LLMs, who plan and communicate with others to cooperatively solve complex embodied tasks. Our framework consists of five modules, each to address a critical aspect of successful multi-agent cooperation, including a Perception Module to perceive the observation and extract useful information, a Memory Module mimicking human’s long-term memory to maintain the agent’s understanding of both the physical environment and other agents, a Communication Module to decide what to communicate utilizing the strong dialogue generation and understanding capability of LLMs, a Planning Module to decide high-level plans including when to communicate considering all the information available, and an Execution Module to execute the plan by generating primitive actions using procedures stored in the memory module.