Adapting LLM Agents with Universal Feedback in Communication
recent works also focus on how to train the LLMs agent use linguistic feedback and non-linguistic reward signals. The linguistic feedback is usually processed as instruction data to do Instruction Fine-tuning (IFT)
Current approaches employ the linguistic data for IFT Li et al. (2023); Micheli & Fleuret (2021), while the reward signals serve solely as a filtering criterion
we propose a universal framework, named Learning through Communication (LTC), to train LLM agents with both linguistic feedback and non-linguistic reward signals. We design a universal buffer to store all the feedback, and an iterative pipeline to enable an LLM agent to explore and update its policy in an given environment. Each iteration of LTC comprises two distinct phases: (1) Exploration: During this phase, the agent interacts with the environments and other agents to gather diverse trajectories (linguistic) and reward signals (non-linguistic) into the universal buffer. (2) Updating: In this phase, the agent’s model is updated based on the collected data in the universal buffer. For updating, LTC combines the language modeling loss and the PPO loss to strike a balance between language consistency and reward signals As the pivot of the iterative pipeline, the replay buffer is updated after each exploration phase, and a subset of the buffer is sampled for the updating phase.
To facilitate collecting trajectories with linguistic data and reward signals, we devised three communication patterns: (1) Single-agent Monologue: This pattern allows a single agent to collect trajectories contain linguistic data and receive reward signals from the environments. (2) Multi-agent Dialogue: This pattern enables multiple agents to interact with each other and external tools to collect linguistic data, and utilize reward signals provided by the environments. (3) Teacher-student Dialogue: This variant of multi-agent dialogue that collect the linguistic feedback and non-linguistic reward signals provided by a teacher agent instead of the environment.