Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

Paper · arXiv 2410.23022 · Published October 30, 2024

Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent’s collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model.

For example, assigning a reward of +1 for solving the task and 0 otherwise is simple to define and accurately reflects the task goal, but is difficult to optimize due to providing zero gradients almost everywhere.

The reward designer can include additional reward shaping terms to create a denser learning signal, which can reflect task progress or guide the agent towards intermediate goals. However, designing intrinsic rewards can be remarkably challenging (Booth et al., 2023; Ibrahim et al., 2024) and places increased demands on human experts to provide task-specific knowledge.

Generating the reward function’s code by LLM. A number of methods have been proposed to automatically generate executable code that computes the reward directly (Ma et al., 2023; Xie et al., 2023; Yu et al., 2023; Li et al., 2024). While they have demonstrated success in complex continuous control tasks, they either require access to environment source code to include in the prompt, or a detailed description of input parameters and reward function templates. Furthermore, they are limited to reward functions compactly expressible via code, describing explicit logic; and it is unclear how these approaches can easily process high-dimensional state representations such as images, or semantic features such as natural language.
Generating reward values by LLMs. Motif (Klissarov et al., 2023) is a typical example of this category. It ranks the captions of pairs of observations using an LLM and distills these preferences into a parametric reward model. Motif does not require access to environment source code nor numerical state representation, can process semantic input features, and can scale to problems requiring billions of environment samples. Nevertheless, it also suffers from two important limitations. First, it requires a diverse, pre-existing dataset of captioned observations which are used to elicit preferences from the LLM. In many situations, such a dataset might not exist, and collecting it can increase the sample complexity. More importantly, collecting a diverse dataset often requires a non-trivial reward function that is feasible to optimize, which is the primary problem we aim to solve with intrinsic reward functions in the first place. Second, it involves a complex three-stage process, which sequentially annotates observations using an LLM, trains a reward model, and finally trains an RL agent. This is still time-consuming, given that the LLM annotation process can take several days’ worth of GPU hours, and is done prior to training the reward model and RL agent. Alternatively, Chu et al. (2023) query the LLM to directly label observations as having high or low reward at each timestep. However, querying an LLM for every observation is computationally infeasible for many RL applications, which involve millions or billions of observations. As a consequence, it would be desirable to have an integrated solution that offers:

(1) concurrent and fast online learning of both the intrinsic rewards and the policy that requires no external data nor auxiliary reward functions,

(2) expressible reward functions that can capture semantic features that are difficult to process with compact executable code.

distributed online intrinsic reward and agent learning system. ONI assumes access to captions of observations, similar to previous work (Klissarov et al., 2023; Chu et al., 2023). The captions of collected observations are annotated by an asynchronous LLM server, and both the policy and intrinsic reward model are simultaneously updated using LLM feedback. ONI removes the dependency on external datasets beyond the agent’s own experience and enables large-scale RL training with ease. Such a learning framework allows us to systematically compare different algorithmic choices for synthesizing LLM feedback. Specifically, we explore three methods: the first one is retrieval-based and simply hashes the annotations; the second builds a binary classification model to distill the sentiment labels returned by the LLM; and the third sends pairs of captions to the LLM server for preference labeling and learns a ranking model, similar to Motif

The LLM worker awaits new observations to label from the input queue, formats them into prompts, calls into an LLM, and returns the annotations in the output queue. Additionally, the prompt templates and other LLM options are configured by the LLM worker. We use an annotation and message format that support all the intrinsic reward labels that we consider in Section 3.2. As Sample Factory already utilizes most of the free CPU and GPU capacity of the system, we opted to call into the LLM via an HTTP/REST interface rather than loading it in this process directly. The communication here adds minimal overhead, and also makes us not need to coordinate the shared memory of multiple GPUs between the main APPO code and LLM.