External Model Motivated Agents: Reinforcement Learning for Enhanced Environment Sampling

Paper · Source
Reward Models

we propose an agent influence framework for RL agents to improve the adaptation efficiency of external models in changing environments without any changes to the agent’s rewards. Our formulation is composed of two self-contained modules: interest fields and behavior shaping via interest fields.

Consider an autonomous underground mining robot whose primary task is to dig and search for minerals, but that is also being used to model and monitor whether the environment is safe for humans, looking for dangers such as toxic air or intense temperatures (Nanda et al., 2010). Although a temperature safety model may not be relevant to the robot’s mining task since temperatures can pose a danger to humans long before the robot, it is critical to human safety that the temperature model remains accurate. While many use-cases might not require the task policy and an external model to be trained simultaneously, if an environment transfer occurs during deployment, such as moving to a new depth underground or damage to the robot, both the task policy and external model must be quickly adapt to the change.

The key idea for this work is that there are points of interest that, if observed, would benefit adaptation of the external model. Our main contributions are the following: (1) a framework to motivate agents in a task reward agnostic way using two modules: a definition of “interest” and a method to steer agents by leveraging interest; (2) an exemplar implementation of both modules to motivate the agent with the secondary objective of enhanced environment sampling for external model learning; (3) experiments and results demonstrating that our interest-based agent motivation implementations improve the adaptation of external models in the presence of environment change compared to the standard PPO and online DIAYN baselines.

The core technical challenge of this work focuses on influencing the behavior of an RL agent in a changed environment according to a model with an objective unrelated to the agent’s reinforcement task. As single-objective RL is poorly suited to handle multiple objectives scenarios like these (Hayes et al., 2022), multi-objective reinforcement learning (MORL) is often used to optimize one or more policies according to multiple objectives simultaneously

Different domains have different definitions of what makes a sample interesting. For example, in the supervised learning setting active learning examines how to increase learning sample efficiency by labeling only the most “interesting” unlabeled dataset samples (Ren et al., 2021).

However, in online, agent-based learning like reinforcement learning, there is no dataset of unlabeled samples; an agent can only train its models using samples it observes from interacting with the environment. To make reinforcement learning more sample efficient, exploration methods use uncertainty estimates,

two separate modules: Point of Interest Field (Interest Field) and Point of Interest Influence (POI Influence). The interest field is a scalar field over the observation space O that defines how “interesting” an observation is for the policy to visit. The POI influence module uses the interest field to shape the agent’s behavior to collect more “interest” throughout the on-policy rollout.