Behavioral Exploration: Learning to Explore via In-Context Adaptation

Paper · arXiv 2507.09041 · Published July 11, 2025
Test Time ComputeContext Engineering

While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely on random exploration and slow, gradient-based behavior updates. How can we endow autonomous agents with such capabilities on par with humans? Taking inspiration from recent progress on both in-context learning and large-scale behavioral cloning, in this work we propose behavioral exploration: training agents to internalize what it means to explore and adapt in-context over the space of “expert” behaviors. To achieve this, given access to a dataset of expert demonstrations, we train a long-context generative model to predict expert actions conditioned on a context of past observations and a measure of how “exploratory” the expert’s behaviors are relative to this context. This enables the model to not only mimic the behavior of an expert, but also, by feeding its past history of interactions into its context, to select different expert behaviors than what have been previously selected, thereby allowing for fast online adaptation and targeted, “expert-like” exploration. We demonstrate the effectiveness of our method in both simulated locomotion and manipulation settings, as well as on real-world robotic manipulation tasks, illustrating its ability to learn adaptive, exploratory behavior.

What would successful exploration and adaptation behavior look like for an autonomous agent? As an example, consider telling a robot “hand me a cup”. In this scenario, the robot should attempt to pick up whichever cups are in the scene until it successfully picks up the right one. It should not, however, move its arm randomly, try to pick up a plate, or repeatedly pick up the same cup when there are other cups in the scene it has not yet attempted to pick up. Instead, the correct behavior requires first leveraging knowledge of the scene and what it means to “hand me a cup” to direct actions to the most relevant features, and then quickly adapting based on the attempts that have already been made in order to continue attempting novel, potentially correct, actions. Such a semantic, informed approach to exploration, coupled with fast online adaptation, is much more akin to how humans behave and dramatically more efficient than the “uninformed” novelty-seeking exploration strategies that are more often considered in the existing literature.

 In this work we take steps towards developing autonomous agents which exhibit such behavior, focusing in particular on settings where we have access to expert demonstration datasets that provide a prior on what behaviors may be “reasonable” for our task of interest. We are inspired by the recent success of in-context learning to enable fast online adaptation in language domains (Brown et al., 2020), and seek to leverage similar in-context learning capabilities in our setting. We propose training a long-context policy to internalize both which actions an expert would take in a given state, and which expert actions are most exploratory. To instill the policy with in-context adaptation capabilities, in addition to the current state, we condition on a context which includes a history of previous observations, and a measure of how exploratory the expert action in this state is given these observations. This enables the model to not only infer which actions an expert is likely to take, but also which expert actions are exploratory and, by feeding its past observations into its context, allows it to quickly adapt its behavior online to attempt different behaviors than what it has already attempted. Critically, our approach is trained to be exploratory over the space of expert behaviors and thus naturally restricts its exploration to coherent, reasonable behaviors—behaviors an expert is likely to play, and therefore behaviors likely to solve a given task.