Can large language models explore in-context?
We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt.
….
In-context learning is an important emergent capability of Large Language Models (LLMs) that enables one to use a pre-trained LLM to solve a problem by specifying the problem description and relevant data entirely in-context, i.e., within the LLM prompt, with no updates to the LLM parameters (Brown et al., 2020).
….
Although supervised learning is an important capability, many applications demand the use of ML models for downstream decision making. Thus, in-context reinforcement learning (ICRL) and sequential decision making is a natural next frontier.
….
Decision making agents must possess three core capabilities: generalization (required for supervised learning), exploration (making decisions that may be suboptimal in the short term for the sake of gathering more information) and planning (to account for long-term consequences of decisions).
In our experiments, we find that only a single configuration (i.e., a prompt design and LLM pair) results in satisfactory exploratory behavior. All other configurations exhibit exploration failures, failing to converge to the best decision (arm) with significant probability
The single configuration that succeeds in our experiments involves a combination of Gpt-4 and an “enhanced” prompt that (a) provides a suggestive hint to explore, (b) externally summarizes the history of interaction into per-arm averages, and (c) asks the LLM to use zero-shot chain-of-thought reasoning (Wei et al., 2022; Kojima et al., 2022). This configuration is visualized in Figure 1(b). One can interpret this finding positively: state-of-the-art LLMs do possess the capability to robustly explore, provided that the prompt is carefully designed to elicit this behavior. On the other hand, we find that the same configuration without external summarization fails, which leads to a negative interpretation: LLMs may fail to explore in more complex environments, where externally summarizing the history is a non-trivial algorithm design problem.1