RL + Transformer = A General-Purpose Problem Solver

Paper · arXiv 2501.14176 · Published January 24, 2025
LLM ArchitectureReinforcement LearningDeep ResearchEvolution

What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., metalearn)? In this study, we demonstrate that a pre-trained transformer fine-tuned with reinforcement learning over multiple episodes develops the ability to solve problems that it has never encountered before—an emergent ability called In- Context Reinforcement Learning (ICRL). This powerful meta-learner not only excels in solving unseen in-distribution environments with remarkable sample efficiency, but also shows strong performance in out-of-distribution environments. In addition, we show that it exhibits robustness to the quality of its training data, seamlessly stitches together behaviors from its context, and adapts to non-stationary environments. These behaviors demonstrate that an RL-trained transformer can iteratively improve upon its own solutions, making it an excellent general-purpose problem solver.

of RL to the real world has been fraught with challenges. Compared to humans, RL methods suffer from low sample efficiency [Tsividis et al., 2017; Duan et al., 2016], meaning that they require a vast number of interactions with the environment before learning an effective policy. This inefficiency arises because they begin tabula rasa—without any prior knowledge of the environment—and explore a wide range of possible actions and states to gather enough information to improve performance.

This ability to generalize and learn new tasks without retraining prompts us to ask: Is it possible to train a transformer to function as a reinforcement learning algorithm, improving its predictions based on a few experiences in its context without any additional weight updates? If so, can it generalize beyond its training data, learn with higher sample efficiency, and solve non-stationary environments—all without additional training?

In this paper we show that: (1) Llama 3.1 8B can teach itself (meta-learn) through in-context experience by training on an RL objective, (2) the ability to meta-learn generalizes beyond the problem space in which it was acquired, (3) the model is robust to variations in the quality of its training data, (4) the model can assemble skills in a piecemeal manner to

similarly to meta-RL algorithms like RL2 [Duan et al., 2016]. In this setup, the model adapts its policy based on the history of interactions within an episode. To address optimization instabilities often encountered in training transformers for RL tasks, they utilize the T-Fixup initialization [Huang et al., 2020] to stabilize training. Their experiments reveal that transformers trained as in-context learners not only match but sometimes exceed the performance of traditional meta- RL methods. Notably, these transformers exhibit a degree of generalization to tasks slightly out of distribution, highlighting their capacity for rapid adaptation based on observed histories.

In parallel, researchers at DeepMind have introduced a transformer-based agent trained using meta-RL that adapts to solve complex tasks within timescales comparable to human learning [Bauer et al., 2023]. Their agent demonstrates sample efficiency akin to humans, suggesting that transformers may employ learning strategies similar to those used by humans when confronting new challenges. This work underscores the potential of transformers as powerful meta-learners in RL settings.

To explore the capabilities of ICRL, we employ the opensource large language model (LLM) called LLaMA 3.1 8B Instruct [Dubey et al., 2024]. We fine-tune this model using the Deep Q-Network (DQN) reinforcement learning algorithm [Mnih et al., 2013], which enables the model to learn optimal actions through trial and error.

Our training data are collected from the parametric game Frozen Lake [Farama Foundation, 2022], a dynamic environment where the game parameters can be changed between episodes. Rather than focusing on solving a single, specific version of Frozen Lake, our objective is to enhance the model’s performance across multiple episodes with varying game configurations. By doing so, we aim to improve the model’s ability to generalize and find better solutions over time, thus highlighting the benefits of the ICRL approach.

While in-context reinforcement learning (ICRL) may not always find the correct answer, the key is that it can enhance its performance by adapting in unforeseen scenarios. This progress indicates that agents capable of human-like adaptability and continuous improvement are within reach.