How Should We Meta-Learn Reinforcement Learning Algorithms?

Paper · arXiv 2507.17668 · Published July 23, 2025
Reinforcement LearningEvolutionSelf Refinement Self Consistency Feedback

Meta-learning shows particular promise for reinforcement learning (RL), where algorithms are often adapted from supervised or unsupervised learning despite their suboptimality for RL. However, until now there has been a severe lack of comparison between different meta-learning algorithms, such as using evolution to optimise over black-box functions or LLMs to propose code. In this paper, we carry out this empirical comparison of the different approaches when applied to a range of meta-learned algorithms which target different parts of the RL pipeline

The improvement of machine learning algorithms typically relies on manual design, a cumbersome process that is limited by human intuition and only rarely yields breakthroughs. An alternative, recent paradigm instead involves meta-learning learning algorithms from data. In this setting, algorithms are discovered computationally, with only limited need for human intervention in the design of the meta-learning process. This has particular potential for reinforcement learning (Sutton & Barto, 2020, RL), which is prone to instability (Van Hasselt et al., 2018; Achiam et al., 2019; Tang & Berseth, 2024) and often borrows algorithms from supervised and unsupervised learning that require adaptation to RL (e.g., (Parisotto et al., 2020; Obando Ceron et al., 2023; Ellis et al., 2024)).

In our results, we find that: language models can find effective RL algorithms in a sample-efficient way, so long as there is a good algorithm from which to kickstart meta-training; distillation of learned algorithms into other networks sometimes improves performance without increasing samples; and symbolic representations do not scale well to recurrent algorithms or those with many inputs. Based on these findings, we propose several recommendations for better ways to meta-learn new RL algorithms, such as suggesting that many systems could benefit from using LLMs in the loop or that distillation from a black-box algorithm into another network is usually worth trying for a potential cheap performance boost. We hope that these guidelines can help reduce the cost of research in meta-RL while ensuring that meta-learned algorithms are as capable as possible.

Unlike these works, which present new meta-learned algorithms, we focus on understanding how the meta-learning algorithm affects a number of factors in RL, like generalisation. This is particularly important due to the instability of RL

Instead of meta-learning black-box algorithms represented by neural networks, some approaches discover symbolic algorithms defined as interpretable mathematical functions. Symbolic algorithms fit naturally into an LLM-based pipeline, since they are easily represented in code. Symbolic programs can be found through symbolic evolution (e.g., Lion (Chen et al., 2023)) or by prompting LLMs to improve algorithms over meta-training (e.g., (Lehman et al., 2022; Lu et al., 2024; Romera- Paredes et al., 2024)). In part of this work, we explore when symbolic algorithms are better than black-box ones, as suggested by Chen et al. (2023).

In RL, a pioneering meta-learned algorithm is Learned Policy Gradient (Oh et al., 2020, LPG), which replaces the actor-critic update, although there are many learned RL algorithms (e.g., (Kirsch et al., 2020; Jackson et al., 2023; Kirsch & Schmidhuber, 2022; Lan et al., 2024)). In addition to LPG, we focus on Learned Policy Optimisation (Lu et al., 2022, LPO), a learned alternative to proximal policy optimisation (Schulman et al., 2017, PPO); and Optimisation for Plasticity, Exploration and Nonstationarity (Goldie et al., 2024, OPEN), a learned optimiser that uses feature engineering for meta-learning. Different to these papers, which propose new meta-learned algorithms for RL, we instead seek to understand how the meta-learning algorithm itself affects performance.

4.5 LLM Proposal

Since the rise of highly capable agentic language models, many researchers have used language models for algorithm discovery (e.g., (Lu et al., 2024; Faldor et al., 2024; Romera-Paredes et al., 2024; Hu et al., 2024; Song et al., 2024b)). Generally, this research is based on the premise that language models generate intelligent proposals, making them more sample efficient than symbolic evolution. As such, LLM-driven discovery pipelines generally evaluate on the order of tens of algorithms, rather than thousands, making them much more practical for evaluating directly in RL.

Since prompt tuning can play a large part in LLM performance, we build on an existing system, DiscoPOP (Lu et al., 2024), and warm-start search from a handcrafted algorithm. The LLM must reason in-context about previous algorithm performance to make suggestions for the next algorithm. In our setting, due to a number of unconventional inputs (particularly in the case of OPEN), we provide the LLM with a brief description of all inputs to the learned algorithm. After training, we select the best in-distribution algorithm for evaluation. We use GPT o3-mini (OpenAI, 2025) as our LLM, since it is a highly capable reasoning model with good performance for coding tasks.

This does highlight a clear limitation of distillation, though: if the original algorithm is poor, distillation is unlikely to fix it. Symbolic distillation also struggles, likely as the 8 inputs make this a relatively high dimensional problem for symbolic evolution. Overall, LLM proposal is by far the strongest baseline, both in-distribution and for generalisation.

The LLM likely performs well for a few reasons: gradient-based optimisation is well covered in the LLM’s training corpus; all inputs to the optimiser are easy to understand; and the LLM has access to a per-environment learning rate tuned for its initialisation of SGD, which effectively relies on fewshot meta-test evaluation. The use of hyperparameters can be seen as an advantage, for flexibility, or disadvantage, if meta-test time samples are expensive.

8 Design Recommendations

Based on the results in Section 7, we produce a set of design recommendations for future metalearning pipelines. These recommendations reflect the current state of the field, meaning they may require adaptation as meta-learning algorithms and capabilities improve. We describe them below.

• For a meta-learned algorithm with few inputs, or inputs which are easy to understand (i.e., an LLM can interpret them), prompting an LLM for new algorithms is a sample-efficient way to find new generalisable algorithms. This has three caveats: there must be an easy-to-define, performant function from which to start the search; it must be possible to run hyperparameter tuning for the algorithm in the meta-test environment; and in-distribution performance of the algorithm will likely be worse than learning a black-box function (especially for many meta-samples).

• As long as it is possible to define a warm-start initialisation function, it is almost always better to prompt a language model for algorithm proposals over applying symbolic distillation. In fact, besides yielding interpretable functions, symbolic distillation is unlikely to improve performance, contrary to the suggestion of Chen et al. (2023) that symbolic functions should generalise better.

• Black-box distillation can often, but not always, improve generalisation. We recommend applying black-box distillation into the same-sized network for all black-box learned algorithms that are feed-forward or have short recurrent rollouts; given there is no increased sample cost and training is quick, this can occasionally yield cheap performance gains. On balance, smaller distillation can cause bigger drops in performance for smaller potential gains.

• Black-box algorithms are practically the only way to meta-learn algorithms which use a large number of features. If a meta-learned algorithm has many inputs, like OPEN, then an LLM is unlikely to propose a performant algorithm which also incorporates all of the input features.