Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can transformers learn to solve new problems within episodes?

Explores whether RL-finetuned transformers can develop meta-learning abilities that let them adapt to unseen tasks through in-episode experience alone, without weight updates.

Note · 2026-02-22 · sourced from LLM Architecture
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

"RL + Transformer = A General-Purpose Problem Solver" (2501.14176) demonstrates that a pre-trained transformer fine-tuned with RL over multiple episodes develops In-Context Reinforcement Learning (ICRL) — an emergent ability to solve problems never encountered during training by learning within the episode context.

Llama 3.1 8B, fine-tuned using DQN on parametric Frozen Lake games, achieves several capabilities simultaneously:

The mechanism is meta-learning via RL. The model adapts its policy based on the history of interactions within an episode — learning from its own within-episode experience without any weight updates. This parallels DeepMind's finding that transformer-based agents trained with meta-RL adapt to complex tasks within timescales comparable to human learning.

The critical distinction from standard fine-tuning: ICRL doesn't teach the model to solve specific problems. It teaches the model to learn to solve problems from experience. The training objective (RL over multiple episodes with varying configurations) creates a meta-learning pressure that the transformer architecture can exploit through its context window. Since Why do trajectories matter more than individual examples for in-context learning?, ICRL's multi-episode training naturally provides the trajectory burstiness property that enables sequential decision-making ICL to emerge.

Since Does RL teach reasoning or just when to use it?, ICRL extends this principle: RL doesn't just teach when to reason, it teaches when and how to learn within context. The base model already has the capacity for in-context adaptation; RL post-training activates and refines this meta-learning capacity.

Since Do base models already contain hidden reasoning ability?, ICRL suggests that meta-learning capability may be another latent capacity that RL activates rather than creates. The pre-trained model's in-context learning ability is the substrate; RL post-training shapes it into in-context reinforcement learning.


Source: LLM Architecture

Related concepts in this collection

Concept map
17 direct connections · 138 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

in-context reinforcement learning enables transformers to meta-learn from episode experience — generalizing to unseen environments without weight updates