Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can transformers learn to solve new problems within episodes?

Explores whether RL-finetuned transformers can develop meta-learning abilities that let them adapt to unseen tasks through in-episode experience alone, without weight updates.

Note · 2026-02-22 · sourced from LLM Architecture

"RL + Transformer = A General-Purpose Problem Solver" (2501.14176) demonstrates that a pre-trained transformer fine-tuned with RL over multiple episodes develops In-Context Reinforcement Learning (ICRL) — an emergent ability to solve problems never encountered during training by learning within the episode context.

Llama 3.1 8B, fine-tuned using DQN on parametric Frozen Lake games, achieves several capabilities simultaneously:

Solves unseen in-distribution environments with remarkable sample efficiency
Shows strong performance on out-of-distribution environments
Is robust to the quality of its training data
Stitches together behaviors from its context in a piecemeal fashion
Adapts to non-stationary environments

The mechanism is meta-learning via RL. The model adapts its policy based on the history of interactions within an episode — learning from its own within-episode experience without any weight updates. This parallels DeepMind's finding that transformer-based agents trained with meta-RL adapt to complex tasks within timescales comparable to human learning.

The critical distinction from standard fine-tuning: ICRL doesn't teach the model to solve specific problems. It teaches the model to learn to solve problems from experience. The training objective (RL over multiple episodes with varying configurations) creates a meta-learning pressure that the transformer architecture can exploit through its context window. Since Why do trajectories matter more than individual examples for in-context learning?, ICRL's multi-episode training naturally provides the trajectory burstiness property that enables sequential decision-making ICL to emerge.

Since Does RL teach reasoning or just when to use it?, ICRL extends this principle: RL doesn't just teach when to reason, it teaches when and how to learn within context. The base model already has the capacity for in-context adaptation; RL post-training activates and refines this meta-learning capacity.

Since Do base models already contain hidden reasoning ability?, ICRL suggests that meta-learning capability may be another latent capacity that RL activates rather than creates. The pre-trained model's in-context learning ability is the substrate; RL post-training shapes it into in-context reinforcement learning.

Source: LLM Architecture

Related concepts in this collection

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
ICRL extends: RL activates meta-learning, not just reasoning
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
meta-learning as another latent capability
Can agents learn from failure without updating their weights? Explores whether language models can improve through trial-and-error by storing reflections in memory rather than through gradient-based parameter updates. Tests if environmental feedback alone can drive learning.
ICRL is the RL-trained version of episodic learning
Why do trajectories matter more than individual examples for in-context learning? Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.
trajectory burstiness specifies the data property that enables ICRL: same-level trajectories in training data create the meta-learning pressure that ICRL exploits; ICRL's generalization to unseen environments depends on having encountered bursty trajectory distributions during RL fine-tuning
Why do LLMs struggle with exploration in simple decision tasks? This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.
ICRL demonstrates successful in-context adaptation via RL, while this note shows exploration failure in LLM agents; the difference may be that ICRL's RL fine-tuning specifically trains the exploration-exploitation trade-off, while vanilla LLMs must approximate it from language patterns alone
Can LLMs handle multiple tasks at once during inference? Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?
task superposition provides the representational substrate for ICRL: the model can maintain multiple task interpretations from in-context experience simultaneously, enabling meta-learning across environment variations within a single episode

Concept map

17 direct connections · 138 in 2-hop network ·medium cluster

Can transformers learn to solve new problems wit… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab… Can agents learn from failure without updating the… Why do trajectories matter more than individual ex… Why do LLMs struggle with exploration in simple de… Can LLMs handle multiple tasks at once during infe…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

in-context reinforcement learning enables transformers to meta-learn from episode experience — generalizing to unseen environments without weight updates