Reinforcement Learning for LLMs LLM Reasoning and Architecture Agentic and Multi-Agent Systems

Why do LLMs struggle with exploration in simple decision tasks?

This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.

Note · 2026-02-22 · sourced from Reasoning Architectures

Decision-making agents require three core capabilities: generalization (supervised learning), exploration (making suboptimal short-term decisions to gather information), and planning (accounting for long-term consequences). LLMs have been shown to possess generalization and limited planning. Exploration turns out to be the hardest.

In systematic evaluation across multi-armed bandit environments — one of the simplest exploration problems — only a single LLM/prompt configuration achieves satisfactory exploratory behavior: GPT-4 + explicit exploratory hints + external per-arm history summarization + zero-shot chain-of-thought. All other configurations fail, including GPT-4 with just the explicit hints, or with CoT but without external summarization.

The critical factor is external history summarization. Without it, the model must track which arms have been tried and what returns were obtained purely from the raw interaction history in context. When the history grows long, this becomes an effective in-context computation problem — the model must maintain and update a running average per arm from unstructured context. LLMs appear to fail this computation reliably.

External summarization converts unstructured history (list of (arm, reward) tuples) into structured per-arm aggregates that are trivially readable. With this pre-processing, GPT-4 can then apply exploratory reasoning correctly.

The negative interpretation matters: External summarization is a non-trivial algorithm design problem in complex environments. If the history has thousands of entries with complex structure (state, action, observation sequences), pre-processing to the right summary form is itself a hard problem. LLM exploration capability in truly complex environments is likely to remain unreliable.

This connects to Why do trajectories matter more than individual examples for in-context learning?: both findings reveal that LLMs' ICL capabilities in sequential decision-making contexts are fragile and depend on specific data presentation choices that are non-trivial to implement.


Source: Reasoning Architectures

Related concepts in this collection

Concept map
15 direct connections · 158 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llms fail at in-context exploration without external summarization and explicit exploratory prompting even with strong base capabilities