Reasoning and Learning Architectures Agentic Systems and Planning

Can LLMs design reward functions for reinforcement learning?

Can language models help automate the notoriously difficult task of designing reward shaping functions for sparse-reward RL, and if so, how might we structure that collaboration to work around LLMs' weaknesses in stochastic control?

Note · 2026-05-18 · sourced from Reinforcement Learning
What actually changes inside a model during RL training?

Sparse-reward RL with stochastic transitions is notoriously sample-inefficient. The standard remedy — reward shaping with intrinsic rewards — places the cognitive burden on the human designer. Producing useful shaping functions requires either task-specific domain knowledge or expert demonstrations for each new task, neither of which scales.

MEDIC (2405.15194) replaces the human designer with an LLM, but with an architectural twist that avoids the well-known failure of directly prompting LLMs for control policies. Direct LLM prompting for control is unreliable because LLMs struggle with the stochasticity, partial observability, and reward sparsity that make RL hard in the first place. MEDIC's move is to strip away those difficulties before asking the LLM to plan.

The mechanism has three steps. First, construct a deterministic abstraction of the original RL problem — the same goal, but simplified to remove stochastic transitions and complex state. Second, prompt an LLM to solve this abstracted problem, producing a (possibly suboptimal but valid) plan. The plan represents what the LLM thinks a good policy looks like in the simplified setting. Third, convert this guide policy into a reward shaping function for the downstream RL agent operating on the original stochastic problem. The shaping rewards encourage the RL agent to follow the LLM's guide policy when it aligns with task progress.

A model-based feedback critic verifies LLM outputs against the abstract model — catching plans that violate problem constraints — before the plan is converted to shaping rewards. This prevents the LLM's plausible-but-wrong outputs from contaminating the RL training signal.

The conceptual move is decomposing what was previously a single hard task (design a reward shaping function for stochastic sparse-reward RL) into two easier tasks (design a deterministic abstraction; have the LLM solve it). Each easier task is something for which LLMs and humans have appropriate tooling. The deterministic abstraction is something humans can specify; the plan over abstraction is something LLMs can produce.

The broader implication: LLMs do not need to be good control policies to contribute to RL. They can be good plan generators over simplified versions of the problem, and the rest of the RL machinery does the work of dealing with the actual stochastic dynamics. This is a different design pattern from later approaches like Can chain-of-thought reasoning be learned during pretraining itself? (where LLM thinking IS the policy) or Can agents learn continuously from experience without updating weights? (where LLM reasoning operates over a case bank). MEDIC sits earlier in the pipeline: the LLM contributes to RL's reward shaping rather than to its policy or value estimation.


Paper: Efficient Reinforcement Learning via Large Language Model-based Search

Related concepts in this collection

Concept map
14 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

LLMs can construct reward shaping functions by solving a simpler deterministic abstraction of the original RL problem