Can LLMs design reward functions for reinforcement learning?
Can language models help automate the notoriously difficult task of designing reward shaping functions for sparse-reward RL, and if so, how might we structure that collaboration to work around LLMs' weaknesses in stochastic control?
Sparse-reward RL with stochastic transitions is notoriously sample-inefficient. The standard remedy — reward shaping with intrinsic rewards — places the cognitive burden on the human designer. Producing useful shaping functions requires either task-specific domain knowledge or expert demonstrations for each new task, neither of which scales.
MEDIC (2405.15194) replaces the human designer with an LLM, but with an architectural twist that avoids the well-known failure of directly prompting LLMs for control policies. Direct LLM prompting for control is unreliable because LLMs struggle with the stochasticity, partial observability, and reward sparsity that make RL hard in the first place. MEDIC's move is to strip away those difficulties before asking the LLM to plan.
The mechanism has three steps. First, construct a deterministic abstraction of the original RL problem — the same goal, but simplified to remove stochastic transitions and complex state. Second, prompt an LLM to solve this abstracted problem, producing a (possibly suboptimal but valid) plan. The plan represents what the LLM thinks a good policy looks like in the simplified setting. Third, convert this guide policy into a reward shaping function for the downstream RL agent operating on the original stochastic problem. The shaping rewards encourage the RL agent to follow the LLM's guide policy when it aligns with task progress.
A model-based feedback critic verifies LLM outputs against the abstract model — catching plans that violate problem constraints — before the plan is converted to shaping rewards. This prevents the LLM's plausible-but-wrong outputs from contaminating the RL training signal.
The conceptual move is decomposing what was previously a single hard task (design a reward shaping function for stochastic sparse-reward RL) into two easier tasks (design a deterministic abstraction; have the LLM solve it). Each easier task is something for which LLMs and humans have appropriate tooling. The deterministic abstraction is something humans can specify; the plan over abstraction is something LLMs can produce.
The broader implication: LLMs do not need to be good control policies to contribute to RL. They can be good plan generators over simplified versions of the problem, and the rest of the RL machinery does the work of dealing with the actual stochastic dynamics. This is a different design pattern from later approaches like Can chain-of-thought reasoning be learned during pretraining itself? (where LLM thinking IS the policy) or Can agents learn continuously from experience without updating weights? (where LLM reasoning operates over a case bank). MEDIC sits earlier in the pipeline: the LLM contributes to RL's reward shaping rather than to its policy or value estimation.
Paper: Efficient Reinforcement Learning via Large Language Model-based Search
Related concepts in this collection
-
Can chain-of-thought reasoning be learned during pretraining itself?
Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.
RLP integrates the LLM directly into the RL signal; MEDIC keeps them separate (LLM produces shaping, RL trains policy)
-
Can language modeling close the knowing-doing gap in AI?
Current LLMs reason well but act poorly in interactive tasks, while RL agents act well but cannot explain themselves. Can reformulating decision-making as language modeling with environmental feedback bridge this fundamental split?
TiG is the opposite design pattern: LLM IS the policy, refined by RL; MEDIC: LLM informs the reward, RL trains a separate policy
-
Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
POLAR's "similarity to target policy" framing is a generalization: the MEDIC guide policy could serve as the target for POLAR-style discrimination
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
LLMs can construct reward shaping functions by solving a simpler deterministic abstraction of the original RL problem