Can LLM-synthesized behavioral heuristics compete with learned policy improvements?
This explores whether the rules and skills an LLM writes for itself — stored in memory, expressed as reward-shaping functions, or injected as text — can match the performance gains you'd otherwise get from updating the model's weights through reinforcement learning.
This explores whether LLM-authored heuristics (skills, memory entries, self-designed rewards) can rival gradient-based policy learning. The most direct "yes" in the corpus is AgentFly, which treats agent learning as memory operations rather than weight updates and still reaches 87.88% on GAIA without touching a single parameter Can agents learn continuously from experience without updating weights?. That result is striking precisely because it skips the thing RL is supposed to provide — internalized policy improvement — and recovers most of the benefit through structured recall and credit assignment in memory.
But the sharper finding is that the dichotomy may be softer than it looks. When researchers actually inspected what RL changes inside a model, they found it touches only 5–30% of parameters, and those updates land in nearly identical, structured subnetworks across random seeds Does reinforcement learning update only a small fraction of parameters?. So "learned policy improvement" is itself a surprisingly localized, almost surgical edit — which makes it more plausible that a well-aimed heuristic could approximate it. LLMs can even generate the learning signal: MEDIC shows a model writing its own reward-shaping functions by first solving a simplified, deterministic version of the problem Can LLMs design reward functions for reinforcement learning?, and TRELAWNEY embeds future-information "lookahead" tokens directly into training data to teach planning with no architectural change at all Can embedding future information in training data improve planning?.
The most honest answer the corpus offers is that it's not a competition — it's a division of labor across timescales. MetaClaw demonstrates that deployed agents need both: rapid skill injection from failures (seconds, zero downtime) and slower gradient optimization during idle windows, and crucially the two reinforce each other — better policies produce more informative failures, and richer heuristics enable higher-reward trajectories Can agents adapt without pausing service to users?. Heuristics win on speed and reversibility; learned updates win on durability. The interesting claim is that you lose more by choosing than by combining.
There's also a structural cousin to heuristics worth pulling in: wrapping LLM calls in explicit algorithmic control flow, where the algorithm — not the model — decides what context each step sees Can algorithms control LLM reasoning better than LLMs alone?. This is heuristic engineering at the orchestration layer rather than the weight layer, and it sidesteps capability ceilings the same way memory does. Related self-improvement loops that avoid human labels — MCTS-derived process rewards Can tree search replace human feedback in LLM training? and majority-vote test-time RL Can models improve themselves using only majority voting? — blur the line further, since they're learned updates bootstrapped from signals the model synthesizes about itself.
The caveat: both camps hit the same wall on genuinely hard problems. On constrained optimization, LLMs plateau around 55–60% satisfaction regardless of scale, architecture, or training regime Do larger language models solve constrained optimization better?. When the ceiling is in the model's reasoning rather than its policy, neither a clever heuristic nor a gradient step moves it — which is the quiet lesson here: heuristics can compete with learned policy improvements wherever the bottleneck is knowing what to do, and neither competes where the bottleneck is being able to do it.
Sources 9 notes
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.
TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.
MetaClaw demonstrates that deployed agents require both rapid skill injection from failures (seconds, zero downtime) and slower gradient-based optimization during idle windows (minutes to hours). The two mechanisms reinforce each other, with better policies producing more informative failures and richer skills enabling higher-reward trajectories.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.