Agentic and Multi-Agent Systems

Why do planning and grounding pull against each other in agents?

Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?

Note · 2026-05-03 · sourced from Visual GUI Agents

AutoGLM's first key insight from building deployable foundation agents for Web Browser and Android is that planning and grounding are not just different sub-tasks — they have opposing optimization requirements, and bundling them in one end-to-end policy means each pulls against the other.

Planning demands flexibility and error recovery. The agent must construct creative paths to goals, abandon failed approaches, and recover when the environment behaves unexpectedly. Optimizing planning means tolerating exploration, allowing the model to consider multiple branches, and rewarding adaptability.

Grounding demands action accuracy. Once the plan is set, the click must hit the right pixel, the form must receive the exact right text, the API call must use the exact right argument. Optimizing grounding means narrowing variability, locking in deterministic behavior, and punishing near-misses.

These two regimes pull in opposite directions during training. A model trained for planning flexibility ungrounds; a model trained for grounding accuracy becomes brittle. The intermediate interface is the architectural artifact that separates them — letting each be developed and optimized on its own terms while still composing into a complete agent.

This finding generalizes a pattern visible across the GUI agent literature (Can structured interfaces help language models control GUIs better? for Agent S's ACI, Why do vision-only GUI agents struggle with screen interpretation? for OmniParser's screen parsing layer): the load-bearing design move is not a better single-pass policy but a clean factoring at the right joint. AutoGLM's second insight — that error recovery is crucial for robustness yet difficult to acquire offline, motivating self-evolving online curriculum RL with weak-to-strong progressive training — depends on the first: the curriculum can target planning behaviors specifically because the interface has separated them from grounding behaviors.

The transferable claim: in any agent stack where two sub-capabilities have conflicting optimization requirements, the architecture must factor before training, not the other way around. This is the same principle behind Does separating planning from execution improve reasoning accuracy? — only here the joint is between planning and grounding rather than planning and execution.


Source: Visual GUI Agents

Related concepts in this collection

Concept map
14 direct connections · 103 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

foundation GUI agents need an intermediate interface that disentangles planning from grounding — the two have opposing optimization requirements