How should agents split planning from visual grounding?
Agents face a tension between reasoning about goals abstractly and translating those goals into concrete screen coordinates or API calls. Can separating these concerns architecturally improve performance?