Why do production AI agents stay deliberately simple?
Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?
"Measuring Agents in Production" (2024) presents the first large-scale systematic study of AI agents deployed in real production environments — 306 practitioners surveyed, 20 in-depth case studies via interviews, across 26 domains.
The findings directly challenge the complexity narrative in agent research:
Simple methods dominate. 70% of deployed agents use off-the-shelf models without weight tuning, relying entirely on prompting. Teams select the most capable, expensive frontier models available because cost and latency remain favorable compared to human baselines. 79% rely heavily on manual prompt construction, and production prompts can exceed 10,000 tokens.
Autonomy is deliberately constrained. 68% of production agents execute at most 10 steps before requiring human intervention. 47% execute fewer than 5 steps. This is not a capability limitation — it is a design choice. Organizations constrain autonomy to maintain reliability, the top development challenge.
Custom builds over frameworks. 85% of detailed case studies forgo third-party agent frameworks, building custom agent applications from scratch. This suggests that current frameworks do not match production requirements — since Why do protocol-based tool systems fail in production agentic workflows?, the preference for custom builds reflects a reliability imperative.
Human evaluation persists. 74% depend primarily on human evaluation. Automated evaluation has not displaced human judgment in production, consistent with Does setting temperature to zero actually make LLM outputs reliable? — single automated evaluations are insufficient for reliability-critical deployment.
The gap between research and production is stark. Research pushes toward multi-agent systems, complex reasoning chains, and autonomous tool use. Production gravitates toward well-scoped, static workflows with human-in-the-loop. Since Why do AI agents fail at workplace social interaction?, the production community has learned this lesson and constrains accordingly.
The practical implication: "simple yet effective methods already enable agents to deliver impact across diverse industries." Complexity is not required for production value — and may be counterproductive when reliability is the binding constraint.
Source: Agentic Research
Related concepts in this collection
-
Why do AI agents fail at workplace social interaction?
Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.
benchmark evidence for why production constrains autonomy
-
Why do protocol-based tool systems fail in production agentic workflows?
Explores whether standardized tool protocols like MCP introduce non-determinism that undermines reliable agent execution, and what causes ambiguous tool selection in production systems.
the reliability imperative behind custom builds
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
production data confirms: most agent work IS repetitive and scoped
-
Why do capable AI agents still fail in real deployments?
Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.
production agents succeed by satisfying ecosystem conditions, not by maximizing capability
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
production AI agents are deliberately simple and constrained — 68 percent execute at most 10 steps and 85 percent forgo third-party frameworks in favor of custom builds