Can extreme task decomposition enable reliable execution at million-step scale?
Can breaking tasks into maximally atomic subtasks with voting-based error correction solve the fundamental reliability problem in long-horizon tasks? This challenges whether better models or better decomposition is the path to high-reliability AI systems.
A system with a 1% per-step error rate is expected to fail after 100 steps of a million-step task. This makes traditional approaches to long-horizon tasks fundamentally infeasible — improving model accuracy from 99% to 99.99% is insufficient for tasks requiring thousands of dependent steps. MAKER (Massively Decomposed Agentic Processes) takes a different approach: instead of improving per-step accuracy, decompose until each step is trivially reliable, then apply error correction.
Three core components:
- Decomposition into minimal subtasks: Each agent handles a single, tiny "micro-role" rather than anthropomorphized human-level roles. By avoiding complex role assignments and instead exploiting the machine-like nature of LLMs, each subtask becomes solvable with high reliability.
- Error correction via subtask-level voting: Multiple agents independently solve the same subtask; voting identifies the correct answer. This is error correction at the finest possible granularity.
- Red-flagging to reduce correlated errors: Detects situations where voting might fail because errors are correlated across agents, and applies additional verification.
The scaling laws are formalized: probability of success and expected cost change predictably with total steps and decomposition level. Under extreme decomposition, effective scaling is feasible; without it, infeasible.
The most counterintuitive finding: state-of-the-art reasoning models are not required. Relatively small non-reasoning models suffice when the decomposition is extreme enough. This inverts the standard approach to hard problems — instead of smarter models, use dumber models on smaller problems.
This extends Does separating planning from execution improve reasoning accuracy? to an extreme: not just separating two functions, but decomposing the entire task into maximally atomic units. It also extends Why does majority voting outperform more complex inference methods? from answer-level voting to subtask-level voting with formalized scaling properties.
The implication for AI deployment: for tasks requiring very high reliability over many steps (organizational processes, scientific experiments, production pipelines), the path may run through decomposition and redundancy rather than through better models.
Source: Novel Architectures
Related concepts in this collection
-
Does separating planning from execution improve reasoning accuracy?
Explores whether modularizing decomposition and solution into separate models prevents interference and boosts performance compared to monolithic approaches.
MAKER takes this principle to its extreme: maximally atomic decomposition
-
Why does majority voting outperform more complex inference methods?
Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
MAKER applies voting at subtask level with formalized scaling laws
-
Do prior errors in context history amplify future errors?
When a language model makes mistakes early in a task, do those errors contaminate subsequent predictions? We explore whether error accumulation degrades long-horizon performance through passive context pollution rather than capability limits.
MAKER addresses this by isolating each step: no error context propagation
-
Are reasoning model failures really about reasoning ability?
Explores whether the performance collapse in language reasoning models reflects actual reasoning limitations or merely execution constraints. Tests whether tool access changes the picture.
consistent: execution can be fixed by decomposition without improving reasoning
-
Can recursive subtask trees overcome context window limits?
Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
MAKER decomposes externally via multiple agents; TIM decomposes internally via recursive subtask trees within a single model, eliminating the coordination overhead while preserving the decomposition principle
-
When does adding more agents actually help systems?
Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.
quantifies when MAKER's extreme decomposition helps vs. hurts: token budget fragmentation under multi-agent coordination trades off against tool complexity, and centralized coordination contains error amplification to 4.4x vs. 17.2x for independent agents
-
Can multi-agent teams automatically remove their weakest members?
Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.
contrasting approach: MAKER uses static decomposition with redundancy-based error correction; DyLAN uses dynamic pruning with contribution-based scoring; MAKER optimizes at design time (decomposition level), DyLAN at runtime (agent selection)
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
extreme task decomposition into microagents with voting enables error-free execution at million-step scale