INQUIRING LINE

Can task decomposition into microagents with voting scale to million-step problems?

This explores whether breaking a hard task into tiny single-step subtasks — each handled by a small agent, with voting to catch mistakes — can actually reach the million-step scale, and what makes that work (or fail).


This explores whether breaking a hard task into tiny single-step subtasks — each handled by a small agent, with voting to catch mistakes — can actually reach the million-step scale. The corpus says yes, and the result is counterintuitive enough to be worth sitting with. The MAKER approach decomposes a problem into the smallest possible steps, votes across redundant attempts at each step, and flags correlated errors, achieving million-step execution with essentially zero errors Can extreme task decomposition enable reliable execution at million-step scale?. The surprise isn't the scale — it's that you don't need frontier reasoning models to get there. When decomposition is extreme enough, small non-reasoning models suffice, which inverts the usual instinct of throwing the biggest model at the hardest problem.

That inversion connects to a broader economic argument: most agentic work is repetitive, well-defined language tasks that small language models handle at a fraction of the cost, making 'small by default, large only when needed' the rational design Can small language models handle most agent tasks?. Extreme decomposition is what *creates* the conditions for small models to win — it shrinks each unit of work until competence stops being the bottleneck and reliability becomes the only thing that matters. Voting is how you buy that reliability. The same logic shows up in test-time RL, where majority voting across repeated samples produces a usable reward signal because consensus answers tend to be correct Can models improve themselves using only majority voting?. Per-step voting in MAKER is the same bet, applied for error correction instead of training.

But the corpus also hands you the catch, and it's a sharp one. Voting only works when the steps are genuinely independent. On structured problems where each step must build on the accumulated result of the last — graph connectivity is the example — sequential chain-of-thought beats parallel voting by an *exponential* margin, because short parallel chains can't reconstruct intermediate state When does sequential reasoning beat parallel voting?. So the million-step claim quietly depends on decomposing into steps that *can* be voted on in isolation. Where the task resists that — where the work is irreducibly sequential — the microagent-plus-voting recipe loses its edge.

There's a second failure mode lurking at scale: coordination. When many agents must share information, systems degrade predictably as the network grows, because agents accept neighbors' claims without verification and let errors propagate Why do multi-agent systems fail to coordinate at scale?. MAKER's defense is to barely coordinate at all — minimal subtasks mean minimal communication surface, and explicit error-flagging substitutes for the verification that distributed agents skip. The scalability comes partly from refusing the coordination problem rather than solving it.

The most interesting tension is whether you even need separate agents. The Thread Inference Model argues that reasoning structured as recursive subtask trees with KV-cache pruning lets a *single* model handle the full recursion internally, replacing multi-agent systems outright Can recursive subtask trees overcome context window limits?. Read together, these suggest the real lever isn't 'microagents' or 'voting' as such — it's aggressive decomposition into verifiable units, whether you spend that across many small agents (MAKER) or one model's pruned working memory (TIM). And a sobering footnote: research finds roughly 80% of multi-agent performance variance is just token budget, not coordination intelligence How does test-time scaling work at the agent level? — so before crediting the architecture, ask how much of the win is simply paying for more steps.


Sources 7 notes

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

When does sequential reasoning beat parallel voting?

On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Next inquiring lines