Can domain-expert workflows always decompose into inspectable stages for AI?
This explores whether every expert's work can be broken into discrete, checkable steps an AI can supervise — and where the word 'always' breaks down.
This reads the question as being about the word 'always.' Decomposing expert work into inspectable stages is one of the most productive moves in the corpus — but it's a property of some domains and some tasks, not a universal guarantee. The most optimistic case is Can algorithms control LLM reasoning better than LLMs alone?: wrap an LLM inside an explicit algorithm, hand each call only its step-relevant context, and complex reasoning becomes a set of modular, debuggable sub-tasks. In the same spirit, Can agents learn reusable sub-task routines from past experience? shows agents can extract reusable sub-task routines and compound them hierarchically, with the biggest gains exactly when the task is novel — evidence that stage-level structure transfers where whole-task memorization fails.
But the corpus is blunt about why decomposition isn't always available: the bottleneck is the domain, not the model. What makes a research domain suitable for autonomous optimization? argues a domain only yields to staged autonomous work if it has immediate scalar metrics, modular architecture, fast iteration, and version control — and a domain missing any one of these resists decomposition 'regardless of LLM capability.' That's the direct answer to 'always': no. An expert workflow whose quality can't be scored mid-stream, or whose steps don't separate cleanly, doesn't become inspectable just because you point a smarter model at it.
The subtler trap is that decomposing into stages and making those stages actually inspectable are two different things. Do frontier LLMs silently corrupt documents in long workflows? found that across long delegated relays, frontier models corrupt about a quarter of document content — and the errors compound silently, never plateauing, precisely because nobody is inspecting the intermediate hand-offs. So the stages existed; the inspection didn't. Where do reasoning agents actually fail during long traces? is the corrective: checking intermediate states rather than final answers lifted task success from 32% to 87%, because most failures were process violations invisible to outcome scoring. Inspectable stages only pay off if you actually verify the stages.
Where clean stage labels aren't handed to you, the corpus offers a backdoor: infer them from structure. Can trajectory structure replace hand-annotated process rewards? shows you can mine dense step-level signal from a trajectory's own shape — tree topology, expert-aligned actions, tool-call positions — without hand-annotating each stage. And Can knowledge graphs teach models deep domain expertise? builds expertise bottom-up from compositional primitives along knowledge-graph paths, which is another way of saying some expert reasoning can be reconstructed as composable units even when the human expert never articulated them as steps.
So the honest synthesis: not always — and the reasons are worth knowing. Decomposability is a structural property of the domain (What makes a research domain suitable for autonomous optimization?); decomposition without active mid-process verification gives you the illusion of inspectability while errors compound underneath (Do frontier LLMs silently corrupt documents in long workflows?, Where do reasoning agents actually fail during long traces?); and even where you succeed, the stages plus tools plus memory have to be engineered as a pipeline, not assumed — Can you turn an LLM into an agent by just fine-tuning? makes the point that the surrounding harness, not the model, decides whether each action is grounded or hallucinated.
Sources 8 notes
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
Autonomous research pipelines require immediate scalar metrics, modular architecture, fast iteration cycles, and version control. Domains lacking any property resist autoresearch regardless of LLM capability, because the bottleneck is environmental structure, not model power.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.
Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.