The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics

Paper · arXiv 2512.05765 · Published December 5, 2025

Influential critiques argue that Large Language Models (LLMs) are a dead end for AGI: “mere pattern matchers” structurally incapable of reasoning or planning. We argue this conclusion misidentifies the bottleneck: it confuses the ocean with the net. Pattern repositories are the necessary System-1 substrate; the missing component is a System-2 coordination layer that selects, constrains, and binds these patterns. We formalize this layer via UCCT, a theory of semantic anchoring that models reasoning as a phase transition governed by effective support (ρd), representational mismatch (dr), and an adaptive anchoring budget (γ log k). Under this lens, ungrounded generation is simply an unbaited retrieval of the substrate’s maximum likelihood prior, while “reasoning” emerges when anchors shift the posterior toward goal-directed constraints. We translate UCCT into architecture with MACI, a coordination stack that implements baiting (behavior modulated debate), filtering (Socratic judging), and persistence (transactional memory). By reframing common objections as testable coordination failures, we argue that the path to AGI runs through LLMs, not around them.

The artificial intelligence community is fractured by a debate over the nature of Large Language Models (LLMs). On one side, scaling proponents argue that LLMs are sufficient for Artificial General Intelligence (AGI). On the other, influential critiques argue that LLMs are “mere pattern matchers” structurally incapable of reasoning, planning, or compositional generalization, and therefore represent a dead end (LeCun, 2022).

We argue that this debate relies on a false dichotomy. To clarify why, consider a fishing metaphor. The ocean represents the model’s vast repository of latent patterns. A fisherman casting a net without bait harvests the maximum likelihood prior of the waters beneath him—mostly common fish (generic training data). Critics who decry these ungrounded outputs are not observing a broken system; they are observing the raw statistical baseline of an unbaited cast. However, intelligent behavior is not just casting; it is baiting and filtering. This process is governed by bait density. If the bait is too sparse, it fails to attract the specific, rare fish, and the ocean’s prior continues to dominate the catch. If the bait is sufficiently dense, it conveys strong intent, shifting the posterior distribution so that the target concept swamps the common priors. Yet, bait is not free; using excessive bait to secure a catch is inefficient. In this view, the “Missing Layer” is the Coordination Layer that optimizes this trade-off: calculating the precise density required to shift the posterior without incurring prohibitive costs.

1.1. Our Position: Substrate plus Coordination

We propose a third position: Substrate plus Coordination. We agree that LLMs alone are insufficient for AGI, but reject the conclusion that they are irrelevant. Our central thesis is: LLMs are the necessary System-1 substrate (the pattern repository). The primary bottleneck is the absence of a System-2 coordination layer that binds these patterns to external constraints, verifies outputs, and maintains state over time.

This paper formalizes the coordination layer through our Multi-Agent Collaborative Intelligence (MACI) framework (Chang, 2025b). MACI is not a claim that current models are AGI, but an architectural stance: build reliable reasoning on top of pretrained substrates by controlling what binds (semantic anchoring), how disagreements evolve (regulated debate), and what persists (transactional memory).

A key contribution is the formalization of bounded coordination. Semantic anchoring improves as we supply more anchors (retrieval, exemplars, tool outputs), but any practical theory must penalize unbounded context to prevent signal dilution. We introduce an adaptive anchoring score that captures this trade-off.

1.2. From a False Dichotomy to a Research Agenda

The current debate is often framed as a binary choice: Position 1 (Scaling sufficiency): Scale data and compute; general intelligence will emerge from the substrate alone.

Position 2 (Dead end): LLM limitations are intrinsic; discard them for alternative foundations.

Our position (Substrate plus Coordination): LLMs supply a necessary substrate. The priority is to engineer the missing coordination layer that transforms pretrained capacity into reliable, verifiable inference.

The key question is not “LLMs or something else,” but: Which coordination mechanisms reliably transform pattern capacity into goal-directed reasoning, and how can we measure success under bounded resources?

2.3. Multi-agent debate, self-critique, and judging as reliability mechanisms

Many recent systems improve reliability by replacing singlepass generation with iterative oversight: debate between multiple model instances, self-critique loops, role specialization, and independent judging. Surveys of LLM-based autonomous agents consolidate common motifs such as planner–executor decompositions, reflective critics, tool routers, and memory modules, emphasizing that gains typically come from system design rather than token prediction alone (Wang et al., 2023; Huang et al., 2024). MACI adopts the same design reality, but pushes on two specific gaps that are often under-specified: (i) explicit behavior modulation as a control policy (explore versus yield tied to anchoring signals), and (ii) Socratic filtering of ill-posed arguments via CRIT as a judge that optimizes reasonableness independent of stance (Chang, 2023).

2.6. Training-time remedies: teacher-guided RL and filtered synthetic data

Another line seeks to push reasoning via post-training, especially reinforcement learning guided by stronger “teacher” models and large-scale synthetic data that is then filtered by a teacher. A recent example is ProRL, which studies prolonged RL to expand reasoning boundaries (Liu et al., 2025). While these methods can improve performance, they raise practical questions highlighted by practitioners: (i) catastrophic forgetting and benchmark regressions under aggressive fine-tuning, and (ii) the teacher bottleneck for frontier models, where “who teaches the best teacher” becomes a circular dependency in the limit. Our coordination stack is complementary and less teacher-dependent: (a) anchoring constrains behavior by binding to external evidence rather than to a teacher’s preferences, (b) CRIT evaluates well-posedness and chain quality without requiring a strictly stronger generator, and (c) verification can be delegated to tools, domain tests, or independent checks that need not be “more intelligent” than the base model, only more reliable on the specific constraint being checked.

2.7. Clinical reasoning as evidence-seeking and precision retrieval

A concrete application where coordination matters is diagnostic reasoning: disagreements often indicate missing information or incompatible evidence, suggesting targeted data acquisition rather than more generation. In our EVINCE study, two-agent interaction is used to surface failure points, propose discriminating queries and tests, and re-evaluate after evidence is integrated (Chang & Chang, 2025). This aligns with the broader view of diagnostic error as a significant public health issue, discussed in the National Academies report (Balogh et al., 2015) and subsequent analyses highlighting concentrated harms in a limited set of conditions where targeted evidence seeking can be high leverage (Newman-Toker & Mark, 2023). Here, debate functions as a controller for precision RAG and measurement: it increases effective k (additional queries and tests), improves ρd (denser evidence support), and reduces dr (resolving conflicting interpretations), which is exactly the UCCT pathway for crossing the anchoring threshold.

Phase Transitions: From Physics to Cognitive Anchoring

Section 3 highlights a striking empirical fact: a tiny amount of context can override an enormous pretrained repository, producing an abrupt flip in behavior. A few examples can rebind an operator or change the effective task, moving the model from one stable interpretation to another. We argue that this is not a machine-learning oddity but a familiar universal mechanism: thresholded state change. Across many physical and biological systems, smooth changes in a control variable yield sharp changes in system state. This section uses that universality to motivate UCCT and to clarify a central claim of this paper: large pattern repositories are not a dead end; they are the substrate that makes threshold driven reconfiguration possible.

Multi-Agent Debate with Behavioral Modulation and Socratic Judging

Human reasoning scales through collaboration: diverse priors confront each other, surface hidden assumptions, and converge through critique. MACI makes this process explicit and controllable. Two mechanisms are central: (i) behavior modulation that regulates how strongly agents defend or revise hypotheses, and (ii) a judge that blocks ill-posed arguments from entering the shared state.

Beyond static stances. Many debate setups treat agents as fixed advocates. This can help, but it misses what makes debate productive: stance strength must adapt to evidence, the group must manage an explore-versus-consolidate tradeoff, and convergence must be prevented from collapsing onto fluent but ill-formed claims.

CRIT as a judge: Socratic filtering of ill-posed arguments. Debate alone is insufficient if agents can generate claims that are vague, internally inconsistent, or unsupported yet rhetorically fluent. MACI therefore introduces an explicit judge role grounded in CRIT (Critical Reading Inquisitive Template) that evaluates reasonableness independent of stance (Chang, 2023). The judge tests whether a claim is well-defined, whether assumptions are explicit, whether evidence supports the conclusion, and what would falsify it.

Operationally, CRIT gates the communication loop. Before a message is integrated into the shared state, it is scored for clarity, consistency, evidential grounding, and falsifiability. Low-scoring arguments are rejected or returned with targeted Socratic queries (e.g., “Which premise does the work?”, “What evidence would change your conclusion?”, “Are you changing definitions?”). This improves downstream anchoring by forcing arguments into forms that bind to shared constraints rather than just plausible.