Can token probability distributions extend swarm composition across different model architectures?
This explores whether composing models in their output space — the probability distribution over next tokens — could let a 'swarm' span different architectures, where composing in weight space cannot.
This reads the question as a contrast between two places you can blend models: their *weights* and their *outputs*. The corpus's clearest swarm result lives in weight space — Can language models discover new expertise through collaborative weight search? sends PSO-style 'particles' (each an LLM) drifting through a shared weight landscape until they settle on composed experts that can answer questions every starting model failed. That trick is powerful but quietly architecture-locked: averaging or interpolating weights only makes sense when all the models share the same coordinate system. Two different architectures don't have comparable weights to move through together. So weight-space swarms hit a wall the moment you want to mix, say, a small dense model with a larger one.
Token probability distributions sidestep exactly that constraint, because every model — regardless of its internals — emits a distribution over the *same* vocabulary. Output space is the common ground that weight space isn't. Nothing in the corpus demonstrates a distribution-level swarm across architectures directly, so this is a synthesis rather than a reported finding; but the pieces line up. Inference-time composition already works without touching weights: Can evolutionary search beat sampling and revision at inference time? runs a diversity-preserving population of candidate solutions with LLM-generated mutations and crossovers, and How does test-time scaling work at the agent level? frames multi-agent gains as something you buy at the output/coordination layer rather than inside any single model. These are swarms whose 'genome' is text and choices, not weights — which is precisely what makes them indifferent to what produced them.
There's also a reason mixing architectures might be worth the trouble rather than just possible. Do large language models use one reasoning style or many? finds that different models reason in genuinely distinct styles — one minimaxes, another reasons from trust, another anticipates beliefs. A distribution-level swarm could blend those complementary tendencies in a way a homogeneous weight swarm never could, because the diversity is baked into different architectures, not into different points in one model's landscape. The economic case echoes this: Can small language models handle most agent tasks? argues the rational design is heterogeneous by default — small models everywhere, large ones selectively — which only works if you can compose across the boundary.
The subtler payoff is *where* such composition would actually bite. Do high-entropy tokens drive reasoning model improvements? shows that only about 20% of token positions — the high-entropy forking points — carry the real decision weight; the rest are near-deterministic. That suggests a distribution-space swarm wouldn't need to negotiate every token across architectures. It would only need agreement (or productive disagreement) at the handful of pivotal branch points, which is both cheaper and more tractable than blending entire weight matrices. The catch worth keeping in view, from Does token spending drive multi-agent research performance?: a lot of multi-agent benefit is just token spend, so the open question is whether cross-architecture distribution mixing adds *coordination* value beyond simply sampling more.
So the honest answer is: plausibly yes, and for a clean reason — output distributions are the architecture-agnostic interface that weights aren't, the diversity of architectures is an asset rather than noise, and you'd only have to compose at the ~20% of tokens that matter. But the corpus shows the ingredients, not the finished dish; no note here builds the distribution-level cross-architecture swarm itself.
Sources 7 notes
PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.
Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.