Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Paper · arXiv 2504.06261 · Published April 8, 2025

Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM “workers” in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the LLM instances to come up with their own collaboration strategy for the problem at hand, all the while “seeing” each other’s memory in the concurrent KV cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with “instant” access to each other’s memory. 1 Hogwild! Inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

Using these models to solve complex problems often requires long sequential computations, that is, generating text token-by-token. However, many reasoning problems are not sequential. Leveraging this intuition, several recent works propose parallel inference strategies that allow multiple LLMs to solve a problem faster or more accurately via some form of collaboration [Wang et al., 2022, Ning et al., 2024]. In the simplest case, multiple LLMs can attempt the problem independently, then vote [Wang et al., 2022] or cross-reference their results [Du et al., 2023, Wang et al., 2024a] to improve correctness. A parallel line of work allows the LLM to divide the problem into multiple independent sub-tasks that are then solved in parallel and merged, producing the final solution [Ning et al., 2024, Kim et al., 2024, Jin et al., 2025]. These parallel inference strategies can improve quality and efficiency, taking advantage of parallelism in modern hardware.

Unfortunately, no single collaboration strategy is universally effective. For instance, solving a problem in independent parallel “threads” can be inefficient when one of the threads requires a longer generation than the rest, resulting in most of the agents waiting for a straggler and wasting compute [Wang et al., 2022, 2024a]. In turn, inference with independent sub-tasks only works if the problem can immediately be split into these sub-tasks. Furthermore, if one of the agents discovers that the original plan is flawed, they will be unable to re-plan [Ning et al., 2024, Ding et al., 2025], potentially solving sub-tasks that are no longer necessary [Jin et al., 2025].

This runs contrary to how humans collaborate. Instead of strict adherence to a fixed collaboration strategy, we often collaborate more dynamically, re-planning on the fly, abandoning some tasks half-way and switching to a more promising approach, discussing or debating strategy if the initial plan failed. While this type of collaboration is harder to define, it offers greater flexibility and can be more efficient if the participants are sufficiently cohesive [Hutchins, 1995, Entin and Serfaty, 1999]. Our Approach. In this work, we try to apply the same principle to artificial reasoners. Since modern LLMs can already reason and plan [Zhou et al., 2024, Gao et al., 2024, Wang et al., 2024c], we hypothesize that they can benefit from dynamic interaction between different instances, during which they can develop their own collaboration strategy for the problem at hand.

To test this hypothesis, we propose Hogwild! Inference—a parallel LLM inference protocol with no pre-defined framework for collaboration.2 Instead of choosing how LLMs should interact ahead of time, we allow them to generate tokens in parallel and “see” each other’s progress (tokens) immediately as they are generated. We then prompt the LLM “workers” to decide their next course of action by themselves, given the latest actions from others: whether this means solving parallel sub-tasks, cross-verifying each other, discussing strategy, or pivoting to a new plan.

To enable this type of on-the-fly collaboration, Hogwild! Inference runs multiple LLM instances with the same weights, but with a custom Key-Value cache that shares token representations between workers, allowing concurrent cross-attention. Specifically, instead of re-computing Key-Value representations for each worker, we keep track of individual worker KV memories and “stitch them together” in different orders, by adjusting their positional embeddings (see Figure 1). Moreover, we provide an efficient implementation of this inference approach.

We test Hogwild! Inference with modern open-source LLMs and find that existing reasoning-capable models—such as QwQ [Qwen Team, 2025] and DeepSeek-R1 [DeepSeek-AI et al., 2025]—can already “reason to coordinate”. More concretely, we observe that concurrent agents can formulate and follow plans, adapt when the initial plan has failed, point out each other’s errors, and use each other’s key observations. When prompted to check if they are doing redundant work – e.g., when one LLM instance is doing a sub-task that is already done by another, or solving a problem that is no longer relevant— they can often (but not always) detect redundancy and change strategy. In summary, our results suggest that parallel inference with a shared Key-Value cache may offer a promising approach to enable effective and efficient collaboration between multiple LLM instances.