Why does batching multiple conversations on one GPU create identity problems?
This explores why the way GPUs serve many users at once — packing several separate conversations into a single batch on one chip — undermines any attempt to say 'I am talking to this particular AI instance.'
This explores why batching breaks identity: when a GPU processes several unrelated conversations together to use the hardware efficiently, you lose the ability to point at a piece of silicon and say "that's the one I'm talking to." The corpus's most direct take is that hardware simply isn't a stable place to locate an LLM's identity at all Can we identify an LLM interlocutor with a single hardware instance?. The plumbing runs in both directions: load-balancing and model-parallelism scatter a single conversation across many machines, while batching funnels many conversations through one machine. Either way the clean one-to-one map between "a conversation" and "a physical instance" dissolves. Your chat isn't running on a chip you could fingerprint — it's interleaved, in the same batch, with strangers.
What makes this more than a plumbing detail is that the identity problem doesn't start at the hardware — it goes all the way down. Even if you could pin a conversation to one GPU, there's no fixed "someone" there to pin. The 20-questions regeneration test shows that an LLM holds a superposition of possible characters and samples one at generation time rather than committing to a self Do large language models actually commit to a single character?. Re-run the same prompt and you get a different-but-consistent answer each time. So batching doesn't corrupt a stable identity; it reveals that the thing we wanted to count was never a discrete individual in the first place — at the hardware level *or* the character level.
The corpus also shows how fragile the softer attempts to manufacture identity are. When you prompt a model to play a specific persona, the variation between repeated runs of the *same* persona matches or exceeds the variation between *different* personas — meaning model uncertainty, not stable social identity, is driving the output Why do LLM persona prompts produce inconsistent outputs across runs?. Persona drift is real enough that researchers have built dedicated reinforcement-learning setups just to hold a simulated character together across turns, cutting drift by over half Can training user simulators reduce persona drift in dialogue?. Identity in these systems is something you have to actively pump energy into maintaining — it is not a property the system possesses by default.
The stakes show up in how people actually relate to chatbots. We treat them as a coherent "quasi-other" — a partner that remembers us, responds to us, builds on our framing How do chatbots enable distributed delusion differently than passive tools?. That felt sense of a continuous someone is exactly what the serving architecture can't deliver. So the deeper answer to the question is a mismatch: batching is an efficiency decision made at the infrastructure layer, but it collides with a human expectation — that there's a stable individual on the other end — which neither the hardware nor the model was ever built to satisfy.
The surprise worth carrying away: the question assumes batching *causes* an identity problem, but the corpus suggests batching merely *exposes* one that was already there. The individual you thought you were talking to was a convenient fiction at every level of the stack — chip, character, and persona alike.
Sources 5 notes
Load-balancing and model-parallelism route single conversations across multiple hardware instances, while batching routes multiple conversations through one instance. These architectural facts break any stable one-to-one mapping, making hardware an untenable level of individuation.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Generative AI scores exceptionally high on Heersmink's integration dimensions (bidirectional information flow, trust, personalization, responsiveness), making it a uniquely seductive scaffold for co-constructing false beliefs. Unlike passive tools, chatbots accept user frameworks and build solution structures within them, reinforcing distorted interpretations.