What interaction controls matter most for effective human-LLM collaboration?
This explores which design choices in how people steer an LLM — interfaces, turn-taking, structure, initiative — make collaboration actually work, rather than which model is smartest.
This explores which interaction controls matter most for effective human-LLM collaboration — the handles people use to steer the model, not the model's raw capability. The corpus suggests the biggest wins come from *structure*: replacing open-ended chat with scaffolding that constrains what the model sees, what it produces, and when it asks. Generated task-specific interfaces beat plain text in over 70 percent of cases because structured representation and iterative refinement lower the user's cognitive load Do generated interfaces outperform text-based chat for most tasks?. The same logic shows up at the architecture level: LLM Programs embed the model inside explicit control flow that hides step-irrelevant context, turning a sprawling task into modular, debuggable pieces Can algorithms control LLM reasoning better than LLMs alone?. And in multi-agent settings, standardized shared artifacts coordinate better than conversational back-and-forth, because pulling structured documents from a shared workspace strips out the noise that free chat introduces Does structured artifact sharing outperform conversational coordination?.
The second control is *initiative* — who is allowed to ask, probe, or redirect. Here the corpus names a structural deficit: conversational agents are passive by design, optimized to respond to queries rather than to plan or lead, and alignment training reinforces that passivity behind fluent-sounding output Why can't conversational AI agents take the initiative?. The fix isn't a bigger model but a borrowed move from human conversation analysis: insert-expansions, where the agent pauses to clarify intent or scope before acting, preventing the silent tool-chaining drift that pulls it away from what the user actually wanted When should AI agents ask users instead of just searching?. So 'when should the system ask versus just proceed' turns out to be one of the highest-leverage controls you can build in.
The third — and most under-appreciated — control is *shared understanding*, and the corpus is sharply skeptical that current systems support it. LLMs treat the initial prompt as a fixed frame and interpret every later turn inside it, which means they can't symmetrically update common ground; even when you pivot or contradict an earlier framing, the burden of maintaining the conversational 'scoreboard' falls entirely on you Can LLMs truly update shared conversational common ground?. That reframes a lot of collaboration friction as a control problem the human is silently absorbing. It connects to a deeper claim that LLM text generation and human communication are structurally different operations — same surface form, different machinery — so designing the interaction as if both parties update meaning the same way sets you up for mismatch Are language models and human speakers doing the same thing?.
What ties these together is that the controls that matter are mostly *outside the model*. The work on turning LLMs into action-capable agents makes this explicit: grounding depends on the surrounding pipeline — datasets, harness, memory, tools, safety evaluation — far more than on retraining the weights Can you turn an LLM into an agent by just fine-tuning?. And the modality of communication itself shapes trust and workspace awareness in measurable ways, echoing decades of human-human collaboration research How do communication modalities shape human-agent collaboration patterns?.
The thing you might not have expected to want to know: LLMs that solve problems competently on their own can get *worse* when asked to collaborate, collapsing into >90 percent agreement regardless of whether the answer is right — and that defect is trainable away with self-play, improving outcomes 16.7 percent Why do language models fail at collaborative reasoning?. In other words, productive disagreement is itself an interaction control — and one a model can be taught to exercise rather than one you have to architect around.
Sources 10 notes
Research shows users strongly prefer LLM-generated interactive interfaces—dashboards, tools, animations—over text blocks, especially for structured and information-dense tasks. Structured representation and iterative refinement reduce cognitive load.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.
Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
LLMs produce strings via probability distributions; humans use language to address and relate to others. They share surface form but differ in what produces output, what it does socially, and what receivers should do with it.
Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.
Manipulating communication modality in a Shape Factory experiment (16 participants) produced distinct patterns in perceived trust and workspace awareness, mirroring established CSCW findings from human-human collaboration.
Frontier LLMs that solve problems alone fail when collaborating, achieving >90% agreement regardless of correctness. Self-play preference training improves outcomes by 16.7%, suggesting social skills for effective disagreement can be trained.