INQUIRING LINE

What interaction design changes would help LLMs handle underspecified requests?

This reads as a design question: not why LLMs mishandle vague or partial requests, but what we could change about how they interact with us to fix it.


This explores what to change in the *interaction* — the back-and-forth — so models stop guessing when a request is underspecified. The corpus is unusually pointed here: it locates the root cause in one place and then offers several different design levers to pull. The diagnosis is that LLMs lock in early. When a request is revealed gradually, they make a premature assumption about what you meant and never recover from it — across 200,000+ conversations, every major model dropped ~39% in multi-turn settings, and bolt-on agent mitigations clawed back only 15–20% Why do language models fail in gradually revealed conversations?. So the design target isn't "answer better," it's "don't commit prematurely."

The most direct lever is teaching models to ask instead of assume. Humans use *dynamic grounding* — we build shared understanding through small clarification-and-repair loops — while LLMs default to *static grounding*, retrieving and answering as if common ground already exists Why do language models skip the calibration step?. The hard part has always been knowing *when* to ask rather than just proceed, and the corpus borrows a precise answer from conversation analysis: *insert-expansions*, the moves humans make to clarify intent or scope a response before acting, give agents a formal rule for when to probe the user instead of silently chaining tools When should AI agents ask users instead of just searching?. That reframes clarification from a UX nicety into a structured decision.

But here's the deeper, less obvious obstacle: even a model that *wants* to clarify may be structurally unable to. One note argues LLMs interpret every later turn through the lens of the *initial* prompt frame — so when you pivot or contradict yourself, the model can't fold the revision into jointly held background; you, the user, end up being the sole keeper of the conversational scoreboard Can LLMs truly update shared conversational common ground?. And another points out these systems are *structurally passive* by training: optimized to respond, not to initiate, so they can't naturally lead with a clarifying question or steer the dialogue Why can't conversational AI agents take the initiative?. Underspecification handling isn't just a prompt-engineering fix, then — it bumps against what the model architecture and alignment objectives let the system do at all.

This is where the corpus gets interesting laterally, because it suggests the fix might live *outside* the model. Instead of better conversation, give the user a better surface: research shows LLM-generated task-specific interfaces — dashboards, sliders, tools — beat plain chat in over 70% of cases, precisely because structured representation lets users express and refine intent without having to verbalize everything up front Do generated interfaces outperform text-based chat for most tasks?. A related strand wraps the model in explicit algorithmic scaffolding that decomposes a task and feeds each step only the context it needs, turning a vague monolithic request into debuggable sub-steps Can algorithms control LLM reasoning better than LLMs alone?. The thread running through both: the harness and surrounding pipeline, not the raw model, often determine whether intent gets grounded Can you turn an LLM into an agent by just fine-tuning?.

The thing you might not have known you wanted to know: the *cost* of staying silent compounds. Underspecified requests handed to long delegated workflows don't just fail once — frontier models silently corrupt ~25% of document content over extended relay tasks, with errors stacking through 50 round-trips and never plateauing Do frontier LLMs silently corrupt documents in long workflows?. And there's a subtler reason clarification matters: the *way* you phrase an ambiguous request changes the answer. Identical questions get different information depending on emotional tone, an invisible bias the model won't flag Does emotional tone in prompts change what information LLMs provide?. Taken together, the corpus's design prescription is consistent — interrupt the premature commitment with structured clarification, dynamic grounding, or a richer interface — because once a model has guessed wrong, it has no reliable internal way to catch and fix it on its own What stops large language models from improving themselves?.


Sources 11 notes

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Why do language models skip the calibration step?

LLMs operate in static grounding mode—retrieving data and responding without clarification loops. Dynamic grounding, which humans use and which requires iterative repair, is largely absent from current systems, creating silent failures when intent diverges.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Why can't conversational AI agents take the initiative?

Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.

Do generated interfaces outperform text-based chat for most tasks?

Research shows users strongly prefer LLM-generated interactive interfaces—dashboards, tools, animations—over text blocks, especially for structured and information-dense tasks. Structured representation and iterative refinement reduce cognitive load.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can you turn an LLM into an agent by just fine-tuning?

Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Next inquiring lines