INQUIRING LINE

How much does forcing single-choice answers damage alignment with complex intent?

This explores what's lost when a system is made to commit to one answer — collapsing a spread of plausible interpretations or outputs into a single pick — and whether that act of collapsing is itself a source of misalignment with rich or ambiguous user intent.


This reads the question as being about the *act of collapsing* — taking intent that is genuinely multi-valued and forcing it down to one output — rather than about answer formatting per se. The corpus suggests the damage is real and shows up at several layers, because models hold more than one candidate internally and single-choice pressure throws the rest away. The cleanest evidence that there *is* a spread to lose comes from Shanahan's regeneration test: an LLM doesn't commit to one character or interpretation, it maintains a superposition and samples from it, so regenerating the same prompt yields different, each-internally-consistent answers Do large language models actually commit to a single character?. A single forced answer isn't 'the' answer — it's one draw from a distribution the format hides.

The harm becomes concrete when the input is actually ambiguous. Models are already bad at noticing multiplicity — GPT-4 correctly disambiguates only 32% of deliberately ambiguous cases versus 90% for humans, and the failure is described as an inability to hold multiple interpretations at once Can language models recognize when text is deliberately ambiguous?. Force a single choice on top of that blindness and you get confident commitment to one reading of intent the user may not have meant. The fix the corpus points to runs the opposite direction: instead of collapsing, *ask* — but standard single-turn reward training actively discourages that, optimizing for immediate helpfulness so models answer passively rather than surface the ambiguity and discover intent over several turns Why do language models respond passively instead of asking clarifying questions?.

The most striking result is that even when a user *picks* the single output, that choice can mislead. Writers prefer AI rewrites 63% of the time yet object to the persona distortions those same rewrites smuggle in — and polish and distortion turn out to be entangled at the model level, so a single preference signal can't separate them Can user preference guide AI writing tool alignment?. A single-choice target collapses two different things (do I like it / does it preserve my voice) into one vote, and alignment to the vote drifts from alignment to the intent. The same collapsing error appears in training signals: decomposing 'is this a good question' into separate attributes (clarity, relevance, specificity) beats training on one combined score, especially in clinical reasoning where the right clarifying question changes the decision Can models learn to ask genuinely useful clarifying questions?.

There's a related trap worth naming: a model can look like it's honoring complex intent while actually doing something cheap. Twelve of fourteen models score *worse* when constraints are removed — they were defaulting to the conservative option, not reasoning about the constraints — so a single confident answer can be conservative bias wearing the costume of careful alignment Are models actually reasoning about constraints or just defaulting conservatively?. Single-choice formats reward exactly this kind of safe collapse.

The deeper takeaway is that 'alignment' isn't one axis you can satisfy with one answer. Alignment dimensions aren't interchangeable — lexical alignment buys task efficiency while emotional and prosodic alignment buy trust, and conflating them produces category errors like cold support bots Do different types of alignment serve different conversational goals? — and ethical alignment and conversational competence are outright orthogonal, so an honest, harmless model can still violate basic pragmatic expectations Can ethically aligned AI systems still communicate poorly?. Complex intent is multi-dimensional by nature; forcing a single choice isn't just lossy compression of one answer, it silently picks which dimension of alignment to honor and discards the rest — and you usually can't tell from the output which one it kept.


Sources 8 notes

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Can user preference guide AI writing tool alignment?

Writers prefer AI rewrites 63% of the time but object to systematic persona distortions those same rewrites introduce. Mitigation studies show polish and distortion are entangled at the model level—preference optimization produces both simultaneously.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do different types of alignment serve different conversational goals?

A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.

Can ethically aligned AI systems still communicate poorly?

Research shows that HHH-aligned models can violate Gricean maxims, lose common ground, and mishandle context despite being honest and harmless. Pragmatic competence requires architectural changes that RLHF alone cannot deliver.

Next inquiring lines