LLM Reasoning and Architecture Language Understanding and Pragmatics Reinforcement Learning for LLMs

Why do confident wrong answers hide in standard accuracy metrics?

When AI systems produce fluent but incorrect recommendations in high-stakes domains, standard accuracy evaluation may miss the failures entirely. What structural blind spot allows these errors to remain invisible?

Note · 2026-05-01 · sourced from Linguistics, NLP, NLU

The car-wash problem is diagnostic because it is simple. No specialized knowledge, no multi-step arithmetic, no ambiguous premises. Just a conflict between a surface heuristic (short distance implies walking) and an implicit constraint (the car must be co-located with the wash). Adrian Vermeule's "fluent and wrong" diagnosis from earlier in this body of work generalizes here: the failure is not in the model's verbal output, which sounds plausible. The failure is in the unstated reasoning step that did not happen.

The HOB authors enumerate where this pattern recurs in deployment. Medical triage: "mild symptom implies wait" versus the unstated constraint that some mild presentations require immediate evaluation. Legal interpretation: "standard clause implies sign" versus the unstated constraint that this clause appears in a non-standard contract. Financial planning: "low-cost option implies choose" versus the unstated constraint that the low-cost option excludes a required feature. In each case a salient surface heuristic, statistically dominant in training data, competes with an implicit constraint that must be derived from world knowledge. In each case the same pattern documented in the car-wash problem can produce a fluent confident recommendation that is wrong.

The accuracy-driven evaluation regime is structurally unable to surface this. A model that recommends "wait" 80 percent of the time on mild symptoms looks accurate when 80 percent of mild symptoms are in fact non-urgent. The failures concentrate in the 20 percent of cases where the implicit constraint is active — exactly the cases where wrong recommendations cause harm. Aggregate accuracy is the wrong metric; minimal-pair asymmetry is the diagnostic. Without the latter, the deployment risk is invisible to standard eval.

Source: Linguistics, NLP, NLU

Original note title

Fluent confident wrong responses are invisible to standard accuracy evaluation in deployment domains where unstated constraints compete with surface features