How should product specifications measure alignment without naming the dimension?
This explores why a product spec that names an abstract target — 'make it aligned,' 'make it helpful' — tends to measure the wrong thing, and how the corpus suggests specifying observable outcomes and causal behavior instead of the dimension itself.
This explores how a product spec can pin down the right behavior without writing 'alignment' into the requirement — because the moment you name the dimension and measure it directly, the metric starts lying to you. The cleanest warning comes from work showing that identical performance numbers can sit on top of completely different internal structure: a model can carry every feature a task needs while its internal organization is fractured, and standard evaluation never sees it (Can models be smart without organized internal structure?). If the number you put in the spec can be satisfied by a broken system, the number is the problem, not the bar.
The deeper issue is that 'alignment' isn't one dimension to measure — it's several that trade against each other. A 2020–2025 review finds lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive warmth and trust; collapse them into a single 'aligned' score and you get category errors — cold support bots, evasive mental-health assistants (Do different types of alignment serve different conversational goals?). A spec that names the dimension forces exactly this collapse. So the move is to specify the conversational outcome each context actually needs (did the user finish the task? did they feel heard?) and let the named dimension stay implicit.
The most tempting shortcut — measure alignment by what users prefer — is the one the corpus most directly refutes. Writers pick AI rewrites 63% of the time, then object to the persona distortions those same rewrites smuggle in; polish and distortion turn out to be entangled at the model level, so optimizing the preference signal produces both at once (Can user preference guide AI writing tool alignment?). Preference is a proxy that looks like the dimension and isn't. The same trap shows up in reliability specs: pinning temperature to zero gives you a consistent output, but consistency is not the thing you wanted — it's one repeated draw from the same distribution, and omega testing across repetitions exposes the gap (Does setting temperature to zero actually make LLM outputs reliable?).
What actually works points toward specifying causal behavior rather than the target. One striking result: when agents are regularized to stay consistent under nullified intervention pathways — judged by whether a partner's suggestion causally changed the outcome rather than whether it sounded plausible — genuine partner-awareness and 'common ground' emerge as a byproduct, with no explicit alignment reward in the loop (Why do standard alignment methods ignore partner interventions?). The dimension you wanted was never named; it fell out of measuring causal impact. That's the template for a spec: define the unnamed quality by the behavior it produces under a probe, not by a score with its label on it.
If you do want something checklist-shaped, the prompt-quality work shows how to do it honestly — six evaluable dimensions grounded in communication theory, where improving one cascades into others, revealing a structured space rather than a flat scorecard (Can we measure prompt quality independent of model outputs?). The lesson that ties these together: name the concrete sub-behaviors and their interactions, and the abstract dimension takes care of itself. Name the dimension directly and you've built a target that masking, entanglement, and false consistency are all free to game.
Sources 6 notes
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.
Writers prefer AI rewrites 63% of the time but object to systematic persona distortions those same rewrites introduce. Mitigation studies show polish and distortion are entangled at the model level—preference optimization produces both simultaneously.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Regularizing agents to maintain consistency when intervention pathways are nullified forces them to evaluate suggestions by causal impact rather than surface plausibility. Common ground alignment emerges as a byproduct without explicit reward.
Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.