Language Understanding and Pragmatics Psychology and Social Cognition Conversational AI Systems

Can user preference guide AI writing tool alignment?

If writers prefer AI-polished text but object to the persona shifts it introduces, does optimizing for preference actually solve the alignment problem or obscure it?

Note · 2026-05-03 · sourced from Co Writing Collaboration

The persona-distortion study (N=2,939 writers) produced two findings that reveal a structural problem with using user preference as the optimization target for AI writing tools. The first: writers strictly preferred the AI-rewritten version of their own text 63% of the time, with 52% saying it better reflected their opinion than what they wrote. The second: when researchers measured the AI's edits across 29 dimensions, writers found many of the systematic shifts objectionable — being made to seem more confident, more wealthy, more educated, more emotionally regulated than they are. Same writers, same artifact, contradictory verdicts. The mitigation studies foreclosed the obvious resolution: the textual properties producing preference (clarity, polish, flow) and the textual properties producing distortion (demographic shift, emotional compression, opinion homogenization) are entangled at the model level. Removing one removes the other.

This is not a calibration failure that better RLHF would fix. It is a structural property of the preference signal itself. When writers are asked "do you prefer this version?" they evaluate on the polish dimension where the AI is unambiguously better. When writers are shown the systematic demographic and stylistic shifts and asked "do you endorse being represented this way?" they evaluate on the misrepresentation dimension where the AI is unambiguously worse. Both verdicts are correct at the level of analysis they're conducted at. Preference optimization aggregates the first verdict and produces models that maximize polish while maximizing distortion as a side-effect, because the side-effect is invisible at the moment of preference judgment.

The implication is more disruptive than the persona-distortion finding alone suggests. RLHF and preference-tuning workflows assume user preference is a coherent target. When the preferred-and-objectionable properties of an artifact are entangled, preference is not a coherent target — it is a projection that throws away the dimension along which the harm lives. No amount of preference data can recover what preference judgments don't measure. Aligning to user preference under entanglement is not "imperfect alignment we'll improve over time"; it is alignment to a target that systematically produces the harm the user objects to when shown it.

The constructive move: alignment workflows must run preference and an orthogonal probe (representation faithfulness, demographic-shift measurement, opinion-compression detection) and treat the two as a multi-objective constraint, not a single optimization target. Where preference and faithfulness diverge, the divergence is the alignment problem made visible; suppressing it by collapsing to preference reproduces False Punditry at the model architecture level.


Source: Co Writing Collaboration

Related concepts in this collection

Concept map
14 direct connections · 165 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

user preference cannot serve as the alignment target for AI writing assistance — desirable polish and undesirable persona distortions are entangled at the model level