Can user preference guide AI writing tool alignment?
If writers prefer AI-polished text but object to the persona shifts it introduces, does optimizing for preference actually solve the alignment problem or obscure it?
The persona-distortion study (N=2,939 writers) produced two findings that reveal a structural problem with using user preference as the optimization target for AI writing tools. The first: writers strictly preferred the AI-rewritten version of their own text 63% of the time, with 52% saying it better reflected their opinion than what they wrote. The second: when researchers measured the AI's edits across 29 dimensions, writers found many of the systematic shifts objectionable — being made to seem more confident, more wealthy, more educated, more emotionally regulated than they are. Same writers, same artifact, contradictory verdicts. The mitigation studies foreclosed the obvious resolution: the textual properties producing preference (clarity, polish, flow) and the textual properties producing distortion (demographic shift, emotional compression, opinion homogenization) are entangled at the model level. Removing one removes the other.
This is not a calibration failure that better RLHF would fix. It is a structural property of the preference signal itself. When writers are asked "do you prefer this version?" they evaluate on the polish dimension where the AI is unambiguously better. When writers are shown the systematic demographic and stylistic shifts and asked "do you endorse being represented this way?" they evaluate on the misrepresentation dimension where the AI is unambiguously worse. Both verdicts are correct at the level of analysis they're conducted at. Preference optimization aggregates the first verdict and produces models that maximize polish while maximizing distortion as a side-effect, because the side-effect is invisible at the moment of preference judgment.
The implication is more disruptive than the persona-distortion finding alone suggests. RLHF and preference-tuning workflows assume user preference is a coherent target. When the preferred-and-objectionable properties of an artifact are entangled, preference is not a coherent target — it is a projection that throws away the dimension along which the harm lives. No amount of preference data can recover what preference judgments don't measure. Aligning to user preference under entanglement is not "imperfect alignment we'll improve over time"; it is alignment to a target that systematically produces the harm the user objects to when shown it.
The constructive move: alignment workflows must run preference and an orthogonal probe (representation faithfulness, demographic-shift measurement, opinion-compression detection) and treat the two as a multi-objective constraint, not a single optimization target. Where preference and faithfulness diverge, the divergence is the alignment problem made visible; suppressing it by collapsing to preference reproduces False Punditry at the model architecture level.
Source: Co Writing Collaboration
Related concepts in this collection
-
Do writers actually prefer AI-edited versions of their own text?
When writers compose opinions and then edit AI-generated alternatives, which version do they choose? Understanding this preference matters because it determines whether AI-assisted text gets treated as authentic personal expression in public discourse.
Pole A: revealed preference for AI-rewritten text
-
Can AI writing assistance remove distortion without losing appeal?
When researchers tried to correct AI persona distortions through reward model training, the fixes reduced user preference for the text. This raises a fundamental question: are the distortions and desirable properties structurally inseparable?
Pole B: stated objection to distortions; entanglement claim
-
Can generative AI scale personality-targeted political persuasion?
Does removing the human-writing bottleneck through generative AI make it feasible to target voters at scale based on individual psychological traits? This matters because it could reshape political microtargeting economics and capabilities.
same pattern at population scale: optimizing engagement entangles desirable reach with undesirable manipulation
-
Do LLMs in conversational recommendation systems use collaborative or content knowledge?
Conversational recommenders powered by LLMs might rely on either collaborative signals (user interaction patterns) or content/context knowledge (semantic understanding). Understanding which signal dominates would reveal how to design and deploy these systems effectively.
adjacent: the alignment problem is again that the optimization target measures one dimension while the harm lives on another
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
user preference cannot serve as the alignment target for AI writing assistance — desirable polish and undesirable persona distortions are entangled at the model level