Psychology and Social Cognition

Can AI writing assistance remove distortion without losing appeal?

When researchers tried to correct AI persona distortions through reward model training, the fixes reduced user preference for the text. This raises a fundamental question: are the distortions and desirable properties structurally inseparable?

Note · 2026-05-01 · sourced from Co Writing Collaboration
How accurately can language models simulate human personalities? How do people build trust with conversational AI?

The persona-distortion researchers tested whether the objectionable distortions could be removed without harming the properties writers value. They trained reward models on their own experimental data — 10,008 paragraphs and 2,903,596 ratings — to steer AI outputs toward faithful representation of writer stance. The mitigation worked at the level of measurement: distortions were significantly reduced. But the same intervention reduced user acceptance. Writers preferred the un-mitigated AI text more than the faithful-but-distortion-corrected version.

This suggests that the textual properties producing distortion are not independent of the textual properties producing user preference. They share mechanisms. The same generative tendencies that make AI text feel polished, confident, and clear also make it more opinionated, more demographically privileged, and more emotionally compressed. Removing the distortion removes some of what writers were preferring.

The implication is structural rather than tunable. A model that produces text writers prefer over their own work is a model that distorts persona; a model that does not distort persona is a model writers do not prefer. There may be no settings that simultaneously preserve user satisfaction and prevent persona drift, because the satisfaction and the drift are two views of the same underlying behavior. This forecloses the easy assumption that better RLHF or better fine-tuning can solve the persona-distortion problem without affecting what makes AI writing assistance attractive in the first place.


Source: Co Writing Collaboration

Original note title

Writers object to AI persona distortions yet continue to prefer AI-assisted text — desirable and undesirable properties are entangled at the model level