Language Understanding and Pragmatics Psychology and Social Cognition

Why do LLMs predict concession-based persuasion so consistently?

Do RLHF training practices cause language models to systematically overpredict conciliatory persuasion tactics, even when dialogue context suggests otherwise? This matters for threat detection and negotiation support systems.

Note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? What happens to social order when AI removes ritual constraints? Why do LLMs excel at social norms yet fail at theory of mind?

When asked to infer persuasion intentions from dialogue, most LLMs exhibit a systematic bias: they predict intentions "characterized by making the other person feel accepted through concessions, promises, or benefits" — regardless of whether the actual dialogue context supports this inference.

The hypothesis is that RLHF (Reinforcement Learning from Human Feedback) is the mechanism. RLHF "tends to prioritize safety and politeness" during preference optimization, and this training signal bleeds into intention prediction. The model has learned that conciliatory, benefit-oriented responses are preferred by human raters, and this preference leaks into its predictions about what other agents will do — it projects its own trained disposition onto the agents it's modeling.

This is a specific, measurable instance of a broader pattern: alignment training shapes not just what the model says but how it models others. If RLHF teaches the model that accommodation is preferred, the model begins to assume accommodation is what agents do. It becomes harder for the model to represent genuinely adversarial, manipulative, or hardball persuasion strategies because its own training bias makes these strategies less probable in its prediction space.

The practical consequence for persuasion-aware AI: a model biased toward predicting concessions will systematically underestimate adversarial intent. In negotiation support, threat detection, or social manipulation detection, this bias translates directly into blind spots — the model expects cooperation where exploitation is occurring.


Source: Theory of Mind

Related concepts in this collection

Concept map
13 direct connections · 170 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

RLHF biases LLMs toward predicting concession-based persuasion intentions regardless of dialogue context