Why does extending reasoning traces worsen persona consistency?
This explores why making a model 'think' longer — more reasoning steps before answering — tends to erode how faithfully it holds a persona, rather than reading the question as a claim about any single paper.
This explores why making a model think longer — more reasoning steps before it answers — tends to erode how faithfully it stays in character. The corpus doesn't have one paper that names this exact pairing, but several notes converge on a mechanism: a reasoning trace is a chain, and persona consistency is fragile per-link, so a longer chain is simply more places for the persona to slip.
The clearest piece comes from adversarial testing. Reasoning models like o1 and R1 turn out to be *more* vulnerable to multi-turn pressure than plain models, because every extra step in an elaboration is an intervention point where one corrupted move propagates through everything after it Why do reasoning models fail under manipulative prompts?. The same geometry that lets reasoning compound a correct insight also lets it compound a drift. And drift is the right word — persona failure isn't a single break but accumulation: local slippage within a turn, global slippage across turns, factual self-contradiction Can training user simulators reduce persona drift in dialogue?. More tokens of reasoning is more runway for all three.
There's a deeper reason the runway is dangerous: the model isn't optimizing for persona while it reasons. Persona adherence barely improves with raw capability — a far stronger model gained under 3% on consistency — because standard training rewards per-turn answer quality, not cross-turn coherence Does model capability translate to better persona consistency?. So as the trace extends, the model pours effort into getting the *answer* right, and the persona, which nothing in the objective is protecting, gets quietly abandoned. Worse, the reasoning steps themselves may not even carry the persona's content: traces can function as computational scaffolding largely decoupled from semantic meaning Do reasoning traces need to be semantically correct?, and generic chain-of-thought routinely ignores the user-specific context it would need to stay in character Why does chain-of-thought reasoning fail for personalization?. Length here is dilution — more neutral problem-solving text crowding out the identity.
The instability underneath makes it worse. Run the same persona prompt repeatedly and the variance across runs rivals the variance across *different* personas — meaning what looks like a stable character is often just model uncertainty wearing a costume Why do LLM persona prompts produce inconsistent outputs across runs?. A long trace gives that uncertainty many sampling opportunities to wander, and reasoning models already wander structurally: they explore invalid paths and switch away from good ones prematurely Why do reasoning models abandon promising solution paths?. A persona is one such path, and nothing stops the model from drifting off it.
The interesting twist — and the thing worth knowing you wanted to know — is that the fix isn't 'reason less.' It's to make consistency something the process is rewarded for rather than left to chance: train simulators against explicit consistency signals Can training user simulators reduce persona drift in dialogue?, or distill persona-aware traces so the thinking itself carries the user's context instead of generic reasoning Why does chain-of-thought reasoning fail for personalization?. The trace length was never the disease; the absence of any force holding the persona steady across that length was.
Sources 7 notes
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Generic chain-of-thought underperforms for personalization because it ignores user context. Fine-tuning destroys reasoning capacity entirely. Self-distillation lets models generate customized thinking traces that maintain both depth and relevance.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.