Why does chain-of-thought reasoning fail for personalization?
Standard reasoning traces produce logically sound but personally irrelevant answers. This explores why generic thinking doesn't anchor to user preferences and what might fix it.
PRIME documents a two-layer failure in applying reasoning to personalization:
Layer 1: Generic CoT fails. Enabling standard chain-of-thought often underperforms the non-thinking baseline for personalization tasks. The uncustomized reasoning trace "merely scratches the surface, seeking broad answers rather than to-the-point, user-specific responses." Generic reasoning explores the problem space without being anchored to the specific user's preferences, values, or communication style — producing reasoning that is logically sound but personally irrelevant.
Layer 2: Fine-tuning destroys thinking capacity. The "fast thinking" training paradigm (direct input→output mapping) turns fine-tuned LLMs into specialist models overfitted to the target space. They lose the generalist capability of generating meaningful intermediate thoughts when prompted. A common error is token repetition — the model has been trained to shortcut directly to outputs and can no longer produce coherent intermediate reasoning. This is not a minor degradation — the model structurally cannot think anymore.
The fix: personalized self-distillation. The model generates its own personalized thinking traces (using its pre-fine-tuning generalist capability), then trains on those traces alongside the standard fine-tuning objective. This produces reasoning that is both user-specific (anchored to the individual's preferences) and deep (maintaining the capacity for intermediate thought). The self-distillation approach leverages the model's own capabilities rather than requiring external reasoning trace data.
This finding extends the reasoning/judgment split documented elsewhere. Since When does explicit reasoning actually help model performance?, personalization is a clear case of "continuous nuanced judgment" — matching preferences, style, and implicit expectations cannot be reduced to logical derivation steps. But PRIME shows the split is not absolute: personalized reasoning can help, provided the reasoning traces themselves are customized to the user.
The connection to Why does asking models to think first hurt performance? is structural: both findings demonstrate that thinking initially hurts but becomes helpful after the thinking process is adapted to the domain. In PRIME's case, self-distillation is the adaptation mechanism; in the TPO case, RL training is. The shared principle: raw thinking capability must be tuned to the domain before it adds value.
Source: Personalization
Related concepts in this collection
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
personalization as a specific instance of the judgment-degradation zone
-
Why does asking models to think first hurt performance?
Initial prompts to generate internal thoughts degrade instruction-following performance. What reverses this harm, and can thinking become useful beyond math and logic?
parallel: thinking hurts until adapted; self-distillation and RL are distinct adaptation mechanisms
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
personalized thinking is a case where reflection must be customized to add value
-
Can user preferences be learned from just ten questions?
Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.
PReF addresses the same "generic fails, personalized succeeds" pattern at the reward level: a single reward function underperforms because it flattens individual preferences; factored rewards capture user-specific dimensions just as personalized thinking traces capture user-specific reasoning patterns
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
generic reasoning underperforms non-thinking for personalization tasks — personalized thinking via self-distillation is required because fast-thinking fine-tuning destroys generalist reasoning capability