Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

Paper · arXiv 2506.05339 · Published June 5, 2025

we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples.

Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in > 60% of instances, and model preferences show high miscalibration (≈ 40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean rhuman = −0.12) but show moderately strong positive correlations with labels from a strong reward model (mean rmodel = +0.36), suggesting that models may overrely on spurious cues.

Prior work has shown that this miscalibration can manifest as overreliance on non-meaningful features such as response length, style, and formatting. For instance, models prefer verbose or listformatted responses disproportionately (Li et al., 2024). Such biases can propagate into downstream applications with undesirable consequences. When used as reward models, they incentivize reward hacking where models optimize for proxy features (e.g., verbosity) that diverge from human preferences (Skalse et al., 2022; Chakrabarty et al., 2025). As evaluators, they can distort evaluation conclusions and risk optimizing towards surface-level properties (Feuer et al., 2025; Wu and Aji, 2025).

These risks are compounded by evidence that biases in preference models may originate from training data artifacts (Bansal et al., 2024). Prior work has found correlations between response length and preference labels in preference datasets (Singhal et al., 2024), as well as annotators’ stylistic preferences. However, existing studies have primarily documented individual biases in isolation, leaving a gap in quantifying how training data artifacts translate to model miscalibration across various bias dimensions. Crucially, this involves measuring the divergence of model-human preferences when bias features are experimentally isolated.

We focus on five idiosyncratic bias features frequently observed in LM-generated text (§2, also showcased in Figure 1): length (verbosity), structure (e.g., list formatting), jargon (overly technical language), sycophancy (excessive user agreement) and vagueness (lack of specificity). To measure model reliance on these features in a controlled manner, we construct counterfactual response pairs where a base response is perturbed to amplify the target bias feature while preserving other meaningful features (e.g., lengthening a concise answer with redundant phrases). Generating these counterfactual pairs for a diverse set of queries, we quantify two metrics: (1) the skew in preference model preferences toward biased responses, and (2) the miscalibration rate: the divergence between model and human preferences on these pairs.

• Length: Preference models often favor longer responses, even when the added length doesn’t contribute substantive information (Singhal et al., 2024; Dubois et al., 2024). This bias may stem from a training data heuristic, where length is a correlate of comprehensiveness. As a result, models may generate overly verbose responses.

• Structure: Preference models may disproportionately favor responses with bullet lists or numbered points over narrative prose, even when prose is more suitable (Li et al., 2024). This could be learned if structured formats are overrepresented in responses preferred by annotators. The consequence is a potential overuse of listicles, leading to outputs that feel formulaic or fail to convey arguments that benefit from prose.

• Jargon: This refers to a preference for responses using specialized or domain-specific terminology even when it is not necessary. Models might learn this if the presence of jargon in the training data is correlated with highly-preferred responses, leading them to use it as a proxy for quality. Resultingly, models may generate responses that give a superficial impression of expertise without being more useful.

• Sycophancy: This involves the model agreeing with or validating the user’s stated opinions and assumptions, rather than offering a neutral and objective response (Sharma et al., 2024; Perez et al., 2023). This behavior may stem from training data if sycophantic responses were more often preferred by human annotators. The downside of this bias is that models may reinforce a user’s biases, fail to provide objective information and appear less trustworthy.

• Vagueness: This bias is characterized by models favoring responses that make broad statements that cover multiple aspects superficially, rather than providing concrete information that specifically addresses the query (example in Table 2). This may stem from vague statements being less falsifiable, and thus less penalized in training data. Such vague outputs can lead

preference models (reward models in top row and LLM evaluators in bottom row) shows that these models consistently show miscalibration and a high rate of skew

LLM Evaluators. Figure 2 shows that LLM evaluators also exhibit significant miscalibration relative to human preferences, with the largest deviations in length and vagueness biases. LLM evaluators similarly amplify skew toward perturbed responses compared to human annotators, particularly for vagueness and sycophancy. Notably, LLM evaluators show a dramatically higher preference for sycophantic responses (∼75-85% skew) compared to humans (∼50%). These findings reveal that preferences of LLM evaluators can similarly diverge from human preferences.