Why do high-disagreement tasks benefit from broad rater pools over deep annotation?

This explores why tasks where annotators legitimately disagree are better served by sampling many different raters (breadth) than by having a few raters label intensively (depth).

This reads the question as being about *where the signal actually lives* when a task provokes disagreement — and the corpus suggests the answer is breadth, because on high-disagreement tasks the disagreement itself is the data, not noise to be averaged away. The starting point is Why do readers interpret the same sentence so differently?: when sentences are socially embedded, different readers interpret them differently because of where they sit, not because someone made a mistake. If the variation across people is the real signal, then deep annotation by a narrow set of raters can't recover it — you'd just be sampling one or two perspectives very precisely while missing the distribution entirely. Breadth captures the shape of legitimate disagreement; depth sharpens a single, possibly idiosyncratic, point of view.

There's a second reason breadth matters, and it's about what a single annotator's response even *is*. Do all annotation responses measure the same underlying thing? shows that annotations aren't one clean measurement — they mix genuine preferences, non-attitudes (people answering when they have no real opinion), and constructed preferences (opinions invented on the spot). You can only tell these apart by looking across measurement conditions and across people. Annotating one rater deeply gives you consistency, but consistency can't distinguish a stable genuine preference from a stably-constructed artifact. A broad pool lets the genuine signal accumulate while the non-attitudes and constructions wash out as scatter — exactly the separation deep annotation can't perform.

The failure mode of ignoring this shows up downstream in Why do reasoning models fail at predicting disagreement?: models optimized for a single deterministic "correct" answer get *worse* at representing human disagreement, and worst of all precisely where variance is high. That's the modeling-side mirror of narrow annotation — collapsing many valid views into one erodes the very capability you need on contested tasks. Broad rater pools are the data-collection counterpart to keeping that distribution alive instead of training it away.

The quiet payoff is a reframing: on contested tasks, "more annotation" and "better annotation" point in opposite directions. Depth buys you precision about the wrong quantity — one perspective's certainty — when the quantity you actually need is the spread across perspectives. The corpus also hints this isn't limited to subjective social text: Can models learn argument quality from labeled examples alone? finds that without an explicit shared framework, raters latch onto surface patterns instead of principled criteria, which is another way disagreement leaks in — and another reason a single deep annotator can quietly encode their own idiosyncratic surface rules rather than a criterion anyone else would share.

Sources 4 notes

Why do readers interpret the same sentence so differently?

Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Why do reasoning models fail at predicting disagreement?

RLVR-trained models degrade significantly at predicting human disagreement distributions, especially when variance is high. The optimization signal for deterministic correctness actively erodes the model's ability to represent multiple valid interpretations.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Why do high-disagreement tasks benefit from broad rater pools over deep annotation?

Sources 4 notes

Next inquiring lines