Why do different people reconstruct the same argument differently?
When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?
Argunauts (Argument Annotation Units) is a dataset and benchmark for argument reconstruction — extracting explicit logical structures from natural language arguments. The dataset's most significant finding is methodological: when multiple annotators (human and LLM) reconstruct the same argument independently, they produce different but equally valid reconstructions.
This is not annotation disagreement in the sense of noise to be resolved. Multiple reconstruction schemas — different choices about what counts as a premise, how to formalize the conclusion, what implicit assumptions to make explicit — are each internally valid. There is no gold standard because the text underdetermines the reconstruction.
This connects directly to Why do readers interpret the same sentence so differently? but at a structural rather than semantic level. Interpretive multiplicity in NLI is about meaning — what a sentence means depends on the reader's social position. Reconstruction multiplicity in argumentation is about structure — how an argument should be formalized depends on which reconstruction schema is applied.
Both findings converge on a challenge to the NLP assumption that language processing tasks have unique correct outputs. Do standard NLP benchmarks hide LLM ambiguity failures? describes how benchmarks respond to this problem by exclusion. For argumentation, exclusion is not possible — underdetermination is not a feature of edge cases but of the task itself.
The practical implication: evaluating LLMs on argument reconstruction requires acknowledging that precision and recall metrics assume ground truth that does not exist. Models that disagree with a reference annotation may be producing equally valid reconstructions. The field is measuring agreement with one valid interpretation and calling it correctness.
This also grounds Why do speakers deliberately use ambiguous language? from a new angle: structural ambiguity (multiple valid formalizations of the same argument) is as fundamental as semantic ambiguity.
Source: Argumentation
Related concepts in this collection
-
Why do readers interpret the same sentence so differently?
How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.
semantic multiplicity; this is structural multiplicity; same root problem
-
Why do speakers deliberately use ambiguous language?
Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.
the broader principle this exemplifies at the argument-structure level
-
Do standard NLP benchmarks hide LLM ambiguity failures?
When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?
benchmark exclusion as the standard NLP response to underdetermination
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
argument reconstruction is fundamentally underdetermined because multiple valid reconstructions exist for the same text with no ground truth