Can the eight-dimension rubric predict which question types need decomposition?
This explores whether a fixed multi-attribute scoring rubric — here, an eight-dimension one — can forecast in advance which questions must be broken into sub-questions before they can be answered well; the corpus has no note on a literal eight-dimension rubric, but it has a lot on the underlying bet that question structure predicts decomposition need.
This explores whether a fixed multi-attribute rubric can predict which questions need to be decomposed — and the honest first answer is that the corpus contains no eight-dimension rubric by that name, but it does have the conceptual machinery the question is really asking about, and that machinery suggests the answer is a qualified yes. The strongest support is the finding that question *type* already determines decomposition strategy: non-factoid questions split into five kinds, and the split is not cosmetic — evidence-based questions suit plain retrieval, while experience and reason questions need to be broken apart or filtered before retrieval works at all Does question type determine the right retrieval strategy?. So a classifier that places a question on the right axes is already, in effect, predicting whether decomposition is required. The eight-dimension rubric is just a finer-grained version of that classifier.
The ALFA work shows what those dimensions might be and why a rubric can carry predictive weight at all. It breaks question *quality* into theory-grounded attributes — clarity, relevance, specificity — and finds that scoring each attribute separately beats collapsing everything into a single number, especially in clinical reasoning where the right follow-up question changes the diagnosis Can models learn to ask genuinely useful clarifying questions?. The lesson that transfers: a multi-dimensional rubric outperforms a scalar precisely because different dimensions trip different failure modes. If one of your eight dimensions tracks 'requires aggregating across aspects' and another tracks 'depends on the asker's situation,' those are exactly the signals that flag a question as needing decomposition rather than a single retrieval pass.
But there's a sharp ceiling worth knowing about. The argument-mining corpus shows that classifying the *inferential structure* of a piece of reasoning is fundamentally harder than tagging its surface parts — scheme classification plateaus at F1 0.55–0.65 while the same systems clear 0.80 on component tagging and stance, because recognizing a pattern means integrating distributed, non-local cues Why does argument scheme classification stumble where other NLP tasks succeed?. Decomposition need is that same kind of integrative judgment. A rubric can predict it, but the dimension that does the predicting will itself be the noisy, low-accuracy one — so an eight-dimension rubric likely predicts decomposition unevenly, nailing the easy structural cues and stumbling exactly where the question's complexity is hidden in how its parts relate.
There's also a generative reason to believe a closed rubric can work at all: Wagemans's Periodic Table maps every argument scheme onto three orthogonal axes, replacing an open-ended list with a finite combinatorial space Can three axes organize all possible argument schemes?. That's the proof-of-concept for the whole premise — if argument structure collapses to three axes, question-decomposition need plausibly collapses to a handful of dimensions too, and the rubric's job is to find the orthogonal ones. The risk is over-fitting the rubric to vocabulary instead of structure: question type should be read off how the parts combine, not off the words present.
One adjacent finding reframes the stakes. Training format shapes a model's reasoning strategy roughly 7.5× more than the domain does — multiple-choice training breeds breadth-first search, free-form breeds depth-first Does training data format shape reasoning strategy more than domain?. That suggests the *form* a question is posed in may predict decomposition need better than its topic — which is good news for a structural rubric and bad news for any rubric that leans on subject-matter cues. If you want the eight dimensions to predict well, weight them toward how the question is built, not what it's about.
Sources 5 notes
Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.
Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.
Wagemans's Periodic Table maps all argument schemes onto coordinates across three axes: subject-predicate structure, first-order versus second-order reasoning, and proposition-type pairings. This combinatorial approach replaces Walton's open-ended list with a closed, systematic space enabling computational analysis and discovery of unstudied scheme types.
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.