What makes intent taxonomies unmanageable at hundreds of intents?

This explores why classifying user requests into a fixed list of named intents breaks down as that list grows — and what the corpus offers as alternatives.

This reads the question as being about the classic dialogue-system design where every user utterance has to be sorted into one of a predefined set of 'intents' — and why that design buckles once the set reaches hundreds. The corpus has a sharp, direct answer and several lateral ones that explain the deeper reason it was never going to scale.

The most on-the-nose material comes from Rasa's reframing of dialogue understanding as command generation rather than intent classification Can command generation replace intent classification in dialogue systems?. The argument there is that intent classification is the wrong primitive: every new intent demands fresh annotated examples, the categories start overlapping at the edges (is this a 'reschedule' or a 'cancel-and-rebook'?), and accuracy degrades as the label space grows. Generating a domain-specific command instead of picking from a flat menu sidesteps all three — no annotation burden, context handled naturally, and scaling without the degradation. The taxonomy isn't unmanageable because hundreds is a big number; it's unmanageable because the format forces you to carve continuous, context-dependent meaning into discrete mutually-exclusive boxes.

The most interesting lateral comes from retrieval failure analysis Where do retrieval systems fail and why?, which names a hard ceiling: embedding dimension mathematically constrains how many distinct items a vector space can cleanly separate, and embeddings measure association rather than relevance. Map that onto intents and you get a structural reason for the plateau — past some point, two intents simply cannot be reliably distinguished in the representation, no matter how much you tune the classifier. This is the same wall, surfacing in a different subfield.

There's also a cognitive-load echo in argument-scheme classification Why does argument scheme classification stumble where other NLP tasks succeed?, where models stall at F1 0.55–0.65 on tasks requiring integrative pattern recognition while sailing past 0.80 on simpler tagging. Fine-grained intent disambiguation is exactly that kind of integrative task, which is why adding more classes hits a quality cliff rather than a gentle slope.

What you might not have known you wanted: the corpus suggests the real fix isn't a better taxonomy but abandoning the discrete taxonomy altogether. Work on discovering persistent user-interest 'journeys' shows people's actual goals are things like 'designing hydroponic systems for small spaces' llms-can-discover-and-describe-persistent-user-interest-journeys — far too specific and personal to ever be a category in any hand-built list. And multi-facet identifier research Can item identifiers balance uniqueness and semantic meaning? makes the general point that no single discrete label can carry both distinctiveness and meaning at once; you need structured, generated representations. The pattern across all of these: discrete labels stop scaling long before reality does, and generation — of commands, of descriptions, of structured identifiers — is the corpus's recurring escape hatch.

Sources 5 notes

Can command generation replace intent classification in dialogue systems?

Rasa's dialogue understanding architecture generates domain-specific commands instead of classifying intents, eliminating annotation requirements, handling context naturally, and scaling without degradation—treating understanding as pragmatics rather than semantics.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Can item identifiers balance uniqueness and semantic meaning?

TransRec shows that combining numeric IDs, titles, and attributes into structured identifiers solves three problems simultaneously: distinctiveness from IDs, semantics from text, and generation grounding from structural constraints. Neither pure IDs nor pure text alone achieves all three.

What makes intent taxonomies unmanageable at hundreds of intents?

Sources 5 notes

Next inquiring lines