Can lower embedding dimensions alone solve the diversity problem without attention mechanisms?

This explores whether shrinking embedding dimensions by itself can fix the 'diversity problem' in recommendation and generation — and whether you can skip attention-style mechanisms to get there.

This reads the question as asking whether dimensionality is the lever for diversity, with attention as an optional extra — and the corpus suggests the relationship is almost exactly backwards. Far from solving diversity, *lower* embedding dimensions are themselves a source of the problem. When user and item embeddings are too small, recommender systems overfit toward popular items because that's the cheapest way to maximize ranking quality, and the harm compounds over time as niche items starve for exposure Does embedding dimensionality secretly drive popularity bias in recommenders?. So shrinking dimensions doesn't open up the tail — it collapses onto the head.

The reason a single fixed vector struggles isn't really its size, it's that one vector has to compress every interest a user has into one point. The fix that the corpus actually credits is architectural: candidate-conditional attention. Deep Interest Network weights a user's past behaviors against each candidate item, switching on only the relevant interests for that comparison — so diverse tastes get expressed dynamically rather than being averaged into a lossy summary How can user vectors capture diverse interests without exploding in size?. That's the opposite of 'diversity without attention': attention is precisely what lets a compact representation behave like a rich one.

Worth knowing too: dimensions aren't a single quality knob. The leading eigenvectors of an embedding space organize meaning coarse-to-fine, separating broad categories first and finer distinctions later, tracking something like the WordNet hierarchy Do embedding eigenvectors organize taxonomy from coarse to fine?. Cut dimensions and you don't trim evenly — you lop off the fine-grained tail that distinguishes niche items from their popular neighbors, which is exactly the distinction diversity depends on. And in language models, the architecture work runs the other direction: deep-and-thin beats wide for small models Does depth matter more than width for tiny language models?, showing that representational richness comes from composition through layers, not from a particular width setting.

If the real worry is diversity rather than dimensionality, the corpus points to mechanisms that operate orthogonally to embedding size altogether. Vector-valued rewards keep diversity 'baked in' by letting solutions specialize across per-criterion or per-persona reward axes instead of collapsing to one scalar Can reward vectors be the hidden source of solution diversity?, and step-level critique during training actively counteracts the tail-narrowing that otherwise sets in Do critique models improve diversity during training itself?. Both treat diversity as something you protect with structure, not something you tune into existence by resizing a vector.

The short version: lower embedding dimensions alone don't solve the diversity problem — the evidence says they *cause* a version of it, and attention (or other structural mechanisms) is what recovers diverse expression from a compact representation. The thing you didn't know you wanted to know: 'dimensionality as a fairness hyperparameter' is a real framing — the size of the vector is quietly a policy decision about who gets seen.

Sources 6 notes

Does embedding dimensionality secretly drive popularity bias in recommenders?

Research shows that when user/item embedding dimensions are too small, recommender systems overfit toward popular items to maximize ranking quality. This compounds over time as niche items receive insufficient exposure, and cannot be fixed post-hoc without treating dimensionality as a fairness hyperparameter.

How can user vectors capture diverse interests without exploding in size?

Deep Interest Network weights historical behaviors against each candidate ad, activating only relevant interests dynamically. This preserves dimension efficiency while expressing diverse tastes without lossy compression.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can reward vectors be the hidden source of solution diversity?

Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can lower embedding dimensions alone solve the diversity problem without attention mechanisms?

Sources 6 notes

Next inquiring lines