What sparse high-rank patterns does the deep tower fail to capture?

This explores a puzzle from recommendation systems: why deep neural recommenders (the 'deep tower') can miss the sparse, item-to-item relationships that a much simpler linear model captures — and what 'high-rank' really means when capacity isn't the bottleneck.

This reads the question through the lens of collaborative filtering, where the surprising result is that a single-layer linear model often beats a deep network. The 'sparse high-rank patterns' are the dense web of specific item-to-item relationships — 'people who bought this exact thing also want that exact thing' — that don't compress into a few smooth latent factors. Deep recommenders typically squeeze everything through a low-dimensional bottleneck (a 'tower' that maps users and items into a compact embedding space), which is great for capturing broad taste but structurally blurs the sharp, idiosyncratic connections between individual items.

The clearest evidence comes from EASE, a shallow item-item weight matrix whose only trick is forcing its diagonal to zero so an item can't predict itself Can simpler models beat deep networks for recommendation systems?. That constraint pushes the model to learn a full, high-rank table of how every item relates to every other one — including negative weights that encode 'these two things repel each other.' Its successor ESLER makes the same point even more pointedly: the structural bias of constraining self-similarity matters more than raw model capacity Can a linear model beat deep collaborative filtering?. A deep tower has plenty of parameters, but its architecture spends them building smooth low-rank representations rather than memorizing the sparse, anti-affinity relationships that actually drive recommendations.

There's a deeper reason this isn't just a recommendation quirk — it may be a mathematical ceiling. Work on embedding-based retrieval proves that for any fixed embedding dimension, there's a hard limit on how many distinct top-k item combinations the space can represent, and you hit that wall even on trivially simple tasks Do embedding dimensions fundamentally limit retrievable document combinations?. A 'high-rank' pattern is precisely one that needs more independent directions than the bottleneck provides. So the deep tower doesn't fail from lack of training or scale — it fails because projecting through a narrow embedding throws away rank the sparse pattern required.

What makes this genuinely counterintuitive: capacity and capability come apart. Models can carry every linearly-decodable feature a task needs while their internal organization is fractured and brittle Can models be smart without organized internal structure?, and the right architectural constraint can beat brute capacity by forcing the model to route prediction through the relationships that matter. The lesson echoes elsewhere in the corpus — MobileLLM finds that *how* you arrange parameters (deep-and-thin) beats simply having more of them Does depth matter more than width for tiny language models?, and weight-sparsity research shows that forced structure, not size, is what yields clean, interpretable circuits Can sparse weight training make neural networks interpretable by design?.

The thread tying these together is that the thing you choose to forbid your model from doing — self-prediction, dense weights, extra width — often teaches it more than the thing you let it learn freely. The deep tower's smoothness is exactly what costs it the sparse, high-rank detail a one-line constraint preserves.

Sources 6 notes

Can simpler models beat deep networks for recommendation systems?

EASE, a shallow linear item-item weight matrix with diagonal constrained to zero, beats deep neural baselines on most datasets. The constraint forces generalization by forbidding self-prediction, while learned negative weights capture item dissimilarity—a structural prior more valuable than model capacity.

Can a linear model beat deep collaborative filtering?

ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.

Do embedding dimensions fundamentally limit retrievable document combinations?

Communication complexity theory proves that for any embedding dimension d, there exists a maximum number of top-k document combinations that can be returned as results. Even embeddings optimized directly on test data hit this polynomial limit, demonstrated on trivially simple retrieval tasks.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

What sparse high-rank patterns does the deep tower fail to capture?

Sources 6 notes

Next inquiring lines