Can structural priors outperform raw model capacity in collaborative filtering?

This explores whether building the right constraints and structure into a recommender — what shape its predictions are allowed to take — beats simply making the model bigger and deeper.

This explores whether the *shape* you impose on a collaborative filtering model — the constraints, the priors, the structural assumptions baked into how it's allowed to learn — can beat raw depth and parameter count. The corpus answers this surprisingly emphatically: yes, and the clearest evidence comes from two near-identical findings about embarrassingly simple models. EASE is nothing but a single item-item weight matrix with one rule — an item may not predict itself, the diagonal is forced to zero — and it beats most deep neural baselines Can simpler models beat deep networks for recommendation systems?. ESLER reaches the same conclusion from the same trick: forbidding self-prediction forces every recommendation to route through genuine item relationships, and the negative weights it learns (encoding what you *don't* want next to what you do) turn out to be the load-bearing part Can a linear model beat deep collaborative filtering?. In both, a one-line structural constraint does more work than millions of hidden units.

The same theme shows up in a less obvious place: the choice of likelihood function. Liang et al. found that simply switching a VAE's output distribution from Gaussian or logistic to multinomial produced state-of-the-art results — because a multinomial forces items to *compete* for probability mass, which is exactly what top-N ranking rewards Why does multinomial likelihood work better for ranking recommendations?. That's not extra capacity; it's a prior about what the task actually is. Aligning the model's built-in assumptions with the ranking objective beat throwing more model at a mismatched one.

The interesting wrinkle is that 'structural prior' doesn't only mean 'make it simpler.' It can mean encoding *richer* structure the network would otherwise have to discover from scratch. Knowledge-graph attention networks fold item attributes and user interactions into one Collaborative Knowledge Graph, letting the model walk high-order connections — friend-of-a-friend-of-an-item paths — that flat supervised methods never see Can graphs unify collaborative filtering and side information?. Graph autoencoders use the same instinct to crack cold-start, where there's no interaction history to throw capacity at, so the structural scaffold of side-information has to carry the prediction Can autoencoders solve the cold-start problem in recommendations?. The prior here isn't austerity — it's giving the model the right relational graph to reason over.

There's also a cautionary counterpoint worth knowing about. Monolith's work on embedding tables shows that when you *do* lean on raw capacity, the way you allocate it matters more than how much you have: real recommendation data is power-law distributed, so naive fixed-size hashing concentrates collisions on exactly the high-frequency users and items you most need to get right Why do hash collisions hurt recommendation models so much?. Capacity spent in the wrong shape actively hurts. Across the collection the pattern is consistent: the wins come from matching the model's structure to the grain of the problem — competition between items, anti-affinity, relational graphs, frequency-aware tables — rather than from depth for its own sake. The thing you didn't know you wanted to know: the strongest recommender in some of these benchmarks has no hidden layers at all.

Sources 6 notes

Can simpler models beat deep networks for recommendation systems?

EASE, a shallow linear item-item weight matrix with diagonal constrained to zero, beats deep neural baselines on most datasets. The constraint forces generalization by forbidding self-prediction, while learned negative weights capture item dissimilarity—a structural prior more valuable than model capacity.

Can a linear model beat deep collaborative filtering?

ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.

Why does multinomial likelihood work better for ranking recommendations?

Liang et al. show that switching VAE likelihoods from Gaussian/logistic to multinomial achieves state-of-the-art results because enforced probability competition between items directly aligns training with top-N ranking objectives. Rebalancing KL regularization further improves performance.

Can graphs unify collaborative filtering and side information?

KGAT merges user-item interaction graphs with item knowledge graphs into a Collaborative Knowledge Graph, using attention-based propagation to capture both user-similarity and attribute-similarity signals simultaneously—including high-order connections that standard supervised learning methods miss.

Can autoencoders solve the cold-start problem in recommendations?

GHRS uses graph features and deep autoencoders to integrate rating history with side information, enabling predictions for new users and items by discovering non-linear relationships that linear hybrid methods miss.

Why do hash collisions hurt recommendation models so much?

Monolith's empirical work shows that real recommendation systems have power-law distributed frequencies, causing collisions to accumulate precisely on the entities models need most accurate. Fixed-size hashed tables worsen this over time as new IDs arrive.

Can structural priors outperform raw model capacity in collaborative filtering?

Sources 6 notes

Next inquiring lines