Why does dot product beat MLP-based similarity in practice?
Neural Collaborative Filtering theory suggests MLPs should outperform dot products as universal approximators. But what explains the empirical gap, and what role do data scale and deployment constraints play?
Neural Collaborative Filtering popularized replacing the dot product between user and item embeddings with a learned MLP, on the theory that an MLP — a universal function approximator — should subsume the dot product as a special case. Rendle and colleagues revisit the experiments and show two non-obvious results.
First, with proper hyperparameter tuning, the simple dot product substantially outperforms the MLP-based similarity. The original NCF gain came from undertuning the dot-product baseline, not from MLP expressiveness. Second, even though an MLP can in theory approximate any function, learning a dot product with an MLP requires both a large model and a large training set — the inductive bias of MLPs makes the dot-product structure expensive to recover from data.
The practical bite is in inference. Dot products admit Maximum Inner Product Search algorithms that retrieve top-K items in sublinear time over millions of items. MLP similarities require a forward pass per (user, item) pair, which is intractable at production scale. The paper concludes that MLPs as embedding combiners should be "used with care" — that the modern DNN architectures most competitive in NLP (transformers) and vision (resnets) all use dot products in their output layers reinforces the point. Universal approximation does not mean universal good choice; the inductive bias of the operator interacts with data scale and serving constraints.
Source: Recommenders Architectures
Related concepts in this collection
-
Can MLPs learn to match dot product similarity in practice?
Universal approximation theory suggests MLPs should learn any similarity function, including dot product. But does this theoretical promise hold up when training on real, finite datasets with practical constraints?
extends: paired statement of the same Rendle result emphasizing the practical infeasibility of efficient retrieval
-
Can simpler models beat deep networks for recommendation systems?
Does removing hidden layers and constraining self-similarity create a more effective collaborative filtering approach than deep autoencoders? This challenges the assumption that architectural depth drives performance.
complements: same lesson at architecture level — the right structural constraint beats depth
-
Can a linear model beat deep collaborative filtering?
Does a shallow linear autoencoder with a zero-diagonal constraint outperform deeper neural models on collaborative filtering tasks? This challenges the field's assumption that depth and nonlinearity drive performance.
complements: same anti-depth lesson — anti-affinity and dot-product priors both outperform learned alternatives
-
Can one model memorize and generalize better than two?
Does training memorization and generalization components jointly in a single model outperform training them separately and combining their predictions? This matters for building efficient recommendation systems that handle both rare and common user behaviors.
complements: industrial systems use simple structural priors (wide cross-product) for memorization rather than relying on MLP universality
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
MLP-based similarity underperforms dot product despite being a universal function approximator — inductive bias matters more than capacity