Can MLPs learn to match dot product similarity in practice?
Universal approximation theory suggests MLPs should learn any similarity function, including dot product. But does this theoretical promise hold up when training on real, finite datasets with practical constraints?
The Neural Collaborative Filtering paper popularized replacing the dot product with a learned MLP for combining user and item embeddings. The justification was theoretical: an MLP is a universal function approximator, so it can in principle learn any similarity function — including dot product — and presumably better ones. Rendle et al.'s revisit shows this argument fails empirically and operationally.
Empirically, with careful hyperparameter selection, a properly configured dot product baseline substantially outperforms the MLP. Even more pointedly, learning a dot product through an MLP requires a large model capacity and a lot of training data — the universal approximation guarantee is asymptotic, and finite-data inductive bias matters more than expressiveness. The MLP is too flexible for the task; its inductive bias points away from the simple geometric similarity that actually fits the data.
Operationally, dot products allow maximum-inner-product search over precomputed item embeddings, which is fast enough for real-time serving over millions of items. MLP similarities require a forward pass per item per query — they cannot be precomputed. So even if MLPs were marginally more accurate, they would be unaffordable in production.
The takeaway: an inductive bias that matches the geometry of the problem (dot product) wins over an expressive parameterization that has to learn the geometry from scratch.
Source: Recommenders Architectures
Related concepts in this collection
-
Why does dot product beat MLP-based similarity in practice?
Neural Collaborative Filtering theory suggests MLPs should outperform dot products as universal approximators. But what explains the empirical gap, and what role do data scale and deployment constraints play?
extends: paired statement of the same Rendle result emphasizing the inductive-bias-vs-capacity framing
-
Can simpler models beat deep networks for recommendation systems?
Does removing hidden layers and constraining self-similarity create a more effective collaborative filtering approach than deep autoencoders? This challenges the assumption that architectural depth drives performance.
complements: same anti-deep-CF lesson at architecture level — capacity isn't the bottleneck
-
Can a linear model beat deep collaborative filtering?
Does a shallow linear autoencoder with a zero-diagonal constraint outperform deeper neural models on collaborative filtering tasks? This challenges the field's assumption that depth and nonlinearity drive performance.
complements: same lesson — inductive bias and structural constraints matter more than depth or non-linearity
-
Why does multinomial likelihood work better for ranking recommendations?
Explores whether the choice of likelihood function in VAE-based collaborative filtering matters for matching training objectives to ranking evaluation metrics. Why items should compete for probability mass.
complements: another structural-prior-matters-more result — likelihood choice over architectural depth
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
MLP similarity does not approximate dot product in practice — universal approximation theorems do not survive contact with finite data