What dominates AI compute in production systems today?
While public discussion centers on large language models, Facebook's infrastructure data reveals a different story about which AI workloads actually consume the most compute cycles in real production environments.
Public discussion of AI compute centers on training and inference for large language models. Facebook's published architecture analysis tells a different story. DNN-based personalized recommendation models comprise up to 79% of AI inference cycles in their production data center. Just three model classes — RMC1, RMC2, RMC3 — account for up to 65% of inference cycles, despite hundreds of recommendation models running across the system.
These models follow a distinct architectural pattern that drives their compute profile. Inputs combine dense features (continuous, like user age) with sparse categorical features (like preferred genres or device types). Sparse features are encoded as multi-hot vectors with potentially millions of categories, but only a few entries are active per user. Mapping these to dense embedding vectors requires embedding-table lookups — operations that are memory-bound rather than compute-bound, which inverts the compute profile of more familiar transformer or convnet workloads.
The implication is that production AI infrastructure is shaped by recommendation, not by the model types that get research attention. Embedding-table operations, sparse feature handling, and the storage capacity for billion-parameter embedding tables are the engineering constraints. McKinsey and TechEmergence estimated recommendation drives up to 35% of Amazon's revenue; Netflix and YouTube data put the figures at 75% of movies watched and 60% of videos consumed. The economic gravity of recommendation in production drives the dominant inference workload — yet methods papers tend to underweight this reality compared to the visibility of LLM compute.
Source: Recommenders Personalized
Related concepts in this collection
-
Do hash collisions really harm popular recommendation items?
Hash-based embedding tables assume uniform ID distribution, but real recommender systems show heavy-tailed frequency patterns. The question explores whether collisions actually concentrate damage on the high-traffic entities that matter most.
grounds: production scale that makes embedding-table problems first-order — 79% of inference cycles makes any degradation costly
-
Why do hash collisions hurt recommendation models so much?
Explores whether standard low-collision hashing works for embedding tables in recommenders, given that user and item frequencies follow power-law distributions rather than uniform ones.
grounds: same scale problem from infrastructure angle
-
How do feed ranking weights shape what content gets produced?
Feed-ranking weights are typically treated as neutral tuning parameters, but do they actually function as political levers that reshape producer behavior and the content supply itself?
extends: the scale of production recommendation makes the political consequences of weight-choice population-wide
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
complements: same compute-economics argument — SLM-first for agentic, three-class-DNN for recommendation — both refuse foundation-model defaults at production scale
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
personalized recommendation models drive 79 percent of Facebook AI inference cycles — three model classes consume two-thirds of total compute