Can likelihood choice matter more than architectural depth for CF?

This explores whether, in collaborative filtering (CF) — the recommender technique that predicts what you'll like from patterns in what similar users liked — the choice of likelihood (the probabilistic loss you train against) can outweigh making the network deeper or bigger.

This explores whether the likelihood — the objective you fit your model to — can matter more than architectural depth for collaborative filtering. Up front, an honesty flag: this corpus doesn't contain papers on collaborative filtering or recommender systems specifically, so it can't settle the CF question on its own terms. What it does have, scattered across reasoning and fine-tuning work, is a recurring pattern that speaks directly to the spirit of your question — that *what you optimize* often dominates *how big you build*.

The sharpest analog is the finding that small models trained with DPO can match large ones on function calling, because DPO's explicit negative examples target the exact failure mode that plain supervised fine-tuning misses Can small models match large models on function calling?. Strip away the domain and that's your question almost verbatim: changing the training objective — moving from a likelihood that only rewards correct outputs to one that also penalizes wrong ones — closed a gap that scaling the model was supposed to close. For CF, this maps cleanly onto the long-running debate between pointwise likelihoods (predict the rating) and pairwise/ranking likelihoods (predict that you prefer A over B): the corpus's evidence suggests the loss formulation is doing heavy lifting that depth alone can't replicate.

There's a second, subtler reason objective choice can beat depth: identical performance metrics can hide completely different internal representations Can models be smart without organized internal structure?. A deeper architecture might post the same offline accuracy while being internally fractured and brittle to distribution shift — which in recommenders is the everyday reality of new users and shifting catalogs. The likelihood you choose shapes what the model actually organizes its representation around, and that organization is invisible to the headline metric. Relatedly, the effect of an objective isn't even fixed across settings: preference tuning *reduces* diversity in code but *increases* it in creative writing, depending on what each domain rewards Does preference tuning always reduce diversity the same way?. The lesson for CF is that a likelihood interacts with the data's incentive structure — so the right loss is a modeling decision, not a default.

The honest counterweight is that architecture is not inert. Conditional scaling laws that fold in architectural variables — hidden size, MLP-to-attention ratio — delivered real, measurable efficiency and accuracy gains Can architecture choices improve inference efficiency without sacrificing accuracy?, and across reasoning tasks the training regime, not raw compute or size, is what makes a model's capacity productive Can non-reasoning models catch up with more compute?. Read together, these point to the same conclusion: depth and likelihood aren't rivals so much as differently-leveraged knobs, and the cheaper, higher-leverage knob is frequently the objective. If you want to pursue the CF question rigorously, the corpus gives you the *principle* — objective often beats scale — but not the recommender-specific experiments; that's a gap worth filling with dedicated CF literature.

Sources 5 notes

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can likelihood choice matter more than architectural depth for CF?

Sources 5 notes

Next inquiring lines