Can other posterior approximation schemes match variational inference performance?
This explores whether alternative ways of approximating a probability distribution — the 'posterior' over hidden variables — can perform as well as variational inference (VI), the dominant workhorse for that job.
This explores whether posterior approximation methods beyond variational inference can match it — and the honest answer is that this corpus doesn't contain a head-to-head bake-off between VI and its classic rivals (MCMC sampling, Laplace approximation, expectation propagation). If you want that specific comparison, the material here won't settle it. What the collection does offer is something more interesting: evidence about *which design choices actually move the needle* once you've committed to an approximate-inference setup, and that lens reframes the question.
The clearest case is the variational autoencoder work, where switching the likelihood from Gaussian or logistic to multinomial produced state-of-the-art collaborative filtering Why does multinomial likelihood work better for ranking recommendations?. The striking part is that the win came not from a better posterior approximation but from matching the likelihood to the objective (ranking competition between items) and rebalancing the KL regularization term. The suggestion lurking here: the approximation *scheme* may matter less than the modeling assumptions wrapped around it. If that holds, asking 'can another scheme match VI?' might be the wrong axis — two schemes with the right likelihood could both win, while the best scheme with the wrong likelihood loses.
The corpus also speaks to the deeper motivation for approximating a posterior at all: representing uncertainty and multiple valid answers instead of a single guess. The GRAM line of work replaces deterministic latent updates with stochastic sampling, letting a model hold a distribution over solutions and explore alternatives a point estimate can't Can stochastic latent reasoning help models explore multiple solutions?. Its extension shows you can sample many parallel latent trajectories to cover the solution space without the variance blowing up Can reasoning systems scale wider instead of only deeper?. That's effectively Monte-Carlo-flavored posterior exploration competing with — or complementing — a single learned variational distribution, and it's a live example of an alternative scheme earning its keep.
There's a quieter thread too: how neural networks represent distributions internally. Models develop dense activations for familiar data and fall back to sparse ones for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks? — a reminder that the 'posterior' a network expresses is shaped by training exposure, not just the inference algorithm you bolt on top. Any approximation scheme inherits whatever uncertainty structure the representation already encodes.
So the takeaway the corpus hands you is a reframing rather than a verdict: when methods compete, the decisive variables here were likelihood choice, regularization balance, and whether the method preserves uncertainty and parallel exploration — not the brand name of the approximation. If you arrived wanting 'VI vs MCMC,' you'll leave suspecting that's a less load-bearing question than which assumptions you feed whichever scheme you pick.
Sources 4 notes
Liang et al. show that switching VAE likelihoods from Gaussian/logistic to multinomial achieves state-of-the-art results because enforced probability competition between items directly aligns training with top-N ranking objectives. Rebalancing KL regularization further improves performance.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.