Can spline-based activations replace MLPs in transformer architectures?
This explores whether Kolmogorov-Arnold Networks — which swap fixed MLP activations for learnable spline functions — could stand in for the MLP blocks inside transformers, and what the corpus says about replacing standard transformer components more generally.
This explores whether spline-based layers (the headline idea behind Kolmogorov-Arnold Networks) could replace the MLP blocks that sit between attention layers in transformers. The corpus has exactly one paper on the spline idea itself, and it's worth being upfront: it argues the case for replacing MLPs in general, not specifically inside transformers. Kolmogorov-Arnold Networks put learnable univariate splines on the network's edges instead of fixed activations and linear weights, and the result is smaller models that hit better accuracy, scale faster, and stay interpretable enough to recover actual mathematical laws Can learnable spline activations beat fixed MLP designs?. So the building block clearly works on its own terms. Whether it survives being dropped into a transformer at scale is a question the collection doesn't directly answer — a real gap rather than a hidden 'yes.'
What the corpus does have is a rich picture of what happens when people try to swap out transformer parts, and that's the more useful lateral question. The pattern across these attempts is that each alternative trades one capability for another. Spiking-plus-linear attention can convert an existing checkpoint into a far more efficient model with under 2% retraining — but it's swapping the attention mechanism, not the MLP, chasing hardware efficiency rather than expressiveness Can spiking neurons make transformers efficient on any hardware?. State-space models replace attention with a fixed-size recurrent state and pay for it: they provably can't copy or retrieve long strings the way even a two-layer transformer can Can state-space models match transformers at copying and retrieval?. The lesson for splines is that 'better on benchmark X' rarely means 'better everywhere' — replacements tend to reveal a hidden cost somewhere.
There's also a deeper reason MLP blocks might be load-bearing in ways a spline swap would have to respect. Pruning studies show neural networks naturally carve compositional tasks into isolated modular subnetworks, and pretraining makes that structure more reliable Do neural networks naturally learn modular compositional structure?. Other work finds the transformer's residual stream acts less like storage and more like a continuous flow of knowledge through those layers Do transformer models store knowledge or generate it continuously?. A spline-based block wouldn't just need to match an MLP's accuracy — it would need to host the same kind of modular, flowing computation the rest of the architecture has learned to rely on. KAN's built-in interpretability is intriguing precisely here: if its splines made that internal structure more legible, that could be the real win over raw accuracy.
The honest bottom line: splines have beaten MLPs in standalone settings, and the field is clearly willing to replace transformer internals when there's a payoff — but the corpus doesn't contain a transformer that actually runs on spline blocks at scale. The interesting open question the collection hands you is which property you'd be optimizing for if you tried: efficiency (where spiking and linear attention compete), raw capability (where transformers keep winning at copying and retrieval), or interpretability — which is the one dimension where the spline approach has a genuine, distinctive edge over the MLP it would replace.
Sources 5 notes
Kolmogorov-Arnold Networks replace MLPs' fixed activations and linear weights with learnable univariate splines on edges, achieving better accuracy with smaller models, faster neural scaling laws, and built-in interpretability for discovering mathematical laws.
SpikingBrain successfully adapted Qwen2.5-7B using under 2% retraining data by combining linear/hybrid-linear attention with adaptive spiking neurons, achieving transformer-comparable performance with near-linear long-sequence complexity on non-NVIDIA hardware.
Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.