Do generic kernel-decay assumptions alone explain coarse-to-fine spectral ordering?

This explores whether the well-known coarse-to-fine ordering of spectral components — broad distinctions emerging in the top eigenvectors, fine ones later — comes from generic 'kernels decay smoothly' math, or whether it actually reflects structure baked into the data itself.

This explores whether the coarse-to-fine spectral ordering people observe in embeddings is fully explained by generic kernel-decay assumptions, or whether something more specific to the data is doing the work. The most direct evidence in the corpus argues for the latter. When you look at the leading eigenvectors of embedding Gram matrices, they don't just decay in some structure-agnostic way — they separate broad taxonomic branches first and then progressively finer sub-branches, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. That level-by-level correspondence is the tell: a generic decay assumption predicts that energy concentrates in a few top modes, but it does not predict that those modes line up with a human taxonomy. The ordering is being shaped by the co-occurrence statistics of the data, not just by the smooth-kernel prior.

The reason this matters is that 'spectrum decays, so coarse comes first' is a statement about magnitude, while 'coarse means broad taxonomic branches' is a statement about meaning — and the corpus keeps showing that meaning rides on data structure, not on generic geometry. Deep-and-thin networks beat wide ones at small scale precisely because depth lets the model compose abstract concepts hierarchically through layers Does depth matter more than width for tiny language models?, which is the same coarse-to-fine logic showing up in architecture rather than in a Gram spectrum. Networks also carve compositional tasks into isolated modular subnetworks on their own Do neural networks naturally learn modular compositional structure? — again, organized structure that no generic smoothness assumption would hand you for free.

There's a sharper cautionary thread here too: spectral or representational structure can look clean while being functionally broken, or look broken while being fine. Models can hold every linearly decodable feature a task needs and still have fundamentally fractured internal organization that standard metrics never see Can models be smart without organized internal structure?. That should make you suspicious of explaining an observed ordering by appeal to a generic prior — the prior predicts the easy-to-measure part (energy decay) and stays silent on the part that actually matters (what each mode encodes). Where the structure comes from is often learned through exposure rather than intrinsic: representational density, for instance, is built up through familiarity with training data rather than handed down by the architecture Is representational sparsity learned or intrinsic to neural networks?.

Finally, the corpus repeatedly votes that imposed or data-given structure outweighs generic capacity. A zero-diagonal linear autoencoder beats most deep collaborative-filtering models because a structural constraint — items can't predict themselves — forces prediction through genuine item relationships Can a linear model beat deep collaborative filtering?. The lesson generalizes to the spectral question: the interesting part of coarse-to-fine ordering is the specific structure (taxonomy, anti-affinity, modularity), and that part is exactly what a generic kernel-decay assumption leaves out. So the honest answer is no — kernel decay tells you energy will concentrate in a few modes, but it doesn't explain why those modes recover a hierarchy. For that you need the data's own correlation structure.

One caveat worth naming: this corpus is strong on the 'structure beats generic prior' theme but thin on the formal kernel-spectrum machinery itself, so treat the eigenvector-taxonomy finding Do embedding eigenvectors organize taxonomy from coarse to fine? as the load-bearing doorway and the rest as converging circumstantial support.

Sources 6 notes

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can a linear model beat deep collaborative filtering?

ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.

Do generic kernel-decay assumptions alone explain coarse-to-fine spectral ordering?

Sources 6 notes

Next inquiring lines