What makes modernized N-gram embeddings composable with transformer architectures?

This explores what a static, lookup-style embedding (like an N-gram or word vector) has to look like before a transformer's attention can build on it — and worth flagging up front that the corpus has no paper on N-gram embeddings by name, so this reads the question as the deeper one underneath it: what makes a fixed input representation composable with attention.

This reads the question as asking what properties a fixed, pre-attention embedding needs so a transformer can stack work on top of it — and the corpus doesn't cover N-gram embeddings as such, but it speaks directly to the precondition. The first thing composability requires is that the static vectors already *mean something* on their own. Analysis of RoBERTa's static embeddings shows they encode rich semantic content — valence, concreteness, iconicity, even taboo — before self-attention ever runs Do transformer static embeddings actually encode semantic meaning?. That's the handoff point: a transformer doesn't manufacture meaning from a blank lookup table, it *operates on* lexical entries that are already loaded. Any modernized embedding becomes composable to the degree it arrives at that same starting line.

The second ingredient is structured geometry. Composability isn't just "the vector means something" — it's "the vectors are arranged so transformations across them are coherent." Embedding spaces turn out to organize themselves coarse-to-fine, with leading spectral directions separating broad taxonomic branches first and finer ones later, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. And the geometry is symbolic-compatible in surprising ways — models encode syntactic relations in something like polar coordinates, using both distance and angle to mark type and direction How do language models encode syntactic relations geometrically?. A representation that lands in this kind of structured space gives attention something regular to compose over, rather than noise.

The third piece is how transformers actually do the composing — through depth, not width. For small models, deep-and-thin architectures beat balanced ones because they compose abstract concepts layer by layer rather than spreading capacity sideways Does depth matter more than width for tiny language models?. And networks naturally carve compositional tasks into isolated modular subnetworks, a structure pretraining makes more reliable Do neural networks naturally learn modular compositional structure?. So a static embedding is composable precisely because the architecture above it is built to repeatedly transform and recombine — the embedding is the base case, the stacked layers are the recursion.

The twist worth knowing: this composition is shallower than it looks. Transformers often "compose" by memorizing and matching linearized computation subgraphs from training, succeeding in-distribution but failing on genuinely novel combinations, with errors compounding across steps Do transformers actually learn systematic compositional reasoning?. So composability here is real but pattern-bound — an embedding plugs in cleanly not because the transformer reasons systematically over it, but because it slots into recognizable statistical shapes.

The thing you may not have known you wanted to know: there's no special bridging trick that makes one embedding scheme "composable" and another not. What makes any static representation compose with a transformer is that knowledge in these models isn't stored, it *flows* — residual streams transmit activations forward like an oral performance rather than retrieving from a fixed archive Do transformer models store knowledge or generate it continuously?. A static embedding is composable when it's a good *seed* for that flow: meaningful on arrival, geometrically well-placed, and ready to be transformed by the layers downstream.

Sources 7 notes

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

What makes modernized N-gram embeddings composable with transformer architectures?

Sources 7 notes

Next inquiring lines