Why does input embedding magnitude affect perturbation sensitivity in transformers?

This explores why the *size* (norm) of a transformer's input embeddings changes how much small input perturbations get amplified or dampened as they propagate through the model.

This explores why the magnitude of input embeddings governs a transformer's sensitivity to perturbations — and the most direct answer in the corpus comes from a Lipschitz-continuity analysis of reasoning chains. The finding is that a transformer's robustness has a *structural floor*: extra reasoning steps dampen how far an input wobble propagates, but never drive sensitivity to zero. Crucially, that analysis shows sensitivity *decreases* as embedding and hidden-state norms grow stronger Can longer reasoning chains eliminate model sensitivity to input noise?. The intuition: perturbation sensitivity is roughly a ratio — how much the output moves relative to how much the input moved. When the embedding signal is large, a fixed-size noise perturbation is small *relative to* the signal it rides on, so it gets washed out rather than amplified. Weak embeddings give noise a louder voice.

What makes this more than a one-paper curiosity is that the same dynamic — small errors compounding (or failing to compound) across depth — shows up everywhere the corpus looks at transformer reliability. When models do compositional reasoning by stitching together memorized computation subgraphs, errors don't stay local: they compound step by step across the chain, which is exactly the perturbation-propagation problem viewed at the task level rather than the embedding level Do transformers actually learn systematic compositional reasoning?. The embedding-norm result tells you the per-step amplification factor; the compositional-reasoning result tells you what happens when you multiply those factors across many steps.

There's a second, less obvious thread: embedding magnitude isn't a fixed knob, it's *learned* and *input-dependent*. Networks develop dense, high-magnitude activations for data they've seen often during training and fall back to sparse, weaker representations for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks?. Put that next to the Lipschitz finding and you get something the question doesn't ask but a reader should want: a model is most robust to perturbations precisely on the inputs it knows well (strong, dense embeddings) and most fragile on the unfamiliar inputs (weak, sparse embeddings) — robustness and familiarity are coupled through the same norm.

Two cautions from the corpus keep this honest. First, the magnitude of an activation isn't a clean readout of what the model is computing — standard analysis tools over-weight simple linear structure, and networks can compute correctly with no interpretable activation pattern at all, so 'bigger norm = more signal' is a useful heuristic, not a law Do standard analysis methods hide nonlinear features in neural networks?. Second, where in the network the magnitude lives matters: transformers can compute an answer in early layers and then actively suppress those representations in later layers Do transformers hide reasoning before producing filler tokens?, meaning the norm that buffers a perturbation at layer 3 may be deliberately overwritten by layer 30.

The thing worth walking away with: perturbation robustness in transformers is never *eliminated*, only *bought* — and the currency is embedding magnitude, which the model spends generously on familiar inputs and stints on unfamiliar ones.

Sources 5 notes

Can longer reasoning chains eliminate model sensitivity to input noise?

Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why does input embedding magnitude affect perturbation sensitivity in transformers?

Sources 5 notes

Next inquiring lines