Do autoencoders learn hidden attractors in latent space?
When you repeatedly apply an autoencoder's encode-decode cycle, do the trajectories in latent space converge to specific points? If so, what creates these attractors and what do they reveal about what the network learned?
A trained autoencoder is usually treated as a one-shot map: input goes in, latent code comes out, reconstruction goes back. Navigating the Latent Space Dynamics of Neural Models reframes the same architecture as a dynamical system. Iterate the encode-decode map and you trace trajectories in latent space. The endpoints of those trajectories are attractor points — locations where iteration stops moving — and they emerge without any additional training, purely from the geometry the autoencoder learned.
The mechanism is locally contractive behavior near training examples. Three inductive biases combine to produce it. Initialization bias: standard schemes preserve activation variance and exhibit a global tendency toward contractive maps. Explicit regularization: weight decay penalizes parameter norms and encourages contraction. Implicit regularization: data augmentation introduces local perturbations around training examples, effectively defining neighborhoods the encoder learns to contract toward. None of these were designed to create attractors; the attractors are a side effect.
What attractors represent depends on training regime. Heavy overparameterization with limited data produces attractors that correspond to memorized examples — the autoencoder behaves like an associative memory akin to a Hopfield network. With more data and less overparameterization, attractors become more abstract — they represent learned distribution modes rather than individual training points. The position on this memorization-vs-generalization spectrum is itself a property of the inductive-bias regime.
This reframing turns the network into an object with intrinsic dynamics that can be analyzed without input data. You can sample noise, iterate the encode-decode map, and discover what the network has actually learned by tracing where the dynamics settle. For foundation models, this enables a class of probing methods that do not require access to the original training data — particularly useful when that data is proprietary or distributed.
Related concepts in this collection
-
Can we probe foundation models without any input data?
Can we understand what foundation models have learned by sampling noise through their encode-decode dynamics instead of analyzing their response to real inputs? This matters for auditing models whose training data is proprietary or inaccessible.
same paper, the methodology application this finding enables
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
adjacent: another angle on what internal structure carries
-
What happens inside models when they suddenly generalize?
Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
adjacent: another decomposition of the memorization-generalization spectrum
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
autoencoders implicitly define a latent vector field via iterated encode-decode maps with attractors emerging from training-induced contractive bias