Can we probe foundation models without any input data?
Can we understand what foundation models have learned by sampling noise through their encode-decode dynamics instead of analyzing their response to real inputs? This matters for auditing models whose training data is proprietary or inaccessible.
Most interpretability methods need input data to work — you give the model an image, an audio clip, a text prompt, and you measure what happens. This breaks down for foundation models whose training data is proprietary, distributed across the internet, or simply too large to enumerate. Navigating the Latent Space Dynamics of Neural Models introduces a probing method that needs no input data at all.
The procedure: start from random noise in latent space. Iterate the model's encode-decode map. Record where the dynamics settle. The attractors that emerge from noise function as a dictionary of signals the model has learned to represent — concepts, classes, distributions, depending on training. The paper validates this on vision foundation models, demonstrating that the attractor set is informative enough to represent diverse downstream datasets.
This is interpretability through dynamics rather than through activation analysis. Standard probing methods ask "what does this layer encode when shown this input?" The attractor method asks "what is this model's latent space pulled toward when started from nowhere in particular?" The answers reveal what the model has internalized as a low-effort representation — what its geometry naturally favors when no input is constraining it.
The methodological consequence is a path to black-box analysis of weights. For models where training data is unavailable, where you have access only to the model itself, attractor dynamics provide a way to characterize learned content. This matters for third-party audit, for understanding pretrained checkpoints whose training corpora were not documented, and for safety analysis of models whose data provenance cannot be reconstructed.
The technique generalizes beyond vision in principle. Any architecture that supports a self-mapping iteration (encode-decode, autoregressive next-token, diffusion denoising) can in principle be probed for attractor structure. Whether the attractors of language models or multimodal models carry similar dictionary-of-signals interpretability is open.
Related concepts in this collection
-
Do autoencoders learn hidden attractors in latent space?
When you repeatedly apply an autoencoder's encode-decode cycle, do the trajectories in latent space converge to specific points? If so, what creates these attractors and what do they reveal about what the network learned?
same paper, the foundational mechanism this methodology operationalizes
-
Can we decode what LLM activations really represent in language?
Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.
adjacent black-box probing approach for LLMs
-
Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
adjacent: another top-down interpretability methodology
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
foundation model knowledge can be probed black-box via attractor dynamics from noise — no input data required