Reasoning and Learning Architectures

Can we probe foundation models without any input data?

Can we understand what foundation models have learned by sampling noise through their encode-decode dynamics instead of analyzing their response to real inputs? This matters for auditing models whose training data is proprietary or inaccessible.

Note · 2026-05-18 · sourced from Cognitive Models Latent

Most interpretability methods need input data to work — you give the model an image, an audio clip, a text prompt, and you measure what happens. This breaks down for foundation models whose training data is proprietary, distributed across the internet, or simply too large to enumerate. Navigating the Latent Space Dynamics of Neural Models introduces a probing method that needs no input data at all.

The procedure: start from random noise in latent space. Iterate the model's encode-decode map. Record where the dynamics settle. The attractors that emerge from noise function as a dictionary of signals the model has learned to represent — concepts, classes, distributions, depending on training. The paper validates this on vision foundation models, demonstrating that the attractor set is informative enough to represent diverse downstream datasets.

This is interpretability through dynamics rather than through activation analysis. Standard probing methods ask "what does this layer encode when shown this input?" The attractor method asks "what is this model's latent space pulled toward when started from nowhere in particular?" The answers reveal what the model has internalized as a low-effort representation — what its geometry naturally favors when no input is constraining it.

The methodological consequence is a path to black-box analysis of weights. For models where training data is unavailable, where you have access only to the model itself, attractor dynamics provide a way to characterize learned content. This matters for third-party audit, for understanding pretrained checkpoints whose training corpora were not documented, and for safety analysis of models whose data provenance cannot be reconstructed.

The technique generalizes beyond vision in principle. Any architecture that supports a self-mapping iteration (encode-decode, autoregressive next-token, diffusion denoising) can in principle be probed for attractor structure. Whether the attractors of language models or multimodal models carry similar dictionary-of-signals interpretability is open.

Related concepts in this collection

Concept map
13 direct connections · 95 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

foundation model knowledge can be probed black-box via attractor dynamics from noise — no input data required