What makes attractor-based probing better for third-party model auditing than alternatives?

This explores why probing a model by feeding it noise and watching where its encode-decode loop settles (attractors) is well-suited to auditing someone else's model — when you have no training data, no inputs, and no internals to inspect.

This explores why attractor-based probing fits the third-party auditor's predicament specifically: you're handed a model you didn't build, can't see inside, and have no access to the data it learned from. The defining move in Can we probe foundation models without any input data? is that you start from random noise and iterate the model's own encode-decode map until it settles into attractors — stable states that act as a dictionary of what the model internalized. No training set, no curated probe inputs, no weights to crack open. For an auditor, that's the whole game: every alternative quietly assumes access you don't have.

The reason this matters becomes clear when you look at what the standard alternatives miss. Surface evaluation — run benchmarks, read the metrics — can be actively misleading. Can models be smart without organized internal structure? shows two models can post identical scores while one carries a cleanly organized representation and the other a fractured one that shatters under perturbation or distribution shift. An auditor trusting the leaderboard would certify both as equivalent. Attractors probe the organization itself, not the score the organization happens to produce, so they can surface the rot that metrics paper over.

The deeper alternatives have their own access problem. Can we understand LLM mechanisms with only representational analysis? argues that real mechanistic claims need both representational analysis (locate the feature) and causal analysis (verify it by intervening). Causal verification means perturbing internals — exactly what a third party often can't do on a closed model. Attractor dynamics sidesteps this: it's a representational read-out that needs only the ability to run the model's own loop, not surgical access to its activations. It won't give you the full causal story, but it gives you a structural fingerprint when the richer methods are simply off the table.

There's a suggestive contrast worth pulling in. Detection methods like the steering-vector circuit in How do language models detect injected steering vectors internally? depend on injecting perturbations and watching the model's internal response — and tellingly, that work finds safety training can suppress the very signal you're trying to read (detection dropping from 63.8% to 10.8%). A probe that relies on the model cooperatively reporting its internal states is a probe a trained model can learn to hide from. Noise-driven attractors don't ask the model anything; they read where its dynamics naturally pool, which is harder to dress up for an auditor.

The honest caveat: this corpus has one direct source on attractor probing, and it's a vision-foundation-model result. The case for it as an *auditing* tool is a lateral synthesis — its black-box, data-free nature lines up against the documented blind spots of metric-based evaluation and the access requirements of causal and detection-based methods. If you want to go deeper on what attractors actually recover, start at Can we probe foundation models without any input data?; if you want to feel why auditors can't trust the easy alternatives, Can models be smart without organized internal structure? is the sharpest doorway.

Sources 4 notes

Can we probe foundation models without any input data?

Vision foundation models can be probed by iterating encode-decode maps starting from random noise, producing attractors that function as a dictionary of internalized signals. This black-box method requires no access to training data or model inputs.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

How do language models detect injected steering vectors internally?

Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.

What makes attractor-based probing better for third-party model auditing than alternatives?

Sources 4 notes

Next inquiring lines