Are detection and identification of injections truly separable in neural circuits?

This explores whether 'noticing that something was injected' (detection) and 'knowing what was injected' (identification) are two distinct mechanisms inside a model's circuitry — or one entangled process the corpus can't actually pull apart.

This explores whether detecting an injection and identifying it are genuinely separate steps in neural circuits, or whether we're imposing a clean split on a messier reality. The most direct evidence comes from work showing that preference optimization builds a literal two-stage circuit: early-layer 'evidence-carrier' features that flag *that* a perturbation is present, which then suppress 'gate' features that otherwise default to denial How do language models detect injected steering vectors internally?. That architecture is suggestive — detection (evidence carriers) and the downstream act of reporting/identifying (gate suppression) sit in different layers and play different roles. So at first pass, yes, they look separable.

But the corpus pushes back on taking that picture at face value. A recurring lesson is that you cannot establish a functional split from representational analysis alone — locating features that *correlate* with detection doesn't prove they *cause* a distinct identification step; only paired representational-then-causal verification (ablate the feature, watch the behavior) earns that claim Can we understand LLM mechanisms with only representational analysis?. And there's a deeper trap: models can reach identical behavior through radically different internal structures, so a circuit that looks two-stage in one model may be fused in another with the same output What actually happens inside a language model?. 'Separable' might be a property of the analysis, not the network.

What would make separability real rather than apparent is genuine modularity — and there's evidence networks do decompose compositional tasks into isolated subnetworks, where ablating one piece affects only its function, with pretraining sharpening that cleanliness Do neural networks naturally learn modular compositional structure?. Training explicitly for sparse weights can force this even further, yielding disentangled circuits where ablation studies confirm necessity and sufficiency Can sparse weight training make neural networks interpretable by design?. The catch: that interpretability holds at small scale and hasn't survived scaling up — so the clean separability we can verify may not be the separability that operates in frontier models.

There's also a functional-layering angle worth knowing about. The corpus finds knowledge retrieval living in lower layers and reasoning adjustment in higher ones Why does reasoning training help math but hurt medical tasks? — which rhymes with detection-before-identification as a depth-ordered pipeline. But layer separation is a weaker claim than circuit separation: things can be ordered by depth while still being computationally entangled.

The honest synthesis: the corpus offers one strong existence-proof of a staged detect-then-respond circuit How do language models detect injected steering vectors internally?, and several reasons to distrust generalizing it — degenerate internal solutions, the correlation-vs-causation gap, and modularity that's verified only at toy scale. So 'truly separable' is best read as *demonstrably separable in specific trained circuits, not provably separable in general*. The interesting wrinkle most readers won't expect: safety training actively *suppresses* the detection stage (dropping it from ~64% to ~11%), which means the separability is not just architectural but something training can selectively dial down — the two functions are independent enough that you can damage one and leave the substrate intact.

Sources 6 notes

How do language models detect injected steering vectors internally?

Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

What actually happens inside a language model?

Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Are detection and identification of injections truly separable in neural circuits?

Sources 6 notes

Next inquiring lines