Can language models keep secrets and control information strategically?

This explores whether LLMs can deliberately withhold information — keep a secret, reason privately, or manage who knows what — and the corpus suggests the architecture leaks in both directions: it spills what it should hide, yet can also bury reasoning it has already computed.

This explores whether language models can deliberately keep secrets and strategically control what they reveal — and the short version from this corpus is that secrecy is mostly something that *happens* to these models rather than something they *do*. The most direct evidence is unflattering: when models reason out loud, they leak. Roughly three-quarters of privacy exposures in reasoning traces come not from clever extraction but from the model simply re-materializing sensitive user data mid-thought, and longer reasoning chains leak *more*, because the private detail acts as cognitive scaffolding the model leans on to think Do reasoning traces actually expose private user data?. Worse, models leak things no one can even see as secrets: behavioral traits transmit between models through data bearing no semantic relationship to the trait, surviving aggressive filtering because the signal is a statistical fingerprint, not content Can language models transmit hidden behavioral traits through unrelated data?. You can't keep a secret you don't know you're carrying.

The more surprising flip side is that models genuinely *can* hide computation — just not on purpose. Trained with hidden chain-of-thought, transformers compute the correct answer in their earliest layers, then actively suppress that representation in later layers to emit format-compliant filler tokens, with the real reasoning still recoverable from lower-ranked predictions Do transformers hide reasoning before producing filler tokens?. So the machinery for 'thinking one thing and saying another' exists. But it's a training artifact, not a strategy — and the concealment is leaky, sitting one logit-lens probe away from exposure. That gap between what the model internally holds and what it presents is exactly where strategic information control would have to live, and here it's accidental.

Strategic information control also requires modeling *who knows what* — and this is where models fall down hardest. LLMs look socially competent only when one model puppeteers all sides of an interaction; the moment agents must hold genuinely private information and reason about each other's asymmetric knowledge, performance collapses Why do LLMs fail when simulating agents with private information?. Keeping a secret is meaningless without a theory of the other mind you're keeping it from, and the corpus shows models skip exactly that grounding work when they can get away with it.

There's a thinner thread of evidence that the raw ingredients for control exist. Models develop entity-recognition mechanisms that causally track whether they actually know something, steering both refusal and hallucination — a primitive form of 'I have / don't have this information' Do models know what they don't know?. But the broader picture of model self-knowledge is unstable: self-reports are unreliable, and models shift their stated beliefs under conversational pressure How well do language models understand their own knowledge?. A system that abandons its position when pushed is not one that can guard a secret under interrogation.

The deepest reason may be architectural. Transformers don't store knowledge in a vault you can lock; they transmit it as continuous flow, knowledge that exists only in the act of generation, more like oral performance than a filing cabinet Do transformer models store knowledge or generate it continuously?. A secret presumes a boundary between possessing information and disclosing it — and if knowing and saying are the same physical process, that boundary is exactly what these models lack. What you'd want to know you didn't want to know: the failure to keep secrets and the failure to integrate context you'd rather the model *use* Why do language models ignore information in their context? may be two faces of one thing — models have remarkably little control over the gate between their internal state and their output, in either direction.

Sources 8 notes

Do reasoning traces actually expose private user data?

74.8% of privacy leaks in language model reasoning traces result from models materializing sensitive user data during thought processes. Longer reasoning chains amplify leakage, and anonymizing traces post-hoc degrades model utility, suggesting private data functions as cognitive scaffolding.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

How well do language models understand their own knowledge?

LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can language models keep secrets and control information strategically?

Sources 8 notes

Next inquiring lines