Can LLMs reconstruct censored knowledge from scattered training hints?
When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.
"Connecting the Dots" (2406.14546) demonstrates inductive out-of-context reasoning (OOCR): LLMs can infer latent information distributed across training documents and apply it to downstream tasks without in-context learning. The experimental design is elegant — finetune a model on a corpus containing only distances between an unknown city and known cities. No city name appears anywhere in the training data.
The model can then verbalize that the unknown city is Paris and answer downstream questions using this inferred fact. No chain-of-thought prompting. No in-context examples. The model pieced together disparate evidence from its finetuning corpus and performed inductive inference to arrive at a conclusion that was never explicitly stated.
This is qualitatively different from standard in-context reasoning. In-context reasoning operates over information present in the prompt. OOCR operates over information distributed across the training data. The model integrates evidence that was never co-present in any single training instance.
The safety implication is direct: censoring dangerous knowledge from training data — a common safety measure — may not prevent LLMs from reconstructing that knowledge. If implicit hints remain scattered across the remaining corpus, the model can connect the dots. This makes content-based safety measures fundamentally less reliable than they appear. The same OOCR mechanism also explains why How much poisoned training data survives safety alignment? — even a tiny fraction of contaminated data provides sufficient statistical traces for the model to reconstruct and integrate the poisoned beliefs.
Since How does multi-hop reasoning develop during transformer training?, the OOCR finding extends the multi-hop pattern from within-context to across-training-data. The model doesn't just chain together facts presented together — it chains together facts that were never presented together, creating new knowledge from statistical residue.
Since Can large language models develop genuine world models without direct environmental contact?, OOCR provides a mechanism for how these world models might form: not from any single document but from the aggregate of partial information across the entire training distribution.
Source: LLM Architecture
Related concepts in this collection
-
How does multi-hop reasoning develop during transformer training?
Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.
within-context multi-hop; OOCR extends this across training data
-
Can large language models develop genuine world models without direct environmental contact?
Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.
OOCR may be the mechanism for world model formation from distributed evidence
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
contrast: OOCR shows some latent information DOES influence generation
-
How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
OOCR explains why low-rate poisoning works: the model's ability to reconstruct knowledge from scattered implicit hints means even 0.1% contamination provides sufficient statistical traces for the model to integrate; conversely, poisoning persistence confirms that OOCR-reconstructed knowledge becomes durable in model weights
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
complementary vulnerability: OOCR constructs knowledge from scattered training evidence, while belief manipulation destroys correct knowledge through inference-time social pressure; together they show LLM knowledge is malleable in both directions — constructible from sparse signals and destructible under conversational pressure
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
LLMs infer censored knowledge by piecing together implicit hints scattered across training documents — inductive out-of-context reasoning poses a safety risk