Does teaching question patterns before document training improve knowledge access?
Standard LLM training encodes documents first, then teaches QA patterns. But does this order matter? Exploring whether reversing the sequence—teaching how knowledge gets queried before encoding it—could unlock better factual recall.
To keep an assistant current, the standard recipe is continued pretraining on new documents followed by instruction-tuning on QA pairs. The paper finds this fails: LLMs trained this way struggle to answer questions even when the perplexity of the documents is minimized — the knowledge is encoded but not accessible. The diagnosis is a granularity mismatch: QA pairs are simple and direct, while documents weave many facts together intricately, so encoding document knowledge without knowing how it will be queried produces representations that don't surface under questioning.
The fix inverts the order. Pre-instruction-tuning (PIT) instruction-tunes on questions before continued pretraining on documents, so the model learns how knowledge is accessed before it encodes the knowledge — and the encoding then takes the access pattern into account. PIT outperforms standard instruction-tuning for later factual recall.
The keeper is a principle about knowledge encoding: what the model learns from a document depends on whether it already knows how that knowledge will be retrieved — encoding and access are coupled, not sequential. This connects to Can we predict keyword priming before learning happens? (how new facts get recruited) and to Does procedural knowledge drive reasoning more than factual retrieval?: both, with PIT, point to how knowledge is represented for use mattering more than raw exposure.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we predict keyword priming before learning happens?
Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.
both concern how newly-learned facts become accessible, not just stored
-
Does procedural knowledge drive reasoning more than factual retrieval?
Explores whether models learn reasoning through general procedures across diverse documents rather than memorizing specific facts. This matters for understanding what pretraining data actually teaches models to reason.
both relocate value from raw document exposure to how knowledge is represented for use
-
Does repeated sensitive data in fine-tuning cause memorization?
When language models train on the same private or proprietary data multiple times, how much do they end up memorizing and leaking that information at inference time? Understanding this risk is critical for organizations fine-tuning on confidential datasets.
the flip side: document-encoding by repetition memorizes; PIT changes what encoding captures
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Instruction-tuned Language Models are Better Knowledge Learners
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Memorization and Knowledge Injection in Gated LLMs
- Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation
- Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training
- Knowledge Graph Prompting for Multi-Document Question Answering
- The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
Original note title
pre-instruction-tuning on QA pairs before training on documents improves knowledge acquisition by encoding how knowledge will be accessed first