Can formal language pretraining make language models more efficient?
Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.
Between Circuits and Chomsky (2025) tests whether training language models on formal languages before natural language can improve acquisition efficiency. The result is surprisingly strong:
For a 1B-parameter model trained on ~1.6B natural language tokens, pre-pretraining on formal languages with hierarchical dependencies:
- Achieves the same loss as natural language-only training
- Shows better linguistic generalization on syntactic evaluations
- Uses 33% fewer natural language tokens to reach equivalent performance
The effect is mechanistically grounded: attention heads acquired during pre-pretraining on formal languages remain crucial for the model's performance on syntactic evaluations in natural language. Structure from formal language training transfers to natural language processing at the level of learned mechanisms.
Why hierarchical formal languages specifically? Papadimitriou & Jurafsky (2023) showed that within the Chomsky hierarchy, context-sensitive languages transfer best to natural language. The key: effective transfer requires formal languages that capture the hierarchical dependency structures present in natural language. Not all formal languages transfer — only those that share the structural properties that matter for syntax.
This directly supports Can language models learn grammar from child-scale data?: if syntactic structure is efficiently acquirable from hierarchical formal languages (which encode the relevant inductive biases), then syntactic competence is trainable from far less data than previously thought — as long as the structure of training provides the right biases.
The broader implication: data volume matters less than structural inductive bias for syntactic generalization. LLMs trained on the right structures learn syntax efficiently; LLMs trained only on natural language may be learning syntax the hard way.
Source: Linguistics, NLP, NLU
Related concepts in this collection
-
Can language models learn grammar from child-scale data?
If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
this provides a mechanism: hierarchical structure in training data enables efficient syntactic acquisition
-
What formal languages actually help transformers learn natural language?
Not all formal languages are equally useful for pre-pretraining. This explores which formal languages transfer well to natural language and why—combining structural requirements with what transformers can actually learn.
the second constraint on when transfer works
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
formal language pre-pretraining addresses this by instilling structural inductive biases
-
Can models learn multi-token concepts during fine-tuning?
Does training models to predict multiple tokens at once, rather than one token sequentially, help them form coherent semantic units? This matters because current next-token prediction fragments concepts like "ribonucleic acid" into arbitrary subword pieces.
both change the learning unit to improve efficiency: pre-pretraining changes the data to hierarchical formal languages, CAFT changes the prediction target to multi-token concepts; complementary approaches operating at different training stages (pre-pretraining vs. post-training)
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
pre-pretraining on hierarchical formal languages achieves 33% greater token efficiency than matched natural language training