LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can formal language pretraining make language models more efficient?

Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.

Note · 2026-02-21 · sourced from Linguistics, NLP, NLU
Where exactly does language competence break down in LLMs? How should researchers navigate LLM reasoning research?

Between Circuits and Chomsky (2025) tests whether training language models on formal languages before natural language can improve acquisition efficiency. The result is surprisingly strong:

For a 1B-parameter model trained on ~1.6B natural language tokens, pre-pretraining on formal languages with hierarchical dependencies:

The effect is mechanistically grounded: attention heads acquired during pre-pretraining on formal languages remain crucial for the model's performance on syntactic evaluations in natural language. Structure from formal language training transfers to natural language processing at the level of learned mechanisms.

Why hierarchical formal languages specifically? Papadimitriou & Jurafsky (2023) showed that within the Chomsky hierarchy, context-sensitive languages transfer best to natural language. The key: effective transfer requires formal languages that capture the hierarchical dependency structures present in natural language. Not all formal languages transfer — only those that share the structural properties that matter for syntax.

This directly supports Can language models learn grammar from child-scale data?: if syntactic structure is efficiently acquirable from hierarchical formal languages (which encode the relevant inductive biases), then syntactic competence is trainable from far less data than previously thought — as long as the structure of training provides the right biases.

The broader implication: data volume matters less than structural inductive bias for syntactic generalization. LLMs trained on the right structures learn syntax efficiently; LLMs trained only on natural language may be learning syntax the hard way.


Source: Linguistics, NLP, NLU

Related concepts in this collection

Concept map
13 direct connections · 111 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

pre-pretraining on hierarchical formal languages achieves 33% greater token efficiency than matched natural language training