Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal to natural language: attention heads acquired during pre-pretraining remain crucial for the model’s performance on syntactic evaluations.1
A recently-explored approach for increasing data efficiency teaches models useful inductive biases by first training them on formal languages before training on natural language (Papadimitriou and Jurafsky, 2020; Chiang and Lee, 2022; McCoy and Griffiths, 2025). We refer to this paradigm as pre-pretraining. What features of formal languages make transfer to natural language effective? Papadimitriou and Jurafsky (2023) show that within the Chomsky hierarchy, context-sensitive languages transfer best to natural language compared to simpler classes of languages. We expand on their investigation and explore an additional factor: the computational limitations of the language model’s architecture. In particular, transformers—the architecture that underlies most popular language models—cannot learn all context-sensitive languages, both in theory and practice (Strobl et al., 2024;Merrill and Sabharwal, 2023). In fact, within all levels of the Chomsky hierarchy, some languages are harder for transformers to learn than others, and many are impossible for them to learn (Merrill et al., 2023, 2024). Can a formal language give rise to positive transfer even when it cannot be fully learned by a transformer?
aIn this work, we hypothesize that optimal transfer from formal to natural language in transformer language models occurs at the intersection of two theoretical hierarchies: the Chomsky hierarchy of formal languages and the circuit complexity hierarchy that bounds transformer computational power (see §3). Specifically, we hypothesize that effective pre-pretraining languages should be: 1. expressive enough to capture hierarchical natural language dependencies, and 2. learnable by transformers in a way that generalizes to longer strings than observed in training. To satisfy the second condition, we define our formal languages in C-RASP (Yang and Chiang, 2024), a restricted programming language whose functions allow transformers to exhibit length generalization (Huang et al., 2025).
Our empirical results support the first part of the hypothesis and provide some support for the second part (§4). Pre-pretraining on languages with hierarchical dependencies outperforms prepretraining on any of the other formal languages that we tested—in fact, it outperforms pre-pretraining on a matched amount of natural language. Of the formal languages with hierarchical dependencies, those that are definable in C-RASP generally achieve equal or better performance, but they are only clearly superior on some of the tasks we evaluated.
Next, we show that when positive transfer occurs, the model reuses attention heads it learned during pre-pretraining, suggesting that mechanisms from pre-pretraining transfer to natural language (§5). Finally, we scale up our experiments to a 1B-parameter language model, and show that in pre-pretraining is effective in that size as well, increasing token efficiency by 33% (§6). Overall, we conclude that formal language pre-pretraining is an effective way to improve generalization and data efficiency.