LLM Reasoning and Architecture Reinforcement Learning for LLMs

What formal languages actually help transformers learn natural language?

Not all formal languages are equally useful for pre-pretraining. This explores which formal languages transfer well to natural language and why—combining structural requirements with what transformers can actually learn.

Note · 2026-02-21 · sourced from Linguistics, NLP, NLU
Where exactly does language competence break down in LLMs? How should researchers navigate LLM reasoning research?

Pre-pretraining on formal languages improves natural language acquisition, but not all formal languages produce equal transfer. Between Circuits and Chomsky (2025) proposes a two-constraint model:

Constraint 1 (Chomsky hierarchy): The formal language must capture hierarchical dependency structures present in natural language. Within the Chomsky hierarchy, context-sensitive languages transfer best to natural language (Papadimitriou & Jurafsky 2023). Simpler formal languages — regular, context-free — transfer poorly because they don't capture the hierarchical dependencies that natural language syntax requires.

Constraint 2 (circuit complexity): The formal language must be learnable by transformers with length generalization. Transformers cannot learn all context-sensitive languages — both in theory and in practice. Many formal languages within the Chomsky hierarchy are either impossible for transformers to learn or can only be learned without length generalization. Pre-pretraining on formal languages that fall outside transformer computational limits may fail to transfer even if those languages are structurally appropriate.

The optimal transfer zone is the intersection of these two constraints: formal languages expressive enough to capture hierarchical dependencies (Chomsky), and learnable by transformers with length generalization (circuit complexity). The paper formalizes this using C-RASP, a restricted programming language whose functions allow length generalization.

Empirical support: formal languages satisfying both constraints achieve equal or better transfer than matched natural language training. Formal languages satisfying only constraint 1 (hierarchical but not in C-RASP) show equivalent or slightly worse performance on some evaluations.

The broader principle: architectural computational limits are not just engineering constraints — they determine what inductive biases can actually be learned. The Chomsky hierarchy describes what structures are grammatically relevant; the circuit complexity hierarchy describes what structures are architecturally learnable. Effective pre-pretraining requires both.


Source: Linguistics, NLP, NLU

Related concepts in this collection

Concept map
14 direct connections · 136 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

effective formal language pre-pretraining requires matching formal language complexity to transformer computational limits