Why do context-sensitive languages transfer better than regular or context-free languages?

This explores a finding from formal-language pretraining — that languages higher up the Chomsky hierarchy (context-sensitive, with nested long-range dependencies) seed better natural-language skills than flatter regular or context-free grammars — and asks why structural complexity, not surface vocabulary, is what transfers.

This explores why, when you warm up a language model on artificial grammars before showing it real text, the more structurally complex grammars (context-sensitive ones, which allow nested, long-range dependencies between distant tokens) hand off more useful machinery than simpler regular or context-free grammars do. The short version: what transfers isn't words, it's the shape of the dependencies. Natural language is full of agreement and nesting that reaches across long spans — subjects matching verbs, clauses inside clauses — and a grammar that forces the model to track those same long-range patterns builds attention circuitry that natural language can immediately reuse. The work on pre-pretraining shows this concretely: training a 1B model first on hierarchical formal languages reaches the same loss with 33% fewer natural-language tokens, and — the telling part — the very attention heads forged on the formal language stay load-bearing for syntactic performance on real text Can formal language pretraining make language models more efficient?. Regular and context-free grammars don't demand that the model build those long-distance bridges, so there's less to carry over.

The deeper principle here is that models learn from *structure in the context*, not just from example tokens. You see the same lesson in a totally different setting: in-context learning for sequential decisions only kicks in when the context contains full or partial trajectories from the same environment — isolated examples aren't enough, but the structural property of 'trajectory burstiness' lets the model generalize across wildly different tasks with no weight updates Why do trajectories matter more than individual examples for in-context learning?. Whether the structure is a nested grammar or a coherent trajectory, the model is extracting a *relational template* and applying it elsewhere. Context-sensitive languages transfer well for the same reason bursty trajectories enable ICL: both supply the dependency structure that the downstream task secretly runs on.

There's corroborating evidence from the opposite direction — what models preferentially *keep* when forced to compress. When reasoning chains are pruned token by token, models hold onto symbolic-computation tokens first and throw away grammar and filler Which tokens in reasoning chains actually matter most?. That ranking reveals what the network treats as structurally essential versus disposable surface. Formal-language pretraining is essentially front-loading the essential layer — the syntactic skeleton — before the model ever has to spend natural-language data learning it.

It's worth seeing where this stops. Structural priors transfer; missing knowledge does not. Prompting and pretraining tricks can reorganize or activate capability a model already has the scaffolding for, but they can't inject foundational content that was never there Can prompt optimization teach models knowledge they lack?. Formal grammars give the model better *machinery for handling* language; they don't give it facts about the world. And structure alone doesn't buy genuine pragmatic competence — models still fail to flex their inferences to communicative stakes the way humans do, suggesting that some context-tracking lives above the level any formal grammar can pretrain Can language models adapt implicature to conversational context?.

The thing worth walking away with: a model's syntactic ability isn't really about exposure to language — it's about exposure to the right *kind of dependency structure*, and you can manufacture that structure cheaply with artificial grammars before a single sentence of English shows up.

Sources 5 notes

Can formal language pretraining make language models more efficient?

Pre-pretraining 1B models on hierarchical formal languages achieves equivalent loss and better syntactic generalization using 33% fewer natural language tokens. The mechanism persists: attention heads trained on formal languages remain critical for syntactic performance on natural language.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can language models adapt implicature to conversational context?

ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.

Why do context-sensitive languages transfer better than regular or context-free languages?

Sources 5 notes

Next inquiring lines