Can text-infilling pretraining adapt language models to irregular document structures?
This explores whether a pretraining objective that asks a model to fill in missing spans of text could teach it to handle documents whose layout doesn't follow tidy linear prose — forms, tables, nested clauses, mixed structure. The corpus doesn't address text-infilling directly, but it has a lot to say about whether *any* training adjustment actually changes how models cope with structure.
This explores whether a fill-in-the-blank pretraining objective could adapt language models to messy, non-linear document structures — and the honest first thing to say is that the collection has no note on text-infilling pretraining specifically. What it does have is a sharper, more uncomfortable set of findings about *where structural failure actually lives* in these models, which reframes the question itself.
The most direct signal is that structural difficulty isn't a surface formatting problem you can train around with a cleverer objective — it tracks something deeper. Models degrade *predictably* as syntactic depth increases: top-tier systems consistently misread embedded clauses, verb phrases, and complex nominals, suggesting statistical learning captures surface patterns but not the deep grammatical scaffolding that irregular structure depends on Why do large language models fail at complex linguistic tasks?. The same shape shows up at the document level: long-context models can match retrieval systems on *semantic* lookups, but collapse on *structured* queries that require joins across tables — relational structure they can read past but not reason over Can long-context LLMs replace retrieval-augmented generation systems?. So the relevant gap isn't 'can the model see irregular structure' but 'can it operate on it,' and more context alone doesn't bridge that.
There's also a ceiling question lurking under any pretraining-objective proposal. Changing how a model is trained reorganizes and surfaces what's in the training distribution; it doesn't conjure capability that the data never contained. The corpus makes this point bluntly about prompting — optimization can activate latent knowledge but cannot inject knowledge the model lacks Can prompt optimization teach models knowledge they lack? — and the deeper-cutting version is that strong parametric priors actively override in-context signals, so even when the structural information is right there, the model can ignore it in favor of what training baked in Why do language models ignore information in their context?. An infilling objective would be one more way of shaping priors; it inherits the same constraint.
The note that comes closest to your actual question is the one on domain-adaptation techniques broadly: every method — parameter-efficient tuning, knowledge-graph curricula, and the like — has a 'sweet spot' tied to a specific domain, and the visible wins almost always carry hidden costs in reasoning faithfulness, capability transfer, and *format flexibility* How do domain training techniques actually reshape model behavior?. That last item is the one to sit with. If you trained a model on infilling to specialize it for irregular documents, this corpus predicts you'd likely buy structural fluency at the price of generality — the model gets better at the forms you trained on and quietly worse at adapting to forms you didn't.
The thing you may not have expected to want to know: the collection treats 'structure' less as a representational problem (which a better pretraining task might fix) and more as a *reasoning-over-structure* problem that training tweaks tend to relocate rather than resolve. If you want to go deeper on the limits of what objective-tweaking can change versus the limits of what's in the data, Can prompt optimization teach models knowledge they lack? and How do domain training techniques actually reshape model behavior? are the two doorways.
Sources 5 notes
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.