LLM Reasoning and Architecture Reinforcement Learning for LLMs Knowledge Retrieval and RAG

Can models learn multi-token concepts during fine-tuning?

Does training models to predict multiple tokens at once, rather than one token sequentially, help them form coherent semantic units? This matters because current next-token prediction fragments concepts like "ribonucleic acid" into arbitrary subword pieces.

Note · 2026-02-22 · sourced from Training Fine Tuning
How do you build domain expertise into general AI models? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Next-token prediction fragments multi-token concepts into arbitrary subword units. "Ribonucleic acid" becomes "rib" → "on" → "ucle" → "ic" → "acid" — five separate prediction targets with no unified semantic representation. Concept-Aware Fine-Tuning (CAFT) introduces multi-token prediction into post-training, enabling models to learn sequences that span multiple tokens as coherent concepts.

Prior multi-token prediction methods worked only during pretraining — prohibitively expensive and dominated by general language modeling rather than domain-specific concept formation. Attempts to apply multi-token prediction to fine-tuning previously failed because multi-token prediction represents a dramatic distribution shift that short post-training phases cannot absorb. CAFT solves this through self-distilled auxiliary heads: first train auxiliary heads (predicting positions beyond the next token) using an instruction-tuning mixture with self-distilled ground truth, then fine-tune with multi-token loss on top of standard LoRA or full fine-tuning.

The results: CAFT consistently outperforms next-token fine-tuning across text summarization and de novo protein design. CAFT LoRA often outperforms next-token full fine-tuning — suggesting models learn more effectively in a multi-token setting even with fewer trainable parameters. In settings where multi-token prediction is highly advantageous (protein design, where amino acid sequences have multi-residue semantic units), multi-fold performance increases are observed.

This connects to the format-shapes-reasoning finding: since Does training data format shape reasoning strategy more than domain?, the prediction unit (single token vs. multi-token) is a format variable that shapes what the model learns. Multi-token prediction is a higher-level format that encourages conceptual chunking rather than token-by-token prediction.

The democratization aspect matters: pretraining-phase MTP was restricted to well-resourced labs. CAFT brings this to fine-tuning, where any practitioner can apply it. Trained task-agnostic auxiliary heads are provided for popular open-source models.


Source: Training Fine Tuning

Related concepts in this collection

Concept map
12 direct connections · 120 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

multi-token concept-aware fine-tuning overcomes next-token fragmentation to form coherent semantic entities during post-training