Can byte-level models match tokenized performance with better efficiency?
Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?
The Byte Latent Transformer (BLT) is the first byte-level LLM architecture to match tokenization-based performance at scale. The core principle: tokenization-based models allocate the same compute to every token, trading efficiency for performance via compression heuristics not correlated with prediction complexity. BLT instead allocates compute dynamically where data complexity demands it.
The mechanism: BLT segments raw bytes into patches based on the entropy of the next-byte prediction. High-entropy regions (uncertain, complex — like the first word of a new sentence) get more compute. Low-entropy regions (predictable, like word endings) get less. The segmentation is dynamic, learned, and contextualized — producing groups with relatively uniform information density.
The architecture has three transformer blocks:
- Two small byte-level local models (handle fine-grained byte processing)
- One large global latent transformer (handles the primary computation on patches)
A critical distinction: patches are not tokens. Tokens are drawn from a fixed vocabulary determined before training; patches are dynamically grouped sequences without a fixed vocabulary. This means the model has direct access to underlying byte features — something token-based models lose entirely. The byte-level representation enables robustness to typos, character-level phenomena, and cross-lingual transfer that token-level models cannot achieve.
This implements Can we allocate inference compute based on prompt difficulty? at a fundamentally finer granularity — not per-prompt, not per-token, but per-byte-group. The principle is the same (allocate where complexity demands) but the resolution is orders of magnitude finer.
The scaling results demonstrate feasibility: first FLOP-controlled scaling study of byte-level models up to 8B parameters and 4T training bytes, with significant improvements in inference efficiency and robustness over tokenized baselines.
Source: Novel Architectures
Related concepts in this collection
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
BLT implements adaptive compute at sub-token granularity via entropy-based segmentation
-
Can parallel architectures solve fundamentally sequential problems?
Explores whether pure parallel computation—like Transformers—can tackle problems requiring long chains of dependent reasoning, or if serial depth is theoretically necessary for certain classes of problems.
BLT's dynamic allocation is orthogonal: it addresses efficiency within a given architecture, not computational depth
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
BLT's entropy-based patching is a concrete architectural variable that conditional scaling laws could incorporate: patch granularity and entropy threshold are architecture-level parameters that affect inference efficiency independently of training FLOPs
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
byte-level language models allocate compute dynamically by entropy — matching tokenized model performance with better efficiency