SYNTHESIS NOTE

Can ternary weights match full precision model performance?

Can models trained natively with only three weight values (−1, 0, 1) achieve the same perplexity and task performance as standard full-precision models? This matters because ternary weights could dramatically reduce computational and energy costs.

Synthesis note · 2026-06-03 · sourced from Mobile

Post-training quantization to low-bit weights is widely used but sub-optimal — it degrades a model trained in full precision. BitNet b1.58 instead trains natively with ternary weights {-1, 0, 1} (≈1.58 bits), and the headline result is that it matches the full-precision (FP16/BF16) Transformer of the same model size and training-token count on both perplexity and end-task performance — while being dramatically cheaper in latency, memory, throughput, and energy.

The keeper is not just efficiency but the reframing: 1.58-bit defines a new scaling law and training recipe for high-performance, cost-effective models, and — because ternary weights turn matrix multiplication into addition — it opens a path to hardware designed specifically for 1-bit LLMs. It also compounds with other bottlenecks: reduced activation precision (16→8 bit, further compressible) doubles feasible context length, and the small footprint eases MoE deployment by cutting devices and inter-chip communication.

This is the weight-precision route in the vault's efficiency-architecture thread, distinct from the attention-linearity route. It complements Can spiking neurons make transformers efficient on any hardware? (which buys efficiency via attention linearity + sparsity) and grounds Can architecture choices improve inference efficiency without sacrificing accuracy? with a concrete architecture where inference cost drops without an accuracy penalty.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Do scaling laws change when weight precision becomes a design variable?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 94 in 2-hop network ·medium cluster Open in graph ↗

Can ternary weights match full precision model p… Can spiking neurons make transformers efficient on… Can architecture choices improve inference efficie… What actually limits language models on mobile pho…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can spiking neurons make transformers efficient on any hardware? Explores whether brain-inspired spiking mechanisms combined with linear attention can adapt existing transformer checkpoints into efficient models trainable outside NVIDIA ecosystems using minimal additional data.
sibling efficiency route via attention linearity + spiking sparsity rather than weight precision
Can architecture choices improve inference efficiency without sacrificing accuracy? Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
1.58-bit is a concrete architecture where inference cost drops independently of accuracy
What actually limits language models on mobile phones? Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
the deployment pressure 1-bit weights most directly relieve

Can ternary weights match full precision model performance?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4