Can ternary weights match full precision model performance?
Can models trained natively with only three weight values (−1, 0, 1) achieve the same perplexity and task performance as standard full-precision models? This matters because ternary weights could dramatically reduce computational and energy costs.
Post-training quantization to low-bit weights is widely used but sub-optimal — it degrades a model trained in full precision. BitNet b1.58 instead trains natively with ternary weights {-1, 0, 1} (≈1.58 bits), and the headline result is that it matches the full-precision (FP16/BF16) Transformer of the same model size and training-token count on both perplexity and end-task performance — while being dramatically cheaper in latency, memory, throughput, and energy.
The keeper is not just efficiency but the reframing: 1.58-bit defines a new scaling law and training recipe for high-performance, cost-effective models, and — because ternary weights turn matrix multiplication into addition — it opens a path to hardware designed specifically for 1-bit LLMs. It also compounds with other bottlenecks: reduced activation precision (16→8 bit, further compressible) doubles feasible context length, and the small footprint eases MoE deployment by cutting devices and inter-chip communication.
This is the weight-precision route in the vault's efficiency-architecture thread, distinct from the attention-linearity route. It complements Can spiking neurons make transformers efficient on any hardware? (which buys efficiency via attention linearity + sparsity) and grounds Can architecture choices improve inference efficiency without sacrificing accuracy? with a concrete architecture where inference cost drops without an accuracy penalty.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can spiking neurons make transformers efficient on any hardware?
Explores whether brain-inspired spiking mechanisms combined with linear attention can adapt existing transformer checkpoints into efficient models trainable outside NVIDIA ecosystems using minimal additional data.
sibling efficiency route via attention linearity + spiking sparsity rather than weight precision
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
1.58-bit is a concrete architecture where inference cost drops independently of accuracy
-
What actually limits language models on mobile phones?
Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
the deployment pressure 1-bit weights most directly relieve
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
- Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
- GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
- Weight-sparse transformers have interpretable circuits
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Original note title
ternary 1-bit LLM weights match full-precision performance at the same size while defining a new cost scaling law and hardware paradigm