How much does shared-prefix sampling reduce token redundancy empirically?
This explores whether the corpus puts an actual number on how much redundant token generation you avoid by branching rollouts from a shared prefix instead of sampling each trajectory independently.
This reads the question as asking for a measured payoff — a percentage or ratio — for shared-prefix sampling, where multiple reasoning trajectories reuse a common opening rather than each regenerating it from scratch. The honest answer first: the corpus describes the effect directionally but doesn't hand you a single empirical redundancy-reduction figure. Can shared-prefix trees reduce redundancy in agent rollouts? reports that tree-structured rollouts branching from shared prefixes yield *more distinct trajectories per token budget* than independent chain sampling, which tightens advantage estimation and unlocks longer-horizon tasks under the same compute — but it frames the win as 'effective sample budget expanded,' not 'X% fewer tokens.'
What the corpus does offer is a way to reason about *why* the savings are large, by showing how little of a reasoning trace is actually unique. Do high-entropy tokens drive reasoning model improvements? finds that only ~20% of tokens are high-entropy 'forking points' where the trajectory genuinely decides something — and training on just those 20% matches full-gradient updates. If only a fifth of tokens carry the branching decisions, then a shared prefix that defers divergence until the first real fork is reusing the other ~80% across siblings. That's the mechanism behind why prefix-sharing pays off, even without a headline number.
Neighboring efficiency work gives you the kind of quantified anchor the shared-prefix note lacks, which is useful for calibrating expectations. Can we explore multiple reasoning paths without committing to one token? cuts token count 22.4% by exploring multiple reasoning paths as a probability-weighted superposition instead of committing to one — a different route to the same goal of not paying full price for every parallel path. And Do persistent agents really cost less per token? is the dramatic end of the spectrum: a 115-day study where 82.9% of tokens were cache reads, meaning the vast majority of 'generated' context was reused rather than recomputed. Prefix-sharing is the rollout-time cousin of that caching logic.
There's a deeper framing worth carrying away: token redundancy isn't uniform, so the savings depend on *where* you share. Which tokens in reasoning chains actually matter most? shows models rank tokens by functional importance — symbolic-computation tokens get preserved while grammar and meta-discourse get pruned first — and Can byte-level models match tokenized performance with better efficiency? allocates compute by next-byte entropy, spending little on predictable stretches. The unifying idea across all of these is that predictable, low-entropy spans are the redundant ones, and shared-prefix sampling is one of several methods that stop paying repeatedly for them.
So the takeaway you didn't know you were looking for: the corpus reframes 'how much redundancy is removed' into 'how little of a trace was ever non-redundant.' If the decisive content lives in ~20% of tokens, the ceiling on prefix-sharing's benefit is set by how late you can keep trajectories merged before they fork — and that's an architecture choice, not a fixed empirical constant. For a hard percentage specific to shared-prefix sampling, this collection doesn't yet have it.
Sources 6 notes
Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
The Byte Latent Transformer (BLT) dynamically segments bytes into patches based on next-byte entropy, allocating more compute to high-entropy regions and less to predictable ones. At 8B parameters, BLT matches tokenized baselines while reducing inference cost and improving robustness to typos and cross-lingual transfer.