How do continuous concept tokens compare to latent trajectory sampling?

This explores two different ways models can explore multiple reasoning paths at once — Soft Thinking's continuous concept tokens (keeping all options blended inside one path) versus GRAM-style parallel sampling of separate latent trajectories (running many paths side by side).

This explores two answers to the same underlying problem: when a model reasons, the moment it picks one token it throws away every other path it could have taken. Both approaches refuse that early commitment, but they refuse it in opposite directions. Continuous concept tokens, as in Soft Thinking Can we explore multiple reasoning paths without committing to one token?, keep the full probability distribution alive *inside a single trajectory* — instead of sampling one word, the model feeds a probability-weighted blend of candidate embeddings forward, so a superposition of reasoning paths travels together in one pass. Latent trajectory sampling, as in GRAM Can reasoning systems scale wider instead of only deeper?, does the opposite: it keeps the paths *separate*, drawing many independent stochastic latent trajectories in parallel and scaling reasoning 'wider' rather than deeper. One compresses the branching into a single richer step; the other lets the branches run as a fleet.

The practical trade-off falls along latency and variance. GRAM's pitch is that depth-only reasoning pays a serial latency cost — each step waits on the last — whereas parallel trajectories sample the solution space at the same wall-clock cost as token-level parallelism, without inflating variance. Soft Thinking's pitch is efficiency of a different kind: because each step already carries the distribution, it reaches answers in *fewer* tokens (a 22% reduction via entropy-based early stopping) while nudging accuracy up. So you can read them as width-in-parallel versus width-folded-into-each-step — both buy exploration, but one spends compute breadthwise and the other spends representational richness per token.

What makes the comparison interesting is that both lean on the same hidden fact: not all tokens are equal. Reasoning seems to hinge on a small set of high-entropy 'forking' decisions — only about 20% of tokens carry the real branching signal Do high-entropy tokens drive reasoning model improvements?, and specific reflection tokens like 'Wait' and 'Therefore' spike in mutual information with correct answers Do reflection tokens carry more information about correct answers?. Continuous concept tokens get their leverage precisely at those forks — that's where preserving the distribution rather than collapsing it matters most. Trajectory sampling instead treats the fork as the place to spawn new paths. They're two strategies for spending compute at the same critical moments.

Step back and both belong to a broader move away from reasoning that has to be spoken aloud. A cluster of architectures — Coconut, Heima, depth-recurrent models — already show that test-time compute can scale through hidden-state iteration with no verbalized steps at all Can models reason without generating visible thinking tokens?, and Latent-Thought Language Models go further, treating latent vectors as a scaling dimension independent of parameter count Can latent thought vectors scale language models beyond parameters?. Continuous concept tokens and latent trajectory sampling are two ways of carving up that latent space: one enriches a single path, the other multiplies paths. If you want a hint at why the visible chain may be optional anyway, note that models trained on deliberately corrupted reasoning traces perform about as well as those trained on correct ones Do reasoning traces need to be semantically correct? — suggesting the trace is computational scaffolding, which is exactly what both of these methods are trying to optimize directly rather than narrate.

Sources 7 notes

Can we explore multiple reasoning paths without committing to one token?

Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

How do continuous concept tokens compare to latent trajectory sampling?

Sources 7 notes

Next inquiring lines