How does the [remention] token help models distinguish initial from later mentions?

This explores whether a dedicated marker token — one that tags a word as a *later* reference rather than a *first* introduction — could help a model track coreference, telling apart the initial mention of something from its repeated appearances.

This explores the idea of a special [remention] token that flags when a word refers back to something already introduced, versus naming it for the first time. I should be upfront: the corpus doesn't contain a note that directly studies a remention token, coreference tracking, or distinguishing initial from later mentions. So rather than stretch unrelated papers to fit, here's the honest adjacent territory the collection *does* cover — which turns out to be surprisingly relevant to *why* such a token might work at all.

The collection's strongest recurring finding is that a small set of designated tokens can carry outsized functional weight — which is exactly the premise a [remention] token would rely on. Reflection markers like "Wait" and "Therefore" turn out to be mutual-information peaks: they spike in their correlation with correct answers, and suppressing them specifically damages reasoning while suppressing random tokens doesn't Do reflection tokens carry more information about correct answers?. In other words, individual structural tokens really can act as load-bearing signals, not filler. A purpose-built remention marker would be a deliberate version of the same phenomenon.

The corpus also shows that models already sort tokens by *function* internally. Pruning experiments reveal six distinct functional token categories, with the model preferentially preserving symbolic-computation tokens and discarding grammar and meta-discourse first Which tokens in reasoning chains actually matter most?. And only ~20% of tokens — the high-entropy "forking" points — actually carry the learning signal during reinforcement training Do high-entropy tokens drive reasoning model improvements?. Both suggest a model has the machinery to treat a special category of token as a distinct functional channel, which is the bet behind any explicit reference-tracking marker.

Where it gets interesting for your question: the collection also documents how models *fail* to track which information is which. Models routinely ignore in-context information when prior training associations are stronger, generating outputs inconsistent with what's actually in front of them Why do language models ignore information in their context?. That's a cousin of the coreference problem — keeping track of *this specific entity, introduced here* against the pull of generic priors. A remention token can be read as an architectural attempt to give the model an explicit handle on context-internal identity rather than leaving it to implicit attention.

So while the corpus can't tell you how a [remention] token performs, it tells you something you might not have gone looking for: the entire reason such a token is plausible is that LLMs already concentrate function into sparse special tokens, and already struggle to bind context-specific identity against parametric priors. If you want the deeper mechanics, the functional-token-ranking note is the best doorway — it's the closest thing here to a theory of *why* a dedicated marker could earn its place in the sequence.

Sources 4 notes

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

How does the [remention] token help models distinguish initial from later mentions?

Sources 4 notes

Next inquiring lines