Should user context live in tokens or in learned model representations?

This explores a design tradeoff: when you want a model to know *you* — your history, preferences, situation — is that knowledge better delivered as text in the prompt (tokens) or baked into compressed vectors the model reads internally (learned representations)?

This explores a design tradeoff: when you want a model to know *you*, should that context arrive as text in the prompt, or as learned representations the model ingests internally? The corpus leans toward representations — but the interesting part is *why*, and it's not just about saving space.

The most direct evidence is that distilling a user's interaction history into embeddings beats stuffing that history into a text prompt, especially as the history grows long Can user embeddings personalize language models more efficiently than prompts?. Embeddings fed through cross-attention are cheaper to run and capture deeper structure than the equivalent prose. There's a theoretical echo here: predicting at the level of latent representations is *exponentially* more sample-efficient than predicting raw tokens, because same-level latents are far more correlated with each other than tokens are Why is predicting latents more sample-efficient than tokens?. Tokens are a lossy, high-variance surface; the representation underneath is where the compositional structure actually lives.

But there's a deeper reason tokens are a fragile place to put context: the model often ignores them. When a prompt conflicts with what the model learned in training, the training priors win — text alone can't override them, and you need to intervene in the representations to change the behavior Why do language models ignore information in their context?. Relatedly, prompting can only *activate* knowledge already in the model; it cannot inject anything new Can prompt optimization teach models knowledge they lack?. So if your user context is genuinely novel information the model never saw, dropping it into tokens puts it in exactly the channel most likely to get steamrolled by parametric memory.

The counterweight is that representations aren't free or universal. Simply having a very long context window — the brute-force "just give it all the tokens" approach — matches retrieval for semantic tasks but collapses on structured, relational queries; length alone doesn't buy you the right kind of understanding Can long-context LLMs replace retrieval-augmented generation systems?. And tokens aren't dead weight: in-context learning of sequential behavior genuinely requires seeing full *trajectories* in the prompt, not pre-baked weights, because the model generalizes from the structure of those token sequences without any weight update Why do trajectories matter more than individual examples for in-context learning?. There's also a hint that not all tokens are equal — a small minority of high-entropy "forking" tokens carry most of the actual signal Do high-entropy tokens drive reasoning model improvements?, which suggests the token-vs-representation question is partly a question of *which* tokens you bother to encode.

The synthesis the corpus points to: it's not either/or, it's a division of labor. Stable, accumulated context (who this user is, over a long history) belongs in learned representations — durable, compact, and resistant to being ignored. Fresh, situational, sequential context (what they're doing right now) belongs in tokens, where the model can do in-context reasoning over it. The failure mode is putting durable identity in tokens (where priors override it) or trying to learn the live situation into weights (slow, and you lose the trajectory structure). The unexpected takeaway is that 'tokens vs. representations' is really a question about *permanence* — how long this context needs to survive contact with the model's existing beliefs.

Sources 7 notes

Can user embeddings personalize language models more efficiently than prompts?

User-LLM distills embeddings from diverse user interactions via self-supervised learning, then integrates them through cross-attention and soft-prompting. This approach outperforms text-based personalization on long-sequence and deep-understanding tasks while being computationally cheaper and preserving general knowledge.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Should user context live in tokens or in learned model representations?

Sources 7 notes

Next inquiring lines