Why does text encoding create different subspaces across domains?
This explores why the same text encoder lands different domains in different regions of embedding space — and what that 'text bias' costs when you try to move a model from one domain to another.
This explores why text encoding tends to carve out separate subspaces for different domains rather than one shared space — and the corpus has a surprisingly coherent answer: text embeddings encode surface vocabulary and co-occurrence statistics, so each domain's distinct word usage pulls its items into its own region. The geometry of embedding space is built from how words appear together. One study finds that the leading eigenvectors of embedding matrices split the world coarse-to-fine, tracking a WordNet-style taxonomy level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. If the structure of the space is inherited from co-occurrence patterns, then domains that talk about things differently — different jargon, different framings — will naturally occupy different neighborhoods.
The recommendation work makes the cost of this concrete. VQ-Rec shows that mapping item text straight to embeddings bakes in a 'text-similarity bias' that doesn't transfer: two items that share words look close even when they behave differently, and a model trained on one domain's vocabulary stumbles on another's Can discrete codes transfer better than text embeddings?. Their fix is telling — insert a layer of discrete codes between the text and the representation, breaking the tight coupling so the lookup table can adapt per domain without retraining the encoder Can discretizing text embeddings improve recommendation transfer?. The subspace problem, in other words, is a feature of going directly from text to vectors; discretizing loosens it.
What's striking is that this fragmentation happens even inside a single domain. HyDE documents a vocabulary mismatch where queries and documents — both English, same topic — land in different embedding regions simply because questions are phrased unlike answers. Their workaround is to generate a hypothetical answer document and match document-to-document, sidestepping the gap entirely Why do queries and documents occupy different embedding spaces?. So 'domain' here is less about subject area than about register: any shift in how language is used can spawn a new subspace.
Underneath all of this is a deeper claim worth pausing on. Text is a lossy human abstraction — it strips out the physics, geometry, and causality of the things it describes Are text-only language models fundamentally limited by abstraction?. If encodings only ever see the shadows on the cave wall, then what separates domains isn't the underlying reality but the linguistic conventions each community uses to point at it. That's why two domains describing related things can still end up far apart: the encoder sees the words, not the world.
The practical upshot threads through the rest of the corpus. You can adapt a retrieval model to a new domain using nothing but a short text description of it, precisely because the gap is a describable shift in vocabulary Can you adapt retrieval models without accessing target data? — but domain adaptation methods carry hidden costs, with visible gains in one area masking degradation in reasoning or transfer elsewhere How do domain training techniques actually reshape model behavior?. If you want to go deeper, the through-line is this: the subspace gap is the price of letting surface text define your geometry, and the most durable fixes either decouple from text or describe the gap rather than fighting it directly.
Sources 7 notes
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.
VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.
HyDE resolves retrieval failures by generating plausible answer documents first, then matching those documents to the corpus using document-document similarity. This avoids the mismatch between query and document spaces without requiring labeled training data.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
Research demonstrates that a brief textual domain description suffices to generate synthetic training data for retrieval fine-tuning, outperforming baselines in zero-target-access scenarios and enabling adaptation where conventional methods are blocked.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.