Training for Compositional Sensitivity Reduces Dense Retrieval Generalization

Paper · arXiv 2604.16351
Training and Fine-TuningRetrieval-Augmented Generation (RAG)Training Data

Dense retrieval compresses texts into single embeddings ranked by cosine similarity. While efficient for recall, this interface is brittle for identity-level matching: minimal compositional edits (negation, role swaps) flip meaning yet retain high similarity. Motivated by geometric results for unit-sphere cosine spaces, we test this retrieval-composition tension in text-only retrieval. Across four dual-encoder backbones, adding structure-targeted negatives consistently reduces zero-shot NanoBEIR retrieval (8–9% mean nDCG@10 drop on small backbones; up to 40% on medium ones), while only partially improving pooled-space separation. Treating pooled cosine as a recall interface, we then benchmark verifiers scoring token–token cosine maps. MaxSim (late interaction) excels at reranking but fails to reject structural near-misses, whereas a small Transformer over similarity maps reliably separates near-misses under end-to-end training.

The dominant dual-encoder paradigm compresses texts into fixed vectors for efficient maximum inner product search (MIPS) retrieval. While effective for fuzzy topical matching, this architecture suffers a fundamental "resolution loss" regarding composition. Because the embedding function compresses variable-length reasoning into a single point, it often treats sentences as commutative bags-of-words, struggling to distinguish structural near-misses (e.g., "the dog bit the man" vs. "the man bit the dog"). Recent theory suggests this is geometrically inevitable: Kang et al. (2025) argue that unit-sphere cosine spaces force conceptual clusters into linear superposition, a geometry hostile to non-commutative structures like negation or order. This implies a retrieval–composition tension: forcing compositional sensitivity into a single vector degrades broad topical generalization.

We investigate this tension in text-only retrieval. We show that training with structure-targeted hard negatives creates a zero-sum game: the model rejects specific permutations but suffers significant degradation in out-of-domain retrieval (NanoBEIR). We argue that identity-sensitive matching should instead be treated as a distinct verification task. We benchmark lightweight verifiers on token–token similarity maps, finding that while MaxSim excels at relevance, true identity preservation requires learned verifiers that detect topological patterns in the map.

Our analysis predicts a retrieval–composition tension for pooled-cosine dual encoders: allocating representational margin to reject meaning-changing near-misses can reduce the margin available for coarse content grouping. Across all backbones and metrics, training with structural hard negatives (Model B) reduces NanoBEIR performance relative to the NQ-only baseline (Model A). This supports the predicted tension: under a single pooled embedding with cosine scoring, allocating margin to reject lexically overlapping meaning-changes competes with broad topical grouping.

The pooled-cosine bottleneck collapses many compositions because it discards token topology. By contrast, M(q,c) preserves which tokens align and where those alignments occur. Verifiers that only aggregate M with permutation-symmetric statistics can still behave like bag-of-words matchers and remain insensitive to binding or role swaps. Injecting positional structure and learning local/global patterns over M breaks these symmetries, allowing the verifier to detect order-preserving diagonals, swapped alignments, and systematic mismatches induced by negation cues. This mirrors the core insight of DCSMs in Kang et al. (2025), specialized here to text–text matching. Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training- Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.