CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Paper · arXiv 2511.18659 · Published November 24, 2025
RAG

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval–generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality.

Our Key Insight: Shared Continuous Representations. Towards tackling this issue, we propose a unified framework that performs retrieval and generation over shared continuous document representations as shown in Fig 1. Instead of maintaining separate embeddings and raw text, we encode documents once into compact memory-token representations that serve both purposes. A central motivation behind this design is that supervised retrieval training typically relies on relevance-labeled data, which is scarce and often domain-specific. To overcome this limitation, we propagate the next-token prediction (NTP) loss from the generator to the retriever, providing a weakly supervised signal that naturally adapts retrieval to downstream generation objectives. This mechanism allows the retriever to learn which documents truly enhance answer generation rather than relying on surface-level similarity. Moreover, continuous representations and joint optimization are inherently complementary: continuous encodings make the retrieval process differentiable, while joint training aligns both modules within a shared semantic space optimized for reasoning.