Retrieval-augmented reasoning with lean language models

Paper · arXiv 2508.11386 · Published August 15, 2025

This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus— in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment.

In practice, test-time scaling refers to deploying inference-time strategies that leverage additional sampling, computation, or prompt engineering to boost the capabilities of a fixed model—without modifying its parameters through fine-tuning or reinforcement learning.

One widely used class of test-time scaling methods is parallel generation, where a model generates multiple candidate responses and then aggregates them through selection mechanisms such as majority voting [3], self-consistency [23], or best-of- N sampling [24]. These techniques improve robustness and factual accuracy by exploiting diversity in the model’s outputs, with selection based on heuristics or learned reward functions. Other common strategies such as beam search [25] and Monte Carlo tree search [26], which maintains multiple high-probability continuations of a sequence in parallel to explore more optimal generations. While such approaches typically improve likelihood, they may reduce diversity, in contrast to sampling-based methods.

Recent models such as DeepSeek-R1-Zero [6] have pushed this frontier by training LLMs via reinforcement learning to produce structured reasoning paths, using formatting conventions (e.g., enclosing thoughts in think> tags) to aid downstream reasoning alignment. While this model demonstrated strong reasoning capabilities, it also exhibited practical limitations, such as decreased readability and occasional mixing of languages.

To mitigate these challenges, DeepSeek-R1 incorporated a small quantity of high-quality “cold start” data prior to reinforcement learning (RL). This dataset comprised carefully curated examples, most notably chain-of-thought demonstrations, designed to stabilise early training and improve the coherence of generated outputs. DeepSeek-R1 was then trained via a two-stage RL procedure: the first stage targeted improvements in reasoning ability, while the second focused on aligning model outputs with human preferences, thereby enhancing readability and reducing incoherent completions. This multi-phase training strategy enabled DeepSeek-R1 to achieve performance on par with OpenAI’s o1 model across a range of reasoning benchmarks.

A retrieval augmented generation (RAG) system has two key components (illustrated in Figure 1):

A retriever which retrieves information from some external memory sources. This also involves a pre-processing step to index the knowledge base.
A generator (often an LLM) which generates a response based on the retrieved information.

RAG enables LLMs to retrieve relevant document chunks from external knowledge bases, often through semantic similarity and embedding based approaches [10]. By utilising an external knowledge base, RAG enables a model to ground its responses in the relevant context, without requiring additional training or fine-tuning, effectively helping it generate relevant responses and reducing hallucinations [9].

The success of a RAG system heavily depends on the quality of its retriever and its role is to provide the LLM with information from the external database that is most relevant to the query. The retriever has two core functions:

Indexing: pre-processing and chunking the data so that data can be retrieved quickly.
Querying: retrieving data relevant to a given query.

Although external data sources may take various forms—including multimodal data (e.g., images, video, audio), tabular datasets, and structured knowledge graphs, this report focuses exclusively on the case where the external memory consists of a corpus of textual documents. In such settings, document chunking is typically required to divide each document into smaller, manageable segments that conform to the context window limitations of both the embedding model used in retrieval and the language model used for generation. A common approach is to segment documents based on predefined units such as characters, paragraphs, or token sequences derived from a specific tokenizer. Overlapping chunks are often employed to reduce the risk of splitting semantically important content across boundaries.

To support retrieval in this context, we adopt an embedding-based approach that leverages a vector store: a specialised data structure designed for efficient indexing and retrieval of items based on their vector representations, or embeddings. These embeddings are intended to capture the semantic content of the input text and are used to represent the individual document chunks. At query time, the system embeds the input query into the same vector space and performs similarity search against the stored document embeddings to identify the most semantically relevant passages. Common similarity metrics include Euclidean distance and cosine similarity, the latter measuring the cosine of the angle between two vectors and often preferred for its scale-invariance.