Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation

Paper · arXiv 2406.13372 · Published June 19, 2024

Recent advances in retrieval-augmented generation have significantly improved the performance of question-answering systems, particularly on factoid ‘5Ws’ questions. However, these systems still face substantial challenges when addressing ‘1H’ questions, specifically how-to questions, which are integral to decision-making processes and require dynamic, step-by-step answers. The key limitation lies in the prevalent data organization paradigm, chunk, which divides documents into fixed-size segments, and disrupts the logical coherence and connections within the context. To overcome this, in this paper, we propose THREAD, a novel data organization paradigm aimed at enabling current systems to handle how-to questions more effectively. Specifically, we introduce a new knowledge granularity, termed ‘logic unit’, where documents are transformed into more structured and loosely interconnected logic units with large language models. Extensive experiments conducted across both open-domain and industrial settings demonstrate that THREAD outperforms existing paradigms significantly

3.1 LOGIC UNIT: RETRIEVAL UNIT OF THREAD

We propose a new knowledge granularity called ‘logic unit’. Different from the chunk-based paradigm, LU comprises specially designed components, especially for bridging consecutive steps when addressing how-to questions, considering the internal logic and coherence of documents. Prerequisite. The prerequisite component acts as an information supplement, providing the necessary context to understand the LU. For example, an LU may include domain-specific terminology such as entities or abbreviations. The prerequisite explains these terms and can generate new queries to retrieve LUs with more detailed information. Without this context, passing these LUs to an LLM-based generator could lead to hallucinations. Additionally, the prerequisite can function as an LU filter, containing constraints that must be met before the LU is considered in answer generation. This filtering ensures only relevant LUs are retrieved. As shown in step LU in Figure 2, accessing the server monitor is a prerequisite for checking the server load.

Header. The header summarizes the LU or describes the intention it aims to address, depending on the type of LU (refer to §3.2). For example, the header could be the name of a terminology if LU describes a terminology; if the LU describes actions to resolve a problem, the header describes the intent or the problem LU aims to resolve. Different from chunk that indexes the entire content, we use the header for indexing which serves as the key for retrieving the LU based on a query.

Body. The body contains detailed information on the LU, which is the core content fed into the LLM-based generator to generate answers. It includes specific actions or necessary information such as code blocks, detailed instructions, etc. This detailed content helps resolve the query mentioned in the header or provides a detailed explanation of the header.

Linker. The linker acts as a bridge between logic units, enabling the dynamic process of how-to questions. Unlike the chunk-based paradigm, which relies on previous retrieval units that often lack direct clues, the linker in THREAD provides necessary information to generate new queries for subsequent retrieval. In Figure 2, the linker of ‘checking server’ specifies multiple possibilities after taking the action in LU body, guiding the retrieval of the next-step LU. Its format varies by LU type, serving as either a query for retrieving other LUs or an entity relationship. The edge of knowledge graph in traditional factoid questions is a special linker enabling navigation between related entities. When no further LUs are connected, the linker remains empty, isolating the current LU. Meta Data. The meta data includes information about the source document from which the LU is extracted, such as the document title, ID, date, and other relevant details. This meta data is crucial for

updating LUs when the source documentation is revised and reprocessed.