Knowledge Retrieval Based on Generative AI

Paper · arXiv 2501.04635 · Published January 8, 2025

Abstract—This study develops a question-answering system based on Retrieval-Augmented Generation (RAG) using Chinese Wikipedia and Lawbank as retrieval sources. Using TTQA and TMMLU+ as evaluation datasets, the system employs BGEM3 for dense vector retrieval to obtain highly relevant search results and BGE-reranker to reorder these results based on query relevance. The most pertinent retrieval outcomes serve as reference knowledge for a Large Language Model (LLM), enhancing its ability to answer questions and establishing a knowledge retrieval system grounded in generative AI.

The system’s effectiveness is assessed through a two-stage evaluation: automatic and assisted performance evaluations. The automatic evaluation calculates accuracy by comparing the model’s auto-generated labels with ground truth answers, measuring performance under standardized conditions without human intervention. The assisted performance evaluation involves 20 finance-related multiple-choice questions answered by 20 participants without financial backgrounds. Initially, participants answer independently. Later, they receive system-generated reference information to assist in answering, examining whether the system improves accuracy when assistance is provided.

The main contributions of this research are: (1) Enhanced LLM Capability: By integrating BGE-M3 and BGE-reranker, the system retrieves and reorders highly relevant results, reduces hallucinations, and dynamically accesses authorized or public knowledge sources. (2) Improved Data Privacy: A customized RAG architecture enables local operation of the LLM, eliminating the need to send private data to external servers. This approach enhances data security, reduces reliance on commercial services, lowers operational costs, and mitigates privacy risks

traditional IR systems have limitations, such as low relevance or excessively lengthy outputs when query terms differ from indexed terms. This increases the effort required to find the desired information

Dense vector retrieval, a key method in this framework, uses deep learning to map text to high-dimensional vectors, improving accuracy by capturing semantic similarity. This approach is crucial for the development of next-generation knowledge retrieval, overcoming the limitations of traditional methods while leveraging the strengths of LLMs.

Information retrieval (IR) has progressed from basic textual data processing to advanced deep learning techniques, with natural language processing (NLP) playing a key role. Early models, such as the Boolean Retrieval Model, used logical operators (OR, AND, NOT) for document-query matching but struggled with complex queries. The Vector Space Model improved upon this by vectorizing documents and queries, using similarity measures, while enhancements like TF-IDF increased retrieval accuracy by adjusting term importance, though lacking contextual understanding. Probabilistic Retrieval Models ranked documents based on relevance probabilities but required large datasets for accuracy. BM25 further refined this approach by dynamically adjusting term weights, and Latent Dirichlet Allocation (LDA) introduced topic modeling using Bayesian networks. The advent of GPUs and deep learning has popularized Dense Vector Retrieval, which converts text into high-dimensional vectors. Models like BERT and Sentence-BERT leverage these vectors to perform tasks such as sentence classification efficiently, enhancing the semantic understanding of

Efficient retrieval of high-dimensional vector data, typically generated by embedding models, requires a specialized data structure known as a vector index. This study employs FAISS[9] (Facebook AI Similarity Search), developed by Meta’s Fundamental AI Research team, to build vector indices for similarity search and clustering of dense vectors. FAISS supports three common indexing methods: Flat Index for small datasets, providing linear searches across all vectors; Inverted File Index (IVF) for large datasets, which clusters data to reduce search time; and Hierarchical Navigable Small World Graph Index (HNSW), a graph-based method that finds neighboring vectors in high-dimensional spaces with enhanced efficiency. FAISS also supports three methods for similarity search: L2 Norm, Dot Product, and Cosine Similarity. The L2 Norm measures the Euclidean distance between two vectors and is often used in tasks requiring precise distance calculations. The Dot Product calculates the sum of the products of vector components and is commonly used for similarity and projection tasks in recommendation systems. Cosine Similarity measures the angle between vectors, focusing on direction rather than magnitude, making it popular in information retrieval (IR) systems. In this experiment, after the BGE-M3 model converts the retrieval data into vectors, FAISS organizes these vectors and constructs the vector index, enabling efficient retrieval based on similarity measures.