Empowering Domain-Specific Language Models with Graph-Oriented Databases: A Paradigm Shift in Performance and Model Maintenance

Paper · arXiv 2410.03867 · Published October 4, 2024

Modern GODB have emerged as a solution for highly-connected data, and link oriented queries and algorithms [2]. In fact, they have been a valuable solution in software industry for decades. The implementation of GODB in business solutions is not restricted to specific domains, and it’s usually well suited for on-line transaction processing (OLTP) solutions as well as analytics (machine learning, business intelligence, data mining, etc.) solutions [4].

In 2023 and 2024, numerous studies have underscored the significance of integrating knowledge graphs (KG) with Large LanguageModels (LLMs) in artificial intelligence (AI) [9][12]. This work aims to illustrate the value of the partnership between GODB and LLMs in industry-specific applications where extensive document analysis is required. The structure of this work is as follows: first, we will delineate the types of solutions under examination in section 2. In section 3, we will define KGs and explore their automated creation in GODB using LLMs. Section 4 will delve into Retrieval-Augmented Generation techniques with GODB support. In section 5, we will explore explainability techniques. Memory, context, and personalization will be addressed in section 6. Section 7 will highlight GODB’s role in enabling factuality checks to mitigate the hallucination effect. Finally, we will conclude with a section on future directions.

Generative AI and LLMs are supposed to have a huge impact transforming work across industries [1]. Domain-specific (DS) LLM solutions are well suited for industry-specific applications for many reasons. Among them, we can mention (1) specialized vocabulary: this kind of applications need to apply a specific jargon and technical terms for dealing with complex industry processes which were not included in original foundational model training. Usually text/audio/videos ingested in these solutions make extensive use of DS language. DS LLMs are trained or tuned on data from the particular industry, enabling them to better understand and generate contextually appropriate text using the specialized vocabulary of that domain. (2) DS rules need to be used to deal with data, extract knowledge and interact with this data. (3) Higher Accuracy: by focusing on a specific domain, LLMs can achieve higher accuracy in understanding and generating text related to that domain. These models are enabled to learn patterns and nuances more effectively. (4) Improved Relevance: DS LLMs can provide more relevant responses to queries or prompts within their designated industry. They understand the context better and can produce text that aligns with the specific requirements and expectations of users within that industry. (5) Bet- ter performance on niche tasks. (6) Faster Deployment and Adoption. Since DS LLMs are tailored to meet the needs of a particular industry, they often require less fine-tuning and customization to be deployed effectively. (7) Compliance and Security, industries have strict compliance and security requirements. DS LLMs can be tuned to adhere to these regulations and ensure that generated text meets industry standards for privacy, security, and legal compliance.

Include: (1) financial sensitivity analytics from day-to-day operations, (2) concurrent comparison of commercial contracts and question-and- answer (Q&A) sessions, (3) spend classification, (4) trend sensing in social net- works, (5) analysis of meeting minute and many others.

one such case: a meeting minute analyzer. Consider a scenario where we have a construction materials company (comprising both a factory and distribution network). Imagine a sizable team of sales representatives who visit clients for sales purposes while also addressing customer satisfaction, assessing materials quality, and gathering feedback on logistics and the industrial process. After each meeting, these representatives submit a brief note (potentially in audio format) to the central office, probably using technical and colloquial jargon. These notes are unstructured and diverse, possibly containing sentiment analysis, customer complaints about material quality, feedback on time-to-market, pricing comparisons with competitors, and more. Furthermore, suppose there are hundreds of sales representatives, each averaging five client visits per day, resulting in approximately half a million notes annually. The company aims to develop a software solution capable of ingesting this data, performing advanced analytics and generating insights. These insights will then be integrated into a chatbot or Q&A agent, enabling real-time access to complex business information. From now on, we will refer to this case as "our case".

the creation and updating of the KG. LLMs can efficiently assist on this task by easily extracting entities and labeled relations between them. Graph-oriented databases (GODB) are necessary in our case to provide a sustainable environment for this large database. Additionally, most foundational models can handle specific GODB query languages, such as Cypher expressions for a Neo4j database.

(1) Role: This sets a statement of the role the LLM agent should play, for example, "You are an assistant in the data science team of a construction materials company aiming to extract information from the minutes of meetings between its sales representatives and clients, with the goal of capturing that knowledge in knowledge graphs in Neo4j". (2) Instruction: This provides specific instructions to build the KG, such as "Could you construct a knowledge graph considering the following premises? (a) It’s important to break down the data that can be extracted from the objectives, meeting summaries, and topics discussed. (b) [...]". (3) Output format: It’s crucial to define the specific output needed for the data pipeline processing workflow, for instance, "Format the output only with the Cypher statements necessary to create the graph, considering that several nodes may already exist in the graph".

The importance of applying RAG on GODB instead of building vector embeddings of minutes (in our case) is based on the fact that many user questions can result in a large number of vector matches after running cosine vector similarity match (or any other vector matching function). Just imagine questions like "give me the volume of cement or concrete sales lost due to humidity issues in 2023": probably, a large number of vectors can emerge as candidates, so the standard solution is not suitable for our case. The Langchain framework has introduced a highly maintainable approach to working with RAG [6], with a focus on the consumption side.

Consider creating a graph containing information about chat history and updating it with each interaction. A notable aspect of this approach is that old facts can be replaced by new ones at any given moment. This implementation resembles a session- oriented KG containing user interaction data, which may eventually transition from session-based to user history-based.

Prompt engineering patterns can be implemented to answer user questions using KG in GODB. The Langchain framework can assist AI engineers in implementing a chain of prompts (via chain, agent, or tooling) where the user prompt can be split into short pieces, each of which can be translated to Cypher. Each query can then be executed in Neo4j, and finally, the LLM can interpret the query responses, combine them, and elaborate a conclusive answer

A good example of a use case for this approach can be the following question: "give me the volume of cement or concrete sales lost due to humidity issues in 2023 in adherence to law 13943". Here, the agent needs to discover the implications of law 13943 and then calculate the sales lost.