ChatGPT is not Enough: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling

Paper · arXiv 2306.11489 · Published June 20, 2023

“Knowledge that can be incorporated into PLMs can be divided into implicit knowledge and explicit knowledge. Typical forms of implicit knowledge include word segmentation, part of speech, sentiment, and human feedback. Word segmentation can improve PLM’s performance by enhancing their abilities to distinguish different elements in languages with complex structures such as Chinese and Japanese. For example, Li et al. [67] proposed enhancing PLMs by integrating information on Chinese word segmentation boundaries using a multisource word-aligned model. Part of speech tagging can provide more contextual information for PLMs, which improves their learning of sentence structure. Sentiment annotation data and sentiment analysis pre-training task can help PLMs better understand sentiment information, leading to improved performance on tasks like sentiment analysis, text classification, and text generation. Human feedback can not only improve the accuracy of PLMs by providing some supervision, but also reduce security issues by providing correct answers to them. For instance, InstructGPT [68] aligns language models with user intent by fine-tuning with human feedback, which can generate more preferred outputs than GPT-3 even with fewer parameters. However, implicit knowledge often suffers from insufficient quality, as it may change with sample and parameter adjustments, resulting in unstable training results. Moreover, implicit knowledge cannot fully address the issue of unexplainable decision-making process.

In contrast, explicit knowledge has higher quality and can steadily improve the performance of PLMs. Additionally, explicit knowledge is structured, organized, and easier to maintain, expand, and explain. Explicit knowledge typically includes semantic web, syntax tree, and KG. KG is a widely used knowledge modeling method that represents entities as well as their relations as triples, providing PLMs with clear and interpretable semantic knowledge assistance. In recent years, various KGPLMs have been proposed, which can be categorized into before-training enhancement, during-training enhancement, and post-training enhancement methods according to the stage at which KGs participate in pre-training, as illustrated in Fig. 3.

There are two challenges while integrating the knowledge from KGs into PLMs: heterogeneous embedding space and knowledge noise. The first challenge arises from the heterogeneity of vector spaces caused by different data formats of text and KGs, which renders it difficult for PLMs to effectively integrate knowledge. The second challenge occurs when unrelated knowledge diverts the sentence from its correct meaning. Before-training enhancement methods resolve these issues by unifying text and KG triples into the same input format through preprocessing, whose framework is shown in Fig. 4. Existing studies propose diverse approaches to achieve this goal, including expanding input structures, enriching input information, generating new data, and optimizing word masks.”