Educating LLMs like Human Students: Structure-aware Injection of Domain Knowledge

Paper · arXiv 2407.16724 · Published July 23, 2024
Domain SpecializationTraining Fine TuningDiscourses

This paper presents a pioneering methodology, termed StructTuning, to efficiently transform foundation Large Language Models (LLMs) into domain specialists. It significantly minimizes the training corpus requirement to a mere 0.3% while achieving an impressive 50% of traditional knowledge injection performance. Our method is inspired by the educational processes for human students, particularly how structured domain knowledge from textbooks is absorbed and then applied to tackle real-world challenges through specific exercises. Based on this, we propose a novel two-stage knowledge injection strategy: Structure-aware Continual Pre- Training (SCPT) and Structure-aware Supervised Fine-Tuning (SSFT). In the SCPT phase, we organize the training data into an auto-generated taxonomy of domain knowledge, enabling LLMs to effectively memorize textual segments linked to specific expertise within the taxonomy’s architecture. Subsequently, in the SSFT phase, we explicitly prompt models to reveal the underlying knowledge structure in their outputs, leveraging this structured domain insight to address practical problems adeptly.

In this process, all the new data to learn is textbooks (structured content) and exercising examples (question-answering pairs), and students just adopt their world knowledge to memorize, understand, and apply the knowledge to become domain experts [23, 52].

Inspired by this, we propose to inject the domain knowledge from textbooks into LLMs, like educating a human student, through a novel two-stage training strategy: Structure-aware Continual Pre-Training (SCPT) and Structure-aware Supervised Fine-Tuning (SSFT

Traditionally, text corpora are merely concatenated and divided into chunks of 2048 [35] or 4096 [13], while the inherent structure of the texts (e.g., catalogs of textbooks) is disregarded. Instead, we propose an automatic approach to maintain each chunk’s knowledge structure. We view each chunk as a knowledge point, employing advanced LLMs to extract domain knowledge taxonomy from the corpus efficiently, bypassing the need for manual annotation. Subsequently, LLMs are trained to predict textual content under the condition of linked knowledge points within the taxonomy, integrating individual training chunks with the entire knowledge framework. Ultimately, models are asked to memorize the knowledge structure to review the whole domain knowledge system.

Traditionally, text corpora are merely concatenated and divided into chunks of 2048 [35] or 4096 [13], while the inherent structure of the texts (e.g., catalogs of textbooks) is disregarded. Instead, we propose an automatic approach to maintain each chunk’s knowledge structure. We view each chunk as a knowledge point, employing advanced LLMs to extract domain knowledge taxonomy from the corpus efficiently, bypassing the need for manual annotation. Subsequently, LLMs are trained to predict textual content under the condition of linked knowledge points within the taxonomy, integrating individual training chunks with the entire knowledge framework. Ultimately, models are asked to memorize the knowledge structure to review the whole domain knowledge system.

In the SSFT stage, the goal shifts from knowledge injection to enabling LLMs to recall and utilize their acquired knowledge to tackle real-world challenges. We explicitly elicit knowledge structures in LLMs’ responses, as a beacon for models to targeted information retrieval or logical reasoning for reliable responses. To this end, we derive a scalable strategy to generate question-answer pairs as practice exercises by advanced LLMs such as GPT4 [1] or LLaMA3 [2]. In the scenarios with existing QA pairs like MMedBench [35], we retrieve related knowledge structures and content, instructing LLaMA3 to provide explanations from questions to answers based on these structures.