Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey

Paper · arXiv 2305.18703 · Published May 30, 2023

Specifically, domain specialization of Large Language Models (LLMs) is defined as the process of customizing general-purpose LLMs according to specific domain contextual data, augmented by domain-specific knowledge, optimized by the domain’s objective, and regulated by domain-specific constraints. This shift towards domain specialization of LLMs is motivated by several compelling reasons. First, there are significant differences in conversation and language styles in different fields, roles, and tasks ranging from medical prescriptions to legal sentences, to online chatting, etc. The acquisition of such capabilities and experience even require human beings many years of training, a lot of which are hands-on and proprietary. Moreover, different fields, institutions, and teams have their own “business models” about which response will maximize their own utility function for their tasks, which is not directly replaceable by a single general-purpose LLMs solver with no customization. More importantly, the requirement of domain knowledge for professional-level usage also need to be very in-depth, in-real-time, and accurate, none of which can be easily achieved by pre-trained LLMs. Many domain knowledge resources are proprietary assets and core competitiveness of the organizations that can never be leaked to general-purpose LLMs. Last but not the least, languages are constrained by social norms, cultural conformity, religious beliefs, legal requirements, and ethical practice, all of which are changing parameters in different locations, countries, populations, races, communities, etc., which make general-purpose LLMs impossible to be a one-fits-all solver without any customization.

In many specialized domains, new discoveries, regulations, and best practices continuously emerge, making it difficult for LLMs to stay up-to-date. For instance, more than 30 thousand mainstream news articles are published every day [157]. For social media analysis and fact-checking, LLMs may not handle them since the knowledge extracted from the training corpus is offline.

popular or widely-discussed topics may be over-represented, while very domain-specific topics can usually be under-represented, which makes it difficult to be effectively learned for domain-specific tasks. In addition, domain-specific tasks often involve complex concepts, specialized terminology, and intricate relationships between entities. Without proper guidance, LLMs may generate plausible-sounding but inconsistent answers to similar queries (i.e., LLM’s hallucination) or slightly rephrased questions [5]. This issue arises because LLMs are designed to predict the most likely word sequences based on the input rather than providing a definitive answer based on a structured knowledge base. Researchers have found users can guide the model to produce more relevant, accurate, and task-specific responses, enhancing the overall utility and effectiveness of AI systems across numerous domains by providing LLMs with a few task-specific demonstrations [166]. Nevertheless, providing LLMs with adequate demonstrations is not trivial since user instructions can often be vague, incomplete, or ambiguous

A systematic categorization and taxonomy of LLMs domain specialization techniques: We comprehensively classify existing methods based on different levels (i.e., black-box, grey-box, and white-box) of accessibility to the LLM and organize their corresponding techniques into a taxonomy. We discuss details, relationships, pros, and cons among different subcategories. The proposed taxonomy is designed to assist domain experts in identifying the most suitable techniques for their target problem settings.

An additional taxonomy could be based on the intervention level: pre-training intervention involves modifying the pre-training process to encourage domain-specific knowledge, the fine-tuning intervention involves adaptations during the fine-tuning stage, and inference-time intervention involves modifying the model’s behavior during the actual application to generate more domain-specific outputs. Furthermore, the taxonomy can be established based on the evaluation and feedback mechanism: fixed evaluation sets a constant benchmark, dynamic evaluation involves continuous performance assessment with changing benchmarks, and user feedback-based evaluation uses direct user input as a signal to specialize the model’s responses.

External augmentation (black box) does not necessarily require access to the LLM’s inner parameter space, making it the most accessible for users with limited resources (e.g., computational resources, domain-specific data). As shown in Figure 2 (b), by using external resources or tools, domain-specific knowledge is incorporated into the input prompt, generated output, or both, effectively adapting the LLM’s performance without modifying its internal structure. 2) prompt crafting (grey box) involves designing various types of prompts by accessing the gradient or loss values of LLMs, allowing for finer control over the model’s behavior. 3) model fine-tuning (white box) demands the most access and resources, as it involves updating the LLM’s parameters to incorporate domain-specific knowledge directly into the model. (Figure 2 (d)).

transform task-specific queries into latent embeddings, calculating attention scores between the query vector and each knowledge entry. A softmax function is used to generate a weight or probability distribution across all knowledge entries concerning the input query. The retrieved memory vector is then obtained via a weighted sum of the memory entries, using attention weights. This method enhances traditional neural networks with implicit knowledge, permitting the model to access relevant, current information during inference

implicit knowledge requires extra processing, such as transforming domain-specific data into latent vectors, making it less practical. Despite the limited work in augmenting LLMs with implicit knowledge, researchers are exploring its potential, including its use in storing instructional knowledge about a domain. This approach involves creating an instruction cycle that retrieves the next input prompt from implicit knowledge, parses the LLM’s output to recover variable assignments, and stores these back into the memory for retrieving the next instruction. Augmenting LLMs with this

Seamless integration of external knowledge into LLMs is crucial, whether the knowledge is explicit or implicit. Existing methods typically concatenate retrieved knowledge to the LLM’s input or intermediate layers. However, it’s important for the LLM to have the option of accepting or rejecting retrieved information, given that such information may be incomplete or conflicting.

(2) Scalability and adaptability: Designing systems capable of scaling to manage large amounts of domain-specific data and adapting to new or changing information is challenging. With rapidly expanding knowledge bases, computing pairwise knowledge similarity will become increasingly computationally unfeasible.

generalize LLMs as task planners (also mentioned as “API selectors” or “controllers”) that call multiple types of domain tools. Other than generating executable commands for each used tool, these approaches focus on how to decompose a complex task into a set of concrete subtasks and how to coordinate between multiple tools. For instance, DSP [61] proposes a Draft, Sketch, and Prove framework for automated theorem proofs where (1) an LLM or oracle is used to draft informal proofs described in a mixture of natural and mathematical languages from input statements, (2) another LLM is used to generate formal sketch from previous informal proof, and (3) off-the-shelf prover is used to prove the open conjectures inside each formal sketch.

Approaches generally fall into two categories: (1) Discrete Prompt involves creating task-specific natural language instructions to prompt LLMs, eliciting domain-specific knowledge from their parameter space, and (2) Continuous Prompt uses learnable vectors to prompt LLMs, eliminating the need for manually designed text instructions.

However, crafting discrete prompts of LLMs for domain specialization poses several open challenges:

(1) Effectiveness: Often the discrete instructions are curated by domain experts or follow some types of templates. It is arguable whether the instructions used are the most effective ones. Therefore, there is a need for evaluation of these instructions. This can be achieved through collaboration between domain experts and data scientists, who can analyze the performance of the LLMs and adjust the instructions accordingly. An automatic evaluation would be even better.

(2) Scalability and adaptability: Automated ways to generate and select/combine discrete instructions without excessive human intervention is another promising direction to improve discrete instructions of LLMs.

Task-dependent Prompt Tuning. Task-dependent prompt tuning optimizes a shared prompt for all instances within a specific task, enabling it to encapsulate information from extensive datasets comprising thousands or millions of examples.

Prompt Content Enhancement. We refer prompt content as the embedding values of continuous prompt, enhancements are developed in terms of task-specific initialization and prior knowledge transfer. Pilot works have validated that in contrast to many optimizers that begin with a random distribution applied in general ML tasks, the optimization process of soft prompt is significantly influenced by its initial value. For language models, word embeddings are pre-trained to be quite distinct. Consequently, a standard optimizer such as stochastic gradient descent (SGD) can only update the parameters in a limited vicinity, leading to the possibility of falling into a local minimum [1]. Therefore, a more effective initialization approach would involve using embeddings of concrete task-specific words.

One of the pioneering works, WARP [43] initializes the prompt by the embedding of special token “[MASK]”. KnowPrompt [18] designed learnable prompts as virtual type words and virtual answer words, which are initialized by the aggregated representation of concrete label words and disassembling words based on their frequency in the dataset.

Instance-dependent prompt tuning however conditionally generates prompts for individual instances, incorporating both contextual information and task instructions

joint and adaptive representation of tasks as well as instance context. IDPG [170] proposed an additional two-layer perceptron as a prompt generator, down and up project the sentence embedding to the adaptive soft prompt. ATTEMPT [3] first train multiple prompts on large-scale source tasks, and calculate an aggregated prompt base on a sentence-wise attention network, which will then be mixed with a newly initialized target task prompt as the final instance-dependent prompt. Jin et al., [63] assume that prompt tokens differently contribute the instance, and thus designed a look-up module to score the association of prompt tokens to instance tokens, which is then used to calculate the aggregated prompt embeddings. Bhardwaj et al., [8] generate context-aware prompts by a transformer-based sentence encoder, but further quantize the contextual prompt into a more compact representation to avoid optimization collapse. Levine et al., [73] learn the joint representation of prompt and input by a frozen T5 encoder following cross- and self-attention layers. Liu et al., [85] propose an instance-aware prompt that is applied to the intermediate layers of LM. The proposed prompt generator is a simple feed-forward layer with bottleneck architecture which take the embedding of [CLS] token or pooling of embeddings of sentence tokens.

Open Challenges. Continuous prompt tuning presents a streamlined method to utilize the broad language understanding capacity of LLMs for specific tasks across different domains. It efficiently tackles issues inherent in discrete prompt methods, such as (1) significant reliance on the prompt for LLM performance, where minor wording or template changes can greatly affect the result, (2) computational complexity in identifying the optimal natural language based prompt from a large search space, and (3) the time-consuming and labor-intensive process of manually designing instructions, particularly in expertise-required domains. However, continuous prompt tuning has its limitations.

Interpretability is often criticized as a weakness of soft prompt tuning. By discreting the optimal continuous prompts into nearby token vectors in LM’s vocabulary, studies such as WARP [43] have found these prompts to be non-interpretable and lacking meaningful content.

Limited access to LLMs poses a significant challenge for continuous prompt learning, especially as models with immense sizes (e.g., 540B PaLM) and models with only API access. This restriction hinders differential optimization on continuous embeddings. In this case, derivative-free prompt tuning that optimizes the soft prompt without gradients from LLMs is widely discussed.

Adapter-based Fine-tuning and Task-oriented Fine-tuning. (1) Adapter based Fine-tuning: This approach, as illustrated in Figure 7 (a), employs neural adapters or modular components to enhance the LLM’s performance on domain specific tasks without major modifications to the LLM’s inner parameters. These adapters, typically integrated into the existing LLM architecture, allow for task-specific learning while keeping the original model largely intact.

In this survey paper, we explore the applications of LLMs across a range of domain-specific tasks in social sciences (e.g., education, finance, law), natural sciences (e.g., biomedicine, earth science), and formal sciences (e.g., human-computer interaction, software engineering, and cyber security). To achieve domain specialization for LLMs in these diverse fields, readers can employ various techniques, such as external augmentation, instruction crafting, and knowledge update. These approaches can help tailor LLMs to specific tasks and challenges in each domain, enabling more accurate, relevant, and effective applications. Although each domain has its unique challenges and requirements, several common applications of specialized LLMs are shared across these fields:

• Advanced information extraction: They can identify entities, relationships, and events from domain-specific texts, such as recognizing genes in biomedical literature or detecting legal clauses in contracts.

• Text generation and summarization: They can generate high-quality, domain-specific content and create accurate summaries of complex domain-specific texts.

• Data-driven predictions and recommendations: They can analyze domain-specific data for forecasting and providing recommendations, like predicting financial trends or suggesting personalized medical treatment plans.

• Conversational agents and expert systems: They can be incorporated into conversational agents or expert systems for domain-specific guidance, such as virtual tutors or legal chatbots.

• Automated code generation and analysis: In software engineering, they can generate or analyze code, identify bugs, or suggest improvements based on natural language descriptions.

Finance and Law. Specializing LLMs in the financial and legal domains requires careful adaptation to the distinctive characteristics of these fields. In the financial domain [71, 87, 169, 174], models need to comprehend complex financial terminologies, economic trends, and regulatory norms to accurately generate content like financial reports, investment analyses, or risk assessments. Meanwhile, the legal domain [15, 120, 151] demands understanding and generation of intricate legal language, comprehension of laws, legal codes, and court rulings, while maintaining absolute precision and a xxformal tone. For both domains, model specialization often involves fine-tuning with domain-specific datasets, incorporating explicit domain knowledge, and optimizing for domain-specific objectives like compliance with regulations, accuracy of information, or effectiveness of advice. However, it’s crucial to maintain an ethical guardrail for these models, given the high stakes nature of both financial and legal decisions. The specialized models also need to keep abreast of the evolving landscapes of these domains, adapting to changes in laws, regulations, or financial trends.

Human Computer Interaction and Software Engineering. Specializing LLMs in the domains of human-computer interaction (HCI) and software engineering requires a deep understanding of the terminologies, workflows, and conventions unique to these areas. In the HCI domain, an LLM may be specialized to understand and respond to user inputs more effectively, potentially improving the design and usability of interfaces by offering more natural and intuitive interaction paradigms. This involves training the model on diverse data, ranging from human conversational data to user interaction logs.

• Domain Complexity: Each domain has its unique intricacies and complexities, which could range from highly specialized vocabularies, and nuanced terminologies to complex knowledge structures. For instance, the legal or medical field employs language and terms that are extremely domain-specific and follow certain syntax and structure rules. This complexity extends to the relationships between different entities and concepts within the domain. Accurately understanding and modeling this intricate domain knowledge is a significant challenge for all types of models.

• Balancing General and Domain Knowledge: An LLM, while needing to understand the specificities of a particular domain, also has to maintain its general knowledge to provide contextually appropriate responses. If a model is overly specialized, it may perform exceptionally within the targeted domain but fail to understand or generate coherent responses to prompts outside of it. Conversely, retaining too much general knowledge may dilute the domain-specific responses. Striking this balance between general and domain knowledge is a complex task.

• Explainability and Trust: As LLMs become more sophisticated, their decision-making process also becomes more opaque, raising the challenge of explainability. It is crucial for users, especially in high-stakes domains like healthcare, law, or finance, to understand how the model arrived at a certain output. Achieving this transparency can help build trust in the system. The challenge lies in the trade-off between model complexity and explainability, as increasing one often decreases the other.

• Adapting to Domain Evolution: Domains are not static; they evolve over time with the introduction of new terminologies, concepts, and trends. For example, the ongoing COVID-19 pandemic introduced a slew of new medical terms and concepts. Therefore, an LLM that is specialized for a certain domain must continuously adapt to these changes to stay relevant and effective. Designing models that can keep pace with the evolving landscape of their specialized domain is a challenging task.

• Scalability: Domain specialization often involves training or fine-tuning the LLM with domain-specific data, crafting specific prompts, or using other domain-specific resources. While this might be feasible for a few domains, scaling this process to cover a wide range of domains or to handle large, complex domains is a significant challenge. It involves not just computational resources but also the availability of domain-specific data and expertise. The challenge is to create efficient and effective methods for domain specialization that can be scaled to cover many different domains.

For instance, a medical LLM could incorporate knowledge from a medical ontology graph

to better understand the relationships between various medical terms and concepts