Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models

Paper · arXiv 2311.09278 · Published November 15, 2023

Injecting a collection of symbolic data directly into the training of LLMs can be problematic, as it disregards the synergies among different symbolic families and overlooks the need for a balanced mixture of natural and symbolic data. In this work, we tackle these challenges from both a data and framework perspective and introduce Symbol-LLM series models. First, we curated a data collection consisting of 34 tasks and incorporating approximately 20 distinct symbolic families, intending to capture the interrelations and foster synergies between symbols. Then, a two-stage tuning framework succeeds in injecting symbolic knowledge without loss of the generality ability. Extensive experiments on both symbol- and NL-centric tasks demonstrate the balanced and superior performances of Symbol-LLM series models.

Nevertheless, a substantial amount of abstract knowledge, notably in areas like molecular formula (e.g., C6H12O6) and first-order logic (e.g., IsTriangle(X) → SumOfAngles(X, 180◦)), is more effectively represented in symbolic forms rather than in NL

deploying LLMs directly via a symbolic-centric interface poses a significant challenge. This is largely attributed to the fact that LLMs are trained via large-scale unsupervised pre-training on extensive general text datasets, which inherently lack a symbolic foundation. The most straightforward approach to incorporating symbolic knowledge into LLMs is through finetuning (Yang et al., 2023; Xu et al., 2023b). However, the format of symbolic data significantly diverges from that used during pre-training. Consequently, merely fine-tuning with large heterogeneous data can lead to catastrophic forgetting (Kirkpatrick et al., 2017).

query(Paris, nwr(hotel))) in API calls . Upon this observation, we conduct a comprehensive collection of 34 text-to-symbol generation tasks with∼20 standard symbolic forms introduced with instruction tuning format. The symbolic data comes from three sources: (1) 88.3% of the data was collected from existing benchmarks. (2) 5.8% of the data was prompted by LLMs. Compensating for the natural absence of symbolic representations in some NL-centric tasks, prompting powerful LLMs can generate more novel text-to-symbol pairs. (3) 5.9% of data was generated by introducing the Symbol-evol strategy, with replaced symbolic definitions to prevent the model from memorizing specific symbols. The above sources are uniformly leveraged to capture the underlying connections between symbols from the data view.

From the framework aspect, we apply a two-stage continual tuning framework including the Injection Stage and the Infusion Stage. The Injection Stage prioritizes the exploitation of the inherent connections between different symbols, thereby enabling the model to thoroughly learn a wide range of symbolic knowledge. After tuning LLaMA-2- Chat models with all collected symbolic data, we obtain Symbol-LLMBase variants. The Infusion Stage focuses on balancing the model’s dual capabilities by utilizing both symbolic data and general instruction tuning. After combining the general instruction-tuning data with the sampled symbolic data and tuning based on Symbol-LLMBase, we can obtain Symbol-LLMInstruct. Finally, Symbol-LLM series models are widely tested on both symbol-centric and NL-centric tasks, which are verified to exhibit substantial superiority.

As widely recognized, 7B or 13B LLMs are still not sufficient to build excellent language agents, especially when complex interaction is involved. Thus, it needs further exploration for the size scaling to the larger ones.