Building Decision Making Models Through Language Model Regime

Paper · arXiv 2408.06087 · Published August 12, 2024

LLMs demonstrate remarkable success in generalizing across varied language tasks, inspiring a new strategy for training decision making models. Our approach, referred to as "Learning then Using" (LTU), entails a two-stage process. Initially, the learning phase develops a robust foundational decision making model by integrating diverse knowledge from various domains and decision making contexts. The subsequent using phase refines this foundation model for specific decision making scenarios. Distinct from other studies that employ LLMs for decision making through supervised learning, our LTU method embraces a versatile training methodology that combines broad pre-training with targeted fine-tuning. Experiments in e-commerce domains such as advertising and search optimization have shown that LTU approach outperforms traditional supervised learning regimes in decision making capabilities and generalization. The LTU approach is the first practical training architecture for both single-step and multi-step decision making tasks combined with LLMs, which can be applied beyond game and robot domains.

The conventional method for aligning language models with downstream tasks, such as decision making, employs supervised fine-tuning (SFT). We introduce a training paradigm named Learning then Using(LTU), with separate learning and using phase. Learning in LTU. Our approach involves an initial learning phase with continued pre-training on large language models (LLMs). Continue pre-training(CT) is widely used to enhance abilities of different domains or languages based on already well-trained large language models (Cui et al., 2023; Rozière et al., 2023). The basic idea of LTU method is LLMs are able to learn inherent patterns and statistics correlations via decision making knowledge formulated as (s, a, r). Through continued pre-training, we can integrate this collective decision making intelligence into a LLM , transforming it into a comprehensive foundation decision making model which is suitable for various downstream tasks. We construct the data format in (s, a, r) pairs as we mentioned in 3.2. We use this part of data to do an auto-regression training following the Eq 1 based on a LLM. After the learning phase, we get our foundation decision making model. Using in LTU. The using phase is a classic supervised fine-tuning phase. Supervised fine-tuning(SFT) has emerged as a powerful and effective technique to adapt pre-trained LLMs to specific downstream tasks through supervised learning. In this part, we leverage the foundation decision making model trained in learning phase and learn to predict P(r|s, a) to solve certain decision making tasks.