Think before you speak: Training Language Models With Pause Tokens

Paper · arXiv 2310.02226 · Published October 3, 2023

Transformer-based causal language models generate tokens one after the other in immediate succession. To generate the (K + 1)th token, the model consumes the K previous tokens, and proceeds layer by layer, computing K intermediate vectors in each hidden layer. Each vector in itself is the output of a module (consisting of self-attention and multi-layer-perceptrons) operating on the previous layer’s output vectors. However sophisticated this end-to-end process may be, it abides by a peculiar constraint: the number of operations determining the next token is limited by the number of tokens seen so far. Arguably, this was the most natural design choice when the Transformer was first conceived by Vaswani et al. (2017). But in hindsight, one may wonder whether for some inputs, the (K + 1)th token demands K + M Transformer operations in each layer (for M > 0), which cannot be met by the arbitrarily constrained K operations per layer. This paper explores one way to free the Transformer of this arbitrary per-layer computational constraint.

The approach we study is to append dummy tokens into a decoder-only model’s input, thereby delaying the model’s output. Specifically, we select a (learnable) pause token (denoted pause) and append one or more copies of <--pause > as a sequence to the input. We simply ignore the model’s corresponding outputs until the last <--pause > token is seen, after which we begin extracting its response.

Crucially, we consider injecting such delays not just at inference, but also during downstream finetuning (see Fig 1) and pretraining (see Fig 2, which provides additional technical details).

A-priori, it is unclear what this simple change would bring about in practice. Optimistically, the Transformer may take advantage of a “wider” computational pathway induced by the delay. A more mundane outcome though would be that the model simply skips any delays introduced by the <--pause> tokens. After all, neither do the <--pause> tokens provide any additional information during inference, nor are there sufficiently many new parameters (barring the few embedding parameters of the single <--pause> token) that can encode any additional information from training data. Worse still, these uninformative tokens may drown out informative signals, and hurt the model.

Partial answers to this question can be found in the literature, motivated somewhat differently. To understand where the benefits of chain-of-thought (Wei et al., 2022) come from, Lanham et al. (2023) append dummy thoughts in the form of periods (‘...’), but only during inference. This, they report, does not help. Presumably, an off-the-shelf model may not have learned to utilize the new computational pathways offered by the inference-time delay. Burtsev et al. (2020) learn with prepended dummy tokens, with the orthogonal motivation of adding memory (rather than extending computation). They train with these tokens only on the target task, and observe minimal performance gains.