Chain-of-thought Reasoning Is A Policy Improvement Operator
“A major challenge that has prevented past efforts of self-learning in language models from succeeding, especially in arithmetic, is a phenomenon that we call error avalanching. During self-training, when all training data is generated by the model itself, there is no guarantee that the data is correct. Error avalanching occurs when small errors or inaccuracies in a model’s output compounds rapidly over time, because each iteration of the model learns from the outputs of the previous model amplifies the existing mistakes. If left unchecked, error avalanching leads to severe degradation of performance within only a few iterations of self-training (see Figure 4). This is consistent with past attempts to get language models to self-learn (for addition as well as other tasks) in which improvement stagnates in only a few steps at most (Zelikman et al., 2022; Lewkowycz et al., 2022; Bai et al., 2022; Huang et al., 2022; Jung et al., 2023). While error avalanching is a fundamental issue in any bootstrapped process, SECToR manages to largely mitigate error avalanching via several forms of self-consistency checks (Figures 4 and 5) that minimize the number of mistakes introduced to the dataset. Nevertheless, SECToR does not continue ad infinitum, and training eventually terminates due to accumulated errors. We return to this issue in the discussion.”