Self-Questioning Language Models
Can large language models improve without external data – by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification.
Exploration is a foundational challenge in reinforcement learning. To address this, many methods encourage agents to discover novel states through intrinsic rewards. One prominent class is based on prediction error, where novelty is measured by the agent’s surprise at its own predictions, such as inverse dynamics models (ICM) (Pathak et al., 2017) or randomly initialized networks (RND) (Burda et al., 2018). Other techniques optimize state entropy to promote diverse state visitation (Liu & Abbeel, 2021). Go-Explore (Ecoffet et al., 2019) separates exploration and robustification, enabling agents to return to promising states and expand from there,
Asymmetric self-play (OpenAI et al., 2021), first proposed for goal-conditioned robotic manipulation, is a method for self-supervised exploration which naturally produces a curriculum of interesting tasks for agents to use for learning. It trains two RL agents: a proposer P who aims to propose challenging tasks, and a solver S who aims to solve those tasks. The proposer receives a reward if the solver fails to solve the task, and the solver receives a reward for solving its assigned task. We refer the reader to OpenAI et al. (2021) for more details on the original application of asymmetric self-play to robotics.
In our language modeling setting, we consider a proposer policy πPt (x) and solver policy πS(ypred | x), where Pt is simply a proposer constrained to a specific topic t (e.g., arithmetic), x is a generated question, and ypred is an attempt at solving the question. Here, both P and S are language models and trained via reinforcement learning as described in the next section.