Learning to Learn from Language Feedback with Social Meta-Learning
Large language models (LLMs) often struggle to learn from corrective feedback within a conversational context. They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation. To address these limitations, we draw inspiration from social meta-learning (SML) in humans — the process of learning how to learn from others. We formulate SML as a finetuning methodology, training LLMs to solicit and learn from language feedback in simulated pedagogical dialogues, where static tasks are converted into interactive social learning problems. SML effectively teaches models to use conversation to solve problems they are unable to solve in a single turn. This capability generalises across domains; SML on math problems produces models that better use feedback to solve coding problems and vice versa. Furthermore, despite being trained only on fully-specified problems, these models are better able to solve underspecified tasks where critical information is revealed over multiple turns. When faced with this ambiguity, SML-trained models make fewer premature answer attempts and are more likely to ask for the information they need. This work presents a scalable approach to developing AI systems that effectively learn from language feedback.
In children, the challenge of learning via social interaction is itself thought to be learned through a process referred to as social meta-learning (SML) (Allen and Ilgaz, 2017). By adulthood, humans are generally adept social learners, using others as a learning resource and effectively engaging in collaborative problem solving. Meanwhile, current approaches to LLM post-training tend to focus on single-turn reasoning and static task performance, limiting opportunities for learning to learn from contextual feedback and possibly diminishing the conversational adaptation abilities already present within the pre-trained base model (Shaikh et al., 2023; Wang et al., 2024b). Interactions with LLMs can subsequently have a brittleness that is unfamiliar within interactions with other humans, placing a significant burden of initial prompt engineering on everyday users.
To address this problem, we formulate SML as a finetuning methodology for LLMs. We achieve this by converting static tasks, such as math problems, into interactive, pedagogical dialogues. Given an initial problem statement, a "student" model attempts to generate the solution over the course of a conversation and a "teacher" model provides guidance. The student is the model being trained and the teacher can be a frozen instance of the same model, or a stronger model. Crucially, the teacher is provided with privileged information, such as the correct final answer or access to the outputs of a verifier. This creates an information asymmetry, ensuring that the teacher can provide valuable, corrective feedback and making problems that are significantly beyond the student's single-turn capabilities tractable through interaction. It also incentivises the student to be proactive in extracting relevant information from the teacher, analogous to in-context exploration in partially observable sequential decision making problems.
We explore two approaches to SML: an offline method where we gather a filtered dataset of successful dialogues and perform supervised finetuning (SFT), and online RL using binary, conversation-level rewards. We find that online RL yields much greater improvements in the ability to learn from language feedback at test time. Moreover, this ability generalises to longer conversations at test time than were used for training and transfers across domains; SML on math problems leads to improved performance in learning from language feedback on coding tasks.
Crucially for human-AI interaction, this training paradigm also enhances a model's ability to navigate ambiguity. Despite the fact that our SML setup exclusively involves training on problems that are fully specified from the first turn, the finetuned model achieves superior performance on tasks where critical information is revealed over multiple conversational turns. A desirable property of models in this setting is making fewer premature answer attempts and instead asking for the necessary information. While this behaviour does become more frequent from SML alone, we find that it can be significantly enhanced through a two-stage training process. We introduce Q-priming, a preliminary SFT stage where we train the model on dialogues in which we have explicitly prompted it to ask questions. To generate examples of useful questions, we again leverage the information asymmetry of our setup, providing the model with the teacher's private knowledge (e.g., the ground truth solution) and asking it to formulate an informative query.
In this work, we introduce a finetuning methodology, inspired by the phenomenon of social meta-learning (SML) in humans, that trains LLMs to learn how to learn from others in conversation. By converting static problems into interactive pedagogical dialogues, we show that models can be taught to use language feedback to solve problems they are unable to solve in a single turn. Our results demonstrate that this learned ability is highly generalisable: SML on math problems improves a model's capacity to learn from feedback on coding tasks and to handle ambiguity in underspecified problems. Furthermore, by introducing a Q-priming stage, we can elicit exploration within conversation, encouraging behaviours such as asking clarifying questions instead of making premature answer attempts. This work presents a scalable path toward developing more collaborative and human-compatible AI by reframing static tasks as interactive learning opportunities.