Self-Refinement and Self-Consistency
Related topics:
- A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic SystemsTo address this limitation, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging …
- A Survey of Calibration Process for Black-Box LLMsIn recent years, Confidence Estimation and Calibration have frequently been discussed together, as the estimation of confidence is often influenced by the uncertainty in the model or data, and calibra…
- A Survey of Self-Evolving Agents: On Path to Artificial Super IntelligenceLarge Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledg…
- A Survey on Knowledge Distillation of Large Language ModelsIn the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their o…
- A Survey on LLM Inference-Time Self-ImprovementTechniques that enhance inference through increased computation at test-time have recently gained attention. In this survey, we investigate the current state of LLM Inference-Time Self- Improvement fr…
- A Survey on Post-training of Large Language ModelsThe emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific expl…
- Absolute Zero: Reinforced Self-play Reasoning with Zero DataReinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR wo…
- An Emulator for Fine-Tuning Large Language Models using Small Language ModelsWidely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a pretraining stage that uses a very large, diverse dataset of text and a fine-tuning (sometimes, ‘al…
- Augmenting Autotelic Agents with Large Language ModelsHumans learn to master open-ended repertoires of skills by imagining and practicing their own goals. This autotelic learning process, literally the pursuit of self-generated (auto) goals (telos), beco…
- Automated Alignment Researchers: Using large language models to scale scalable oversightLarge language models’ ever-accelerating rate of improvement raises two particularly important questions for alignment research. One is how alignment can keep up. Frontier AI models are now contribut…
- Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategiesHowever, their efficacy is undermined by undesired and inconsistent behaviors, including hallucination, unfaithful reasoning, and toxic content. A promising approach to rectify these flaws is self-cor…
- Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: a Short SurveyBuilding autonomous machines that can explore open-ended environments, discover possible interactions and build repertoires of skills is a general objective of artificial intelligence. Developmental a…
- Boundless Socratic Learning with Language GamesAn agent trained within a closed system can master any desired capability, as long as the following three conditions hold: (a) it receives sufficiently informative and aligned feedback, (b) its covera…
- Branch-Solve-Merge Improves Large Language Model Evaluation and GenerationLarge Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects a…
- Bridging Offline and Online Reinforcement Learning for LLMsWe investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and n…
- CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement LearningIn open-ended environments, autonomous learning agents must set their own goals and build their own curriculum through an intrinsically motivated exploration. They may consider a large diversity of go…
- Can Large Reasoning Models Self-Train?Scaling the performance of large language models (LLMs) increasingly depends on methods that reduce reliance on human supervision. Reinforcement learning from automated verification offers an alternat…
- Chain-of-Thought Reasoning Without PromptingIn enhancing the reasoning capabilities of large language models (LLMs), prior research primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of-thought (CoT) promptin…
- Chain-of-Verification Reduces Hallucination in Large Language ModelsWe study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (COVE) method whereby the model first (i) draf…
- CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasksWe propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synth…
- DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep ResearchDeep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinfor…
- Do I Know This Entity? Knowledge Awareness and Hallucinations in Language ModelsHallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using spa…
- End-to-End Test-Time Training for Long ContextOn the other hand, Transformers with self-attention still struggle to efficiently process long context equivalent to years of human experience, in part because they are designed for nearly lossless re…
- Evaluating Large Language Models at Evaluating Instruction FollowingAs research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasin…
- FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Setsevaluating the alignment of LLMs to human values is challenging for two reasons. First, open-ended user instructions usually require a composition of multiple abilities, which makes measurement with a…
- Fine-grained Hallucination Detection and Editing for Language ModelsSeveral recent work studies automatic hallucination detection (Min et al., 2023) or editing outputs (Gao et al., 2022) to address such LM hallucinations. These systems typically categorize hallucinati…
- Grounding Large Language Models in Interactive Environments with Online Reinforcement LearningLarge Language Models’ (LLM) abilities to capture abstract knowledge about world’s physics to solve decision-making problems. Yet, the alignment between LLMs’ knowledge and the environment can be wron…
- How Should We Meta-Learn Reinforcement Learning Algorithms?Meta-learning shows particular promise for reinforcement learning (RL), where algorithms are often adapted from supervised or unsupervised learning despite their suboptimality for RL. However, until n…
- How to Correctly do Semantic Backpropagation on Language-based Agentic SystemsDue to the strength of Large Language Models (LLMs) in doing a wide array of tasks, agentic systems typically have most of their key components rely on querying LLMs. This results in communication bet…
- Improving Reinforcement Learning from Human Feedback Using Contrastive Rewardswe improve the effectiveness of the reward model by introducing a penalty term on the reward, named contrastive rewards. Our approach involves two steps: (1) an offline sampling step to obtain respons…
- Inference-Time Scaling for Generalist Reward ModelingReinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that p…
- Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference DataIn this paper, we introduce Inverse-Q*, an innovative framework that transcends traditional RL methods by optimizing token-level reinforcement learning without the need for additional reward or value …
- Judging LLM-as-a-Judge with MT-Bench and Chatbot ArenaEvaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we…
- LLM Post-Training: A Deep Dive into Reasoning Large Language ModelsSome policy gradient approaches are explained below: Policy Gradient (REINFORCE). The REINFORCE algorithm [114, 115] is a method used to improve decision-making by adjusting the model’s strategy (poli…
- Learning To Retrieve Prompts for In-Context LearningIn-context learning is a recent paradigm in natural language understanding, where a large pre-trained language model (LM) observes a test instance and a few training examples as its input, and directl…
- Learning to (Learn at Test Time): RNNs with Expressive Hidden StatesSelf-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their…
- Learning to Discover at Test TimeHow can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement…
- Learning to Reason for FactualityReasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning c…
- Let’s Verify Step by StepWe conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. In closely re…
- Looking beyond the next tokenThe structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans’ natural writing and reasoning process, where …
- Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with NothingIs it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named MA…
- Mechanisms of Introspective AwarenessRecent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept—a phenomenon termed “introspective awareness.” We i…
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-JudgeLarge Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et a…
- Metacognitive Retrieval-Augmented Large Language ModelsRetrieval-augmented generation have become central in natural language processing due to their efficacy in generating factual content. While traditional methods employ single-time retrieval, more rece…
- Mind the Gap: Examining the Self-Improvement Capabilities of Large Language ModelsSelf-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights…
- OMNI-SIMPLEMEM: Autoresearch-Guided Discovery of Lifelong Multimodal Agent MemoryAI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory r…
- On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic WeightingSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing…
- Online Intrinsic Rewards for Decision Making Agents from Large Language Model FeedbackAutomatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration,…
- Post-Completion Learning for Language ModelsCurrent language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos) token, overlooking the potential learning opportunities in the post-completion space. We…
- Post-Training Large Language Models via Reinforcement Learning from Self-FeedbackLarge Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self- Feedback (RLSF…
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Modelsit remains contentious whether RL truly expands a model’s reasoning capabilities or merely amplifies high-reward outputs already latent in the base model’s distribution, and whether continually scalin…
- R-Zero: Self-Evolving Reasoning LLM from Zero DataSelf-evolving Large Language Models (LLMs) offer a scalable path toward superintelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for t…
- RARR: Researching and Revising What Language Models Say, Using Language ModelsLanguage models (LMs) now excel at many tasks such as question answering, reasoning, and dialog. However, they sometimes generate unsupported or misleading content. A user cannot easily determine whet…
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMsLarge language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abiliti…
- RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught ReasonerThe reasoning abilities of large language models (LLMs) have improved with chain-of-thought (CoT) prompting, allowing models to solve complex tasks in a stepwise manner. However, training CoT capabili…
- Recursive Introspection: Teaching Language Model Agents How to Self-ImproveEven the strongest proprietary large language models (LLMs) do not quite exhibit the ability of continually improving their responses sequentially, even in scenarios where they are explicitly told tha…
- Reflexion: an autonomous agent with dynamic memory and self-reflectionRecent advancements in decision-making large language model (LLM) agents have demonstrated impressive performance across various benchmarks. However, these state-of-the-art approaches typically necess…
- Rethinking with Retrieval: Faithful Large Language Model InferenceIn this paper, we present a post-processing approach called rethinking with retrieval (RR) for utilizing external knowledge in LLMs. Our method begins by using the chain-of-thought (CoT) prompting met…
- SAND: Boosting LLM Agents with Self-Taught Action DeliberationLarge Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imi…
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive SearchThis typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstra…
- Self-Adapting Language ModelsGiven a new input, the model produces a self-edit—a generation that may restructure the information in different ways, specify optimization hyperparameters, or invoke tools for data augmentation and g…
- Self-Discover: Large Language Models Self-Compose Reasoning Structures*Table 2. All 39 reasoning modules consisting of high-level cognitive heuristics for problem-solving. We adopt them from Fernando et al.* (_2023_). Reasoning Modules 1 How could I devise an experim…
- Self-Evaluation Guided Beam Search for ReasoningBreaking down a problem into intermediate steps has demonstrated impressive performance in Large Language Model (LLM) reasoning. However, the growth of the reasoning chain introduces uncertainty and e…
- Self-Improving Model SteeringModel steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily o…
- Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization ChallengesLarge language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iterativel…
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models“Typical alignment methods include Supervised Fine-Tuning (SFT) (Ouyang et al., 2022; Tunstall et al., 2023a) based on human demonstrations, and Reinforcement Learning from Human Feedback (RLHF) (Chri…
- Self-Questioning Language ModelsCan large language models improve without external data – by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a …
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-ReflectionDespite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Ret…
- Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning CatalystIn this work, we introduce Self-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few…
- Self-Refine: Iterative Refinement with Self-FeedbackMotivated by how humans refine their written text, we introduce SELF-REFINE, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate…
- Self-Reflection in LLM Agents: Effects on Problem-Solving PerformanceTo improve their performance, we can provide them with a series of cognitive capabilities. For example, we can provide them with a CoT [1–3], access to external memory [22–25], and the ability to lear…
- Self-Taught EvaluatorsModel-based evaluation is at the heart of successful model development – as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is t…
- Self-critiquing models for assisting human evaluatorsWe fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our mode…
- Self-distillation Enables Continual LearningContinual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement le…
- SkillClaw: Let Skills Evolve Collectively with Agentic EvolverLarge language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage…
- Test-Time Scaling with Reflective Generative ModelWe introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3- mini’s performance via the new Reflective Generative Form. The new form focuses on highquality reasoning traje…
- The Invisible Leash: Why RLVR May Not Escape Its OriginRecent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI’s capabilities, particularly in solving complex logical…
- Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer ReflectionSelf-detection for Large Language Models (LLMs) seeks to evaluate the trustworthiness of the LLM’s output by leveraging its own capabilities, thereby alleviating the issue of output hallucination. How…
- Toward Self-Improvement of LLMs via Imagination, Searching, and CriticizingDespite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced pro…
- Training Language Models to Self-Correct via Reinforcement LearningSelf-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correct…
- Transformer2: Self-adaptive LLMsSelf-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse…
- Truly Self-Improving Agents Require Intrinsic Metacognitive LearningSelf-improving agents aim to continuously acquire new capabilities with minimal supervision. However, current approaches face two key limitations: their self-improvement processes are often rigid, fai…
- Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-PromptingCoT encounters difficulties when key information required for the reasoning process is either implicit or missing. It primarily stems from the fact that CoT emphasizes the stages of reasoning, while n…
- Unsupervised Elicitation of Language ModelsTo steer pretrained language models for downstream tasks, today’s post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficul…
- Voyager: An Open-Ended Embodied Agent with Large Language ModelsWe introduce VOYAGER, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter…
- When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Modelswe set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflecti…
- ZebraLogic: On the Scaling Limits of LLMs for Logical ReasoningOur results reveal a significant decline in accuracy as problem complexity grows—a phenomenon we term the “curse of complexity.” This limitation persists even with larger models and increased inferenc…