Reasoning by Reflection and Self-Critique

Topic · 49 papers

Related topics:

A Comprehensive Evaluation of Inductive Reasoning Capabilities and Problem Solving in Large Language Models
Inductive reasoning is fundamental to both human and artificial intelligence. The inductive reasoning abilities of current Large Language Models (LLMs) are evaluated in this research. We argue that on…
Answering Questions by Meta-Reasoning over Multiple Chains of Thought
Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of thought (CoT), before arriving at a final answer. Often, multiple cha…
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
When language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. T…
Bilevel Autoresearch: Meta-Autoresearching Itself
If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop. Every exi…
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning…
DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration
Real-time human-artificial intelligence (AI) collaboration is crucial yet challenging, especially when AI agents must adapt to diverse and unseen human behaviors in dynamic scenarios. Existing large l…
Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory
While large language models (LLMs) leverage both knowledge and reasoning during inference, the capacity to distinguish between them plays a pivotal role in model analysis, interpretability, and develo…
DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs
We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue-style reasoning models, aiming to boost diversity and coherency of the reasoning process. Recent advances in R…
Diplomat: A Dialogue Dataset for Situated PragMATic Reasoning
“We introduce a new benchmark, Diplomat, aiming at a unified paradigm for pragmatic reasoning and situated conversational understanding. Compared with previous works that treat different figurative ex…
Divide-or-Conquer? Which Part Should You Distill Your LLM?
we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. F…
Do large language models resemble humans in language use?
regularities in language range from phonology to pragmatics. For example, people associate different sounds with different referents (e.g., Köhler, 1929), automatically reinterpret implausible sentenc…
Efficient Reasoning with Balanced Thinking
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, faili…
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy …
First Try Matters: Revisiting the Role of Reflection in Reasoning Models
Large language models have recently demonstrated significant gains in reasoning ability, often attributed to their capacity to generate longer chains of thought and engage in reflective reasoning. How…
From Language to Logic: A Bi-Level Framework for Structured Reasoning
Structured reasoning over natural language inputs remains a core challenge in artificial intelligence, as it requires bridging the gap between unstructured linguistic expressions and formal logical re…
Humans or LLMs as the Judge? A Study on Judgement Biases
Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-ajudge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently …
LLM Augmentations to support Analytical Reasoning over Multiple Documents
Our key contributions are: 1) We conduct the first investigation of the feasibility of using LLMs in intelligence analysis where both evidencebased reasoning and analytical creativity is of utmost …
LLMs can implicitly learn from mistakes in-context
We consider the scenario where an LLM outputs a corrective rationale for an erroneous answer, then uses it to improve its next answer akin to explicit learning in humans—a phenomenon whereby patterns …
Latent Skill Discovery for Chain-of-Thought Reasoning
Recent advances in Large Language Models (LLMs) have led to an emergent ability of chain-of-thought (CoT) prompting, a prompt reasoning strategy that adds intermediate rationale steps between question…
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of hu…
MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often r…
Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multipath Chain-of-Thought explorations before prod…
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
To address these issues, we introduce Meta- Reasoner, a framework that dynamically optimizes inference-time reasoning by enabling LLMs to “think about how to think.” Drawing inspiration from human met…
Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors
Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token…
Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, th…
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent work…
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
vanilla-retrieved information tends to lack depth, utility, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, repetitive, and unoriginal outp…
Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. While these reas…
Proactive Conversational Agents with Inner Thoughts
In this paper, we demonstrate the limitations of such methods and rethink what it means for AI to be proactive in multi-party, human-AI conversations. We propose that just like humans, rather than mer…
Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games
Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This unde…
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
[**https://arxiv.org/abs/2403.09629**](https://arxiv.org/abs/2403.09629) For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversat…
Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion
Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection …
Reflexion: an autonomous agent with dynamic memory and self-reflection
Recent advancements in decision-making large language model (LLM) agents have demonstrated impressive performance across various benchmarks. However, these state-of-the-art approaches typically necess…
Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up
we propose Reversal of Thought (RoT), a novel framework aimed at enhancing the logical reasoning abilities of LLMs. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrate…
SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions
“The recent NLP literature has witnessed a tremendous amount of activity in building models that.... can follow natural language instructions (Mishra et al., 2022; Wei et al., 2022; Sanh et al., 2022;…
Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
Instruction tuning is critical to large language models (LLMs) for achieving better instruction following and task adaptation capabilities but its success heavily relies on the training data quality. …
Self-Discover: Large Language Models Self-Compose Reasoning Structures
*Table 2. All 39 reasoning modules consisting of high-level cognitive heuristics for problem-solving. We adopt them from Fernando et al.* (_2023_). Reasoning Modules 1 How could I devise an experim…
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Ret…
Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue th…
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models
We investigate how to elicit compositional generalization capabilities in large language models (LLMs). Compositional generalization empowers LLMs to solve complex problems by combining foundational s…
System 2 Attention (is something you might need too)
Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next t…
Teaching Large Language Models to Reason with Reinforcement Learning
[**https://arxiv.org/abs/2403.04642**](https://arxiv.org/abs/2403.04642) [[Reinforcement Learning]] Reinforcement Learning from Human Feedback (RLHF) has emerged as a dominant approach for aligning …
Test-Time Scaling with Reflective Generative Model
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3- mini’s performance via the new Reflective Generative Form. The new form focuses on highquality reasoning traje…
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection
Self-detection for Large Language Models (LLMs) seeks to evaluate the trustworthiness of the LLM’s output by leveraging its own capabilities, thereby alleviating the issue of output hallucination. How…
Think before you speak: Training Language Models With Pause Tokens
Transformer-based causal language models generate tokens one after the other in immediate succession. To generate the (K + 1)th token, the model consumes the K previous tokens, and proceeds layer by l…
Thinking Augmented Pre-training
This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for …
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced pro…
Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models
Abstract. While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments…
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflecti…