Inference-Time Scaling

Topic · 34 papers

Related topics:

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
2 What to Scale “What to scale” refers to the specific form of TTS that is expanded or adjusted to enhance an LLM’s performance during inference. When applying TTS , researchers typically choose a sp…
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
While reinforcement learning (RL) holds promise for enabling self-exploration and learning from feedback, recent attempts yield only modest improvements in complex reasoning. In this paper, we present…
DeLLMa: Decision Making Under Uncertainty with Large Language Models
The potential of large language models (LLMs) as decision support tools is increasingly being explored in fields such as business, engineering, and medicine, which often face challenging tasks of deci…
DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-st…
Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which revea…
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
Recent efforts in fine-tuning language models often rely on automatic data selection, commonly using Nearest Neighbors retrieval from large datasets. However, we theoretically show that this approach …
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involv…
How Many Instructions Can LLMs Follow at Once?
Production-grade LLM systems require robust adherence to dozens or even hundreds of instructions simultaneously. However, the instruction-following capabilities of LLMs at high instruction densities h…
Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models
Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as BEST-OF-N Sampling and MAJ…
Inference-Time Scaling for Generalist Reward Modeling
Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that p…
Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning
Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture wh…
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more…
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their…
Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
However, existing methods overlook the trade-off between reasoning effectiveness and computational efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this…
Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones
Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-t…
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
Large language models (LLMs) (Brown et al., 2020) are known to acquire substantial factual knowledge during pretraining, storing it in their parameters (Geva et al., 2023). However, how effectively th…
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights–causing destructive interference between tasks…
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending …
Process Reward Models That Think
Step-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to bui…
Recursive Language Models
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy…
Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning
Test-time scaling, which is also often referred to as slow-thinking, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization,…
Rethinking Thinking Tokens: LLMs as Improvement Operators
Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accur…
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their respective role in enhancing model generalization in rule-ba…
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstra…
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly power…
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. These intermedia…
Test-Time Scaling with Reflective Generative Model
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3- mini’s performance via the new Reflective Generative Form. The new form focuses on highquality reasoning traje…
Test-time Prompt Intervention
Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning ca…
The Art of Scaling Reinforcement Learning Compute for LLMs
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite …
Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods
This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models on challenging reasoning tasks. Specifically, we focus our research on verifie…
Titans: Learning to Memorize at Test Time
Over more than a decade there has been an extensive research effort of how effectively utilize recurrent models and attentions. While recurrent models aim to compress the data into a fixed-size memory…
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensi…
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT.…
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Our results reveal a significant decline in accuracy as problem complexity grows—a phenomenon we term the “curse of complexity.” This limitation persists even with larger models and increased inferenc…