Test-Time Compute

Topic · 33 papers

Related topics:

A Survey on LLM Inference-Time Self-Improvement
Techniques that enhance inference through increased computation at test-time have recently gained attention. In this survey, we investigate the current state of LLM Inference-Time Self- Improvement fr…
A Survey on Post-training of Large Language Models
The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific expl…
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
2 What to Scale “What to scale” refers to the specific form of TTS that is expanded or adjusted to enhance an LLM’s performance during inference. When applying TTS , researchers typically choose a sp…
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
While reinforcement learning (RL) holds promise for enabling self-exploration and learning from feedback, recent attempts yield only modest improvements in complex reasoning. In this paper, we present…
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
Abstract: Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive and agentic environments. Existing work primarily focuses on enhancing performance th…
Behavioral Exploration: Learning to Explore via In-Context Adaptation
While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely…
Deep Think with Confidence
Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishi…
Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which revea…
Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?
With the growing success of reasoning models across complex natural language tasks, researchers in the Information Retrieval (IR) community have begun exploring how similar reasoning capabilities can …
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models
Aligning Large Language Models (LLMs) traditionally relies on costly training and human preference annotations. Self-alignment seeks to reduce these expenses by enabling models to align themselves. To…
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning…
End-to-End Test-Time Training for Long Context
On the other hand, Transformers with self-attention still struggle to efficiently process long context equivalent to years of human experience, in part because they are designed for nearly lossless re…
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mat…
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called …
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more…
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Some policy gradient approaches are explained below: Policy Gradient (REINFORCE). The REINFORCE algorithm [114, 115] is a method used to improve decision-making by adjusting the model’s strategy (poli…
LLMs can implicitly learn from mistakes in-context
We consider the scenario where an LLM outputs a corrective rationale for an erroneous answer, then uses it to improve its next answer akin to explicit learning in humans—a phenomenon whereby patterns …
Long-context LLMs Struggle with Long In-context Learning
We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark re…
MatFormer: Nested Transformer for Elastic Inference
Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, …
Memorization and Knowledge Injection in Gated LLMs
Large Language Models (LLMs) currently struggle to sequentially add new memories and integrate new knowledge. These limitations contrast with the human ability to continuously learn from new experienc…
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
To address these issues, we introduce Meta- Reasoner, a framework that dynamically optimizes inference-time reasoning by enabling LLMs to “think about how to think.” Drawing inspiration from human met…
Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion
Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection …
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI’s o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. W…
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstra…
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this …
Sleep-time Compute: Beyond Inference Scaling at Test-time
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time…
TTRL: Test-Time Reinforcement Learning
This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during i…
Test-Time Scaling with Reflective Generative Model
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3- mini’s performance via the new Reflective Generative Form. The new form focuses on highquality reasoning traje…
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
This paper aims to overcome a major obstacle in scaling reinforcement learning (RL) for reasoning with large language models (LLMs), namely the collapse of policy entropy. Such phenomenon is consisten…
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
We investigate the effectiveness of test-time training (TTT)—updating model parameters temporarily during inference using a loss derived from input data—as a mechanism for improving models’ reasoning …
Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods
This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models on challenging reasoning tasks. Specifically, we focus our research on verifie…
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking
Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance m…
Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities
Researchers have moved beyond simple autoregressive token generation by introducing the concept of “thought”—a sequence of tokens representing intermediate steps in the reasoning process. This innovat…