A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Paper · arXiv 2503.24235 · Published March 31, 2025
Test Time ComputeInference time scalingRLVRReinforcement LearningReward Models

2 What to Scale

“What to scale” refers to the specific form of TTS that is expanded or adjusted to enhance an LLM’s performance during inference. When applying TTS , researchers typically choose a specific “what to scale” based on an empirical hypothesis, aiming to achieve performance gains. For example, some researchers hypothesize that  longer CoTs improve complex reasoning, leading them to enforce longer outputs from LLMs. Others leverage the self-consistency principle, assuming that generating multiple solutions to a reasoning task increases the likelihood of reaching the correct answer.

2.1 Parallel Scaling

LLMs typically generate a single response per query. Parallel scaling improves test-time performance by generating multiple outputs in parallel and then aggregating them into a final answer. Formally, consider a problem set P and a collection of models m ∈ {1, . . . ,M}. Each model generates km candidate responses for a given problem p ∈ P, producing a set of sampled solutions S:

S = {sm,i | m ≤ M, i ≤ km}, ⇒ (∃ˆs) ˆs = A(s1,1, . . . , sM,kM) is correct.

(1) Here, A is the aggregation function that derives a final response from the set S. The effectiveness of parallel scaling depends on both coverage—the likelihood of generating at least one correct response—and aggregation quality, which determines whether a correct response is successfully identified. This approach is supported by both theory and intuition: cognitive science research (Stanovich and West, 2000) suggests that complex problems often allow multiple valid solution paths, and increasing the number of generated responses improves the chance of finding a correct one (Li et al., 2025d). Empirically, this relationship is often log-linear with respect to compute (Brown et al., 2024).

We categorize parallel scaling into two common forms based on different sources of coverage: (1) repeated sampling from a single model and (2) sampling across multiple models. Furthermore, there are some additional techniques to enhance solution diversity and reliability, such as hyperparameter adjustments (e.g., sampling temperature (Renze, 2024) to control output variability) and input modifications (e.g., prompt rephrasing (Lambert et al., 2025) to elicit diverse responses).

2.2 Sequential Scaling

Sequential scaling involves explicitly directing later computations based on intermediate steps. Unlike parallel methods, sequential scaling updates intermediate states iteratively. We denote the partial solution states (subproblem results, or initial drafts) by n1, n2, . . . , nT , with each new state nt+1 = R(nt, p) incorporating both the previous state and the problem context. Because many problems require deliberation rather than immediate pattern matching, single-pass ‘System 1’ (Yu et al., 2024c)-style generation often fails on complex reasoning tasks. Iterative methods emulate a ‘System 2’ approach, breaking down and refining the solution step by step.

Early work like chain-of-thought prompting (Wei et al., 2022) motivated solve the problem step-by-step, nt+1 = AppendStep(nt, new reasoning step), leading to approaches that refine responses (Madaan et al., 2023), nt+1 = Refine(nt), or break down problems systematically (Zhou et al., 2023a; Zelikman et al., 2022), nt+1 = IntegrateSub 􀀀 nt, solution to next subproblem   . Subsequent studies show that iterative revision (Chen et al., 2024h; Gou et al., 2024; Chen et al., 2025d; Snell et al., 2024) triggers self-correction, improving accuracy on challenging tasks In practice, real-world tasks often demand more flexible and potentially non-linear reasoning paths, suggesting that purely sequential approaches, while effective, may be only one part of a broader solution.

2.3 Hybrid Scaling Hybrid scaling exploits the complementary benefits of parallel and sequential scaling. Parallel scaling mitigates the risk of the model missing the correct line of thought by casting a wide net, while sequential scaling allows deep exploration of a line of reasoning once it seems promising. Formally, let Ft be the set of candidate solutions at iteration t. Each iteration expands these candidates in parallel with an expansion function E and sequentially filters them with a selection function S:

Ft+1 = S   [

s∈Ft

E(s)

. (2)

After T iterations, an aggregator A selects the final solution ˆs ∈ FT . From a cognitive standpoint, such a combination mirrors how human problem-solvers generate multiple hypotheses (divergent thinking) and then refine/evaluate them (convergent thinking). Classic search algorithms (e.g., iterative deepening (Chen et al., 2025d) and beam search (Snell et al., 2024)) embody this strategy by balancing exploration and exploitation. Recent work expands on this idea. Tree-of-Thoughts (ToT) (Yao et al., 2023b) branches at decision points, exploring multiple reasoning paths before pruning to a single sequence. Follow-up methods, such as Graph-of- Thoughts (Besta et al., 2024), Algorithm-of-Thought (Sel et al., 2024), Forest-of-Thought (Bi et al., 2024), Monte Carlo Tree Search (MCTS) (Lin et al., 2025), and multi-agent reasoning (Wang et al., 2025a; Chen et al., 2024f), leverage similar but more complex hybrid patterns. For instance, multiple LLMs can debate or verify each other’s answers (Liang et al., 2024; Schaul, 2024), while “journey learning” and “tool-augmented reasoning” (Li et al., 2025b) emphasize capturing full reasoning trajectories.

2.4 Internal Scaling

Internal scaling elicits a model to autonomously determine how much computation to allocate for reasoning during testing within the model’s internal parameters instead of depending on external human-guided strategies. Formally, we update an initial model M0 to a new model M1 via a training procedure, Φ : (M0,D) 7→ M1, on data D that includes multi-step reasoning tasks (e.g., long CoT examples produced by external scaling (Qin et al., 2024)). Surprisingly, employing outcome-oriented reward modeling (DeepSeek-AI, 2025; OpenAI, 2024b) for RL enables the model to extend its reasoning process autonomously.

At test time, M1 generates a sequence of internal states z1, z2, . . . , zT via zt+1 = fθ(zt), stop(zt) = πθ(zt). (3)

The model’s learned policy πθ controls when to halt. This internal feedback loop can lead to emergent behaviors— such as more detailed reasoning chains or self-evaluation steps—without any external prompts or multi-call orchestration. In practice, internal scaling often rivals or surpasses standard techniques, thanks to its ability to focus computational effort on a single, coherent reasoning trajectory.

3 How to Scale

3.1 Tuning-based Approaches

To activate a model’s ability to devote cost at test time, directly tuning its parameters is an effective strategy. This includes two approaches: 1) Supervised Finetuning (SFT): Training an LLM via next-token prediction on synthetic or distilled long CoTs enables it to imitate and internalize structured reasoning patterns, effectively learning to think through complex problems. By mimicking extended rationales, SFT reduces the reliance on explicit prompting at inference time. 2) Reinforcement Learning (RL): By leveraging feedback from a reward model on inference tasks, the policy model is automatically updated. Although no supervised data is introduced, the model autonomously generates long CoT reasoning while ensuring reliable answers. We divide the RL for internal scaling works into two perspectives. The reward model-based methods and the reward model-free methods.

3.1.1 Supervised Finetuning (SFT)

Training an LLM via next-token prediction on synthetic or distilled long CoTs enables it to internalize structured reasoning patterns and effectively “think” through complex problems. By mimicking extended rationales, SFT reduces the reliance on explicit prompting at inference time. This will include three subsections: (1) Imitation, describing techniques like MCTS used to generate CoT-style demonstrations for fine-tuning, (2) Distillation, summarizing how student models are trained using outputs from stronger models (e.g., o1, R1), and (3) Warmup, stabilizing learning and aligning the model’s behavior to produce useful step-by-step reasoning.

Imitation A prominent approach to enhancing LLM reasoning via SFT is to generate long CoT demonstrations using test-time “planner” algorithms and then fine-tune the model to imitate those demonstrations. For example, STaR (Zelikman et al., 2022) uses the model itself to generate step-by-step solutions for a given problem and filters for correct outcomes, treating the verified solutions as new demonstrations to fine-tune. More structured search has been applied to generate even higher-quality traces: ReST-MCTS (Zhang et al., 2024a) integrates an MCTS planner (guided by a learned value model) to explore the space of possible reasoning steps; the model is subsequently fine-tuned on these search-generated traces, i.e., it learns to imitate the successful reasoning trajectories discovered by the planner.

Distillation While the imitation approach uses a model’s own intermediate outputs for improvement, distillation techniques aim to transfer the capabilities of a stronger model (or ensemble of models) into a target model via supervised learning. As reported by Muennighoff et al. (2025); Li et al. (2025e), a 32B model trained on a curated sample set generated by a top-tier reasoner was able to solve competition-level math problems nearly as well as the teacher, indicating successful distillation of reasoning.

Warmup SFT warmup (Luong et al., 2024) refers to an initial SFT phase applied to an LLM after its unsupervised pretraining but before other post-training steps like RL. This stage stabilizes subsequent training by providing a well-initialized model that adapts better to preference optimization and avoids instability due to ungrounded behavior (Zeng et al., 2025c). Effective SFT warmup is characterized by several key elements: (i) the use of highquality, task-relevant datasets (Luong et al., 2024); (ii) short duration; (iii) a tailored learning rate schedule (Pareja et al., 2024). Technically, SFT warmup is often integrated with methods like rejection sampling (Pareja et al., 2024)—which uses warm-started models to generate high-quality data for further training.

3.1.2 Reinforcement Learning (RL) Reward model-free. Recent advancements in RL and preference optimization have significantly enhanced the performance of large language models, particularly in reasoning and problem-solving tasks. A key innovation in this domain is the introduction of RL with verifiable reward by DeepSeek R1 (DeepSeek-AI, 2025), which leverages rule-based reward mechanisms to optimize models efficiently and reliably. This approach has sparked growing interest among researchers working on large models, as it addresses challenges such as sparse rewards and training instability by providing dense feedback for policy optimization. Several methods have been developed to improve exploration and accuracy in reasoning tasks through preference optimization. For instance, cDPO (Lin et al., 2024), CPL (Wang et al., 2024f), Focused-DPO (Zhang et al., 2025b), DAPO (Liu et al., 2024b), and RFTT (Zhang et al., 2025c) prioritize critical or error-prone areas, enhancing internal scaling and reasoning accuracy. Additionally, Selective DPO (Gao et al., 2025b) emphasizes the importance of aligning data difficulty with model capacity by filtering out overly challenging examples, further refining the training process. VC-PPO (Yuan et al., 2025) investigates the failure of PPO for the long CoT task and uses a pre-trained value model to achieve better results. Light-R1 (Wen et al., 2025) proposes a curriculum training framework for increasing data difficulty combined with multi-staged post-training. SimPO (Meng et al., 2024) uses the average log probability of a sequence as the implicit reward and removes the reference model in DPO.

In the realm of mathematical problem-solving, DQO (Ji et al., 2024) and OREO (Wang et al., 2024b) propose novel value function optimization techniques, demonstrating improvements in model performance. DAPO (Yu et al., 2025) leverages dynamic sampling for large-scale RL systems. These advancements are complemented by a range of open-source training frameworks that have equipped researchers and developers with tools to optimize training and enhance inference. Early frameworks like SimpleRL (Zeng et al., 2025b) and DeepScaler (Luo et al., 2025b) quickly replicated the technology stack of DeepSeek R1. Furthermore, SimpleRL-Zoo (Zeng et al., 2025a) presents more experimental details about SimpleRL. Others, such as X-R1 (X-R1Team, 2025) and TinyZero (Pan et al., 2025b), focus on delivering an intuitive and cost-effective user experience. Notably, Open-Reasoner-Zero (Hu et al., 2025b) replicated the DeepSeek R1-zero training scheme using a 32B model, achieving comparable performance. Further advancements in RL for internal scaling have been facilitated by frameworks like OpenR (Wang et al., 2024c), OpenRLHF (Hu et al., 2024), OpenR1 (HuggingFace, 2025), Logic-RL (Xie et al., 2025) and AReaL(AntResearch- RL-Lab, 2025). These frameworks have enhanced the replication of internal scaling and, through open-source sharing, accelerated academic research progress. The above developments not only address key challenges in RL but also pave the way for more efficient and reliable model training and deployment.

Reward model-based. With a Bradley-Terry model (Zheng et al., 2023b) optimized by human preference as the reward model, PPO (Schulman et al., 2017) stands as one of the most influential algorithms with its efficiency and stability and is widely used for internal scaling. Building upon PPO, ReMax (Li et al., 2023b) introduces variance reduction techniques along with REINFORCE (Sutton et al., 1999) and RLOO (Ahmadian et al., 2024) methods. This eliminates the need for additional value models in PPO, reduces over four hyperparameters, lowers GPU memory usage, and speeds up the training process. GRPO (Shao et al., 2024) replaces traditional value models with improved sampling strategies. This significantly accelerates the learning process and achieves performance comparable to GPT-4 in mathematics. REINFORCE++ (Hu et al., 2025a) further simplifies GRPO and enhances its training. DVPO (Huang et al., 2025a) presents a streamlined framework, substituting the reward model with a pre-trained global value model and removing the dependency between the actor and critic. PRIME (Cui et al., 2025) integrates the SFT model as a PRM within a unified RL framework, allowing online updates through policy rollouts and outcome labels via implicit process rewards. SPPD (Yi et al., 2025) utilizes process preference learning with a dynamic value margin for self-training. Recently, several works have focused on other challenges of existing reward model-based methods. UGDA (Sun et al., 2025) leverages the uncertainty and influence of samples during PPO training and iteratively refines the reward model. VinePPO (Kazemnejad et al., 2024) exploits the flexibility of language environments to compute unbiased Monte Carlo-based estimates, avoiding the need for large value networks. LCPO (Aggarwal and Welleck, 2025) focuses on optimizing accuracy and adherence to user-specified length constraints for reasoning tasks. Rest-MCTS* (Zhang et al., 2024a) uses tree-search-based RL to bypass per-step manual annotation typically required for training process rewards. These advancements and refinements in algorithms continue to drive the field of reinforcement learning for internal scaling, offering more effective tools and methods for solving complex problems.

3.2 Inference-based Approaches

Unlike training-based approaches, which adjust the model’s parameters offline, inference-based approaches dynamically adjust computation during deployment. This paradigm includes four essential components: (i) Stimulation, which encourages the model to generate longer or multiple candidate outputs; (ii) Verification, which filters or scores outputs based on correctness or other criteria; (iii) Search, which systematically explores the sample space; and (iv) Aggregation, which consolidates multiple outputs into the final output. These four components are often used in combination to allocate test-time computation more effectively and boost performance on complex reasoning tasks. In the following sections, we provide detailed discussions of each component.

3.2.1 Stimulation

Stimulation techniques are the first step in encouraging the model to allocate more computation to thinking. It basically stimulates the LLM to generate (i) longer samplers and (ii) more samples instead of generating single and short samples via naive prompting. This includes several key approaches:

Prompt Strategy. Instead of allowing the model to generate an answer directly, one way to stimulate the scaling of LLM during test time is through the prompt. This behavior requires the backbone LLM’s ability to follow instructions. For instance, prompts can guide the model toward step-by-step reasoning. Simple modifications such as adding explicit instructions (e.g., “Please think step by step.”) can improve the model’s ability to break down complex problems into intermediate steps (Lightman et al., 2023). This strategy ensures more deliberate and structured thought generation by shaping the reasoning process at the input level. Other techniques such as (Wei et al., 2022; Ranaldi et al., 2025) also rely on explicitly stating the requirements in the prompt to stimulate samples during the TTS .

Decode Strategy Rather than passively accepting the model’s default output behavior, this approach modifies the decoding process to encourage LLM to generate longer, more detailed samples adaptively. Techniques such as injecting filler token (Pfau et al., 2024), adaptively injecting predefined injection phrase (Jin et al., 2020), forcing scaling budget (Muennighoff et al., 2025), enforcing intermediate generation (Li et al., 2025f), or predictive decoding (Ma et al., 2025a) allow the model to modify its distribution progressively. Enforcing extended reasoning at the output level enables the model to think longer and generate more comprehensive solutions without requiring additional external guidance.

Latent Strategy Unlike strategies that rely on token-level instructions or output expansion, latent strategies encourage deeper or recurrent thinking within the hidden representations themselves, effectively scaling up test-time computation through continuous internal states. For example, Hao et al. (2024) propose a paradigm where the model completes reasoning steps entirely in hidden space before producing the final answer; Kong et al. (2025) introduce a latent-thought framework that conditions text generation on an inferred latent variable to guide more thorough or expansive reasoning, while Shen et al. (2025c) show that compressing CoT into continuous embeddings can preserve intermediate reasoning fidelity without lengthy textual traces. Other approaches (Saunshi et al., 2025) harness looped or recurrent inference to repeatedly refine hidden states, effectively unfolding multiple “thinking iterations” in a single forward pass.

Self-Repetition Strategy Apart from generating longer samples, another way to stimulate the LLM is to generate multiple samples instead of individual ones. One commonly adopted strategy is to prompt the LLM repeatedly during the decoding stage, commonly known as self-repetition (Wang et al., 2023b). Another strategy is to prompt the LLM sequentially, in order to mimic refinement process (Madaan et al., 2023) or correlation under constraint (Ferraz et al., 2024).

Mixture-of-Model Strategy Gathering the “wisdom of the crowd” can move beyond repeated sampling from a single model to coordinated sampling across multiple models. These LLMs can play either homogeneous roles (Wang et al., 2025a) or heterogeneous roles (Chen et al., 2024i; He et al., 2025) during the process. By harnessing diverse perspectives, such multi-model strategy not only increases the coverage of possible solutions but also improves the system’s overall robustness.

3.2.2 Verification

Verifying the correctness and consistency of LLM during the test-time scaling is also crucial. The verification process plays an important role in the test-time scaling, as a solid verification process can be adapted to:

• directly selects the output sample among various ones, under the Parallel Scaling paradigm; • guides the stimulation process and determines when to stop, under the Sequential Scaling paradigm;

• serves as the criteria in the search process, which we will discuss in Section 3.2.3; • determines what sample to aggregate and how to aggregate them, e.g., weights, which we will discuss in Section 3.2.4.

Usually, there are two types of verifications, as shown below: Outcome Verification. Outcome verification plays a crucial role in ensuring the correctness and consistency of generated outputs. Common approaches include using a separate verifier model to score multiple candidate answers (e.g.,Cobbe et al. (2021)), employing self-consistency, voting mechanisms (Wang et al., 2023b) and discriminator LM (Chen et al., 2024j) and leveraging tool-assisted (Gou et al., 2024) or heuristic checks (DeepSeek-AI, 2025) in domains such as math and code generation. For specific task problems, such as trip planning, functional scoring (Lee et al., 2025) is also adopted for verifying the proposed plans. Instead of formulating the outcome verification as a classification problem, Zhang et al. (2025d) exploits the generative ability of LLM and proposes to reformulate the outcome verification process as a next-token prediction task. Li et al. (2025g) formulate the feedback utilization as an optimization problem and adaptive propagate information between samples.

Apart from single criteria, certain outcome verification approaches verify the quality of the simulated samples from multiple perspectives. Liu et al. (2023b) conducts both (i) passive verification from external tools and (ii) active verification via a rethinking mechanism to justify each sample. Zhang et al. (2024c) follows a similar idea and proposes to verify each sample from three aspects: Assertion, Process, and Result. Lifshitz et al. (2025) further extends the number of verification agents to an arbitrary number and decouples the semantic criteria with verification agents. Parmar et al. (2025) and Saad-Falcon et al. (2024) also propose a verification agent to score each sample considering various factors, respectively. Saad-Falcon et al. (2024) additionally proposes a unit test-based verification approach. We provide a detailed technical categorization in the Appendix A.

Process Verification. Process verification approaches verify the sample outcomes and the process of obtaining such an outcome. They are commonly adopted in tasks with formal, deductive processes, such as reasoning, coding, or mathematics. They are also known as the process reward model (PRM) or state verification. Lightman et al. (2023) processes to train a PRM as a step-level verification on mathematical tasks. Yao et al. (2023b) processes an LM-based state verifier as guidelines for searching the samples under the tree structure. Zhang et al. (2024b) further tunes these preference data into LLM and enables CoT structure during test time. Instead of training an external verifier, Xie et al. (2023) prompts the same LM to evaluate the current step given all previous ones. Hosseini et al. (2024) proposes to train the verifier with both accurate and inaccurate generated data. Although LM-based process verifiers can be easily integrated, they may yield unreliable verification, especially for complex problems with long processes. Ling et al. (2023) decomposes the verification process in a deductive manner. Hence, the verifier only needs to verify a few statements within the long thought chain. Yu et al. (2024a) is based on similar intuition but instead focuses on code-aided mathematical reasoning tasks with the critic model iteratively. Li et al. (2025b) instead relies on the external toolbox, such as code interpreters, to verify the process.

3.2.3 Search

Search is also a frequently used component during the test-time scaling. LLMs pre-trained on vast amounts of online data, can be viewed as a compression of real-world knowledge. However, standard inference tends to underutilize their capacity. Search, being a classic yet working technique in retrieving relevant information from vast databases, can be utilized to fully exploit the capability of LLMs by exploring their potential options in a structured manner. Existing test-time scaling approaches based on search techniques demonstrate significant performance increases over complex tasks, such as complex mathematics, etc.

Yao et al. (2023b) explores the potential of search by decomposing the output samples into multiple thoughts and organizing them in a tree structure. Based on only Naive tree search algorithms, such as depth-first search and breath-first search, it demonstrates superior performance on reasoning tasks. Monte-Carlo Tree Search (Coulom, 2006), being a classical and powerful search algorithm, also shines its light on better exploiting the hidden knowledge of LLMs. Chaffin et al. (2022) adopts MCTS during the decoding stage guided by discriminators for constrained textual generation. Zhang et al. (2023b) further extends the MCTS to enhance the planning ability in code generation via looking ahead. Tian et al. (2024) incorporates the MCTS as a critical component in the self-improving framework for LLM. Wan et al. (2024) tailors the search algorithm to tackle problems requiring long-horizon planning and deep tree structure for searching. Chen et al. (2024j) further identifies that discriminators are the key bottleneck in search-enhanced planning. Gandhi et al. (2024) systematizes the search process in a unified language and proposes to train an LLM with data and feedback from the search process. Wu et al. (2024d) empirically analyzes various search algorithms and designs a reward-balanced search algorithm toward Paretooptimal test-time scaling. Edward Beeching (2024) further extends the beam search by incorporating diversity consideration.

Apart from searching within the tree structure, Besta et al. (2024) models the output as a graph search problem. Xie et al. (2023) proposes a stochastic beam search solution based on self-evaluation for reasoning tasks. Pan et al. (2025a) enhances MCTS with proposed associative memory to dynamically update its knowledge base. Li et al. (2025c) proposes to solve the reasoning process as constructing a control flow graph with each node indicating a logic unit.

3.2.4 Aggregation Aggregation techniques consolidate multiple solutions into a final decision to enhance the reliability and robustness of model predictions at test time. Based on how the final output is generated, we empirically categorize them into two key classes: (i) Selection, which selects the best-performed sample among all candidates, where the selection criteria may vary across different approaches; and (ii) Fusion, which fuse multiple samples into one through tricks like weighting or generation.

Selection In this category, the aggregation process can be viewed as a selection problem. One well-known example is to select the most consistent answer, commonly known as self-consistency. Wang et al. (2023b) improves accuracy by leveraging statistical redundancy—if different reasoning paths converge to the same conclusion, the answer is more likely to be correct. Self-consistency effectively reduces variance in model outputs and mitigates occasional hallucinations. However, as the final output is voted based on consistency, inaccurate and low-quality samples would inevitably influence the output quality. Therefore, various approaches are proposed to filter the candidates before voting. Chen et al. (2024e) incorporates an LM as a filter, while Wu et al. (2025b) proposes a Length-filtered vote, where prediction uncertainty is adopted as a proxy to filter reliable CoT length.

Best-of-N (Irvine et al., 2023) follows the same process but replaces the self-consistency criteria with scalar scores generated by external verifiers. Song et al. (2024) further demonstrates that best-of-N on small LLMs can yield competitive performance against SOTA propriety models. Munkhbat et al. (2025) attaches a few-conditioning filtering before the best-of-N selection. This aims to alleviate its sample inefficiency and achieves significant length reduction. Motivated by particle filtering, Puri et al. (2025) proposes to consider filtering upon the samples. Sessa et al. (2024) went one step further in reducing sample inefficiency. It tunes the best-of-N results into the LM via RLHF. With the blooming of the agentic approach, Parmar et al. (2025) proposes a selection agent considering complex factors with both historical and current status. Apart from selecting samples from one single LM, Ong et al. (2025) views the selection of samples generated by weak and strong LLMs as a routing problem and proposes constraints on computation costs.

Fusion Directly selecting the final output sample among candidates may yield unsatisfactory results, especially when the sample quality of candidates is low. Fusion approaches propose to merge multiple samples into one to solve such a problem. Brown et al. (2024) and Li et al. (2023a) extend the idea from Best-of-N and weigh each sample by its score from external verifiers. Jiang et al. (2023), on the other hand, directly prompts another LLM as a summarizer to merge multiple selected samples. Li et al. (2025j) shares similar intuition by replacing the majority voting in self-consistency (Wang et al., 2024e) with generative self-aggregation. Li et al. (2025c) also adopts LLM as the synthesizer, given the intermediate consideration in previous steps.