Test-Time Scaling with Reflective Generative Model
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3- mini’s performance via the new Reflective Generative Form. The new form focuses on highquality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini’s series with only 32B parameter size.
Recent analyses suggest that OpenAI’s o3 model achieves its advanced reasoning and coding capabilities through Test-Time Scaling (TTS) techniques such as massive sampling, candidate scoring, and search over multiple reasoning paths (Labs, 2025; Zeff, 2024). For instance, during ARC-AGI and competitive coding evaluations, o3 was shown to generate up to 1024 candidate samples for each query (Chollet, 2024; OpenAI, 2025). These inference-time strategies mark a significant shift from traditional one-pass models, enabling o3 to adapt dynamically to novel tasks and achieve near-human performance in reasoning benchmarks.
TTS approaches can be categorized into two types: internal TTS and external TTS. Internal TTS (also called sequential TTS in Zeng et al. (2025)) strategies use CoT for longer thinking processes (Guo et al., 2025; OpenAI, 2024), which benefits from Long-CoT Supervised Fine- Tuning and reinforcement learning. Recent internal TTS methods (Guo et al., 2025) mainly suffer from the false positive reasoning process, as the outcome reward will misclassify the correct answer with incorrect reasoning during the training stage. External TTS (also called parallel TTS in Zeng et al. (2025)) is proposed for selecting the correct reasoning process, which is proved to be more effective in performance boosting compared with outcome reward (Lightman et al., 2023). Prominent external TTS algorithms include Best-of-N sampling, Beam Search, and Diverse Verifier Tree Search, using the verifier such as the Process Reward Model(PRM) to select high-quality reasoning trajectories. For example, Best-of-N firstly generates multiple outputs, and then sequentially uses PRM-based to score and select the best solution. Liu et al. (2025) point out that the best external TTS solution varies from policy model with different parameters. When the parameter of the policy model is less than 32B, the search methods (Beam Search and Diverse Verifier Tree Search) achieve better results than the sampling one(Best-of-N). However, when the parameter size is equal to or greater than 32B, the sampling method can achieve better performance. Based on the above analysis, internal and external TTS are two individual methods and can benefit each other, e.g. using the Best-of-N to boost the Long-CoT model with PRM.
This paper focuses on external TTS and proposes a new Reflective Generative Form for highquality reasoning trajectory selection. Specially, the proposed new form shares the backbone of the policy model and process reward model, and uses self-supervised training to eliminate the reliance on process-level supervision. Based on the Reflective Generative Form, the proposed MetaStone-S1 contains high, medium, and low reasoning modes with the controllable thinking length. Experiment results show that MetaStone-S1 achieves comparable performance to OpenAI o3-mini’s series with only 32B parameters.
2.1. Test-Time Scaling
Test-Time Scaling (TTS) is a technique that leverages additional computational resources at inference time to tackle challenging problems. With the remarkable performance improvements demonstrated by OpenAI o1 (Jaech et al., 2024), TTS has become a research hot spot for enhancing the reasoning capabilities of LLMs. TTS can be divided into two categories: internal TTS and external TTS. Internal TTS introduces the long Chain-of-Thought (CoT) to generate answers based on the detailed reasoning process. OpenAI o1(Jaech et al., 2024) and DeepSeek R1(Guo et al., 2025) introduce a thinking process to plan the solution and guide the final answer. Jin et al. (2024); Yeo et al. (2025) have shown that long CoT can help models correct mistakes by themselves and decompose complex problems more effectively, thereby improving performance. DeepScaleR(Luo et al., 2025) demonstrates that by carefully extending the context length during training, only a 1.5B-parameter model can surpass the o1-Preview. However, Chen et al. (2024a,b) have highlighted the risk of overthinking, where excessively long reasoning trajectories may lead to performance degradation. On the other hand, external TTS scales up inference through search-based strategies and auxiliary reward models. A common approach is the Best-of-N strategy (Brown et al., 2024; Lightman et al., 2023; Wang et al., 2023), which generates multiple candidates and selects the best one based on scores from the pretrained reward model. Moreover, fine-grained methods have also been explored, including, such as Beam Search (Liu et al., 2025; Snell et al., 2024), Diverse Verifier Tree Search (Beeching et al.) and Monte Carlo Tree Search (MCTS) (Guan et al., 2025; Luo et al., 2024; Zhang et al., 2024). These methods search at the step level and utilize Process Reward Models (PRMs) to guide the reasoning trajectory step-by-step. Beyond search strategies, recent work emphasizes that the quality of the reward model is a crucial factor in external TTS (Guan et al., 2025). A straightforward and effective way to enhance a model’s reasoning ability is to develop a high-quality reward model.
2.2. Process Reward Model
Process Reward Models (PRMs) focus on evaluating LLMs at the step level. Lightman et al. (2023) unveil that this fine-grained guidance can lead to better TTS performance compared with the global-level Outcome Reward Model (ORM). However, accurately identifying logical errors in LLM outputs remains challenging, and PRMs require high-quality task-specific annotated data for training. To this end, recent worksWang et al. (2023) leverage Monte Carlo estimation to automatically assign step-level scores using only the final answers as supervision. Guan et al. (2025); Zhang et al. (2024) iteratively synthesizes data by MCTS and fine-tuning both LLMs and PRMs, improving performance across both models. Tan et al. (2025) follow the LLM-asa- judge method and introduce a new LLM to annotate the reward of each step. Nonetheless, Zhang et al. (2025) point out that labels generated by Monte Carlo estimation can be noisy, as incorrect reasoning processes may still yield correct final answers. They further propose a hybrid approach that combines both Monte Carlo estimation with the LLM-as-a-judge. Despite these advances, existing PRMs still suffer from several challenges. First, the PRMs are trained on a new large-scale LLM model, resulting in significant training and inference costs. Second, most PRM training methods typically follow an off-policy strategy, which limits their ability to directly discriminate outputs generated by the target LLM. The unseen distribution during inference time may further degrade performance. To address these issues, we propose a Reflective Generative Form, which shares most parameters between the PRM and the target LLM, and supports on-policy optimization with only outcome rewards, enabling more efficient and aligned training.
4.1. Unified Interface in Reflective Generative Form
Our proposed Reflective Generative Form establishes a unified interface for the policy model and the PRM. For the policy model, we employ reasoning LLMs that contain the thinking process in response, delineated by the ’
Within this unified form, the policy model first generates multiple thinking processes as the reasoning trajectories. Subsequently, the SPRM evaluates each thinking process for reasoning trajectory selection. The evaluation procedure contains two steps: (1) Segmenting the reasoning trajectory into discrete steps and (2) Predicting a trajectory score based on evaluation in each step.
Step Segmentation. We segment each reasoning trajectory using tokens that are already supported by the policy model’s tokenizer, eliminating the need to introduce additional step-specific tokens or fine-tune the LLM for step-format outputs. Specifically, we treat tokens containing ’.\n\n’ as step-tokens and split the trajectory accordingly. Additionally, we retain only the first token in any sequence of consecutive step-tokens and ignore the step-token appearing at the beginning of the trajectory, as it does not contain substantive solution information.
Trajectory Score Prediction. After using step-tokens to mark the end of individual reasoning steps, we evaluate each step based on the representation of the corresponding step-token. Since the representation in the last layer mainly captures the logits prediction for a single token, we use the hidden representations from the second-to-last layer of the policy model to provide richer contextual information of the entire step. These representations are then fed into the SPRM head to predict process scores for each step. The final score for the entire reasoning trajectory is computed as the geometric mean of the individual process scores:
Sfinal =
Ö𝑛
𝑖=1
Score𝑖
! 1
𝑛
=
Ö𝑛
𝑖=1
𝑆𝑃𝑅𝑀
where 𝑛 denotes the total number of steps, and 𝑓token𝑖 is the representation of the 𝑖-th step-token obtained from the policy model. Score𝑖 is the SPRM’s process score for 𝑖-th step. Through this unified interface, a single network can generate reasoning trajectories and score them in parallel, enabling joint training in an end-to-end manner. This design facilitates a straightforward and efficient training pipeline for on-policy PRM learning, where both the policy model and the SPRM continuously refine their parameters from shared experiences, thereby improving the overall quality of the generated trajectories.
However, since a correct final answer may include incorrect intermediate steps and vice versa (Lightman et al., 2023), we introduce the self-supervised dynamic weight 𝑤𝑖 to mitigate supervision noise. Specifically, we use the SPRM head’s own prediction on each step as the pseudo label and set 𝑤𝑖 = 1 only if the pseudo label is consistent with the final answer’s correctness. This dynamic filtering allows the model to avoid noisy samples and focus on the most representative steps of correct and incorrect solutions. Thus, by enlarging the score gap between correct and incorrect steps, SPRM can progressively learn the process evaluation ability with only final annotations.