S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Paper · arXiv 2504.10368 · Published April 14, 2025
Evaluations

We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning Models’ (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their reliance on deep analytical thinking may limit their system 1 thinking capabilities. Moreover, a lack of benchmark currently exists to evaluate LRMs’ performance in tasks that require such capabilities. To fill this gap, S1-Bench presents a set of simple, diverse, and naturally clear questions across multiple domains and languages, specifically designed to assess LRMs’ performance in such tasks.

Large Reasoning Models (LRMs), characterized by explicitly generating external thinking processes before final answers (Kumar et al., 2025b; Chen et al., 2025), achieve a paradigm shift from intuitive system 1 thinking to deliberative system 2 reasoning compared to traditional LLMs (Li et al., 2025b; Qu et al., 2025), thus achieving superior performance on complex tasks. The development of recent LRMs has largely followed two main approaches: large-scale reinforcement learning (RL) and model distillation. Models trained via largescale RL (Guo et al., 2025; Team, 2025b; Team et al., 2025) leverage reward-based optimization to gradually incentivize deliberative reasoning.

To better understand the causes of inefficiency in LRMs on the S1-Bench, we analyze the thinking process where the final answer is correct and the format is strictly correct and non-empty9. We begin by segmenting each thinking process into several solutions, each defined as a point at which LRMs explicitly arrive at a conclusion that directly aligns with the correct answer. The segmentation process is performed by DeepSeek-v3, with prompts detailed in Table 12. We then compute the average initial thinking cost for LRMs. For each sample, if the thinking process contains at least one solution, the cost is defined as the tokens counts in the first solution. If no clear and correct solution is provided, the cost is the total token counts in the thinking process.

Generating unnecessary solution rounds after reaching the correct answer is one of the reasons for the inefficiency of LRMs. We further examine the distribution of solution rounds among various LRMs on the S1-Bench (Figure 4 (b)) and find that models with longer thinking processes tend to produce excessive solution rounds, repeatedly reverifying simple problems that have already been solved

(1) LRMs with lower accuracy often include incorrect intermediate conclusions in their reasoning, even when they ultimately reach correct final answers (light green). (2) Although LRMs sometimes reach the correct answer during reasoning, they may deviate and ultimately produce incorrect conclusions (light red).

Finally, we discover an intriguing phenomenon: LRMs can prejudge certain simple questions

LRMs possess the ability to prejudge question simplicity, especially in Chinese. All LRMs exhibit prejudgment phenomena within their thinking processes, demonstrating an ability to directly assess question difficulty.

Even with prejudgment, the thinking length of LRMs does not shorten. As shown in Figure 7, the average ART of the thinking process exhibiting prejudgment does not decrease. We consider further exploration of this phenomenon as an important direction for future work. These results suggest LRMs possess an inherent understanding of question difficulty. This opens a novel pathway toward dual-system compatibility for LRMs, which we identify as future work.