OpenThoughts: Data Recipes for Reasoning Models

Paper · arXiv 2506.04178 · Published June 4, 2025

posttraining process equips these models with the ability to output long chains of thought, or "thinking tokens," during inference time, which can guide the model toward the correct answer. Yet, the complete recipes for frontier reasoning models are not public, making research for building reasoning models difficult.

Existing works, such as SkyT1 (NovaSky-Team, 2025b) and S1 (Muennighoff et al., 2025), adopt nearly identical model architectures and training setups as typical instruction tuning, yet still achieve performance improvements by focusing on improving the training datasets. These examples highlight the importance of curating high-quality SFT data as a key lever for reasoning performance.

Most of these projects, however, explore only a limited fraction of possible design choices, such as relying on human-written questions or using DeepSeek-R1 as the sole teacher. Recreating reasoning models requires exploring a large design space for various strategies of generating question-answer pairs for reasoning (Face, 2025). This exploration is prohibitively expensive for many researchers due to the high costs of teacher inference and model training. In the absence of these expensive experiments, many papers rely on existing heuristics and intuitions to inform their data design choices.

Sampling multiple answers per question from a teacher model is an effective technique

to increase the size of a data source by at least 16×. The increased dataset scale drives

significant performance gains.

Models with better performance are not necessarily better teachers. QwQ-32B is a stronger

teacher than DeepSeek-R1, although it scores lower on target reasoning benchmarks.

We experimented with numerous verification and answer filtering methods, and none gave

significant performance improvements.

Selecting questions from a small number (top 1 or 2) of high-quality sources leads to better

downstream performance compared to optimizing for diversity (i.e., top 8 or 16 sources).

Filtering questions by LLM labeled difficulty or LLM response length yields better results

than filters typical to pre-training data curation that use embeddings or fastText.

The two highest performing question filtering methods are difficulty-based filtering and response length filtering. Difficulty-based filtering asks an LLM (GPT-4o-mini) to assess the difficulty of each question, then retains the most difficult questions. Difficulty-based filtering is the winning strategy for code. Meanwhile, response length filtering asks an LLM to respond to each question directly, then selects the questions with the longest LLM-generated responses. Response length filtering performs the best for math and science.

For code and science data, various combinations of deduplication and multiple answer generation yield similar results. For example, the baseline of no deduplication with 1× answer per question performs 0.7 points worse on average than exact deduplication with 16× answers per question for the code domain. Meanwhile, for math, exact deduplication with 4× answers per question performs the best, and 16× answers per question is the second-best option. We adopt the second-best option moving forward, as it provides better scalability. Similar to Section 4.2, the results here indicate that the benefits of question diversity may be limited for the reasoning datasets we measure performance on, at least when answer diversity increases. Thus, for math and science, we select the optimal strategy, which is exact deduplication with 16× answers per question. For code, we employ the second-best strategy, which involves no deduplication with 16× answers per question.