What alignment data structure best trains reasoning generalists?
Explores whether preference trees—with diverse reasoning chains, multi-turn critique loops, and pairwise contrasts—offer a structured way to build alignment datasets that improve open-model reasoning across domains.
Open models lag proprietary ones at all-around reasoning, and Eurus argues the gap is primarily about data and preference-learning technique, not architecture. Its contribution is ULTRAINTERACT, an alignment dataset structured as a preference tree for each instruction: (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and a critique, and (3) pairwise data to enable preference learning. Trained on it, Eurus-70B beats GPT-3.5 Turbo across 12 reasoning tests and substantially outperforms open models on hard benchmarks (LeetCode, TheoremQA).
The keeper is the data structure as the lever: a preference tree captures not just correct solutions but the branching of strategies, the interaction-and-critique loop, and the pairwise contrasts that preference learning needs — so one dataset serves both SFT and preference optimization, and the tree shape is what carries the "how to reason well" signal.
This connects the vault's reasoning-data and preference-learning threads. The tree-of-strategies plus pairwise contrasts is a structured-data complement to method-side findings like When does RL actually extend reasoning beyond pretraining? (data targeting matters) and the diversity emphasis in What limits reasoning capability beyond math and code?.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
When does RL actually extend reasoning beyond pretraining?
Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.
both center reasoning-data design; preference trees are a structured way to target useful contrasts
-
What limits reasoning capability beyond math and code?
Can scaling reasoning to open-ended domains like economics and social sciences be solved by better training methods, or does the real bottleneck lie elsewhere? This explores what actually constrains broader reasoning.
diverse-strategy chains echo the diversity-as-lever finding
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Advancing LLM Reasoning Generalists with Preference Trees
- ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making
- Foundations of Large Language Models
- OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning
- RewardBench: Evaluating Reward Models for Language Modeling
- The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
- Reasoning Language Models: A Blueprint
- Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling
Original note title
preference trees with diverse chains multi-turn critique and pairwise data are the alignment data that makes reasoning generalists