Can reasoning models actually sustain long-chain reflection?
Tests whether large reasoning models genuinely perform self-correction and backtracking, or merely simulate it fluently. Uses constraint satisfaction problems where performance cannot be faked by surface plausibility.
LR²Bench takes the central marketing claim of Large Reasoning Models — that they can sustain long-chain reflective reasoning, making assumptions, backtracking, and self-refining over many steps — and tests it where the claim cannot be faked by surface fluency. The benchmark consists of 850 Constraint Satisfaction Problems across six task families (knowledge-based, logical, spatial). DeepSeek-R1 averages 20.0% Exact Match. OpenAI o1-preview averages 23.6%. These are the frontier LRMs, on tasks designed to require exactly the capability they were trained for.
CSPs are the right test because they are unforgiving in a specific way. A CSP either satisfies all constraints or it doesn't — there is no partial-credit reading where the trace looks plausible. Reflection in CSPs requires real backtracking: when a partial assignment violates a constraint, the solver must abandon a branch and try another. Surface-level "wait, let me reconsider" does not satisfy a constraint that was just violated. The 20-23% ceiling means that on 80% of these problems, reflective fluency fails to convert into reflective competence.
This converges with Does the reasoning cliff depend on how we test models?: text-only LRM evaluation reveals the cliff that tool-augmented evaluation often hides. It also converges with Do language models fail at reasoning due to complexity or novelty? — frontier LRMs are not failing on long chains in general, they are failing on chains whose instance structure was not in training. CSPs are precisely such structure: each instance is a fresh combinatorial space.
The methodological provocation is that CSPs are exactly where symbolic solver integration improves faithful logical reasoning by offloading complex execution from unreliable llm reasoning to deterministic systems would predict tool-enabled rescue. The 20% number is the unaided ceiling. Whether tool access closes the gap is the next question; without tools, the gap is large enough to call long-chain reflection "theatrical" in the technical sense — fluent, well-formed, and not actually doing the work.
Source: Reasoning Methods CoT ToT Paper: LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Related concepts in this collection
-
Does the reasoning cliff depend on how we test models?
If language models hit a capability wall in text-only reasoning tasks, does that wall disappear when they can use tools? What does this reveal about what we're actually measuring?
text-only ceiling versus tool-enabled rescue
-
Do language models fail at reasoning due to complexity or novelty?
Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.
instance unfamiliarity explains CSP collapse
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
constraint satisfaction is the missing benchmark for reflective reasoning — even o1-preview and DeepSeek-R1 only hit 20-23.6% Exact Match