Should reasoning benchmarks score final answers or reasoning traces?
Current reasoning benchmarks often credit plausible-looking reasoning steps even when final answers are wrong. Does measuring outcomes instead of traces reveal whether models actually solve problems, or does it miss important reasoning capability?
LR²Bench scores Exact Match on the final solution against deterministic CSP ground truth. It does not score the trace. This is the methodological choice that produces the dramatic 20-23.6% number, and it is the choice most other reasoning benchmarks have been quietly avoiding. Trace-based evaluation — does the reasoning look right, are the reflective phrases present, does the chain have the expected structure — would have inflated the result by counting plausible-looking reflection as evidence of reflection. CSPs do not allow that inflation because the constraint either holds or it doesn't.
The lesson generalizes. Do reasoning traces actually cause correct answers? argues the principle: derivational traces are stylistic mimicry of reasoning, not verified reasoning. Does RLVR actually improve mathematical reasoning or just coherence? argues the empirical version: training improves trace coherence without improving trace validity. LR²Bench operationalizes the methodological response — measure the outcome, not the trace, on tasks where the outcome is independently verifiable.
The harder corollary: many existing reasoning benchmarks are partly trace-evaluation in disguise. Math benchmarks where partial-credit grading is permissive, multi-step reasoning where intermediate steps can be "interpretation-credited" by graders, dialogue tasks where helpfulness is judged on tone — these all give credit for reflective appearance even when outcomes are wrong or absent. CSPs are valuable not because they are common in real applications but because they are epistemically clean: they isolate whether the model can do the thing, free from rhetorical credit.
For benchmark design more broadly, the LR²Bench template is: pick tasks with deterministic verifiers; measure final outcome; do not score the trace. Apply that template to a domain and the reasoning theater collapses into whatever reasoning is actually happening. Twenty percent on CSPs is the floor after the theater is removed. Benchmarks that produce higher numbers should explain how their design avoids re-introducing trace credit — and most cannot.
Source: Reasoning Methods CoT ToT Paper: LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Related concepts in this collection
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
principle: traces are mimicry, not verification
-
Does RLVR actually improve mathematical reasoning or just coherence?
RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
empirical: training improves coherence not validity
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
reflection benchmarks should be solution-verifiable not trace-verifiable — Exact Match on the answer cuts through reasoning theater