ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Paper · arXiv 2502.01100 · Published February 3, 2025
FlawsInference time scalingDomain SpecializationSelf Refinement Self Consistency Feedback

Our results reveal a significant decline in accuracy as problem complexity grows—a phenomenon we term the “curse of complexity.” This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations,

ideal evaluation framework must: (1) isolate pure logical reasoning from domain knowledge; (2) enable precise control over problem complexity; (3) minimize data leakage to prevent training data memorization; (4) provide objective metrics for assessing an LLM’s reasoning results

Constraint satisfaction problems (CSPs) offer such a controlled framework

These findings suggest that limited reasoning in current LLMs are not solely a matter of model- or-sample-size scaling, but also arise from insufficient test-time compute. This shortfall underscores the need to train LLMs to reason step by step (Wei et al., 2022) explicitly (e.g., via reinforcement learning (Lambert et al., 2024a)), as exemplified by emerging reasoning models such as o1 and R1. Specifically, we conduct a systematic investigation into the scaling behavior of LLMs in logical reasoning, focusing on three key dimensions: model size (§4), sampling (§5), and test-time compute (§6). Understanding scaling behavior of LLMs in reasoning is critical to identify the most promising directions for advancing LLMs’ reasoning capabilities and to guide future research efforts more effectively.

We find that it’s much more promising to scale up the reasoning tokens (i.e., chain-of-thoughts; CoTs) generated during inference with a backtracking mechanism. We take OpenAI’s o1 models as a typical example and show that they generate significantly more, nearly 10x (hidden) reasoning tokens than others, which scale properly with problem complexity. Based on our empirical results, we also find that there exists an optimal ratio of reasoning tokens to Z3 conflicts, but O1-like models cannot always reach this optimal ratio when the complexity is extremely high, thus not achieving perfect reasoning (§6).

• Moreover, we explore the potential of using self-verification prompting to improve LLMs (§6.2). We find that such methods can help LLMs improve their performance, but the improvement is very marginal. We further analyze the reasoning process of o1 and discuss its strengths and weakness in logical reasoning (§D).

while some ZebraLogic puzzles can be solved through straightforward linear deduction, many require more complex non-monotonic reasoning strategies, such as counterfactual reasoning that involves backtracking and revising assumptions

6.2. Self-Refinement is Limited but Promising The other feature of o1’s hidden reasoning process is the ability to reflect on its own reasoning process and refine its answer. From our observation on the summary of their hidden reasoning process, we can see that o1 often revisits the clues and constraints to verify its previous reasoning and fix the errors if there are any, which is similar to the Z3 solver’s conflict-driven clause learning mechanism. In order to elicit such self-refinement behavior from LLMs, we add follow-up queries to ask the model to review its initial answer and check the clues and constraints again in a multi-turn conversation setting. There are two settings for the self-refinement process: one with the oracle knowledge of the correct answer and the other without the oracle knowledge. Results in Table 2 show modest improvements with self-verification, particularly without oracle knowledge (4o improves from 31.7 to 33.0, then decreases to 32.1 on second iteration).