Complex Logical Instruction Generation

Paper · arXiv 2508.09125 · Published August 12, 2025
EvaluationsLinguistics, NLP, NLUArgumentationNatural Language InferenceReasoning Critiques

Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval.

Before the advent of ChatGPT (Ouyang et al., 2022), chatbots built on earlier models such as GPT-1 (Radford & Narasimhan, 2018), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), and other architectures struggled to generate coherent and contextually appropriate utterances. At that time, it was difficult to imagine that such models could assist with tasks in daily life. With the emergence of instruction-following capabilities, large language models (LLMs) are now able to accurately understand basic human intentions and even leverage tools to perform a wide range of tasks that enhance productivity, such as deep research1, coding assistance2, and scientific discovery (AI4Science & Quantum, 2023). The instructions for these tasks could contain rich logic structures such as sequencing, loops, nesting, recursion, and backtracking. Previous instruction-following evaluations typically focus on instructions with constraints on the response format (e.g., “fewer than 300 words”) or content (e.g., “in Shakespeare’s tone”) (Zhou et al., 2023; Qin et al., 2024; Jiang et al., 2024) and seldom explore how well LLMs perform on the instructions with rich logic structures.

To address this, we first propose LogicIFGen, a scalable, automated framework for generating verifiable instructions from code functions, which can naturally contain rich logic structures. Given test input data, models are expected to rely solely on the natural language instruction to simulate every logic of the code function and produce the same results, analogous to being verbally guided by an examiner to process the input data step-by-step (see Figure 1 (Left)). The models are required to refrain from writing code or using external tools; instead, they must simulate and execute code logic only through text generation. This setting aligns more closely with instruction-following evaluation, as most tasks require the model to explicitly unfold the underlying logic. For example, when given an instruction such as “repeat asking the user for clarification until you fully understand the intent of the user,” the model must perform each logical step through natural language alone. LogicIFGen obtains the reference labels by executing the code function on the same inputs. By comparing model outputs with these reference labels, we can easily verify whether a model follows the natural language instruction correctly. In addition, LogicIFGen incorporates state trackers to monitor intermediate logic flow, enabling us to double check whether models faithfully adhere to the instruction’s internal logic rather than hallucinating the final results. Second, we construct a benchmark called LogicIFEval using LogicIFGen, which contains 426 verifiable, logic-rich instructions paired with associated test cases. The functions used to generate LogicIFEval are the solutions of challenging simulation problems from competitive programming platforms, CodeForces3 and POJ4. These simulation problem solutions are especially suitable for instruction-following evaluation because they require models to faithfully emulate complex, step-by-step processes and state transitions, often involving intricate control flow, edge case handling, and the coordination of multiple logic elements.

Experimental results show that most popular LLMs are only able to correctly follow fewer than 60% of the instructions in LogicIFEval, revealing a significant deficiency in their instruction-following ability (see Figure 1 (Right)). Open-source models continue to lag behind frontier models such as the OpenAI o-series and Anthropic Claude. As logical complexity increases, models find it increasingly difficult to accurately interpret and follow instructions. We also observe that incorporating explicit thinking before generating a response can potentially enhance instruction-following performance for large LLMs, but not for smaller LLMs. Further error analysis and case studies reveal key failure modes and highlight promising directions for advancing LLMs’ ability to follow logic-rich instructions.

Verification methods generally fall into two categories: heuristic functions or LLM-as-judge. While LLM-as-judge has been shown to correlate highly with human judgments (Qin et al., 2024), there is still room for improvement (Zeng et al., 2023). In addition to general-purpose instructionfollowing benchmarks, other datasets target specific scenarios or constraints, such as length control (Zhang et al., 2025), long-context settings (Wu et al., 2024), or agentic scenarios (Qi et al., 2025). Yang et al. (2025) explore LLMs’ ability to adhere to user intent while producing functionally accurate code. In contrast, our work uses code as the source to generate instructions, requiring LLMs to generate text, rather than code, in response. Some studies have suggested that reasoning may decrease instruction-following performance (Li et al., 2025; Fu et al., 2025). However, our findings in Section 4 indicate that reasoning can actually enhance instruction-following for logic-rich instructions. Further research like Tam et al. (2024); Qin et al. (2025) is needed to clarify the relationship between reasoning and instruction following. To the best of our knowledge, we are the first to systematically investigate whether LLMs can precisely follow logic-rich instructions and how to generate such instructions in scale.