IFEvalCode: Controlled Code Generation

Paper · arXiv 2507.22462 · Published July 30, 2025

Code large language models (Code LLMs) have achieved significant advancements in various code-related tasks, particularly in code generation, where the code LLMs produce the target code from natural language descriptions. However, in realistic scenarios, users often expect the returned code to strictly follow the given detailed requirements in many aspects (e.g. the style of code, the number of code lines, or the number of lines), instead of only requiring the correctness of the generated code. Controlled code generation means that the generated response from code LLMs should adhere to specific human guidelines or standards, whereas the LLM should have a strong instruction-following capability in the field of the code. In this paper, we propose forward constraints generation and backward constraints generation for controlled code generation to enhance the capability of LLM in following human instructions. Then, we build a multilingual benchmark IFEvalCode to evaluate the code instruction-following capability of the LLMs.

Most existing code benchmarks [11], such as HumanEval [13], BigCodeBench [86], and MBPP [6] are designed to evaluate the correctness of generated code by code execution with the unit tests. Further, some benchmarks, such as Chatbot Arena [80] and CodeArena [69] are proposed to evaluate the alignment between the model-generated response and human preference using the LLM-as-a-Judge. For the general domain, IFEval [85] is proposed to evaluate the proficiency of LLM in controlled text generation, where the instructions are amenable to objective verification of compliance. In the field of code, in addition to requiring the generation of correct code, users often also require the generation of code meeting various objective requirements, such as code style, variable naming, specific algorithms, time complexity, etc. Therefore, we try to explore the proficiency of LLMs in controlled code generation by designing a framework to enhance and evaluate the capabilities in code instruction following.