When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions

Paper · arXiv 2507.20439 · Published July 27, 2025
Prompts Prompting

Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. However, in practice task descriptions frequently exhibit ambiguity, incompleteness, or internal contradictions. In this paper, we present the first empirical study examining the robustness of state of the art code generation models when faced with such unclear task descriptions. We extend the HumanEval and MBPP benchmarks by systematically introducing realistic task descriptions flaws through guided mutation strategies, producing a dataset that mirrors the messiness of informal developer instructions. We evaluate multiple LLMs of varying sizes and architectures, analyzing their functional correctness and failure modes across task descriptions categories. Our findings reveal that even minor imperfections in task description phrasing can cause significant performance degradation, with contradictory task descriptions resulting in numerous logical errors. Moreover, while larger models tend to be more resilient than smaller variants, they are not immune to the challenges posed by unclear requirements. We further analyze semantic error patterns and identify correlations between description clarity, model behavior, and error types. Our results underscore the critical need for developing LLMs that are not only powerful but also robust to the imperfections inherent in natural user tasks, highlighting important considerations for improving model training strategies, designing more realistic evaluation benchmarks, and ensuring reliable deployment in practical software development environments.

existing benchmarks make somewhat simplistic assumptions regarding the quality of requirements used to task descriptions code generation. For example, task descriptions in HumanEval and MBPP are typically crafted by experts and target relatively simple and well-known problems. This makes them unrepresentative of the imperfect and variable-quality requirements that arise in real-world settings, where users and developers may use ambiguous phrasing, express contradictory goals, or omit important details [22, 37]. By contrast, the broader scientific literature has begun to investigate the robustness of general-purpose LLMs when faced with more difficult scenarios. For instance, recent studies have examined how LLMs handle adversarial descriptions [40] or changes in reasoning tasks [36].

Although there are many specifications related defects, for example, “the seven sins of specifiers” [22], and related taxonomies [37], ambiguity, inconsistency, and incompleteness have been identified as the most prevalent issues in the requirements engineering literature [23]. These types of issues are well known to lead to misinterpretation and faulty implementation, even among human developers, and their impact on automated code generation remains poorly understood.

Specifically, we create ambiguous descriptions by introducing phrases with multiple plausible interpretations; contradictory descriptions by inserting conflicting or incompatible requirements; and incomplete descriptions by omitting critical task constraints.

However, when the description is made incomplete (e.g., omitting the mention of “triangular”), the model incorrectly assumes that the base area is provided as input, making the produced function useless. When the description is contradictory (e.g., mixing code output with a textual formula), the model returns both, violating single-output expectations. With an ambiguous description using metaphorical language (e.g., “three-sided form stretched across a distance”), the model generates a function to compute only the area of a triangle, ignoring the “stretched” dimension entirely and failing to calculate volume. This example highlights how even subtle description flaws can significantly degrade the correctness of the produced code.