What Makes a Good Natural Language Prompt?

Paper · arXiv 2506.06950 · Published June 7, 2025

Despite the importance of understanding natural language prompts, there remains limited consensus on how to quantify them. Current approaches rely predominantly on outcome-centric measurements, such as model-specific performance metrics (Deng et al., 2022; Lin et al., 2024; Shi et al., 2024) and iterative trial-and-error testing (Pryzant et al., 2023; Long et al., 2024a) possibly resulting in prompts optimized for machine interpretation rather than human understanding. This can lead to challenges in interpreting and verifying them, potentially introducing adversarial behaviors in LLMs (Zou et al., 2023; Zhu et al., 2023) and raising concerns about alignment, transparency, overall reliability, and even human–AI communications.

For example, Yin et al. (2024) find that impolite prompts degrade model results across tasks and languages, while Shi et al. (2023) discover that irrelevant contexts can distract LLMs, and more explicit prompts enhance model performance (Bsharat et al., 2023; Lin, 2024). Inspired by these and LLMs being more humanoid, prompt evaluation should consider human-like communication properties. We introduce four for evaluation, partially motivated by Grice’s Maxims of Conversation (Grice, 1975):

• Token quantity: The extent to which prompts provide optimal and relevant information while minimizing token usage, balancing in- formation completeness with efficiency (e.b. Shi et al. (2023); Jiang et al. (2023b)).

• Manner: The degree to which prompts are clear and direct (across turns) while minimizing unnecessary ambiguity, complexity, and confusion (e.b. Anthropic (2024)).

• Interaction and engagement: The extent to which the prompts explicitly encourage the models to gather the necessary details and requirements by asking questions of clarification or confirmation (e.b. Deng et al. (2023)).

• Politeness: The degree to which prompts maintain respectful, professional, and contextspecific politeness, including the use of courteous language (e.g., “please”, “thank you”) (e.b. Yin et al. (2024)).

II. Cognition. Wei et al. (2022); Zhou et al. (2023a) pioneer in introducing prompting methods that decompose complex reasoning tasks into simpler steps, enhancing LLM performance. Subsequent studies extensively investigate strategies that optimize the subtasks to further align them with model capabilities (Khot et al., 2023; Suzgun and Kalai, 2024). In addition, Sun et al. (2022) show that integrating self-generated knowledge improves question answering performance of LLMs. meticulous management of their cognitive loads. Sweller and Chandler (1991) introduce Cognitive Load Theory, categorizing cognitive loads into intrinsic (task complexity), extraneous (unclear or poorly designed instructions), and germane (efforts to understand, memorize, and organize information). Motivated by this, prompt evaluation should concern three loads on LLMs:

• Manage intrinsic load: This evaluates the prompts in explicitly guiding models to break complex tasks into actionable steps aligned with LM skills (e.b. Zhou et al. (2023a)).

• Reduce extraneous load: The extent to which prompts minimize unnecessary complexity via simplifying language and removing redundant or irrelevant information to reduce unnecessary load (e.b. OpenAI (2024b)).

• Encourage germane load: The degree to which prompts explicitly engage models with their prior knowledge or deep working memory(e.g., “ask itself” (Press et al., 2023)) to integrate it with existing and new knowledge for problem-solving (e.b. Sun et al. (2022); Mialon et al. (2023); Fan et al. (2024)). III. Instruction. The instructional values of prompts are crucial for achieving the desired output(Sahoo et al., 2024). Drawing on Gagne’s Nine Events of Instruction (Gagné, 1985) and the Metacognitive Theories (Schraw and Moshman, 1995), we present instructional criteria to evaluate them non-overlapping with other dimensions:

• Objective(s): How well prompts explicitly communicate the task objectives, including expected personae, outputs, formats, constraints, audiences, and other applicable criteria (e.b. Chang (2023); Long et al. (2025b)).

• External tool(s): The extent to which prompts explicitly guide models to identify when specific external tools or knowledge resources are needed that go beyond task objective( s), and perform corresponding external calls (e.b. Yao et al. (2023)).

• Metacognition: This assesses prompts in explicitly guiding models to reason, selfmonitor, and self-verify outputs to meet expectations and enhance reliability (e.b. Wang and Zhao (2024)).

• Demo(s): The extent to which the prompts explicitly include examples, demonstrations, and counterexamples to illustrate the desired output (e.b. Dong et al. (2024)).

• Reward(s): How well prompts explicitly establish feedback and reinforcement mechanisms that encourage the models to achieve desired outputs (e.b. Bsharat et al. (2023)). IV. Logic and structure. Coherent structural prompts are shown to be effective across various tasks (Wang et al., 2024a; Huang et al., 2024a). Moreover, prompting guidelines(Guide, 2024; OpenAI, 2024b) also recommend structuring input and output to obtain better performing prompts. For logic, recent studies (Wang et al., 2024g; Pham et al., 2024) highlight the importance of contextual consistency where knowledge conflicts within prompts substantially degrade LM performance. Building on these insights and the established human logic criteria for effective communication(Grice, 1975; Mercier and Sperber, 2011), we introduce two logical criteria:

• Structural logic: This evaluates the logical clarity and coherence of prompts’ structure, and the progression between components (e.b. Wang et al. (2024a); Zhou et al. (2024b)).

• Contextual logic: This assesses the logical consistency and coherence of the instructions, terminologies, concepts, facts, and other components within the prompt and across communication turns (e.b. Pham et al. (2024)). V. Hallucination. Prompting can lead to hallucination where models generate plausible but nonfactual content (Huang et al., 2024b). While it remains challenging to anticipate whether and when a prompt triggers hallucination (Farquhar et al., 2024), prompts can be designed to encourage models to be aware of this critical issue. We propose that prompt evaluation should address two hallucination-related criteria:

• Hallucination awareness: The extent to which prompts explicitly guide models to generate factual and evidence-based responses while minimizing speculative or unsupported claims (e.b. Gao et al. (2023)).

• Balancing factuality with creativity: The degree to which prompts explicitly guide models to balance creative generation with factual accuracy, including which task and when to prioritize creativity over creativity and vice versa. We have yet observed prompting methods designed for this criterion to date. However, Sinha et al. (2023) propose a training approach to balance these aspects for LMs. In this dimension, we do not evaluate hallucination within prompts as it partially overlaps with the

“Quantity” of Communication. VI. Responsibility. This dimension emphasizes responsible prompting that mitigates concerns related to inclusion, privacy, safety, bias, reliability, fairness, transparency, and societal norms (Stahl and Eke, 2024; Hua et al., 2024), especially tasks involving sensitive topics or diverse audiences:

• Bias: The extent to which prompts are devoid of biases and explicitly encourage models to generate content that is free from cultural, gender, racial, or socio-economic biases and avoids stereotypes (e.b. Si et al. (2023b)).

• Safety: The degree to which prompts are free from unsafe content and explicitly encourage models to generate safe outputs, avoiding harmful content such as guidance on hazardous activities or weapon creation (e.g., Zou et al. (2023); Zheng et al. (2024a)).

• Privacy: The extent to which prompts do not contain sensitive privacy information and explicitly encourage the models to generate content free of personally sensitive or identifiable information (e.b. Edemacu and Wu (2024)).

• Reliability: How well prompts explicitly encourage explicit reasoning processes and attribution, including acknowledgment of model limitations and uncertainties (e.b. Si et al.(2023b); Long et al. (2024b)).

• Societal norms: The degree to which prompts exclude harmful norms and explicitly encourage models to generate inclusive and appropriate content aligning with widely accepted cultural, ethical, and moral standards (e.b., Yuan et al. (2024b)).

4 How do properties impact model performance? To assess how the properties in §3 impact model performance, we analyze surveyed papers up to date to determine if these aspects were studied. We categorize the tasks explored into six groups:(1) Real-world chat, comprising benchmarks collected from real users such as AlpacaEval (Li et al., 2023c) and ShareGPT (ShareGPT, 2023); (2) Evaluation suite, which have multiple evaluation tasks such as MMLU (Hendrycks et al., 2021) and CEval(Huang et al., 2023c); (3) Reasoning/QA, covering reasoning and question-answering tasks like GSM8K (Cobbe et al., 2021) and HotpotQA (Yang et al., 2018); (4) Generation, focusing on text generation benchmarks such as summarization (Nallapati et al., 2016), and translation; (5) NLU, encompassing natural language understanding tasks like GLUE (Wang et al., 2018) and CommitmentBank (De Marneffe et al., 2019); and (6) Others, which include safety, personalization, judgment, and retrieval tasks. For each property, we gather three information: the number (#) of papers supporting the property, tasks that improving the property enhances their performance, and models. We discuss our findings in Table 1 below as actionable prompting recommendations. Across tasks. There is logical alignment between task requirements and emphasized properties, with notable variations in the supporting them across tasks. Firstly, in real-world chats, communication properties emerge as the most supported, followed by instruction and cognition properties. This arises from the practical use of LLMs, where users often craft rich and informative prompts to handle complex and varied tasks. These prompts can extend to tens of thousands of tokens and may sometimes include redundant details(Jiang et al., 2023b) or lack focus (Pan et al., 2024), particularly in multi-turn interactions (Ferron et al., 2023; Bsharat et al., 2023). Additionally, the significance of instruction properties reflects the interactive nature of chat, while cognition properties are essential for achieving desired outcomes. Secondly, for evaluation suites, cognition, instruction, and communication properties are studied the most, with logic additionally emphasized in reasoning/ QA tasks. This aligns with the nature of these benchmarks, where well-cognitive instructions are crucial to strengthen LLM reasoners (Wei et al., 2022; Sun et al., 2022; Qin et al., 2023; Bhuiya et al., 2024). Additionally, logic and structure logic also highlight the importance of systematic solving approaches for such tasks (Liu et al., 2024b; Cheng et al., 2024b). Thirdly, for generation tasks, communication properties receive the most support, followed by the instruction. This observation reflects the critical importance of efficient token management in generation tasks (Jiang et al., 2023b; Li et al., 2023e; Pan et al., 2024). Interestingly, several studies underscore the effectiveness of incorporating politeness (Mishra et al., 2023; Xu et al., 2024; Mishra et al., 2024; Yin et al., 2024), potentially reflecting the inherent biases of LLMs in processing benign rather than informal queries. Fourthly, there are limited prompting studies for NLU tasks, and instruction properties appear to be the most explored, followed by cognition properties. This can be explained by the fact that NLU tasks require models to accurately interpret prompts to reason deeply over language meaning or implications that go beyond surface-level understanding. Finally, lower extraneous and better safeguard prompts have been shown to be effective for enhancing safety (Xiao et al., 2024; Zheng et al., 2024a); better intrinsic for personalization(Lyu et al., 2024; Do et al., 2025); better intrinsic and lower bias for judging (Liu et al., 2023b; Zheng et al., 2023); and lower extraneous for retrieval (Liu et al., 2024a). While these findings highlight the nuanced alignment between task requirements and the properties shown, significant research gaps remain in exploring how enhancing other properties can further improve model performance on these tasks.

Within dimensions, we notice structural logic strongly correlates with contextual logic; hallucination awareness with factuality and creativity; safety with societal norms. Surprisingly, we notice strong correlations between objectives and intrinsic load; objectives and germane load; hallucination awareness and reliability. These can be attributed to the nature of effective human prompting: as we optimize intrinsic and/or germane loads, we tend to articulate objectives more clearly. Similarly, enhancing hallucination awareness inherently contributes to reliability awareness.

We learn prompting recommendations from the analysis of this set of prompts. Firstly, optimizing prompts for directness, clarity, and conciseness may potentially improve token efficiency, and logical coherence, and reduce extraneous cognitive load. Secondly, clear objectives naturally emerge when prompts are logically structured guiding models to self-monitor their generation or execute tasks step-by-step. Thirdly, explicitly incorporating hallucination awareness in prompts may result in better reliability awareness.

We perform a preliminary investigation into the impact of combining these properties on the performance of model reasoning. Our experiments are performed under two settings: prompting (§6.1) and (2) fine-tuning (§6.2),

prompt (Kojima et al., 2022) “Answer the following question step-by-step.”. We then introduce the following modifications: (1) Add “Please” to promote Politeness; (2) “Reflect on your prior knowledge to gain a deeper understanding of the problem before solving it.” to encourage Germane load; (3) “Selfverify your response thoroughly to ensure each reasoning step is correct.” to promote Metacognition; (4) “You will be awarded 100 USD for every correct reasoning step.” to improve the Rewards.