Can we measure prompt quality independent of model outputs?
This explores whether prompt quality has measurable, learnable dimensions beyond intuition. The research asks if prompts can be evaluated by their communicative, cognitive, and instructional properties rather than by their results.
"What Makes a Good Natural Language Prompt?" (Long et al., 2025) introduces the first systematic framework for evaluating prompt quality independent of model performance. Rather than measuring prompts by their outputs, the framework measures prompts by their communicative, cognitive, and instructional properties — treating prompt quality as a human-facing design problem.
The six dimensions:
Communication (from Grice's Maxims): token quantity (optimal information density), manner (clarity and directness), interaction and engagement (encouraging clarification), politeness (respectful tone — impolite prompts measurably degrade performance across tasks and languages).
Cognition (from Cognitive Load Theory): manage intrinsic load (break complex tasks into steps aligned with LM capabilities), reduce extraneous load (minimize unnecessary complexity and redundancy), encourage germane load (engage the model's prior knowledge and deep working memory).
Instruction (from Gagné's Nine Events): objectives (explicit task specification), external tools (guiding when to use external resources), metacognition (self-monitoring and self-verification), demonstrations (examples and counterexamples), rewards (feedback mechanisms).
Logic and Structure: structural logic (coherent progression between components), contextual logic (consistency of instructions, terminology, and facts across turns).
Hallucination: hallucination awareness (guiding factual, evidence-based responses), balancing factuality with creativity.
Responsibility: bias, safety, privacy, reliability, societal norms.
The empirical findings reveal non-obvious correlations. Structural logic strongly correlates with contextual logic — well-organized prompts tend to be internally consistent. Hallucination awareness correlates with reliability awareness. And optimizing intrinsic or germane cognitive load naturally clarifies objectives — as you manage the model's cognitive burden, task specification emerges. This suggests that prompt quality is not a flat checklist but a structured space where improvements in one dimension cascade to others.
The practical recommendation: "optimizing prompts for directness, clarity, and conciseness may potentially improve token efficiency, logical coherence, and reduce extraneous cognitive load." This creates a concrete dimension for the custodial skill that How does LLM-mediated search change what expertise requires? identifies as missing — prompt literacy is not just knowing how LLMs work, but knowing how to communicate with them according to measurable principles.
The framework also reveals research gaps: communication properties are most studied for real-world chat, cognition properties for evaluation suites, instruction properties for NLU tasks — but many cross-dimension interactions remain unexplored. Politeness effects are surprisingly robust across generation tasks, potentially reflecting training biases toward benign queries.
Source: Prompts Prompting
Related concepts in this collection
-
How does LLM-mediated search change what expertise requires?
When experts search through LLMs instead of traditional inquiry, do they need fundamentally different skills? This explores whether domain knowledge alone is enough when the search itself operates on statistical patterns rather than meaningful questions.
this framework provides the measurable dimensions for the "prompt literacy" the custodial shift requires
-
How should users control systems with unpredictable outputs?
When generative AI produces different outputs from identical inputs, how do interaction design principles help users maintain control and develop effective mental models for stochastic systems?
prompt quality dimensions explain why some intent specifications succeed and others fail
-
Can models learn argument quality from labeled examples alone?
Explores whether fine-tuning on quality-labeled examples teaches models the underlying criteria for evaluating arguments, or merely surface patterns. Matters because high-stakes assessment tasks depend on reliable, transferable quality judgment.
parallel finding: quality criteria require explicit frameworks, whether for arguments or prompts
-
Why can't users articulate what they want from AI?
Explores the cognitive gap between imagining possibilities and expressing them as prompts. Why language interfaces create a harder envisioning task than traditional UI affordances.
cognitive load management and interaction/engagement dimensions directly address the articulation gap
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
prompt quality has six evaluable dimensions grounded in Gricean maxims cognitive load theory and instructional design