Design & LLM Interaction Psychology and Social Cognition Language Understanding and Pragmatics

Can we measure prompt quality independent of model outputs?

This explores whether prompt quality has measurable, learnable dimensions beyond intuition. The research asks if prompts can be evaluated by their communicative, cognitive, and instructional properties rather than by their results.

Note · 2026-03-28 · sourced from Prompts Prompting
How do you build domain expertise into general AI models? Why can't AI models lead conversations on their own?

"What Makes a Good Natural Language Prompt?" (Long et al., 2025) introduces the first systematic framework for evaluating prompt quality independent of model performance. Rather than measuring prompts by their outputs, the framework measures prompts by their communicative, cognitive, and instructional properties — treating prompt quality as a human-facing design problem.

The six dimensions:

Communication (from Grice's Maxims): token quantity (optimal information density), manner (clarity and directness), interaction and engagement (encouraging clarification), politeness (respectful tone — impolite prompts measurably degrade performance across tasks and languages).

Cognition (from Cognitive Load Theory): manage intrinsic load (break complex tasks into steps aligned with LM capabilities), reduce extraneous load (minimize unnecessary complexity and redundancy), encourage germane load (engage the model's prior knowledge and deep working memory).

Instruction (from Gagné's Nine Events): objectives (explicit task specification), external tools (guiding when to use external resources), metacognition (self-monitoring and self-verification), demonstrations (examples and counterexamples), rewards (feedback mechanisms).

Logic and Structure: structural logic (coherent progression between components), contextual logic (consistency of instructions, terminology, and facts across turns).

Hallucination: hallucination awareness (guiding factual, evidence-based responses), balancing factuality with creativity.

Responsibility: bias, safety, privacy, reliability, societal norms.

The empirical findings reveal non-obvious correlations. Structural logic strongly correlates with contextual logic — well-organized prompts tend to be internally consistent. Hallucination awareness correlates with reliability awareness. And optimizing intrinsic or germane cognitive load naturally clarifies objectives — as you manage the model's cognitive burden, task specification emerges. This suggests that prompt quality is not a flat checklist but a structured space where improvements in one dimension cascade to others.

The practical recommendation: "optimizing prompts for directness, clarity, and conciseness may potentially improve token efficiency, logical coherence, and reduce extraneous cognitive load." This creates a concrete dimension for the custodial skill that How does LLM-mediated search change what expertise requires? identifies as missing — prompt literacy is not just knowing how LLMs work, but knowing how to communicate with them according to measurable principles.

The framework also reveals research gaps: communication properties are most studied for real-world chat, cognition properties for evaluation suites, instruction properties for NLU tasks — but many cross-dimension interactions remain unexplored. Politeness effects are surprisingly robust across generation tasks, potentially reflecting training biases toward benign queries.


Source: Prompts Prompting

Related concepts in this collection

Concept map
16 direct connections · 138 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

prompt quality has six evaluable dimensions grounded in Gricean maxims cognitive load theory and instructional design