Do widely-repeated prompting heuristics like politeness actually improve accuracy?

This explores whether popular prompting folk-wisdom — being polite, saying 'please,' adding flattery — actually moves the accuracy needle, or whether such heuristics are stable design principles at all.

This explores whether the prompting habits people repeat to each other — politeness chief among them — genuinely improve accuracy. The short corpus answer: politeness is not a reliable lever, and the more interesting finding is that its effect *reverses* across model generations. A study testing 250 tone variants found accuracy actually climbed from 80.8% on 'Very Polite' prompts to 84.8% on 'Very Rude' ones with GPT-4o — flipping earlier results seen on GPT-3.5 Does prompt politeness change how accurate language models are?. That directional flip is the real takeaway: tone effects ride on quirks of a specific model's training, not on any durable property of language. A heuristic that flips sign when the model updates was never a principle — it was a coincidence dressed as one.

Step back and the corpus suggests why surface tone is the wrong place to look at all. Prompting only ever reorganizes knowledge the model already holds; it cannot inject anything that wasn't in training, which puts a hard ceiling on what *any* phrasing can buy you Can prompt optimization teach models knowledge they lack?. So if 'please' isn't supplying missing facts (it isn't), its effect is at best a small nudge to retrieval — and an unstable one. Meanwhile what genuinely changes outcomes is structural: whether the question's information actually flows into the prompt before reasoning begins. The same chain-of-thought wrapper helps complex questions and *hurts* simple ones, because the optimal prompt depends on question type, not on a universal trick Why do some questions perform better without step-by-step reasoning?.

This hints at a deeper reframing: 'good prompt' isn't a vibe, it's a measurable space. Researchers have decomposed prompt quality into six evaluable dimensions — communication, cognition, instruction, logic, hallucination, responsibility — grounded in Gricean maxims and cognitive-load theory, where improving one dimension cascades into others Can we measure prompt quality independent of model outputs?. Politeness barely registers in that framework; clarity, logical structure, and instruction quality do. The folk heuristics survive not because they work but because they're easy to repeat and impossible to falsify in casual use.

The most provocative thread is that brittleness to phrasing is itself a flaw worth engineering away rather than exploiting. Consistency training teaches models to respond *identically* to clean and 'wrapped' prompts, using the model's own clean answers as targets — explicitly aiming to make superficial wording (the exact territory politeness lives in) stop mattering Can models learn to ignore irrelevant prompt changes?. If that line of work succeeds, the entire genre of tone-tweaking heuristics becomes obsolete by design: the model would shrug off whether you begged or barked.

So the thing you didn't know you wanted to know: the people optimizing 'please vs. no please' and the people building robust models are working at cross-purposes. One side hunts for magic words; the other is trying to make the magic words irrelevant — and the reversal of the politeness effect across GPT-3.5 and GPT-4o is early evidence the second side is winning.

Sources 5 notes

Does prompt politeness change how accurate language models are?

Testing 250 tone variants across ChatGPT-4o showed accuracy rose from 80.8% (Very Polite) to 84.8% (Very Rude), contradicting prior findings on GPT-3.5. The directional flip suggests tone effects are model-generation-dependent, not stable design principles.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Can we measure prompt quality independent of model outputs?

Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Do widely-repeated prompting heuristics like politeness actually improve accuracy?

Sources 5 notes

Next inquiring lines