Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)
{okd5069@psu.edu, akhilkumar@psu.edu} Abstract The wording of natural language prompts has been shown to influence the performance of large language models (LLMs), yet the role of politeness and tone remains underexplored. In this study, we investigate how varying levels of prompt politeness affect model accuracy on multiple-choice questions. We created a dataset of 50 base questions spanning mathematics, science, and history, each rewritten into five tone variants—Very Polite, Polite, Neutral, Rude, and Very Rude—yielding 250 unique prompts. Using ChatGPT- 4o, we evaluated responses across these conditions and applied paired sample t-tests to assess statistical significance. Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. Our results highlight the importance of studying pragmatic aspects of prompting and raise broader questions about the social dimensions of human–AI interaction.
Introduction. The rise of generative AI and natural language processing (NLP) has opened new possibilities for automating many tasks across a broad range of domains, thus unleashing huge productivity gains. Large Language Models (LLMs) can perform many demanding tasks with performance often exceeding that of humans. With their vast abyss of training data and sophisticated modeling architecture, it is known that LLMs demonstrate qualities at the heart of human cognitive capacities like analogical reasoning without any prior task-specific fine-tuning (Webb et al., 2023). Since these powerful LLMs are accessed through a natural language interface, there are also several notions of how minor differences in inputs, formally called ‘prompts’, affect their response quality, as measured by accuracy, length, coherence, etc. Thus, a new field of study called ‘prompt engineering’ has emerged to study the variance in response quality from different prompt designs and create better prompts for the most desired results (Sclar et al., 2024).
Discussion / Conclusion. In this paper, we evaluated the performance of a well-known LLM ChatGPT 4o to understand how well it performs on our dataset of 50 multiple-choice questions of varying degrees of difficulty drawn from multiple domains when the politeness level or tone of the questions is set to five different levels. Our experiments are preliminary and show that the tone can affect the performance measured in terms of the score on the answers to the 50 questions significantly. Somewhat surprisingly, our results show that rude tones lead to better results than polite ones. Yin, et al. (2024) noted that "impolite prompts often result in poor performance, but overly polite language does not guarantee better outcomes." Their tests on multiple choice questions with very rude prompts elicited more inaccurate answers from ChatGPT 3.5 and Llama2- 70B; however, in their tests on ChatGPT 4 with 8 different prompts ranked from 1 (rudest) to 8 (politest) the accuracy ranged from 73.86 (for politeness level 3) to 79.09 (for politeness level 4).