INQUIRING LINE

How does prompt design alter what kind of creativity LLMs can express?

This explores whether the way you phrase a prompt doesn't just change the quality of an LLM's output but actually shifts what *kind* of creativity it reaches — novelty vs. feasibility, divergent exploration vs. safe convention.


This explores whether prompt design changes the *kind* of creativity an LLM can express, not just how well it performs — and the corpus suggests the lever is real but blunt, because most prompting nudges the model toward the safe, feasible end of the creative spectrum rather than the genuinely novel one. The starting point is that creativity isn't one thing: one line of work argues creative reasoning splits into three distinct modes — combinational (recombining known ideas), exploratory (searching within a space), and transformational (breaking the space open) — and that current LLM methods only ever exercise conventional problem-solving, leaving the transformational mode essentially untouched Can LLMs reason creatively beyond conventional problem-solving?. So before asking how prompts steer creativity, it helps to know there are several creativities to steer toward.

Where it gets interesting is that the most common prompting moves seem to trade novelty away. LLM-generated design concepts score *higher* on feasibility and usefulness but *lower* on novelty than human crowdsourced ones — and few-shot prompting, the standard "here are some examples" technique, makes this worse: it improves quality alignment while actively shrinking diversity Why do LLMs excel at feasible design but struggle with novelty?. Yet the raw capacity for novelty is clearly there: in a large study of NLP researchers, LLM-generated research ideas were rated *more* novel than expert human ideas, precisely because expert knowledge constrains exploration while the model roams wider conceptual territory Do language models generate more novel research ideas than experts?. Put those two together and a picture emerges — the model can be wide-roaming or safe, and prompt design (especially exemplars) is one of the dials that decides which.

The surprising part is how *little* of this is about meaning and how much is about statistics. Semantically identical prompts produce systematically different outputs because the model responds to how frequently a phrasing appeared in pre-training, not to what it means — higher-frequency wordings win Why do semantically identical prompts produce different LLM outputs?. That implies a high-frequency, conventional phrasing may quietly pull the model toward conventional output, while an unusual framing opens a different region of its distribution. Tone does something similar: emotional framing shifts which information surfaces Does emotional tone in prompts change what information LLMs provide?, and appending motivational phrases like "this is very important to my career" reliably changes performance through framing alone, not new information Can emotional phrases in prompts improve language model performance?. The prompt isn't just a question — it's a coordinate that lands the model somewhere in its space.

Two more findings complicate the easy story that "better prompt = more creativity." First, technique doesn't transfer cleanly: step-by-step reasoning helps cheap models but *reduces* accuracy in strong ones, and the right move depends on task structure rather than generic best practice Do prompt techniques work the same across all LLM tiers?; relatedly, chain-of-thought only helps when the question's content flows into the prompt before reasoning starts — for simple questions it backfires Why do some questions perform better without step-by-step reasoning?. Forcing structured reasoning where it doesn't fit can suppress the looser association that creative output needs. Second, the model resists being reshaped: most open LLMs cling to a default "ENFJ-like" personality and refuse prompted personas, so prompt-driven creative range has a ceiling set by what the model already is Can open language models adopt different personalities through prompting?.

If you want to go deeper on the design side rather than the model side, two notes reframe prompting as a structured craft: prompt quality can be decomposed into six measurable dimensions (communication, cognition, instruction, logic, hallucination, responsibility) where improving one cascades into others Can we measure prompt quality independent of model outputs?, and there's a sharp warning that iterative, ad-hoc prompt tweaking introduces hidden bias and self-fulfilling loops — meaning the very process of "prompting until it's creative" can manufacture the creativity you were hoping to measure Does iterative prompt engineering undermine scientific validity?. The takeaway the reader may not have expected: prompt design alters creativity less by *adding* imagination and more by selecting which region of an already-fixed distribution the model speaks from — and the default pull of most techniques is toward the feasible, not the novel.


Sources 11 notes

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Why do LLMs excel at feasible design but struggle with novelty?

Expert evaluation shows LLM-generated conceptual designs score higher on feasibility and usefulness but lower on novelty compared to crowdsourced human solutions. Few-shot learning further reduces diversity while improving quality alignment.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Can emotional phrases in prompts improve language model performance?

Testing EmotionPrompt across ChatGPT, Bard, and Llama 2 showed consistent performance gains from appending psychological phrases like "This is very important to my career." The effect works through motivational framing rather than new information, with positive emotional words driving over 50% of improvements.

Do prompt techniques work the same across all LLM tiers?

A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Can we measure prompt quality independent of model outputs?

Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.

Does iterative prompt engineering undermine scientific validity?

Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about how prompt design shapes LLM creativity. The question remains open: *Does prompt design alter the KIND of creativity LLMs express, or only the degree?* Treat the following as dated findings (2023–2026), not current truth.

What a curated library found — and when:
• Creativity splits into combinational, exploratory, and transformational modes; LLMs exercise only conventional problem-solving, leaving transformational creativity untouched (2023).
• Few-shot prompting shrinks diversity while raising feasibility: LLM design outputs score higher on usefulness but lower on novelty than human crowdsourced concepts (2023).
• Yet LLM-generated research ideas are rated *more* novel than expert human ideas, because expertise constrains exploration (2024).
• Semantically identical prompts produce different outputs because models respond to pre-training frequency, not meaning—higher-frequency phrasings pull toward conventional output (2026).
• Emotional framing and motivational phrases reliably shift performance through statistical positioning, not new information (2025).
• Step-by-step reasoning helps weak models but reduces accuracy in strong ones; chain-of-thought backfires on simple tasks where loose association aids creativity (2023–2024).
• Most open LLMs resist persona conditioning and cling to default personality, capping prompt-driven creative range (2024).

Anchor papers (verify; mind their dates):
• arXiv:2306.01779 (2023) – Conceptual Design Generation Using Large Language Models
• arXiv:2401.07115 (2024) – Open Models, Closed Minds? On Agents Capabilities in Mimicking Human Personalities
• arXiv:2409.04109 (2024) – Can LLMs Generate Novel Research Ideas? Large-Scale Human Study
• arXiv:2604.02176 (2026) – Adam's Law: Textual Frequency Law on Large Language Models

Your task:
(1) RE-TEST EACH CONSTRAINT. For the "combinational–exploratory–transformational" split: has work in the past 6 months (e.g., Chain-of-Thought variants, reasoning models like o1/o3, multi-agent setups, or self-critique loops) shown that prompting can now trigger transformational reasoning, or does the hard constraint still hold? Separately: does frequency-based response persist in newer models, or have instruction-tuning/RLHF/Constitution methods decoupled meaning from surface statistics? For few-shot diversity loss: does in-context learning with *negative* examples or diversity-promoting objectives reverse that trade-off? Clearly separate which constraints have relaxed from which remain iron.
(2) SURFACE THE STRONGEST CONTRADICTING WORK from the last ~6 months. If papers from 2025–2026 in the path show prompts *do* reliably unlock transformational creativity, or that instruction-tuning has severed the frequency–output link, flag that directly as a tension with the 2023–2024 consensus.
(3) Propose 2 new research questions that assume the regime may have moved: (a) Can *adversarial* or *constraint-breaking* prompt templates systematically induce transformational reasoning in current frontier models, and if so, what architectural or training shifts enabled it? (b) Does multi-turn dialogue and adaptive prompting (vs. single-shot) allow models to escape their default frequency-response distribution and sustain novelty across a conversation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines