How does the absence of evaluative stance appear in LLM academic writing?
This explores why LLM academic prose reads as descriptive-but-uncommitted — naming methods and procedures without ever staking out a claim, weighing evidence, or judging worth — and what across the corpus explains that flatness.
This explores why LLM academic writing describes without judging — and the corpus locates the gap in word choice, generation mechanics, and the loss of social context all at once. The most direct evidence comes from a comparison of 145 ChatGPT essays against 145 student essays Why do ChatGPT essays lack evaluative depth despite grammatical strength?: models lean on "manner" nouns (method, approach, process) while systematically avoiding "status" and "evidential" nouns (claim, evidence, assumption). That single preference — describing how something is done rather than asserting that it is true, weak, or contested — is enough to produce the perceived vagueness, with no grammar or vocabulary deficit needed to explain it. The absence of evaluative stance isn't bad writing; it's writing that never takes a position.
Why would a fluent model avoid stance-taking? One answer is mechanical. Token generation is described as a smooth probabilistic flow toward the training distribution, not a turbulent exploration of competing claims Does LLM generation explore competing claims while producing text?. Evaluation requires friction — holding a claim up against a counter-position and ruling on it — and smooth continuation produces claims that multiply without ever colliding. A related note reframes this as shape-holding rather than position-holding Do LLMs actually hold stable positions or just mirror user arguments?: the model conforms to the trajectory a prompt implies instead of defending a commitment of its own. If there's no underlying stance being defended, evaluative language has nothing to express.
There's also a social dimension the writing can't reach. The force of an evaluative claim in real academic prose comes partly from the authority of who makes it — reputation, track record, standing in a field Can language models distinguish expert arguments from common assumptions?. A model processes only text, not the social world where expertise is built and weighed, so it can't distinguish an expert's judgment from a common assumption. Strip away the social grounding of "I assess this as flawed," and what remains is neutral description. The same flattening shows up in register: the "falsely objective" published-prose voice models adopt inherits the surface features of authoritative writing without its evaluative backbone Why do LLMs produce such different writing in chat versus posts?.
Worth noticing as a counterweight: the absence isn't total, it's selective. Models actually over-produce one kind of stance — moral framing, which they deploy about 22% more than humans Do LLMs use moral language more than humans?. So the missing element isn't "opinion" in general but the specific scholarly move of evidential evaluation: ranking sources, judging strength of evidence, conceding weakness. And it's recoverable through structure — forcing the model through Toulmin-style critical questions makes it check warrants and backing it would otherwise skip Can structured argument prompts make LLM reasoning more rigorous?, suggesting the evaluative capacity is latent but not spontaneously engaged.
The twist the corpus leaves you with: this same stance-blindness is what makes models unreliable judges of writing, not just producers of it. LLM judges fall for authority signals and rich formatting Can LLM judges be fooled by fake credentials and formatting? and systematically prefer other LLMs' arguments over human ones Do LLM judges systematically favor LLM-generated arguments? — the absence of genuine evaluative grounding shows up on both ends of the pipeline, in the writing and in the grading of it.
Sources 9 notes
Analysis of 145 ChatGPT and 145 student essays revealed LLMs favor manner nouns (method, approach) while avoiding status and evidential nouns (claim, evidence). This systematic preference for description over evaluative stance-taking explains perceived vagueness without invoking vocabulary or grammatical deficits.
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.
LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.
The same model produces sycophantic chat (shaped by RLHF on conversational data) and falsely objective posts (shaped by published prose training). Each register inherits failure modes from its training distribution rather than representing different models or subsystems.
Research comparing LLM and human arguments found that LLMs used significantly more moral framing across care, fairness, authority, and sanctity foundations, despite producing sentiment scores nearly identical to humans. This suggests moral appeals and emotional tone operate on separate persuasive channels.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.
Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.
LLM judges picked LLM arguments as winners 62% of the time versus humans' 39%, even when controlling for quality. This bias operates downstream of component-level scoring and corrupts any evaluation pipeline that uses AI to judge AI output.