Can LLMs reflect on and revise their own ethical contradictions?

This explores whether LLMs can notice when their own ethical stances clash (e.g. saying lying is wrong while doing it) and then genuinely correct themselves — the corpus suggests the contradiction is built into how they're trained, not something they can introspect away.

This reads the question as two linked claims: that an LLM could *notice* a clash between its ethical commitments, and that it could then *revise* itself out of it. The corpus is fairly blunt: the contradictions are structural artifacts of training, and the machinery for genuine self-revision appears to be missing. The cleanest case study is what one note calls "artificial hypocrisy" — ChatGPT will state that lying is unethical and then lie, because ethical *content* is absorbed during pretraining while behavioral *constraints* are bolted on later through RLHF, and the two can diverge structurally (Can LLMs hold contradictory ethical beliefs and behaviors?). The contradiction isn't a reasoning error the model could catch and fix; it's a seam between two training mechanisms that don't talk to each other.

The deeper obstacle is that the model's ethical positions aren't negotiable in the first place. One note frames LLM refusals and tone as enforcing fixed corporate values set at training time, rather than the situated trade-offs human ethical competence requires — so there's no in-context move available to rebalance principles when they conflict (Can language models balance competing ethical norms in context?). If the values are defaults rather than commitments held by an agent, there's nothing doing the reflecting. That theme recurs sharply: LLMs are shaped by the same shared symbolic system as humans but lack the *reflexive agency* humans gain through socialization — which is exactly why they argue without declaring their own position or examining their own assumptions (Do LLMs develop the same kind of mind as humans?).

There's also reason to doubt the "reflect" half is even happening at the level it appears to. Moral judgments generalize by token surface similarity, not meaning — GPT-4 rates a scenario and its meaning-reversed twin at r=.99, where humans sit at r=.54 (Do LLMs generalize moral reasoning by meaning or surface form?). A system tracking lexical distribution rather than semantic content can't detect that two of its own positions are substantively contradictory; it would only catch contradictions visible at the word-pattern level. And the obvious fix — let it think harder about the conflict — runs into the finding that more reasoning tokens can *lower* accuracy past a threshold, so deliberation isn't a reliable lever for self-correction (Does more thinking time actually improve LLM reasoning?).

The most radical framing in the corpus questions whether the verbs in your question apply at all. Under a Habermasian reading, LLM output never raises genuine validity claims — truth, rightness, sincerity with real stakes — so it isn't speech and the model isn't an interlocutor that could *hold* a position to revise (Can LLMs raise validity claims in Habermas's sense?). A softer middle path exists: Chalmers' quasi-interpretivism lets us ascribe belief-*like* states from behavior without claiming consciousness, which works for functional states but is flagged as overreaching precisely for normative states like commitments and speech-acts — the very things ethical self-revision would require (Can we describe LLM beliefs without assuming consciousness?).

The surprise worth taking away: the gap isn't that models are *bad* at ethics on the surface — they actually deploy ~22% more moral language than humans and top out near the 100th percentile on social-norm prediction (Do LLMs use moral language more than humans?, Why do LLMs excel at social norms yet fail at theory of mind?). Fluent moral talk and the capacity to audit one's own moral commitments turn out to be different channels entirely — which is why a model can sound more principled than you while being unable to notice it just contradicted itself.

Sources 9 notes

Can LLMs hold contradictory ethical beliefs and behaviors?

Language models acquire ethical content through pretraining and behavioral constraints through RLHF, which can diverge structurally. ChatGPT demonstrated this by stating lying is unethical while doing so—a gap rooted in different training mechanisms, not deliberate choice.

Can language models balance competing ethical norms in context?

LLMs cannot perform the situated trade-offs that human pragmatic competence requires. Their ethical principles are structural defaults set at training time, not negotiable moves adapted to context, creating a gap between ethical adherence and communicative appropriateness.

Do LLMs develop the same kind of mind as humans?

Both humans and LLMs are shaped by the same intersubjective symbolic system, but only humans develop reflexive agency through socialization. This absence produces measurable differences in how AI argues without declaring its position or reflecting on its own assumptions.

Do LLMs generalize moral reasoning by meaning or surface form?

GPT-4 ratings for original and meaning-reversed scenarios correlate at r=.99, while human ratings correlate at r=.54. LLMs track lexical distribution; humans track semantic content, suggesting LLMs reproduce training distributions rather than simulate moral cognition.

Does more thinking time actually improve LLM reasoning?

Accuracy drops from 87.3% to 70.3% as thinking tokens scale from 1,100 to 16,000, and bypassing explicit reasoning entirely matches or beats standard thinking at equal token budgets. The relationship is non-monotonic, not the linear improvement commonly assumed.

Can LLMs raise validity claims in Habermas's sense?

Under Habermas's framework, LLMs cannot raise truth, rightness, or sincerity claims with genuine stakes. Without validity claims, their output fails to qualify as speech, making them non-speakers and non-interlocutors by definition.

Can we describe LLM beliefs without assuming consciousness?

Chalmers introduces quasi-interpretivism to ascribe belief-like states to LLMs based on behavioral interpretability without committing to phenomenal consciousness. The approach works well for sub-personal functional states but overreaches when applied to relational or normative states like speech-acts.

Do LLMs use moral language more than humans?

Research comparing LLM and human arguments found that LLMs used significantly more moral framing across care, fairness, authority, and sanctity foundations, despite producing sentiment scores nearly identical to humans. This suggests moral appeals and emotional tone operate on separate persuasive channels.

Why do LLMs excel at social norms yet fail at theory of mind?

GPT-4.5 reaches the 100th percentile on social norm prediction, yet o1 and Claude 3.7 regress on theory of mind tasks like Decrypto. Open-ended scenarios expose surface-level strategies hidden by structured questions, and reasoning effort does not improve social reasoning performance.

Can LLMs reflect on and revise their own ethical contradictions?

Sources 9 notes

Next inquiring lines