LLM Reasoning and Architecture Language Understanding and Pragmatics Design & LLM Interaction

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

Note · 2026-02-22 · sourced from Reasoning Logic Internal Rules

LLMs are surprisingly brittle to the ordering of premises in deductive reasoning tasks, despite the fact that premise order does not alter the underlying logical task. The "Premise Order Matters" paper shows that permuting premise order can cause a performance drop of over 30%.

The key finding is directional: LLMs achieve best performance when premises are presented in the same order as the context required in intermediate reasoning steps — essentially, when the prompt mirrors the ground truth proof sequence. When premises must be mentally reordered to construct the proof, accuracy drops sharply.

This brittleness reveals that LLM deductive reasoning is not operating on abstract logical relations but on sequential pattern matching through the input. The model processes premises in order and constructs intermediate representations that are order-dependent. When the order does not match the proof structure, the model must implicitly reorder — a capability it lacks or executes poorly.

The finding connects to multiple existing insights about surface-level processing:

Since Why do chain-of-thought examples fail across different conditions?, order sensitivity is not unique to premises — it extends across the entire prompt structure. Both findings suggest that LLMs process prompts as sequential narratives, not as unordered logical structures.

Since Does training data format shape reasoning strategy more than domain?, premise ordering is another format effect: the same logical content produces dramatically different performance depending on presentation format. The 30% gap is comparable to the 7.5x format effect documented in training data.

The practical implication is that anyone constructing prompts for deductive reasoning tasks should order premises to match the expected proof sequence. This is trivial for the prompt designer who knows the answer but impossible in production settings where the answer is unknown — creating a fundamental deployment challenge for LLM deductive reasoning.

Source: Reasoning Logic Internal Rules

Related concepts in this collection

Why do chain-of-thought examples fail across different conditions? Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
order sensitivity extends from exemplars to premises; shared mechanism of sequential processing
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
premise ordering is a format effect comparable in magnitude
Do large language models reason symbolically or semantically? Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
sequential processing rather than abstract logical manipulation

Concept map

14 direct connections · 147 in 2-hop network ·dense cluster

How much does the order of premises actually mat… Why do chain-of-thought examples fail across diffe… Does training data format shape reasoning strategy… Do large language models reason symbolically or se…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

premise ordering affects deductive reasoning performance by over 30 percent