LLM Reasoning and Architecture Design & LLM Interaction Reinforcement Learning for LLMs

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Note · 2026-02-22 · sourced from Reasoning Methods CoT ToT
How should we allocate compute budget at inference time?

Manual chain-of-thought prompting rests on an implicit assumption: if a human writes good reasoning examples, the model will reason better. The AutoCoT paper exposes this assumption through systematic sensitivity analysis, documenting four distinct brittleness dimensions:

1. Order sensitivity: Randomly shuffling the order of few-shot CoT exemplars on GPT-3 causes accuracy fluctuations of up to 3.3% below average on GSM8K. The model is sensitive to which examples appear first, not just which examples appear.

2. Complexity sensitivity: Chain length (number of reasoning hops) must match problem difficulty. Simple exemplars (few hops) degrade performance on complex questions; complex exemplars (many hops) degrade performance on simple questions. The model over- or under-reasons to match the exemplar pattern.

3. Diversity requirement: A uniform-complexity exemplar set underperforms a mixed-complexity set. The optimal strategy is a distribution across complexity levels, not the highest-complexity exemplars. This means selecting exemplars for diversity, not just quality.

4. Style sensitivity: Different human annotators writing CoT for the same problems produce results that vary by up to 28.2% accuracy. There is no "neutral" annotation style — every annotator introduces style artifacts that interact with model processing differently.

These four dimensions compound: a set of exemplars that is well-ordered, complexity-matched, diversity-balanced, AND style-appropriate is extremely difficult to produce manually, and what works for one task rarely generalizes to another. This is why automated approaches (AutoCoT, which uses LLM-generated and filtered pseudo-chains) can outperform manually curated exemplars despite producing less obviously "good" reasoning.

Two additional findings extend this picture. Complexity-based prompting confirms the complexity dimension with a direct mechanism: selecting exemplars with more reasoning steps consistently improves multi-step reasoning performance. The relationship is monotonic — more exemplar complexity → better model performance — which means complexity sensitivity is not just about matching but about setting a reasoning floor. CDW-CoT (Clustered Distance-Weighted CoT) provides a practical solution to the compounding problem: by clustering the dataset and training optimal prompt probability distributions per cluster, it dynamically adapts exemplar selection to instance characteristics rather than using one-size-fits-all prompts. This directly addresses the finding that what works for one task rarely generalizes, by making exemplar selection instance-specific rather than task-level.

Latent Skill Discovery (RSD) reframes exemplar selection as a learned reasoning policy. Rather than heuristic selection (by complexity, diversity, or clustering), RSD discovers an unsupervised latent space of reasoning skills from unlabeled demonstrations, then trains a reasoning policy (via PPO) to select demonstrations based on the target task's characteristics. This addresses the compounding brittleness problem by learning which combination of skills a given problem requires, rather than relying on surface-level features like complexity or diversity. The approach implies that the four brittleness dimensions (order, complexity, diversity, style) may be symptoms of a deeper issue: exemplar selection needs to be strategic, matching the specific reasoning capabilities a problem demands.

DPP bias reveals a fifth dimension: prompt-architecture positioning. The "Demos' Position in Prompt" paper shows that moving an unchanged block of demos from the start to the end of a prompt swings accuracy by up to 20% and flips ~50% of predictions — roughly 6x the effect of within-exemplar order shuffling (3.3%). This is not about which demos appear first among themselves, but where the entire demo block sits relative to system prompt and user message. The mechanistic cause is architectural: primacy bias, induction heads, and lost-in-the-middle effects create position-dependent attention gradients that modulate ICL effectiveness. The DPP finding extends the brittleness story to a larger spatial scale — ordering effects appear at every granularity from How much does the order of premises actually matter for reasoning? (within-task) to within-exemplar (3.3%) to prompt-architecture level (20%).

The deeper implication is a connection to Do language models actually use their reasoning steps?: if CoT performance is this sensitive to surface properties of exemplars (order, style, position), the reasoning chains are not cleanly driving outputs. A model that reasons correctly only when given exemplars in the right order is exhibiting a form of causal insufficiency — the reasoning capacity is real but brittle, heavily conditioned on surface formatting.


Source: Reasoning Methods CoT ToT; enriched from Cognitive Models Latent, Context Engineering

Related concepts in this collection

Concept map
19 direct connections · 161 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

cot exemplar performance is brittle across four dimensions order complexity diversity and style