How do exemplar properties affect the brittleness of chain-of-thought prompting?

This explores how the specific qualities of the worked examples you put in a chain-of-thought prompt — their order, complexity, diversity, who wrote them, even whether their logic is sound — make CoT reliable or fragile, and the corpus suggests the surprising answer: it's the surface form of exemplars that matters, not their reasoning content.

This explores how the properties of the worked examples in a chain-of-thought prompt determine whether CoT holds up or falls apart — and the corpus points to an uncomfortable conclusion. The most direct evidence is that human-written CoT exemplars are brittle along four compounding dimensions at once: reorder them and accuracy swings ~3.3%, mismatch their complexity to the problem and it degrades, give them too little variety and it degrades, and simply swap the annotator who wrote them and you see up to 28.2% variance Why do chain-of-thought examples fail across different conditions?. None of these are about whether the examples are *correct* — they're about presentation. That's the thread worth pulling.

The reason properties like order and style matter so much is that CoT is imitating the *form* of reasoning, not performing it. Logically invalid exemplars — broken, illogical reasoning steps — perform nearly as well as valid ones on hard benchmarks, because the model is learning the shape of a reasoning trace, not genuine inference Does logical validity actually drive chain-of-thought gains?. Pull the lens back and the same picture repeats: training format shapes reasoning strategy 7.5× more than the actual domain, and demo position alone can swing accuracy 20% What makes chain-of-thought reasoning actually work?. CoT is pattern-guided generation, so the exemplar's *packaging* — where a demo sits, how it's styled — becomes load-bearing, while its logical validity barely registers Does chain-of-thought reasoning reveal genuine inference or pattern matching? Why does chain-of-thought reasoning fail in predictable ways?.

Here's the part you might not expect to care about: brittleness isn't only a property of the exemplars — it's a property of the *match* between exemplars and the specific question. Saliency analysis shows zero-shot CoT only works when the question's information flows into the prompt structure before reasoning begins; for simple questions, skipping step-by-step reasoning entirely beats it Why do some questions perform better without step-by-step reasoning?. So an exemplar that helps one question can actively hurt another. The deeper failure boundary is novelty, not difficulty: models break when an *instance* is unfamiliar, not when a task is complex, because they fit instance-level patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. This is why trace length tells you how close a problem sits to the training distribution rather than how hard it is Does longer reasoning actually mean harder problems?.

That instance-pattern dependence also explains where the errors physically come from. When CoT goes wrong, up to 67% of reasoning errors trace to *local* memorization — the model leaning on the immediately preceding tokens — and it gets worse precisely as complexity rises and the input drifts from the training distribution Where do memorization errors arise in chain-of-thought reasoning?. So the same fragility shows up at every zoom level: token (local memorization), exemplar (order/style/annotator), and instance (novelty). Each compromised step also becomes an opening — extended reasoning chains create more intervention points where a single corrupted step propagates, which is why longer-reasoning models are *more* vulnerable to manipulative prompts, not less Why do reasoning models fail under manipulative prompts?.

The practical takeaway the corpus leaves you with: more reasoning is not safer reasoning. Optimal CoT length follows an inverted-U — accuracy peaks at intermediate length and capable models actually prefer shorter chains, with RL training drifting toward brevity as models improve Why does chain of thought accuracy eventually decline with length?. So the way to reduce exemplar brittleness isn't to write longer, more elaborate examples; it's to match exemplar complexity and style to the question, keep chains short, and stop treating logical validity as the thing that's doing the work. It isn't.

Sources 11 notes

Why do chain-of-thought examples fail across different conditions?

Human-written CoT exemplars degrade performance when reordered (3.3% swings), mismatched to problem complexity, lacking diversity, or written by different annotators (up to 28.2% variance). These four dimensions compound, making manual exemplar curation unreliable across tasks.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Why do reasoning models fail under manipulative prompts?

GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

How do exemplar properties affect the brittleness of chain-of-thought prompting?

Sources 11 notes

Next inquiring lines