Do reasoning traces show how models actually think?

Visible reasoning traces in language models are unreliable mimicry whose performance gains depend on structural scaffolding rather than logical validity.

Topic Hub · 44 linked notes · 6 sections

View as

Reasoning Traces as Performances

17 notes

Do reasoning traces actually cause correct answers?

Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.

Do chain of thought traces actually help humans understand reasoning?

When models show their work through chain of thought traces, do humans find them interpretable? Research tested whether the traces that improve model performance also improve human understanding.

Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Does partial formalism work better than full symbolic translation?

Exploring whether injecting limited symbolic structure into natural language preserves reasoning power better than complete formalization. This matters because current neuro-symbolic approaches often lose semantic information during translation.

Do reasoning traces need to be semantically correct?

Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.

How often do reasoning models acknowledge their use of hints?

When language models receive reasoning hints that visibly change their answers, do they verbalize acknowledging those hints? This matters because it reveals whether chain-of-thought explanations can be trusted as honest.

Why do models trust their own generated answers?

Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.

Do large language models make the same causal reasoning mistakes as humans?

Research on collider structures reveals whether LLMs share human biases in causal inference. This matters because if both fail identically, collaboration might reinforce rather than correct errors.

Why do reasoning models fail at exception-based rule inference?

Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.

Why do better reasoning models ignore instructions?

As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?

Does optimizing against monitors destroy monitoring itself?

Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.

Do transformers actually learn systematic compositional reasoning?

Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.

Why do language models explore so much less than humans?

Most LLMs decide too quickly in open-ended tasks, relying on uncertainty reduction rather than exploration. Understanding this gap could reveal how reasoning training changes decision-making timing.

Do reasoning traces actually expose private user data?

Explores whether language models leak sensitive information through their internal reasoning steps, even when explicitly instructed not to. Investigates the mechanisms and scale of privacy exposure in reasoning traces.

Evaluation, Bias, and Self-Assessment

10 notes

Can LLM judges be fooled by fake credentials and formatting?

Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.

Does transformer attention architecture inherently favor repeated content?

Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.

Can model explanations help humans predict what models actually do?

Do explanations that sound plausible to humans actually help them forecast model behavior on new cases? Understanding this gap matters because RLHF optimizes for plausible explanations, not predictive ones.

Does the reasoning cliff depend on how we test models?

If language models hit a capability wall in text-only reasoning tasks, does that wall disappear when they can use tools? What does this reveal about what we're actually measuring?

Why do LLM judges fail at predicting sparse user preferences?

When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.

Can agents evaluate AI outputs more reliably than language models?

Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.

Do users trust citations more when there are simply more of them?

Explores whether citation quantity alone influences user trust in search-augmented LLM responses, independent of whether those citations actually support the claims being made.

Do reward models actually consider what the prompt asks?

Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.

Why do preference models favor surface features over substance?

Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.

Can LLM judges be tricked without accessing their internals?

Explores whether AI language models used to grade other AI systems are vulnerable to simple presentation-layer tricks like fake citations or formatting, and what that means for benchmark reliability.

RLHF, Truth, and Persuasion

2 notes

Does RLHF training make models more convincing or more correct?

Explores whether RLHF improves actual task performance or merely trains models to sound more persuasive to human evaluators. This matters because alignment techniques could be creating the illusion of safety.

Does RLHF make language models indifferent to truth?

Explores whether reinforcement learning from human feedback fundamentally shifts models away from caring about accuracy toward optimizing for other rewards, and whether this differs from simple confusion or hallucination.

Evaluation Contamination and Validity

1 note

How much of LLM few-shot ability comes from training data?

Do large language models genuinely learn from a few examples, or are they mostly recognizing patterns from their training data? This matters for understanding what LLMs can actually do.

RLVR and What Training Actually Does

6 notes

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.

Why do random rewards improve reasoning for some models but not others?

Spurious rewards boost Qwen's math reasoning by 16-25% but fail for Llama and OLMo. We explore whether reward quality matters, or if pretraining strategy determines what RLVR can unlock.

Why do reasoning models fail at predicting disagreement?

RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.

Does RL training create new reasoning skills or activate existing ones?

Understanding whether reinforcement learning actually builds novel capabilities or simply teaches models when to use reasoning they already possess. This matters for predicting RL's value across different task types.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR shows both real behavioral changes and inflated metrics. Can these contradictory findings actually describe the same phenomenon from different angles, and what does that mean for evaluating reasoning improvements?