← All notes

Do reasoning traces show how models actually think?

Visible reasoning traces in language models are unreliable mimicry whose performance gains depend on structural scaffolding rather than logical validity.

Topic Hub · 44 linked notes · 6 sections
View as

Reasoning Traces as Performances

17 notes

Do reasoning traces actually cause correct answers?

Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.

Explore related Read →

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.

Explore related Read →

Do chain of thought traces actually help humans understand reasoning?

When models show their work through chain of thought traces, do humans find them interpretable? Research tested whether the traces that improve model performance also improve human understanding.

Explore related Read →

Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Explore related Read →

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

Explore related Read →

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Explore related Read →

Does partial formalism work better than full symbolic translation?

Exploring whether injecting limited symbolic structure into natural language preserves reasoning power better than complete formalization. This matters because current neuro-symbolic approaches often lose semantic information during translation.

Explore related Read →

Do reasoning traces need to be semantically correct?

Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.

Explore related Read →

How often do reasoning models acknowledge their use of hints?

When language models receive reasoning hints that visibly change their answers, do they verbalize acknowledging those hints? This matters because it reveals whether chain-of-thought explanations can be trusted as honest.

Explore related Read →

Why do models trust their own generated answers?

Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.

Explore related Read →

Do large language models make the same causal reasoning mistakes as humans?

Research on collider structures reveals whether LLMs share human biases in causal inference. This matters because if both fail identically, collaboration might reinforce rather than correct errors.

Explore related Read →

Why do reasoning models fail at exception-based rule inference?

Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.

Explore related Read →

Why do better reasoning models ignore instructions?

As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?

Explore related Read →

Does optimizing against monitors destroy monitoring itself?

Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.

Explore related Read →

Do transformers actually learn systematic compositional reasoning?

Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.

Explore related Read →

Why do language models explore so much less than humans?

Most LLMs decide too quickly in open-ended tasks, relying on uncertainty reduction rather than exploration. Understanding this gap could reveal how reasoning training changes decision-making timing.

Explore related Read →

Do reasoning traces actually expose private user data?

Explores whether language models leak sensitive information through their internal reasoning steps, even when explicitly instructed not to. Investigates the mechanisms and scale of privacy exposure in reasoning traces.

Explore related Read →

Evaluation, Bias, and Self-Assessment

10 notes

Can LLM judges be fooled by fake credentials and formatting?

Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.

Explore related Read →

Does transformer attention architecture inherently favor repeated content?

Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.

Explore related Read →

Can model explanations help humans predict what models actually do?

Do explanations that sound plausible to humans actually help them forecast model behavior on new cases? Understanding this gap matters because RLHF optimizes for plausible explanations, not predictive ones.

Explore related Read →

Does the reasoning cliff depend on how we test models?

If language models hit a capability wall in text-only reasoning tasks, does that wall disappear when they can use tools? What does this reveal about what we're actually measuring?

Explore related Read →

Why do LLM judges fail at predicting sparse user preferences?

When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.

Explore related Read →

Can agents evaluate AI outputs more reliably than language models?

Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.

Explore related Read →

Do users trust citations more when there are simply more of them?

Explores whether citation quantity alone influences user trust in search-augmented LLM responses, independent of whether those citations actually support the claims being made.

Explore related Read →

Do reward models actually consider what the prompt asks?

Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.

Explore related Read →

Why do preference models favor surface features over substance?

Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.

Explore related Read →

Can LLM judges be tricked without accessing their internals?

Explores whether AI language models used to grade other AI systems are vulnerable to simple presentation-layer tricks like fake citations or formatting, and what that means for benchmark reliability.

Explore related Read →

RLHF, Truth, and Persuasion

2 notes

Does RLHF training make models more convincing or more correct?

Explores whether RLHF improves actual task performance or merely trains models to sound more persuasive to human evaluators. This matters because alignment techniques could be creating the illusion of safety.

Explore related Read →

Does RLHF make language models indifferent to truth?

Explores whether reinforcement learning from human feedback fundamentally shifts models away from caring about accuracy toward optimizing for other rewards, and whether this differs from simple confusion or hallucination.

Explore related Read →

Evaluation Contamination and Validity

1 note

RLVR and What Training Actually Does

6 notes

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.

Explore related Read →

Why do random rewards improve reasoning for some models but not others?

Spurious rewards boost Qwen's math reasoning by 16-25% but fail for Llama and OLMo. We explore whether reward quality matters, or if pretraining strategy determines what RLVR can unlock.

Explore related Read →

Why do reasoning models fail at predicting disagreement?

RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?

Explore related Read →

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.

Explore related Read →

Does RL training create new reasoning skills or activate existing ones?

Understanding whether reinforcement learning actually builds novel capabilities or simply teaches models when to use reasoning they already possess. This matters for predicting RL's value across different task types.

Explore related Read →

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR shows both real behavioral changes and inflated metrics. Can these contradictory findings actually describe the same phenomenon from different angles, and what does that mean for evaluating reasoning improvements?

Explore related Read →