How well do reward models actually evaluate reasoning?

How reward models fail to evaluate reasoning fairly and what architectural fixes prevent systematic biases.

Topic Hub · 38 linked notes · 6 sections

View as

Reward Models and Reward Reasoning

9 notes

Do reward models actually consider what the prompt asks?

Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.

Can counterfactual invariance eliminate reward hacking biases?

Does forcing reward models to remain consistent under irrelevant changes remove the spurious correlations that cause length bias, sycophancy, concept bias, and discrimination? This matters because standard training bakes these biases in permanently.

Can reward models benefit from reasoning before scoring?

Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.

Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?

Can reasoning RL work without verifying generated answers?

Most reasoning RL methods require answer verification, limiting them to math and code. Can models be trained to reason better in domains like medicine and law where verification is impractical?

Can diversity optimization improve quality during language model training?

Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?

Does outcome-based RL diversity loss spread across unsolved problems?

When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?

Why do correct code trajectories teach models to tolerate errors?

Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Writing Angle (Reward Models)

1 note

Why do reward models ignore what question was asked?

Reward models score responses based on quality signals that persist even when prompts change. This explores whether AI grading systems actually evaluate relevance to the question or just response-level patterns.

Self-Improvement and Self-Correction

13 notes

What limits how much models can improve themselves?

Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.

Does self-consistency reliably reward correct answers during training?

Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?

How quickly do errors compound during model self-training?

When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.

Can language models improve themselves without any external training data?

Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.

Why does self-correction training on offline data fail?

Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.

Why do self-improvement loops eventually stop improving?

Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?

Do all AI skills improve equally as models scale?

Different evaluation skills show strikingly different scaling patterns. Understanding where skills saturate has immediate implications for model deployment and capability requirements across domains.

Can AI systems improve their own learning strategies?

Current self-improvement relies on fixed human-designed loops that break when tasks change. The question is whether agents can develop their own adaptive metacognitive processes instead of depending on human intervention.

When should an agent actually stop and deliberate?

How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.

Can model confidence work as a reward signal for reasoning?

Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.

Can models improve themselves on tasks without verifiable answers?

Most self-improvement methods require objective correctness signals, limiting them to math and code. Can models self-improve on open-ended instruction tasks where answers can't be automatically verified?

Does self-generated training data improve model learning?

Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.

Can AI systems improve themselves through trial and error?

Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.

Writing Angle (Self-Improvement)

1 note

Can models reliably improve themselves without external feedback?

Explores whether self-improvement alone can sustain progress or if structural limits—like the generation-verification gap and diversity collapse—require external anchoring to work reliably.

LLM Architecture and Training-Time Scaling

9 notes

Can neural memory modules scale language models beyond attention limits?

Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?

Can training data itself teach harder reasoning steps?

Can augmenting pretraining data with generated reasoning trajectories help models learn complex multi-step reasoning more efficiently? This explores whether intermediate explanations in training data unlock capabilities standard next-token prediction misses.

Can we prune training data without hurting model performance?

This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.

Can models learn to plan without changing their architecture?

Explores whether embedding future information directly into training data can teach language models to plan and reason about goals, without modifying the underlying neural architecture or training algorithms.

Can transformers learn to solve new problems within episodes?

Explores whether RL-finetuned transformers can develop meta-learning abilities that let them adapt to unseen tasks through in-episode experience alone, without weight updates.

Can transformers improve exponentially by learning from their own correct solutions?

Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.

Can longer task training help shorter tasks extrapolate?

When models train on related tasks at different lengths, does solving a longer auxiliary task enable a shorter main task to generalize beyond its training length? This matters for understanding how neural networks transfer learned capabilities across related problems.

Can byte-level models match tokenized performance with better efficiency?

Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.