← All clusters

Reasoning and Learning Architectures

Research on how language models learn, reason, and improve through reinforcement learning, reward modeling, and architectural innovations. Covers training dynamics, inference-time computation, mechanistic interpretability, and the cognitive structures underlying model reasoning.

290 notes (primary) · 1094 papers · 21 sub-topics
View as

Reinforcement Learning

27 notes

Can chain-of-thought reasoning be learned during pretraining itself?

Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.

Explore related Read →

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Explore related Read →

Can judges that reason about reasoning outperform classifier rewards?

Can process reward models generate explanations about why steps are correct rather than simply classifying them? This explores whether meta-reasoning about reasoning improves both accuracy and generalization in step-level evaluation.

Explore related Read →

Can adversarial critics replace task-specific verifiers for reasoning?

Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.

Explore related Read →

Why do language models fail to act on their own reasoning?

LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?

Explore related Read →

Can full episode rewards per step enable better credit assignment?

Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.

Explore related Read →

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Explore related Read →

Does negative reinforcement alone outperform full reinforcement learning?

Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.

Explore related Read →

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Explore related Read →

Can reinforcement learning discover reasoning strategies base models cannot?

Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.

Explore related Read →

Can models learn what makes research worth doing?

Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.

Explore related Read →

Can reinforcement learning scale beyond single-turn language tasks?

Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.

Explore related Read →

Does RL training follow a predictable two-phase learning sequence?

This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.

Explore related Read →

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Explore related Read →

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Explore related Read →

Does thinking emerge when agents choose between learned sub-policies?

Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.

Explore related Read →

Can two simple techniques match complex RL algorithms?

Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.

Explore related Read →

Reasoning Architectures

21 notes

Does planning direction affect how hard problems become?

Planning research typically goes forward only. But some problems get easier when you work backward from the goal. What makes direction matter, and can language models exploit this?

Explore related Read →

Do base models already contain hidden reasoning ability?

Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.

Explore related Read →

Can a single problem unlock reasoning through solution critique?

Does exposing models to diverse critiques of different solutions to one problem activate reasoning as effectively as training on many problems? This tests whether solution diversity matters more than problem diversity.

Explore related Read →

Can reasoning and tool execution be truly decoupled?

Can LLM reasoning be separated from tool observations to eliminate redundant re-prompting and enable parallel execution? Two recent architectures suggest yes, but what are the tradeoffs?

Explore related Read →

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Explore related Read →

Why do LLMs struggle with exploration in simple decision tasks?

This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.

Explore related Read →

Why do outcome-based reward models fail at intermediate step evaluation?

Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.

Explore related Read →

Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

Explore related Read →

Can curriculum learning approximate expensive process supervision?

Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?

Explore related Read →

Does RL teach reasoning or just when to use it?

Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.

Explore related Read →

Why do RL agents stop asking informative questions?

RL-trained agents often fail to seek information effectively, despite being trained to do so. Understanding whether this reflects a capability gap or a training dynamics problem could reveal how to unlock better information-seeking behavior.

Explore related Read →

Does RL post-training create reasoning or just deploy it?

Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.

Explore related Read →

Why do trajectories matter more than individual examples for in-context learning?

Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.

Explore related Read →

RL with Verifiable Rewards (RLVR)

18 notes

Why does RLVR training narrow a model's problem solving ability?

RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.

Explore related Read →

Can adaptive guidance from solution traces reduce reward sparsity in RL?

When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.

Explore related Read →

Can generative reasoning beat discriminative models with less training data?

Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.

Explore related Read →

Do high-entropy tokens drive reasoning model improvements?

Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.

Explore related Read →

Can reasoning emerge from expert demonstrations alone?

Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.

Explore related Read →

Can model confidence alone replace external answer verification?

Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.

Explore related Read →

Can RL agents learn to reason better, not just succeed?

Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?

Explore related Read →

Can a single training example unlock mathematical reasoning?

Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.

Explore related Read →

Can next-token prediction become a reasoning task with RL?

Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.

Explore related Read →

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.

Explore related Read →

How can rubric-based rewards resist reward hacking attacks?

Single rubrics are easily exploited by models, and simply adding more rubrics yields diminishing returns. What design patterns and defensive mechanisms actually prevent reward hacking in rubric-based RL systems?

Explore related Read →

Why do random rewards improve reasoning for some models but not others?

When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?

Explore related Read →

Is the exploration-exploitation trade-off actually fundamental?

Token-level analysis suggests exploration and exploitation are opposed, but does hidden-state analysis reveal they could coexist? Understanding measurement granularity's role in perceived trade-offs matters for scaling reasoning systems.

Explore related Read →

Why does RLVR work with completely random rewards?

RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.

Explore related Read →

LLM Architecture

17 notes

Why do decoder-only models underperform as text encoders?

Decoder-only LLMs use causal attention, which limits each token to seeing only prior context. This explores whether removing this constraint could make them competitive universal encoders without architectural redesign.

Explore related Read →

Can we prune training data without hurting model performance?

This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.

Explore related Read →

Can embedding future information in training data improve planning?

This explores whether inserting lookahead tokens containing future goals into training sequences helps models learn long-range planning without changing their architecture. The question matters because it tests whether data-level changes can produce architectural-level reasoning improvements.

Explore related Read →

Can transformers learn to solve new problems within episodes?

Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.

Explore related Read →

Can text-trained models compress images better than specialized tools?

Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.

Explore related Read →

Can neural memory modules scale language models beyond attention limits?

Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?

Explore related Read →

Why do accurate predictions lead to poor decisions?

Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.

Explore related Read →

Can transformers improve exponentially by learning from their own correct solutions?

Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.

Explore related Read →

Why do neural networks fail at compositional generalization?

Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.

Explore related Read →

Can training data augmentation match test-time compute scaling benefits?

Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.

Explore related Read →

Test-Time Compute

17 notes

Can we allocate inference compute based on prompt difficulty?

Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?

Explore related Read →

Do critique models improve diversity during training itself?

Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.

Explore related Read →

Does extended thinking actually improve reasoning or just increase variance?

When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.

Explore related Read →

How do internal and external test-time scaling compare?

Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.

Explore related Read →

Why does majority voting outperform more complex inference methods?

Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?

Explore related Read →

Can non-reasoning models catch up with more compute?

Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.

Explore related Read →

How should we balance parallel versus sequential compute at test time?

Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?

Explore related Read →

Does policy entropy collapse limit reasoning performance in RL?

As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?

Explore related Read →

Can self-supervised process rewards replace human annotation?

Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.

Explore related Read →

Can models precompute answers before users ask questions?

Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?

Explore related Read →

When does majority-vote reward actually help test-time learning?

Test-time RL using consensus rewards shows contradictory results across different models and domains. What determines whether consensus amplifies correct answers or reinforces confident mistakes?

Explore related Read →

Can inference compute replace scaling up model size?

Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.

Explore related Read →

Can models improve themselves using only majority voting?

Explores whether test-time reinforcement learning can generate effective reward signals from unlabeled data by treating majority-voted answers as pseudo-labels, and whether this bootstrapping approach actually drives meaningful policy improvement.

Explore related Read →

What makes test-time training actually work in practice?

Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.

Explore related Read →

Why do reasoning models fail differently at training versus inference?

Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.

Explore related Read →

How can we predict the optimal thinking token threshold?

Researchers are exploring what determines when a model should stop reasoning on a given task, since accuracy degrades beyond a critical threshold but no principled prediction method exists yet.

Explore related Read →

Training and Fine-Tuning

16 notes

Can utility-weighted training loss actually harm model performance?

When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.

Explore related Read →

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.

Explore related Read →

Can semantic knowledge shift model behavior like reinforcement learning does?

Can textual descriptions of successful reasoning patterns, prepended as context, achieve the same distribution shifts that RL achieves through parameter updates? This matters because it could eliminate the need for expensive fine-tuning on limited data.

Explore related Read →

Can models trained on many imperfect experts outperform each one?

Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.

Explore related Read →

Can we train better models on less data?

Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.

Explore related Read →

Does instruction tuning teach task understanding or output format?

Exploring whether models trained on instructions actually learn the task semantics or merely learn to match output distributions. This matters because it challenges assumptions about how fine-tuning improves model behavior.

Explore related Read →

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.

Explore related Read →

Can abstractions guide exploration better than depth alone?

Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?

Explore related Read →

Do pretraining and fine-tuning scale independently in language models?

Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.

Explore related Read →

Can step-wise expert rewards help small models learn hard reasoning?

When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.

Explore related Read →

Does training on AI-generated content permanently degrade model quality?

When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.

Explore related Read →

Mechanistic Interpretability

13 notes

Can LLMs handle multiple tasks at once during inference?

Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?

Explore related Read →

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Explore related Read →

Can neural networks learn compositional skills without symbolic mechanisms?

Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.

Explore related Read →

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

Explore related Read →

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Explore related Read →

How do language models detect injected steering vectors internally?

Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.

Explore related Read →

Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Explore related Read →

Do language models understand in fundamentally different ways?

Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?

Explore related Read →

Do neural networks naturally learn modular compositional structure?

Explores whether neural networks decompose compositional tasks into distinct subroutines without explicit symbolic design. This challenges the longstanding view that neural networks are fundamentally non-compositional.

Explore related Read →

Can AI pass every test while understanding nothing?

Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.

Explore related Read →

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Explore related Read →

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Explore related Read →

Novel LLM Architectures

12 notes

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.

Explore related Read →

Can byte-level models match tokenized performance with better efficiency?

Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?

Explore related Read →

Can energy minimization unlock reasoning without domain-specific training?

Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?

Explore related Read →

Can evolutionary search beat sampling and revision at inference time?

Can LLMs evolve populations of solutions through recombination and selection to outperform simpler inference strategies? This matters because it could reveal whether biological-inspired search improves planning without formal problem definitions.

Explore related Read →

Can recurrent hierarchies achieve reasoning that transformers cannot?

Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.

Explore related Read →

Can cognition work by reusing memory instead of recomputing?

Does intelligence emerge from structured navigation of prior inference paths rather than fresh computation? This challenges whether brains and AI systems need to recalculate constantly or can leverage stored trajectories for efficiency.

Explore related Read →

Can models learn to evaluate their own work during training?

Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.

Explore related Read →

Can looped transformers generalize to unseen knowledge combinations?

Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.

Explore related Read →

Can models dynamically activate expert skills at inference time?

Can language models efficiently discover and compose task-specific capabilities on the fly without modifying base weights? This explores whether test-time adaptation through expert vector composition outperforms fixed fine-tuning approaches.

Explore related Read →

Can parallel architectures solve inherently sequential problems?

Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?

Explore related Read →

Self-Refinement and Self-Consistency

10 notes

Why does self-correction training on offline data fail?

Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.

Explore related Read →

When should an agent actually stop and deliberate?

How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.

Explore related Read →

Can language models improve themselves without any external training data?

Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.

Explore related Read →

Can model confidence work as a reward signal for reasoning?

Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.

Explore related Read →

Can models improve themselves on tasks without verifiable answers?

Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?

Explore related Read →

Does self-consistency reliably reward correct answers during training?

Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?

Explore related Read →

Does self-generated training data improve model learning?

Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.

Explore related Read →

What limits how much models can improve themselves?

Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.

Explore related Read →

Why do self-improvement loops eventually stop improving?

Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?

Explore related Read →

Can models reliably improve themselves without external feedback?

Explores whether self-improvement alone can sustain progress or if structural limits—like the generation-verification gap and diversity collapse—require external anchoring to work reliably.

Explore related Read →

Reward Models

8 notes

Why do correct code trajectories teach models to tolerate errors?

Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.

Explore related Read →

Can diversity optimization improve quality during language model training?

Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?

Explore related Read →

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Explore related Read →

Does outcome-based RL diversity loss spread across unsolved problems?

When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?

Explore related Read →

Can reward models benefit from reasoning before scoring?

Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.

Explore related Read →

Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?

Explore related Read →

Why do reward models ignore what question was asked?

Reward models score responses based on quality signals that persist even when prompts change. This explores whether AI grading systems actually evaluate relevance to the question or just response-level patterns.

Explore related Read →

Can reasoning improvement work without answer verification?

Explores whether RL-based reasoning training can extend beyond math and code to general domains like chemistry and law by replacing answer verification with a simpler signal based on reference answer likelihood.

Explore related Read →

Cognitive Models and Latent Representations

8 notes

How do language models encode syntactic relations geometrically?

Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.

Explore related Read →

Why does asking models to think first hurt performance?

Initial prompts to generate internal thoughts degrade instruction-following performance. What reverses this harm, and can thinking become useful beyond math and logic?

Explore related Read →

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Explore related Read →

Can explicit stack tracking improve how transformers learn recursive syntax?

Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.

Explore related Read →

Can we explore multiple reasoning paths without committing to one token?

Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?

Explore related Read →

Do transformers hide reasoning before producing filler tokens?

Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.

Explore related Read →

Training Data

7 notes

Can agents learn from their own actions without external rewards?

Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.

Explore related Read →

Can synthetic data replace seed examples in task generation?

Can models generate high-quality synthetic data for novel tasks without relying on existing input-output exemplars? This matters because many specialized domains lack training examples to work from.

Explore related Read →

How do quality, diversity, and complexity affect synthetic data differently?

When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.

Explore related Read →

Can we generate synthetic data without any seed examples?

Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?

Explore related Read →

Why do language models need so much more text than humans?

Language models train on the surface of written text, but humans learn by inferring the underlying thoughts behind what they read. Does this explain why models need vastly more data to reach human-level understanding?

Explore related Read →

Diffusion-Based LLMs

7 notes

Why can't we easily adapt reinforcement learning to diffusion language models?

Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?

Explore related Read →

Can diffusion models enable control that autoregressive models cannot reach?

Autoregressive language models struggle with complex global controls like syntax and infilling because they generate left-to-right and have discrete token bottlenecks. Can diffusion models' continuous latents and parallel denoising overcome these structural limitations?

Explore related Read →

Can diffusion language models match autoregressive inference speed?

Diffusion LLMs promised faster decoding through parallel token generation, but open-source implementations never outpaced autoregressive models in practice. What architectural barriers prevent diffusion from realizing its speed potential?

Explore related Read →

Can diffusion models commit to answers before full decoding?

Do diffusion language models settle on correct answers early in their refinement process, and if so, can we detect and exploit this convergence to speed up inference without losing quality?

Explore related Read →

Can diffusion models perform evolutionary search in parameter space?

Diffusion models and evolutionary algorithms share equivalent mathematical structures. Can we leverage this equivalence to build evolutionary search methods that preserve solution diversity better than traditional algorithms?

Explore related Read →

Can iterative revision cycles match how humans actually write?

Does framing research writing as a diffusion process—where drafts are refined through retrieval-augmented cycles—better capture human cognition than linear pipelines and reduce information loss?

Explore related Read →

Does autoregressive generation uniquely enable LLM scaling?

Is the autoregressive factorization truly necessary for LLM scalability, or do other generative principles like diffusion achieve comparable performance? This matters because it shapes which architectural paths deserve investment.

Explore related Read →

LLM Memory

6 notes

Can retrieval knowledge compress into a tiny parametric model?

Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.

Explore related Read →

Can brain memory systems explain how LLMs should store knowledge?

This explores whether the brain's three-tier memory architecture—neocortex, hippocampus, and prefrontal cortex—maps onto transformer weights, external knowledge stores, and agentic state. Understanding this mapping could reveal which AI memory problems each tier solves and which it cannot.

Explore related Read →

When do language models stop memorizing and start generalizing?

Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.

Explore related Read →

Can recursive subtask trees overcome context window limits?

Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.

Explore related Read →

Inference-Time Scaling

4 notes

Can architecture choices improve inference efficiency without sacrificing accuracy?

Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.

Explore related Read →

Can models learn to internalize search algorithms through training?

Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.

Explore related Read →

Can models treat long prompts as external code environments?

Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?

Explore related Read →

Does RL training follow predictable scaling curves?

Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.

Explore related Read →

Context Engineering

4 notes

Can length generalization transfer between different related tasks?

Can a model trained on longer sequences in one task learn to handle longer inputs in a related task without explicit training? This matters for understanding how neural networks reuse computational strategies across problems.

Explore related Read →

Multimodal Models

3 notes

Mobile and On-Device LLMs

3 notes

Does depth matter more than width for tiny language models?

Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.

Explore related Read →

Does recomputing weights cost less than moving them on mobile?

Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.

Explore related Read →

What actually limits language models on mobile phones?

Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.

Explore related Read →

Evolutionary Methods

1 note

Foundation Models

1 note