How should reasoning systems actually be architected? · Gravity7

What RL Training Actually Does

2 notes

Does RL teach reasoning or just when to use it?

Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.

Do base models already contain hidden reasoning ability?

Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.

Architectural Efficiency

3 notes

Can reasoning and tool execution run in parallel?

Standard LLM tool use halts for each response, creating redundant prompts and sequential delays. Do alternative architectures that separate reasoning from tool observation actually eliminate these costs?

Does separating planning from execution improve reasoning accuracy?

Explores whether modularizing decomposition and solution into separate models prevents interference and boosts performance compared to monolithic approaches.

Can reasoning stay grounded without external feedback loops?

Explores whether language models can maintain accurate reasoning through their own internal chains of thought, or whether they need real-world feedback to avoid hallucination and error propagation.

Latent and Non-Verbal Reasoning

12 notes

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Can recurrent hierarchies achieve reasoning that transformers cannot?

Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.

Can energy minimization unlock reasoning without domain-specific training?

Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?

Can we explore multiple reasoning paths without committing to one token?

Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Can agents share thoughts directly without using language?

Explores whether multi-agent systems can communicate by exchanging latent thoughts extracted from hidden states, bypassing the ambiguity and misalignment problems inherent in natural language.

Can looped transformers generalize to unseen knowledge combinations?

Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.

Why does asking models to think first hurt performance?

Initial prompts to generate internal thoughts degrade instruction-following performance. What reverses this harm, and can thinking become useful beyond math and logic?

Can we trigger reasoning without explicit chain-of-thought prompts?

This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.

Can continuous reasoning avoid forgetting in instruction-tuned models?

Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?

Can we measure how deeply a model actually reasons?

What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?

Where does LLM reasoning actually happen during generation?

Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.

Structured Decomposition

2 notes

Can algorithms plus limited LLM calls solve complex tasks better?

Explores whether decomposing tasks into step-specific prompts within algorithmic control flow—rather than asking the LLM to manage full state—overcomes context window and reasoning limits while improving task performance.

Can models dynamically activate expert skills at inference time?

Can language models efficiently discover and compose task-specific capabilities on the fly without modifying base weights? This explores whether test-time adaptation through expert vector composition outperforms fixed fine-tuning approaches.

CoT Limitations in Practice

2 notes

Does chain of thought reasoning actually explain model decisions?

When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.

Do reasoning cycles in hidden states reveal aha moments?

What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.

Reasoning Elicitation Without RL

2 notes

Can modular cognitive tools boost LLM reasoning without training?

Does structuring reasoning as discrete, sandboxed tool calls elicit stronger problem-solving in language models compared to monolithic prompting approaches, and can this approach match specialized reasoning models?

Can symbolic solvers fix how LLMs reason about logic?

LLMs excel at understanding natural language but fail at precise logical inference. Can pairing them with deterministic symbolic solvers—using solver feedback to refine attempts—overcome this fundamental weakness?

Role-Aware Reasoning

1 note

Why do reasoning models lose character consistency during role-playing?

When large reasoning models engage in role-playing, they tend to forget their assigned role and default to formal logical thinking. Understanding these failure modes is critical for building character-faithful AI agents.

Training Methods for Reasoning

5 notes

Do formal language prototypes improve reasoning across different domains?

This explores whether training LLMs on abstract reasoning patterns in formal languages like Prolog and PDDL creates generalizable reasoning foundations that transfer to structurally similar problems across diverse domains.

Can curriculum learning approximate expensive process supervision?

Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?

Why do outcome-based reward models fail at intermediate step evaluation?

Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.

Can backward reasoning during training improve forward reasoning?

This explores whether training models to reason backward—generating inverse questions and backward reasoning paths—builds internal consistency checking that transfers to forward-only inference without test-time overhead.

Does planning backward help when goals have bottlenecks?

Can language models exploit structural asymmetries in planning problems by reversing the search direction? This matters because most planning research assumes forward-only generation, potentially missing efficiency gains when bottlenecks constrain early possibilities.

Dialogue-as-Reasoning Architecture

2 notes

Can dialogue format help models reason more diversely?

Explores whether structuring internal reasoning as multi-agent dialogue rather than monologue can improve strategy diversity and coherency across different problem types, using the Compound-QA benchmark.

Can dialogue planning balance fast responses with strategic depth?

Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.

Memory-Augmented Reasoning

2 notes

Can recursive subtask trees overcome context window limits?

Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.

Can reasoning systems maintain memory across multiple retrieval cycles?

Does integrating evidence across iterative retrieval steps—rather than treating each step independently—help systems resolve contradictions and build coherent understanding in complex narratives?

Sequential Decision Making

2 notes

Why do trajectories matter more than individual examples for in-context learning?

Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.

Why do LLMs struggle with exploration in simple decision tasks?

This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.

Writing Angles

3 notes

Does RL teach reasoning or teach when to use it?

Post-training RL gets credit for building reasoning into language models, but emerging evidence suggests base models already possess this capability. The question is whether RL creates new reasoning skills or simply teaches deployment timing.

Does chain-of-thought reasoning actually explain model decisions?

Chain-of-thought is deployed to make AI systems transparent and auditable. But does the reasoning chain actually correlate with correct outputs, or does it just create an illusion of explainability?

Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

Pass 3 Additions (2026-05-03)

3 notes

Why do planning and grounding pull against each other in agents?

Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Can we measure reasoning quality beyond output plausibility?

How might we evaluate whether AI systems reason internally like humans do, rather than just producing human-like outputs? This matters because surface coherence can mask broken underlying reasoning.

Related Areas

5 notes

What makes chain-of-thought reasoning actually work?

Explores how reasoning traces are structured, what components they rely on, and the specific conditions under which they break down or fail to generalize beyond training patterns.

How do reasoning models actually fail under pressure?

This explores where reasoning models break down—whether through adversarial attacks, social reasoning gaps, or unfaithful traces that resist monitoring. Understanding failure modes reveals what these systems genuinely can and cannot do.

How should we allocate compute budget at inference time?

Test-time scaling asks how to spend computational budget during inference to make models smarter. The key puzzle: should all prompts get equal compute, or should difficult queries get more?

How should we allocate compute budget at inference time?

Test-time scaling asks how to spend computational budget during inference to make models smarter. The key puzzle: should all prompts get equal compute, or should difficult queries get more?

How should researchers navigate LLM reasoning research?

This note explores how to systematically explore interconnected insights about test-time scaling, reasoning architectures, and language model cognition. It matters because LLM research spans multiple domains—from inference compute to philosophy—and understanding the map helps identify novel connections.