Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can energy minimization unlock reasoning without domain-specific training?

Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?

Note · 2026-02-23 · sourced from Novel Architectures

Energy-Based Transformers (EBTs) represent a fundamentally different approach to inference-time scaling. Rather than generating tokens sequentially, EBTs train to assign an energy value (unnormalized probability) to every input and candidate-prediction pair. Prediction is then reframed as gradient descent-based energy minimization until convergence — the model iteratively refines its prediction by descending the energy landscape.

This formulation enables System 2 Thinking to emerge from unsupervised learning without any of the domain-specific scaffolding that current approaches require:

No modality restrictions (works on both text and images)
No problem-specific design (not limited to verifiable domains like math/code)
No additional supervision beyond unsupervised pretraining (no verifiers, no verifiable rewards)

The scaling results are striking:

Training: Up to 35% higher scaling rate than Transformer++ with respect to data, batch size, parameters, FLOPs, and depth
Inference: 29% more improvement from additional test-time compute on language tasks than Transformer++
Generalization: Larger performance improvements on data farther out-of-distribution — suggesting EBTs generalize better than existing approaches
Efficiency: Outperform Diffusion Transformers on image denoising with fewer forward passes

The deeper implication: current test-time scaling approaches are constrained by their dependence on either (a) verbalized reasoning chains requiring domain-specific training data, or (b) verifiable reward signals for RL-based approaches. EBTs bypass both constraints by making "thinking harder" an inherent property of the architecture — more gradient descent iterations at inference = more thinking, with the model's own energy function as the implicit verifier.

This challenges the implicit assumption in Can non-reasoning models catch up with more compute? — EBTs are not "reasoning models" in the RL-trained sense, yet they scale with inference compute because the energy minimization framework is itself a form of iterative refinement that doesn't require explicit reasoning traces.

Source: Novel Architectures

Related concepts in this collection

Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
EBTs operationalize this at the architecture level: energy minimization inherently scales with inference compute
Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
EBTs may redefine the boundary: energy minimization is a form of inference-time computation that doesn't require reasoning-specific RL training
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
EBTs add nuance: for energy-based architectures, more iterations genuinely improve until convergence, unlike token-based reasoning where overthinking degrades quality
Can recurrent hierarchies achieve reasoning that transformers cannot? Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
complementary latent architecture: HRM achieves near-perfect accuracy on tasks where CoT scores 0% via dual-recurrence; EBTs achieve 35% higher scaling rate via energy minimization; different mechanisms (recurrence vs. gradient descent) escaping the same TC0 constraint

Concept map

15 direct connections · 151 in 2-hop network ·dense cluster

Can energy minimization unlock reasoning without… Can inference compute replace scaling up model siz… Can non-reasoning models catch up with more comput… Does more thinking time actually improve LLM reaso… Can recurrent hierarchies achieve reasoning that t…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

energy-based transformers achieve system 2 thinking from unsupervised learning alone — modality and problem agnostic