Can energy-based transformers achieve deep reasoning without supervision?

This explores whether "Energy-Based Transformers" can learn to reason — the slow, deliberate "System 2" kind — from raw unsupervised learning alone, with no task-specific training, and how that bet compares to other routes the corpus takes toward unsupervised reasoning.

This explores whether Energy-Based Transformers (EBTs) can reach deliberate, System-2-style reasoning purely from unsupervised learning. The corpus's direct answer is encouraging: EBTs reframe inference as energy minimization — the model assigns an energy score to each input-prediction pair and uses gradient descent at inference time to settle into a low-energy answer, effectively "thinking" by iterating rather than emitting in one shot. The headline claim is that this yields steeper training scaling and meaningful inference-compute gains over a strong Transformer baseline, with better generalization on out-of-distribution data and no domain-specific scaffolding Can energy minimization unlock reasoning without domain-specific training?. So the short answer the library offers is: yes, in principle, deep reasoning can emerge from the right objective rather than from supervised labels.

But the interesting part is how this sits among the corpus's other escape routes from the same trap. EBTs are one of several architectures betting that fixed-depth, feed-forward transformers are the bottleneck. The Hierarchical Reasoning Model makes a different bet — coupling slow abstract planning with fast detailed computation across two timescales to break past the depth ceiling that constrains standard transformers, solving Sudoku and mazes that chain-of-thought fails on, with tiny parameter counts Can recurrent hierarchies achieve reasoning that transformers cannot?. Both are reacting to the same diagnosis: that transformers often only *look* like they reason. One note shows compositional reasoning in transformers collapses into memorized subgraph matching that shatters on novel combinations Do transformers actually learn systematic compositional reasoning?, and another finds genuine multi-hop reasoning only emerges in late training stages and needs explicit compositional exposure to generalize How do transformers learn to reason across multiple steps?. EBT's energy-minimization loop is one proposed way to get real iterative computation instead of pattern recall.

The "without supervision" half of the question opens a second front. EBTs get there through the learning objective itself, but the corpus has a sharply different unsupervised path: self-play. Ctx2Skill's three-role loop manufactures the missing feedback signal internally — a Challenger escalates difficulty as a curriculum, a Judge issues binary verdicts as reward, and skills co-evolve in natural language, all without human labels Can language models learn skills without human supervision?. That's worth pairing with EBTs because it answers a different question — EBT removes supervision from *how the model computes an answer*, while self-play removes it from *where the training signal comes from*. Both dodge human annotation, but at different layers of the stack.

There's also a quieter, cheaper rival that should temper the "new architecture" excitement. Cognitive tools show you can lift a frozen GPT-4.1 from 26.7% to 43.3% on competition math with zero RL training, just by wrapping reasoning operations in modular sandboxed calls that isolate each step Can modular cognitive tools unlock reasoning without training?. The implication cuts against EBTs in an interesting way: some "reasoning" capability is already latent in standard models and merely needs to be *elicited* rather than *trained in*. And prompting theory backs this up — a single finite transformer is provably Turing-complete given the right prompt, even though ordinary training rarely produces models that actually behave that way Can a single transformer become universally programmable through prompts?. So the honest tension in the corpus is: is deep unsupervised reasoning an architecture problem (EBT, HRM), a training-signal problem (self-play), or an elicitation problem (cognitive tools)?

Where the corpus gets quiet: it has strong evidence that training regime beats raw inference compute — non-reasoning models can't simply spend their way to parity, because reasoning has to be instilled, not bought at test time Can non-reasoning models catch up with more compute?. EBTs claim to convert *more* inference compute into *better* answers via energy descent, which is exactly the lever that note says is usually weak. That's the unresolved bet worth watching: EBTs promise the test-time scaling that the rest of the corpus says training, not inference, normally controls. The library doesn't yet have a head-to-head verdict — but it gives you the precise fault line to read along.

Sources 8 notes

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

How do transformers learn to reason across multiple steps?

Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can energy-based transformers achieve deep reasoning without supervision?

Sources 8 notes

Next inquiring lines