What training interventions could close the perception-action gap?

This explores how training—not architecture or prompting tricks—can knit a model's perception (what it takes in) to its action (what it does with that), reading the 'gap' as the loop where a model's own outputs become inputs that shape what it should perceive and do next.

This explores how training can close the gap between what a model perceives and how it acts on that perception—the loop where outputs feed back as inputs. The most direct answer in the corpus is that this loop isn't present at birth: a base model trained only to predict text treats each token as a passive guess, but post-training-shifts-a-model-from-passive-prediction-to-enaction-where-it-recognizes-its shows that post-training measurably flips this, so the model starts treating its own outputs as actions that shape its future inputs (visible as 3–4x lower entropy on-policy and signs that it recognizes its own trajectory). In other words, the gap is something training installs, not something you prompt your way across.

But which training, and where it points, matters enormously. Does verbose chain-of-thought actually help multimodal perception tasks? is the cautionary note: when the real bottleneck is perception—how visual attention gets allocated—piling on text-token reasoning optimizes the wrong target and actively hurts. The lesson generalizes: closing a perception-action gap means training the part that's actually limiting, not the part that's easiest to reward. Does RL training follow a predictable two-phase learning sequence? sharpens this into a sequence—RL first consolidates execution (getting the action mechanically right), then the bottleneck shifts to strategic planning, and concentrating optimization on planning tokens in that second phase yields the real gains. If perception-action is your gap, the intervention you need depends on which phase you're stuck in.

A second family of interventions grounds action in feedback rather than just better internal reasoning. Can interleaving reasoning with real-world feedback prevent hallucination? (ReAct) alternates reasoning with real external queries, injecting fresh perception at each step so errors can't avalanche—beating pure chain-of-thought by 10–34% on interactive tasks. This is arguably the cleanest 'close the gap' move: don't make the model imagine harder, make it look again between actions. Complementing it, Does extended thinking help or hurt model reasoning? shows training changes the *quality* of the perceptual-reasoning step, not just its length—the same thinking mechanism that induces self-doubt in a vanilla model becomes productive gap analysis after RL.

Two quieter findings reframe the whole question. Do base models already contain hidden reasoning ability? argues that five independent methods all merely *elicit* capability already latent in base activations—post-training selects rather than creates. If that's right, closing the perception-action gap may be less about teaching new behavior and more about unlocking a coupling the model already has. And Can chain-of-thought reasoning be learned during pretraining itself? pushes the intervention earlier still, treating reasoning itself as exploratory *action* during pretraining with an information-gain reward—planting the loop before post-training rather than retrofitting it.

The sting in the tail: every training intervention here is also a way to *open* a gap. Does preference optimization harm conversational understanding? shows preference optimization rewarding confident single-turn answers strips out the clarifying questions and understanding-checks by 77.5%—the model perceives less and acts more confidently, which is the perception-action gap widening under the banner of helpfulness. So the honest answer is that the corpus offers a toolkit (action-aware post-training, phase-targeted RL, external grounding, latent-capability elicitation, pretraining-time reasoning) but warns that the same levers that close the gap on one axis quietly pry it open on another.

Sources 8 notes

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

What training interventions could close the perception-action gap?

Sources 8 notes

Next inquiring lines