Agentic Systems and Planning Reasoning and Learning Architectures

Can language modeling close the knowing-doing gap in AI?

Current LLMs reason well but act poorly in interactive tasks, while RL agents act well but cannot explain themselves. Can reformulating decision-making as language modeling with environmental feedback bridge this fundamental split?

Note · 2026-05-18 · sourced from Reinforcement Learning
How does test-time scaling work at the agent level? What kind of thing is an LLM really?

A central paradox in current AI: LLMs excel at complex reasoning (math, code) yet often fail at simple interactive tasks that young children perform effortlessly. Conversely, traditional RL agents acquire procedural knowledge through environmental interaction but operate as black boxes. The split is between declarative knowledge (knowing about something — what LLMs do well) and procedural knowledge (knowing how to do something — what RL agents do well).

Think-In Games (2508.21365) reformulates the bridge as a language modeling task. The LLM generates language-guided policies. These policies are refined iteratively through online reinforcement learning based on environmental feedback. The result: LLMs develop procedural understanding through direct interaction with the game environment while retaining their inherent reasoning and explanatory abilities. Critically, the policy is language, so the agent can explain its decisions at every step.

The architectural move is consequential. Traditional RL outputs actions; the policy is opaque. TiG outputs language describing actions; the policy is transparent. The environmental reward refines the language-policy directly — the language IS the policy parameterization. This means the agent's procedural competence becomes inspectable in the way declarative knowledge already was.

Two consequences. First, dramatically lower data and computational demands compared to conventional RL methods — because the LLM brings strong priors about what kinds of policies are reasonable, RL training only needs to refine those priors against environmental signal, not learn from scratch. Second, step-by-step natural language explanations for decisions improve transparency and interpretability — the same property that makes LLMs trustworthy in declarative tasks now extends to procedural ones.

The deeper claim is about the nature of intelligence: declarative and procedural knowledge are not categorically separate substrates that need joining — they can be unified if procedural competence is parameterized in the same medium (language) as declarative competence. The reward gradient refines the language; the language is the procedure.

This connects to Why do language models fail to act on their own reasoning?: the knowing-doing gap (declarative ≠ procedural in current LLMs) is exactly what TiG addresses. Where the greedy-agents paper diagnoses the gap as architectural, TiG argues it's a training-objective gap that RL on language-policy can close. Both find that RLFT narrows the gap — TiG provides the mechanism for why.

For MOBA-game macro-level reasoning specifically, the LLM brings strategic-thinking priors; RL refines them against game outcomes; explanations come for free.


Paper: Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models

Related concepts in this collection

Concept map
15 direct connections · 140 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

RL bridges the declarative-procedural knowledge gap by reformulating decision-making as language modeling with environmental feedback