Can language modeling close the knowing-doing gap in AI?
Current LLMs reason well but act poorly in interactive tasks, while RL agents act well but cannot explain themselves. Can reformulating decision-making as language modeling with environmental feedback bridge this fundamental split?
A central paradox in current AI: LLMs excel at complex reasoning (math, code) yet often fail at simple interactive tasks that young children perform effortlessly. Conversely, traditional RL agents acquire procedural knowledge through environmental interaction but operate as black boxes. The split is between declarative knowledge (knowing about something — what LLMs do well) and procedural knowledge (knowing how to do something — what RL agents do well).
Think-In Games (2508.21365) reformulates the bridge as a language modeling task. The LLM generates language-guided policies. These policies are refined iteratively through online reinforcement learning based on environmental feedback. The result: LLMs develop procedural understanding through direct interaction with the game environment while retaining their inherent reasoning and explanatory abilities. Critically, the policy is language, so the agent can explain its decisions at every step.
The architectural move is consequential. Traditional RL outputs actions; the policy is opaque. TiG outputs language describing actions; the policy is transparent. The environmental reward refines the language-policy directly — the language IS the policy parameterization. This means the agent's procedural competence becomes inspectable in the way declarative knowledge already was.
Two consequences. First, dramatically lower data and computational demands compared to conventional RL methods — because the LLM brings strong priors about what kinds of policies are reasonable, RL training only needs to refine those priors against environmental signal, not learn from scratch. Second, step-by-step natural language explanations for decisions improve transparency and interpretability — the same property that makes LLMs trustworthy in declarative tasks now extends to procedural ones.
The deeper claim is about the nature of intelligence: declarative and procedural knowledge are not categorically separate substrates that need joining — they can be unified if procedural competence is parameterized in the same medium (language) as declarative competence. The reward gradient refines the language; the language is the procedure.
This connects to Why do language models fail to act on their own reasoning?: the knowing-doing gap (declarative ≠ procedural in current LLMs) is exactly what TiG addresses. Where the greedy-agents paper diagnoses the gap as architectural, TiG argues it's a training-objective gap that RL on language-policy can close. Both find that RLFT narrows the gap — TiG provides the mechanism for why.
For MOBA-game macro-level reasoning specifically, the LLM brings strategic-thinking priors; RL refines them against game outcomes; explanations come for free.
Paper: Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models
Related concepts in this collection
-
Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
diagnoses the gap; TiG provides the architectural fix (language-as-policy + RL)
-
Does thinking emerge when agents choose between learned sub-policies?
Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.
TiG instantiates this theoretical result: LLM sub-policies (strategic-reasoning patterns) become selectable through RL refinement
-
Can agent deployment itself generate training signals automatically?
Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
TiG specializes the next-state signal pattern to language-policy refinement in game environments
-
How does treating LLMs as multi-step agents change what we can optimize?
Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.
TiG is a concrete agentic-RL implementation in the new POMDP framing
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
RL bridges the declarative-procedural knowledge gap by reformulating decision-making as language modeling with environmental feedback