Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess

Paper · arXiv 2507.00726 · Published July 1, 2025

While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM’s output move quality, which can be seen as a form of knowledge distillation. Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels. We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models’ internal understanding of chess—a deficit which RL alone may not be able to fully overcome.

Reinforcement learning with verifiable rewards (RLVR) has shown strong performance in developing mathematical reasoning capabilities for large language models (LLMs) (Guo et al., 2025; Li et al., 2025; Yu et al., 2025). While these successes highlight LLMs’ capacity for logical thinking, a critical dimension of intelligence remains largely unexplored: strategic reasoning—the ability to plan, anticipate adversary actions, and make decisions in multiagent environments. Beyond logical reasoning in static settings, strategic reasoning aligns more with real-world scenarios such as games, negotiation, and market competitions (Zhang et al., 2024; Park et al., 2025).

To investigate this gap, we turn to chess, a strategic game demanding deep strategic reasoning abilities such as positional evaluation, long-term planning, and reasoning about an opponent’s intentions. In addition, chess offers a favorable environment for applying RLVR on LLMs, as it provides abundant publicly available game records and human-annotated reasoning about optimal moves. Given such a testbed for examining strategic reasoning, we raise the following research question:

Can LLMs develop strategic reasoning capabilities through RLVR with chess?

To this end, we train Qwen2.5 (Qwen: et al., 2025) and Llama3.1 (Grattafiori et al., 2024) models to predict the next best move in chess using Group Relative Policy Optimization (GRPO) (Shao et al., 2024). Unlike typical RLVR approaches that rely on sparse binary rewards (correct/incorrect), chess allows for dense reward signals: we leverage the fact that we can evaluate each move based on its estimated win probability after such a move, providing graded feedback proportional to move quality. We implement this using a pretrained chess expert model as a reward model—a form of knowledge distillation from Q-value network to LLM—which evaluates position strength and provides continuous rather than sparse binary rewards.

Through additional failure analysis, we find that base LLMs often struggle to grasp fundamental chess rules. Thus, we hypothesize that, in contrast to logical reasoning in math domains, the limited emergence of strategic reasoning in chess may stem from insufficient exposure to chess-specific knowledge during LLM pre-training. Our empirical findings also support recent claims that RL mainly amplifies existing capabilities of pre-trained LLMs (Li et al., 2025; Zhao et al., 2025b), and offer insight for practitioners aiming to elicit reasoning abilities in new environments: pre-trained domain knowledge is essential to develop advanced reasoning.