Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models

Paper · arXiv 2505.10543 · Published May 15, 2025
Reasoning o1 o3 SearchPrompts PromptingReasoning by ReflectionReasoning Critiques

Abstract. While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments remains unclear. This study systematically evaluates the efficacy of self-reflection, heuristic mutation, and planning as prompting techniques to test the adaptive capabilities of agents. We conduct experiments with various open-source language models in dynamic environments and find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, a too-long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in crucial areas such as planning, reasoning, and spatial coordination, suggesting that current-generation large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while reasoning methods like Chain of thought improves multistep reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.

Our framework is presented in Figure 1. At each timestep, the agent interacts with the environment by taking an action, receives a reward and moves to the next state. Reflection performs a retrospective analysis of the agent’s trajectory at each timestep. Oracle evolves heuristics after each episode by mutating them based on past reflections and trajectory to capture the generic episode dynamics. The Planner simulates future states and the cumulative reward based on the trajectory and reflection. The prompts for all modules described below can be found in the Supplementary Material.

Specifically, in the first episode, the agent explores the environment, and the Oracle uses the resulting trajectory and reflections to create initial heuristics (the parent). Following each subsequent episode, the Oracle generates and evaluates a single offspring by mutating the parent heuristics. If the offspring’s performance is better than the parent’s, the offspring replaces the parent; otherwise, the parent heuristics are retained. This "mutation" process enables the LLM to iteratively refine the heuristics by adding, modifying, or removing rules to better align with observed data.