Large Language Models Think Too Fast To Explore Effectively

Paper · arXiv 2501.18009 · Published January 29, 2025
Reasoning o1 o3 Search

whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty-driven strategies, unlike humans who balance uncertainty and empowerment. Representational analysis of the models with Sparse Autoencoders (SAE) revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration

While extensive benchmarks have been developed to assess how LLMs perceive, think, reason, and act across diverse environments, limited attention has been given to their capacity for exploration. Exploration—defined as behaviors aimed at discovering new information, possibilities, or strategies, often at the expense of immediate rewards—plays a crucial role in intelligence, enhancing long-term understanding, adaptability, and performance. This behavior stands in contrast to exploitation, which focuses on leveraging known information for immediate benefits.

Exploration has been extensively studied in the fields of Reinforcement Learning [8, 15] and human learning [4, 19, 7]. In human learning, exploration strategies are typically categorized into three types: random exploration, uncertainty-driven exploration, and empowerment. Random exploration introduces stochastic noise into behaviors, enabling agents to stumble upon new information. Uncertainty-driven exploration prioritizes sampling actions with uncertain outcomes to reduce ambiguity and improve decision-making confidence. Empowerment, on the other hand, emphasizes intrinsic rewards and open-ended discovery, driving agents to maximize possibilities rather than optimizing specific outcomes. This type of exploration aligns closely with behaviors observed in tasks like scientific research, where the goal is to uncover as many novel possibilities as possible.

This unbalanced strategy used in traditional LLMs makes us wonder why LLMs could not use empowerment in the game. Theoretically, LLMs should be able to represent the semantic meaning of these elements. Are they really representing such information as empowerment but not using it or do they lack of ability to understand empowerment? To investigate this question, we employed Sparse Auto-Encoders(SAE)(see methods 2.4 to decompose the latent representation of elements in LLMs to figure out whether both empowerment and uncertainty are properly represented during the computation

Paper Summary. Exploration is essential for discovering new opportunities and understanding complex environments. Our study reveals that most LLMs struggle to achieve human-level exploration in the open-ended task. They heavily rely on uncertainty-driven strategies, which provide short-term gains but fail to support long-term success. While resulting in suboptimal performance and limited adaptability to broader decision spaces. However, we also find exceptional models like o1, which surpass humans in performance and indicate stronger uncertainty-driven and empowerment exploration strategy usage. This suggests that LLMs with reasoning training may be essential to perform in open-ended tasks, which requires a variety of exploration strategies.

Fast Thinking in Traditional LLMs. A key issue lies in LLMs “thinking too fast” during exploratory tasks. In LLaMA3.1-70B, uncertainty values dominate early transformer blocks and the activations from early transformer blocks correlate strongly with immediate choices, while empowerment values emerge in middle blocks. This temporal mismatch leads to premature decision-making that prioritizes short-term utility over deeper exploration. This predominant information processing by uncertainty values and choices can weaken the role of empowerment in the exploratory decision-making process.

both prompt engineering and intervention did not help to improve the model performance. In the model alternative, we experiment on a most recently released open-source reasoning model, DeepSeek-R1 [5], which claims to match o1-level performance in multiple benchmarks, with its reasoning process visible. This reasoning model outperforms other traditional LLMs and reaches human-level performance in this task (Figure.9 in the appendix). This superior performance provides more evidence that model infrastructure may be the main reason that limits their performance in this type of task.

Lmitations and Future Directions. Despite these findings, the underlying cause of LLMs “thinking too fast” remains unclear and requires further investigation. Future research could explore the interaction between model architecture and processing dynamics, as well as how LLMs weigh uncertainty and empowerment during decision-making. Interventions such as integrating extended reasoning frameworks like CoT, optimizing transformer block interactions, or training with explicit exploratory objectives could enhance LLMs’ exploratory abilities. These efforts would not only improve performance but also advance our understanding of creating AI systems capable of more human-like exploration.