A Tutorial on LLM Reasoning: Relevant Methods behind ChatGPT o1

Paper · arXiv 2502.10867 · Published February 15, 2025

System 1 thinking is fast, automatic, and intuitive, operating effortlessly and often unconsciously. It relies on neural pathways that enable rapid processing, especially in situations needing quick reactions or when cognitive resources are constrained. System 2 thinking is deliberate, effortful, and conscious, involving focused attention and analytical reasoning. It processes information more slowly and is used for complex problem solving, logical reasoning, and decision-making tasks. o1 is an exciting development for AI, as LLMs can now not only generate rapid responses using learned patterns but, more significantly, simulate complex reasoning processes through mechanisms like chain of thought or other forms of search, similar to how humans engage in deeper, step-by-step thinking1.

This approach would essentially optimise the agent towards emulating the average or typical play of these players, potentially incorporating their mistakes and suboptimal strategies. This phenomenon can be characterised as what we called an ”intelligence upper bound,” a concept that can be rigorously derived from recent research in offline reinforcement learning and imitation learning [10]. The agent, in this case, is limited by the quality of the demonstrations it learns from, unable to surpass the skill level present in its training data. This limitation underscores a crucial challenge in AI development: how to enable systems to transcend the boundaries of their training data and develop novel, potentially superior strategies. Conversely, when data is leveraged to develop a deeper understanding, or a world model, of chess dynamics, it may pave the way for the evolution of sophisticated strategies and tactics that go beyond mere imitation of behaviours observed in the training data. A world model presents the agent’s understanding of the environment, in this case, the chess rules, i.e., how a move would change the status of the game and what the winning chance of a given move is. Learning and refining this world model, coupled with the ability to simulate potential outcomes, could potentially empower an AI agent to surpass the 2000 Elo benchmark. The simulation capabilities afforded by these internal world models would enable deep thinking (simulation), thereby enhancing the agent’s reasoning and generalisation capabilities. Model-based strategies like Monte Carlo Tree Search (MCTS) serve as classic illustrations of this approach [23]. The transition to System 2 type reasoning, as potentially exemplified by ChatGPT o1, likely relies on establishing a certain type ofWorld Model and utilising reinforcement learning (reward maximisation) rather than solely minimising prediction errors. This shift in approach may be one of the key transitional techniques behind ChatGPT o1’s enhanced reasoning capabilities.

By combining the predictive power of large language models with the strategic depth of reinforcement learning and World Modelling, AI systems like o1 can potentially engage in more sophisticated problem-solving and decision-making processes. This hybrid approach allows for both rapid pattern recognition (akin to System 1 thinking) and deliberate, step-by-step reasoning (characteristic of System 2 thinking), potentially explaining the significant leap in performance observed in o1.

In this MDP formulation, the LLM is tasked with generating reasoning steps and the final answer to a question in a step-by-step manner. The LLM policy operates by generating tokens, which form higher-level reasoning constructs. The states represent the sequence of reasoning steps so far, and actions correspond to the selection of new reasoning steps or the final answer. The LLM policy governs the choice of actions, and the process-reward model (PRM) provides feedback on the quality of reasoning steps and the final answer. By optimising the policy to maximise the reward, the LLM can be guided by PRM to generate accurate and meaningful reasoning processes

We can define the reasoning process as a Markov Decision Process (MDP) [1]. A MDP representation offers a flexible framework for modelling reasoning. It allows the model to autoregressively generate sequential reasoning steps toward the final answer, while also enabling a tree structure by sampling multiple paths at each step for alternative reasoning trajectories. By combining both approaches-sequential and branching reasoning-the model can explore diverse solutions, creating a versatile and comprehensive reasoning process.