A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Paper · arXiv 2410.13639 · Published October 17, 2024

We select two powerful closed-source LLMs for evaluation.

o1 model. It is designed to spend more time reasoning before they respond, which can reason through complex tasks and solve harder problems than previous models in science, coding, and math.

GPT-4o. It is a multimodal model that integrates text, vision, and audio processing capabilities into a single and unified neural network.

As for Test-time Compute methods, we select four methods based on GPT-4o.

5 https://openai.com/o1/

6 https://huggingface.co/datasets/AI-MO/aimo-validation-aime

Best-of-N (BoN). It makes LLMs generate multiple N outputs for a given input, and the most

suitable response is selected as the output.

Step-wise BoN. It enables LLMs to analyze a problem and break it down into several subproblems.

For each step, the model generates N responses based on the previous sub-problems

and answers, and then we use a reward model to select the best response. This process continues iteratively until the final answer to the original problem is obtained.

Self-Refine. It improves initial outputs from LLMs through iterative feedback and refinement

(Madaan et al., 2024).

Agent Workflow. LLM agents break down complex tasks into smaller sub-tasks, plan their execution through a structured workflow, and utilize various tools to achieve their goals. For the commonsense reasoning datasets, we leverage the existing state-of-the-art agent framework (Zhou et al., 2023; 2024) for evaluation. For the code and math datasets, we select the top-picked agents from GPTs 7, specifically code copilot and math solver, respectively.

The performance improvement from Self-Refine is not significant. On most tasks, Self-Refine shows only a slight improvement compared to GPT-4, and its performance even declines on Collie. For this phenomenon, we assume that LLMs may generate responses that slightly deviate from the required format during the refinement iterations of Self-Refine.

BoN achieves relatively good results on HotpotQA. It demonstrates the necessity of searching for more possible responses during the inference stage by scaling time. However, the performance of BoN on Collie has declined compared to the original GPT-4o. Besides, when N increases, there is a slight degradation in performance. We believe this is due to Collie’s strict format requirements, which limit the effectiveness of diverse outputs from LLMs.

The Step-wise BoN is limited by the complex tasks. As for Step-wise BoN, it achieves an excellent result on HotpotQA, which does not have a restriction on output text. However, its performance drops significantly on other complex benchmarks that make Step-wise BoN generate numerous intermediate steps and cannot follow the original question.

AgentWorkflow achieves a significant improvement in performance on all benchmarks. The Agent Workflow uses a similar idea to the step-wise BoN that breaks down complex tasks into smaller subtasks, but it designs a series of domain-specific system prompts, which reduces unnecessary long-context reasoning processes. However, there is still a gap between the Agent Workflow and the o1 model, which may be because AgentWorkflow explores a less diverse space of responses

Systematic Analysis (SA). Starting from the overall structure of the problem, o1 first analyzes the inputs and outputs, as well as the constraints, and then decides on the choice of algorithm and the use of data structures.

• Method Reuse (MR). For some problems that can be transformed into classic problems (such as the shortest path or knapsack problem), o1 can quickly reuse existing methods to solve them.

• Divide and Conquer (DC). It breaks down a complex problem into subproblems and constructs the overall solution by solving the subproblems.

• Self-Refinement (SR). o1 assesses its reasoning process during inference to determine if there are any issues and correct any errors.

• Context Identification (CI). For some datasets requiring additional information input (e.g., HotpotQA), o1 first summarizes different aspects of the context related to the query, and then gives the response for the corresponding query.

• Emphasizing Constraints (EC). For some datasets with constraints on the generated text (e.g., Collie), o1 usually emphasizes the corresponding constraints during the reasoning process.