Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1

Paper · arXiv 2410.02162 · Published October 3, 2024

OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs–making it a new kind of model: a Large Reasoning Model (LRM). In this paper, we evaluate the planning capabilities of two LRMs (o1-preview and o1-mini) on both planning and scheduling benchmarks. We see that while o1 does seem to offer significant improvements over autoregressive LLMs, this comes at a steep inference cost, while still failing to provide any guarantees over what it generates. We also show that combining o1 models with external verifiers–in a so-called LRM-Modulo system–guarantees the correctness of the combined system’s output while further improving performance.

We argue that, to be complete, new approaches to measuring LRM reasoning capabilities must take into account efficiency, cost, and guarantees. We also note the steep inference cost of LRMs and discuss the tradeoffs between using LLMs vs LRMs, arguing that in some cases an LLM-Modulo [16] approach may be significantly cheaper than o1 models for comparative performance, and with guarantees.

In contrast, we focus on classical planning problems, or STRIPS planning problems, which are a formalism for automated planning in discrete, deterministic spaces. To define a planning problem, we specify an initial state, a domain, and a goal. The domain contains all relevant information about the types of objects that may exist and the allowable actions from any given state, specified by defining the preconditions and effects of each named action. Problems and domains are represented in the flexible PDDL (Planning Domain and Definition Language) framework [22]. Solutions to PDDL problems are correct plans–sequences of actions executable from the initial state which arrive at a goal-satisfying final state. These are problems in which the planner already knows all relevant facts about the world and which actions are possible–only deliberation is required.

However, as LRMs adaptively vary their time taken and dollar cost per instance in response to the input, measuring efficiency has become much more important.

we believe that o1’s architecture supplements an underlying LLM with System 2-like abilities, allowing it to outperform previous models.

While planning problems normally require the agent to formulate a course of action to achieve a goal, an equally valid use of planning abilities is to recognize that a given goal cannot be accomplished by any plan. A real-world example of this is network vulnerability analysis, where an agent may wish to certify that no plan of attack exists for a specified system [2]. So far, LLMs have struggled to recognize that some problems cannot be solved, instead confidently confabulating nonsensical answers. o1 was launched with the claim that it has started to overcome this issue, and can now accurately identify unsolvable problems [3].

In the remaining 54% of cases, the model generated a full (and therefore impossible and incorrect!) plan.

When the model gives an incorrect answer, it also sometimes provides a creative, but nonsensical, justification for its decision. It is almost as if o1 has gone from hallucinating to gaslighting! In one case, it decided that an unsolvable problem is solvable because a goal condition, while not present in the final state, had been true for at some point during the execution, and thus should continue to count. In another, it declared that on (a,c) was true because, as it explained in a brief parenthetical, a was on b which was on c, and was thus a was somewhere above c, which should count as being "on top" of it.

Prior to the release of these models, the best way to coax planning capabilities out of LLMs has been to pair them with sound external verifier in generate-test frameworks, in what are known as LLM-Modulo systems [16, 33]. This framework is broadly applicable even beyond LLMs, and–given a sound verifier for some domain–requires only a generator expressive enough to provide guesses for that domain. Moreover, because of the built-in verification, it guarantees that any answer output is correct. For safety-critical systems, this is essential!