Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying

Paper · arXiv 2412.15177 · Published December 19, 2024

Studies have underscored how, regardless of the recent breakthrough and swift advances in AI research, even state-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning. The results seem to suggest that LLMs still work as (highly advanced) data pattern identifiers, scoring poorly when attempting to generalise and solve reasoning problems the models have never previously seen or that are not close to samples presented in their training data. To address this compelling concern, this paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin’s model of argumentation. We show that employing these critical questions can improve the reasoning capabilities of LLMs. By probing the rationale behind the models’ reasoning process, the LLM can assess whether some logical mistake is occurring and correct it before providing the final reply to the user prompt. The underlying idea is drawn from the gold standard of any valid argumentative procedure: the conclusion is valid if it is entailed by accepted premises. Or, to paraphrase such Aristotelian principle in a real-world approximation, characterised by incomplete information and presumptive logic, the conclusion is valid if not proved otherwise.

we show that the CQoT technique provides remarkable improvements

over the LLMs’ baseline performance, thus further corroborating the test-time

compute hypothesis: it is possible to enhance the capabilities of generative AI

models by granting more “time to think”.

In this tradition, a seminal work is Toulmin’s “The Uses of Argument” [52] which began to formalize the idea that in reasoning it is not just the conclusion that is important, but also the rationale behind the reached conclusion. In Toulmin’s view, all conclusions are defeasible in the sense that they can be overturned by subsequent inferences. Conclusions, which Toulmin calls “claims”, are constructed from data about the world, are supported by a warrant, a motive for thinking that the conclusion holds, and this itself is derived from some backing (experience or experimental data). This whole structure is an argument. All claims may be subject to a rebuttal, and the final truth will only be determined by taking into account all such rebuttals (each of which is itself an argument), examining the warrants, and reaching a verdict on which argument is most plausible.

One way to consider the kind of reasoning captured by argumentation is as a formal representation that extends classical logic to permit defeasible reasoning. This is important because it seems that one way of tractably handling the complexity of lots of knowledge about the world is to permit presumptive reasoning, that is reasoning that allows for “jumping to conclusions”. (This makes it possible to represent general patterns of reasoning that summarize plenty of specific information.) However, jumping to conclusions may sometimes lead to incorrect claims, and these then need to be withdrawn, so any systems of presumptive reasoning need to be able to reason defeasibly.

As a result, much of the work on argumentation has been concerned with defeasibility.

seminal work in this area, however, is due to Dung [15], who introduced the idea that one could consider the properties of a set of arguments at a purely abstract level. Given a set of arguments and a relation that identifies pairs of arguments where one attacks the other, it is possible to establish several wellfounded principles under which sets of acceptable arguments may be determined. For example, any argument that is not attacked is acceptable, and any argument that is attacked by an acceptable argument is not acceptable (this leads to a simple fixpoint definition of acceptability). This basic idea has been widely adapted and extended, see for example [3].

Toulmin’s schema provides us with part of what we need to augment LLM reasoning, and we combine that idea with the more recent notions of argument schemes and critical questions [56, 55]. Broadly speaking, the idea behind argument schemes is that presumptive reasoning operates through the application of a number of schemes that capture common patterns of reasoning. Since they support presumptive reasoning, they are not expected to be deductive in the truth-preserving sense that classical logic is. Rather, they capture rules for drawing inferences that will often provide useful conclusions but will sometimes lead to invalid claims. In that sense, such schemes are more like the default rules of default logic [42]. One of the key features of argumentation schemes is the list of associated critical questions (CQs). The role of these questions is to highlight when the scheme may or may not lead to reasonable conclusions. The premises, claim (or conclusion) supported by the argumentation scheme are presumptive, and a premise or claim is withdrawn unless the CQs posed have been answered successfully. In general, CQs can also be used as pointers to counter-arguments that themselves instantiate argument schemes. CQs can then be used in two different ways: (i) posing CQs to construct counter-arguments that can themselves instantiate schemes with their own CQs; (ii) posing CQs to challenge an existing argument (e.g. questioning a premise) and which can be responded by constructing a supporting argument.

What we do in this paper is think of the LLM as providing reasoning that

can be structured as per a Toulmin schema, and then use pertinent critical

questions to check the validity of the presumptive conclusions that the LLM

has reached.

As described above, Toulmin’s model of argument includes the following six components:

• Claim - The conclusion or main argument being made;

• Qualifier - Limits or conditions on the claim;

• Data - The facts, evidence, or premises that support the claim;

• Warrant - The logical connection or justification that bridges the data and claim;

• Rebuttal - Acknowledgment of counterarguments or exceptions to the claim;

• Backing - Additional support or justification for the warrant.

Our Critical-Questions-of-Thought (CQoT) approach can be summarised by the

four steps illustrated in Figure 2. The chosen name already hints at the main

component of this technique, i.e., the critical questions, whose contribution

occur in Step 2 of the pipeline. For each CQ we identify which elements of

Toulmin’s model the question targets. Note that we are not trying to claim

that an LLM follows the Toulmin model of argumentation, but rather that

LLMs engage in presumptive reasoning, and so a set of CQs that cover all

the elements of Toulmin’s general schema of presumptive reasoning provide a

suitable battery of queries with which to test any conclusions that an LLM

comes up with. The leveraged critical questions are:

Does the reasoning process start with clearly defined premises? (Data)
Are the premises supported by evidence or accepted facts? (Data)
Does the reasoning process use logical connections between premises and

conclusions? (Warrant)

Are the logical connections used in the reasoning process valid? (Warrant)
Does the reasoning process avoid fallacies or logical errors? (Warrant,

Backing)

Is the conclusion logically derived from the premises? (Claim)
Is the reasoning process consistent with established knowledge or principles?

(Backing)

Does the reasoning process lead to a conclusion that is plausible and reasonable?

(Claim, Qualifier, Rebuttal)

Critical-Questions-of-Thought is a simple approach that can be implemented on top of any LLMs. In this section, we introduce and break down each of the four stages that compose the CQoT pipeline, as depicted in Figure 2. In a nutshell, Steps 1, 2 and 4 can be interpreted as different prompting strategies leveraging the outcome of the previous steps. Step 3 operates as a checkpoint. It verifies the model’s responses to the CQs and decides whether the pipeline should continue to the final phase or iterate over, effectively starting again from the initial stage. More precisely:

• Step 1. Prompt the LLM to reason logically about the input query and sketch a plan to follow. The model must divide each reasoning process into premises and conclusions so that each conclusion derives from the respective premises. No final response should be provided.

• Step 2. The model will now have to assess the validity of the previously envisaged reasoning steps by checking whether they address the critical questions devised in Section 3. All such CQs are related to the validity of logical reasoning and refer to specific elements of the Toulmin representation of arguments.

• Step 3. The model has to evaluate the replies to the critical questions. If there are at least 7/8 positive answers, then the pipeline will move to Step 4. Otherwise, the pipeline will start again from Step 1. This will occur for a maximum of five iterations, after which the requirement will drop to a minimum of 5/8 positive answers. If this score has not been reached after a total of ten iterations, the pipeline will, regardless, move to Step 4 with the latest generated reasoning plan.

• Step 4. Having determined the correctness of the model reasoning process, the LLM is now required to strictly follow it in order to provide the final reply to the original user prompt.