Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

Paper · arXiv 2308.11483 · Published August 22, 2023

“Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing challenges to fair assessment of these models. As these models become more powerful, it becomes imperative to understand and address these limitations. In this paper, we focus on LLMs robustness on the task of multiple-choice questions — commonly adopted task to study reasoning and fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs towards the order of options in multiple-choice questions, we demonstrate a considerable performance gap of approximately 13% to 75% in LLMs on different benchmarks, when answer options are reordered, even when using demonstrations in a few-shot setting. Through a detailed analysis, we conjecture that this sensitivity arises when LLMs are uncertain about the prediction between the top-2/3 choices, and specific options placements may favor certain prediction between those top choices depending on the question caused by positional bias. We also identify patterns in top-2 choices that amplify or mitigate the model’s bias toward option placement. We found that for amplifying bias, the optimal strategy involves positioning the top two choices as the first and last options. Conversely, to mitigate bias, we recommend placing these choices among the adjacent options. To validate our conjecture, we conduct various experiments and adopt two approaches to calibrate LLMs’ predictions, leading to up to 8 percentage points improvement across different models and benchmarks.”

In addition to using existing datasets, such as the one from the NL4Opt competition [16], one of our goals is also to measure how well the framework is able to model and recognize problems. To evaluate this, our plan is to evaluate the system in various levels of abstraction.

The first level of abstraction would be the easiest, where the problem name would be clearly stated (i.e. use of the words tsp, knapsack, graph coloring,...), and the constraints and variables already identified using tokens. This is the baseline.
In the second level, we would omit the name of the problem while still explicitly describing the variables, constraints, and parameters of the problem. This level will help us see if problems are recognized.
In the third level, the known lexical of modelization (i.e. words like constraint, variables, domains,...) would be removed. Here, our wish is to measure how well the components of the model can be without being pre-identified using tokens.
The fourth level would be the more abstract one. Some parameters could be implicit and not numerical for example. This would be the closest to what any human unaware of what an optimization problem is would do. This final level allows us to really see whether LLMs can do something when the structure of the problem disappears totally.

It is to be expected that the more abstract we are in the problem descriptions, the more difficult it would be for the LLMs and the more interactions it would need with the user to get the perfect model and solution. Simple examples of the different levels of abstraction are given in Example 1.

▶ Example 1. Problem descriptions with different levels of abstraction for the same model:

(L1) I wish to solve a Knapsack problem, where I have 5 items, and so 5 binary variables to tell me which item is in or not. The weights of my items are 2, 3, 7, 4, and 1. The utilities of my items are 2, 3, 1, 2, and 3. The limit of weight is 10. Can you model it?

(L2) I wish to solve a problem, where I have 5 items, and so 5 binary variables to tell me which item is in or not. The weights of my items are 2, 3, 7, 4, and 1. The utilities of my items are 2, 3, 1, 2, and 3. I wish to limit the total weight to 10. Also, I wish to maximize the utility of the objects I take. Can you model it?

(L3) I wish to go on vacation. The airport only allows 10 kg for my suitcase. I have 5 items. The weights of my items are 2, 3, 7, 4, and 1. The utilities of my items are 2, 3, 1, 2, and 3. Can you tell me what items I should take in order to pass the best vacation?

(L4) I wish to go on vacation. The airport only allows 10 kg for my suitcase. I have 5 items: my ski combination, weighing 7 kg, some warm clothes, weighing 4 kg, some hiking boots, weighing 3 kg, a book on hiking, of 1 kg, and some umbrella, of 2 kg. As I’m going hiking, I think my boots and my book are really important, while the ski combination would not help me well. Can you tell me what items I should take in order to pass the best vacation?