Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents

Paper · arXiv 2402.17896 · Published February 27, 2024
Agentic Research

Existing question answering (QA) datasets are no longer challenging to most powerful Large Language Models (LLMs). Traditional QA benchmarks like TriviaQA, NaturalQuestions, ELI5 and HotpotQA mainly study “known unknowns” with clear indications of both what information is missing, and how to find it to answer the question. Hence, good performance on these benchmarks provides a false sense of security. A yet unmet need of the NLP community is a bank of non-factoid, multiperspective questions involving a great deal of unclear information needs, i.e. “unknown uknowns”. We claim we can find such questions in search engine logs, which is surprising because most question-intent queries are indeed factoid. We present Researchy Questions, a dataset of search engine queries tediously filtered to be non-factoid, “decompositional” and multi-perspective. We show that users spend a lot of “effort” on these questions in terms of signals like clicks and session length, and that they are also challenging for GPT-4. We also show that “slow thinking” answering techniques, like decomposition into sub-questions shows benefit over answering directly. We release1 ∼ 100k Researchy Questions, along with the Clueweb22 URLs that were clicked.

We believe the well-studied phenomenon of “unknown unknowns” (United States Congress et al., 1981) applies to LLM Agents in scenarios addressing complex questions requiring “slow thinking” (Kahneman, 2011). Simply put, one strategy is to iteratively re-frame or decompose the problem into a set of “known unknowns” (which characterize most of the aforementioned QA datasets). For these sub-problems, it should be clearer what information is missing, how to find it, and once found, how the “known known” contributes to the final answer. Several techniques such as chainof- thought question decomposition (Radhakrishnan et al., 2023) and tree-of-thought (Yao et al., 2023a) prompting take a similar approach to plan long-horizon solutions to complex problems. However, those studies still operate over traditional QA benchmarks like HotpotQA, or over simple games like crossword puzzles. Hence, the right benchmark of questions for these advanced decomposition techniques still does not exist for open-domain web scenarios (Krishna et al., 2021).

We present Researchy Questions to study the dynamics of how LLM agents handle unclear information needs associated with very complex questions. We define a Researchy Question as a nonfactoid question that expects a long-form answer (longer than a paragraph!) entailing substantial research or effort to synthesize. A Researchy Question can be instantiated as a complex search task (Aula and Russell, 2008) with unclear information needs that requires analyzing multiple documents or pieces of evidence. A Researchy Question does not have a single correct answer, but rather multiple perspectives allowing a dense manifold of answers over which varying criteria can determine which is better. In practice, the act of answering a Researchy Question probably involves decomposition into sub-questions that aid the retrieval of comprehensive information, reducing the risk of missing unknown unknowns. Lastly, a Researchy Question represents a genuine information need that real people asked. Figure 1 qualitatively compares other canonical QA datasets.

For the following (Question | score | reason) triples, the score indicates how "good" of a nonfactoid question they are in the sense that they can lead to interesting and in-depth analysis.

Definition: A good non-factoid question is specific, with potential to amount to a good research report with a clear and refutable thesis, supported by evidence and analysis.

Characteristic formats of good non-factoid questions (not exhaustive):

• Good non-factoid questions will often talk about the relationship between two things, e.g. "Compare and contrast X and Y", "How/why does X affect/impact Y?", "Why X is significant to Y", or "What role does X play in Y?", or "to what extent does X lead to Y?", etc.

• A good non-factoid question can also ask "Why does X happen", "What factors play a role in X?", "How is X significant" or "What is the cause of X", but it should be specific about what kind of analysis is expected.

• Other forms of good non-factoid questions can ask about the pros/cons, benefits/detriments of something, or compare/contrast two things, etc.

Instructions: Rate each question on a scale of 0-10, where 0 is a factoid question and 10 is an excellent non-factoid question and then provide a brief reason for your rating

Q: how tall is abraham lincoln | 0 | factoid

Q: can i change the weather | 2 | personal question

Q: was the civil war fought over slavery | 5 | fair, but could more directly ask about other important facets of the causes of the civil war and their role in the conflict

Q: to what extent was the civil war fought over slavery | 8 | good, will lead to in-depth analysis on the causes of the civil war

Q: what impact do human activities have on the weather | 10 | excellent, many in-depth reports written to answer this question

Q: should LA invest more in railway or highway infrastructure for public transport | 9 | great

Q: what is an example of blackbody radiation? | 0 | asking for an example

Q: could not determine type for | 0 | not a question

Q: what typically signals the end of the olympic games | 2 | factoid, olympic closing ceremony can be looked up easily

Q: Why were Navajo code talkers used during WW2? | 7 | good, could lead to analysis of how culture and language can be used in warfare

Q: When does protein folding begin? | 1 | has a single, known correct answer

Q: what is the cost and necessary materials to build a refinery | 5 | fair, asks about a complex process but will not likely elicit analysis

Q: What is the Navavidha Bhakti? | 0 | asking for a definition

Q: why is technological change bad? | 5 | fair, but could be more specific

Q: analyze how technological changes have historically impacted cultures | 10 | excellent, very specific

Q: who owns phone number 280-626-1435 | 0 | personally identifiable information

Q: What are the main differences between regulations of the NFL and the CFL? | 4 | has potential for in-depth analysis but doesn’t explicitly ask for it

Q: Why do planes using rivets & not welded construction? | 7 | good, will require in-depth analysis on aerospace technology

Q: How did the Catholic Pope manage to become more powerful than Kings in old Europe? | 9 | much potential for historical analysis

Q: interesting facts about korea | 0 | not specific

Q: {Question} |

Given the question: {Question}

Instructions: Please output a python dictionary with fields scoring the question on the following criteria:

  1. "ambiguous" : Int 0-10 to what extent is the intent of the question ambiguous (has more than one interpretation); 0 means no major ambiguity. Not to be confused with subjectiveness or incompleteness.

  2. "incompleteness" : Int 0-10 indicating how difficult it is to determine the intent of the question, whether it is missing crucial context or details that ought to be specified in order to answer the question; 0 means the question is answerable and self-contained, 10 means the question is un-answerable because it is incomplete or under-specified.

  3. "assumptive" : Int 0-10 the degree to which the question has built-in assumptions or biases (that are not offensive, which is point 8 below); 0 means no notable or unreasonable assumptions.

  4. "multi-faceted" : Int 0-10 the degree to which the question has multiple facets or perspectives that need to be considered in order to answer it; 0 means the question is straightforward and has a single, undisputed answer.

  5. "knowledge-intensive" : Int 0-10 the degree to which the question would require specialized knowledge (like textbooks, scholarly articles, etc) to provide a thorough and grounded answer; 0 means the answer is common knowledge or can be looked up instantly in common references, 10 means the questions probably entails a lot of work to find and analyze specialized knowledge.

  6. "subjective" : Int 0-10 the degree to which the question is subjective, meaning an answer(s) exist, but there is no agreed-upon way to determine which one is better; 0 means the question is largely objective i.e. the overwhelming majority of people would agree on the answer if they knew it.

  7. "reasoning-intensive" : Int 0-10 the degree to which the question requires reasoning to synthesize an answer; 0 means the question can be answered trivially e.g. by looking up a fact, referencing an encyclopedia or database, or using a calculator (once).

  8. "harmful" : Int 0-10 to what extent the question could be interpreted as being harmful (physically or psychologically to oneself, others, or animals), offensive, overly biased, sexually explicit, or otherwise inappropriate for e.g. someone of the age of 12 to be exposed to.

Note that the above criteria are not mutually exclusive, e.g. a question can be both subjective and knowledge-intensive, for example "is capitalism better than socialism" would be both. Make sure to output only the valid python dictionary without comments or other extraneous output.