TaskLAMA: Probing the Complex Task Understanding of Language Models

Paper · arXiv 2308.15299 · Published August 29, 2023

“Structured Complex Task Decomposition (SCTD) is the problem of breaking down a complex real-world task (such as planning a wedding) into a directed acyclic graph over individual steps that contribute to achieving the task, with edges specifying temporal dependencies between them. SCTD is an important component of assistive planning tools, and a challenge for commonsense reasoning systems. We probe how accurately SCTD can be done with the knowledge extracted from Large Language Models (LLMs). We introduce a high-quality human-annotated dataset for this problem and novel metrics to fairly assess performance of LLMs against several baselines. Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline. We also propose a number of approaches to further improve their performance, with a relative improvement of 7% to 37% over the base model. However, we find that LLMs still struggle to predict pairwise temporal dependencies, which reveals a gap in their understanding of complex tasks.”

To address the data scarcity issue in Conversational question answering (ConvQA), a dialog inpainting method, which utilizes documents to generate ConvQA datasets, has been proposed. However, the original dialog inpainting model is trained solely on the dialog reconstruction task, resulting in the generation of questions with low contextual relevance due to insufficient learning of question-answer alignment. To overcome this limitation, we propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance from textual sources. The framework incorporates two training tasks: question-answer matching (QAM) and topic-aware dialog generation (TDG). Moreover, re-ranking is conducted during the inference phase based on the contextual relevance of the generated questions.

Recent work (Dai et al., 2022) proposes a dialog inpainting method to address this challenge by automatically generating ConvQA datasets using preexisting text datasets. The text dataset is segmented into sentence-level units, which are directly utilized as answers, while the trained dialog in-painter generates questions corresponding to these answers to complete the conversation. Dialog inpainting has the potential to address the data scarcity issue owing to the abundance of online documents authored by domain experts and the capability to convert these documents into dialogs with well-defined answers. Considering the guaranteed quality of the answers, it is crucial to generate questions that are well-aligned with each corresponding answer (Sun et al., 2018). However, we have observed a low contextual relevance in the questions generated by the dialog in-painter trained solely on a dialog reconstruction task, as it lacks sufficient training of question-answer alignment. For instance, as illustrated in Figure 1, the dialog in-painter tends to generate questions that exhibit a deficiency in answer specificity (e.g., the green case) or are contextually inappropriate (e.g., the blue case).

The framework incorporates two training methodologies, in addition to dialog reconstruction, to address the limitation of generating contextually-irrelevant questions. These methodologies include a question answer matching (QAM) task and a topic-aware dialog generation (TDG) task. In the QAM task, the model is provided with numerous QA pairs and learns to differentiate between matching and non-matching pairs. This enables the model to discern contextual relevance among QA pairs. In the TDG task, we provide the model with keywords extracted from the target answer using a keyword extractor. Then, the model learns to generate answer specific questions using these keywords along with the given answer sentence and the dialog history.