ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on ReasonVQA, and the empirical results demonstrate that ReasonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude. 1
Although our framework is adaptable to any knowledge base and annotated vision dataset, we chose Wikidata [31], one of the most complete structure knowledge bases, as the external knowledge source. We utilize SPARQL to seamlessly integrate image sources, such as Visual Genome (VG) [15], which contains over 108K images along with various descriptions of objects and their relationships within images, e.g, region descriptions, objects, attributes, relationships, region graphs, scene graphs, and question-answer pairs. VG has been used as a main resource for construction of many other VQA datasets like GQA [11] and CRIC [6]. In this paper, we leverage such existing images, questions, object descriptions and scene graphs from VG to build a more robust question generation process. By representing the relationship between an object and other elements, scene graphs allow for a more comprehensive interpretation of the image, avoiding the loss of contextual information from the visual component. The annotations of objects in VG are canonicalized toWordNet [23] synset names, which can be used to retrieve the corresponding concepts from Wikidata by using Natural Language Toolkit (NLTK) [3] and SPARQL queries.