CEO: Corpus-based Open-Domain Event Ontology Induction

Paper · arXiv 2305.13521 · Published May 22, 2023
Knowledge GraphsDomain SpecializationReading Summarizing

This paper presents CEO, a novel Corpus-based Event Ontology induction model to relax the restriction imposed by pre-defined event ontologies. Without direct supervision, CEO leverages distant supervision from available summary datasets to detect corpus-wise salient events and exploits external event knowledge to force events within a short distance to have close embeddings.

Extracting and understanding real-world events described in the text are crucial information extraction tasks that lay the foundations for downstream NLP applications

To address this limitation, the previous work (Shen et al., 2021) proposed the event type induction task, which automatically induces event ontology from documents. However, previous work only covers verbal events while ignoring the nominal ones. Moreover, it can only induce the flat ontology, which is not enough to cover the rich hierarchical ontology structure defined by humans. Last but not least, the induced ontology only contains type ids, making it hard to be verified and curated by users. This paper introduces a new Corpus-based open-domain Event Ontology induction strategy (CEO). As demonstrated in Figure 1, CEO covers both verbal and nominal events and leverages external summarization datasets to detect salient events better. On top of that, CEO is also capable of inducing hierarchical event ontology with the help of a word sense ontology tree defined in WordNet (Fellbaum, 2010).

The first technical contribution is corpus-wise salient event detection with distant supervision from available summary datasets. Following the assumption that summaries written by humans are likely to include events about the main content (Liu et al., 2018; Jindal et al., 2020), we consider events mentioned both in summary and body text as salient while those only mentioned in the body text as non-salient. To obtain corpuswise key events, we fine-tune a Longformer-based model (Beltagy et al., 2020) to classify whether the identified events are salient or not given rich context.

With all kinds of event-centric information for salient events, we can infer the corpus-level event ontology by incorporating the learned informative event embeddings into a wide range of off-the-shelf hierarchical clustering models (discussed in §5.3.1). For individual event mentions, we average over the following embeddings as the final comprehensive event representations: 1) contextualized embeddings for tokens at positions predicted as the predicate, subject, and object; 2) event sentence embeddings represented by Sentence-BERT (Reimers and Gurevych, 2019a); 3) predicate sense embeddings composed of definition sentence representations from Sentence- BERT and contextualized token embeddings for predicate positions from example sentences.