Topic Modeling in Embedding Spaces

Paper · arXiv 1907.04907 · Published July 8, 2019

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (ETM), a generative model of documents that marries traditional topic models with word embeddings. In particular, it models each word with a categorical distribution whose natural parameter is the inner product between a word embedding and an embedding of its assigned topic.

Topic models are statistical tools for discovering the hidden semantic structure in a collection of documents (Blei et al., 2003; Blei, 2012). Topic models and their extensions have been applied to many fields, such as marketing, sociology, political science, and the digital humanities. Boyd- Graber et al. (2017) provide a review.

Most topic models build on latent Dirichlet allocation (LDA) (Blei et al., 2003). LDA is a hierarchical probabilistic model that represents each topic as a distribution over terms and represents each document as a mixture of the topics. When fit to a collection of documents, the topics summarize their contents, and the topic proportions provide a low-dimensional representation of each one. LDA can be fit to large datasets of text by using variational inference and stochastic optimization (Hoffman et al., 2010, 2013).

LDA is a powerful model and it is widely used. However, it suffers from a pervasive technical problem—it fails in the face of large vocabularies. Practitioners must severely prune their vocabularies in order to fit good topic models, i.e., those that are both predictive and interpretable. This is typically done by removing the most and least frequent words. On large collections, this pruning may remove important terms and limit the scope of the models. The problem of topic modeling with large vocabularies has yet to be addressed in the research literature.

In parallel with topic modeling came the idea of word embeddings. Research in word embeddings begins with the neural language model of Bengio et al. (2003), published in the same year and journal as Blei et al. (2003). Word embeddings eschew the “one-hot” representation of words— a vocabulary-length vector of zeros with a single one—to learn a distributed representation, one where words with similar meanings are close in a lower-dimensional vector space (Rumelhart and Abrahamson, 1973; Bengio et al., 2006).

In this paper, we develop the embedded topic model (ETM), a topic model for word embeddings. The ETM enjoys the good properties of topic models and the good properties of word embeddings

As a topic model, it discovers an interpretable latent semantic structure of the texts; as a word embedding, it provides a low-dimensional representation of the meaning of words. It robustly accommodates large vocabularies and the long tail of language data.