Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

Our Model

Abstract

Topic models extract meaningful groups of words from documents, allowing for a better understanding of data. However, the solutions are often not coherent enough, and thus harder to interpret. Coherence can be improved by adding more contextual knowledge to the model. Recently, neural topic models have become available, while BERT-based representations have further pushed the state of the art of neural models in general. We combine pre-trained representations and neural topic models. Pre-trained BERT sentence embeddings indeed support the generation of more meaningful and coherent topics than either standard LDA or existing neural topic models. Results on four datasets show that our approach effectively increases topic coherence.

Publication
Pre-Print
Federico Bianchi
Federico Bianchi
Postdoctoral Researcher

My research interests include meaning in natural language and programming languages.

Related