6533b7d4fe1ef96bd126195a
RESEARCH PRODUCT
Statistically Validated Networks for assessing topic quality in LDA models
Alessandro AlbanoAndrea Simonettisubject
Settore SECS-S/06 -Metodi Mat. dell'Economia e d. Scienze Attuariali e Finanz.Settore SECS-S/01 - StatisticaTopic Model Topic Coherence LDA Statistically Validated Networksdescription
Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution overwords characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be only characterized by a set of irrelevant or unchained words, being useless for the interpretation. Although many topic-quality metrics were proposed (Newman et al., 2009; Aletras and Stevenson,2013; Roder et al., 2015; Nikolenko et al., 2017), the automatic evaluation of the coherence of topics remains an open research area. The main contributions of this paper are: i) to define a coherence measure (SVN-Coherence) based on a rigorous statistical model that approximates human ratings better than state-of-the-art methods, and ii) to filter out marginal associations of words and facilitate the graphical representation and interpretation of the obtained topics through Statically Validated Networks (SVN) (Tumminello et al., 2011). Specifically, the method builds a co-occurrence network for each topic whose most probable words are the nodes. We set a link between two nodes (words) in each network if their co-occurrences are statistically significant. The Hypergeometric distribution describes the probability mass function under the null hypothesis and it models the probability of co-occurrence between words conditionally to their marginals. Indeed, it allows taking into account the heterogeneity of the vocabulary on a collection of texts. Finally, we derive a global measure of coherence for each topic by considering the number of statistically validated links, the strength of the association between word pairs, and the relative relevance of each word in the topic. We claim that these links carry relevant information about the structure of topics, i.e., the more connected the network, the more semantically coherent the corresponding topic. The new measure provides a coherence-based ranking that distinguishes between high-quality and low-quality topics. We designed a survey to obtain human judgment, which we use as ground truth, to compare our method with the state-of-art coherence measures. Specifically, we asked 222 PhD students to evaluate the coherence of 32 topics (extracted from the New York Times articles dataset) on a 4-point scale. The results show that the proposed SVN-Coherence substantially outperforms all the state-of-art coherence metrics.
year | journal | country | edition | language |
---|---|---|---|---|
2022-01-01 |