0000000001247739
AUTHOR
Andrea Simonetti
Statistically Validated Networks for assessing topic quality in LDA models
Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution overwords characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be only characterized by a set of irrelevant or unchained words, being useless for the interpretation. Although many topic-quality metrics were proposed (Newman et al., 2009; Alet…
MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS
Topic models arise from the need of understanding and exploring large text document collections and predicting their underlying structure. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) has quickly become one of the most popular text modelling techniques. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models give no guaranty on the interpretability of their outputs. The topics learned from texts may be characterized by a set of irrelevant or unchained words. Therefore, topic models require validation of the coherence of estimated topics. However, the automatic evaluation …
Ranking coherence in topic models using statistically validated networks
Probabilistic topic models have become one of the most widespread machine learning techniques in textual analysis. Topic discovering is an unsupervised process that does not guarantee the interpretability of its output. Hence, the automatic evaluation of topic coherence has attracted the interest of many researchers over the last decade, and it is an open research area. This article offers a new quality evaluation method based on statistically validated networks (SVNs). The proposed probabilistic approach consists of representing each topic as a weighted network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-oc…
Marked Hawkes processes for Twitter data
In this paper, we propose to model retweet event sequences using a marked Hawkes process, which is a self-exciting point process where the occurrence of previous events in time increases the probability of further events. The aim is to analyse Twitter data combining temporal point processes theory and textual analysis. Since each retweet event carries a set of properties, we mark the process by different characteristics drawn from the textual analysis, finding that the tone of the description of the Twitter user is a good predictor of the number of retweets in a single cascade.
Using Local Ecological Knowledge of Fishers to Reconstruct Abundance Trends of Elasmobranch Populations in the Strait of Sicily
Fishers “local ecological knowledge” (LEK) can be used to reconstruct long-term trends of species that are at very low biomass due to overfishing. In this study, we used historical memories of Sicilian fishers to understand their perception of change in abundance of cartilaginous fish in the Strait of Sicily over the last decades. We conducted interviews with 27 retired fishers from Mazara del Vallo harbor (SW Sicily) working in demersal fisheries, using a pre-defined questionnaire with a series of open and fixed questions related to the abundance of sharks and rays. The questionnaire included specific questions about the trends they perceived in catch or by-catch of cartilaginous fish abun…
Statistically Validated Networks for evaluating coherence in topic models
Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be characterized by a set of irrelevant or unchained words, being useless for the interpretation. In the framework of topic quality evaluation, the pairwise semantic cohesion among the top-N most pr…