Element weighted Kemeny distance for ranking data
Preference data are a particular type of ranking data that arise when several individuals express their preferences over a finite set of items. Within this framework, the main issue concerns the aggregation of the preferences to identify a compromise or a “consensus”, defined as the closest ranking (i.e. with the minimum distance or maximum correlation) to the whole set of preferences. Many approaches have been proposed, but they are not sensitive to the importance of items: i.e. changing the rank of a highly-relevant element should result in a higher penalty than changing the rank of a negligible one. The goal of this paper is to investigate the consensus between rankings taking into accou…
Supervised vs Unsupervised Latent DirichletAllocation: topic detection in lyrics.
Topic modeling is a type of statistical modeling for discovering the abstract ``topics'' that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a fixed number of topics starting from words in each document modeled according to a Dirichlet distribution. In this work we are going to apply LDA to a set of songs from four famous Italian songwriters and split them into topics. This work studies the use of themes in lyrics using statistical analysis to detect topics. Aim of the work is to underline the main limits of the standard unsupervised LDA and to propose a supervised…
Statistically Validated Networks for assessing topic quality in LDA models
Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution overwords characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be only characterized by a set of irrelevant or unchained words, being useless for the interpretation. Although many topic-quality metrics were proposed (Newman et al., 2009; Alet…
Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches
Probabilistic topic models are machine learning tools for processing and understanding large text document collections. Among the different models in the literature, Latent Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability given a topic and look for meaningful and interpretable semantic themes. This work proposes a new method …
MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS
Topic models arise from the need of understanding and exploring large text document collections and predicting their underlying structure. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) has quickly become one of the most popular text modelling techniques. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models give no guaranty on the interpretability of their outputs. The topics learned from texts may be characterized by a set of irrelevant or unchained words. Therefore, topic models require validation of the coherence of estimated topics. However, the automatic evaluation …
A family of distances for preference–approvals
Producción Científica
Ranking coherence in topic models using statistically validated networks
Probabilistic topic models have become one of the most widespread machine learning techniques in textual analysis. Topic discovering is an unsupervised process that does not guarantee the interpretability of its output. Hence, the automatic evaluation of topic coherence has attracted the interest of many researchers over the last decade, and it is an open research area. This article offers a new quality evaluation method based on statistically validated networks (SVNs). The proposed probabilistic approach consists of representing each topic as a weighted network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-oc…
Distance-based and ranking methods for preference rankings, preference-approvals and textual analysis
A comparison of ensemble algorithms for item-weighted Label Ranking
Label Ranking (LR) is a non-standard supervised classification method with the aim of ranking a finite collection of labels according to a set of predictor variables. Traditional LR models assume indifference among alternatives. However, misassigning the ranking position of a highly relevant label is frequently regarded as more severe than failing to predict a trivial label. Moreover, switching two similar alternatives should be considered less severe than switching two different ones. Therefore, efficient LR classifiers should be able to take into account the similarities and individual weights of the items to be ranked. The contribution of this paper is to formulate and compare flexible i…
Le carriere universitarie degli studenti negli atenei statali e non statali in Italia
Negli ultimi anni si è assistito ad un incremento della competizione tra gli atenei per “accaparrarsi” gli studenti, a cui si aggiunge una sempre maggiore attività di promozione e di reclutamento degli studenti delle università non statali (telematiche e non). Le università non statali, altrimenti denominate “libere Università”, sono promosse sia da enti di diritto privato che da enti pubblici (regioni, province, comuni). Esse sono legalmente riconosciute dal Ministero dell'Istruzione dell'Università e della Ricerca, e autorizzate a rilasciare titoli accademici, relativi all’ordinamento universitario, di valore legale identico a quelli rilasciati dalle università statali. La letteratura ha …
A two-stage LDA algorithm for ranking induced topic readability
Probabilistic topic models, such as LDA, are standard text analysis algorithms that provide predictive and latent topic representation for a corpus. However, due to the unsupervised training process, it is difficult to verify the assumption that the latent space discovered by these models is generally meaningful and valuable. This paper introduces a two-stage LDA algorithm to estimate latent topics in text documents and use readability scores to link the identified topics to a linguistically motivated latent structure. We define a new interpretative tool called induced topic readability, which is used to rank topics from the one with the most complex linguistic structure to the one with the…
ANALISI DELLE CARRIERE UNIVERSITARIE NELLE LAUREE SCIENTIFICHE DI BASE IN ITALIA NELL’ULTIMO DECENNIO
An analysis of italian university science students’ careers in the last decade · This paper deals with the study of Italian university science careers by analyzing administrative longitudinal data from the Italian Ministry of Education. Three freshmen cohorts enrolled in a three-year degree course 2011/12, 2014/15, and 2016/17 are analyzed in three time points : at enrolment, in order to assess their choices with respect to their individual characteristics ; at the beginning of the second year, in order to assess who moves to another course and who drops out ; at the fourth year, in order to determine the “best” students’ profiles. The students’ variables involved are gender, type of school…
Boosting for ranking data: an extension to item weighting
Gli alberi decisionali sono una tecnica predittiva di machine learning particolarmente diffusa, utilizzata per prevedere delle variabili discrete (classificazione) o continue (regressione). Gli algoritmi alla base di queste tecniche sono intuitivi e interpretabili, ma anche instabili. Infatti, per rendere la classificazione più affidabile si `e soliti combinare l’output di più alberi. In letteratura, sono stati proposti diversi approcci per classificare ranking data attraverso gli alberi decisionali, ma nessuno di questi tiene conto ne dell’importanza, ne delle somiglianza dei singoli elementi di ogni ranking. L’obiettivo di questo articolo `e di proporre un’estensione ponderata del metodo …
Ensemble methods for item-weighted label ranking: a comparison
Label Ranking (LR), an emerging non-standard supervised classification problem, aims at training preference models that order a finite set of labels based on a set of predictor features. Traditional LR models regard all labels as equally important. However, in many cases, failing to predict the ranking position of a highly relevant label can be considered more severe than failing to predict a trivial one. Moreover, an efficient LR classifier should be able to take into account the similarity between the items to be ranked. Indeed, swapping two similar elements should be less penalized than swapping two dissimilar ones. The contribution of the present paper is to formulate more flexible item…
Impact of the COVID-19 pandemic on music: a method for clustering sentiments
The outbreak of coronavirus disease 2019 (COVID-19) was highly stressful for people. In general, fear and anxiety about a disease can be overwhelming and cause strong emotions in adults and children. One way to cope with this stress consists in listening to music. Aim of this work is to understand if the music heard during the lock-down reflects the emotions generated by the pandemic on each of us. So, the primary goal of this work is to build two indices for measuring the anger and joy levels of the top streamed songs by Italian Spotify users (during the SARS-CoV-2 pandemic), and study their evolution over time. A Hierarchical Cluster Analysis has been applied in order to identify groups o…
Towards the definition of distance measures in the preference-approval structures
The task of combining preference rankings and approval voting is a relevant issue in social choice theory. The preference-approval voting (PAV) analyses the preferences of a group of individuals over a set of items. The main difference with the classical approaches for preference data consists in introducing, in addition to the ranking of candidates, a further distinction; candidates are subsetted in “acceptable” and “unacceptable”, or also in “good set” and “bad set” (a way to express the approval/disapproval). This work introduces the definition of a new measure to quantify disagreement between preference-approval profiles. For each pair of alternatives, we consider the two possible disag…
Statistically Validated Networks for evaluating coherence in topic models
Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be characterized by a set of irrelevant or unchained words, being useless for the interpretation. In the framework of topic quality evaluation, the pairwise semantic cohesion among the top-N most pr…
A weighted distance-based approach with boosted decision trees for label ranking
Label Ranking (LR) is an emerging non-standard supervised classification problem with practical applications in different research fields. The Label Ranking task aims at building preference models that learn to order a finite set of labels based on a set of predictor features. One of the most successful approaches to tackling the LR problem consists of using decision tree ensemble models, such as bagging, random forest, and boosting. However, these approaches, coming from the classical unweighted rank correlation measures, are not sensitive to label importance. Nevertheless, in many settings, failing to predict the ranking position of a highly relevant label should be considered more seriou…
Dalla triennale alla magistrale: continua la “fuga dei cervelli” dal Mezzogiorno d'Italia
Guardando alla geografia della mobilità degli studenti meridionali negli anni accademici dal 2014/15 al 2017/18 nel passaggio dalla laurea triennale a quella magistrale, Massimo Attanasio, Marco Enea e Alessandro Albano rilevano che la fuga, già evidente nel passaggio dalle superiori all’università, continua anche in seguito: gli atenei del Mezzogiorno continuano a perdere iscritti potenziali a favore degli atenei del Centro-Nord.