6533b81ffe1ef96bd1278bf8
RESEARCH PRODUCT
A two-stage LDA algorithm for ranking induced topic readability
Mariangela SciandraAlessandro Albanosubject
readabilityLatent Dirichlet Allocationtopic modelcoherencedescription
Probabilistic topic models, such as LDA, are standard text analysis algorithms that provide predictive and latent topic representation for a corpus. However, due to the unsupervised training process, it is difficult to verify the assumption that the latent space discovered by these models is generally meaningful and valuable. This paper introduces a two-stage LDA algorithm to estimate latent topics in text documents and use readability scores to link the identified topics to a linguistically motivated latent structure. We define a new interpretative tool called induced topic readability, which is used to rank topics from the one with the most complex linguistic structure to the one with the lowest semantic content readily. The usefulness of our method is shown with an application to real data, using articles from the New York Times.
year | journal | country | edition | language |
---|---|---|---|---|
2022-07-01 |