A fast and recursive algorithm for clustering large datasets with k-medians

6533b826fe1ef96bd1283cc1

RESEARCH PRODUCT

A fast and recursive algorithm for clustering large datasets with k-medians

Hervé Cardot Jean-marie Monnez Peggy Cénac

subject

Statistics and Probability Clustering high-dimensional data FOS: Computer and information sciences Mathematical optimization high dimensional data Machine Learning (stat.ML)02 engineering and technology Stochastic approximation 01 natural sciences Statistics - Computation 010104 statistics & probability k-medoids Statistics - Machine Learning [MATH.MATH-ST]Mathematics [math]/Statistics [math.ST]stochastic approximation 0202 electrical engineering electronic engineering information engineering Computational statistics recursive estimators Almost surely [ MATH.MATH-ST ] Mathematics [math]/Statistics [math.ST]0101 mathematics Cluster analysis Computation (stat.CO)Mathematics averaging k-medoids Robbins Monro Applied Mathematics Estimator [STAT.TH]Statistics [stat]/Statistics Theory [stat.TH]stochastic gradient [ STAT.TH ] Statistics [stat]/Statistics Theory [stat.TH]Medoid Computational Mathematics Computational Theory and Mathematics online clustering 020201 artificial intelligence & image processing partitioning around medoids Algorithm

description

Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the $k$-means algorithm, a new class of recursive stochastic gradient algorithms designed for the $k$-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which are known to have better performances, and a data-driven procedure that allows automatic selection of the value of the descent step is proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as $k$-means, trimmed $k$-means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.

year	journal	country	edition	language
2012-01-01

10.1016/j.csda.2011.11.019 https://hal.science/hal-00644683/file/ARTICLE.SEQUENTIAL.K-MEDIANSk.V3.2011.pdf