6533b826fe1ef96bd128492b

RESEARCH PRODUCT

Toolbox for Distance Estimation and Cluster Validation on Data With Missing Values

Marko NiemelaSami AyramoTommi Karkkainen

subject

mallintaminenGeneral Computer Sciencedistance estimation020209 energyGeneral Engineeringlaatu02 engineering and technologyTK1-9971missing valuesklusteritkoneoppiminendatavalidointialgoritmit0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingGeneral Materials ScienceMissing valuesElectrical engineering. Electronics. Nuclear engineeringcluster validationtietojenkäsittelyclustering

description

Missing data are unavoidable in the real-world application of unsupervised machine learning, and their nonoptimal processing may decrease the quality of data-driven models. Imputation is a common remedy for missing values, but directly estimating expected distances have also emerged. Because treatment of missing values is rarely considered in clustering related tasks and distance metrics have a central role both in clustering and cluster validation, we developed a new toolbox that provides a wide range of algorithms for data preprocessing, distance estimation, clustering, and cluster validation in the presence of missing values. All these are core elements in any comprehensive cluster analysis methodology. We describe the methodological background of the implemented algorithms and present multiple illustrations of their use. The experiments include validating distance estimation methods against selected reference methods and demonstrating the performance of internal cluster validation indices. The experimental results demonstrate the general usability of the toolbox for the straightforward realization of alternate data processing pipelines. Source code, data sets, results, and example macros are available on GitHub. https://github.com/markoniem/nanclustering_toolbox peerReviewed

10.1109/access.2021.3136435https://ieeexplore.ieee.org/document/9656159/