6533b852fe1ef96bd12ab7c5

RESEARCH PRODUCT

Stability-Based Model Selection for High Throughput Genomic Data: An Algorithmic Paradigm

Raffaele GiancarloFilippo Utro

subject

Class (computer programming)Settore INF/01 - Informaticabusiness.industryComputer scienceHeuristic (computer science)Model selectionStability (learning theory)Machine learningcomputer.software_genreIdentification (information)Algorithm designArtificial intelligenceCluster analysisbusinessAlgorithms and Data StructuresThroughput (business)computer

description

Clustering is one of the most well known activities in scien- tific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is the model selection problem, i.e., the identifi- cation of the correct number of clusters in a dataset. In the last decade, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained promi- nence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of predic- tion, but the slowest in terms of time. Unfortunately, this fascinating and classic area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we highlight: (A) the first general algorithmic paradigm for stability- based methods for model selection; (B) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (C) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the Literature for stability-based model selection methods.

https://doi.org/10.1007/978-3-642-33757-4_20