Search results for "data clustering"
showing 10 items of 27 documents
Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting.
2015
De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clusterin…
Clustering categorical data: A stability analysis framework
2011
Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation …
Distance-constrained data clustering by combined k-means algorithms and opinion dynamics filters
2014
Data clustering algorithms represent mechanisms for partitioning huge arrays of multidimensional data into groups with small in–group and large out–group distances. Most of the existing algorithms fail when a lower bound for the distance among cluster centroids is specified, while this type of constraint can be of help in obtaining a better clustering. Traditional approaches require that the desired number of clusters are specified a priori, which requires either a subjective decision or global meta–information knowledge that is not easily obtainable. In this paper, an extension of the standard data clustering problem is addressed, including additional constraints on the cluster centroid di…
Scalable Clustering by Iterative Partitioning and Point Attractor Representation
2016
Clustering very large datasets while preserving cluster quality remains a challenging data-mining task to date. In this paper, we propose an effective scalable clustering algorithm for large datasets that builds upon the concept of synchronization. Inherited from the powerful concept of synchronization, the proposed algorithm, CIPA (Clustering by Iterative Partitioning and Point Attractor Representations), is capable of handling very large datasets by iteratively partitioning them into thousands of subsets and clustering each subset separately. Using dynamic clustering by synchronization, each subset is then represented by a set of point attractors and outliers. Finally, CIPA identifies the…
A Novel Clustering Algorithm based on a Non-parametric "Anti-Bayesian" Paradigm
2015
The problem of clustering, or unsupervised classification, has been solved by a myriad of techniques, all of which depend, either directly or implicitly, on the Bayesian principle of optimal classification. To be more specific, within a Bayesian paradigm, if one is to compare the testing sample with only a single point in the feature space from each class, the optimal Bayesian strategy would be to achieve this based on the distance from the corresponding means or central points in the respective distributions. When this principle is applied in clustering, one would assign an unassigned sample into the cluster whose mean is the closest, and this can be done in either a bottom-up or a top-dow…
Comparison of Internal Clustering Validation Indices for Prototype-Based Clustering
2017
Clustering is an unsupervised machine learning and pattern recognition method. In general, in addition to revealing hidden groups of similar observations and clusters, their number needs to be determined. Internal clustering validation indices estimate this number without any external information. The purpose of this article is to evaluate, empirically, characteristics of a representative set of internal clustering validation indices with many datasets. The prototype-based clustering framework includes multiple, classical and robust, statistical estimates of cluster location so that the overall setting of the paper is novel. General observations on the quality of validation indices and on t…
PGAC: A Parallel Genetic Algorithm for Data Clustering
2005
Cluster analysis is a valuable tool for exploratory pattern analysis, especially when very little a priori knowledge about the data is available. Distributed systems, based on high speed intranet connections, provide new tools in order to design new and faster clustering algorithms. Here, a parallel genetic algorithm for clustering called PGAC is described. The used strategy of parallelization is the island model paradigm where different populations of chromosomes (called demes) evolve locally to each processor and from time to time some individuals are moved from one deme to another. Experiments have been performed for testing the benefits of the parallelisation paradigm in terms of comput…
Feature Ranking of Large, Robust, and Weighted Clustering Result
2017
A clustering result needs to be interpreted and evaluated for knowledge discovery. When clustered data represents a sample from a population with known sample-to-population alignment weights, both the clustering and the evaluation techniques need to take this into account. The purpose of this article is to advance the automatic knowledge discovery from a robust clustering result on the population level. For this purpose, we derive a novel ranking method by generalizing the computation of the Kruskal-Wallis H test statistic from sample to population level with two different approaches. Application of these enlargements to both the input variables used in clustering and to metadata provides a…
Solution Using Clustering Methods
1987
The main aim of this analysis is to find out typical morphologies from the multivariate and longitudinal data set on growing children and to describe the morphological evolution of the found groups of girls. The finding out of typical morphologies is, in our opinion, strictly linked to the search of structures in the individuals and in the variables.
Trajectory-based and Sound-based Medical Data Clustering
2022
Challenges in medicine are often faced as interdisciplinary endeav- ors. In such an interdisciplinary view, sonification of medical data provides an additional sensory dimension to highlight often hard- to-find information and details. Some examples of sonification of medical data include Covid genome mapping [5], auditory repre- sentations of tridimensional objects as the brain [4], enhancement of medical imagery through the use of sound [1]. Here, we focus on kidney filtering-efficiency time-evolution data. We consider the estimated glomerular filtration rate (eGFR), the main indicator of kidney efficiency in diabetic kidney disease patients.1 We propose a technique to sonify the eGFR tra…