Search results for "Clustering algorithm"

showing 10 items of 34 documents

A Greedy Algorithm for Hierarchical Complete Linkage Clustering

2014

We are interested in the greedy method to compute an hierarchical complete linkage clustering. There are two known methods for this problem, one having a running time of \({\mathcal O}(n^3)\) with a space requirement of \({\mathcal O}(n)\) and one having a running time of \({\mathcal O}(n^2 \log n)\) with a space requirement of Θ(n 2), where n is the number of points to be clustered. Both methods are not capable to handle large point sets. In this paper, we give an algorithm with a space requirement of \({\mathcal O}(n)\) which is able to cluster one million points in a day on current commodity hardware.

CombinatoricsCURE data clustering algorithmSUBCLUNearest-neighbor chain algorithmCorrelation clusteringSingle-linkage clusteringHierarchical clustering of networksGreedy algorithmComplete-linkage clusteringMathematics
researchProduct

Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting.

2015

De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clusterin…

Computer scienceCorrelation clusteringSingle-linkage clusteringMolecular Sequence DataMachine learningcomputer.software_genrePattern Recognition AutomatedCURE data clustering algorithmRNA Ribosomal 16SGeneticsComputer GraphicsCluster analysisBase Sequencebusiness.industryApplied MathematicsDendrogramHigh-Throughput Nucleotide SequencingPattern recognitionSignal Processing Computer-AssistedEquipment DesignHierarchical clusteringEquipment Failure AnalysisRNA BacterialCanopy clustering algorithmArtificial intelligenceHierarchical clustering of networksbusinesscomputerSequence AlignmentAlgorithmsBiotechnologyIEEE/ACM transactions on computational biology and bioinformatics
researchProduct

Clustering categorical data: A stability analysis framework

2011

Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation …

Computer sciencebusiness.industrySingle-linkage clusteringCorrelation clusteringConstrained clusteringcomputer.software_genreMachine learningDetermining the number of clusters in a data setData stream clusteringCURE data clustering algorithmConsensus clusteringData miningArtificial intelligenceCluster analysisbusinesscomputer2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
researchProduct

Mammographic images segmentation based on chaotic map clustering algorithm

2013

Background: This work investigates the applicability of a novel clustering approach to the segmentation of mammographic digital images. The chaotic map clustering algorithm is used to group together similar subsets of image pixels resulting in a medically meaningful partition of the mammography. Methods: The image is divided into pixels subsets characterized by a set of conveniently chosen features and each of the corresponding points in the feature space is associated to a map. A mutual coupling strength between the maps depending on the associated distance between feature space points is subsequently introduced. On the system of maps, the simulated evolution through chaotic dynamics leads…

Cooperative behaviorClustering algorithmsComputer scienceFeature vectorCorrelation clusteringPhysics::Medical PhysicsMass lesionsMicrocalcificationsImage processingBreast NeoplasmsDigital imageSegmentationBreast cancerImage Processing Computer-AssistedCluster AnalysisHumansRadiology Nuclear Medicine and imagingSegmentationComputer visionCluster analysisFeaturesPixelChaotic maps Clustering algorithms Cooperative behavior Segmentation Mammography Features Mass lesions Microcalcifications Breast cancerbusiness.industrySegmentation-based object categorizationCalcinosisSettore FIS/07 - Fisica Applicata(Beni Culturali Ambientali Biol.e Medicin)Radiographic Image EnhancementChaotic mapsRadiology Nuclear Medicine and imagingComputer Science::Computer Vision and Pattern RecognitionFemaleArtificial intelligencebusinessAlgorithmsMammographyResearch Article
researchProduct

Distance-constrained data clustering by combined k-means algorithms and opinion dynamics filters

2014

Data clustering algorithms represent mechanisms for partitioning huge arrays of multidimensional data into groups with small in–group and large out–group distances. Most of the existing algorithms fail when a lower bound for the distance among cluster centroids is specified, while this type of constraint can be of help in obtaining a better clustering. Traditional approaches require that the desired number of clusters are specified a priori, which requires either a subjective decision or global meta–information knowledge that is not easily obtainable. In this paper, an extension of the standard data clustering problem is addressed, including additional constraints on the cluster centroid di…

Fuzzy clusteringCorrelation clusteringSingle-linkage clusteringConstrained clusteringcomputer.software_genreDetermining the number of clusters in a data setSettore ING-INF/04 - AutomaticaData clustering k–means Opinion dynamics Hegelsmann–Krause modelCURE data clustering algorithmData miningCluster analysisAlgorithmcomputerk-medians clusteringMathematics22nd Mediterranean Conference on Control and Automation
researchProduct

Scalable Clustering by Iterative Partitioning and Point Attractor Representation

2016

Clustering very large datasets while preserving cluster quality remains a challenging data-mining task to date. In this paper, we propose an effective scalable clustering algorithm for large datasets that builds upon the concept of synchronization. Inherited from the powerful concept of synchronization, the proposed algorithm, CIPA (Clustering by Iterative Partitioning and Point Attractor Representations), is capable of handling very large datasets by iteratively partitioning them into thousands of subsets and clustering each subset separately. Using dynamic clustering by synchronization, each subset is then represented by a set of point attractors and outliers. Finally, CIPA identifies the…

Fuzzy clusteringGeneral Computer ScienceComputer scienceSingle-linkage clusteringCorrelation clusteringConstrained clustering02 engineering and technologycomputer.software_genreComputingMethodologies_PATTERNRECOGNITIONData stream clusteringCURE data clustering algorithm020204 information systems0202 electrical engineering electronic engineering information engineeringCanopy clustering algorithm020201 artificial intelligence & image processingData miningCluster analysiscomputerACM Transactions on Knowledge Discovery from Data
researchProduct

A Novel Clustering Algorithm based on a Non-parametric "Anti-Bayesian" Paradigm

2015

The problem of clustering, or unsupervised classification, has been solved by a myriad of techniques, all of which depend, either directly or implicitly, on the Bayesian principle of optimal classification. To be more specific, within a Bayesian paradigm, if one is to compare the testing sample with only a single point in the feature space from each class, the optimal Bayesian strategy would be to achieve this based on the distance from the corresponding means or central points in the respective distributions. When this principle is applied in clustering, one would assign an unassigned sample into the cluster whose mean is the closest, and this can be done in either a bottom-up or a top-dow…

Fuzzy clusteringbusiness.industryComputer scienceCorrelation clusteringConstrained clusteringPattern recognitioncomputer.software_genreData stream clusteringCURE data clustering algorithmCanopy clustering algorithmAffinity propagationArtificial intelligenceData miningbusinessCluster analysiscomputer
researchProduct

Comparison of Internal Clustering Validation Indices for Prototype-Based Clustering

2017

Clustering is an unsupervised machine learning and pattern recognition method. In general, in addition to revealing hidden groups of similar observations and clusters, their number needs to be determined. Internal clustering validation indices estimate this number without any external information. The purpose of this article is to evaluate, empirically, characteristics of a representative set of internal clustering validation indices with many datasets. The prototype-based clustering framework includes multiple, classical and robust, statistical estimates of cluster location so that the overall setting of the paper is novel. General observations on the quality of validation indices and on t…

Fuzzy clusteringlcsh:T55.4-60.8Computer scienceSingle-linkage clusteringCorrelation clustering02 engineering and technologycomputer.software_genrelcsh:QA75.5-76.95Theoretical Computer Scienceprototype-based clusteringCURE data clustering algorithm020204 information systemsprototype-based clustering; clustering validation index; robust statisticsConsensus clusteringalgoritmit0202 electrical engineering electronic engineering information engineeringlcsh:Industrial engineering. Management engineeringCluster analysisk-medians clusteringta113Numerical Analysisbusiness.industryPattern recognitionDetermining the number of clusters in a data setComputational MathematicsComputingMethodologies_PATTERNRECOGNITIONComputational Theory and Mathematicsrobust statistics020201 artificial intelligence & image processinglcsh:Electronic computers. Computer scienceArtificial intelligenceData miningtiedonlouhintabusinessclustering validation indexcomputerAlgorithms
researchProduct

Feature Ranking of Large, Robust, and Weighted Clustering Result

2017

A clustering result needs to be interpreted and evaluated for knowledge discovery. When clustered data represents a sample from a population with known sample-to-population alignment weights, both the clustering and the evaluation techniques need to take this into account. The purpose of this article is to advance the automatic knowledge discovery from a robust clustering result on the population level. For this purpose, we derive a novel ranking method by generalizing the computation of the Kruskal-Wallis H test statistic from sample to population level with two different approaches. Application of these enlargements to both the input variables used in clustering and to metadata provides a…

Kruskal-Wallis testComputer scienceCorrelation clusteringPopulation02 engineering and technologycomputer.software_genreMachine learning01 natural sciencesRanking (information retrieval)010104 statistics & probabilityKnowledge extractionCURE data clustering algorithmpopulation analysisRanking SVM0202 electrical engineering electronic engineering information engineeringTest statistic0101 mathematicseducational knowledge discoveryeducationCluster analysiseducation.field_of_studybusiness.industryRanking020201 artificial intelligence & image processingData miningArtificial intelligencerobust clusteringbusinesscomputer
researchProduct

The COPD multi-dimensional phenotype: A new classification from the STORICO Italian observational study.

2019

BackgroundThis paper is aimed to (i) develop an innovative classification of COPD, multi-dimensional phenotype, based on a multidimensional assessment; (ii) describe the identified multi-dimensional phenotypes.MethodsAn exploratory factor analysis to identify the main classificatory variables and, then, a cluster analysis based on these variables were run to classify the COPD-diagnosed 514 patients enrolled in the STORICO (trial registration number: NCT03105999) study into multi-dimensional phenotypes.ResultsThe circadian rhythm of symptoms and health-related quality of life, but neither comorbidity nor respiratory function, qualified as primary classificatory variables. Five multidimension…

MalePulmonologyPhysiologyComorbidityAnxietyPathology and Laboratory MedicineCohort StudiesPulmonary Disease Chronic ObstructiveMathematical and Statistical TechniquesQuality of lifeMedicine and Health SciencesCoughingCluster AnalysisRespiratory functionPublic and Occupational HealthAged 80 and overCOPDMultidisciplinaryDepressionApplied MathematicsSimulation and ModelingQStatisticsRMiddle AgedExploratory factor analysisCircadian RhythmBody FluidsCircadian RhythmsPhenotypeItalyPhysical SciencesAnxietyMedicineFemalemedicine.symptomAnatomyFactor AnalysisAlgorithmsCohort studyResearch Articlemedicine.medical_specialtyScienceMemory EpisodicChronic Obstructive Pulmonary DiseaseSettore MED/10 - Malattie Dell'Apparato RespiratorioResearch and Analysis MethodsClustering AlgorithmsSigns and SymptomsDiagnostic MedicineInternal medicinemedicineCOPDHumansStatistical MethodsAgedbusiness.industryBiology and Life SciencesPhysical Activitymedicine.diseaseComorbidityrespiratory tract diseasesMucusDyspneaCoughQuality of LifeObservational studybusinessFactor Analysis StatisticalSleepPhysiological ProcessesChronobiologymultiple phenotypesMathematicsPloS one
researchProduct