Search results for "clustering"

showing 10 items of 446 documents

Computation Cluster Validation in the Big Data Era

2017

Data-driven class discovery, i.e., the inference of cluster structure in a dataset, is a fundamental task in Data Analysis, in particular for the Life Sciences. We provide a tutorial on the most common approaches used for that task, focusing on methodologies for the prediction of the number of clusters in a dataset. Although the methods that we present are general in terms of the data for which they can be used, we offer a case study relevant for Microarray Data Analysis.

Clustering high-dimensional dataClass (computer programming)Clustering validation measureSettore INF/01 - InformaticaComputer sciencebusiness.industryBig dataInferenceMicroarrays data analysiscomputer.software_genreGap statisticTask (project management)ComputingMethodologies_PATTERNRECOGNITIONCURE data clustering algorithmConsensus clusteringHypothesis testing in statisticClustering Class Discovery in Data Algorithmsb Clustering algorithmFigure of meritConsensus clusteringData miningCluster analysisbusinesscomputer
researchProduct

A local complexity based combination method for decision forests trained with high-dimensional data

2012

Accurate machine learning with high-dimensional data is affected by phenomena known as the “curse” of dimensionality. One of the main strategies explored in the last decade to deal with this problem is the use of multi-classifier systems. Several of such approaches are inspired by the Random Subspace Method for the construction of decision forests. Furthermore, other studies rely on estimations of the individual classifiers' competence, to enhance the combination in the multi-classifier and improve the accuracy. We propose a competence estimate which is based on local complexity measurements, to perform a weighted average combination of the decision forest. Experimental results show how thi…

Clustering high-dimensional dataComputational complexity theorybusiness.industryComputer scienceDecision treeMachine learningcomputer.software_genreRandom forestRandom subspace methodArtificial intelligenceData miningbusinessCompetence (human resources)computerClassifier (UML)Curse of dimensionality2012 12th International Conference on Intelligent Systems Design and Applications (ISDA)
researchProduct

An Extension of the DgLARS Method to High-Dimensional Relative Risk Regression Models

2020

In recent years, clinical studies, where patients are routinely screened for many genomic features, are becoming more common. The general aim of such studies is to find genomic signatures useful for treatment decisions and the development of new treatments. However, genomic data are typically noisy and high dimensional, not rarely outstripping the number of patients included in the study. For this reason, sparse estimators are usually used in the study of high-dimensional survival data. In this paper, we propose an extension of the differential geometric least angle regression method to high-dimensional relative risk regression models.

Clustering high-dimensional dataComputer sciencedgLARS Gene expression data High-dimensional data Relative risk regression models Sparsity · Survival analysisLeast-angle regressionRelative riskStatisticsEstimatorRegression analysisExtension (predicate logic)High dimensionalSettore SECS-S/01 - StatisticaSurvival analysis
researchProduct

GenClust: A genetic algorithm for clustering gene expression data

2005

Abstract Background Clustering is a key step in the analysis of gene expression data, and in fact, many classical clustering algorithms are used, or more innovative ones have been designed and validated for the task. Despite the widespread use of artificial intelligence techniques in bioinformatics and, more generally, data analysis, there are very few clustering algorithms based on the genetic paradigm, yet that paradigm has great potential in finding good heuristic solutions to a difficult optimization problem such as clustering. Results GenClust is a new genetic algorithm for clustering gene expression data. It has two key features: (a) a novel coding of the search space that is simple, …

Clustering high-dimensional dataDNA ComplementaryComputer scienceRand indexCorrelation clusteringOligonucleotidesEvolutionary algorithmlcsh:Computer applications to medicine. Medical informaticscomputer.software_genreBiochemistryPattern Recognition AutomatedBiclusteringOpen Reading FramesStructural BiologyCURE data clustering algorithmConsensus clusteringGenetic algorithmCluster AnalysisCluster analysislcsh:QH301-705.5Molecular BiologyGene expression data Clustering Evolutionary algorithmsOligonucleotide Array Sequence AnalysisModels StatisticalBrown clusteringHeuristicGene Expression ProfilingApplied MathematicsComputational BiologyComputer Science Applicationslcsh:Biology (General)Gene Expression RegulationMutationlcsh:R858-859.7Data miningSequence AlignmentcomputerSoftwareAlgorithmsBMC Bioinformatics
researchProduct

Data Analysis and Bioinformatics

2007

Data analysis methods and techniques are revisited in the case of biological data sets. Particular emphasis is given to clustering and mining issues. Clustering is still a subject of active research in several fields such as statistics, pattern recognition, and machine learning. Data mining adds to clustering the complications of very large data-sets with many attributes of different types. And this is a typical situation in biology. Some cases studies are also described.

Clustering high-dimensional dataFuzzy clusteringComputer sciencebusiness.industryCorrelation clusteringConceptual clusteringMachine learningcomputer.software_genreComputingMethodologies_PATTERNRECOGNITIONCURE data clustering algorithmConsensus clusteringCanopy clustering algorithmData miningArtificial intelligenceCluster analysisbusinesscomputer
researchProduct

Distance Functions, Clustering Algorithms and Microarray Data Analysis

2010

Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function works best has been investigated, but no final conclusion has been reached. The aim of this extended abstract is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic sepa…

Clustering high-dimensional dataFuzzy clusteringSettore INF/01 - Informaticabusiness.industryCorrelation clusteringMachine learningcomputer.software_genrePearson product-moment correlation coefficientRanking (information retrieval)Euclidean distancesymbols.namesakeClustering distance measuressymbolsArtificial intelligenceData miningbusinessCluster analysiscomputerMathematicsDe facto standard
researchProduct

Structural clustering of millions of molecular graphs

2014

We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for…

Clustering high-dimensional dataFuzzy clusteringTheoretical computer sciencek-medoidsComputer scienceSingle-linkage clusteringCorrelation clusteringConstrained clusteringcomputer.software_genreComplete-linkage clusteringGraphHierarchical clusteringComputingMethodologies_PATTERNRECOGNITIONData stream clusteringCURE data clustering algorithmCanopy clustering algorithmFLAME clusteringAffinity propagationData miningCluster analysiscomputerk-medians clusteringClustering coefficientProceedings of the 29th Annual ACM Symposium on Applied Computing
researchProduct

Making nonlinear manifold learning models interpretable: The manifold grand tour

2015

Smooth nonlinear topographic maps of the data distribution to guide a Grand Tour visualisation.Prioritisation of data linear views that are most consistent with data structure in the maps.Useful visualisations that cannot be obtained by other more classical approaches. Dimensionality reduction is required to produce visualisations of high dimensional data. In this framework, one of the most straightforward approaches to visualising high dimensional data is based on reducing complexity and applying linear projections while tumbling the projection axes in a defined sequence which generates a Grand Tour of the data. We propose using smooth nonlinear topographic maps of the data distribution to…

Clustering high-dimensional dataQA75Nonlinear dimensionality reductionDiscriminative clusteringComputer scienceVisualització de la informaciócomputer.software_genreData visualizationProjection (mathematics)Information visualizationArtificial IntelligenceQA:Informàtica::Infografia [Àrees temàtiques de la UPC]business.industryData visualizationDimensionality reductionGrand tourGeneral EngineeringNonlinear dimensionality reductionTopographic mapData structureComputer Science ApplicationsVisualizationManifold learningData miningbusinesscomputerGenerative topographic mappingLinear projections
researchProduct

Dimensionality reduction via regression on hyperspectral infrared sounding data

2014

This paper introduces a new method for dimensionality reduction via regression (DRR). The method generalizes Principal Component Analysis (PCA) in such a way that reduces the variance of the PCA scores. In order to do so, DRR relies on a deflationary process in which a non-linear regression reduces the redundancy between the PC scores. Unlike other nonlinear dimensionality reduction methods, DRR is easy to apply, it has out-of-sample extension, it is invertible, and the learned transformation is volume-preserving. These properties make the method useful for a wide range of applications, especially in very high dimensional data in general, and for hyperspectral image processing in particular…

Clustering high-dimensional dataRedundancy (information theory)business.industryDimensionality reductionPrincipal component analysisFeature extractionNonlinear dimensionality reductionHyperspectral imagingPattern recognitionArtificial intelligencebusinessMathematicsCurse of dimensionality2014 6th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS)
researchProduct

Scaling Up a Metric Learning Algorithm for Image Recognition and Representation

2008

Maximally Collapsing Metric Learning is a recently proposed algorithm to estimate a metric matrix from labelled data. The purpose of this work is to extend this approach by considering a set of landmark points which can in principle reduce the cost per iteration in one order of magnitude. The proposal is in fact a generalized version of the original algorithm that can be applied to larger amounts of higher dimensional data. Exhaustive experimentation shows that very similar behavior at a lower cost is obtained for a wide range of the number of landmark points used.

Clustering high-dimensional dataSet (abstract data type)Range (mathematics)LandmarkMetric (mathematics)Landmark pointRepresentation (mathematics)AlgorithmFacial recognition systemMathematics
researchProduct