Search results for "High-dimensional data"

showing 10 items of 29 documents

Structural clustering of millions of molecular graphs

2014

We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for…

Clustering high-dimensional dataFuzzy clusteringTheoretical computer sciencek-medoidsComputer scienceSingle-linkage clusteringCorrelation clusteringConstrained clusteringcomputer.software_genreComplete-linkage clusteringGraphHierarchical clusteringComputingMethodologies_PATTERNRECOGNITIONData stream clusteringCURE data clustering algorithmCanopy clustering algorithmFLAME clusteringAffinity propagationData miningCluster analysiscomputerk-medians clusteringClustering coefficientProceedings of the 29th Annual ACM Symposium on Applied Computing
researchProduct

Making nonlinear manifold learning models interpretable: The manifold grand tour

2015

Smooth nonlinear topographic maps of the data distribution to guide a Grand Tour visualisation.Prioritisation of data linear views that are most consistent with data structure in the maps.Useful visualisations that cannot be obtained by other more classical approaches. Dimensionality reduction is required to produce visualisations of high dimensional data. In this framework, one of the most straightforward approaches to visualising high dimensional data is based on reducing complexity and applying linear projections while tumbling the projection axes in a defined sequence which generates a Grand Tour of the data. We propose using smooth nonlinear topographic maps of the data distribution to…

Clustering high-dimensional dataQA75Nonlinear dimensionality reductionDiscriminative clusteringComputer scienceVisualització de la informaciócomputer.software_genreData visualizationProjection (mathematics)Information visualizationArtificial IntelligenceQA:Informàtica::Infografia [Àrees temàtiques de la UPC]business.industryData visualizationDimensionality reductionGrand tourGeneral EngineeringNonlinear dimensionality reductionTopographic mapData structureComputer Science ApplicationsVisualizationManifold learningData miningbusinesscomputerGenerative topographic mappingLinear projections
researchProduct

Dimensionality reduction via regression on hyperspectral infrared sounding data

2014

This paper introduces a new method for dimensionality reduction via regression (DRR). The method generalizes Principal Component Analysis (PCA) in such a way that reduces the variance of the PCA scores. In order to do so, DRR relies on a deflationary process in which a non-linear regression reduces the redundancy between the PC scores. Unlike other nonlinear dimensionality reduction methods, DRR is easy to apply, it has out-of-sample extension, it is invertible, and the learned transformation is volume-preserving. These properties make the method useful for a wide range of applications, especially in very high dimensional data in general, and for hyperspectral image processing in particular…

Clustering high-dimensional dataRedundancy (information theory)business.industryDimensionality reductionPrincipal component analysisFeature extractionNonlinear dimensionality reductionHyperspectral imagingPattern recognitionArtificial intelligencebusinessMathematicsCurse of dimensionality2014 6th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS)
researchProduct

Scaling Up a Metric Learning Algorithm for Image Recognition and Representation

2008

Maximally Collapsing Metric Learning is a recently proposed algorithm to estimate a metric matrix from labelled data. The purpose of this work is to extend this approach by considering a set of landmark points which can in principle reduce the cost per iteration in one order of magnitude. The proposal is in fact a generalized version of the original algorithm that can be applied to larger amounts of higher dimensional data. Exhaustive experimentation shows that very similar behavior at a lower cost is obtained for a wide range of the number of landmark points used.

Clustering high-dimensional dataSet (abstract data type)Range (mathematics)LandmarkMetric (mathematics)Landmark pointRepresentation (mathematics)AlgorithmFacial recognition systemMathematics
researchProduct

The Three Steps of Clustering In The Post-Genomic Era

2013

This chapter descibes the basic algorithmic components that are involved in clustering, with particular attention to classification of microarray data.

Clustering high-dimensional dataSettore INF/01 - Informaticabusiness.industryCorrelation clusteringPattern recognitioncomputer.software_genreBiclusteringCURE data clustering algorithmClustering Classification Biological Data MiningConsensus clusteringArtificial intelligenceData miningbusinessCluster analysiscomputerMathematics
researchProduct

A Feature Set Decomposition Method for the Construction of Multi-classifier Systems Trained with High-Dimensional Data

2013

Data mining for the discovery of novel, useful patterns, encounters obstacles when dealing with high-dimensional datasets, which have been documented as the "curse" of dimensionality. A strategy to deal with this issue is the decomposition of the input feature set to build a multi-classifier system. Standalone decomposition methods are rare and generally based on random selection. We propose a decomposition method which uses information theory tools to arrange input features into uncorrelated and relevant subsets. Experimental results show how this approach significantly outperforms three baseline decomposition methods, in terms of classification accuracy.

Clustering high-dimensional databusiness.industryComputer sciencePattern recognitionInformation theorycomputer.software_genreUncorrelatedDecomposition method (queueing theory)Data miningArtificial intelligencebusinessFeature setcomputerClassifier (UML)Curse of dimensionality
researchProduct

Regularized Regression Incorporating Network Information: Simultaneous Estimation of Covariate Coefficients and Connection Signs

2014

We develop an algorithm that incorporates network information into regression settings. It simultaneously estimates the covariate coefficients and the signs of the network connections (i.e. whether the connections are of an activating or of a repressing type). For the coefficient estimation steps an additional penalty is set on top of the lasso penalty, similarly to Li and Li (2008). We develop a fast implementation for the new method based on coordinate descent. Furthermore, we show how the new methods can be applied to time-to-event data. The new method yields good results in simulation studies concerning sensitivity and specificity of non-zero covariate coefficients, estimation of networ…

Clustering high-dimensional databusiness.industryjel:C41jel:C13Machine learningcomputer.software_genreRegressionhigh-dimensional data gene expression data pathway information penalized regressionConnection (mathematics)Set (abstract data type)Lasso (statistics)CovariateArtificial intelligenceSensitivity (control systems)businessCoordinate descentAlgorithmcomputerMathematics
researchProduct

Incrementally Assessing Cluster Tendencies with a~Maximum Variance Cluster Algorithm

2003

A straightforward and efficient way to discover clustering tendencies in data using a recently proposed Maximum Variance Clustering algorithm is proposed. The approach shares the benefits of the plain clustering algorithm with regard to other approaches for clustering. Experiments using both synthetic and real data have been performed in order to evaluate the differences between the proposed methodology and the plain use of the Maximum Variance algorithm. According to the results obtained, the proposal constitutes an efficient and accurate alternative.

Clustering high-dimensional datak-medoidsComputer scienceCURE data clustering algorithmSingle-linkage clusteringCanopy clustering algorithmVariance (accounting)Data miningCluster analysiscomputer.software_genrecomputerk-medians clustering
researchProduct

Penalized regression and clustering in high-dimensional data

The main goal of this Thesis is to describe numerous statistical techniques that deal with high-dimensional genomic data. The Thesis begins with a review of the literature on penalized regression models, with particular attention to least absolute shrinkage and selection operator (LASSO) or L1-penalty methods. L1 logistic/multinomial regression models are used for variable selection and discriminant analysis with a binary/categorical response variable. The Thesis discusses and compares several methods that are commonly utilized in genetics, and introduces new strategies to select markers according to their informative content and to discriminate clusters by offering reduced panels for popul…

High-dimensional dataQuantile regression coefficients modelingTuning parameter selectionGenomic dataLasso regressionLasso regression; High-dimensional data; Genomic data; Tuning parameter selection; Quantile regression coefficients modeling; Curves clustering;Settore SECS-S/01 - StatisticaCurves clustering
researchProduct

Inferring networks from high-dimensional data with mixed variables

2014

We present two methodologies to deal with high-dimensional data with mixed variables, the strongly decomposable graphical model and the regression-type graphical model. The first model is used to infer conditional independence graphs. The latter model is applied to compute the relative importance or contribution of each predictor to the response variables. Recently, penalized likelihood approaches have also been proposed to estimate graph structures. In a simulation study, we compare the performance of the strongly decomposable graphical model and the graphical lasso in terms of graph recovering. Five different graph structures are used to simulate the data: the banded graph, the cluster gr…

Random graphClustering high-dimensional dataPenalized likelihoodTheoretical computer scienceConditional independenceDecomposable Graphical Models.Computer scienceCluster graphMixed variablesGraphical modelMutual informationPenalized Gaussian Graphical ModelSettore SECS-S/01 - Statistica
researchProduct