Search results for "DATA MINING"

showing 10 items of 907 documents

Sparse kernel methods for high-dimensional survival data

2008

Abstract Sparse kernel methods like support vector machines (SVM) have been applied with great success to classification and (standard) regression settings. Existing support vector classification and regression techniques however are not suitable for partly censored survival data, which are typically analysed using Cox's proportional hazards model. As the partial likelihood of the proportional hazards model only depends on the covariates through inner products, it can be ‘kernelized’. The kernelized proportional hazards model however yields a solution that is dense, i.e. the solution depends on all observations. One of the key features of an SVM is that it yields a sparse solution, dependin…

Statistics and ProbabilityLung NeoplasmsLymphomaComputer sciencecomputer.software_genreComputing MethodologiesBiochemistryPattern Recognition AutomatedArtificial IntelligenceMargin (machine learning)CovariateCluster AnalysisHumansComputer SimulationFraction (mathematics)Molecular BiologyProportional Hazards ModelsModels StatisticalTraining setProportional hazards modelGene Expression ProfilingComputational BiologyComputer Science ApplicationsSupport vector machineComputational MathematicsKernel methodComputational Theory and MathematicsRegression AnalysisData miningcomputerAlgorithmsSoftwareBioinformatics

researchProduct

Coupled variable selection for regression modeling of complex treatment patterns in a clinical cancer registry.

2013

For determining a manageable set of covariates potentially influential with respect to a time-to-event endpoint, Cox proportional hazards models can be combined with variable selection techniques, such as stepwise forward selection or backward elimination based on p-values, or regularized regression techniques such as component-wise boosting. Cox regression models have also been adapted for dealing with more complex event patterns, for example, for competing risks settings with separate, cause-specific hazard models for each event type, or for determining the prognostic effect pattern of a variable over different landmark times, with one conditional survival model for each landmark. Motivat…

Statistics and ProbabilityMaleNiacinamideBoosting (machine learning)Carcinoma HepatocellularEpidemiologyComputer scienceScoreFeature selectionAntineoplastic Agentscomputer.software_genreDecision Support TechniquesNeoplasmsCovariateHumansRegistriesAgedProportional Hazards ModelsProportional hazards modelPhenylurea CompoundsLiver NeoplasmsRegression analysisConfounding Factors EpidemiologicMiddle AgedSorafenibPrognosisRegressionCancer registryData Interpretation StatisticalRegression AnalysisData miningcomputerStatistics in medicine

researchProduct

STATIS and DISTATIS: optimum multitable principal component analysis and three way metric multidimensional scaling

2012

STATIS is an extension of principal component analysis PCA tailored to handle multiple data tables that measure sets of variables collected on the same observations, or, alternatively, as in a variant called dual-STATIS, multiple data tables where the same variables are measured on different sets of observations. STATIS proceeds in two steps: First it analyzes the between data table similarity structure and derives from this analysis an optimal set of weights that are used to compute a linear combination of the data tables called the compromise that best represents the information common to the different data tables; Second, the PCA of this compromise gives an optimal map of the observation…

Statistics and ProbabilityMathematical optimizationSimilarity (geometry)[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH]Linear discriminant analysiscomputer.software_genre01 natural sciences[ STAT.TH ] Statistics [stat]/Statistics Theory [stat.TH]Correspondence analysisSet (abstract data type)010104 statistics & probability03 medical and health sciences0302 clinical medicine[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST]Multiple factor analysisPrincipal component analysisMetric (mathematics)Data miningMultidimensional scaling[ MATH.MATH-ST ] Mathematics [math]/Statistics [math.ST]0101 mathematicscomputer030217 neurology & neurosurgeryComputingMilieux_MISCELLANEOUSMathematics

researchProduct

Comprehensive estimation of input signals and dynamics in biochemical reaction networks

2012

Abstract Motivation: Cellular information processing can be described mathematically using differential equations. Often, external stimulation of cells by compounds such as drugs or hormones leading to activation has to be considered. Mathematically, the stimulus is represented by a time-dependent input function. Parameters such as rate constants of the molecular interactions are often unknown and need to be estimated from experimental data, e.g. by maximum likelihood estimation. For this purpose, the input function has to be defined for all times of the integration interval. This is usually achieved by approximating the input by interpolation or smoothing of the measured data. This procedu…

Statistics and ProbabilityMedicin och hälsovetenskapComputer scienceDifferential equationMaximum likelihoodcomputer.software_genreBiochemistryModels BiologicalMedical and Health SciencesIntegration intervalMolecular BiologyJanus KinasesLikelihood FunctionsRegulation Pathways and Systems BiologyExperimental dataOriginal PapersConfidence intervalComputer Science ApplicationsComputational MathematicsSTAT Transcription FactorsComputational Theory and MathematicsData miningAlgorithmcomputerSmoothingAlgorithmsSignal Transduction

researchProduct

CARE: context-aware sequencing read error correction.

2020

Abstract Motivation Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. Results We present CARE—an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors ar…

Statistics and ProbabilityMultiple sequence alignmentComputer scienceSequence assemblyHigh-Throughput Nucleotide SequencingContext (language use)Sequence Analysis DNAcomputer.software_genreBiochemistryGenomeComputer Science ApplicationsComputational MathematicsComputational Theory and MathematicsHumansHuman genomeData miningError detection and correctionMolecular BiologycomputerSequence AlignmentAlgorithmsSoftwareBioinformatics (Oxford, England)

researchProduct

Hybrid recommendation methods in complex networks

2015

We propose here two new recommendation methods, based on the appropriate normalization of already existing similarity measures, and on the convex combination of the recommendation scores derived from similarity between users and between objects. We validate the proposed measures on three relevant data sets, and we compare their performance with several recommendation systems recently proposed in the literature. We show that the proposed similarity measures allow to attain an improvement of performances of up to 20\% with respect to existing non-parametric methods, and that the accuracy of a recommendation can vary widely from one specific bipartite network to another, which suggests that a …

Statistics and ProbabilityNormalization (statistics)Social and Information Networks (cs.SI)FOS: Computer and information sciencesPhysics - Physics and SocietyComputer scienceNonparametric statisticsFOS: Physical sciencesComputer Science - Social and Information NetworksCondensed Matter PhysicPhysics and Society (physics.soc-ph)Complex networkRecommender systemcomputer.software_genreComputer Science - Information RetrievalBipartite graphConvex combinationData miningNoisy datacomputerInformation Retrieval (cs.IR)Statistical and Nonlinear Physic

researchProduct

Functional Principal Component Analysis for the explorative analysis of multisite-multivariate air pollution time series with long gaps

2013

The knowledge of the urban air quality represents the first step to face air pollution issues. For the last decades many cities can rely on a network of monitoring stations recording concentration values for the main pollutants. This paper focuses on functional principal component analysis (FPCA) to investigate multiple pollutant datasets measured over time at multiple sites within a given urban area. Our purpose is to extend what has been proposed in the literature to data that are multisite and multivariate at the same time. The approach results to be effective to highlight some relevant statistical features of the time series, giving the opportunity to identify significant pollutants and…

Statistics and ProbabilityPollutantFunctional principal component analysisgeographyMultivariate statisticsgeography.geographical_feature_categorySeries (mathematics)Computer scienceAir pollutionFunctional data analysiscomputer.software_genreUrban areamedicine.disease_causeAir quality Functional Data Analysis Three mode FPCA EOFmedicineData miningStatistics Probability and UncertaintySettore SECS-S/01 - StatisticaAir quality indexcomputer

researchProduct

Covariance and correlation estimators in bipartite complex systems with a double heterogeneity

2019

Complex bipartite systems are studied in Biology, Physics, Economics, and Social Sciences, and they can suitably be described as bipartite networks. The heterogeneity of elements in those systems makes it very difficult to perform a statistical analysis of similarity starting from empirical data. Though binary Pearson's correlation coefficient has proved effective to investigate the similarity structure of some real-world bipartite networks, here we show that both the usual sample covariance and correlation coefficient are affected by a bias, which is due to the aforementioned heterogeneity. Such a bias affects real bipartite systems, and, for example, we report its effects on empirical dat…

Statistics and ProbabilityRandom graphComputer scienceComplex systemEstimatorStatistical and Nonlinear Physicsdata miningCombinatoricssocio-economic networksnetworkBipartite graphCovariance and correlationStatistics Probability and Uncertaintyrandom graph

researchProduct

Iterative Cluster Analysis of Protein Interaction Data

2004

Abstract Motivation: Generation of fast tools of hierarchical clustering to be applied when distances among elements of a set are constrained, causing frequent distance ties, as happens in protein interaction data. Results: We present in this work the program UVCLUSTER, that iteratively explores distance datasets using hierarchical clustering. Once the user selects a group of proteins, UVCLUSTER converts the set of primary distances among them (i.e. the minimum number of steps, or interactions, required to connect two proteins) into secondary distances that measure the strength of the connection between each pair of proteins when the interactions for all the proteins in the group are consid…

Statistics and ProbabilitySaccharomyces cerevisiae ProteinsComputer sciencecomputer.software_genreBiochemistryInteractomePattern Recognition AutomatedSet (abstract data type)Protein Interaction MappingCluster (physics)Cluster AnalysisCluster analysisMolecular BiologyCytoskeletonMeasure (data warehouse)Gene Expression ProfilingProteinsActinsComputer Science ApplicationsHierarchical clusteringGene expression profilingComputational MathematicsComputational Theory and MathematicsPattern recognition (psychology)Benchmark (computing)Data miningcomputerAlgorithmsSoftwareSignal TransductionBioinformatics

researchProduct

Testing with a nuisance parameter present only under the alternative: a score-based approach with application to segmented modelling

2016

ABSTRACTWe introduce a score-type statistic to test for a non-zero regression coefficient when the relevant term involves a nuisance parameter present only under the alternative. Despite the non-regularity and complexity of the problem and unlike the previous approaches, the proposed test statistic does not require the nuisance to be estimated. It is simple to implement by relying on the conventional distributions, such as Normal or t, and it justified in the setting of probabilistic coherence. We focus on testing for the existence of a breakpoint in segmented regression, and illustrate the methodology with an analysis on data of DNA copy number aberrations and gene expression profiles from…

Statistics and ProbabilityScore testscore testNuisance variablepiecewise linearthreshold valuecomputer.software_genre01 natural sciencesnon-standard inference010104 statistics & probability03 medical and health sciences0302 clinical medicineStatisticsLinear regressionTest statisticNuisance parameter0101 mathematicsSegmented regressionStatisticMathematicsApplied MathematicsProbabilistic logicBreakpoint detectionModeling and SimulationData miningStatistics Probability and UncertaintySettore SECS-S/01 - Statisticacomputer030217 neurology & neurosurgeryJournal of Statistical Computation and Simulation

researchProduct