Search results for "Mining"

showing 10 items of 1730 documents

Covariance and correlation estimators in bipartite complex systems with a double heterogeneity

2019

Complex bipartite systems are studied in Biology, Physics, Economics, and Social Sciences, and they can suitably be described as bipartite networks. The heterogeneity of elements in those systems makes it very difficult to perform a statistical analysis of similarity starting from empirical data. Though binary Pearson's correlation coefficient has proved effective to investigate the similarity structure of some real-world bipartite networks, here we show that both the usual sample covariance and correlation coefficient are affected by a bias, which is due to the aforementioned heterogeneity. Such a bias affects real bipartite systems, and, for example, we report its effects on empirical dat…

Statistics and ProbabilityRandom graphComputer scienceComplex systemEstimatorStatistical and Nonlinear Physicsdata miningCombinatoricssocio-economic networksnetworkBipartite graphCovariance and correlationStatistics Probability and Uncertaintyrandom graph

researchProduct

Iterative Cluster Analysis of Protein Interaction Data

2004

Abstract Motivation: Generation of fast tools of hierarchical clustering to be applied when distances among elements of a set are constrained, causing frequent distance ties, as happens in protein interaction data. Results: We present in this work the program UVCLUSTER, that iteratively explores distance datasets using hierarchical clustering. Once the user selects a group of proteins, UVCLUSTER converts the set of primary distances among them (i.e. the minimum number of steps, or interactions, required to connect two proteins) into secondary distances that measure the strength of the connection between each pair of proteins when the interactions for all the proteins in the group are consid…

Statistics and ProbabilitySaccharomyces cerevisiae ProteinsComputer sciencecomputer.software_genreBiochemistryInteractomePattern Recognition AutomatedSet (abstract data type)Protein Interaction MappingCluster (physics)Cluster AnalysisCluster analysisMolecular BiologyCytoskeletonMeasure (data warehouse)Gene Expression ProfilingProteinsActinsComputer Science ApplicationsHierarchical clusteringGene expression profilingComputational MathematicsComputational Theory and MathematicsPattern recognition (psychology)Benchmark (computing)Data miningcomputerAlgorithmsSoftwareSignal TransductionBioinformatics

researchProduct

Testing with a nuisance parameter present only under the alternative: a score-based approach with application to segmented modelling

2016

ABSTRACTWe introduce a score-type statistic to test for a non-zero regression coefficient when the relevant term involves a nuisance parameter present only under the alternative. Despite the non-regularity and complexity of the problem and unlike the previous approaches, the proposed test statistic does not require the nuisance to be estimated. It is simple to implement by relying on the conventional distributions, such as Normal or t, and it justified in the setting of probabilistic coherence. We focus on testing for the existence of a breakpoint in segmented regression, and illustrate the methodology with an analysis on data of DNA copy number aberrations and gene expression profiles from…

Statistics and ProbabilityScore testscore testNuisance variablepiecewise linearthreshold valuecomputer.software_genre01 natural sciencesnon-standard inference010104 statistics & probability03 medical and health sciences0302 clinical medicineStatisticsLinear regressionTest statisticNuisance parameter0101 mathematicsSegmented regressionStatisticMathematicsApplied MathematicsProbabilistic logicBreakpoint detectionModeling and SimulationData miningStatistics Probability and UncertaintySettore SECS-S/01 - Statisticacomputer030217 neurology & neurosurgeryJournal of Statistical Computation and Simulation

researchProduct

A web application for the unspecific detection of differentially expressed DNA regions in strand-specific expression data

2015

Abstract Genomic technologies allow laboratories to produce large-scale data sets, either through the use of next-generation sequencing or microarray platforms. To explore these data sets and obtain maximum value from the data, researchers view their results alongside all the known features of a given reference genome. To study transcriptional changes that occur under a given condition, researchers search for regions of the genome that are differentially expressed between different experimental conditions. In order to identify these regions several algorithms have been developed over the years, along with some bioinformatic platforms that enable their use. However, currently available appli…

Statistics and ProbabilitySequence analysisADNGenomicsComputational biologyBiologycomputer.software_genreBiochemistryGenomeComputer GraphicsExpressió genèticaWeb applicationHumansMolecular BiologyGeneInternetMicroarray analysis techniquesbusiness.industryGenome HumanGene Expression ProfilingComputational BiologyHigh-Throughput Nucleotide SequencingDNAGenomicsSequence Analysis DNAComputer Science ApplicationsGene expression profilingComputational MathematicsGenòmicaComputingMethodologies_PATTERNRECOGNITIONComputational Theory and MathematicsData miningbusinesscomputerAlgorithmsGenèticaReference genome

researchProduct

The Power of Word-Frequency Based Alignment-Free Functions: a Comprehensive Large-Scale Experimental Analysis

2021

Abstract Motivation Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either mi…

Statistics and ProbabilitySequenceSimilarity (geometry)Settore INF/01 - Informaticasequence analysisComputer sciencepower statisticsAlignment-Free Genomic Analysis Big Data Software Platforms Bioinformatics AlgorithmsScale (descriptive set theory)Function (mathematics)computer.software_genreBiochemistryComputer Science ApplicationsSet (abstract data type)Computational MathematicsRange (mathematics)Computational Theory and Mathematicssequence analysis; power statistics; alignment-free functionsalignment-free functionsData miningCompleteness (statistics)Molecular BiologycomputerType I and type II errors

researchProduct

Overlap and diversity in antimicrobial peptide databases: Compiling a non-redundant set of sequences

2015

Abstract Motivation: The large variety of antimicrobial peptide (AMP) databases developed to date are characterized by a substantial overlap of data and similarity of sequences. Our goals are to analyze the levels of redundancy for all available AMP databases and use this information to build a new non-redundant sequence database. For this purpose, a new software tool is introduced. Results: A comparative study of 25 AMP databases reveals the overlap and diversity among them and the internal diversity within each database. The overlap analysis shows that only one database (Peptaibol) contains exclusive data, not present in any other, whereas all sequences in the LAMP_Patent database are inc…

Statistics and ProbabilitySimilarity (geometry)Computer scienceSequence analysisAntimicrobial peptidesPeptaibolPeptidecomputer.software_genreProceduresBiochemistrySet (abstract data type)chemistry.chemical_compoundProtein methodsSequence Analysis ProteinRedundancy (engineering)HumansDatabases ProteinMolecular BiologyAntimicrobial cationic peptideschemistry.chemical_classificationSequenceAntimicrobial cationic peptideDatabaseSequence databaseSequence analysisComputer Science ApplicationsAlgorithmComputational MathematicsChemistryProtein databaseComputational Theory and MathematicschemistryData miningNucleic acid databaseDatabases Nucleic AcidcomputerSoftwareAlgorithmsHuman

researchProduct

ArtiFuse—computational validation of fusion gene detection tools without relying on simulated reads

2019

Abstract Motivation Gene fusions are an important class of transcriptional variants that can influence cancer development and can be predicted from RNA sequencing (RNA-seq) data by multiple existing tools. However, the real-world performance of these tools is unclear due to the lack of known positive and negative events, especially with regard to fusion genes in individual samples. Often simulated reads are used, but these cannot account for all technical biases in RNA-seq data generated from real samples. Results Here, we present ArtiFuse, a novel approach that simulates fusion genes by sequence modification to the genomic reference, and therefore, can be applied to any RNA-seq dataset wit…

Statistics and ProbabilitySource codeSequence analysisComputer sciencemedia_common.quotation_subjectValue (computer science)Genomicscomputer.software_genreBiochemistryFusion gene03 medical and health sciences0302 clinical medicineSoftwareMolecular BiologyGene030304 developmental biologymedia_common0303 health sciencesSequence Analysis RNAbusiness.industryHigh-Throughput Nucleotide SequencingRNAGenomicsComputer Science ApplicationsComputational MathematicsComputational Theory and Mathematics030220 oncology & carcinogenesisBenchmark (computing)RNAData miningGene FusionbusinesscomputerSoftwareBioinformatics

researchProduct

Fully Bayesian Approach to Image Restoration with an Application in Biogeography

1994

SUMMARY A common method of studying biogeographical ranges is an atlas survey, in which the research area is divided into a square grid and the data consist of the squares where observations occur. Often the observations form only an incomplete map of the true range, and a method is required to decide whether the blank squares indicate true absence or merely a lack of study there. This is essentially an image restoration problem, but it has properties that make the common empirical Bayesian procedures inadequate. Most notably, the observed image is heavily degraded, causing difficulties in the estimation of spatial interaction, and the assessment of reliability of the restoration is emphasi…

Statistics and ProbabilitySquare tilingAtlas (topology)Spatial interactionBayesian probabilityCommon methodcomputer.software_genreBlankGeographyData miningStatistics Probability and UncertaintySpatial analysiscomputerImage restorationApplied Statistics

researchProduct

Testing for local structure in spatiotemporal point pattern data

2017

The detection of clustering structure in a point pattern is one of the main focuses of attention in spatiotemporal data mining. Indeed, statistical tools for clustering detection and identification of individual events belonging to clusters are welcome in epidemiology and seismology. Local second-order characteristics provide information on how an event relates to nearby events. In this work, we extend local indicators of spatial association (known as LISA functions) to the spatiotemporal context (which will be then called LISTA functions). These functions are then used to build local tests of clustering to analyse differences in local spatiotemporal structures. We present a simulation stud…

Statistics and ProbabilityStructure (mathematical logic)010504 meteorology & atmospheric sciencesEvent (computing)Ecological ModelingAssociation (object-oriented programming)Context (language use)computer.software_genre01 natural sciences010104 statistics & probabilityIdentification (information)Point (geometry)Data mining0101 mathematicsCluster analysiscomputer0105 earth and related environmental sciencesStatistical hypothesis testingMathematicsEnvironmetrics

researchProduct

LipiDisease: associate lipids to diseases using literature mining

2021

Abstract Summary Lipids exhibit an essential role in cellular assembly and signaling. Dysregulation of these functions has been linked with many complications including obesity, diabetes, metabolic disorders, cancer and more. Investigating lipid profiles in such conditions can provide insights into cellular functions and possible interventions. Hence the field of lipidomics is expanding in recent years. Even though the role of individual lipids in diseases has been investigated, there is no resource to perform disease enrichment analysis considering the cumulative association of a lipid set. To address this, we have implemented the LipiDisease web server. The tool analyzes millions of recor…

Statistics and ProbabilitySupplementary dataWeb serverAcademicSubjects/SCI01060Computer scienceCellular functionsComputational biologyDiseasecomputer.software_genreApplications NotesBiochemistryField (computer science)Computer Science ApplicationsComputational MathematicsComputational Theory and MathematicsLipidomicsData and Text MiningMolecular BiologycomputerBioinformatics

researchProduct