0000000000117678

AUTHOR

Filippo Utro

0000-0003-3226-7642

showing 25 related works from this author

Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistic…

2008

Abstract Background Inferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. Results We consider five such measures: Clest, Consensus (Consensus Clustering), FOM (Figure of Merit), Gap (Gap Statistics) and ME (Model Explorer), in addition to the classic WCSS (Within Cluster…

clustering microarray dataMicroarrayComputer scienceStatistics as Topiccomputer.software_genrelcsh:Computer applications to medicine. Medical informaticsBiochemistryStructural BiologyDatabases GeneticConsensus clusteringStatisticsCluster (physics)AnimalsCluster AnalysisHumansCluster analysislcsh:QH301-705.5Molecular BiologyOligonucleotide Array Sequence AnalysisStructure (mathematical logic)Microarray analysis techniquesApplied MathematicsComputational BiologyComputer Science ApplicationsBenchmarkingComputingMethodologies_PATTERNRECOGNITIONlcsh:Biology (General)Gene chip analysislcsh:R858-859.7Data miningDNA microarraycomputerAlgorithmsSoftwareResearch ArticleBMC Bioinformatics
researchProduct

ValWorkBench: an open source Java library for cluster validation, with applications to microarray data analysis.

2015

Background: Cluster analysis is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. It is central to the life sciences due to the advent of high throughput technologies, e.g., classification of tumors. In particular, in cluster analysis, it is of relevance to assess cluster quality and to predict the number of clusters in a dataset, if any. This latter task is usually performed via internal validation measures. Despite their potentially important role, both the use of classic internal validation measures and the design of new ones, specific for microarray data, do not seem to have grea…

Software documentationInformation retrievalSettore INF/01 - Informaticabusiness.industryComputer scienceSoftware developmentAlgorithm engineeringHealth InformaticsPattern discovery in bioinformatics and biomedicinecomputer.software_genreData scienceSoftware metricComputer Science ApplicationsSoftware frameworkMicroarray cluster analysiSoftwareBioinformatics softwareSoftware constructionComponent-based software engineeringCluster AnalysisProgramming LanguagesbusinesscomputerSoftwareAlgorithmsComputer methods and programs in biomedicine
researchProduct

Basic Statistical Indices for SeqAn

2009

z-score SeqAn
researchProduct

Omic-based strategies reveal novel links between primary metabolism and antibiotic production

2008

Settore BIO/19 - Microbiologia GeneraleProteome Transcriptome Actinomycetes
researchProduct

Textual data compression in computational biology: Algorithmic techniques

2012

Abstract In a recent review [R. Giancarlo, D. Scaturro, F. Utro, Textual data compression in computational biology: a synopsis, Bioinformatics 25 (2009) 1575–1586] the first systematic organization and presentation of the impact of textual data compression for the analysis of biological data has been given. Its main focus was on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used together with a technical presentation of how well-known notions from information theory have been adapted to successfully work on biological data. Rather surprisingly, the use of data compression is pervasive in computational biology. Starting from…

Biological dataData Compression Theory and Practice Alignment-free sequence comparison Entropy Huffman coding Hidden Markov Models Kolmogorov complexity Lempel–Ziv compressors Minimum Description Length principle Pattern discovery in bioinformatics Reverse engineering of biological networks Sequence alignmentSettore INF/01 - InformaticaGeneral Computer ScienceKolmogorov complexityComputer scienceSearch engine indexingComputational biologyInformation theoryInformation scienceTheoretical Computer ScienceTechnical PresentationEntropy (information theory)Data compressionComputer Science Review
researchProduct

Speeding up the Consensus Clustering methodology for microarray data analysis

2010

Abstract Background The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose…

Settore INF/01 - Informaticalcsh:QH426-470Computer scienceResearchApplied MathematicsStability (learning theory)InferenceApproximation algorithmcomputer.software_genreNon-negative matrix factorizationIdentification (information)lcsh:GeneticsComputingMethodologies_PATTERNRECOGNITIONComputational Theory and Mathematicslcsh:Biology (General)Structural BiologyConsensus clusteringBenchmark (computing)Data mininginternal validation measures data mining microarray data NMFCluster analysiscomputerMolecular Biologylcsh:QH301-705.5Algorithms for Molecular Biology
researchProduct

Textual data compression in computational biology: a synopsis.

2009

Abstract Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison and reverse engineering of biological networks. Results: The main focus of this review is on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been use…

Statistics and ProbabilityDatabases Factualbusiness.industryComputer sciencemedia_common.quotation_subjectSearch engine indexingcompression dataComputational BiologyInformation Storage and RetrievalComputational biologyBiochemistryData scienceComputer Science ApplicationsComputational MathematicsPresentationSoftwareComputational Theory and MathematicsBenchmark (computing)businessMolecular BiologyBiological networkSoftwareData compressionmedia_commonBioinformatics (Oxford, England)
researchProduct

Functional Information, Biomolecular Messages and Complexity of BioSequences and Structures

2010

In the quest for a mathematical measure able to capture and shed light on the dual notions of information and complexity in biosequences, Hazen et al. have introduced the notion of Functional Information (FI for short). It is also the result of earlier considerations and findings by Szostak and Carothers et al. Based on the experiments by Charoters et al., regarding FI in RNA binding activities, we decided to study the relation existing between FI and classic measures of complexity applied on protein-DNA interactions on a genome-wide scale. Using classic complexity measures, i.e, Shannon entropy and Kolmogorov Complexity as both estimated by data compression, we found that FI applied to pro…

sequence complexityFunctional Activity Sequence Complexity Combinatorics onWords Protein-DNA interaction.combinatorics on wordsFunctional activityprotein-DNA interaction.
researchProduct

Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning

2014

Abstract Motivation: Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dictionaries, has a well established role in genomic and proteomic studies. Much less so in epigenomics, although the role of k-mers in chromatin organization and nucleosome positioning is particularly relevant. Fundamental questions concerning the informational content and compositional structure of nucleosome favouring and disfavoring sequences with respect to their basic building blocks still remain open. Results: We present the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is: (i) exhau…

EpigenomicsStatistics and ProbabilityGeneticsSupplementary dataSequenceGenomeSettore INF/01 - InformaticaSequence Analysis DNAComputational biologyAlgorithms and Data Structures BioinformaticsBiologyChromatin Assembly and DisassemblyBiochemistryNucleosomesComputer Science ApplicationsComputational MathematicsComputational Theory and Mathematicsk-merAnimalsHumansNucleosomeMolecular BiologyComposition (language)Epigenomics
researchProduct

Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis

2012

AbstractThe advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of pre…

Settore INF/01 - InformaticaGeneral Computer Sciencebusiness.industryComputer scienceBioinformaticsModel selectionGeneral statisticsMachine learningcomputer.software_genreTheoretical Computer ScienceComputational biologyAnalysis of massive datasetsMachine learningCluster (physics)Algorithms and data structures General statistics Analysis of massive datasets Machine learning Computational biology BioinformaticsAlgorithms and data structuresAlgorithm designArtificial intelligenceCluster analysisbusinessCompleteness (statistics)computerComputer Science(all)Theoretical Computer Science
researchProduct

A basic analysis toolkit for biological sequences

2007

This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory an…

Theoretical computer sciencelcsh:QH426-470Computer sciencebusiness.industrysoftwareComputationApplied MathematicsString searching algorithmApproximate string matchingSoftware ArticleSet (abstract data type)Longest common subsequence problemlcsh:GeneticsSoftwareComputational Theory and Mathematicslcsh:Biology (General)Structural BiologyAffine transformationPerlbusinesscomputerMolecular Biologylcsh:QH301-705.5computer.programming_language
researchProduct

DNA combinatorial messages and Epigenomics: The case of chromatin organization and nucleosome occupancy in eukaryotic genomes

2019

Abstract Epigenomics is the study of modifications on the genetic material of a cell that do not depend on changes in the DNA sequence, since those latter involve specific proteins around which DNA wraps. The end result is that Epigenomic changes have a fundamental role in the proper working of each cell in Eukaryotic organisms. A particularly important part of Epigenomics concentrates on the study of chromatin, that is, a fiber composed of a DNA-protein complex and very characterizing of Eukaryotes. Understanding how chromatin is assembled and how it changes is fundamental for Biology. In more than thirty years of research in this area, Mathematics and Theoretical Computer Science have gai…

0303 health sciencesSettore INF/01 - InformaticaGeneral Computer ScienceFiber (mathematics)0102 computer and information sciencesComputational biology01 natural sciencesNucleosome occupancyGenomeDNA sequencingTheoretical Computer ScienceChromatinComputational biology03 medical and health scienceschemistry.chemical_compoundchemistry010201 computation theory & mathematicsComputer ScienceAlgorithms and complexityFormal languageA fibersDNACombinatorics on word030304 developmental biologyEpigenomicsTheoretical Computer Science
researchProduct

The Three Steps of Clustering In The Post-Genomic Era

2013

This chapter descibes the basic algorithmic components that are involved in clustering, with particular attention to classification of microarray data.

Clustering high-dimensional dataSettore INF/01 - Informaticabusiness.industryCorrelation clusteringPattern recognitioncomputer.software_genreBiclusteringCURE data clustering algorithmClustering Classification Biological Data MiningConsensus clusteringArtificial intelligenceData miningbusinessCluster analysiscomputerMathematics
researchProduct

A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data …

2013

Abstract Background Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Results A procedure is proposed for the assessment of the discriminative ability of a distance functi…

Computer sciencecomputer.software_genreBiochemistrysymbols.namesakeDiscriminative modelStructural BiologyCluster AnalysisRelevance (information retrieval)Cluster analysisMolecular BiologyOligonucleotide Array Sequence AnalysisClustering discriminative ability of a distance function external validation indicesSettore INF/01 - InformaticaResearchApplied MathematicsMutual informationPearson product-moment correlation coefficientComputer Science ApplicationsHierarchical clusteringEuclidean distanceRange (mathematics)Metric (mathematics)symbolsData miningTranscriptomecomputerAlgorithmsBMC Bioinformatics
researchProduct

Statistical Indexes for Computational and Data Driven Class Discovery in Microarray Data

2009

clustering
researchProduct

Bayesian versus data driven model selection for microarray data

2014

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is a particular instance of the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In what follows, for ease of reference, we refer to that instance still as model selection. It is an important part of any statistical analysis. The techniques used for solving it are mainly either Bayesian or data-driven, and are both based on internal knowledge. That is, they use information obtained by processing the input data. A…

Clustering Model selection Bayesian information criterion Akaike information criterion Minimum message length BioinformaticsSettore INF/01 - InformaticaComputer sciencebusiness.industryModel selectionBayesian probabilitycomputer.software_genreMachine learningComputer Science ApplicationsData-drivenDetermining the number of clusters in a data setIdentification (information)Bayesian information criterionData miningArtificial intelligenceAkaike information criterionCluster analysisbusinesscomputer
researchProduct

Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies

2013

High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to researchers and technicians applying the existing software and tools, we include a synopsis of the main characteristics of the described approaches, including details on their impleme…

Sequence analysisComputer sciencebusiness.industryComputational BiologyHigh-Throughput Nucleotide SequencingContext (language use)Data CompressionBioinformaticsData scienceDNA sequencingSoftwareSequence analysis Data compressionMetagenomicsState (computer science)businessSequence AlignmentMolecular BiologyAlgorithmsSoftwareInformation SystemsData compressionBriefings in Bioinformatics
researchProduct

Computation Cluster Validation in the Big Data Era

2017

Data-driven class discovery, i.e., the inference of cluster structure in a dataset, is a fundamental task in Data Analysis, in particular for the Life Sciences. We provide a tutorial on the most common approaches used for that task, focusing on methodologies for the prediction of the number of clusters in a dataset. Although the methods that we present are general in terms of the data for which they can be used, we offer a case study relevant for Microarray Data Analysis.

Clustering high-dimensional dataClass (computer programming)Clustering validation measureSettore INF/01 - InformaticaComputer sciencebusiness.industryBig dataInferenceMicroarrays data analysiscomputer.software_genreGap statisticTask (project management)ComputingMethodologies_PATTERNRECOGNITIONCURE data clustering algorithmConsensus clusteringHypothesis testing in statisticClustering Class Discovery in Data Algorithmsb Clustering algorithmFigure of meritConsensus clusteringData miningCluster analysisbusinesscomputer
researchProduct

Stability-Based Model Selection for High Throughput Genomic Data: An Algorithmic Paradigm

2012

Clustering is one of the most well known activities in scien- tific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is the model selection problem, i.e., the identifi- cation of the correct number of clusters in a dataset. In the last decade, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained promi- nence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of predic- tion, but the slowest in terms of time. Unfortunately…

Class (computer programming)Settore INF/01 - Informaticabusiness.industryComputer scienceHeuristic (computer science)Model selectionStability (learning theory)Machine learningcomputer.software_genreIdentification (information)Algorithm designArtificial intelligenceCluster analysisbusinessAlgorithms and Data StructuresThroughput (business)computer
researchProduct

Algorithms for internal validation clustering measures in the post genomic era.

2011

algorithmSettore INF/01 - Informaticapost genomic era.internal validation clustering measure
researchProduct

The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of…

2015

Abstract Motivation: Thanks to research spanning nearly 30 years, two major models have emerged that account for nucleosome organization in chromatin: statistical and sequence specific. The first is based on elegant, easy to compute, closed-form mathematical formulas that make no assumptions of the physical and chemical properties of the underlying DNA sequence. Moreover, they need no training on the data for their computation. The latter is based on some sequence regularities but, as opposed to the statistical model, it lacks the same type of closed-form formulas that, in this case, should be based on the DNA sequence only. Results: We contribute to close this important methodological gap …

0301 basic medicineStatistics and ProbabilityNucleosome organizationComputational biologyBiologyType (model theory)BiochemistryGenomeDNA sequencing03 medical and health sciencesComputational Theory and MathematicNucleosomeMolecular BiologySequence (medicine)GeneticsGenomeSettore INF/01 - InformaticaEukaryotaComputer Science Applications1707 Computer Vision and Pattern RecognitionStatistical modelDNAChromatinNucleosomesComputer Science ApplicationsChromatinSettore BIO/18 - GeneticaComputational Mathematics030104 developmental biologyComputational Theory and MathematicsComputational MathematicBioinformatics
researchProduct

A Tutorial on Computational Cluster Analysis with Applications to Pattern Discovery in Microarray Data

2008

Background Inferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. Results We consider five such measures: Clest, Consensus (Consensus Clustering), FOM (Figure of Merit), Gap (Gap Statistics) and ME (Model Explorer), in addition to the classic WCSS (Within Cluster Sum-of-S…

Microarray analysis techniquesComputer scienceApplied Mathematicscomputer.software_genreDisease clusterClusteringComputational MathematicsComputingMethodologies_PATTERNRECOGNITIONComputational Theory and MathematicsGene chip analysisMicroarray databasesData miningDNA microarrayCluster analysiscomputerMathematics in Computer Science
researchProduct

The Three Steps of Clustering in the Post-Genomic Era: A Synopsis

2011

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. Following Handl et al., it can be summarized as a three step process: (a) choice of a distance function; (b) choice of a clustering algorithm; (c) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Unfortunately, the high dimensionality of the data and their noisy nature makes cluster analysis of genomic data particul…

cluster validation indicesSettore INF/01 - InformaticaProcess (engineering)Computer sciencebusiness.industryGenomic datadistance functionMachine learningcomputer.software_genreObject (computer science)ClusteringCluster algorithmPredictive powerRelevance (information retrieval)Artificial intelligenceHigh dimensionalitybusinessCluster analysiscomputer
researchProduct

In vitro versus in vivo compositional landscapes of histone sequence preferences in eucaryotic genomes

2018

Abstract Motivation Although the nucleosome occupancy along a genome can be in part predicted by in vitro experiments, it has been recently observed that the chromatin organization presents important differences in vitro with respect to in vivo. Such differences mainly regard the hierarchical and regular structures of the nucleosome fiber, whose existence has long been assumed, and in part also observed in vitro, but that does not apparently occur in vivo. It is also well known that the DNA sequence has a role in determining the nucleosome occupancy. Therefore, an important issue is to understand if, and to what extent, the structural differences in the chromatin organization between in vit…

0301 basic medicineStatistics and Probabilityved/biology.organism_classification_rank.speciesComputational biologySaccharomyces cerevisiaeGenomeBiochemistryDNA sequencingHistones03 medical and health sciences0302 clinical medicineIn vivoComputational Theory and MathematicNucleosomeAnimalsModel organismCaenorhabditis elegansMolecular BiologySequence (medicine)GenomebiologySettore INF/01 - Informaticaved/biologyComputer Science ApplicationChromatinComputer Science ApplicationsChromatinNucleosomesComputational Mathematics030104 developmental biologyHistoneEukaryotic CellsComputational Theory and Mathematicsbiology.proteinComputer Vision and Pattern RecognitionSequence Analysis030217 neurology & neurosurgery
researchProduct

Foreword: Algorithms, Strings and Theoretical Approaches in the Big Data Era – Special Issue in Honor of the 60th Birthday of Professor Raffaele Gian…

2017

Raffaele Giancarlo was born in 1957 in Salerno, Italy. He received his Laurea Degree in Computer Science from the University of Salerno in 1982. His Laurea thesis on combinatorial algorithms on words was supervised by Professor Alberto Apostolico. Some years later, in 1984, he was one of the few young researchers attending the Advanced Research Workshop on Combinatorial Algorithms on Words held at Maratea (Italy). In the same year, he won a public competition for an Assistant Professor position at University of Salerno. He also decided to pursue graduate studies in the US. Raffaele Giancarlo obtained his Ph.D. in Computer Science from Columbia University in 1990, defending one of the first …

special issue
researchProduct