0000000001230397
AUTHOR
Luca Pinello
MOESM3 of Assessment of computational methods for the analysis of single-cell ATAC-seq data
Additional file 3: Review history.
Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM
Single-cell transcriptomic assays have enabled the de novo reconstruction of lineage differentiation trajectories, along with the characterization of cellular heterogeneity and state transitions. Several methods have been developed for reconstructing developmental trajectories from single-cell transcriptomic data, but efforts on analyzing single-cell epigenomic data and on trajectory visualization remain limited. Here we present STREAM, an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data. We have tested STREAM on several synthetic and real datasets generated with different single-cell techno…
A one class KNN for signal identification: a biological case study
The paper describes an application of a one class KNN to identify different signal patterns embedded in a noise structured background. The problem becomes harder whenever only one pattern is well-represented in the signal; in such cases, one class classifier techniques are more indicated. The classification phase is applied after a preprocessing phase based on a multi layer model (MLM) that provides preliminary signal segmentation in an interval feature space. The one class KNN has been tested on synthetic and real (Saccharomyces cerevisiae) microarray data in the specific problem of DNA nucleosome and linker regions identification. Results have shown, in both cases, a good recognition rate.
A MULTI-LAYER MODEL TO STUDY GENOME-SCALE POSITIONS OF NUCLEOSOMES
The positioning of nucleosomes along chromatin has been implicated in the regulation of gene expression in eukaryotic cells, because packaging DNA into nucleosomes affects sequence accessibility. In this paper we propose a new model (called MLM) for the identification of nucleosomes and linker regions across DNA, consisting in a thresholding technique based on cut-set conditions. For this purpose we have defined a method to generate synthetic microarray data fully inspired from the approach that has been used by Yuan et al. Results have shown a good recognition rate on synthetic data, moreover, the $MLM$ shows a good agreement with the recently published method based on Hidden Markov Model …
Omic-based strategies reveal novel links between primary metabolism and antibiotic production
Distance Functions, Clustering Algorithms and Microarray Data Analysis
Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function works best has been investigated, but no final conclusion has been reached. The aim of this extended abstract is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic sepa…
Multi layer analysis.
A multi-layer method to study genome-scale positions of nucleosomes
AbstractThe basic unit of eukaryotic chromatin is the nucleosome, consisting of about 150 bp of DNA wrapped around a protein core made of histone proteins. Nucleosomes position is modulated in vivo to regulate fundamental nuclear processes. To measure nucleosome positions on a genomic scale both theoretical and experimental approaches have been recently reported. We have developed a new method, Multi-Layer Model (MLM), for the analysis of nucleosome position data obtained with microarray-based approach. The MLM is a feature extraction method in which the input data is processed by a classifier to distinguish between several kinds of patterns. We applied our method to simulated-synthetic and…
STREAM: Single-cell Trajectories Reconstruction, Exploration And Mapping of omics data
AbstractSingle-cell transcriptomic assays have enabled the de novo reconstruction of lineage differentiation trajectories, along with the characterization of cellular heterogeneity and state transitions. Several methods have been developed for reconstructing developmental trajectories from single-cell transcriptomic data, but efforts on analyzing single-cell epigenomic data and on trajectory visualization remain limited. Here we present STREAM, an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data.
A one class classifier for Signal identification: a biological case study
The paper describes an application of a one-class KNN to identify different signal patterns embedded in a noise structured background. The problem become harder whenever only one pattern is well represented in the signal, in such cases one class classifier techniques are more indicated. The classification phase is applied after a preprocessing phase based on a Multi Layer Model (MLM) that provides a preliminary signal segmentation in an interval feature space. The one-class KNN has been tested on synthetic data that simulate microarray data for the identification of nucleosomes and linker regions across DNA. Results have shown a good recognition rate on synthetic data for nucleosome and lin…
A New Dissimilarity Measure for Clustering Seismic Signals
Hypocenter and focal mechanism of an earthquake can be determined by the analysis of signals, named waveforms, related to the wave field produced and recorded by a seismic network. Assuming that waveform similarity implies the similarity of focal parameters, the analysis of those signals characterized by very similar shapes can be used to give important details about the physical phenomena which have generated an earthquake. Recent works have shown the effectiveness of cross-correlation and/or cross-spectral dissimilarities to identify clusters of seismic events. In this work we propose a new dissimilarity measure between seismic signals whose reliability has been tested on real seismic dat…
The Three Steps of Clustering In The Post-Genomic Era
This chapter descibes the basic algorithmic components that are involved in clustering, with particular attention to classification of microarray data.
A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis
Abstract Background Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Results A procedure is proposed for the assessment of the discriminative ability of a distance functi…
A motif-independent metric for DNA sequence specificity
Abstract Background Genome-wide mapping of protein-DNA interactions has been widely used to investigate biological functions of the genome. An important question is to what extent such interactions are regulated at the DNA sequence level. However, current investigation is hampered by the lack of computational methods for systematic evaluating sequence specificity. Results We present a simple, unbiased quantitative measure for DNA sequence specificity called the Motif Independent Measure (MIM). By analyzing both simulated and real experimental data, we found that the MIM measure can be used to detect sequence specificity independent of presence of transcription factor (TF) binding motifs. We…
Interval Length Analysis in Multi Layer Model
In this paper we present an hypothesis test of randomness based on the probability density function of the symmetrized Kulback-Leibler distance estimated, via a Monte Carlo simulation, by the distributions of the interval lengths detected using the Multi-Layer Model (MLM). The $MLM$ is based on the generation of several sub-samples of an input signal; in particular a set of optimal cut-set thresholds are applied to the data to detect signal properties. In this sense MLM is a general pattern detection method and it can be considered a preprocessing tool for pattern discovery. At the present the test has been evaluated on simulated signals which respect a particular tiled microarray approach …
A New Feature Selection Methodology for K-mers Representation of DNA Sequences
DNA sequence decomposition into k-mers and their frequency counting, defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length. This simple process allows to compare sequences in an alignment free way, using common similarities and distance functions on the numerical codomain of the mapping. The most common used decomposition uses all the substrings of a fixed length k making the codomain of exponential dimension. This obviously can affect the time complexity of the similarity computation, and in general of the machine learning algorithm used for the purpose of sequence analysis. Moreover, the presence of possible noisy features can also affect the…
Erratum to: A New Feature Selection Methodology for K-mers Representation of DNA Sequences
The Three Steps of Clustering in the Post-Genomic Era: A Synopsis
Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. Following Handl et al., it can be summarized as a three step process: (a) choice of a distance function; (b) choice of a clustering algorithm; (c) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Unfortunately, the high dimensionality of the data and their noisy nature makes cluster analysis of genomic data particul…
Applications of alignment-free methods in epigenomics
Epigenetic mechanisms play an important role in the regulation of cell type-specific gene activities, yet how epigenetic patterns are established and maintained remains poorly understood. Recent studies have supported a role of DNA sequences in recruitment of epigenetic regulators. Alignment-free methods have been applied to identify distinct sequence features that are associated with epigenetic patterns and to predict epigenomic profiles. Here, we review recent advances in such applications, including the methods to map DNA sequence to feature space, sequence comparison and prediction models. Computational studies using these methods have provided important insights into the epigenetic reg…
A Fuzzy One Class Classifier for Multi Layer Model
The paper describes an application of a fuzzy one-class classifier (FOC ) for the identification of different signal patterns embedded in a noise structured background. The classification phase is applied after a preprocessing phase based on a Multi Layer Model (MLM ) that provides a preliminary signal segmentation in an interval feature space. The FOC has been tested on synthetic and real microarray data in the specific problem of DNA nucleosome and linker regions identification. Results have shown, in both cases, a good recognition rate.
Genome-wide characterization of chromatin binding and nucleosome spacing activity of the nucleosome remodelling ATPase ISWI
The evolutionarily conserved ATP-dependent nucleosome remodelling factor ISWI can space nucleosomes affecting a variety of nuclear processes. In Drosophila, loss of ISWI leads to global transcriptional defects and to dramatic alterations in higher-order chromatin structure, especially on the male X chromosome. In order to understand if chromatin condensation and gene expression defects, observed in ISWI mutants, are directly correlated with ISWI nucleosome spacing activity, we conducted a genome-wide survey of ISWI binding and nucleosome positioning in wild-type and ISWI mutant chromatin. Our analysis revealed that ISWI binds both genic and intergenic regions. Remarkably, we found that ISWI…
Genome-wide characterization of chromatin binding and nucleosome spacing activity of the nucleosome remodelling ATPase ISWI.
The evolutionarily conserved ATP-dependent nucleosome remodelling factor ISWI can space nucleosomes affecting a variety of nuclear processes. In Drosophila, loss of ISWI leads to global transcriptional defects and to dramatic alterations in higher-order chromatin structure, especially on the male X chromosome. In order to understand if chromatin condensation and gene expression defects, observed in ISWI mutants, are directly correlated with ISWI nucleosome spacing activity, we conducted a genome-wide survey of ISWI binding and nucleosome positioning in wild-type and ISWI mutant chromatin. Our analysis revealed that ISWI binds both genic and intergenic regions. Remarkably, we found that ISWI…
Assessment of computational methods for the analysis of single-cell ATAC-seq data
Abstract Background Recent innovations in single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) enable profiling of the epigenetic landscape of thousands of individual cells. scATAC-seq data analysis presents unique methodological challenges. scATAC-seq experiments sample DNA, which, due to low copy numbers (diploid in humans), lead to inherent data sparsity (1–10% of peaks detected per cell) compared to transcriptomic (scRNA-seq) data (10–45% of expressed genes detected per cell). Such challenges in data generation emphasize the need for informative features to assess cell heterogeneity at the chromatin level. Results We present a benchmarking framework that …
A new Multi-Layers Method to Analyze Gene Expression
In the paper a new Multi-Layers approach (called Multi-Layers Model MLM) for the analysis of stochastic signals and its application to the analysis of gene expression data is presented. It consists in the generation of sub-samples from the input signal by applying a threshold technique based on cut-set optimal conditions. The MLM has been applied on synthetic and real microarray data for the identification of particular regions across DNA called nucleosomes and linkers. Nucleosomes are the fundamental repeating subunits of all eukaryotic chromatin, and their positioning provides useful information regarding the regulation of gene expression in eukaryotic cells. Results have shown a good rec…
MOESM2 of Assessment of computational methods for the analysis of single-cell ATAC-seq data
Additional file 2: Code to reproduce the analyses.
MOESM1 of Assessment of computational methods for the analysis of single-cell ATAC-seq data
Additional file 1: Figures S1–S24, Tables S1-S21, Supplementary Notes, and Supplementary figure legends