0000000000236402

AUTHOR

Jean-fred Fontaine

Computational identification of cell-specific variable regions in ChIP-seq data.

ABSTRACT Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is used to identify genome-wide DNA regions bound by proteins. Several sources of variation can affect the reproducibility of a particular ChIP-seq assay, which can lead to a misinterpretation of where the protein under investigation binds to the genome in a particular cell type. Given one ChIP-seq experiment with replicates, binding sites not observed in all the replicates will usually be interpreted as noise and discarded. However, the recent discovery of high-occupancy target (HOT) regions suggests that there are regions where binding of multiple transcription factors can be identified. To investigate these regions,…

research product

Interpretable machine learning models for single-cell ChIP-seq imputation

AbstractMotivationSingle-cell ChIP-seq (scChIP-seq) analysis is challenging due to data sparsity. High degree of data sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from ENCODE to impute missing protein-DNA interacting regions of target histone marks or transcription factors.ResultsImputations using machine learning models trained for each single cell, each target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene i…

research product

Single-cell ChIP-seq imputation with SIMPA by leveraging bulk ENCODE data

Abstract Single-cell ChIP-seq analysis is challenging due to data sparsity. We present SIMPA ( https://github.com/salbrec/SIMPA ), a single-cell ChIP-seq data imputation method leveraging predictive information within bulk ENCODE data to impute missing protein-DNA interacting regions of target histone marks or transcription factors. Machine learning models trained for each single cell, each target, and each genomic region enable drastic improvement in cell types clustering and genes identification.

research product

Automated quality control of next generation sequencing data using machine learning

AbstractControlling quality of next generation sequencing (NGS) data files is a necessary but complex task. To address this problem, we statistically characterized common NGS quality features and developed a novel quality control procedure involving tree-based and deep learning classification algorithms. Predictive models, validated on internal data and external disease diagnostic datasets, are to some extent generalizable to data from unseen species. The derived statistical guidelines and predictive models represent a valuable resource for users of NGS data to better understand quality issues and perform automatic quality control. Our guidelines and software are available at the following …

research product

TAF-ChIP: An ultra-low input approach for genome wide chromatin immunoprecipitation assay

Chromatin immunoprecipitation (ChIP) followed by next generation sequencing is an invaluable and powerful technique to understand transcriptional regulation. However, ChIP is currently limited by the requirement of large amount of starting material. This renders studying rare cell populations very challenging, or even impossible. Here, we present a tagmentation-assisted fragmentation ChIP (TAF-ChIP) and sequencing method to generate high-quality datasets from low cell numbers. The method relies on Tn5 transposon activity to fragment the chromatin that is immunoprecipitated, thus circumventing the need for sonication or MNAse digestion to fragment. Furthermore, Tn5 adds the sequencing adapto…

research product

TAF-ChIP: an ultra-low input approach for genome-wide chromatin immunoprecipitation assay

The authors present a novel method for obtaining chromatin profiles from low cell numbers without prior nuclei isolation. The method is successfully implemented in generating epigenetic profile from 100 cells with high signal-to-noise ratio.

research product

Gene Set to Diseases (GS2D): disease enrichment analysis on human gene sets with literature data

Large sets of candidate genes derived from high-throughput biological experiments can be characterized by functional enrichment analysis. The analysis consists of comparing the functions of one gene set against that of a background gene set. Then, functions related to a significant number of genes in the gene set are expected to be relevant. Web tools offering disease enrichment analysis on gene sets are often based on gene-disease associations from manually curated or experimental data that is accurate but does not cover all diseases discussed in the literature. Using associations automatically derived from literature data could be a cost effective method to improve the coverage of disease…

research product

Defining Human Tyrosine Kinase Phosphorylation Networks Using Yeast as an In Vivo Model Substrate.

Systematic assessment of tyrosine kinase-substrate relationships is fundamental to a better understanding of cellular signaling and its profound alterations in human diseases such as cancer. In human cells, such assessments are confounded by complex signaling networks, feedback loops, conditional activity, and intra-kinase redundancy. Here we address this challenge by exploiting the yeast proteome as an in vivo model substrate. We individually expressed 16 human non-receptor tyrosine kinases (NRTKs) in Saccharomyces cerevisiae and identified 3,279 kinase-substrate relationships involving 1,351 yeast phosphotyrosine (pY) sites. Based on the yeast data without prior information, we generated …

research product

Quality control guidelines and machine learning predictions for next generation sequencing data

Abstract Controlling the quality of next generation sequencing (NGS) data files is usually not fully automatized because of its complexity and involves strong assumptions and arbitrary choices. We have statistically characterized common NGS quality features of a large set of files and optimized the complex quality control procedure using a machine learning approach including tree-based algorithms and deep learning. Predictive models were validated using internal and external data, including applications to disease diagnosis datasets. Models are unbiased, accurate and to some extent generalizable to unseen data types and species. Given enough labelled data for training, this approach could p…

research product

RNA Sequencing of Human Peripheral Blood Cells Indicates Upregulation of Immune-Related Genes in Huntington's Disease

Huntington's disease (HD) is an autosomal dominantly inherited neurodegenerative disorder caused by a trinucleotide repeat expansion in the Huntingtin gene. As disease-modifying therapies for HD are being developed, peripheral blood cells may be used to indicate disease progression and to monitor treatment response. In order to investigate whether gene expression changes can be found in the blood of individuals with HD that distinguish them from healthy controls, we performed transcriptome analysis by next-generation sequencing (RNA-seq). We detected a gene expression signature consistent with dysregulation of immune-related functions and inflammatory response in peripheral blood from HD ca…

research product

Disease–Genes Must Guide Data Source Integration in the Gene Prioritization Process

One of the main issues in detecting the genes involved in the etiology of genetic human diseases is the integration of different types of available functional relationships between genes. Numerous approaches exploited the complementary evidence coded in heterogeneous sources of data to prioritize disease-genes, such as functional profiles or expression quantitative trait loci, but none of them to our knowledge posed the scarcity of known disease-genes as a feature of their integration methodology. Nevertheless, in contexts where data are unbalanced, that is, where one class is largely under-represented, imbalance-unaware approaches may suffer a strong decrease in performance. We claim that …

research product

Posttranslational modifications by ADAM10 shape myeloid antigen-presenting cell homeostasis in the splenic marginal zone

The spleen contains phenotypically and functionally distinct conventional dendritic cell (cDC) subpopulations, termed cDC1 and cDC2, which each can be divided into several smaller and less well-characterized subsets. Despite advances in understanding the complexity of cDC ontogeny by transcriptional programming, the significance of posttranslational modifications in controlling tissue-specific cDC subset immunobiology remains elusive. Here, we identified the cell-surface–expressed A-disintegrin-and-metalloproteinase 10 (ADAM10) as an essential regulator of cDC1 and cDC2 homeostasis in the splenic marginal zone (MZ). Mice with a CD11c-specific deletion of ADAM10 (ADAM10(ΔCD11c)) exhibited a …

research product

Lost Strings in Genomes: What Sense Do They Make?

We studied the sets of avoided strings to be observed over a family of genomes. It was found that the length of the minimal avoided string rarely exceeds 9 nucleotides, with neither respect to a phylogeny of a genome under consideration. The lists of the avoided strings observed over the sets of (related) genomes have been analyzed. Very low correlation between the phylogeny, and the set of those strings has been found.

research product

DiseaseLinc: Disease Enrichment Analysis of Sets of Differentially Expressed LincRNAs

Long intergenic non-coding RNAs (LincRNAs) are long RNAs that do not encode proteins. Functional evidence is lacking for most of them. Their biogenesis is not well-known, but it is thought that many lincRNAs originate from genomic duplication of coding material, resulting in pseudogenes, gene copies that lose their original function and can accumulate mutations. While most pseudogenes eventually stop producing a transcript and become erased by mutations, many of these pseudogene-based lincRNAs keep similarity to the parental gene from which they originated, possibly for functional reasons. For example, they can act as decoys for miRNAs targeting the parental gene. Enrichment analysis of fun…

research product

LipiDisease: associate lipids to diseases using literature mining

Abstract Summary Lipids exhibit an essential role in cellular assembly and signaling. Dysregulation of these functions has been linked with many complications including obesity, diabetes, metabolic disorders, cancer and more. Investigating lipid profiles in such conditions can provide insights into cellular functions and possible interventions. Hence the field of lipidomics is expanding in recent years. Even though the role of individual lipids in diseases has been investigated, there is no resource to perform disease enrichment analysis considering the cumulative association of a lipid set. To address this, we have implemented the LipiDisease web server. The tool analyzes millions of recor…

research product

Evaluation of in vivo and in vitro models of toxicity by comparison of toxicogenomics data with the literature.

Toxicity affecting humans is studied by observing the effects of chemical substances in animal organisms (in vivo) or in animal and human cultivated cell lines (in vitro). Toxicogenomics studies collect gene expression profiles and histopathology assessment data for hundreds of drugs and pollutants in standardized experimental designs using different model systems. These data are an invaluable source for analyzing genome-wide drug response in biological systems. However, a problem remains that is how to evaluate the suitability of heterogeneous in vitro and in vivo systems to model the many different aspects of human toxicity. We propose here that a given model system (cell type or animal o…

research product

Statistical guidelines for quality control of next-generation sequencing techniques.

Condition-specific statistical guidelines and accurate classification trees for quality control of functional genomics NGS files (RNA-seq, ChIP-seq and DNase-seq) have been generated using thousands of reference files from the ENCODE project and made available to the community.

research product