0000000001051244

AUTHOR

Miguel A. Andrade-navarro

Evolutionary stability of topologically associating domains is associated with conserved gene regulation

AbstractBackgroundThe human genome is highly organized in the three-dimensional nucleus. Chromosomes fold locally into topologically associating domains (TADs) defined by increased intra-domain chromatin contacts. TADs contribute to gene regulation by restricting chromatin interactions of regulatory sequences, such as enhancers, with their target genes. Disruption of TADs can result in altered gene expression and is associated to genetic diseases and cancers. However, it is not clear to which extent TAD regions are conserved in evolution and whether disruption of TADs by evolutionary rearrangements can alter gene expression.ResultsHere, we hypothesize that TADs represent essential functiona…

research product

Proteome-wide comparison between the amino acid composition of domains and linkers

Objective Amino acid composition is a sequence feature that has been extensively used to characterize proteomes of many species and protein families. Yet the analysis of amino acid composition of protein domains and the linkers connecting them has received less attention. Here, we perform both a comprehensive full-proteome amino acid composition analysis and a similar analysis focusing on domains and linkers, to uncover domain- or linker-specific differential amino acid usage patterns. Results The amino acid composition in the 38 proteomes studied showcase the greater variability found in archaea and bacteria species compared to eukaryotes. When focusing on domains and linkers, we describe …

research product

Evaluating Cell Identity from Transcription Profiles

SummaryInduced pluripotent stem cells (iPS) and direct lineage programming offer promising autologous and patient-specific sources of cells for personalized drug-testing and cell-based therapy. Before these engineered cells can be widely used, it is important to evaluate how well the engineered cell types resemble their intended target cell types. We have developed a method to generate CellScore, a cell identity score that can be used to evaluate the success of an engineered cell type in relation to both its initial and desired target cell type, which are used as references. Of 20 cell transitions tested, the most successful transitions were the iPS cells (CellScore > 0.9), while other t…

research product

Toward completion of the Earth’s proteome: an update a decade later

Protein databases are steadily growing driven by the spread of new more efficient sequencing techniques. This growth is dominated by an increase in redundancy (homologous proteins with various degrees of sequence similarity) and by the incapability to process and curate sequence entries as fast as they are created. To understand these trends and aid bioinformatic resources that might be compromised by the increasing size of the protein sequence databases, we have created a less-redundant protein data set. In parallel, we analyzed the evolution of protein sequence databases in terms of size and redundancy. While the SwissProt database has decelerated its growth mostly because of a focus on i…

research product

Avoided motifs: short amino acid strings missing from protein datasets.

Abstract According to the amino acid composition of natural proteins, it could be expected that all possible sequences of three or four amino acids will occur at least once in large protein datasets purely by chance. However, in some species or cellular context, specific short amino acid motifs are missing due to unknown reasons. We describe these as Avoided Motifs, short amino acid combinations missing from biological sequences. Here we identify 209 human and 154 bacterial Avoided Motifs of length four amino acids, and discuss their possible functionality according to their presence in other species. Furthermore, we determine two Avoided Motifs of length three amino acids in human proteins…

research product

RepeatsDB in 2021: improved data and extended classification for protein tandem repeat structures

The RepeatsDB database (URL: https://repeatsdb.org/) provides annotations and classification for protein tandem repeat structures from the Protein Data Bank (PDB). Protein tandem repeats are ubiquitous in all branches of the tree of life. The accumulation of solved repeat structures provides new possibilities for classification and detection, but also increasing the need for annotation. Here we present RepeatsDB 3.0, which addresses these challenges and presents an extended classification scheme. The major conceptual change compared to the previous version is the hierarchical classification combining top levels based solely on structural similarity (Class > Topology > Fold) with two new lev…

research product

Towards identifying drug side effects from social media using active learning and crowd sourcing.

Motivation Social media is a largely untapped source of information on side effects of drugs. Twitter in particular is widely used to report on everyday events and personal ailments. However, labeling this noisy data is a difficult problem because labeled training data is sparse and automatic labeling is error-prone. Crowd sourcing can help in such a scenario to obtain more reliable labels, but is expensive in comparison because workers have to be paid. To remedy this, semi-supervised active learning may reduce the number of labeled data needed and focus the manual labeling process on important information. Results We extracted data from Twitter using the public API. We subsequently use Ama…

research product

Traitpedia: a collaborative effort to gather species traits

Abstract Summary Traitpedia is a collaborative database aimed to collect binary traits in a tabular form for a growing number of species. Availability and implementation Traitpedia can be accessed from http://cbdm-01.zdv.uni-mainz.de/~munoz/traitpedia. Supplementary information Supplementary data are available at Bioinformatics online.

research product

Computational identification of cell-specific variable regions in ChIP-seq data.

ABSTRACT Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is used to identify genome-wide DNA regions bound by proteins. Several sources of variation can affect the reproducibility of a particular ChIP-seq assay, which can lead to a misinterpretation of where the protein under investigation binds to the genome in a particular cell type. Given one ChIP-seq experiment with replicates, binding sites not observed in all the replicates will usually be interpreted as noise and discarded. However, the recent discovery of high-occupancy target (HOT) regions suggests that there are regions where binding of multiple transcription factors can be identified. To investigate these regions,…

research product

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

AbstractThe widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotatio…

research product

RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures

RepeatsDB 2.0 (URL: http://repeatsdb.bio.unipd.it/) is an update of the database of annotated tandem repeat protein structures. Repeat proteins are a widespread class of non-globular proteins carrying heterogeneous functions involved in several diseases. Here we provide a new version of RepeatsDB with an improved classification schema including high quality annotations for ∼5400 protein structures. RepeatsDB 2.0 features information on start and end positions for the repeat regions and units for all entries. The extensive growth of repeat unit characterization was possible by applying the novel ReUPred annotation method over the entire Protein Data Bank, with data quality is guaranteed by a…

research product

FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.

The accelerated growth of protein databases offers great possibilities for the study of protein function using sequence similarity and conservation. However, the huge number of sequences deposited in these databases requires new ways of analyzing and organizing the data. It is necessary to group the many very similar sequences, creating clusters with automated derived annotations useful to understand their function, evolution, and level of experimental evidence. We developed an algorithm called FastaHerder2, which can cluster any protein database, putting together very similar protein sequences based on near-full-length similarity and/or high threshold of sequence identity. We compressed 50…

research product

Between Interactions and Aggregates: The PolyQ Balance

Abstract Polyglutamine regions (polyQ) are highly abundant consecutive runs of glutamine residues. They have been generally studied in relation to the so-called polyQ-associated diseases, characterized by protein aggregation caused by the expansion of the polyglutamine tract via a CAG-slippage mechanism. However, more than 4800 human proteins contain a polyQ, and only 9 of these regions are known to be associated with disease. Computational sequence studies and experimental structure determinations are completing a more interesting picture in which polyQ emerge as a motif for modulation of protein-protein interactions. But long polyQ regions may lead to an excess of interactions, and produc…

research product

Interpretable machine learning models for single-cell ChIP-seq imputation

AbstractMotivationSingle-cell ChIP-seq (scChIP-seq) analysis is challenging due to data sparsity. High degree of data sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from ENCODE to impute missing protein-DNA interacting regions of target histone marks or transcription factors.ResultsImputations using machine learning models trained for each single cell, each target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene i…

research product

The Developmental Transcriptome for Lytechinus variegatus Exhibits Temporally Punctuated Gene Expression Changes

AbstractEmbryonic development is arguably the most complex process an organism undergoes during its lifetime, and understanding this complexity is best approached with a systems-level perspective. The sea urchin has become a highly valuable model organism for understanding developmental specification, morphogenesis, and evolution. As a non-chordate deuterostome, the sea urchin occupies an important evolutionary niche between protostomes and vertebrates.Lytechinus variegatus(Lv) is an Atlantic species that has been well studied, and which has provided important insights into signal transduction, patterning, and morphogenetic changes during embryonic and larval development. The Pacific specie…

research product

Single-cell ChIP-seq imputation with SIMPA by leveraging bulk ENCODE data

Abstract Single-cell ChIP-seq analysis is challenging due to data sparsity. We present SIMPA ( https://github.com/salbrec/SIMPA ), a single-cell ChIP-seq data imputation method leveraging predictive information within bulk ENCODE data to impute missing protein-DNA interacting regions of target histone marks or transcription factors. Machine learning models trained for each single cell, each target, and each genomic region enable drastic improvement in cell types clustering and genes identification.

research product

The importance of definitions in the study of polyQ regions: A tale of thresholds, impurities and sequence context

Graphical abstract

research product

MIPPIE: the mouse integrated protein–protein interaction reference

Abstract Cells operate and react to environmental signals thanks to a complex network of protein–protein interactions (PPIs), the malfunction of which can severely disrupt cellular homeostasis. As a result, mapping and analyzing protein networks are key to advancing our understanding of biological processes and diseases. An invaluable part of these endeavors has been the house mouse (Mus musculus), the mammalian model organism par excellence, which has provided insights into human biology and disorders. The importance of investigating PPI networks in the context of mouse prompted us to develop the Mouse Integrated Protein–Protein Interaction rEference (MIPPIE). MIPPIE inherits a robust infr…

research product

Function and Evolution of Nematode RNAi Pathways

Selfish genetic elements, like transposable elements or viruses, are a threat to genomic stability. A variety of processes, including small RNA-based RNA interference (RNAi)-like pathways, has evolved to counteract these elements. Amongst these, endogenous small interfering RNA and Piwi-interacting RNA (piRNA) pathways were implicated in silencing selfish genetic elements in a variety of organisms. Nematodes have several incredibly specialized, rapidly evolving endogenous RNAi-like pathways serving such purposes. Here, we review recent research regarding the RNAi-like pathways of Caenorhabditis elegans as well as those of other nematodes, to provide an evolutionary perspective. We argue tha…

research product

Liver-Kidney-on-Chip To Study Toxicity of Drug Metabolites

Advances in organ-on-chip technologies for the application in in vitro drug development provide an attractive alternative approach to replace ethically controversial animal testing and to establish a basis for accelerated drug development. In recent years, various chip-based tissue culture systems have been developed, which are mostly optimized for cultivation of one single cell type or organoid structure and lack the representation of multi organ interactions. Here we present an optimized microfluidic chip design consisting of interconnected compartments, which provides the possibility to mimic the exchange between different organ specific cell types and enables to study interdependent cel…

research product

Disentangling the complexity of low complexity proteins

Abstract There are multiple definitions for low complexity regions (LCRs) in protein sequences, with all of them broadly considering LCRs as regions with fewer amino acid types compared to an average composition. Following this view, LCRs can also be defined as regions showing composition bias. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichot…

research product

Bioinformatics in theory and application - highlights of the 36th German Conference on Bioinformatics.

research product

The Anti-amyloid Compound DO1 Decreases Plaque Pathology and Neuroinflammation-Related Expression Changes in 5xFAD Transgenic Mice

Self-propagating amyloid-β (Aβ) aggregates or seeds possibly drive pathogenesis of Alzheimer's disease (AD). Small molecules targeting such structures might act therapeutically in vivo. Here, a fluorescence polarization assay was established that enables the detection of compound effects on both seeded and spontaneous Aβ42 aggregation. In a focused screen of anti-amyloid compounds, we identified Disperse Orange 1 (DO1) ([4-((4-nitrophenyl)diazenyl)-N-phenylaniline]), a small molecule that potently delays both seeded and non-seeded Aβ42 polymerization at substoichiometric concentrations. Mechanistic studies revealed that DO1 disrupts preformed fibrillar assemblies of synthetic Aβ42 peptides …

research product

Automated quality control of next generation sequencing data using machine learning

AbstractControlling quality of next generation sequencing (NGS) data files is a necessary but complex task. To address this problem, we statistically characterized common NGS quality features and developed a novel quality control procedure involving tree-based and deep learning classification algorithms. Predictive models, validated on internal data and external disease diagnostic datasets, are to some extent generalizable to data from unseen species. The derived statistical guidelines and predictive models represent a valuable resource for users of NGS data to better understand quality issues and perform automatic quality control. Our guidelines and software are available at the following …

research product

TAF-ChIP: An ultra-low input approach for genome wide chromatin immunoprecipitation assay

Chromatin immunoprecipitation (ChIP) followed by next generation sequencing is an invaluable and powerful technique to understand transcriptional regulation. However, ChIP is currently limited by the requirement of large amount of starting material. This renders studying rare cell populations very challenging, or even impossible. Here, we present a tagmentation-assisted fragmentation ChIP (TAF-ChIP) and sequencing method to generate high-quality datasets from low cell numbers. The method relies on Tn5 transposon activity to fragment the chromatin that is immunoprecipitated, thus circumventing the need for sonication or MNAse digestion to fragment. Furthermore, Tn5 adds the sequencing adapto…

research product

Flanking regions determine the structure of the poly-glutamine homo- repeat in huntingtin through mechanisms common among glutamine-rich human proteins

International audience; The causative agent of Huntington's disease, the poly-Q homo-repeat in the N-terminal region of huntingtin (httex1), is flanked by a 17-residue-long fragment (N17) and a proline-rich region (PRR), which promote and inhibit the aggregation propensity of the protein, respectively, by poorly understood mechanisms. Based on experimental data obtained from site-specifically labeled NMR samples, we derived an ensemble model of httex1 that identified both flanking regions as opposing poly-Q secondary structure promoters. While N17 triggers helicity through a promiscuous hydrogen bond network involving the side chains of the first glutamines in the poly-Q tract, the PRR prom…

research product

A novel approach to investigate the evolution of structured tandem repeat protein families by exon duplication.

Tandem Repeat Proteins (TRPs) are ubiquitous in cells and are enriched in eukaryotes. They contributed to the evolution of organism complexity, specializing for functions that require quick adaptability such as immunity-related functions. To investigate the hypothesis of repeat protein evolution through exon duplication and rearrangement, we designed a tool to analyze the relationships between exon/intron patterns and structural symmetries. The tool allows comparison of the structure fragments as defined by exon/intron boundaries from Ensembl against the structural element repetitions from RepeatsDB. The all-against-all pairwise structural alignment between fragments and comparison of the t…

research product

Evolution-guided evaluation of the inverted terminal repeats of the synthetic transposon Sleeping Beauty.

Abstract Sleeping Beauty (SB) is a synthetic Tc1/mariner transposon that is widely used for genetic engineering in vertebrates, including humans. Its sequence was derived from a consensus of sequences found in fish species including the Atlantic salmon (Salmo salar). One of the functional components of SB, the transposase enzyme, has been subject to extensive mutagenesis yielding hyperactive protein variants for advanced applications. The second functional component, the transposon inverted terminal repeats (ITRs), has so far not been extensively modified, mainly due to a lack of natural sequence information. Importantly, as genome sequences become available, they can provide a rich source …

research product

TAF-ChIP: an ultra-low input approach for genome-wide chromatin immunoprecipitation assay

The authors present a novel method for obtaining chromatin profiles from low cell numbers without prior nuclei isolation. The method is successfully implemented in generating epigenetic profile from 100 cells with high signal-to-noise ratio.

research product

The latent geometry of the human protein interaction network

Abstract Motivation A series of recently introduced algorithms and models advocates for the existence of a hyperbolic geometry underlying the network representation of complex systems. Since the human protein interaction network (hPIN) has a complex architecture, we hypothesized that uncovering its latent geometry could ease challenging problems in systems biology, translating them into measuring distances between proteins. Results We embedded the hPIN to hyperbolic space and found that the inferred coordinates of nodes capture biologically relevant features, like protein age, function and cellular localization. This means that the representation of the hPIN in the two-dimensional hyperboli…

research product

Nuclear inclusions of pathogenic ataxin-1 induce oxidative stress and perturb the protein synthesis machinery

Spinocerebellar ataxia type-1 (SCA1) is caused by an abnormally expanded polyglutamine (polyQ) tract in ataxin-1. These expansions are responsible for protein misfolding and self-assembly into intranuclear inclusion bodies (IIBs) that are somehow linked to neuronal death. However, owing to lack of a suitable cellular model, the downstream consequences of IIB formation are yet to be resolved. Here, we describe a nuclear protein aggregation model of pathogenic human ataxin-1 and characterize IIB effects. Using an inducible Sleeping Beauty transposon system, we overexpressed the ATXN1(Q82) gene in human mesenchymal stem cells that are resistant to the early cytotoxic effects caused by the expr…

research product

Gene Set to Diseases (GS2D): disease enrichment analysis on human gene sets with literature data

Large sets of candidate genes derived from high-throughput biological experiments can be characterized by functional enrichment analysis. The analysis consists of comparing the functions of one gene set against that of a background gene set. Then, functions related to a significant number of genes in the gene set are expected to be relevant. Web tools offering disease enrichment analysis on gene sets are often based on gene-disease associations from manually curated or experimental data that is accurate but does not cover all diseases discussed in the literature. Using associations automatically derived from literature data could be a cost effective method to improve the coverage of disease…

research product

Expression and subcellular localization of USH1C/harmonin in the human retina provide insights into pathomechanisms and therapy

AbstractUsher syndrome (USH) is the most common form of hereditary deafness-blindness in humans. USH is a complex genetic disorder, assigned to three clinical subtypes differing in onset, course, and severity, with USH1 being the most severe. Rodent USH1 models do not reflect the ocular phenotype observed in human patients to date; hence, little is known about the pathophysiology of USH1 in the human eye. One of the USH1 genes, USH1C, exhibits extensive alternative splicing and encodes numerous harmonin protein isoforms that function as scaffolds for organizing the USH interactome. RNA-seq analysis of human retinas uncovered harmonin_a1 as the most abundant transcript of USH1C. Bulk RNA-seq…

research product

Text mining of biomedical literature: doing well, but we could be doing better.

research product

The Role of Low Complexity Regions in Protein Interaction Modes: An Illustration in Huntingtin

Low complexity regions (LCRs) are very frequent in protein sequences, generally having a lower propensity to form structured domains and tending to be much less evolutionarily conserved than globular domains. Their higher abundance in eukaryotes and in species with more cellular types agrees with a growing number of reports on their function in protein interactions regulated by post-translational modifications. LCRs facilitate the increase of regulatory and network complexity required with the emergence of organisms with more complex tissue distribution and development. Although the low conservation and structural flexibility of LCRs complicate their study, evolutionary studies of proteins …

research product

Editorial: Protein Interaction Networks in Health and Disease

The identification and annotation of protein-protein interactions (PPIs) is of great importance in systems biology. Big data produced from experimental or computational approaches allow not only the construction of large protein interaction maps but also expand our knowledge on how proteins build up molecular complexes to perform sophisticated tasks inside a cell. However, if we want to accurately understand the functionality of these complexes, we need to go beyond the simple identification of PPIs. We need to know when and where an interaction happens in the cell and also understand the flow of information through a protein interaction network. Another perspective of the research on PPI n…

research product

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

© The Author(s) 2020.

research product

Myeloid leukemia with transdifferentiation plasticity developing from T-cell progenitors

Unfavorable patient survival coincides with lineage plasticity observed in human acute leukemias. These cases are assumed to arise from hematopoietic stem cells, which have stable multipotent differentiation potential. However, here we report that plasticity in leukemia can result from instable lineage identity states inherited from differentiating progenitor cells. Using mice with enhanced c-Myc expression, we show, at the single-cell level, that T-lymphoid progenitors retain broad malignant lineage potential with a high capacity to differentiate into myeloid leukemia. These T-cell-derived myeloid blasts retain expression of a defined set of T-cell transcription factors, creating a lymphoi…

research product

AnABlast: Re-searching for Protein-Coding Sequences in Genomic Regions

AnABlast is a computational tool that highlights protein-coding regions within intergenic and intronic DNA sequences which escape detection by standard gene prediction algorithms. DNA sequences with small protein-coding genes or exons, complex intron-containing genes, or degenerated DNA fragments are efficiently targeted by AnABlast. Furthermore, this algorithm is particularly useful in detecting protein-coding sequences with nonsignificant homologs to sequences in databases. AnABlast can be executed online at http://www.bioinfocabd.upo.es/anablast/ .

research product

A reliable and unbiased human protein network with the disparity filter

AbstractThe living cell operates thanks to an intricate network of protein interactions. Proteins activate, transport, degrade, stabilise and participate in the production of other proteins. As a result, a reliable and systematically generated protein wiring diagram is crucial for a deeper understanding of cellular functions. Unfortunately, current human protein networks are noisy and incomplete. Also, they suffer from both study and technical biases: heavily studied proteins (e.g. those of pharmaceutical interest) are known to be involved in more interactions than proteins described in only a few publications. Here, we use the experimental evidence supporting the interaction between protei…

research product

Prediction of Chromatin Accessibility in Gene-Regulatory Regions from Transcriptomics Data

AbstractThe epigenetics landscape of cells plays a key role in the establishment of cell-type specific gene expression programs characteristic of different cellular phenotypes. Different experimental procedures have been developed to obtain insights into the accessible chromatin landscape including DNase-seq, FAIRE-seq and ATAC-seq. However, current downstream computational tools fail to reliably determine regulatory region accessibility from the analysis of these experimental data. In particular, currently available peak calling algorithms are very sensitive to their parameter settings and show highly heterogeneous results, which hampers a trustworthy identification of accessible chromatin…

research product

CellMap visualizes protein-protein interactions and subcellular localization

Many tools visualize protein-protein interaction (PPI) networks. The tool introduced here, CellMap, adds one crucial novelty by visualizing PPI networks in the context of subcellular localization, i.e. the location in the cell or cellular component in which a PPI happens. Users can upload images of cells and define areas of interest against which PPIs for selected proteins are displayed (by default on a cartoon of a cell). Annotations of localization are provided by the user or through our in-house database. The visualizer and server are written in JavaScript, making CellMap easy to customize and to extend by researchers and developers.

research product

Zc3h13/Flacc is required for adenosine methylation by bridging the mRNA-binding factor Rbm15/Spenito to the m6A machinery component Wtap/Fl(2)d

N6-methyladenosine (m6A) is the most abundant mRNA modification in eukaryotes, playing crucial roles in multiple biological processes. m6A is catalyzed by the activity of methyltransferase-like 3 (Mettl3), which depends on additional proteins whose precise functions remain poorly understood. Here we identified Zc3h13 (zinc finger CCCH domain-containing protein 13)/Flacc [Fl(2)d-associated complex component] as a novel interactor of m6A methyltransferase complex components in Drosophila and mice. Like other components of this complex, Flacc controls m6A levels and is involved in sex determination in Drosophila. We demonstrate that Flacc promotes m6A deposition by bridging Fl(2)d to the mRNA-…

research product

Drivers of topoisomerase II poisoning mimic and complement cytotoxicity in AML cells

Recently approved cancer drugs remain out-of-reach to most patients due to prohibitive costs and only few produce clinically meaningful benefits. An untapped alternative is to enhance the efficacy and safety of existing cancer drugs. We hypothesized that the response to topoisomerase II poisons, a very successful group of cancer drugs, can be improved by considering treatment-associated transcript levels. To this end, we analyzed transcriptomes from Acute Myeloid Leukemia (AML) cell lines treated with the topoisomerase II poison etoposide. Using complementary criteria of co-regulation within networks and of essentiality for cell survival, we identified and functionally confirmed 11 druggabl…

research product

Comprehensive translational control of tyrosine kinase expression by upstream open reading frames

Post-transcriptional control has emerged as a major regulatory event in gene expression and often occurs at the level of translation initiation. Although overexpression or constitutive activation of tyrosine kinases (TKs) through gene amplification, translocation or mutation are well-characterized oncogenic events, current knowledge about translational mechanisms of TK activation is scarce. Here, we report the presence of translational cis-regulatory upstream open reading frames (uORFs) in the majority of transcript leader sequences of human TK mRNAs. Genetic ablation of uORF initiation codons in TK transcripts resulted in enhanced translation of the associated downstream main protein-codin…

research product

The 18S ribosomal RNA m 6 A methyltransferase Mettl5 is required for normal walking behavior in Drosophila

RNA modifications have recently emerged as an important layer of gene regulation. N6-methyladenosine (m6A) is the most prominent modification on eukaryotic messenger RNA and has also been found on noncoding RNA, including ribosomal and small nuclear RNA. Recently, several m6A methyltransferases were identified, uncovering the specificity of m6A deposition by structurally distinct enzymes. In order to discover additional m6A enzymes, we performed an RNAi screen to deplete annotated orthologs of human methyltransferase-like proteins (METTLs) in Drosophila cells and identified CG9666, the ortholog of human METTL5. We show that CG9666 is required for specific deposition of m6A on 18S ribosomal …

research product

Comparison of inter- and intraspecies variation in humans and fruit flies

AbstractVariation is essential to species survival and adaptation during evolution. This variation is conferred by the imperfection of biochemical processes, such as mutations and alterations in DNA sequences, and can also be seen within genomes through processes such as the generation of antibodies. Recent sequencing projects have produced multiple versions of the genomes of humans and fruit flies (Drosophila melanogaster). These give us a chance to study how individual gene sequences vary within and between species. Here we arranged human and fly genes in orthologous pairs and compared such within-species variability with their degree of conservation between flies and humans. We observed …

research product

Repeatability in protein sequences

Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary structure elements do not apply when the distribution of amino acids becomes biased. While there is a tendency towards structural disorder in LCRs, various examples, and particularly homorepeats of single amino acids, suggest that very short repeats could adopt structures very difficult to predict. These structures are possibly variable and dependant on the context of intra- or inter-molecular interactions. In general, short repeats in LCRs can induce structure. This could explain the observation that very short (non-perfect) rep…

research product

Defining Human Tyrosine Kinase Phosphorylation Networks Using Yeast as an In Vivo Model Substrate.

Systematic assessment of tyrosine kinase-substrate relationships is fundamental to a better understanding of cellular signaling and its profound alterations in human diseases such as cancer. In human cells, such assessments are confounded by complex signaling networks, feedback loops, conditional activity, and intra-kinase redundancy. Here we address this challenge by exploiting the yeast proteome as an in vivo model substrate. We individually expressed 16 human non-receptor tyrosine kinases (NRTKs) in Saccharomyces cerevisiae and identified 3,279 kinase-substrate relationships involving 1,351 yeast phosphotyrosine (pY) sites. Based on the yeast data without prior information, we generated …

research product

Identification of transcribed protein coding sequence remnants within lincRNAs

Abstract Long intergenic non-coding RNAs (lincRNAs) are non-coding transcripts >200 nucleotides long that do not overlap protein-coding sequences. Importantly, such elements are known to be tissue-specifically expressed and to play a widespread role in gene regulation across thousands of genomic loci. However, very little is known of the mechanisms for the evolutionary biogenesis of these RNA elements, especially given their poor conservation across species. It has been proposed that lincRNAs might arise from pseudogenes. To test this systematically, we developed a novel method that searches for remnants of protein-coding sequences within lincRNA transcripts; the hypothesis is that we can t…

research product

The Conservation of Low Complexity Regions in Bacterial Proteins Depends on the Pathogenicity of the Strain and Subcellular Location of the Protein

Low complexity regions (LCRs) in proteins are characterized by amino acid frequencies that differ from the average. These regions evolve faster and tend to be less conserved between homologs than globular domains. They are not common in bacteria, as compared to their prevalence in eukaryotes. Studying their conservation could help provide hypotheses about their function. To obtain the appropriate evolutionary focus for this rapidly evolving feature, here we study the conservation of LCRs in bacterial strains and compare their high variability to the closeness of the strains. For this, we selected 20 taxonomically diverse bacterial species and obtained the completely sequenced proteomes of t…

research product

RNA Sequencing of Human Peripheral Blood Cells Indicates Upregulation of Immune-Related Genes in Huntington's Disease

Huntington's disease (HD) is an autosomal dominantly inherited neurodegenerative disorder caused by a trinucleotide repeat expansion in the Huntingtin gene. As disease-modifying therapies for HD are being developed, peripheral blood cells may be used to indicate disease progression and to monitor treatment response. In order to investigate whether gene expression changes can be found in the blood of individuals with HD that distinguish them from healthy controls, we performed transcriptome analysis by next-generation sequencing (RNA-seq). We detected a gene expression signature consistent with dysregulation of immune-related functions and inflammatory response in peripheral blood from HD ca…

research product

m6A modulates neuronal functions and sex determination in Drosophila

N6-methyladenosine RNA (m6A) is a prevalent messenger RNA modification in vertebrates. Although its functions in the regulation of post-transcriptional gene expression are beginning to be unveiled, the precise roles of m6A during development of complex organisms remain unclear. Here we carry out a comprehensive molecular and physiological characterization of the individual components of the methyltransferase complex, as well as of the YTH domain-containing nuclear reader protein in Drosophila melanogaster. We identify the member of the split ends protein family, Spenito, as a novel bona fide subunit of the methyltransferase complex. We further demonstrate important roles of this complex in …

research product

Disease–Genes Must Guide Data Source Integration in the Gene Prioritization Process

One of the main issues in detecting the genes involved in the etiology of genetic human diseases is the integration of different types of available functional relationships between genes. Numerous approaches exploited the complementary evidence coded in heterogeneous sources of data to prioritize disease-genes, such as functional profiles or expression quantitative trait loci, but none of them to our knowledge posed the scarcity of known disease-genes as a feature of their integration methodology. Nevertheless, in contexts where data are unbalanced, that is, where one class is largely under-represented, imbalance-unaware approaches may suffer a strong decrease in performance. We claim that …

research product

7C: Computational Chromosome Conformation Capture by Correlation of ChIP-seq at CTCF motifs.

Abstract Background Knowledge of the three-dimensional structure of the genome is necessary to understand how gene expression is regulated. Recent experimental techniques such as Hi-C or ChIA-PET measure long-range chromatin interactions genome-wide but are experimentally elaborate, have limited resolution and such data is only available for a limited number of cell types and tissues. Results While ChIP-seq was not designed to detect chromatin interactions, the formaldehyde treatment in the ChIP-seq protocol cross-links proteins with each other and with DNA. Consequently, also regions that are not directly bound by the targeted TF but interact with the binding site via chromatin looping are…

research product

The distributions of protein coding genes within chromatin domains in relation to human disease.

Abstract Background Our understanding of the nuclear chromatin structure has increased hugely during the last years mainly as a consequence of the advances in chromatin conformation capture methods like Hi-C. The unprecedented resolution of genome-wide interaction maps shows functional consequences that extend the initial thought of an efficient DNA packaging mechanism: gene regulation, DNA repair, chromosomal translocations and evolutionary rearrangements seem to be only the peak of the iceberg. One key concept emerging from this research is the topologically associating domains (TADs) whose functional role in gene regulation and their association with disease is not fully untangled. Resul…

research product

Lost Strings in Genomes: What Sense Do They Make?

We studied the sets of avoided strings to be observed over a family of genomes. It was found that the length of the minimal avoided string rarely exceeds 9 nucleotides, with neither respect to a phylogeny of a genome under consideration. The lists of the avoided strings observed over the sets of (related) genomes have been analyzed. Very low correlation between the phylogeny, and the set of those strings has been found.

research product

orthoFind Facilitates the Discovery of Homologous and Orthologous Proteins

Finding homologous and orthologous protein sequences is often the first step in evolutionary studies, annotation projects, and experiments of functional complementation. Despite all currently available computational tools, there is a requirement for easy-to-use tools that provide functional information. Here, a new web application called orthoFind is presented, which allows a quick search for homologous and orthologous proteins given one or more query sequences, allowing a recurrent and exhaustive search against reference proteomes, and being able to include user databases. It addresses the protein multidomain problem, searching for homologs with the same domain architecture, and gives a si…

research product

Missing value imputation in proximity extension assay-based targeted proteomics data

Targeted proteomics utilizing antibody-based proximity extension assays provides sensitive and highly specific quantifications of plasma protein levels. Multivariate analysis of this data is hampered by frequent missing values (random or left censored), calling for imputation approaches. While appropriate missing-value imputation methods exist, benchmarks of their performance in targeted proteomics data are lacking. Here, we assessed the performance of two methods for imputation of values missing completely at random, the previously top-benchmarked ‘missForest’ and the recently published ‘GSimp’ method. Evaluation was accomplished by comparing imputed with remeasured relative concentrations…

research product

Computational Chromosome Conformation Capture by Correlation of ChIP-seq at CTCF motifs

Background: Transcription factors (TFs) bind to gene promoters or distal regulatory elements that interact with the promoter via chromatin looping. While the TF binding sites themselves are detected genome-wide by ChIP-seq experiments, it is difficult to associate them regulated genes without information of chromatin looping. Recent experimental techniques such as Hi-C or ChIA-PET measure long-range interactions genome-wide but are experimentally elaborate and have limited resolution. Here, we present Computational Chromosome Conformation Capture by Correlation of ChIP-seq at CTCF motifs (7C). Results: While ChIP-seq was not designed to detect contacts, the formaldehyde treatment in the ChI…

research product

Protein Interaction Networks in Health and Disease

research product

dAPE: a web server to detect homorepeats and follow their evolution.

Abstract Summary Homorepeats are low complexity regions consisting of repetitions of a single amino acid residue. There is no current consensus on the minimum number of residues needed to define a functional homorepeat, nor even if mismatches are allowed. Here we present dAPE, a web server that helps following the evolution of homorepeats based on orthology information, using a sensitive but tunable cutoff to help in the identification of emerging homorepeats. Availability and Implementation dAPE can be accessed from http://cbdm-01.zdv.uni-mainz.de/∼munoz/polyx. Supplementary information Supplementary data are available at Bioinformatics online.

research product

Detection of condition-specific marker genes from RNA-seq data with MGFR

The identification of condition-specific genes is key to advancing our understanding of cell fate decisions and disease development. Differential gene expression analysis (DGEA) has been the standard tool for this task. However, the amount of samples that modern transcriptomic technologies allow us to study, makes DGEA a daunting task. On the other hand, experiments with low numbers of replicates lack the statistical power to detect differentially expressed genes. We have previously developed MGFM, a tool for marker gene detection from microarrays, that is particularly useful in the latter case. Here, we have adapted the algorithm behind MGFM to detect markers in RNA-seq data. MGFR groups s…

research product

Dynamics of a Protein Interaction Network Associated to the Aggregation of polyQ-Expanded Ataxin-1

Background: Several experimental models of polyglutamine (polyQ) diseases have been previously developed that are useful for studying disease progression in the primarily affected central nervous system. However, there is a missing link between cellular and animal models that would indicate the molecular defects occurring in neurons and are responsible for the disease phenotype in vivo. Methods: Here, we used a computational approach to identify dysregulated pathways shared by an in vitro and an in vivo model of ATXN1(Q82) protein aggregation, the mutant protein that causes the neurodegenerative polyQ disease spinocerebellar ataxia type-1 (SCA1). Results: A set of common dysregulated pathwa…

research product

A Methodology to Study Pseudogenized lincRNAs

Long intergenic noncoding RNAs (lincRNAs) are known to be tissue specifically expressed and able to regulate functional protein-coding genes: some can even act as competing endogenous RNAs (ceRNAs), because microRNAs can bind to them instead of the corresponding mRNA binding sites. Some lincRNAs contain remnants of protein-coding sequences and it has been hypothesized that they might arise after a pseudogenization processes. However, a major limitation in the study of such phenomenon is the lack of proper computational tools designed to align/analyze protein-coding sequences and noncoding sequences. To overcome this limitation, we published a method that finds the remnants of protein-coding…

research product

HIPPIE v2.0: Enhancing meaningfulness and reliability of protein-protein interaction networks

The increasing number of experimentally detected interactions between proteins makes it difficult for researchers to extract the interactions relevant for specific biological processes or diseases. This makes it necessary to accompany the large-scale detection of protein-protein interactions (PPIs) with strategies and tools to generate meaningful PPI subnetworks. To this end, we generated the Human Integrated Protein-Protein Interaction rEference or HIPPIE (http://cbdm.uni-mainz.de/hippie/). HIPPIE is a one-stop resource for the generation and interpretation of PPI networks relevant to a specific research question. We provide means to generate highly reliable, context-specific PPI networks …

research product

Visualizing Human Protein‐Protein Interactions and Subcellular Localizations on Cell Images Through CellMap

Visualizing protein data remains a challenging and stimulating task. Useful and intuitive visualization tools may help advance biomolecular and medical research; unintuitive tools may bar important breakthroughs. This protocol describes two use cases for the CellMap (http://cellmap.protein.properties) web tool. The tool allows researchers to visualize human protein-protein interaction data constrained by protein subcellular localizations. In the simplest form, proteins are visualized on cell images that also show protein-protein interactions (PPIs) through lines (edges) connecting the proteins across the compartments. At a glance, this simultaneously highlights spatial constraints that prot…

research product

MGFM: a novel tool for detection of tissue and cell specific marker genes from microarray gene expression data

Background Identification of marker genes associated with a specific tissue/cell type is a fundamental challenge in genetic and cell research. Marker genes are of great importance for determining cell identity, and for understanding tissue specific gene function and the molecular mechanisms underlying complex diseases. Results We have developed a new bioinformatics tool called MGFM (Marker Gene Finder in Microarray data) to predict marker genes from microarray gene expression data. Marker genes are identified through the grouping of samples of the same type with similar marker gene expression levels. We verified our approach using two microarray data sets from the NCBI’s Gene Expression Omn…

research product

Evolutionary Study of Disorder in Protein Sequences

Intrinsically disordered proteins (IDPs) contain regions lacking intrinsic globular structure (intrinsically disordered regions, IDRs). IDPs are present across the tree of life, with great variability of IDR type and frequency even between closely related taxa. To investigate the function of IDRs, we evaluated and compared the distribution of disorder content in 10,695 reference proteomes, confirming its high variability and finding certain correlation along the Euteleostomi (bony vertebrates) lineage to number of cell types. We used the comparison of orthologs to study the function of disorder related to increase in cell types, observing that multiple interacting subunits of protein comple…

research product

A targeted proteomics investigation of the obesity paradox in venous thromboembolism

Abstract The obesity paradox, the controversial finding that obesity promotes disease development but protects against sequelae in patients, has been observed in venous thromboembolism (VTE). The aim of this investigation was to identify a body mass–related proteomic signature in VTE patients and to evaluate whether this signature mediates the obesity paradox in VTE patients. Data from the Genotyping and Molecular Phenotyping in Venous ThromboEmbolism Project, a prospective cohort study of 693 VTE patients, were analyzed. A combined end point of recurrent VTE or all-cause death was used. Relative quantification of 444 proteins was performed using high-throughput targeted proteomics technolo…

research product

DiseaseLinc: Disease Enrichment Analysis of Sets of Differentially Expressed LincRNAs

Long intergenic non-coding RNAs (LincRNAs) are long RNAs that do not encode proteins. Functional evidence is lacking for most of them. Their biogenesis is not well-known, but it is thought that many lincRNAs originate from genomic duplication of coding material, resulting in pseudogenes, gene copies that lose their original function and can accumulate mutations. While most pseudogenes eventually stop producing a transcript and become erased by mutations, many of these pseudogene-based lincRNAs keep similarity to the parental gene from which they originated, possibly for functional reasons. For example, they can act as decoys for miRNAs targeting the parental gene. Enrichment analysis of fun…

research product

Protein expression profiling suggests relevance of noncanonical pathways in isolated pulmonary embolism

Abstract Patients with isolated pulmonary embolism (PE) have a distinct clinical profile from those with deep vein thrombosis (DVT)-associated PE, with more pulmonary conditions and atherosclerosis. These findings suggest a distinct molecular pathophysiology and the potential involvement of alternative pathways in isolated PE. To test this hypothesis, data from 532 individuals from the Genotyping and Molecular Phenotyping of Venous ThromboEmbolism Project, a multicenter prospective cohort study with extensive biobanking, were analyzed. Targeted, high-throughput proteomics, machine learning, and bioinformatic methods were applied to contrast the acute-phase plasma proteomes of isolated PE pa…

research product

LipiDisease: associate lipids to diseases using literature mining

Abstract Summary Lipids exhibit an essential role in cellular assembly and signaling. Dysregulation of these functions has been linked with many complications including obesity, diabetes, metabolic disorders, cancer and more. Investigating lipid profiles in such conditions can provide insights into cellular functions and possible interventions. Hence the field of lipidomics is expanding in recent years. Even though the role of individual lipids in diseases has been investigated, there is no resource to perform disease enrichment analysis considering the cumulative association of a lipid set. To address this, we have implemented the LipiDisease web server. The tool analyzes millions of recor…

research product

Protein-protein interactions can be predicted using coiled coil co-evolution patterns

AbstractProtein-protein interactions are sometimes mediated by coiled coil structures. The evolutionary conservation of interacting orthologs in different species, along with the presence or absence of coiled coils in them, may help in the prediction of interacting pairs. Here, we illustrate how the presence of coiled coils in a protein can be exploited as a potential indicator for its interaction with another protein with coiled coils. The prediction capability of our strategy improves when restricting our dataset to highly reliable, known protein-protein interactions. Our study of the co-evolution of coiled coils demonstrates that pairs of interacting proteins can be distinguished from no…

research product

Assessing the low complexity of protein sequences via the low complexity triangle.

Background Proteins with low complexity regions (LCRs) have atypical sequence and structural features. Their amino acid composition varies from the expected, determined proteome-wise, and they do not follow the rules of structural folding that prevail in globular regions. One way to characterize these regions is by assessing the repeatability of a sequence, that is, calculating the local propensity of a region to be part of a repeat. Results We combine two local measures of low complexity, repeatability (using the RES algorithm) and fraction of the most frequent amino acid, to evaluate different proteomes, datasets of protein regions with specific features, and individual cases of proteins…

research product

PlaToLoCo: the first web meta-server for visualization and annotation of low complexity regions in proteins

Abstract Low complexity regions (LCRs) in protein sequences are characterized by a less diverse amino acid composition compared to typically observed sequence diversity. Recent studies have shown that LCRs may co-occur with intrinsically disordered regions, are highly conserved in many organisms, and often play important roles in protein functions and in diseases. In previous decades, several methods have been developed to identify regions with LCRs or amino acid bias, but most of them as stand-alone applications and currently there is no web-based tool which allows users to explore LCRs in protein sequences with additional functional annotations. We aim to fill this gap by providing PlaToL…

research product

Computational Prediction of Position Effects of Apparently Balanced Human Chromosomal Rearrangements.

Interpretation of variants of uncertain significance, especially chromosomal rearrangements in non-coding regions of the human genome, remains one of the biggest challenges in modern molecular diagnosis. To improve our understanding and interpretation of such variants, we used high-resolution three-dimensional chromosomal structural data and transcriptional regulatory information to predict position effects and their association with pathogenic phenotypes in 17 subjects with apparently balanced chromosomal abnormalities. We found that the rearrangements predict disruption of long-range chromatin interactions between several enhancers and genes whose annotated clinical features are strongly …

research product

MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments

Multiple sequence alignments are usually phylogenetically driven. They are studied in the framework of evolution. But sometimes, it is interesting to study residue conservation at positions unconstrained by evolutionary rules. We present a supervised method to access a layer of information difficult to appreciate visually when many protein sequences are aligned. This new tool (MAGA; http://cbdm-01.zdv.uni-mainz.de/~munoz/maga/ ) locates positions in multiple sequence alignments differentially conserved in manually defined groups of sequences.

research product

Assessment of computational methods for the analysis of single-cell ATAC-seq data

Abstract Background Recent innovations in single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) enable profiling of the epigenetic landscape of thousands of individual cells. scATAC-seq data analysis presents unique methodological challenges. scATAC-seq experiments sample DNA, which, due to low copy numbers (diploid in humans), lead to inherent data sparsity (1–10% of peaks detected per cell) compared to transcriptomic (scRNA-seq) data (10–45% of expressed genes detected per cell). Such challenges in data generation emphasize the need for informative features to assess cell heterogeneity at the chromatin level. Results We present a benchmarking framework that …

research product

Evaluation of in vivo and in vitro models of toxicity by comparison of toxicogenomics data with the literature.

Toxicity affecting humans is studied by observing the effects of chemical substances in animal organisms (in vivo) or in animal and human cultivated cell lines (in vitro). Toxicogenomics studies collect gene expression profiles and histopathology assessment data for hundreds of drugs and pollutants in standardized experimental designs using different model systems. These data are an invaluable source for analyzing genome-wide drug response in biological systems. However, a problem remains that is how to evaluate the suitability of heterogeneous in vitro and in vivo systems to model the many different aspects of human toxicity. We propose here that a given model system (cell type or animal o…

research product

Automated selection of homologs to track the evolutionary history of proteins

Background The selection of distant homologs of a query protein under study is a usual and useful application of protein sequence databases. Such sets of homologs are often applied to investigate the function of a protein and the degree to which experimental results can be transferred from one organism to another. In particular, a variety of databases facilitates static browsing for orthologs. However, these resources have a limited power when identifying orthologs between taxonomically distant species. In addition, in some situations, for a given query protein, it is advantageous to compare the sets of orthologs from different specific organisms: this recursive step-wise search might give …

research product

Statistical guidelines for quality control of next-generation sequencing techniques.

Condition-specific statistical guidelines and accurate classification trees for quality control of functional genomics NGS files (RNA-seq, ChIP-seq and DNase-seq) have been generated using thousands of reference files from the ENCODE project and made available to the community.

research product

REP2: A Web Server to Detect Common Tandem Repeats in Protein Sequences

Ensembles of tandem repeats (TRs) in protein sequences expand rapidly to form domains well suited for interactions with proteins. For this reason, they are relatively frequent. Some TRs have known structures and therefore it is advantageous to predict their presence in a protein sequence. However, since most TRs diverge quickly, their detection by classical sequence comparison algorithms is not very accurate. Previously, we developed a method and a web server that used curated profiles and thresholds for the detection of 11 common TRs. Here we present a new web server (REP2) that allows the analysis of TRs in both individual and aligned sequences. We provide currently precomputed analyses f…

research product

Co-regulation of paralog genes in the three-dimensional chromatin architecture.

Paralog genes arise from gene duplication events during evolution, which often lead to similar proteins that cooperate in common pathways and in protein complexes. Consequently, paralogs show correlation in gene expression whereby the mechanisms of co-regulation remain unclear. In eukaryotes, genes are regulated in part by distal enhancer elements through looping interactions with gene promoters. These looping interactions can be measured by genome-wide chromatin conformation capture (Hi-C) experiments, which revealed self-interacting regions called topologically associating domains (TADs). We hypothesize that paralogs share common regulatory mechanisms to enable coordinated expression acco…

research product