0000000000105880

AUTHOR

Pablo Mier

Proteome-wide comparison between the amino acid composition of domains and linkers

Objective Amino acid composition is a sequence feature that has been extensively used to characterize proteomes of many species and protein families. Yet the analysis of amino acid composition of protein domains and the linkers connecting them has received less attention. Here, we perform both a comprehensive full-proteome amino acid composition analysis and a similar analysis focusing on domains and linkers, to uncover domain- or linker-specific differential amino acid usage patterns. Results The amino acid composition in the 38 proteomes studied showcase the greater variability found in archaea and bacteria species compared to eukaryotes. When focusing on domains and linkers, we describe …

research product

Toward completion of the Earth’s proteome: an update a decade later

Protein databases are steadily growing driven by the spread of new more efficient sequencing techniques. This growth is dominated by an increase in redundancy (homologous proteins with various degrees of sequence similarity) and by the incapability to process and curate sequence entries as fast as they are created. To understand these trends and aid bioinformatic resources that might be compromised by the increasing size of the protein sequence databases, we have created a less-redundant protein data set. In parallel, we analyzed the evolution of protein sequence databases in terms of size and redundancy. While the SwissProt database has decelerated its growth mostly because of a focus on i…

research product

Avoided motifs: short amino acid strings missing from protein datasets.

Abstract According to the amino acid composition of natural proteins, it could be expected that all possible sequences of three or four amino acids will occur at least once in large protein datasets purely by chance. However, in some species or cellular context, specific short amino acid motifs are missing due to unknown reasons. We describe these as Avoided Motifs, short amino acid combinations missing from biological sequences. Here we identify 209 human and 154 bacterial Avoided Motifs of length four amino acids, and discuss their possible functionality according to their presence in other species. Furthermore, we determine two Avoided Motifs of length three amino acids in human proteins…

research product

Traitpedia: a collaborative effort to gather species traits

Abstract Summary Traitpedia is a collaborative database aimed to collect binary traits in a tabular form for a growing number of species. Availability and implementation Traitpedia can be accessed from http://cbdm-01.zdv.uni-mainz.de/~munoz/traitpedia. Supplementary information Supplementary data are available at Bioinformatics online.

research product

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

AbstractThe widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotatio…

research product

FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.

The accelerated growth of protein databases offers great possibilities for the study of protein function using sequence similarity and conservation. However, the huge number of sequences deposited in these databases requires new ways of analyzing and organizing the data. It is necessary to group the many very similar sequences, creating clusters with automated derived annotations useful to understand their function, evolution, and level of experimental evidence. We developed an algorithm called FastaHerder2, which can cluster any protein database, putting together very similar protein sequences based on near-full-length similarity and/or high threshold of sequence identity. We compressed 50…

research product

Between Interactions and Aggregates: The PolyQ Balance

Abstract Polyglutamine regions (polyQ) are highly abundant consecutive runs of glutamine residues. They have been generally studied in relation to the so-called polyQ-associated diseases, characterized by protein aggregation caused by the expansion of the polyglutamine tract via a CAG-slippage mechanism. However, more than 4800 human proteins contain a polyQ, and only 9 of these regions are known to be associated with disease. Computational sequence studies and experimental structure determinations are completing a more interesting picture in which polyQ emerge as a motif for modulation of protein-protein interactions. But long polyQ regions may lead to an excess of interactions, and produc…

research product

The importance of definitions in the study of polyQ regions: A tale of thresholds, impurities and sequence context

Graphical abstract

research product

SuppFile1.fasta.txt – Supplemental material for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments

Supplemental material, SuppFile1.fasta.txt for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments by Pablo Mier and Miguel A Andrade-Navarro in Evolutionary Bioinformatics

research product

Disentangling the complexity of low complexity proteins

Abstract There are multiple definitions for low complexity regions (LCRs) in protein sequences, with all of them broadly considering LCRs as regions with fewer amino acid types compared to an average composition. Following this view, LCRs can also be defined as regions showing composition bias. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichot…

research product

Flanking regions determine the structure of the poly-glutamine homo- repeat in huntingtin through mechanisms common among glutamine-rich human proteins

International audience; The causative agent of Huntington's disease, the poly-Q homo-repeat in the N-terminal region of huntingtin (httex1), is flanked by a 17-residue-long fragment (N17) and a proline-rich region (PRR), which promote and inhibit the aggregation propensity of the protein, respectively, by poorly understood mechanisms. Based on experimental data obtained from site-specifically labeled NMR samples, we derived an ensemble model of httex1 that identified both flanking regions as opposing poly-Q secondary structure promoters. While N17 triggers helicity through a promiscuous hydrogen bond network involving the side chains of the first glutamines in the poly-Q tract, the PRR prom…

research product

A novel approach to investigate the evolution of structured tandem repeat protein families by exon duplication.

Tandem Repeat Proteins (TRPs) are ubiquitous in cells and are enriched in eukaryotes. They contributed to the evolution of organism complexity, specializing for functions that require quick adaptability such as immunity-related functions. To investigate the hypothesis of repeat protein evolution through exon duplication and rearrangement, we designed a tool to analyze the relationships between exon/intron patterns and structural symmetries. The tool allows comparison of the structure fragments as defined by exon/intron boundaries from Ensembl against the structural element repetitions from RepeatsDB. The all-against-all pairwise structural alignment between fragments and comparison of the t…

research product

Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length

Abstract Amino acid usage in a proteome depends mostly on its taxonomy, as it does the codon usage in transcriptomes. Here, we explore the level of variation in the codon usage of a specific amino acid, glutamine, in relation to the number of consecutive glutamine residues. We show that CAG triplets are consistently more abundant in short glutamine homorepeats (polyQ, four to eight residues) than in shorter glutamine stretches (one to three residues), leading to the evolutionary growth of the repeat region in a CAG-dependent manner. The length of orthologous polyQ regions is mostly stable in primates, particularly the short ones. Interestingly, given a short polyQ the CAG usage is higher in…

research product

The latent geometry of the human protein interaction network

Abstract Motivation A series of recently introduced algorithms and models advocates for the existence of a hyperbolic geometry underlying the network representation of complex systems. Since the human protein interaction network (hPIN) has a complex architecture, we hypothesized that uncovering its latent geometry could ease challenging problems in systems biology, translating them into measuring distances between proteins. Results We embedded the hPIN to hyperbolic space and found that the inferred coordinates of nodes capture biologically relevant features, like protein age, function and cellular localization. This means that the representation of the hPIN in the two-dimensional hyperboli…

research product

The Role of Low Complexity Regions in Protein Interaction Modes: An Illustration in Huntingtin

Low complexity regions (LCRs) are very frequent in protein sequences, generally having a lower propensity to form structured domains and tending to be much less evolutionarily conserved than globular domains. Their higher abundance in eukaryotes and in species with more cellular types agrees with a growing number of reports on their function in protein interactions regulated by post-translational modifications. LCRs facilitate the increase of regulatory and network complexity required with the emergence of organisms with more complex tissue distribution and development. Although the low conservation and structural flexibility of LCRs complicate their study, evolutionary studies of proteins …

research product

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

© The Author(s) 2020.

research product

AnABlast: Re-searching for Protein-Coding Sequences in Genomic Regions

AnABlast is a computational tool that highlights protein-coding regions within intergenic and intronic DNA sequences which escape detection by standard gene prediction algorithms. DNA sequences with small protein-coding genes or exons, complex intron-containing genes, or degenerated DNA fragments are efficiently targeted by AnABlast. Furthermore, this algorithm is particularly useful in detecting protein-coding sequences with nonsignificant homologs to sequences in databases. AnABlast can be executed online at http://www.bioinfocabd.upo.es/anablast/ .

research product

SuppFile2.fasta.txt – Supplemental material for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments

Supplemental material, SuppFile2.fasta.txt for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments by Pablo Mier and Miguel A Andrade-Navarro in Evolutionary Bioinformatics

research product

The 18S ribosomal RNA m 6 A methyltransferase Mettl5 is required for normal walking behavior in Drosophila

RNA modifications have recently emerged as an important layer of gene regulation. N6-methyladenosine (m6A) is the most prominent modification on eukaryotic messenger RNA and has also been found on noncoding RNA, including ribosomal and small nuclear RNA. Recently, several m6A methyltransferases were identified, uncovering the specificity of m6A deposition by structurally distinct enzymes. In order to discover additional m6A enzymes, we performed an RNAi screen to deplete annotated orthologs of human methyltransferase-like proteins (METTLs) in Drosophila cells and identified CG9666, the ortholog of human METTL5. We show that CG9666 is required for specific deposition of m6A on 18S ribosomal …

research product

Repeatability in protein sequences

Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary structure elements do not apply when the distribution of amino acids becomes biased. While there is a tendency towards structural disorder in LCRs, various examples, and particularly homorepeats of single amino acids, suggest that very short repeats could adopt structures very difficult to predict. These structures are possibly variable and dependant on the context of intra- or inter-molecular interactions. In general, short repeats in LCRs can induce structure. This could explain the observation that very short (non-perfect) rep…

research product

The Conservation of Low Complexity Regions in Bacterial Proteins Depends on the Pathogenicity of the Strain and Subcellular Location of the Protein

Low complexity regions (LCRs) in proteins are characterized by amino acid frequencies that differ from the average. These regions evolve faster and tend to be less conserved between homologs than globular domains. They are not common in bacteria, as compared to their prevalence in eukaryotes. Studying their conservation could help provide hypotheses about their function. To obtain the appropriate evolutionary focus for this rapidly evolving feature, here we study the conservation of LCRs in bacterial strains and compare their high variability to the closeness of the strains. For this, we selected 20 taxonomically diverse bacterial species and obtained the completely sequenced proteomes of t…

research product

orthoFind Facilitates the Discovery of Homologous and Orthologous Proteins

Finding homologous and orthologous protein sequences is often the first step in evolutionary studies, annotation projects, and experiments of functional complementation. Despite all currently available computational tools, there is a requirement for easy-to-use tools that provide functional information. Here, a new web application called orthoFind is presented, which allows a quick search for homologous and orthologous proteins given one or more query sequences, allowing a recurrent and exhaustive search against reference proteomes, and being able to include user databases. It addresses the protein multidomain problem, searching for homologs with the same domain architecture, and gives a si…

research product

dAPE: a web server to detect homorepeats and follow their evolution.

Abstract Summary Homorepeats are low complexity regions consisting of repetitions of a single amino acid residue. There is no current consensus on the minimum number of residues needed to define a functional homorepeat, nor even if mismatches are allowed. Here we present dAPE, a web server that helps following the evolution of homorepeats based on orthology information, using a sensitive but tunable cutoff to help in the identification of emerging homorepeats. Availability and Implementation dAPE can be accessed from http://cbdm-01.zdv.uni-mainz.de/∼munoz/polyx. Supplementary information Supplementary data are available at Bioinformatics online.

research product

Evolutionary Study of Disorder in Protein Sequences

Intrinsically disordered proteins (IDPs) contain regions lacking intrinsic globular structure (intrinsically disordered regions, IDRs). IDPs are present across the tree of life, with great variability of IDR type and frequency even between closely related taxa. To investigate the function of IDRs, we evaluated and compared the distribution of disorder content in 10,695 reference proteomes, confirming its high variability and finding certain correlation along the Euteleostomi (bony vertebrates) lineage to number of cell types. We used the comparison of orthologs to study the function of disorder related to increase in cell types, observing that multiple interacting subunits of protein comple…

research product

Protein-protein interactions can be predicted using coiled coil co-evolution patterns

AbstractProtein-protein interactions are sometimes mediated by coiled coil structures. The evolutionary conservation of interacting orthologs in different species, along with the presence or absence of coiled coils in them, may help in the prediction of interacting pairs. Here, we illustrate how the presence of coiled coils in a protein can be exploited as a potential indicator for its interaction with another protein with coiled coils. The prediction capability of our strategy improves when restricting our dataset to highly reliable, known protein-protein interactions. Our study of the co-evolution of coiled coils demonstrates that pairs of interacting proteins can be distinguished from no…

research product

Assessing the low complexity of protein sequences via the low complexity triangle.

Background Proteins with low complexity regions (LCRs) have atypical sequence and structural features. Their amino acid composition varies from the expected, determined proteome-wise, and they do not follow the rules of structural folding that prevail in globular regions. One way to characterize these regions is by assessing the repeatability of a sequence, that is, calculating the local propensity of a region to be part of a repeat. Results We combine two local measures of low complexity, repeatability (using the RES algorithm) and fraction of the most frequent amino acid, to evaluate different proteomes, datasets of protein regions with specific features, and individual cases of proteins…

research product

PlaToLoCo: the first web meta-server for visualization and annotation of low complexity regions in proteins

Abstract Low complexity regions (LCRs) in protein sequences are characterized by a less diverse amino acid composition compared to typically observed sequence diversity. Recent studies have shown that LCRs may co-occur with intrinsically disordered regions, are highly conserved in many organisms, and often play important roles in protein functions and in diseases. In previous decades, several methods have been developed to identify regions with LCRs or amino acid bias, but most of them as stand-alone applications and currently there is no web-based tool which allows users to explore LCRs in protein sequences with additional functional annotations. We aim to fill this gap by providing PlaToL…

research product

MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments

Multiple sequence alignments are usually phylogenetically driven. They are studied in the framework of evolution. But sometimes, it is interesting to study residue conservation at positions unconstrained by evolutionary rules. We present a supervised method to access a layer of information difficult to appreciate visually when many protein sequences are aligned. This new tool (MAGA; http://cbdm-01.zdv.uni-mainz.de/~munoz/maga/ ) locates positions in multiple sequence alignments differentially conserved in manually defined groups of sequences.

research product

Automated selection of homologs to track the evolutionary history of proteins

Background The selection of distant homologs of a query protein under study is a usual and useful application of protein sequence databases. Such sets of homologs are often applied to investigate the function of a protein and the degree to which experimental results can be transferred from one organism to another. In particular, a variety of databases facilitates static browsing for orthologs. However, these resources have a limited power when identifying orthologs between taxonomically distant species. In addition, in some situations, for a given query protein, it is advantageous to compare the sets of orthologs from different specific organisms: this recursive step-wise search might give …

research product

REP2: A Web Server to Detect Common Tandem Repeats in Protein Sequences

Ensembles of tandem repeats (TRs) in protein sequences expand rapidly to form domains well suited for interactions with proteins. For this reason, they are relatively frequent. Some TRs have known structures and therefore it is advantageous to predict their presence in a protein sequence. However, since most TRs diverge quickly, their detection by classical sequence comparison algorithms is not very accurate. Previously, we developed a method and a web server that used curated profiles and thresholds for the detection of 11 common TRs. Here we present a new web server (REP2) that allows the analysis of TRs in both individual and aligned sequences. We provide currently precomputed analyses f…

research product

Additional file 2: of Automated selection of homologs to track the evolutionary history of proteins

Figure S1. Number of orthology pairwise relationships calculated with OrthoMCL, ProteinPathTracker and Reciprocal Best Hit Blast (RBHB) in 15 species, using the proteomes provided by OrthoMCL in the default species from the default path in ProteinPathTracker, and taking E. coli proteins as reference. a) All OrthoMCL pairs. b) Only the best 25% scored OrthoMCL pairs. (PNG 388Â kb)

research product

Additional file 1: of Automated selection of homologs to track the evolutionary history of proteins

List of complete reference proteomes used in the web tool, organised by evolutionary path. (XLSX 13Â kb)

research product

MOESM1 of Proteome-wide comparison between the amino acid composition of domains and linkers

Additional file 1. List of proteomes used for the analyses. Each proteome is described by the name of the species, abbreviation as used in the manuscript, UniProt organism ID, number of proteins, and percentage of amino acids from domains/linkers against the total amino acid composition of the proteome.

research product