0000000000532515
AUTHOR
Pablo Mier
Proteome-wide comparison between the amino acid composition of domains and linkers
Objective Amino acid composition is a sequence feature that has been extensively used to characterize proteomes of many species and protein families. Yet the analysis of amino acid composition of protein domains and the linkers connecting them has received less attention. Here, we perform both a comprehensive full-proteome amino acid composition analysis and a similar analysis focusing on domains and linkers, to uncover domain- or linker-specific differential amino acid usage patterns. Results The amino acid composition in the 38 proteomes studied showcase the greater variability found in archaea and bacteria species compared to eukaryotes. When focusing on domains and linkers, we describe …
Toward completion of the Earth’s proteome: an update a decade later
Protein databases are steadily growing driven by the spread of new more efficient sequencing techniques. This growth is dominated by an increase in redundancy (homologous proteins with various degrees of sequence similarity) and by the incapability to process and curate sequence entries as fast as they are created. To understand these trends and aid bioinformatic resources that might be compromised by the increasing size of the protein sequence databases, we have created a less-redundant protein data set. In parallel, we analyzed the evolution of protein sequence databases in terms of size and redundancy. While the SwissProt database has decelerated its growth mostly because of a focus on i…
Avoided motifs: short amino acid strings missing from protein datasets.
Abstract According to the amino acid composition of natural proteins, it could be expected that all possible sequences of three or four amino acids will occur at least once in large protein datasets purely by chance. However, in some species or cellular context, specific short amino acid motifs are missing due to unknown reasons. We describe these as Avoided Motifs, short amino acid combinations missing from biological sequences. Here we identify 209 human and 154 bacterial Avoided Motifs of length four amino acids, and discuss their possible functionality according to their presence in other species. Furthermore, we determine two Avoided Motifs of length three amino acids in human proteins…
Traitpedia: a collaborative effort to gather species traits
Abstract Summary Traitpedia is a collaborative database aimed to collect binary traits in a tabular form for a growing number of species. Availability and implementation Traitpedia can be accessed from http://cbdm-01.zdv.uni-mainz.de/~munoz/traitpedia. Supplementary information Supplementary data are available at Bioinformatics online.
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
AbstractThe widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotatio…
FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.
The accelerated growth of protein databases offers great possibilities for the study of protein function using sequence similarity and conservation. However, the huge number of sequences deposited in these databases requires new ways of analyzing and organizing the data. It is necessary to group the many very similar sequences, creating clusters with automated derived annotations useful to understand their function, evolution, and level of experimental evidence. We developed an algorithm called FastaHerder2, which can cluster any protein database, putting together very similar protein sequences based on near-full-length similarity and/or high threshold of sequence identity. We compressed 50…
Between Interactions and Aggregates: The PolyQ Balance
Abstract Polyglutamine regions (polyQ) are highly abundant consecutive runs of glutamine residues. They have been generally studied in relation to the so-called polyQ-associated diseases, characterized by protein aggregation caused by the expansion of the polyglutamine tract via a CAG-slippage mechanism. However, more than 4800 human proteins contain a polyQ, and only 9 of these regions are known to be associated with disease. Computational sequence studies and experimental structure determinations are completing a more interesting picture in which polyQ emerge as a motif for modulation of protein-protein interactions. But long polyQ regions may lead to an excess of interactions, and produc…
The importance of definitions in the study of polyQ regions: A tale of thresholds, impurities and sequence context
Graphical abstract
SuppFile1.fasta.txt – Supplemental material for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments
Supplemental material, SuppFile1.fasta.txt for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments by Pablo Mier and Miguel A Andrade-Navarro in Evolutionary Bioinformatics
Disentangling the complexity of low complexity proteins
Abstract There are multiple definitions for low complexity regions (LCRs) in protein sequences, with all of them broadly considering LCRs as regions with fewer amino acid types compared to an average composition. Following this view, LCRs can also be defined as regions showing composition bias. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichot…
Flanking regions determine the structure of the poly-glutamine homo- repeat in huntingtin through mechanisms common among glutamine-rich human proteins
International audience; The causative agent of Huntington's disease, the poly-Q homo-repeat in the N-terminal region of huntingtin (httex1), is flanked by a 17-residue-long fragment (N17) and a proline-rich region (PRR), which promote and inhibit the aggregation propensity of the protein, respectively, by poorly understood mechanisms. Based on experimental data obtained from site-specifically labeled NMR samples, we derived an ensemble model of httex1 that identified both flanking regions as opposing poly-Q secondary structure promoters. While N17 triggers helicity through a promiscuous hydrogen bond network involving the side chains of the first glutamines in the poly-Q tract, the PRR prom…
A novel approach to investigate the evolution of structured tandem repeat protein families by exon duplication.
Tandem Repeat Proteins (TRPs) are ubiquitous in cells and are enriched in eukaryotes. They contributed to the evolution of organism complexity, specializing for functions that require quick adaptability such as immunity-related functions. To investigate the hypothesis of repeat protein evolution through exon duplication and rearrangement, we designed a tool to analyze the relationships between exon/intron patterns and structural symmetries. The tool allows comparison of the structure fragments as defined by exon/intron boundaries from Ensembl against the structural element repetitions from RepeatsDB. The all-against-all pairwise structural alignment between fragments and comparison of the t…
Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length
Abstract Amino acid usage in a proteome depends mostly on its taxonomy, as it does the codon usage in transcriptomes. Here, we explore the level of variation in the codon usage of a specific amino acid, glutamine, in relation to the number of consecutive glutamine residues. We show that CAG triplets are consistently more abundant in short glutamine homorepeats (polyQ, four to eight residues) than in shorter glutamine stretches (one to three residues), leading to the evolutionary growth of the repeat region in a CAG-dependent manner. The length of orthologous polyQ regions is mostly stable in primates, particularly the short ones. Interestingly, given a short polyQ the CAG usage is higher in…
The latent geometry of the human protein interaction network
Abstract Motivation A series of recently introduced algorithms and models advocates for the existence of a hyperbolic geometry underlying the network representation of complex systems. Since the human protein interaction network (hPIN) has a complex architecture, we hypothesized that uncovering its latent geometry could ease challenging problems in systems biology, translating them into measuring distances between proteins. Results We embedded the hPIN to hyperbolic space and found that the inferred coordinates of nodes capture biologically relevant features, like protein age, function and cellular localization. This means that the representation of the hPIN in the two-dimensional hyperboli…
The Role of Low Complexity Regions in Protein Interaction Modes: An Illustration in Huntingtin
Low complexity regions (LCRs) are very frequent in protein sequences, generally having a lower propensity to form structured domains and tending to be much less evolutionarily conserved than globular domains. Their higher abundance in eukaryotes and in species with more cellular types agrees with a growing number of reports on their function in protein interactions regulated by post-translational modifications. LCRs facilitate the increase of regulatory and network complexity required with the emergence of organisms with more complex tissue distribution and development. Although the low conservation and structural flexibility of LCRs complicate their study, evolutionary studies of proteins …
CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
© The Author(s) 2020.
AnABlast: Re-searching for Protein-Coding Sequences in Genomic Regions
AnABlast is a computational tool that highlights protein-coding regions within intergenic and intronic DNA sequences which escape detection by standard gene prediction algorithms. DNA sequences with small protein-coding genes or exons, complex intron-containing genes, or degenerated DNA fragments are efficiently targeted by AnABlast. Furthermore, this algorithm is particularly useful in detecting protein-coding sequences with nonsignificant homologs to sequences in databases. AnABlast can be executed online at http://www.bioinfocabd.upo.es/anablast/ .
SuppFile2.fasta.txt – Supplemental material for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments
Supplemental material, SuppFile2.fasta.txt for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments by Pablo Mier and Miguel A Andrade-Navarro in Evolutionary Bioinformatics
The 18S ribosomal RNA m 6 A methyltransferase Mettl5 is required for normal walking behavior in Drosophila
RNA modifications have recently emerged as an important layer of gene regulation. N6-methyladenosine (m6A) is the most prominent modification on eukaryotic messenger RNA and has also been found on noncoding RNA, including ribosomal and small nuclear RNA. Recently, several m6A methyltransferases were identified, uncovering the specificity of m6A deposition by structurally distinct enzymes. In order to discover additional m6A enzymes, we performed an RNAi screen to deplete annotated orthologs of human methyltransferase-like proteins (METTLs) in Drosophila cells and identified CG9666, the ortholog of human METTL5. We show that CG9666 is required for specific deposition of m6A on 18S ribosomal …
Repeatability in protein sequences
Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary structure elements do not apply when the distribution of amino acids becomes biased. While there is a tendency towards structural disorder in LCRs, various examples, and particularly homorepeats of single amino acids, suggest that very short repeats could adopt structures very difficult to predict. These structures are possibly variable and dependant on the context of intra- or inter-molecular interactions. In general, short repeats in LCRs can induce structure. This could explain the observation that very short (non-perfect) rep…
The Conservation of Low Complexity Regions in Bacterial Proteins Depends on the Pathogenicity of the Strain and Subcellular Location of the Protein
Low complexity regions (LCRs) in proteins are characterized by amino acid frequencies that differ from the average. These regions evolve faster and tend to be less conserved between homologs than globular domains. They are not common in bacteria, as compared to their prevalence in eukaryotes. Studying their conservation could help provide hypotheses about their function. To obtain the appropriate evolutionary focus for this rapidly evolving feature, here we study the conservation of LCRs in bacterial strains and compare their high variability to the closeness of the strains. For this, we selected 20 taxonomically diverse bacterial species and obtained the completely sequenced proteomes of t…
orthoFind Facilitates the Discovery of Homologous and Orthologous Proteins
Finding homologous and orthologous protein sequences is often the first step in evolutionary studies, annotation projects, and experiments of functional complementation. Despite all currently available computational tools, there is a requirement for easy-to-use tools that provide functional information. Here, a new web application called orthoFind is presented, which allows a quick search for homologous and orthologous proteins given one or more query sequences, allowing a recurrent and exhaustive search against reference proteomes, and being able to include user databases. It addresses the protein multidomain problem, searching for homologs with the same domain architecture, and gives a si…
dAPE: a web server to detect homorepeats and follow their evolution.
Abstract Summary Homorepeats are low complexity regions consisting of repetitions of a single amino acid residue. There is no current consensus on the minimum number of residues needed to define a functional homorepeat, nor even if mismatches are allowed. Here we present dAPE, a web server that helps following the evolution of homorepeats based on orthology information, using a sensitive but tunable cutoff to help in the identification of emerging homorepeats. Availability and Implementation dAPE can be accessed from http://cbdm-01.zdv.uni-mainz.de/∼munoz/polyx. Supplementary information Supplementary data are available at Bioinformatics online.
Evolutionary Study of Disorder in Protein Sequences
Intrinsically disordered proteins (IDPs) contain regions lacking intrinsic globular structure (intrinsically disordered regions, IDRs). IDPs are present across the tree of life, with great variability of IDR type and frequency even between closely related taxa. To investigate the function of IDRs, we evaluated and compared the distribution of disorder content in 10,695 reference proteomes, confirming its high variability and finding certain correlation along the Euteleostomi (bony vertebrates) lineage to number of cell types. We used the comparison of orthologs to study the function of disorder related to increase in cell types, observing that multiple interacting subunits of protein comple…
Protein-protein interactions can be predicted using coiled coil co-evolution patterns
AbstractProtein-protein interactions are sometimes mediated by coiled coil structures. The evolutionary conservation of interacting orthologs in different species, along with the presence or absence of coiled coils in them, may help in the prediction of interacting pairs. Here, we illustrate how the presence of coiled coils in a protein can be exploited as a potential indicator for its interaction with another protein with coiled coils. The prediction capability of our strategy improves when restricting our dataset to highly reliable, known protein-protein interactions. Our study of the co-evolution of coiled coils demonstrates that pairs of interacting proteins can be distinguished from no…
Assessing the low complexity of protein sequences via the low complexity triangle.
Background Proteins with low complexity regions (LCRs) have atypical sequence and structural features. Their amino acid composition varies from the expected, determined proteome-wise, and they do not follow the rules of structural folding that prevail in globular regions. One way to characterize these regions is by assessing the repeatability of a sequence, that is, calculating the local propensity of a region to be part of a repeat. Results We combine two local measures of low complexity, repeatability (using the RES algorithm) and fraction of the most frequent amino acid, to evaluate different proteomes, datasets of protein regions with specific features, and individual cases of proteins…
PlaToLoCo: the first web meta-server for visualization and annotation of low complexity regions in proteins
Abstract Low complexity regions (LCRs) in protein sequences are characterized by a less diverse amino acid composition compared to typically observed sequence diversity. Recent studies have shown that LCRs may co-occur with intrinsically disordered regions, are highly conserved in many organisms, and often play important roles in protein functions and in diseases. In previous decades, several methods have been developed to identify regions with LCRs or amino acid bias, but most of them as stand-alone applications and currently there is no web-based tool which allows users to explore LCRs in protein sequences with additional functional annotations. We aim to fill this gap by providing PlaToL…
MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments
Multiple sequence alignments are usually phylogenetically driven. They are studied in the framework of evolution. But sometimes, it is interesting to study residue conservation at positions unconstrained by evolutionary rules. We present a supervised method to access a layer of information difficult to appreciate visually when many protein sequences are aligned. This new tool (MAGA; http://cbdm-01.zdv.uni-mainz.de/~munoz/maga/ ) locates positions in multiple sequence alignments differentially conserved in manually defined groups of sequences.
Automated selection of homologs to track the evolutionary history of proteins
Background The selection of distant homologs of a query protein under study is a usual and useful application of protein sequence databases. Such sets of homologs are often applied to investigate the function of a protein and the degree to which experimental results can be transferred from one organism to another. In particular, a variety of databases facilitates static browsing for orthologs. However, these resources have a limited power when identifying orthologs between taxonomically distant species. In addition, in some situations, for a given query protein, it is advantageous to compare the sets of orthologs from different specific organisms: this recursive step-wise search might give …
REP2: A Web Server to Detect Common Tandem Repeats in Protein Sequences
Ensembles of tandem repeats (TRs) in protein sequences expand rapidly to form domains well suited for interactions with proteins. For this reason, they are relatively frequent. Some TRs have known structures and therefore it is advantageous to predict their presence in a protein sequence. However, since most TRs diverge quickly, their detection by classical sequence comparison algorithms is not very accurate. Previously, we developed a method and a web server that used curated profiles and thresholds for the detection of 11 common TRs. Here we present a new web server (REP2) that allows the analysis of TRs in both individual and aligned sequences. We provide currently precomputed analyses f…
Additional file 2: of Automated selection of homologs to track the evolutionary history of proteins
Figure S1. Number of orthology pairwise relationships calculated with OrthoMCL, ProteinPathTracker and Reciprocal Best Hit Blast (RBHB) in 15 species, using the proteomes provided by OrthoMCL in the default species from the default path in ProteinPathTracker, and taking E. coli proteins as reference. a) All OrthoMCL pairs. b) Only the best 25% scored OrthoMCL pairs. (PNG 388Â kb)
Additional file 1: of Automated selection of homologs to track the evolutionary history of proteins
List of complete reference proteomes used in the web tool, organised by evolutionary path. (XLSX 13Â kb)
MOESM1 of Proteome-wide comparison between the amino acid composition of domains and linkers
Additional file 1. List of proteomes used for the analyses. Each proteome is described by the name of the species, abbreviation as used in the manuscript, UniProt organism ID, number of proteins, and percentage of amino acids from domains/linkers against the total amino acid composition of the proteome.