6533b7d4fe1ef96bd1263271

RESEARCH PRODUCT

FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.

Pablo MierMiguel A. Andrade-navarro

subject

0301 basic medicineProtein structure databaseProteomicsProteomeSequence analysisComputer sciencecomputer.software_genreSensitivity and SpecificitySet (abstract data type)Evolution Molecular03 medical and health sciences0302 clinical medicineSimilarity (network science)Sequence Analysis ProteinGeneticsCluster (physics)AnimalsCluster AnalysisHumansCluster analysisDatabases ProteinMolecular BiologySequenceDatabaseFunction (mathematics)Computational Mathematics030104 developmental biologyComputational Theory and MathematicsModeling and SimulationData miningcomputer030217 neurology & neurosurgerySoftware

description

The accelerated growth of protein databases offers great possibilities for the study of protein function using sequence similarity and conservation. However, the huge number of sequences deposited in these databases requires new ways of analyzing and organizing the data. It is necessary to group the many very similar sequences, creating clusters with automated derived annotations useful to understand their function, evolution, and level of experimental evidence. We developed an algorithm called FastaHerder2, which can cluster any protein database, putting together very similar protein sequences based on near-full-length similarity and/or high threshold of sequence identity. We compressed 50 reference proteomes, along with the SwissProt database, which we could compress by 74.7%. The clustering algorithm was benchmarked using OrthoBench and compared with FASTA HERDER, a previous version of the algorithm, showing that FastaHerder2 can cluster a set of proteins yielding a high compression, with a lower error rate than its predecessor. We illustrate the use of FastaHerder2 to detect biologically relevant functional features in protein families. With our approach we seek to promote a modern view and usage of the protein sequence databases more appropriate to the postgenomic era.

10.1089/cmb.2015.0191https://pubmed.ncbi.nlm.nih.gov/26828375