Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

6533b7d4fe1ef96bd1262a00

RESEARCH PRODUCT

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

Ole K. Tørresen Bastiaan Star Pablo Mier Miguel A. Andrade-navarro Alex Bateman Patryk Jarnot Aleksandra Gruca Marcin Grynberg Andrey V. Kajava Vasilis J. Promponas Maria Anisimova Kjetill S. Jakobsen Dirk Linke Ole K. Tørresen Bastiaan Star Pablo Mier Miguel A. Andrade-navarro Alex Bateman Patryk Jarnot Aleksandra Gruca Marcin Grynberg Andrey V. Kajava Vasilis J. Promponas Maria Anisimova Kjetill S. Jakobsen Dirk Linke

subject

FOS: Computer and information sciences Bioinformatics [SDV]Life Sciences [q-bio]Sequence assembly Genomics [SDV.BC]Life Sciences [q-bio]/Cellular Biology Computational biology Biology Genome 03 medical and health sciences Annotation 0302 clinical medicine Tandem repeat Genetics Animals Survey and Summary Databases Protein Gene ComputingMilieux_MISCELLANEOUS 030304 developmental biology 0303 health sciences End user 572: Biochemie DNA Sequence Analysis DNA Genomics [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM]Workflow ComputingMethodologies_PATTERNRECOGNITION Gadus morhua Tandem Repeat Sequences Scientific Experimental Error [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM]Databases Nucleic Acid 030217 neurology & neurosurgery

description

AbstractThe widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.

year	journal	country	edition	language
2019-12-02

10.1093/nar/gkz841 https://hal.archives-ouvertes.fr/hal-03089273