Reference genome assessment from a population scale perspective: an accurate profile of variability and noise.

6533b85dfe1ef96bd12bdfb5

RESEARCH PRODUCT

Reference genome assessment from a population scale perspective: an accurate profile of variability and noise.

Antonio López-quílez Alicia Amadoz Joaquín Dopazo José Carbonell-caballero Roberto Alonso Cankut ÇUbuk David Conesa Marta R. Hidalgo

subject

0301 basic medicine Statistics and Probability Quality Control Genotype Computer science media_common.quotation_subject Population Genomics Bioinformatics computer.software_genre Biochemistry Genome 03 medical and health sciences Genetic variation Animals Humans Quality (business)Allele education Molecular Biology Genotyping Reliability (statistics)media_common Protocol (science)education.field_of_study Genome Models Statistical Genetic Variation Reproducibility of Results Genomics Genome Analysis Original Papers Computer Science Applications Computational Mathematics 030104 developmental biology Computational Theory and Mathematics Data mining computer Software Reference genome

description

Abstract Motivation Current plant and animal genomic studies are often based on newly assembled genomes that have not been properly consolidated. In this scenario, misassembled regions can easily lead to false-positive findings. Despite quality control scores are included within genotyping protocols, they are usually employed to evaluate individual sample quality rather than reference sequence reliability. We propose a statistical model that combines quality control scores across samples in order to detect incongruent patterns at every genomic region. Our model is inherently robust since common artifact signals are expected to be shared between independent samples over misassembled regions of the genome. Results The reliability of our protocol has been extensively tested through different experiments and organisms with accurate results, improving state-of-the-art methods. Our analysis demonstrates synergistic relations between quality control scores and allelic variability estimators, that improve the detection of misassembled regions, and is able to find strong artifact signals even within the human reference assembly. Furthermore, we demonstrated how our model can be trained to properly rank the confidence of a set of candidate variants obtained from new independent samples. Availability and implementation This tool is freely available at http://gitlab.com/carbonell/ces. Supplementary information Supplementary data are available at Bioinformatics online.

year	journal	country	edition	language
2017-01-01

10.1093/bioinformatics/btx482 https://academic.oup.com/bioinformatics/article-pdf/33/22/3511/25167564/btx482.pdf