6533b852fe1ef96bd12ab33e

RESEARCH PRODUCT

Quality control guidelines and machine learning predictions for next generation sequencing data

Steffen AlbrechtMiguel A Andrade-navarroJean-fred Fontaine

subject

description

Abstract Controlling the quality of next generation sequencing (NGS) data files is usually not fully automatized because of its complexity and involves strong assumptions and arbitrary choices. We have statistically characterized common NGS quality features of a large set of files and optimized the complex quality control procedure using a machine learning approach including tree-based algorithms and deep learning. Predictive models were validated using internal and external data, including applications to disease diagnosis datasets. Models are unbiased, accurate and to some extent generalizable to unseen data types and species. Given enough labelled data for training, this approach could potentially work for any type of NGS assay or species. The derived statistical guidelines and predictive models represent a valuable resource for NGS specialists to better understand quality issues and perform automatic quality control of their own files. Our guidelines and software are available at the following URL: https://github.com/salbrec/seqQscorer .

10.1101/768713http://dx.doi.org/10.1101/768713