FASTdoop: A versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications

6533b7d7fe1ef96bd1267ac0

RESEARCH PRODUCT

FASTdoop: A versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications

Raffaele Giancarlo Umberto Ferraro Petrillo Giuseppe Cattaneo Gianluca Roscigno

subject

0301 basic medicine FASTQ format Statistics and Probability Computer science Sequence analysis media_common.quotation_subject Information Storage and Retrieval Bioinformatics computer.software_genre Genome Biochemistry Domain (software engineering)03 medical and health sciences Computational Theory and Mathematic Humans Genomic library Quality (business)DNA sequencing FASTQ; NGS; FASTQ; DNA sequencing Molecular Biology media_common Gene Library Sequence Database Settore INF/01 - Informatica Genome Human Computer Science Applications1707 Computer Vision and Pattern Recognition Genomics Sequence Analysis DNA FASTQ File format Computer Science Applications Statistics and Probability; Biochemistry; Molecular Biology; Computer Science Applications1707 Computer Vision and Pattern Recognition; Computational Theory and Mathematics; Computational Mathematics Computational Mathematics 030104 developmental biology Computational Theory and Mathematics NGS Database Management Systems computer

description

Abstract Summary MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers versatility and efficiency. That is, it can handle collections of reads, with or without quality scores, as well as long genomic sequences while the existing routines concentrate mainly on NGS sequence data. Moreover, in the domain where a comparison is possible, the routines proposed here are faster than the available ones. In conclusion, FASTdoop is a much needed addition to Hadoop-BAM. Availability and Implementation The software and the datasets are available at http://www.di.unisa.it/FASTdoop/. Supplementary information Supplementary data are available at Bioinformatics online.

year	journal	country	edition	language
2017-01-01

10.1093/bioinformatics/btx010 http://hdl.handle.net/11386/4702187