An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop

6533b852fe1ef96bd12aa4b5

RESEARCH PRODUCT

An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop

Giuseppe Cattaneo Umberto Ferraro Petrillo Gianluca Roscigno Raffaele Giancarlo

subject

0301 basic medicine Theoretical computer science 030102 biochemistry & molecular biology Settore INF/01 - Informatica Computer science Computation Extension (predicate logic)Information System Hash table Distributed computing Task (project management)Theoretical Computer Science 03 medical and health sciences 030104 developmental biology Alignment-free sequence comparison and analysis Hadoop Hardware and Architecture alignment-free sequence comparison and analysis; distributed computing; Hadoop; MapReduce; software; theoretical computer science; information systems; hardware and architecture Sequence comparison MapReduce Alignment-free sequence comparison and analysi Alignment-free sequence comparison and analysis; Distributed computing; Hadoop; MapReduce; Theoretical Computer Science; Software; Information Systems; Hardware and Architecture Software Information Systems

description

Alignment-free methods are one of the mainstays of biological sequence comparison, i.e., the assessment of how similar two biological sequences are to each other, a fundamental and routine task in computational biology and bioinformatics. They have gained popularity since, even on standard desktop machines, they are faster than methods based on alignments. However, with the advent of Next-Generation Sequencing Technologies, datasets whose size, i.e., number of sequences and their total length, is a challenge to the execution of alignment-free methods on those standard machines are quite common. Here, we propose the first paradigm for the computation of k-mer-based alignment-free methods for Apache Hadoop that extends the problem sizes that can be processed with respect to a standard sequential machine while also granting a good time performance. Technically, as opposed to a standard Hadoop implementation, its effectiveness is achieved thanks to the incremental management of a persistent hash table during the map phase, a task not contemplated by the basic Hadoop functions and that can be useful also in other contexts.

year	journal	country	edition	language
2016-08-08

10.1007/s11227-016-1835-3 http://hdl.handle.net/11386/4681576