6533b7d1fe1ef96bd125d8e3

RESEARCH PRODUCT

RabbitMash: accelerating hash-based genome analysis on modern multi-core architectures

Weiguo LiuXiaoming XuZekun YinBertil SchmidtYanjie WeiJinxiao Zhang

subject

Statistics and ProbabilityWorkstationExploitComputer scienceHash functionParallel computingBiochemistrylaw.invention03 medical and health sciencesSoftwarelawCluster analysisMolecular Biology030304 developmental biology0303 health sciencesMulti-core processorGenomeComputersbusiness.industry030302 biochemistry & molecular biologyGenomicsSketchComputer Science ApplicationsComputational MathematicsComputational Theory and MathematicsbusinessAlgorithmsSoftware

description

Abstract Motivation Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. Results We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furthermore, RabbitMash is able to compute the all-versus-all distances of 100 321 genomes in <5 min on a 40-core workstation while Mash requires over 40 min. Availability and implementation RabbitMash is available at https://github.com/ZekunYin/RabbitMash. Supplementary information Supplementary data are available at Bioinformatics online.

https://doi.org/10.1093/bioinformatics/btaa754