6533b7d0fe1ef96bd125a1ce

RESEARCH PRODUCT

Suffix Array Construction on Multi-GPU Systems

Christian HundtBertil SchmidtDaniel JüngerFlorian BürenRobin Kobus

subject

Multi-core processorSpeedupComputer scienceSuffix array0102 computer and information sciences02 engineering and technologyParallel computingData structure01 natural scienceslaw.inventionCUDAShared memory010201 computation theory & mathematicslaw0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingSuffixData compression

description

Suffix arrays are prevalent data structures being fundamental to a wide range of applications including bioinformatics, data compression, and information retrieval. Therefore, various algorithms for (parallel) suffix array construction both on CPUs and GPUs have been proposed over the years. Although providing significant speedup over their CPU-based counterparts, existing GPU implementations share a common disadvantage: input text sizes are limited by the scarce memory of a single GPU. In this paper, we overcome aforementioned memory limitations by exploiting multi-GPU nodes featuring fast NVLink interconnects. In order to achieve high performance for this communication-intensive task, we design a parallel inter-GPU (re-)merging scheme. To handle segments spanning multiple GPUs, we propose an efficient strategy for the merging phase facilitated by a fast partitioning search. On 8 GPUs our implementation achieves speedups between 133 and 354 over sequential CPU-based libdivsufsort, between 30 and 68 over its multi-threaded shared memory version using 80 threads on 40 CPU cores for large datasets ranging from 697M to 3159M characters in size. For medium-sized datasets ranging between 104M and 236M, our approach yields maximum (minimum) speedups of 11.7 (4.5) and 6.45 (4.5) over existing single-GPU implementations (CUDPP, NVBIO). We are able to construct the suffix array of a full human genome on a single DGX-1 server within a runtime of 3.44~seconds which is faster than the 4.8 seconds that were previously reported employing 1600 cores on 100 nodes on a CPU-based HPC cluster. Our implementation is publicly available at https://gitlab.rlp.net/pararch/multi-gpu-suffix-array/.

https://doi.org/10.1145/3307681.3325961