0000000000165122

AUTHOR

Weiguo Liu

RabbitMash: accelerating hash-based genome analysis on modern multi-core architectures

Abstract Motivation Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. Results We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furtherm…

research product

Efficient Parallel Sort on AVX-512-Based Multi-Core and Many-Core Architectures

Sorting kernels are a fundamental part of numerous applications. The performance of sorting implementations is usually limited by a variety of factors such as computing power, memory bandwidth, and branch mispredictions. In this paper we propose an efficient hybrid sorting method which takes advantage of wide vector registers and the high bandwidth memory of modern AVX-512-based multi-core and many-core processors. Our approach employs a combination of vectorized bitonic sorting and load-balanced multi-threaded merging. Thread-level and data-level parallelism are used to exploit both compute power and memory bandwidth. Our single-threaded implementation is ~30x faster than qsort in the C st…

research product

Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters

Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data par…

research product

Millimeter-Scale and Billion-Atom Reactive Force Field Simulation on Sunway Taihulight

Large-scale molecular dynamics (MD) simulations on supercomputers play an increasingly important role in many research areas. With the capability of simulating charge equilibration (QEq), bonds and so on, Reactive force field (ReaxFF) enables the precise simulation of chemical reactions. Compared to the first principle molecular dynamics (FPMD), ReaxFF has far lower requirements on computational resources so that it can achieve higher efficiencies for large-scale simulations. In this article, we present our efforts on scaling ReaxFF on the Sunway TaihuLight Supercomputer (TaihuLight). We have carefully redesigned the force analysis and neighbor list building steps. By applying fine-grained …

research product

WarpCore: A Library for fast Hash Tables on GPUs

Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In this work, we exploit the fast memory interface of modern GPUs together with a parallel hashing scheme tailored to improve global memory access patterns, to design WarpCore -- a versatile library of hash table data structures. Unique device-sided operations allow for building high performance data processing pipelines ent…

research product

XLCS: A New Bit-Parallel Longest Common Subsequence Algorithm on Xeon Phi Clusters

Finding the longest common subsequence (LCS) of two strings is a classical problem in bioinformatics. A basic approach to solve this problem is based on dynamic programming. As the biological sequence databases are growing continuously, bit-parallel sequence comparison algorithms are becoming increasingly important. In this paper, we present XLCS, a new parallel implementation to accelerate the LCS algorithm on Xeon Phi clusters by performing bit-wise operations. We have designed an asynchronous IO framework to improve the data transfer efficiency. To make full use of the computing resources of Xeon Phi clusters, we use three levels of parallelism: node-level, thread-level and vector-level.…

research product

FMapper: Scalable read mapper based on succinct hash index on SunWay TaihuLight

Abstract One of the most important application in bioinformatics is read mapping. With the rapidly increasing number of reads produced by next-generation sequencing (NGS) technology, there is a need for fast and efficient high-throughput read mappers. In this paper, we present FMapper – a highly scalable read mapper on the TaihuLight supercomputer optimized for its fourth-generation ShenWei many-core architecture (SW26010). In order to fully exploit the computational power of the SW26010, we employ dynamic scheduling of tasks, asynchronous I/O and data transfers and implement a vectorized version of the banded Myers algorithm tailored to the 256 bit vector registers of the SW26010. Our perf…

research product

RabbitQC: high-speed scalable quality control for sequencing data

Abstract Motivation Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes. Results We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between …

research product

Neighbor-list-free molecular dynamics on sunway TaihuLight supercomputer

Molecular dynamics (MD) simulations are playing an increasingly important role in many research areas. Pair-wise potentials are widely used in MD simulations of bio-molecules, polymers, and nano-scale materials. Due to a low compute-to-memory-access ratio, their calculation is often bounded by memory transfer speeds. Sunway TaihuLight is one of the fastest supercomputers featuring a custom SW26010 many-core processor. Since the SW26010 has some critical limitations regarding main memory bandwidth and scratchpad memory size, it is considered as a good platform to investigate the optimization of pair-wise potentials especially in terms of data reusage. MD algorithms often use a neighbor-list …

research product

CUDA-BLASTP: Accelerating BLASTP on CUDA-enabled graphics hardware

Scanning protein sequence database is an often repeated task in computational biology and bioinformatics. However, scanning large protein databases, such as GenBank, with popular tools such as BLASTP requires long runtimes on sequential architectures. Due to the continuing rapid growth of sequence databases, there is a high demand to accelerate this task. In this paper, we demonstrate how GPUs, powered by the Compute Unified Device Architecture (CUDA), can be used as an efficient computational platform to accelerate the BLASTP algorithm. In order to exploit the GPU's capabilities for accelerating BLASTP, we have used a compressed deterministic finite state automaton for hit detection as wel…

research product

SWMapper: Scalable Read Mapper on SunWay TaihuLight

With the rapid development of next-generation sequencing (NGS) technologies, high throughput sequencing platforms continuously produce large amounts of short read DNA data at low cost. Read mapping is a performance-critical task, being one of the first stages required for many different types of NGS analysis pipelines. We present SWMapper — a scalable and efficient read mapper for the Sunway TaihuLight supercomputer. A number of optimization techniques are proposed to achieve high performance on its heterogeneous architecture which are centered around a memory-efficient succinct hash index data structure including seed filtration, duplicate removal, dynamic scheduling, asynchronous data tra…

research product

FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures

Deep Learning is ubiquitous in a wide field of applications ranging from research to industry. In comparison to time-consuming iterative training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, lower latency and higher computation efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN – a fast inference library for ARM CPUs – targeting the performance ceiling of mobile devices. FeatherCNN employs three key techniques: 1) A highly efficient TensorGEMM (generalized matrix multiplication) routine is app…

research product

BGSA: a bit-parallel global sequence alignment toolkit for multi-core and many-core architectures

Abstract Motivation Modern bioinformatics tools for analyzing large-scale NGS datasets often need to include fast implementations of core sequence alignment algorithms in order to achieve reasonable execution times. We address this need by presenting the BGSA toolkit for optimized implementations of popular bit-parallel global pairwise alignment algorithms on modern microprocessors. Results BGSA outperforms Edlib, SeqAn and BitPAl for pairwise edit distance computations and Parasail, SeqAn and BitPAl when using more general scoring schemes for pairwise alignments of a batch of sequence reads on both standard multi-core CPUs and Xeon Phi many-core CPUs. Furthermore, banded edit distance perf…

research product

Mapping of BLASTP Algorithm onto GPU Clusters

Searching protein sequence database is a fundamental and often repeated task in computational biology and bioinformatics. However, the high computational cost and long runtime of many database scanning algorithms on sequential architectures heavily restrict their applications for large-scale protein databases, such as GenBank. The continuing exponential growth of sequence databases and the high rate of newly generated queries further deteriorate the situation and establish a strong requirement for time-efficient scalable database searching algorithms. In this paper, we demonstrate how GPU clusters, powered by the Compute Unified Device Architecture (CUDA), OpenMP, and MPI parallel programmi…

research product

Accelerating large-scale biological database search on Xeon Phi-based neo-heterogeneous architectures

In this paper we present new parallelization techniques for searching large-scale biological sequence databases with the Smith-Waterman algorithm on Xeon Phi-based neoheterogenous architectures. In order to make full use of the compute power of both the multi-core CPU and the many-core Xeon Phi hardware, we use a collaborative computing scheme as well as hybrid parallelism. At the CPU side, we employ SSE intrinsics and multi-threading to implement SIMD parallelism. At the Xeon Phi side, we use Knights Corner vector instructions to gain more data parallelism. We have presented two dynamic task distribution schemes (thread level and device level) in order to achieve better load balancing. Fur…

research product

S-Aligner: Ultrascalable Read Mapping on Sunway Taihu Light

The availability and amount of sequenced genomes have been rapidly growing in recent years because of the adoption of next-generation sequencing (NGS) technologies that enable high-throughput short-read generation at highly competitive cost. Since this trend is expected to continue in the foreseeable future, the design and implementation of efficient and scalable NGS bioinformatics algorithms are important to research and industrial applications. In this paper, we introduce S-Aligner–a highly scalable read mapper designed for the Sunway Taihu Light supercomputer and its fourth-generationShenWei many-core architecture (SW26010). S-Aligner employs a combination of optimization techniques to o…

research product

PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment

The progress of next-generation sequencing has a major impact on medical and genomic research. This technology can now produce billions of short DNA fragments (reads) in a single run. One of the most demanding computational problems used by almost every sequencing pipeline is short-read alignment; i.e. determining where each fragment originated from in the original genome. Most current solutions are based on a seed-and-extend approach, where promising candidate regions (seeds) are first identified and subsequently extended in order to verify whether a full high-scoring alignment actually exists in the vicinity of each seed. Seed verification is the main bottleneck in many state-of-the-art a…

research product

SWhybrid: A Hybrid-Parallel Framework for Large-Scale Protein Sequence Database Search

Computer architectures continue to develop rapidly towards massively parallel and heterogeneous systems. Thus, easily extensible yet highly efficient parallelization approaches for a variety of platforms are urgently needed. In this paper, we present SWhybrid, a hybrid computing framework for large-scale biological sequence database search on heterogeneous computing environments with multi-core or many-core processing units (PUs) based on the Smith- Waterman (SW) algorithm. To incorporate a diverse set of PUs such as combinations of CPUs, GPUs and Xeon Phis, we abstract them as SIMD vector execution units with different number of lanes. We propose a machine model, associated with a unified …

research product

SPECTR

Modern high throughput sequencing platforms can produce large amounts of short read DNA data at low cost. Error correction is an important but time-consuming initial step when processing this data in order to improve the quality of downstream analyses. In this paper, we present a Scalable Parallel Error CorrecToR designed to improve the throughput of DNA error correction for Illumina reads on various parallel platforms. Our design is based on a k-spectrum approach where a Bloom filter is frequently probed as a key operation and is optimized towards AVX-512-based multi-core CPUs, Xeon Phi many-cores (both KNC and KNL), and heterogeneous compute clusters. A number of architecture-specific opt…

research product

Optimization of Reactive Force Field Simulation: Refactor, Parallelization, and Vectorization for Interactions

Molecular dynamics (MD) simulations are playing an increasingly important role in many areas ranging from chemical materials to biological molecules. With the continuing development of MD models, the potentials are getting larger and more complex. In this article, we focus on the reactive force field (ReaxFF) potential from LAMMPS to optimize the computation of interactions. We present our efforts on refactoring for neighbor list building, bond order computation, as well as valence angles and torsion angles computation. After redesigning these kernels, we develop a vectorized implementation for non-bonded interactions, which is nearly $100 \times$ 100 × faster than the management processing…

research product

Cell-List based Molecular Dynamics on Many-Core Processors: A Case Study on Sunway TaihuLight Supercomputer

Molecular dynamics (MD) simulations are playing an increasingly important role in several research areas. The most frequently used potentials in MD simulations are pair-wise potentials. Due to the memory wall, computing pair-wise potentials on many-core processors are usually memory bounded. In this paper, we take the SW26010 processor as an exemplary platform to explore the possibility to break the memory bottleneck by improving data reusage via cell-list-based methods. We use cell-lists instead of neighbor-lists in the potential computation, and apply a number of novel optimization methods. Theses methods include: an adaptive replica arrangement strategy, a parameter profile data structur…

research product