Search results for "CUDA"
showing 10 items of 56 documents
Gossip
2019
Nowadays, a growing number of servers and workstations feature an increasing number of GPUs. However, slow communication among GPUs can lead to poor application performance. Thus, there is a latent demand for efficient multi-GPU communication primitives on such systems. This paper focuses on the gather, scatter and all-to-all collectives, which are important operations for various algorithms including parallel sorting and distributed hashing. We present two distinct communication strategies (ring-based and flow-oriented) to generate transfer plans for their topology-aware implementation on NVLink-connected multi-GPU systems. We achieve a throughput of up to 526 GB/s for all-to-all and 148 G…
CRiSPy-CUDA: Computing Species Richness in 16S rRNA Pyrosequencing Datasets with CUDA
2011
Pyrosequencing technologies are frequently used for sequencing the 16S rRNA marker gene for metagenomic studies of microbial communities. Computing a pairwise genetic distance matrix from the produced reads is an important but highly time consuming task. In this paper, we present a parallelized tool (called CRiSPy) for scalable pairwise genetic distance matrix computation and clustering that is based on the processing pipeline of the popular ESPRIT software package. To achieve high computational efficiency, we have designed massively parallel CUDA algorithms for pairwise k-mer distance and pairwise genetic distance computation. We have also implemented a memory-efficient sparse matrix clust…
COMPARISON OF CPML IMPLEMENTATIONS FOR THE GPU-ACCELERATED FDTD SOLVER
2011
Three distinctively difierent implementations of convolu- tional perfectly matched layer for the FDTD method on CUDA enabled graphics processing units are presented. All implementations store ad- ditional variables only inside the convolutional perfectly matched lay- ers, and the computational speeds scale according to the thickness of these layers. The merits of the difierent approaches are discussed, and a comparison of computational performance is made using complex real-life benchmarks.
CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read Alignment
2017
Next generation sequencing (NGS) technologies have enabled cheap, large-scale, and high-throughput production of short DNA sequence reads and thereby have promoted the explosive growth of data volume. Unfortunately, the produced reads are short and prone to contain errors that are incurred during sequencing cycles. Both large data volume and sequencing errors have complicated the mapping of NGS reads onto the reference genome and have motivated the development of various aligners for very short reads, typically less than 100 base pairs (bps) in length. As read length continues to increase, propelled by advances in NGS technologies, these longer reads tend to have higher sequencing error rat…
Parallelized Clustering of Protein Structures on CUDA-Enabled GPUs
2014
Estimation of the pose in which two given molecules might bind together to form a potential complex is a crucial task in structural biology. To solve this so-called "docking problem", most algorithms initially generate large numbers of candidate poses (or decoys) which are then clustered to allow for subsequent computationally expensive evaluations of reasonable representatives. Since the number of such candidates ranges from thousands to millions, performing the clustering on standard CPUs is highly time consuming. In this paper we analyze and evaluate different approaches to parallelize the nearest neighbor chain algorithm to perform hierarchical Ward clustering of protein structures usin…
SAUCE: A Web-Based Automated Assessment Tool for Teaching Parallel Programming
2015
Many curricula for undergraduate studies in computer science provide a lecture on the fundamentals of parallel programming like multi-threaded computation on shared memory architectures using POSIX threads or OpenMP. The complex structure of parallel programs can be challenging, especially for inexperienced students. Thus, there is a latent need for software supporting the learning process. Subsequent lectures may cover more advanced parallelization techniques such as the Message Passing Interface (MPI) and the Compute Unified Device Architecture (CUDA) languages. Unfortunately, the majority of students cannot easily access MPI clusters or modern hardware accelerators in order to effectivel…
SAUCE: A web application for interactive teaching and learning of parallel programming
2017
Abstract Prevalent hardware trends towards parallel architectures and algorithms create a growing demand for graduate students familiar with the programming of concurrent software. However, learning parallel programming is challenging due to complex communication and memory access patterns as well as the avoidance of common pitfalls such as dead-locks and race conditions. Hence, the learning process has to be supported by adequate software solutions in order to enable future computer scientists and engineers to write robust and efficient code. This paper discusses a selection of well-known parallel algorithms based on C++11 threads, OpenMP, MPI, and CUDA that can be interactively embedded i…
High Precision Conservative Surface Mesh Generation for Swept Volumes
2015
We present a novel, efficient, and flexible scheme to generate a high-quality mesh that approximates the outer boundary of a swept volume. Our approach comes with two guarantees. First, the approximation is conservative, i.e., the swept volume is enclosed by the generated mesh. Second, the one-sided Hausdorff distance of the generated mesh to the swept volume is upper bounded by a user defined tolerance. Exploiting this tolerance the algorithm generates a mesh that is adapted to the local complexity of the swept volume boundary, keeping the overall output complexity remarkably low. The algorithm is two-phased: the actual sweep and the mesh generation. In the sweeping phase, we introduce a g…
Massively parallel computation of atmospheric neutrino oscillations on CUDA-enabled accelerators
2019
Abstract The computation of neutrino flavor transition amplitudes through inhomogeneous matter is a time-consuming step and thus could benefit from optimization and parallelization. Next to reliable parameter estimation of intrinsic physical quantities such as neutrino masses and mixing angles, these transition amplitudes are important in hypothesis testing of potential extensions of the standard model of elementary particle physics, such as additional neutrino flavors. Hence, fast yet precise implementations are of high importance to research. In the recent past, massively parallel accelerators such as CUDA-enabled GPUs featuring thousands of compute units have been widely adopted due to t…
GROMEX: A Scalable and Versatile Fast Multipole Method for Biomolecular Simulation
2020
Atomistic simulations of large biomolecular systems with chemical variability such as constant pH dynamic protonation offer multiple challenges in high performance computing. One of them is the correct treatment of the involved electrostatics in an efficient and highly scalable way. Here we review and assess two of the main building blocks that will permit such simulations: (1) An electrostatics library based on the Fast Multipole Method (FMM) that treats local alternative charge distributions with minimal overhead, and (2) A $λ$-dynamics module working in tandem with the FMM that enables various types of chemical transitions during the simulation. Our $λ$-dynamics and FMM implementations d…