Search results for "Parallel computing"
showing 10 items of 189 documents
Iterative sparse matrix-vector multiplication for accelerating the block Wiedemann algorithm over GF(2) on multi-graphics processing unit systems
2012
SUMMARY The block Wiedemann (BW) algorithm is frequently used to solve sparse linear systems over GF(2). Iterative sparse matrix–vector multiplication is the most time-consuming operation. The necessity to accelerate this step is motivated by the application of BW to very large matrices used in the linear algebra step of the number field sieve (NFS) for integer factorization. In this paper, we derive an efficient CUDA implementation of this operation by using a newly designed hybrid sparse matrix format. This leads to speedups between 4 and 8 on a single graphics processing unit (GPU) for a number of tested NFS matrices compared with an optimized multicore implementation. We further present…
A Fast GPU-Based Motion Estimation Algorithm for H.264/AVC
2012
H.264/AVC is the most recent predictive video compression standard to outperform other existing video coding standards by means of higher computational complexity. In recent years, heterogeneous computing has emerged as a cost-efficient solution for high-performance computing. In the literature, several algorithms have been proposed to accelerate video compression, but so far there have not been many solutions that deal with video codecs using heterogeneous systems. This paper proposes an algorithm to perform H.264/AVC inter prediction. The proposed algorithm performs the motion estimation, both with full-pixel and sub-pixel accuracy, using CUDA to assist the CPU, obtaining remarkable time …
CRiSPy-CUDA: Computing Species Richness in 16S rRNA Pyrosequencing Datasets with CUDA
2011
Pyrosequencing technologies are frequently used for sequencing the 16S rRNA marker gene for metagenomic studies of microbial communities. Computing a pairwise genetic distance matrix from the produced reads is an important but highly time consuming task. In this paper, we present a parallelized tool (called CRiSPy) for scalable pairwise genetic distance matrix computation and clustering that is based on the processing pipeline of the popular ESPRIT software package. To achieve high computational efficiency, we have designed massively parallel CUDA algorithms for pairwise k-mer distance and pairwise genetic distance computation. We have also implemented a memory-efficient sparse matrix clust…
COMPARISON OF CPML IMPLEMENTATIONS FOR THE GPU-ACCELERATED FDTD SOLVER
2011
Three distinctively difierent implementations of convolu- tional perfectly matched layer for the FDTD method on CUDA enabled graphics processing units are presented. All implementations store ad- ditional variables only inside the convolutional perfectly matched lay- ers, and the computational speeds scale according to the thickness of these layers. The merits of the difierent approaches are discussed, and a comparison of computational performance is made using complex real-life benchmarks.
CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read Alignment
2017
Next generation sequencing (NGS) technologies have enabled cheap, large-scale, and high-throughput production of short DNA sequence reads and thereby have promoted the explosive growth of data volume. Unfortunately, the produced reads are short and prone to contain errors that are incurred during sequencing cycles. Both large data volume and sequencing errors have complicated the mapping of NGS reads onto the reference genome and have motivated the development of various aligners for very short reads, typically less than 100 base pairs (bps) in length. As read length continues to increase, propelled by advances in NGS technologies, these longer reads tend to have higher sequencing error rat…
Parallelized Clustering of Protein Structures on CUDA-Enabled GPUs
2014
Estimation of the pose in which two given molecules might bind together to form a potential complex is a crucial task in structural biology. To solve this so-called "docking problem", most algorithms initially generate large numbers of candidate poses (or decoys) which are then clustered to allow for subsequent computationally expensive evaluations of reasonable representatives. Since the number of such candidates ranges from thousands to millions, performing the clustering on standard CPUs is highly time consuming. In this paper we analyze and evaluate different approaches to parallelize the nearest neighbor chain algorithm to perform hierarchical Ward clustering of protein structures usin…
Exploiting seeding of random number generators for efficient domain decomposition parallelization of dissipative particle dynamics
2013
Abstract Dissipative particle dynamics (DPD) is a new promising method commonly used in coarse-grained simulations of soft matter and biomolecular systems at constant temperature. The DPD thermostat involves the evaluation of stochastic or random forces between pairs of neighboring particles in every time step. In a parallel computing environment, the transfer of these forces from node to node can be very time consuming. In this paper we describe the implementation of a seeded random number generator with three input seeds at each step which enables the complete generation of the pairwise stochastic forces in parallel DPD simulations with minimal communication between nodes.
SAUCE: A Web-Based Automated Assessment Tool for Teaching Parallel Programming
2015
Many curricula for undergraduate studies in computer science provide a lecture on the fundamentals of parallel programming like multi-threaded computation on shared memory architectures using POSIX threads or OpenMP. The complex structure of parallel programs can be challenging, especially for inexperienced students. Thus, there is a latent need for software supporting the learning process. Subsequent lectures may cover more advanced parallelization techniques such as the Message Passing Interface (MPI) and the Compute Unified Device Architecture (CUDA) languages. Unfortunately, the majority of students cannot easily access MPI clusters or modern hardware accelerators in order to effectivel…
On the systolic calculation of all-pairs interactions using transputer arrays
1991
Comparison of implementations of the lattice-Boltzmann method
2008
AbstractSimplicity of coding is usually an appealing feature of the lattice-Boltzmann method (LBM). Conventional implementations of LBM are often based on the two-lattice or the two-step algorithm, which however suffer from high memory consumption and poor computational performance, respectively. The aim of this work was to identify implementations of LBM that would achieve high computational performance with low memory consumption. Effects of memory addressing schemes were investigated in particular. Data layouts for velocity distribution values were also considered, and they were found to be related to computational performance. A novel bundle data layout was therefore introduced. Address…