Search results for "Parallel computing"
showing 10 items of 189 documents
On solving separable block tridiagonal linear systems using a GPU implementation of radix-4 PSCR method
2018
Partial solution variant of the cyclic reduction (PSCR) method is a direct solver that can be applied to certain types of separable block tridiagonal linear systems. Such linear systems arise, e.g., from the Poisson and the Helmholtz equations discretized with bilinear finite-elements. Furthermore, the separability of the linear system entails that the discretization domain has to be rectangular and the discretization mesh orthogonal. A generalized graphics processing unit (GPU) implementation of the PSCR method is presented. The numerical results indicate up to 24-fold speedups when compared to an equivalent CPU implementation that utilizes a single CPU core. Attained floating point perfor…
Fast Poisson solvers for graphics processing units
2013
Two block cyclic reduction linear system solvers are considered and implemented using the OpenCL framework. The topics of interest include a simplified scalar cyclic reduction tridiagonal system solver and the impact of increasing the radix-number of the algorithm. Both implementations are tested for the Poisson problem in two and three dimensions, using a Nvidia GTX 580 series GPU and double precision floating-point arithmetic. The numerical results indicate up to 6-fold speed increase in the case of the two-dimensional problems and up to 3- fold speed increase in the case of the three-dimensional problems when compared to equivalent CPU implementations run on a Intel Core i7 quad-core CPU…
Parallelization of adaptive MC integrators
1997
Monte Carlo (MC) methods for numerical integration seem to be embarassingly parallel on first sight. When adaptive schemes are applied in order to enhance convergence however, the seemingly most natural way of replicating the whole job on each processor can potentially ruin the adaptive behaviour. Using the popular VEGAS-Algorithm as an example an economic method of semi-micro parallelization with variable grain-size is presented and contrasted with another straightforward approach of macro-parallelization. A portable implementation of this semi-micro parallelization is used in the xloops-project and is made publicly available.
CIPRNG: A VLSI Family of Chaotic Iterations Post-Processings for $\mathbb {F}_{2}$ -Linear Pseudorandom Number Generation Based on Zynq MPSoC
2018
Hardware pseudorandom number generators are continuously improved to satisfy both physical and ubiquitous computing security system challenges. The main contribution of this paper is to propose two post-processing modules in hardware, to improve the randomness of linear PRNGs while succeeding in passing the TestU01 statistical battery of tests. They are based on chaotic iterations and are denoted by CIPRNG-MC and CIPRNG-XOR. They have various interesting properties, encompassing the ability to improve the statistical profile of the generators on which they iterate. Such post-processing have been implemented on FPGA and ASIC without inferring any blocs (RAM or DSP). A comparison in terms of …
Hierarchical Parallelization of an H.264/AVC Video Encoder
2006
Last generation video encoding standards increase computing demands in order to reach the limits on compression efficiency. This is particularly the case of H.264/AVC specification that is gaining interest in industry. We are interested in applying parallel processing to H.264 encoders in order to fulfill the computation requirements imposed by stressing applications like video on demand, videoconference, live broadcast, etc. Given a delivered video quality and bit rate, the main complexity parameters are image resolution, frame rate and latency. These parameters can still be pushed forward in such a way that special purpose hardware solutions are not available. Parallel processing based on…
Lightweight LCP construction for next-generation sequencing datasets
2012
The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and B…
Design of a Real-time face detection parallel architecture using High-Level Synthesis
2008
Abstract We describe a High-Level Synthesis implementation of a parallel architecture for face detection. The chosen face detection method is the well-known Convolutional Face Finder (CFF) algorithm, which consists of a pipeline of convolution operations. We rely on dataflow modelling of the algorithm and we use a high-level synthesis tool in order to specify the local dataflows of our Processing Element (PE), by describing in C language inter-PE communication, fine scheduling of the successive convolutions, and memory distribution and bandwidth. Using this approach, we explore several implementation alternatives in order to find a compromise between processing speed and area of the PE. We …
On GPU-accelerated fast direct solvers and their applications in image denoising
2015
Multi-Authored Manuscripts and Speedup in Academic Publishing
2014
It is unfair to count a n-authored paper as one paper for each coauthor, i.e., as n papers: this is “feeding the multitude”. Sharing the credit among coauthors by percentages or by simply dividing by n is fairer but somewhat harsh. So, we propose to take into account the productivity gains of parallelization by introducing a team bonus function that multiplies the allocation thereby increasing the credit allocated to each coauthor.The degree of parallelization cannot be determined exogenously discipline by discipline. So, one may propose that each team of coauthors indicates how the labor was organized to produce the paper. Unfortunately, the coauthors may systematically bias their answers …
PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment
2017
The progress of next-generation sequencing has a major impact on medical and genomic research. This technology can now produce billions of short DNA fragments (reads) in a single run. One of the most demanding computational problems used by almost every sequencing pipeline is short-read alignment; i.e. determining where each fragment originated from in the original genome. Most current solutions are based on a seed-and-extend approach, where promising candidate regions (seeds) are first identified and subsequently extended in order to verify whether a full high-scoring alignment actually exists in the vicinity of each seed. Seed verification is the main bottleneck in many state-of-the-art a…