Search results for "parallel computing"
showing 10 items of 189 documents
Multi-objective optimisations for a superscalar architecture with selective value prediction
2012
This work extends an earlier manual design space ex ploration of our developed Selective Load Value Pre diction based superscalar architecture to the L2 unified cache. A fter that we perform an automatic design space expl oration using a special developed software tool by varying several architectural parameters. Our goal is to find optim al configurations in terms of CPI (Cycles per Instruction) and energy consumption. By varying 19 architectural parameter s, as we proposed, the design space is over 2.5 millions of billions configurations which obviously means that only heuristic search can be considered. Therefore, we propose dif ferent methods of automatic design space exploratio n based…
SYSTOLIC GENERATION OF k-ARY TREES
1999
The only parallel generating algorithms for k-ary trees are those of Akl and Stojmenović in 1996 and of Vajnovszki and Phillips in 1997. In the first of them, trees are represented by an inversion table and the processor model is a linear aray multicomputer. In the second, trees are represented by bitstrings and the algorithm executes on a shared memory multiprocessor. In this paper we give a parallel generating algorithm for k-ary trees represented by generalized P–sequences for execution on a linear array multicomputer.
Optimization of Application-Specific L1 Cache Translation Functions of the LEON3 Processor
2020
Reconfigurable caches offer an intriguing opportunity to tailor cache behavior to applications for better run-times and energy consumptions. While one may adapt structural cache parameters such as cache and block sizes, we adapt the memory-address-to-cache-index mapping function to the needs of an application. Using a LEON3 embedded multi-core processor with reconfigurable cache mappings, a metaheuristic search procedure, and Mibench applications, we show in this work how to accurately compare non-deterministic performances of applications and how to use this information to implement an optimization procedure that evolves application-specific cache mappings.
Parallel macro pipelining on the intel SCC many-core computer
2013
In this paper we present how Intel's Single-Chip-Cloud processor behaves for parallel macro pipeline applications. Subsets of the SCC's available cores can be arranged as a pipeline where each core processes one stage of the overall workload. Each of the independent cores processes a small part of a larger task and feeds the following core with new data after it finishes its work. Our case-study is a parallel rendering system which renders successive images and applies different filters on them. On normal graphics adapters this is usually done in multiple cycles, we do this in a single pipeline pass. We show that we can achieve a significant speedup by using multiple parallel pipelines on t…
SIMULATING SPIN MODELS ON GPU: A TOUR
2012
The use of graphics processing units (GPUs) in scientific computing has gathered considerable momentum in the past five years. While GPUs in general promise high performance and excellent performance per Watt ratios, not every class of problems is equally well suitable for exploiting the massively parallel architecture they provide. Lattice spin models appear to be prototypic examples of problems suitable for this architecture, at least as long as local update algorithms are employed. In this review, I summarize our recent experience with the simulation of a wide range of spin models on GPU employing an equally wide range of update algorithms, ranging from Metropolis and heat bath updates,…
Parallelized short read assembly of large genomes using de Bruijn graphs
2011
Abstract Background Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. Results We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows tha…
Architectural improvements and FPGA implementation of a multimodel neuroprocessor
2003
Since neural networks (NNs) require an enormous amount of learning time, various kinds of dedicated parallel computers have been developed. In the paper a 2-D systolic array (SA) of dedicated processing elements (PEs) also called systolic cells (SCs) is presented as the heart of a multimodel neural-network accelerator. The instruction set of the SA allows the implementation of several neural algorithms, including error back propagation and a self organizing feature map algorithm. Several special architectural facilities are presented in the paper in order to improve the 2-D SA performance. A swapping mechanism of the weight matrix allows the implementation of NNs larger than 2-D SA. A systo…
LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs
2015
Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribut…
Bit-Parallel Approximate Pattern Matching on the Xeon Phi Coprocessor
2014
Bit-parallel pattern matching encodes calculated values in bit arrays. This approach gains its efficiency by performing multiple updates within a machine word. An important parameter is therefore the machine word size (e.g. 32 or 64 bits). With the increasing length of vector registers, the efficient mapping of bit-parallel pattern matching algorithms onto modern high performance computing architectures is becoming increasingly important. In this paper, we investigate an efficient implementation of the Wu-Manber approximate pattern matching algorithm on the Intel Xeon Phi coprocessor. This architecture features a 512-bit long vector processing unit (VPU) as well as a large number of process…
SWAPHI-LS: Smith-Waterman Algorithm on Xeon Phi coprocessors for Long DNA Sequences
2014
As an optimal method for sequence alignment, the Smith-Waterman (SW) algorithm is widely used. Unfortunately, this algorithm is computationally demanding, especially for long sequences. This has motivated the investigation of its acceleration on a variety of high-performance computing platforms. However, most work in the literature is only suitable for short sequences. In this paper, we present SWAPHI-LS, the first parallel SW algorithm exploiting emerging Xeon Phi coprocessors to accelerate the alignment of long DNA sequences. In SWAPHI-LS, we have investigated three parallelization approaches (naive, tiled, and distributed) in order to deeply explore the inherent parallelism within Xeon P…