Search results for "parallel computing"

showing 10 items of 189 documents

Multi-objective optimisations for a superscalar architecture with selective value prediction

2012

This work extends an earlier manual design space ex ploration of our developed Selective Load Value Pre diction based superscalar architecture to the L2 unified cache. A fter that we perform an automatic design space expl oration using a special developed software tool by varying several architectural parameters. Our goal is to find optim al configurations in terms of CPI (Cycles per Instruction) and energy consumption. By varying 19 architectural parameter s, as we proposed, the design space is over 2.5 millions of billions configurations which obviously means that only heuristic search can be considered. Therefore, we propose dif ferent methods of automatic design space exploratio n based…

Hardware and ArchitectureComputer scienceCycles per instructionSuperscalarValue (computer science)Parallel computingCacheEnergy consumptionElectrical and Electronic EngineeringDesign spaceSoftwareSpace explorationSign (mathematics)IET Computers & Digital Techniques

researchProduct

SYSTOLIC GENERATION OF k-ARY TREES

1999

The only parallel generating algorithms for k-ary trees are those of Akl and Stojmenović in 1996 and of Vajnovszki and Phillips in 1997. In the first of them, trees are represented by an inversion table and the processor model is a linear aray multicomputer. In the second, trees are represented by bitstrings and the algorithm executes on a shared memory multiprocessor. In this paper we give a parallel generating algorithm for k-ary trees represented by generalized P–sequences for execution on a linear array multicomputer.

Hardware and ArchitectureShared memory multiprocessorProcessor modelWeight-balanced treeParallel algorithmParallel computingInversion tableSoftwareTheoretical Computer ScienceLinear arrayMathematicsVector processorParallel Processing Letters

researchProduct

Optimization of Application-Specific L1 Cache Translation Functions of the LEON3 Processor

2020

Reconfigurable caches offer an intriguing opportunity to tailor cache behavior to applications for better run-times and energy consumptions. While one may adapt structural cache parameters such as cache and block sizes, we adapt the memory-address-to-cache-index mapping function to the needs of an application. Using a LEON3 embedded multi-core processor with reconfigurable cache mappings, a metaheuristic search procedure, and Mibench applications, we show in this work how to accurately compare non-deterministic performances of applications and how to use this information to implement an optimization procedure that evolves application-specific cache mappings.

Hardware_MEMORYSTRUCTURESComputer scienceCPU cachemedia_common.quotation_subjectApplication specificCacheParallel computingTranslation (geometry)Function (engineering)MetaheuristicEnergy (signal processing)Block (data storage)media_common

researchProduct

Parallel macro pipelining on the intel SCC many-core computer

2013

In this paper we present how Intel's Single-Chip-Cloud processor behaves for parallel macro pipeline applications. Subsets of the SCC's available cores can be arranged as a pipeline where each core processes one stage of the overall workload. Each of the independent cores processes a small part of a larger task and feeds the following core with new data after it finishes its work. Our case-study is a parallel rendering system which renders successive images and applies different filters on them. On normal graphics adapters this is usually done in multiple cycles, we do this in a single pipeline pass. We show that we can achieve a significant speedup by using multiple parallel pipelines on t…

Hardware_MEMORYSTRUCTURESSpeedupParallel renderingbusiness.industryComputer sciencePipeline (computing)020207 software engineering02 engineering and technologyParallel computingGraphics pipelineSingle-chip Cloud ComputerMemory bankParallel processing (DSP implementation)Embedded system0202 electrical engineering electronic engineering information engineeringMacrobusiness

researchProduct

SIMULATING SPIN MODELS ON GPU: A TOUR

2012

The use of graphics processing units (GPUs) in scientific computing has gathered considerable momentum in the past five years. While GPUs in general promise high performance and excellent performance per Watt ratios, not every class of problems is equally well suitable for exploiting the massively parallel architecture they provide. Lattice spin models appear to be prototypic examples of problems suitable for this architecture, at least as long as local update algorithms are employed. In this review, I summarize our recent experience with the simulation of a wide range of spin models on GPU employing an equally wide range of update algorithms, ranging from Metropolis and heat bath updates,…

Heat bathComputer scienceMonte Carlo methodGeneral Physics and AstronomyStatistical and Nonlinear PhysicsMassively parallel architectureRangingParallel computingComputer Science ApplicationsComputational Theory and MathematicsGeneral-purpose computing on graphics processing unitsGraphicsArchitectureMathematical PhysicsPerformance per wattInternational Journal of Modern Physics C

researchProduct

Parallelized short read assembly of large genomes using de Bruijn graphs

2011

Abstract Background Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. Results We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows tha…

Hybrid genome assemblyParallel computingComputational biologyBiologylcsh:Computer applications to medicine. Medical informaticsBiochemistryAssemblersStructural BiologyHumansThroughput (business)Molecular Biologylcsh:QH301-705.5De Bruijn sequenceGenomeContigBacteriaGenome HumanApplied MathematicsMessage passingDNA sequencing theoryComputational BiologyHigh-Throughput Nucleotide SequencingComputer Science Applicationslcsh:Biology (General)comic_booksScalabilitylcsh:R858-859.7comic_books.characterSoftwareResearch ArticleBMC Bioinformatics

researchProduct

Architectural improvements and FPGA implementation of a multimodel neuroprocessor

2003

Since neural networks (NNs) require an enormous amount of learning time, various kinds of dedicated parallel computers have been developed. In the paper a 2-D systolic array (SA) of dedicated processing elements (PEs) also called systolic cells (SCs) is presented as the heart of a multimodel neural-network accelerator. The instruction set of the SA allows the implementation of several neural algorithms, including error back propagation and a self organizing feature map algorithm. Several special architectural facilities are presented in the paper in order to improve the 2-D SA performance. A swapping mechanism of the weight matrix allows the implementation of NNs larger than 2-D SA. A systo…

Instruction setArtificial neural networkComputer architectureComputer scienceFeature (machine learning)Systolic arrayParallel computingDifference-map algorithmField-programmable gate arrayBackpropagationWord (computer architecture)Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02.

researchProduct

LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs

2015

Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribut…

Instruction setCUDASpeedupComputer scienceSparse matrix-vector multiplicationDouble-precision floating-point formatParallel computingGeneral-purpose computing on graphics processing unitsRowSparse matrix2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

researchProduct

Bit-Parallel Approximate Pattern Matching on the Xeon Phi Coprocessor

2014

Bit-parallel pattern matching encodes calculated values in bit arrays. This approach gains its efficiency by performing multiple updates within a machine word. An important parameter is therefore the machine word size (e.g. 32 or 64 bits). With the increasing length of vector registers, the efficient mapping of bit-parallel pattern matching algorithms onto modern high performance computing architectures is becoming increasingly important. In this paper, we investigate an efficient implementation of the Wu-Manber approximate pattern matching algorithm on the Intel Xeon Phi coprocessor. This architecture features a 512-bit long vector processing unit (VPU) as well as a large number of process…

Instruction setCoprocessorSpeedupComputer scienceParallel computingPattern matchingIntrinsicsWord (computer architecture)Xeon PhiVector processor2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing

researchProduct

SWAPHI-LS: Smith-Waterman Algorithm on Xeon Phi coprocessors for Long DNA Sequences

2014

As an optimal method for sequence alignment, the Smith-Waterman (SW) algorithm is widely used. Unfortunately, this algorithm is computationally demanding, especially for long sequences. This has motivated the investigation of its acceleration on a variety of high-performance computing platforms. However, most work in the literature is only suitable for short sequences. In this paper, we present SWAPHI-LS, the first parallel SW algorithm exploiting emerging Xeon Phi coprocessors to accelerate the alignment of long DNA sequences. In SWAPHI-LS, we have investigated three parallelization approaches (naive, tiled, and distributed) in order to deeply explore the inherent parallelism within Xeon P…

Instruction setSmith–Waterman algorithmCoprocessorXeonComputer scienceData parallelismTask parallelismParallel computingSIMDIntrinsicsInstruction-level parallelismXeon Phi2014 IEEE International Conference on Cluster Computing (CLUSTER)

researchProduct