Search results for "parallel computing"
showing 10 items of 189 documents
Exploiting selective instruction reuse and value prediction in a superscalar architecture
2009
In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches represent a source of important performance penalties. Our statistics show that about 28% of branches are dependent on critical Load instructions. Moreover, 5.61% of branches are unbiased and depend on critical Loads, too. In the same way, about 21% of branches depend on MUL/DIV instructions whereas 3.76% are unbiased and depend on MUL/DIV instructions. These dependences involve high-penalty mispredictions becoming serious performance obstac…
AnyDSL: a partial evaluation framework for programming high-performance libraries
2023
This paper advocates programming high-performance code using partial evaluation. We present a clean-slate programming system with a simple, annotation-based, online partial evaluator that operates on a CPS-style intermediate representation. Our system exposes code generation for accelerators (vectorization/parallelization for CPUs and GPUs) via compiler-known higher-order functions that can be subjected to partial evaluation. This way, generic implementations can be instantiated with target-specific code at compile time. In our experimental evaluation we present three extensive case studies from image processing, ray tracing, and genome sequence alignment. We demonstrate that using partial …
PGAC: A Parallel Genetic Algorithm for Data Clustering
2005
Cluster analysis is a valuable tool for exploratory pattern analysis, especially when very little a priori knowledge about the data is available. Distributed systems, based on high speed intranet connections, provide new tools in order to design new and faster clustering algorithms. Here, a parallel genetic algorithm for clustering called PGAC is described. The used strategy of parallelization is the island model paradigm where different populations of chromosomes (called demes) evolve locally to each processor and from time to time some individuals are moved from one deme to another. Experiments have been performed for testing the benefits of the parallelisation paradigm in terms of comput…
The Impact of Java Applications at Microarchitectural Level from Branch Prediction Perspective
2009
The portability, the object-oriented and distributed programming models, multithreading support and automatic garbage collection are features that make Java very attractive for application developers. The main goal of this paper consists in pointing out the impact of Java applications at microarchitectural level from two perspectives: unbiased branches and indirect jumps/calls, such branches limiting the ceiling of dynamic branch prediction and causing significant performance degradation. Therefore, accurately predicting this kind of branches remains an open problem. The simulation part of the paper mainly refers to determining the context length influence on the percentage of unbiased bran…
Heterogeneous PBLAS: Optimization of PBLAS for Heterogeneous Computational Clusters
2008
This paper presents a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for heterogeneous computational clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step towards the development of a parallel linear algebra package for heterogeneous computational clusters. We demonstrate the efficiency of the HeteroPBLAS programs on a homogeneous computing cluster and a heterogeneous computing cluster.
Pure Functions in C: A Small Keyword for Automatic Parallelization
2017
AbstractThe need for parallel task execution has been steadily growing in recent years since manufacturers mainly improve processor performance by increasing the number of installed cores instead of scaling the processor’s frequency. To make use of this potential, an essential technique to increase the parallelism of a program is to parallelize loops. Several automatic loop nest parallelizers have been developed in the past such as PluTo. The main restriction of these tools is that the loops must be statically analyzable which, among other things, disallows function calls within the loops. In this article, we present a seemingly simple extension to the C programming language which marks fun…
Parallel Simulated Annealing: Getting Super Linear Speedups
2005
The study described in this paper tries to improve and combine different approaches that are able to speed up applications of the Simulated Annealing model. It investigates separately two main aspects concerning the degree of parallelism an implementation can egectively exploit at the initial andfinal periods of an execution. As for case studies, it deals with two implementations: the Job shop Scheduling problem and the poryblio selection problem. The paper reports the results of a large number of experiments, carried out by means of a transputer network and a hypercube system. They give useful suggestions about selecting the most suitable values of the intervention parameters to achieve su…
A recurrence-free variant of strassen’s algorithm on hypercube
1995
In this paper a non-recursive Strassen’s matrix multiplication algorithm is presented. This new algorithm is suitable to run on parallel environments. Two computational schemes have been worked out exploiting different parallel approaches on hypercube architecture. A comparative analysis is reported. The experiments have been carried out on an nCUBE-2 supercomputer, housed at CNUCE in Pisa, supporting the Express parallel operating system. © 1995, Taylor & Francis Group, LLC. All rights reserved.
A block access unit for 2D memory access
2007
Many of the coding tools in the H.264/AVC video coding standard are based on 2D processing resulting in rowwise and column-wise memory accesses starting from arbitrary memory addresses. This paper discusses a low-cost hardware realization of these accesses on sub-word parallel processors. The proposed block access unit is placed between the processor and memory. It supports unaligned 2D block accesses according to several 2D access patterns. The 2D block accesses are pipelinable and they result in minimum number of memory accesses required to deliver the desired data.
Analyzing the Energy Efficiency of the Memory Subsystem in Multicore Processors
2014
In this paper we analyze the energy overhead incurred when operating with data stored in different levels of the memory subsystem (cache levels and DDR chips) of current multicore architectures. Our approach builds upon servet, a portable framework for the memory characterization of multicore processors, extending this suite with a power-related test that, when applied to a platform equipped with a power measurement mechanism, provides information on the efficiency of memory energy usage. As additional contributions, i) we provide a complete experimental study of the impact that the CPU performance states (also known as P-states) exert on the memory energy efficiency of a collection of rece…