Search results for "parallel computing"

showing 10 items of 189 documents

Exploiting selective instruction reuse and value prediction in a superscalar architecture

2009

In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches represent a source of important performance penalties. Our statistics show that about 28% of branches are dependent on critical Load instructions. Moreover, 5.61% of branches are unbiased and depend on critical Loads, too. In the same way, about 21% of branches depend on MUL/DIV instructions whereas 3.76% are unbiased and depend on MUL/DIV instructions. These dependences involve high-penalty mispredictions becoming serious performance obstac…

Instructions per cycleSpeedupComputer scienceSpeculative executionSpec#Thread (computing)Parallel computingReuseHardware and ArchitectureSuperscalarHardware_CONTROLSTRUCTURESANDMICROPROGRAMMINGcomputerData cacheSoftwarecomputer.programming_languageJournal of Systems Architecture

researchProduct

AnyDSL: a partial evaluation framework for programming high-performance libraries

2023

This paper advocates programming high-performance code using partial evaluation. We present a clean-slate programming system with a simple, annotation-based, online partial evaluator that operates on a CPS-style intermediate representation. Our system exposes code generation for accelerators (vectorization/parallelization for CPUs and GPUs) via compiler-known higher-order functions that can be subjected to partial evaluation. This way, generic implementations can be instantiated with target-specific code at compile time. In our experimental evaluation we present three extensive case studies from image processing, ray tracing, and genome sequence alignment. We demonstrate that using partial …

Intermediate languageComputer science020207 software engineeringImage processing02 engineering and technologyParallel computingPartial evaluation004020204 information systems0202 electrical engineering electronic engineering information engineeringCode generationRay tracing (graphics)General-purpose computing on graphics processing unitsSafety Risk Reliability and QualityImplementationSoftwareCompile time

researchProduct

PGAC: A Parallel Genetic Algorithm for Data Clustering

2005

Cluster analysis is a valuable tool for exploratory pattern analysis, especially when very little a priori knowledge about the data is available. Distributed systems, based on high speed intranet connections, provide new tools in order to design new and faster clustering algorithms. Here, a parallel genetic algorithm for clustering called PGAC is described. The used strategy of parallelization is the island model paradigm where different populations of chromosomes (called demes) evolve locally to each processor and from time to time some individuals are moved from one deme to another. Experiments have been performed for testing the benefits of the parallelisation paradigm in terms of comput…

IntranetCorrectnessTheoretical computer scienceParallel processing (DSP implementation)Artificial neural networkData Clustering Evolutionary Aglorithms Parallel processingSettore INF/01 - InformaticaComputer scienceParallel algorithmA priori and a posterioriAlgorithm designParallel computingCluster analysis

researchProduct

The Impact of Java Applications at Microarchitectural Level from Branch Prediction Perspective

2009

The portability, the object-oriented and distributed programming models, multithreading support and automatic garbage collection are features that make Java very attractive for application developers. The main goal of this paper consists in pointing out the impact of Java applications at microarchitectural level from two perspectives: unbiased branches and indirect jumps/calls, such branches limiting the ceiling of dynamic branch prediction and causing significant performance degradation. Therefore, accurately predicting this kind of branches remains an open problem. The simulation part of the paper mainly refers to determining the context length influence on the percentage of unbiased bran…

JavaComputer Networks and CommunicationsComputer scienceIndirect branchContext (language use)Parallel computingArityBranch predictorComputer Science ApplicationsSoftware portabilityInheritance (object-oriented programming)Computational Theory and MathematicscomputerGarbage collectioncomputer.programming_languageInternational Journal of Computers Communications & Control

researchProduct

Heterogeneous PBLAS: Optimization of PBLAS for Heterogeneous Computational Clusters

2008

This paper presents a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for heterogeneous computational clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step towards the development of a parallel linear algebra package for heterogeneous computational clusters. We demonstrate the efficiency of the HeteroPBLAS programs on a homogeneous computing cluster and a heterogeneous computing cluster.

Kernel (linear algebra)ScaLAPACKComputer scienceComputer clusterLinear algebraCluster (physics)Concurrent computingSymmetric multiprocessor systemParallel computingBasic Linear Algebra SubprogramsComputational science2008 International Symposium on Parallel and Distributed Computing

researchProduct

Pure Functions in C: A Small Keyword for Automatic Parallelization

2017

AbstractThe need for parallel task execution has been steadily growing in recent years since manufacturers mainly improve processor performance by increasing the number of installed cores instead of scaling the processor’s frequency. To make use of this potential, an essential technique to increase the parallelism of a program is to parallelize loops. Several automatic loop nest parallelizers have been developed in the past such as PluTo. The main restriction of these tools is that the loops must be statically analyzable which, among other things, disallows function calls within the loops. In this article, we present a seemingly simple extension to the C programming language which marks fun…

LOOP (programming language)Computer sciencemedia_common.quotation_subject020209 energy02 engineering and technologyParallel computingcomputer.software_genreToolchainTheoretical Computer ScienceTask (computing)Automatic parallelizationSide effect (computer science)Parallel processing (DSP implementation)020204 information systemsTheory of computationParallelism (grammar)0202 electrical engineering electronic engineering information engineeringPolytope model020201 artificial intelligence & image processingCompilerFunction (engineering)computerSoftwareInformation Systemsmedia_common2017 IEEE International Conference on Cluster Computing (CLUSTER)

researchProduct

Parallel Simulated Annealing: Getting Super Linear Speedups

2005

The study described in this paper tries to improve and combine different approaches that are able to speed up applications of the Simulated Annealing model. It investigates separately two main aspects concerning the degree of parallelism an implementation can egectively exploit at the initial andfinal periods of an execution. As for case studies, it deals with two implementations: the Job shop Scheduling problem and the poryblio selection problem. The paper reports the results of a large number of experiments, carried out by means of a transputer network and a hypercube system. They give useful suggestions about selecting the most suitable values of the intervention parameters to achieve su…

Mathematical optimizationSpeedupComputational complexity theoryJob shop schedulingParallel processing (DSP implementation)Computer scienceSimulated annealingDegree of parallelismFlow shop schedulingParallel computingHypercubeProceedings. Second Euromicro Workshop on Parallel and Distributed Processing

researchProduct

A recurrence-free variant of strassen’s algorithm on hypercube

1995

In this paper a non-recursive Strassen’s matrix multiplication algorithm is presented. This new algorithm is suitable to run on parallel environments. Two computational schemes have been worked out exploiting different parallel approaches on hypercube architecture. A comparative analysis is reported. The experiments have been carried out on an nCUBE-2 supercomputer, housed at CNUCE in Pisa, supporting the Express parallel operating system. © 1995, Taylor & Francis Group, LLC. All rights reserved.

Matrix multiplicationGeneral Computer ScienceComputer scienceExpress operating systemComputer Science (all)Parallel computingStrassen’s algorithmSupercomputerMatrix multiplicationStrassen algorithmHypercube architectureHypercubeAlgorithmHypercube architecture

researchProduct

A block access unit for 2D memory access

2007

Many of the coding tools in the H.264/AVC video coding standard are based on 2D processing resulting in rowwise and column-wise memory accesses starting from arbitrary memory addresses. This paper discusses a low-cost hardware realization of these accesses on sub-word parallel processors. The proposed block access unit is placed between the processor and memory. It supports unaligned 2D block accesses according to several 2D access patterns. The 2D block accesses are pipelinable and they result in minimum number of memory accesses required to deliver the desired data.

Memory addressComputer scienceUniform memory accessSemiconductor memoryParallel computingH 264 avcScalable Video CodingCoding (social sciences)Norchip 2007

researchProduct

Analyzing the Energy Efficiency of the Memory Subsystem in Multicore Processors

2014

In this paper we analyze the energy overhead incurred when operating with data stored in different levels of the memory subsystem (cache levels and DDR chips) of current multicore architectures. Our approach builds upon servet, a portable framework for the memory characterization of multicore processors, extending this suite with a power-related test that, when applied to a platform equipped with a power measurement mechanism, provides information on the efficiency of memory energy usage. As additional contributions, i) we provide a complete experimental study of the impact that the CPU performance states (also known as P-states) exert on the memory energy efficiency of a collection of rece…

Memory coherenceMemory managementFlat memory modelShared memoryComputer scienceInterleaved memoryUniform memory accessDistributed memorySemiconductor memoryParallel computing2014 IEEE International Symposium on Parallel and Distributed Processing with Applications

researchProduct