Search results for "parallel computing"

showing 9 items of 189 documents

Optimizing the Integration Area and Performance of VLIW Architectures by Hardware/Software Co-design

2021

The cost and the performance are major concerns that the designers of embedded processors shall take into account, especially for market considerations. In order to reduce the cost, embedded systems rely on simple hardware architectures like VLIW (Very Long Instruction Word) processors and they look for compiler support. This paper aims at developing a design space explorer of VLIW architectures from different perspectives like processing performance and integration area. A multi-objective Genetic Algorithm (GA) was used to find the optimum hardware configuration of an embedded system and the optimization rules applied by compiler on the benchmarks code. The first step consisted in represen…

education.field_of_studyInstructions per cycleMemory hierarchyComputer sciencePopulationEvolutionary algorithmOptimizing compilerParallel computingcomputer.software_genreVery long instruction wordGenetic algorithmCompilereducationcomputer
researchProduct

A parallel radix-4 block cyclic reduction algorithm

2014

A conventional block cyclic reduction algorithm operates by halving the size of the linear system at each reduction step, that is, the algorithm is a radix-2 method. An algorithm analogous to the block cyclic reduction known as the radix-q partial solution variant of the cyclic reduction (PSCR) method allows the use of higher radix numbers and is thus more suitable for parallel architectures as it requires fever reduction steps. This paper presents an alternative and more intuitive way of deriving a radix-4 block cyclic reduction method for systems with a coefficient matrix of the form tridiag{ − I,D, − I}. This is performed by modifying an existing radix-2 block cyclic reduction method. Th…

fast Poisson solverblock cyclic reductionnopea Poisson ratkaisijaosamurtokehitelmätekniikkaparallel computingsyklinen reduktioPSCRrinnakkaislaskentasuora ratkaisijadirect solverpartial fraction technique
researchProduct

CUDA-BLASTP: Accelerating BLASTP on CUDA-enabled graphics hardware

2011

Scanning protein sequence database is an often repeated task in computational biology and bioinformatics. However, scanning large protein databases, such as GenBank, with popular tools such as BLASTP requires long runtimes on sequential architectures. Due to the continuing rapid growth of sequence databases, there is a high demand to accelerate this task. In this paper, we demonstrate how GPUs, powered by the Compute Unified Device Architecture (CUDA), can be used as an efficient computational platform to accelerate the BLASTP algorithm. In order to exploit the GPU's capabilities for accelerating BLASTP, we have used a compressed deterministic finite state automaton for hit detection as wel…

graphics hardwareSource codeComputer sciencemedia_common.quotation_subjectGraphics hardwareGraphics processing unitParallel computingGeneral Purpose Computation on Graphics Processing Unit (GPGPU)Computational scienceInstruction setCUDAGeneticsComputer GraphicsDatabases Proteinmedia_commondynamic programmingFinite-state machineSequence databaseApplied MathematicsProteinsCompute Unified Device Architecture (CUDA)sequence alignmentGeneral-purpose computing on graphics processing unitsAlgorithmsSoftwareBiotechnology
researchProduct

SPH modeling of blood flow in cerebral aneurysms

Gli aneurismi cerebrali sono dilatazioni patologiche di arterie cerebrali. Queste patologie hanno un intrinseco rischio di rottura con conseguenti emorragie intracraniche. Sebbene i meccanismi di formazione, crescita e rottura degli aneurismi cerebrali non sono ancora del tutto compresi, è comunemente riconosciuto che in questi processi i fattori emodinamici giocano un ruolo molto importante. Le simulazioni numeriche possono fornire utili informazioni sull'emodinamica e possono essere usate per applicazioni cliniche. Nei tradizionali metodi numerici basati su una griglia di calcolo il processo di discretizzazione dei vasi cerebrali sui quali insiste un aneurisma è molto complesso. D’altra p…

mechanical platelet activationSmoothed particle hydrodynamics (SPH)parallel computingHemodynamicsMulti-blockOpen-boundariescerebral aneurysmsPressure Poisson EquationSettore ICAR/01 - Idraulica
researchProduct

Perfect Hashing Structures for Parallel Similarity Searches

2015

International audience; Seed-based heuristics have proved to be efficient for studying similarity between genetic databases with billions of base pairs. This paper focuses on algorithms and data structures for the filtering phase in seed-based heuristics, with an emphasis on efficient parallel GPU/manycores implementa- tion. We propose a 2-stage index structure which is based on neighborhood indexing and perfect hashing techniques. This structure performs a filtering phase over the neighborhood regions around the seeds in constant time and avoid as much as possible random memory accesses and branch divergences. Moreover, it fits particularly well on parallel SIMD processors, because it requ…

parallelismSimilarity (geometry)OpenCLComputer scienceseed-based heuristicsHash functionSearch engine indexingGPUParallel computingData structureperfect hash functionPattern matchingSIMD[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM][INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC]read mapperHeuristicsPerfect hash function2015 IEEE International Parallel and Distributed Processing Symposium Workshop
researchProduct

Improvements and applications of the elements of prototype-based clustering

2018

Clustering or cluster analysis is an essential part of data mining, machine learning, and pattern recognition. The most popularly applied clustering methods are partitioning-based or prototype-based methods. Prototype-based clustering methods usually have easy implementability and good scalability. These methods, such as K-means clustering, have been used for different applications in various fields. On the other hand, prototype-based clustering methods are typically sensitive to initialization, and the selection of the number of clusters for knowledge discovery purposes is not straightforward. In the era of big data, in high-velocity, ever-growing datasets, which can also be erroneous, outl…

random projectionparallel computingknowledge discoveryclustering initializationminimal learning machinedata miningprototype-based clusteringmachine learningkoneoppiminenbig datarinnakkaiskäsittelyklusterianalyysitiedonlouhintarobust clusteringK-means
researchProduct

Lattice Boltzmann Simulations at Petascale on Multi-GPU Systems with Asynchronous Data Transfer and Strictly Enforced Memory Read Alignment

2015

The lattice Boltzmann method is a well-established numerical approach for complex fluid flow simulations. Recently general-purpose graphics processing units have become accessible as high-performance computing resources at large-scale. We report on implementing a lattice Boltzmann solver for multi-GPU systems that achieves 0.69 PFLOPS performance on 16384 GPUs. In addition to optimizing the data layout on the GPUs and eliminating the halo sites, we make use of the possibility to overlap data transfer between the host CPU and the device GPU with computing on the GPU. We simulate flow in porous media and measure both strong and weak scaling performance with the emphasis being on a large scale…

ta113ta114Computer scienceLattice Boltzmann methodsGPUParallel computingSolverLattice Boltzmannmemory alignmentComputational sciencePetascale computingAsynchronous communicationData structure alignmentGraphicsasynchronous communicationTitanHost (network)ComputingMethodologies_COMPUTERGRAPHICSData transmissionEuromicro international conference on parallel, distributed and network-based processing
researchProduct

Designing a graphics processing unit accelerated petaflop capable lattice Boltzmann solver: Read aligned data layouts and asynchronous communication

2016

The lattice Boltzmann method is a well-established numerical approach for complex fluid flow simulations. Recently, general-purpose graphics processing units (GPUs) have become available as high-performance computing resources at large scale. We report on designing and implementing a lattice Boltzmann solver for multi-GPU systems that achieves 1.79 PFLOPS performance on 16,384 GPUs. To achieve this performance, we introduce a GPU compatible version of the so-called bundle data layout and eliminate the halo sites in order to improve data access alignment. Furthermore, we make use of the possibility to overlap data transfer between the host central processing unit and the device GPU with com…

virtauslaskentalarge-scale I/OComputer scienceGraphics processing unitLattice Boltzmann methodscomputational fluid dynamicsParallel computinggraphics processing unit01 natural sciencesmemory alignmentprocessors010305 fluids & plasmasTheoretical Computer Science0103 physical sciencesData structure alignment0101 mathematicsGraphicsComputingMethodologies_COMPUTERGRAPHICSta113data layoutta114prosessoritSolverLattice Boltzmann010101 applied mathematicsData accessHardware and ArchitectureAsynchronous communicationCentral processing unitasynchronous communicationTitanSoftwareThe International Journal of High Performance Computing Applications
researchProduct

Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units

2021

We contribute to the optimization of the sparse matrix-vector product by introducing a variant of the coordinate sparse matrix format that balances the workload distribution and compresses both the indexing arrays and the numerical information. Our approach is multi-platform, in the sense that the realizations for (general-purpose) multicore processors as well as graphics accelerators (GPUs) are built upon common principles, but differ in the implementation details, which are adapted to avoid thread divergence in the GPU case or maximize compression element-wise (i.e., for each matrix entry) for multicore architectures. Our evaluation on the two last generations of NVIDIA GPUs as well as In…

workload balancingMulti-core processorComputer Networks and CommunicationsComputer sciencesparse matrix-vector productParallel computingLoad balancing (computing)coordinate sparse matrix formatSparse matrix vectorcompressionExascale computingComputer Science ApplicationsTheoretical Computer ScienceComputational Theory and MathematicsCompression (functional analysis)Product (mathematics)Graphicsgraphics processing units (GPUs)multicoreprocessors (CPUs)Software
researchProduct