Search results for "distributed"
showing 10 items of 1260 documents
Efficient Parallel Sort on AVX-512-Based Multi-Core and Many-Core Architectures
2019
Sorting kernels are a fundamental part of numerous applications. The performance of sorting implementations is usually limited by a variety of factors such as computing power, memory bandwidth, and branch mispredictions. In this paper we propose an efficient hybrid sorting method which takes advantage of wide vector registers and the high bandwidth memory of modern AVX-512-based multi-core and many-core processors. Our approach employs a combination of vectorized bitonic sorting and load-balanced multi-threaded merging. Thread-level and data-level parallelism are used to exploit both compute power and memory bandwidth. Our single-threaded implementation is ~30x faster than qsort in the C st…
Regulating blockchain for sustainability? The critical relationship between digital innovation, regulation, and electricity governance
2021
Abstract Blockchain technology has found several innovative applications in the electricity industry. However, its potential has still to be discovered. This is partly due to the role that regulation plays in electricity markets. To be introduced, experimented with, and eventually adopted on a commercial scale, blockchain-supported innovations need to fit the existing regulatory framework or the rules to be reshaped or updated. We focus on energy regulators' possible responses to the blockchain-enhanced market operations (both from the incumbents and potential newcomers), suggesting a monitoring mechanism that can support innovation.
Online Scheduling of Task Graphs on Hybrid Platforms
2018
Modern computing platforms commonly include accelerators. We target the problem of scheduling applications modeled as task graphs on hybrid platforms made of two types of resources, such as CPUs and GPUs. We consider that task graphs are uncovered dynamically, and that the scheduler has information only on the available tasks, i.e., tasks whose predecessors have all been completed. Each task can be processed by either a CPU or a GPU, and the corresponding processing times are known. Our study extends a previous \(4\sqrt{m/k}\)-competitive online algorithm [2], where m is the number of CPUs and k the number of GPUs (\(m\ge k\)). We prove that no online algorithm can have a competitive ratio …
Hybrid P2P schemes for remote terrain interactive visualization systems
2013
Over the last few years, there has been a lot of development of interactive terrain visualization applications using remote databases. One of the main problems that these applications must face is scalability. These applications usually use a client-server model that cannot support a large number of concurrent requests without using a considerable number of servers. In this paper, we present a full comparative study of new hybrid P2P schemes for terrain interactive visualization systems. The performance evaluation results show that the best strategy consists of avoiding the periodical reporting among peer nodes about the current information contained in each node, while using some servers a…
Massively Parallel ANS Decoding on GPUs
2019
In recent years, graphics processors have enabled significant advances in the fields of big data and streamed deep learning. In order to keep control of rapidly growing amounts of data and to achieve sufficient throughput rates, compression features are a key part of many applications including popular deep learning pipelines. However, as most of the respective APIs rely on CPU-based preprocessing for decoding, data decompression frequently becomes a bottleneck in accelerated compute systems. This establishes the need for efficient GPU-based solutions for decompression. Asymmetric numeral systems (ANS) represent a modern approach to entropy coding, combining superior compression results wit…
Neighbor-list-free molecular dynamics on sunway TaihuLight supercomputer
2020
Molecular dynamics (MD) simulations are playing an increasingly important role in many research areas. Pair-wise potentials are widely used in MD simulations of bio-molecules, polymers, and nano-scale materials. Due to a low compute-to-memory-access ratio, their calculation is often bounded by memory transfer speeds. Sunway TaihuLight is one of the fastest supercomputers featuring a custom SW26010 many-core processor. Since the SW26010 has some critical limitations regarding main memory bandwidth and scratchpad memory size, it is considered as a good platform to investigate the optimization of pair-wise potentials especially in terms of data reusage. MD algorithms often use a neighbor-list …
Multi-application Based Network-on-Chip Design for Mesh-of-Tree Topology Using Global Mapping and Reconfigurable Architecture
2019
This paper outlines a multi-application mapping for Mesh-of-Tree (MoT) topology based Network-on-Chip (NoC) design using reconfigurable architecture. A two phase Particle Swarm Optimization (PSO) has been proposed for reconfigurable architecture to minimize the communication cost. In first phase global mapping is done by combining multiple applications and in second phase, reconfiguration is achieved by switching the cores to near by routers using multiplexers. Experimentations have been carried out for several application benchmarks and synthetic applications generated using TGFF tool. The results show significant improvement in terms of communication cost after reconfiguration.
MARL-Ped+Hitmap: Towards Improving Agent-Based Simulations with Distributed Arrays
2016
Multi-agent systems allow the modelling of complex, heterogeneous, and distributed systems in a realistic way. MARL-Ped is a multi-agent system tool, based on the MPI standard, for the simulation of different scenarios of pedestrians who autonomously learn the best behavior by Reinforcement Learning. MARL-Ped uses one MPI process for each agent by design, with a fixed fine-grain granularity. This requirement limits the performance of the simulations for a restricted number of processors that is lesser than the number of agents. On the other hand, Hitmap is a library to ease the programming of parallel applications based on distributed arrays. It includes abstractions for the automatic parti…
Fault-Tolerant Network-on-Chip Design for Mesh-of-Tree Topology Using Particle Swarm Optimization
2018
As the size of the chip is scaling down the density of Intellectual Property (IP) cores integrated on a chip has been increased rapidly. The communication between these IP cores on a chip is highly challenging. To overcome this issue, Network-on-Chip (NoC) has been proposed to provide an efficient and a scalable communication architecture. In the deep sub-micron level NoCs are prone to faults which can occur in any component of NoC. To build a reliable and robust systems, it is necessary to apply efficient fault-tolerant techniques. In this paper, we present a flexible spare core placement in Mesh-of-Tree (MoT) topology using Particle Swarm Optimization (PSO) by considering IP core failures…
Nvidia CUDA parallel processing of large FDTD meshes in a desktop computer
2020
The Finite Difference in Time Domain numerical (FDTD) method is a well know and mature technique in computational electrodynamics. Usually FDTD is used in the analysis of electromagnetic structures, and antennas. However still there is a high computational burden, which is a limitation for use in combination with optimization algorithms. The parallelization of FDTD to calculate in GPU is possible using Matlab and CUDA tools. For instance, the simulation of a planar array, with a three dimensional FDTD mesh 790x276x588, for 6200 time steps, takes one day -elapsed time- using the CPU of an Intel Core i3 at 2.4GHz in a personal computer, 8Gb RAM. This time is reduced 120 times when the calcula…