Evaluation of an Alternative for Increasing Switch Radix
In large switch-based interconnection networks, increasing the switch radix results in a decrease in the total number of network components. In this paper we evaluate an interesting strategy for building high-radix switches going beyond the integration scale bounds. This approach is independent of the evolution of single-chip switches and will remain valid as integration scale keeps evolving. Simulation results show that with a correct internal switch design, this kind of switches achieves almost the same performance as single-chip switches with the same radix, which would be unfeasible with current integration scale.
NoC Reconfiguration for CMP Virtualization
At NoC level, the traffic interferences can be drastically reduced by using virtualization mechanisms. An effective strategy to virtualize a NoC consists in dividing the network in different partitions, each one serving different applications and traffic flows. In this paper, we propose a NoC reconfiguration mechanism to support NoC virtualization under real scenarios. Dynamic reassignment of network resources to different partitions is allowed in order to NoC dynamically adapts to application needs. Evaluation results show a good behavior of CMP virtualization.
A Fast GPU-Based Motion Estimation Algorithm for H.264/AVC
H.264/AVC is the most recent predictive video compression standard to outperform other existing video coding standards by means of higher computational complexity. In recent years, heterogeneous computing has emerged as a cost-efficient solution for high-performance computing. In the literature, several algorithms have been proposed to accelerate video compression, but so far there have not been many solutions that deal with video codecs using heterogeneous systems. This paper proposes an algorithm to perform H.264/AVC inter prediction. The proposed algorithm performs the motion estimation, both with full-pixel and sub-pixel accuracy, using CUDA to assist the CPU, obtaining remarkable time …
Optimal Configuration for N-Dimensional Twin Torus Networks
Torus topology is one of the most common topologies used in the current largest supercomputers. Although 3D torus is widely used, recently some supercomputers in the Top500 list have been built using networks with topologies of five or six dimensions. To obtain an nD torus, 2n ports per node are needed. These ports can be offered by a single or several cards per node. In the second case, there are multiple ways of assigning the dimension and direction of the card ports. In a previous work we proposed the 3D Twin (3DT) torus which uses two 4-port cards per node, and obtained the optimal port configuration. This paper extends and generalizes that work in order to obtain the optimal port confi…
3D high definition video coding on a GPU-based heterogeneous system
H.264/MVC is a standard for supporting the sensation of 3D, based on coding from 2 (stereo) to N views. H.264/MVC adopts many coding options inherited from single view H.264/AVC, and thus its complexity is even higher, mainly because the number of processing views is higher. In this manuscript, we aim at an efficient parallelization of the most computationally intensive video encoding module for stereo sequences. In particular, inter prediction and its collaborative execution on a heterogeneous platform. The proposal is based on an efficient dynamic load balancing algorithm and on breaking encoding dependencies. Experimental results demonstrate the proposed algorithm's ability to reduce the…
Reducing complexity in H.264/AVC motion estimation by using a GPU
H.264/AVC applies a complex mode decision technique that has high computational complexity in order to reduce the temporal redundancies of video sequences. Several algorithms have been proposed in the literature in recent years with the aim of accelerating this part of the encoding process. Recently, with the emergence of many-core processors or accelerators, a new approach can be adopted for reducing the complexity of the H.264/AVC encoding algorithm. This paper focuses on reducing the inter prediction complexity adopted in H.264/AVC and proposes a GPU-based implementation using CUDA. Experimental results show that the proposed approach reduces the complexity by as much as 99% (100x of spe…
VEF Traces: A Framework for Modelling MPI Traffic in Interconnection Network Simulators
Simulation is often used to evaluate the behaviour and measure the performance of computing systems. Specifically, in high-performance interconnection networks, the simulation has been extensively considered to verify the behaviour of the network itself and to evaluate its performance. In this context, network simulation must be fed with network traffic, also referred to as network workload, whose nature has been traditionally synthetic. These workloads can be used for the purpose of driving studies on network performance, but often such workloads are not accurate enough if a realistic evaluation is pursued. For this reason, other non-synthetic workloads have gained popularity over last dec…
Optimizing H.264/AVC interprediction on a GPU-based framework
H.264/MPEG-4 part 10 is the latest standard for video compression and promises a significant advance in terms of quality and distortion compared with the commercial standards currently most in use such as MPEG-2 or MPEG-4. To achieve this better performance, H.264 adopts a large number of new/improved compression techniques compared with previous standards, albeit at the expense of higher computational complexity. In addition, in recent years new hardware accelerators have emerged, such as graphics processing units (GPUs), which provide a new opportunity to reduce complexity for a large variety of algorithms. However, current GPUs suffer from higher power consumption requirements because of…
Adapting hierarchical bidirectional inter prediction on a GPU-based platform for 2D and 3D H.264 video coding
The H.264/AVC video coding standard introduces some improved tools in order to increase compression efficiency. Moreover, the multi-view extension of H.264/AVC, called H.264/MVC, adopts many of them. Among the new features, variable block-size motion estimation is one which contributes to high coding efficiency. Furthermore, it defines a different prediction structure that includes hierarchical bidirectional pictures, outperforming traditional Group of Pictures patterns in both scenarios: single-view and multi-view. However, these video coding techniques have high computational complexity. Several techniques have been proposed in the literature over the last few years which are aimed at acc…
Deadline-based QoS Algorithms for High-performance Networks
Quality of service (QoS) is becoming an attractive feature for high-performance networks and parallel machines because it could allow a more efficient use of resources. Deadline-based algorithms can provide powerful QoS provision. However, the cost associated with keeping ordered lists of packets makes them impractical for high-performance networks. In this paper, we explore how to adapt efficiently the earliest deadline first family of algorithms to the high-speed networks environments. The results show excellent performance using just two virtual channels, FIFO queues, and a cost feasible with today's technology.
C-switches: Increasing switch radix with current integration scale
In large switch-based interconnection networks, increasing the switch radix results in a decrease in the total number of network components, and consequently the overall cost of the network can be significantly reduced. Moreover, high-radix switches are an attractive option to improve the network performance in terms of latency, since hop count is also reduced. However, there are some problems related to the integration scale to design such single-chip switches. In this paper we discuss key issues and evaluate an interesting alternative for building high-radix switches going beyond the integration scale bounds. The idea basically consists in combining several current smaller single-chip swi…
Accelerating H.264 inter prediction in a GPU by using CUDA
H.264/AVC defines a very efficient algorithm for the inter prediction but it takes too much time. With the emergence of General Purpose Graphics Processing Units (GPGPU), a new door has been opened to support this video algorithm into these small processing units. In this paper, a forward step is developed towards an implementation of the H.264/AVC inter prediction algorithm into a GPU using Compute Unified Device Architecture (CUDA). The results show a negligible rate distortion drop with a time reduction on average up to 93.6%.
Efficient Switches with QoS Support for Clusters
Current interconnect standards providing hardware support for quality of service (QoS) consider up to 16 virtual channels (VCs) for this purpose. However, most implementations do not offer so many VCs because they increase the complexity of the switch and the scheduling delays. We have shown that this number of VCs can be significantly reduced, because it is enough to use two VCs for QoS purposes at each switch port. In this paper, we cover the weaknesses of that proposal and, not only we reduce VCs, but we also improve performance due to the flexibility assigning buffer memory.
A GPU-Based DVC to H.264/AVC Transcoder
Mobile to mobile video conferencing is one of the services that the newest mobile network operators can offer to users With the apparition of the distributed video coding paradigm which moves the majority of complexity from the encoder to the decoder, this offering can be achieved by introducing a transcoder This device has to convert from the distributed video coding paradigm to traditional video coding such as H.264/AVC which is formed by simpler decoders and more complex encoders, and allows to the users to execute only the low complex algorithms In order to deal with this high complex video transcoder, this paper introduces a graphics processing unit based transcoder as base station The…