An Scalable matrix computing unit architecture for FPGA and SCUMO user design interface

6533b851fe1ef96bd12a99ef

RESEARCH PRODUCT

An Scalable matrix computing unit architecture for FPGA and SCUMO user design interface

Manuel Bataller-mompeán Taras Iakymchuk Alfredo Rosado-muñoz Asgar Abbaszadeh Jose V. Frances-villora

subject

Computer Networks and Communications Computer science MathematicsofComputing_NUMERICALANALYSIS Sistemes informàtics lcsh:TK7800-8360 02 engineering and technology Scalar multiplication Computational science Matrix (mathematics)matrix-computing unit Transpose 0202 electrical engineering electronic engineering information engineering matrix processor Electrical and Electronic Engineering Circulant matrix circulant matrices FPGA 020208 electrical & electronic engineering lcsh:Electronics Dot product Matrix multiplication Arquitectura d'ordinadors Hardware and Architecture Control and Systems Engineering matrix arithmetic Signal Processing 020201 artificial intelligence & image processing Multiplication hardware implementation

description

High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrix-computing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations’ performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrix-by-matrix addition, subtraction, dot product and multiplication, matrix-by-vector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N × N matrices, the architecture requires N ALU-RAM blocks and performs O ( N 2 ) , requiring N 2 + 7 and N + 7 clock cycles for matrix-matrix and matrix-vector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 × 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other state-of-the-art matrix computing units.

year	journal	country	edition	language
2019-01-15

10.3390/electronics8010094 https://doi.org/10.3390/electronics8010094