6533b7dcfe1ef96bd1272888

RESEARCH PRODUCT

Versatile Direct and Transpose Matrix Multiplication with Chained Operations: An Optimized Architecture Using Circulant Matrices

José V. Francés VilloraEmmanuel Ovie OsimiryTaras IakymchukAlfredo Rosado-muñozManuel Bataller Mompean

subject

Cycles per instructionBlock matrix020206 networking & telecommunications02 engineering and technologyParallel computingMatrix chain multiplicationMatrix multiplication020202 computer hardware & architectureTheoretical Computer ScienceMatrix (mathematics)Computational Theory and MathematicsHardware and ArchitectureTranspose0202 electrical engineering electronic engineering information engineeringMultiplicationHardware_ARITHMETICANDLOGICSTRUCTURESArithmeticCirculant matrixSoftwareMathematics

description

With growing demands in real-time control, classification or prediction, algorithms become more complex while low power and small size devices are required. Matrix multiplication (direct or transpose) is common for such computation algorithms. In numerous algorithms, it is also required to perform matrix multiplication repeatedly, where the result of a multiplication is further multiplied again. This work describes a versatile computation procedure and architecture: one of the matrices is stored in internal memory in its circulant form, then, a sequence of direct or transpose multiplications can be performed without timing penalty. The architecture proposes a RAM-ALU block for each matrix column, being connected in a systolic ring. The computation is propagated through internal RAM-ALU blocks. The architecture exploits local connections, minimizing delays. The system is described as an IP core, it is fully parameterisable to customize matrix size and data format when implementing. An $N\times N$ matrix multiplication is performed in $\mathcal{O}(N^2)$ clock cycles requiring $N$ RAM-ALU blocks being $2N$ in memory size. For a Virtex7 FPGA, clock runs at 340 MHz for 100 $\times$ 100 and 290 MHz for 1000 $\times$ 1,000 matrix size with 1 clock cycle per element multiplication.

https://doi.org/10.1109/tc.2016.2538235