FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures

6533b851fe1ef96bd12a8c2d

RESEARCH PRODUCT

FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures

Wang Xiaoning Shengzhong Feng Yu Qiao Haidong Lan Bertil Schmidt Christian Hundt Weiguo Liu Deng Minwen Jintao Meng

subject

020203 distributed computing Source code Iterative method Computer science business.industry media_common.quotation_subject Deep learning Inference 02 engineering and technology Parallel computing Convolutional neural network Matrix multiplication ARM architecture Computational Theory and Mathematics Hardware and Architecture Signal Processing 0202 electrical engineering electronic engineering information engineering Artificial intelligence business media_common

description

Deep Learning is ubiquitous in a wide field of applications ranging from research to industry. In comparison to time-consuming iterative training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, lower latency and higher computation efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN – a fast inference library for ARM CPUs – targeting the performance ceiling of mobile devices. FeatherCNN employs three key techniques: 1) A highly efficient TensorGEMM (generalized matrix multiplication) routine is applied to accelerate Winograd convolution on ARM CPUs, 2) General layer optimization based on custom high performance kernels improves both the computational efficiency and locality of memory access patterns for non-Winograd layers. 3) The framework design emphasizes joint layer-wise optimization using layer fusion to remove redundant calculations and memory movements. Performance evaluation reveals that FeatherCNN significantly outperforms state-of-the-art libraries. A forward propagation pass of VGG-16 on a 64-core ARM server is 48, 14, and 12 times faster than Caffe using OpenBLAS, Caffe2 using Eigen, and NNPACK, respectively. In addition, FeatherCNN is 3.19 times faster than the recently released TensorFlow Lite library on an iPhone 7 plus. In terms of GEMM performance, FeatherCNN achieves 14.8 and 39.0 percent higher performance than Apple's Accelerate framework on an iPhone 7 plus and Eigen on a Samsung Galaxy S8, respectively. The source code of FeatherCNN library is publicly available at https://github.com/tencent/feathercnn .

year	journal	country	edition	language
2020-03-01	IEEE Transactions on Parallel and Distributed Systems

https://doi.org/10.1109/tpds.2019.2939785