6533b851fe1ef96bd12a8c2d

RESEARCH PRODUCT

FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures

Wang XiaoningShengzhong FengYu QiaoHaidong LanBertil SchmidtChristian HundtWeiguo LiuDeng MinwenJintao Meng

subject

020203 distributed computingSource codeIterative methodComputer sciencebusiness.industrymedia_common.quotation_subjectDeep learningInference02 engineering and technologyParallel computingConvolutional neural networkMatrix multiplicationARM architectureComputational Theory and MathematicsHardware and ArchitectureSignal Processing0202 electrical engineering electronic engineering information engineeringArtificial intelligencebusinessmedia_common

description

Deep Learning is ubiquitous in a wide field of applications ranging from research to industry. In comparison to time-consuming iterative training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, lower latency and higher computation efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN – a fast inference library for ARM CPUs – targeting the performance ceiling of mobile devices. FeatherCNN employs three key techniques: 1) A highly efficient TensorGEMM (generalized matrix multiplication) routine is applied to accelerate Winograd convolution on ARM CPUs, 2) General layer optimization based on custom high performance kernels improves both the computational efficiency and locality of memory access patterns for non-Winograd layers. 3) The framework design emphasizes joint layer-wise optimization using layer fusion to remove redundant calculations and memory movements. Performance evaluation reveals that FeatherCNN significantly outperforms state-of-the-art libraries. A forward propagation pass of VGG-16 on a 64-core ARM server is 48, 14, and 12 times faster than Caffe using OpenBLAS, Caffe2 using Eigen, and NNPACK, respectively. In addition, FeatherCNN is 3.19 times faster than the recently released TensorFlow Lite library on an iPhone 7 plus. In terms of GEMM performance, FeatherCNN achieves 14.8 and 39.0 percent higher performance than Apple's Accelerate framework on an iPhone 7 plus and Eigen on a Samsung Galaxy S8, respectively. The source code of FeatherCNN library is publicly available at https://github.com/tencent/feathercnn .

https://doi.org/10.1109/tpds.2019.2939785