A Comparative Analysis of Residual Block Alternatives for End-to-End Audio Classification

6533b7d8fe1ef96bd126acf7

RESEARCH PRODUCT

A Comparative Analysis of Residual Block Alternatives for End-to-End Audio Classification

Irene Martin-morato Javier Naranjo-alcazar Maximo Cobos Francesc J. Ferri Sergi Perez-castanos Pedro Zuccarello

subject

Normalization (statistics)General Computer Science Computer science Feature extraction ESC 02 engineering and technology computer.software_genre Residual Convolutional neural network convolutional neural networks 0202 electrical engineering electronic engineering information engineering General Materials Science urbansound8k Audio signal processing Block (data storage)Contextual image classification General Engineering Audio classification 020206 networking & telecommunications 113 Computer and information sciences 020201 artificial intelligence & image processing lcsh:Electrical engineering. Electronics. Nuclear engineering Data mining lcsh:TK1-9971 computer residual learning

description

Residual learning is known for being a learning framework that facilitates the training of very deep neural networks. Residual blocks or units are made up of a set of stacked layers, where the inputs are added back to their outputs with the aim of creating identity mappings. In practice, such identity mappings are accomplished by means of the so-called skip or shortcut connections. However, multiple implementation alternatives arise with respect to where such skip connections are applied within the set of stacked layers making up a residual block. While residual networks for image classification using convolutional neural networks (CNNs) have been widely discussed in the literature, their adoption for 1D end-to-end architectures is still scarce in the audio domain. Thus, the suitability of different residual block designs for raw audio classification is partly unknown. The purpose of this article is to compare, analyze and discuss the performance of several residual block implementations, the most commonly used in image classification problems, within a state-of-the-art CNN-based architecture for end-to-end audio classification using raw audio waveforms. Deep and careful statistical analyses over six different residual block alternatives are conducted, considering two well-known datasets and common input normalization choices. The results show that, while some significant differences in performance are observed among architectures using different residual block designs, the selection of the most suitable residual block can be highly dependent on the input data. publishedVersion Peer reviewed

year	journal	country	edition	language
2020-01-01	IEEE Access

https://doi.org/10.1109/access.2020.3031685