6533b7d8fe1ef96bd126acf7

RESEARCH PRODUCT

A Comparative Analysis of Residual Block Alternatives for End-to-End Audio Classification

Irene Martin-moratoJavier Naranjo-alcazarMaximo CobosFrancesc J. FerriSergi Perez-castanosPedro Zuccarello

subject

Normalization (statistics)General Computer ScienceComputer scienceFeature extractionESC02 engineering and technologycomputer.software_genreResidualConvolutional neural networkconvolutional neural networks0202 electrical engineering electronic engineering information engineeringGeneral Materials Scienceurbansound8kAudio signal processingBlock (data storage)Contextual image classificationGeneral EngineeringAudio classification020206 networking & telecommunications113 Computer and information sciences020201 artificial intelligence & image processinglcsh:Electrical engineering. Electronics. Nuclear engineeringData mininglcsh:TK1-9971computerresidual learning

description

Residual learning is known for being a learning framework that facilitates the training of very deep neural networks. Residual blocks or units are made up of a set of stacked layers, where the inputs are added back to their outputs with the aim of creating identity mappings. In practice, such identity mappings are accomplished by means of the so-called skip or shortcut connections. However, multiple implementation alternatives arise with respect to where such skip connections are applied within the set of stacked layers making up a residual block. While residual networks for image classification using convolutional neural networks (CNNs) have been widely discussed in the literature, their adoption for 1D end-to-end architectures is still scarce in the audio domain. Thus, the suitability of different residual block designs for raw audio classification is partly unknown. The purpose of this article is to compare, analyze and discuss the performance of several residual block implementations, the most commonly used in image classification problems, within a state-of-the-art CNN-based architecture for end-to-end audio classification using raw audio waveforms. Deep and careful statistical analyses over six different residual block alternatives are conducted, considering two well-known datasets and common input normalization choices. The results show that, while some significant differences in performance are observed among architectures using different residual block designs, the selection of the most suitable residual block can be highly dependent on the input data. publishedVersion Peer reviewed

https://doi.org/10.1109/access.2020.3031685