Acoustic Scene Classification with Squeeze-Excitation Residual Networks

6533b837fe1ef96bd12a1d5f

RESEARCH PRODUCT

Acoustic Scene Classification with Squeeze-Excitation Residual Networks

Javier Naranjo-alcazar Maximo Cobos Sergi Perez-castanos Pedro Zuccarello

subject

FOS: Computer and information sciences Sound (cs.SD)Computer Science - Machine Learning General Computer Science Calibration (statistics)Computer science Residual Convolutional neural network Field (computer science)Computer Science - Sound Machine Learning (cs.LG)030507 speech-language pathology & audiology 03 medical and health sciences Audio and Speech Processing (eess.AS)Acoustic scene classification Feature (machine learning)FOS: Electrical engineering electronic engineering information engineering General Materials Science Block (data storage)Artificial neural network business.industry pattern recognition General Engineering deep learning Pattern recognition machine listening squeeze-excitation Artificial intelligence lcsh:Electrical engineering. Electronics. Nuclear engineering 0305 other medical science business lcsh:TK1-9971 Electrical Engineering and Systems Science - Audio and Speech Processing

description

Acoustic scene classification (ASC) is a problem related to the field of machine listening whose objective is to classify/tag an audio clip in a predefined label describing a scene location (e. g. park, airport, etc.). Many state-of-the-art solutions to ASC incorporate data augmentation techniques and model ensembles. However, considerable improvements can also be achieved only by modifying the architecture of convolutional neural networks (CNNs). In this work we propose two novel squeeze-excitation blocks to improve the accuracy of a CNN-based ASC framework based on residual learning. The main idea of squeeze-excitation blocks is to learn spatial and channel-wise feature maps independently instead of jointly as standard CNNs do. This is usually achieved by combining some global grouping operators, linear operators and a final calibration between the input of the block and its learned relationships. The behavior of the block that implements such operators and, therefore, the entire neural network, can be modified depending on the input to the block, the established residual configurations and the selected non-linear activations. The analysis has been carried out using the TAU Urban Acoustic Scenes 2019 dataset presented in the 2019 edition of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. All configurations discussed in this document exceed the performance of the baseline proposed by the DCASE organization by 13% percentage points. In turn, the novel configurations proposed in this paper outperform the residual configurations proposed in previous works.

year	journal	country	edition	language
2020-03-20

10.1109/access.2020.3002761 http://arxiv.org/abs/2003.09284