0000000000074995

AUTHOR

Zied Elloumi

showing 2 related works from this author

Analyzing Learned Representations of a Deep ASR Performance Prediction Model

2018

This paper addresses a relatively new task: prediction of ASR performance on unseen broadcast programs. In a previous paper, we presented an ASR performance prediction system using CNNs that encode both text (ASR transcript) and speech, in order to predict word error rate. This work is dedicated to the analysis of speech signal embeddings and text embeddings learnt by the CNN while training our prediction model. We try to better understand which information is captured by the deep model and its relation with different conditioning factors. It is shown that hidden layers convey a clear signal about speech style, accent and broadcast type. We then try to leverage these 3 types of information …

FOS: Computer and information sciencesComputer Science - Computation and LanguageComputer scienceSpeech recognitionWord error rate02 engineering and technology010501 environmental sciences01 natural sciences[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL][INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL]0202 electrical engineering electronic engineering information engineeringPerformance predictionLeverage (statistics)020201 artificial intelligence & image processingComputation and Language (cs.CL)0105 earth and related environmental sciences
researchProduct

ASR performance prediction on unseen broadcast programs using convolutional neural networks

2018

In this paper, we address a relatively new task: prediction of ASR performance on unseen broadcast programs. We first propose an heterogenous French corpus dedicated to this task. Two prediction approaches are compared: a state-of-the-art performance prediction based on regression (engineered features) and a new strategy based on convolutional neural networks (learnt features). We particularly focus on the combination of both textual (ASR transcription) and signal inputs. While the joint use of textual and signal features did not work for the regression baseline, the combination of inputs for CNNs leads to the best WER prediction performance. We also show that our CNN prediction remarkably …

FOS: Computer and information sciencesComputer Science - Computation and LanguageComputer scienceSpeech recognitionFeature extractionInformationSystems_INFORMATIONSTORAGEANDRETRIEVAL02 engineering and technology010501 environmental sciences01 natural sciencesConvolutional neural network[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]Task (project management)[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL]0202 electrical engineering electronic engineering information engineeringTask analysisPerformance prediction020201 artificial intelligence & image processingMel-frequency cepstrumTranscription (software)Hidden Markov modelComputation and Language (cs.CL)ComputingMilieux_MISCELLANEOUS0105 earth and related environmental sciences
researchProduct