6533b82cfe1ef96bd128ff55

RESEARCH PRODUCT

Data Augmentation for Pipeline-Based Speech Translation

Diego AlvesMārcis PinnisAskars Salimbajevs

subject

Machine translationComputer sciencePipeline (computing)media_common.quotation_subjectSpeech recognition[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG]speech translationSpeech processingcomputer.software_genreneural machine translation[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]robustness to errorsWorkflow[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG][INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL]Speech translationQuality (business)Noise (video)Suffixcomputermedia_common

description

International audience; Pipeline-based speech translation methods may suffer from errors found in speech recognition system output. Therefore, it is crucial that machine translation systems are trained to be robust against such noise. In this paper, we propose two methods for parallel data augmentation for pipeline-based speech translation system development. The first method utilises a speech processing workflow to introduce errors and the second method generates commonly found suffix errors using a rule-based method. We show that the methods in combination allow significantly improving speech translation quality by 1.87 BLEU points over a baseline system.

10.3233/faia200605http://dx.doi.org/10.3233/faia200605