6533b82cfe1ef96bd128ff55
RESEARCH PRODUCT
Data Augmentation for Pipeline-Based Speech Translation
Diego AlvesMārcis PinnisAskars Salimbajevssubject
Machine translationComputer sciencePipeline (computing)media_common.quotation_subjectSpeech recognition[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG]speech translationSpeech processingcomputer.software_genreneural machine translation[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]robustness to errorsWorkflow[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG][INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL]Speech translationQuality (business)Noise (video)Suffixcomputermedia_commondescription
International audience; Pipeline-based speech translation methods may suffer from errors found in speech recognition system output. Therefore, it is crucial that machine translation systems are trained to be robust against such noise. In this paper, we propose two methods for parallel data augmentation for pipeline-based speech translation system development. The first method utilises a speech processing workflow to introduce errors and the second method generates commonly found suffix errors using a rule-based method. We show that the methods in combination allow significantly improving speech translation quality by 1.87 BLEU points over a baseline system.
year | journal | country | edition | language |
---|---|---|---|---|
2020-01-01 | Human Language Technologies – The Baltic Perspective - Proceedings of the Ninth International Conference Baltic HLT 2020 |