6533b833fe1ef96bd129ba77

RESEARCH PRODUCT

Wykorzystanie korpusów rosyjskojęzycznych newsów internetowych na potrzeby systemów automatycznego rozpoznawania mowy w obszarze monitoringu mediów

subject

description

The author of the article used open Internet-news corpuses (NewsRu and Taiga) to create N-gram language models for the needs of automatic speech recognition systems. The models were comprehensively evaluated (perplexity, WER, proper name recognition, comparison with the base model and Google ASR). The author also rescored N-gram models, using recursive neural networks. The effectiveness of the models was assessed by recognizing speech from the news channel Россия 24 (37 files with a total length of 1.5 hours were tested). The selection of test data is related to the main goal of the article — speech recognition for the needs of the so-called media monitoring.

10.31261/pr.12741https://doi.org/10.31261/pr.12741