Author: Serge Sharoff

0000000000255995

AUTHOR

Serge Sharoff

showing 4 related works from this author

Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora

2013

The beginning of the 1990s marked a radical turn in various NLP applications towards using large collections of texts.

HistoryLinguistics

researchProduct

Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora

2017

This paper presents the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. It recalls the design of the datasets, presents their final construction and statistics and the methods used to evaluate system results. 13 runs were submitted to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), German-English (3 runs), and Chinese-English (3 runs). The best F-scores as measured against the gold standard were 0.84 (German-English), 0.80 (French-English), and 0.43 (Chinese-English). Because of the design of the dataset, in which not all gold parallel sentence pairs are known, these are only minimum values. We examined …

Computer scienceSentence extractionbusiness.industrySpeech recognition020206 networking & telecommunications02 engineering and technologyGold standard (test)Spottingcomputer.software_genreTask (project management)0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingArtificial intelligencebusinesscomputerNatural language processingSentenceProceedings of the 10th Workshop on Building and Using Comparable Corpora

researchProduct

BUCC Shared Task: Cross-Language Document Similarity

2015

We summarise the organisation and results of the first shared task aimed at detecting the most similar texts in a large multilingual collection. The dataset of the shared was based on Wikipedia dumps with interlanguage links with further filtering to ensure comparability of the paired articles. The eleven system runs we received have been evaluated using the TREC evaluation metrics. 1 Task description Parallel corpora of original texts with their translations provide the basis for multilingual NLP applications since the beginning of the 1990s. Relative scarcity of such resources led to greater attention to comparable (=less parallel) resources to mine information about possible translations…

InterlanguageDocument similarityInformation retrievalComputer sciencebusiness.industryInformationSystems_INFORMATIONSTORAGEANDRETRIEVALArtificial intelligencecomputer.software_genrebusinesscomputerNatural language processingTask (project management)Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

researchProduct

New Areas of Application of Comparable Corpora

2019

This chapter describes several approaches of using comparable corpora beyond the area of MT for under-resourced languages, which is the primary focus of the ACCURAT project. Section 7.1, which is based on Rapp and Zock (Automatic dictionary expansion using non-parallel corpora. In: A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg, 2010), addresses the task of creating resources for bilingual dictionaries using a seed lexicon; Sect. 7.2 (based on Rapp et al., Identifying word translations from comparable documents without a seed lexicon. Proceedi…

business.industryComputer scienceGroup method of data handlingSection (typography)020207 software engineering02 engineering and technology[SCCO.LING]Cognitive science/LinguisticsLexiconcomputer.software_genreFocus (linguistics)Task (project management)[SCCO]Cognitive scienceBusiness intelligence0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processing[INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]Artificial intelligencebusinesscomputerComputingMilieux_MISCELLANEOUSNatural language processingWord (computer architecture)

researchProduct