6533b839fe1ef96bd12a6414

RESEARCH PRODUCT

BUCC Shared Task: Cross-Language Document Similarity

Serge SharoffReinhard RappPierre Zweigenbaum

subject

InterlanguageDocument similarityInformation retrievalComputer sciencebusiness.industryInformationSystems_INFORMATIONSTORAGEANDRETRIEVALArtificial intelligencecomputer.software_genrebusinesscomputerNatural language processingTask (project management)

description

We summarise the organisation and results of the first shared task aimed at detecting the most similar texts in a large multilingual collection. The dataset of the shared was based on Wikipedia dumps with interlanguage links with further filtering to ensure comparability of the paired articles. The eleven system runs we received have been evaluated using the TREC evaluation metrics. 1 Task description Parallel corpora of original texts with their translations provide the basis for multilingual NLP applications since the beginning of the 1990s. Relative scarcity of such resources led to greater attention to comparable (=less parallel) resources to mine information about possible translations. Many studies have been produced within the paradigm of comparable corpora, including publications in

https://doi.org/10.18653/v1/w15-3411