6533b7d4fe1ef96bd126264b

RESEARCH PRODUCT

A Methodology for Bilingual Lexicon Extraction from Comparable Corpora

Reinhard Rapp

subject

Text corpusInterlinguaComputer sciencebusiness.industrymedia_common.quotation_subjectBootstrapping (linguistics)computer.software_genrelanguage.human_languageParallel corporaBilingual lexiconResource (project management)languageQuality (business)Artificial intelligencebusinesscomputerWord (computer architecture)Natural language processingmedia_common

description

Dictionary extraction using parallel corpora is well established. However, for many language pairs parallel corpora are a scarce resource which is why in the current work we discuss methods for dictionary extraction from comparable corpora. Hereby the aim is to push the boundaries of current approaches, which typically utilize correlations between co-occurrence patterns across languages, in several ways: 1) Eliminating the need for initial lexicons by using a bootstrapping approach which only requires a few seed translations. 2) Implementing a new approach which first establishes alignments between comparable documents across languages, and then computes cross-lingual alignments between words and multiword-units. 3) Improving the quality of computed word translations by applying an interlingua approach, which, by relying on several pivot languages, allows an effective multi-dimensional cross-check. 4) We investigate that, by looking at foreign citations, language translations can even be derived from a single monolingual text corpus.

https://doi.org/10.18653/v1/w15-4108