6533b7d4fe1ef96bd126264b
RESEARCH PRODUCT
A Methodology for Bilingual Lexicon Extraction from Comparable Corpora
Reinhard Rappsubject
Text corpusInterlinguaComputer sciencebusiness.industrymedia_common.quotation_subjectBootstrapping (linguistics)computer.software_genrelanguage.human_languageParallel corporaBilingual lexiconResource (project management)languageQuality (business)Artificial intelligencebusinesscomputerWord (computer architecture)Natural language processingmedia_commondescription
Dictionary extraction using parallel corpora is well established. However, for many language pairs parallel corpora are a scarce resource which is why in the current work we discuss methods for dictionary extraction from comparable corpora. Hereby the aim is to push the boundaries of current approaches, which typically utilize correlations between co-occurrence patterns across languages, in several ways: 1) Eliminating the need for initial lexicons by using a bootstrapping approach which only requires a few seed translations. 2) Implementing a new approach which first establishes alignments between comparable documents across languages, and then computes cross-lingual alignments between words and multiword-units. 3) Improving the quality of computed word translations by applying an interlingua approach, which, by relying on several pivot languages, allows an effective multi-dimensional cross-check. 4) We investigate that, by looking at foreign citations, language translations can even be derived from a single monolingual text corpus.
year | journal | country | edition | language |
---|---|---|---|---|
2015-01-01 | Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra) |