Author: Reinhard Rapp

0000000000222291

AUTHOR

Reinhard Rapp

showing 16 related works from this author

A practical solution to the problem of automatic word sense induction

2004

Recent studies in word sense induction are based on clustering global co-occurrence vectors, i.e. vectors that reflect the overall behavior of a word in a corpus. If a word is semantically ambiguous, this means that these vectors are mixtures of all its senses. Inducing a word's senses therefore involves the difficult problem of recovering the sense vectors from the mixtures. In this paper we argue that the demixing problem can be avoided since the contextual behavior of the senses is directly observable in the form of the local contexts of a word. From human disambiguation performance we know that the context of a word is usually sufficient to determine its sense. Based on this observation…

Computer sciencebusiness.industryWord-sense inductionComputer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing)Context (language use)Artificial intelligenceCluster analysiscomputer.software_genrebusinesscomputerWord (computer architecture)Natural language processingSemEvalProceedings of the ACL 2004 on Interactive poster and demonstration sessions -

researchProduct

A Methodology for Bilingual Lexicon Extraction from Comparable Corpora

2015

Dictionary extraction using parallel corpora is well established. However, for many language pairs parallel corpora are a scarce resource which is why in the current work we discuss methods for dictionary extraction from comparable corpora. Hereby the aim is to push the boundaries of current approaches, which typically utilize correlations between co-occurrence patterns across languages, in several ways: 1) Eliminating the need for initial lexicons by using a bootstrapping approach which only requires a few seed translations. 2) Implementing a new approach which first establishes alignments between comparable documents across languages, and then computes cross-lingual alignments between wor…

Text corpusInterlinguaComputer sciencebusiness.industrymedia_common.quotation_subjectBootstrapping (linguistics)computer.software_genrelanguage.human_languageParallel corporaBilingual lexiconResource (project management)languageQuality (business)Artificial intelligencebusinesscomputerWord (computer architecture)Natural language processingmedia_commonProceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)

researchProduct

Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora

2013

The beginning of the 1990s marked a radical turn in various NLP applications towards using large collections of texts.

HistoryLinguistics

researchProduct

Discovering the Senses of an Ambiguous Word by Clustering its Local Contexts

2005

As has been shown recently, it is possible to automatically discover the senses of an ambiguous word by statistically analyzing its contextual behavior in a large text corpus. However, this kind of research is still at an early stage. The results need to be improved and there is considerable disagreement on methodological issues. For example, although most researchers use clustering approaches for word sense induction, it is not clear what statistical features the clustering should be based on. Whereas so far most researchers cluster global co-occurrence vectors that reflect the overall behavior of a word in a corpus, in this paper we argue that it is more appropriate to use local context v…

Text corpusbusiness.industryComputer scienceContext (language use)computer.software_genreWord senseWord-sense inductionArtificial intelligencebusinessCluster analysiscomputerNatural language processingWord (computer architecture)Strengths and weaknesses

researchProduct

Syntagmatic and Paradigmatic Associations in Information Retrieval

2003

It is shown that unconscious associative processes taking place in the memory of a searcher during the formulation of a search query in information retrieval — such as the production of free word associations and the generation of synonyms — can be simulated using statistical models that analyze the distribution of words in large text corpora. The free word associations as produced by subjects on presentation of stimulus words can be predicted by applying first-order statistics to the frequencies of word co-occurrences as observed in texts. The generation of synonyms can also be conducted on co-occurrence data but requires second-order statistics. Both approaches are compared and validated …

Text corpusEmpirical dataSyntagmatic analysisInformation retrievalWeb search querySemantic similarityComputer scienceStatistical modelIndependent component analysisAssociative property

researchProduct

A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German

1998

In this paper we present Morphy, an integrated tool for German morphology, part-of-speech tagging and context-sensitive lemmatization. Its large lexicon of more than 320,000 word forms plus its ability to process German compound nouns guarantee a wide morphological coverage. Syntactic ambiguities can be resolved with a standard statistical part-of-speech tagger. By using the output of the tagger, the lemmatizer can determine the correct root even for ambiguous word forms. The complete package is freely available and can be downloaded from the World Wide Web.

FOS: Computer and information sciencesSpectrum analyzerRoot (linguistics)Morphology (linguistics)Computer Science - Computation and LanguageComputer sciencebusiness.industryLemmatisationContext (language use)computer.software_genreLexiconSyntaxlanguage.human_languageGermanH.3.4NounlanguageArtificial intelligencebusinesscomputerComputation and Language (cs.CL)Natural language processingWord (computer architecture)

researchProduct

Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora

2017

This paper presents the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. It recalls the design of the datasets, presents their final construction and statistics and the methods used to evaluate system results. 13 runs were submitted to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), German-English (3 runs), and Chinese-English (3 runs). The best F-scores as measured against the gold standard were 0.84 (German-English), 0.80 (French-English), and 0.43 (Chinese-English). Because of the design of the dataset, in which not all gold parallel sentence pairs are known, these are only minimum values. We examined …

Computer scienceSentence extractionbusiness.industrySpeech recognition020206 networking & telecommunications02 engineering and technologyGold standard (test)Spottingcomputer.software_genreTask (project management)0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingArtificial intelligencebusinesscomputerNatural language processingSentenceProceedings of the 10th Workshop on Building and Using Comparable Corpora

researchProduct

BUCC Shared Task: Cross-Language Document Similarity

2015

We summarise the organisation and results of the first shared task aimed at detecting the most similar texts in a large multilingual collection. The dataset of the shared was based on Wikipedia dumps with interlanguage links with further filtering to ensure comparability of the paired articles. The eleven system runs we received have been evaluated using the TREC evaluation metrics. 1 Task description Parallel corpora of original texts with their translations provide the basis for multilingual NLP applications since the beginning of the 1990s. Relative scarcity of such resources led to greater attention to comparable (=less parallel) resources to mine information about possible translations…

InterlanguageDocument similarityInformation retrievalComputer sciencebusiness.industryInformationSystems_INFORMATIONSTORAGEANDRETRIEVALArtificial intelligencecomputer.software_genrebusinesscomputerNatural language processingTask (project management)Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

researchProduct

The computation of word associations

2002

It is shown that basic language processes such as the production of free word associations and the generation of synonyms can be simulated using statistical models that analyze the distribution of words in large text corpora. According to the law of association by contiguity, the acquisition of word associations can be explained by Hebbian learning. The free word associations as produced by subjects on presentation of single stimulus words can thus be predicted by applying first-order statistics to the frequencies of word co-occurrences as observed in texts. The generation of synonyms can also be conducted on co-occurrence data but requires second-order statistics. The reason is that synony…

Text corpusSyntagmatic analysisbusiness.industryComputer scienceSynonymSpeech recognitionStatistical modelcomputer.software_genreProduction (computer science)Artificial intelligencebusinessAssociation (psychology)computerNatural language processingWord (computer architecture)Proceedings of the 19th international conference on Computational linguistics -

researchProduct

Exploring the sense distributions of homographs

2006

This paper quantitatively investigates in how far local context is useful to disam-biguate the senses of an ambiguous word. This is done by comparing the co-occurrence frequencies of particular context words. First, one context word representing a certain sense is chosen, and then the co-occurrence frequencies with two other context words, one of the same and one of another sense, are compared. As expected, it turns out that context words belonging to the same sense have considerably higher co-occurrence frequencies than words belonging to different senses. In our study, the sense inventory is taken from the University of South Florida homograph norms, and the co-occurrence counts are based…

HomographComputer scienceBritish National CorpusContext (language use)Word (computer architecture)LinguisticsProceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations on - EACL '06

researchProduct

Automatic identification of word translations from unrelated English and German corpora

1999

Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is more difficult, because most statistical clues useful in the processing of parallel texts cannot be applied to non-parallel texts. Whereas for parallel texts in some studies up to 99% of the word alignments have been shown to be correct, the accuracy for non-parallel texts has been around 30% up to now. The current study, which is based on the assumption that there is a correlation between the patterns of word co-occurrences in corpora of different languages, makes a sign…

Computer sciencebusiness.industrycomputer.software_genrelanguage.human_languageLinguisticsTask (project management)GermanBilingual lexiconIdentification (information)ComputingMethodologies_DOCUMENTANDTEXTPROCESSINGlanguageArtificial intelligencebusinesscomputerNatural language processingWord (computer architecture)Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics -

researchProduct

The CogALex-IV Shared Task on the Lexical Access Problem

2014

The shared task of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALexIV) was devoted to a subtask of the lexical access problem, namely multi-stimulus association. In this task, participants were supposed to determine automatically an expected response based on a number of received stimulus words. We describe here the task definition, the theoretical background, the training and test data sets, and the evaluation procedure used for ranking the participating systems. We also summarize the approaches used and present the results of the evaluation. In conclusion, the outcome of the competition are a number of systems which provide very good solutions to the problem.

Computer sciencebusiness.industryCognitionLexical accessArtificial intelligenceData miningbusinessLexiconcomputer.software_genrecomputerNatural language processingTest dataProceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

researchProduct

Part-of-Speech Induction by Singular Value Decomposition and Hierarchical Clustering

2006

Part-of-speech induction involves the automatic discovery of word classes and the assignment of each word of a vocabulary to one or several of these classes. The approach proposed here is based on the analysis of word distributions in a large collection of German newspaper texts. Its main advantage over other attempts is that it combines the hierarchical clustering of context vectors with a previous step of dimensionality reduction that minimizes the effects of sampling errors.

VocabularyK-SVDComputer sciencebusiness.industrymedia_common.quotation_subjectDimensionality reductionCorrelation clusteringPattern recognitionContext (language use)Hierarchical clusteringSingular value decompositionArtificial intelligencebusinessWord (computer architecture)media_common

researchProduct

A practical solution to the problem of automatic part-of-speech induction from text

2005

The problem of part-of-speech induction from text involves two aspects: Firstly, a set of word classes is to be derived automatically. Secondly, each word of a vocabulary is to be assigned to one or several of these word classes. In this paper we present a method that solves both problems with good accuracy. Our approach adopts a mixture of statistical methods that have been successfully applied in word sense induction. Its main advantage over previous attempts is that it reduces the syntactic space to only the most important dimensions, thereby almost eliminating the otherwise omnipresent problem of data sparseness.

Vocabularybusiness.industryComputer sciencemedia_common.quotation_subjectSpeech recognitionSpace (commercial competition)Part of speechcomputer.software_genreSyntaxSet (abstract data type)Word-sense inductionArtificial intelligencebusinesscomputerNatural language processingWord (computer architecture)media_commonProceedings of the ACL 2005 on Interactive poster and demonstration sessions - ACL '05

researchProduct

New Areas of Application of Comparable Corpora

2019

This chapter describes several approaches of using comparable corpora beyond the area of MT for under-resourced languages, which is the primary focus of the ACCURAT project. Section 7.1, which is based on Rapp and Zock (Automatic dictionary expansion using non-parallel corpora. In: A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg, 2010), addresses the task of creating resources for bilingual dictionaries using a seed lexicon; Sect. 7.2 (based on Rapp et al., Identifying word translations from comparable documents without a seed lexicon. Proceedi…

business.industryComputer scienceGroup method of data handlingSection (typography)020207 software engineering02 engineering and technology[SCCO.LING]Cognitive science/LinguisticsLexiconcomputer.software_genreFocus (linguistics)Task (project management)[SCCO]Cognitive scienceBusiness intelligence0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processing[INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]Artificial intelligencebusinesscomputerComputingMilieux_MISCELLANEOUSNatural language processingWord (computer architecture)

researchProduct

Free Word Associations Correspond to Contiguities Between Words in Texts*

2005

A free associative response is the first word a person comes up with after perceiving another word, the so-called associative stimulus. People commonly associate hot to cold, church to priest, and hard to work. According to traditional association theory this behaviour is the result of learning by contiguity: “Objects once experienced together tend to become associated in the imagination, so that when any one of them is thought of, the others are likely to be thought of also, in the same order of sequence or coexistence as before” (James, 1890). This explanation has been rejected by cognitive psychologists who explain the production of associations as the result of symbolic processes which …

Linguistics and LanguageAssociation theoryCognitionStimulus (physiology)PsychologyLanguage and LinguisticsLinguisticsAssociative propertyAssociative learningCognitive psychologyJournal of Quantitative Linguistics

researchProduct