6533b870fe1ef96bd12cf01a

RESEARCH PRODUCT

Combining a context aware neural network with a denoising autoencoder for measuring string similarities

Morten GoodwinMehdi Ben LazregOle-christoffer Granmo

subject

Artificial neural networkProperty (programming)Computer sciencebusiness.industryString (computer science)020206 networking & telecommunicationsContext (language use)02 engineering and technologycomputer.software_genre01 natural sciencesTheoretical Computer ScienceHuman-Computer InteractionCharacter (mathematics)0103 physical sciencesMetric (mathematics)0202 electrical engineering electronic engineering information engineeringArtificial intelligenceString metricbusiness010301 acousticscomputerSoftwareWord (computer architecture)Natural language processing

description

Abstract Measuring similarities between strings is central for many established and fast-growing research areas, including information retrieval, biology, and natural-language processing. The traditional approach to string similarity measurements is to define a metric with respect to a word space that quantifies and sums up the differences between characters in two strings; surprisingly, these metrics have not evolved a great deal over the past few decades. Indeed, the majority of them are still based on making a simple comparison between character and character distributions without considering the words context. This paper proposes a string metric that encompasses similarities between strings based on (1) the character similarities between the words, including non-standard and standard spellings of the same words, and (2) the context of these words. We propose a neural network composed of a denoising autoencoder and what we call a context encoder, both specifically designed to find similarities between the words based on their context. Experimental results show that the resulting metrics have succeeded in 85.4% of the cases in finding the correct version of a non-standard spelling among the closest words, compared to 63.2% using the established Normalised-Levenshtein distance. We also show that by employing our approach, the words used in similar context are calculated to be more similar than words with different contexts, which is a desirable property lacking in established string metrics.

https://doi.org/10.1016/j.csl.2019.101028