Combining a context aware neural network with a denoising autoencoder for measuring string similarities

6533b870fe1ef96bd12cf01a

RESEARCH PRODUCT

Combining a context aware neural network with a denoising autoencoder for measuring string similarities

Morten Goodwin Mehdi Ben Lazreg Ole-christoffer Granmo

subject

Artificial neural network Property (programming)Computer science business.industry String (computer science)020206 networking & telecommunications Context (language use)02 engineering and technology computer.software_genre 01 natural sciences Theoretical Computer Science Human-Computer Interaction Character (mathematics)0103 physical sciences Metric (mathematics)0202 electrical engineering electronic engineering information engineering Artificial intelligence String metric business 010301 acoustics computer Software Word (computer architecture)Natural language processing

description

Abstract Measuring similarities between strings is central for many established and fast-growing research areas, including information retrieval, biology, and natural-language processing. The traditional approach to string similarity measurements is to define a metric with respect to a word space that quantifies and sums up the differences between characters in two strings; surprisingly, these metrics have not evolved a great deal over the past few decades. Indeed, the majority of them are still based on making a simple comparison between character and character distributions without considering the words context. This paper proposes a string metric that encompasses similarities between strings based on (1) the character similarities between the words, including non-standard and standard spellings of the same words, and (2) the context of these words. We propose a neural network composed of a denoising autoencoder and what we call a context encoder, both specifically designed to find similarities between the words based on their context. Experimental results show that the resulting metrics have succeeded in 85.4% of the cases in finding the correct version of a non-standard spelling among the closest words, compared to 63.2% using the established Normalised-Levenshtein distance. We also show that by employing our approach, the words used in similar context are calculated to be more similar than words with different contexts, which is a desirable property lacking in established string metrics.

year	journal	country	edition	language
2020-03-01	Computer Speech & Language

https://doi.org/10.1016/j.csl.2019.101028