6533b834fe1ef96bd129de7c

RESEARCH PRODUCT

Vector representation of non-standard spellings using dynamic time warping and a denoising autoencoder

Mehdi Ben LazregOle-christopher GranmoMorten Goodwin

subject

Dynamic time warpingArtificial neural networkComputer sciencebusiness.industrySpeech recognition020208 electrical & electronic engineeringPattern recognitionContext (language use)02 engineering and technology010501 environmental sciencesTranslation (geometry)01 natural sciencesAutoencoderEuclidean distance0202 electrical engineering electronic engineering information engineeringEdit distanceArtificial intelligenceHidden Markov modelbusinessWord (computer architecture)0105 earth and related environmental sciences

description

The presence of non-standard spellings in Twitter causes challenges for many natural language processing tasks. Traditional approaches mainly regard the problem as a translation, spell checking, or speech recognition problem. This paper proposes a method that represents the stochastic relationship between words and their non-standard versions in real vectors. The method uses dynamic time warping to preprocess the non-standard spellings and autoencoder to derive the vector representation. The derived vectors encode word patterns and the Euclidean distance between the vectors represents a distance in the word space that challenges the prevailing edit distance. After training the autoencoder on 1051 different words and their non-standard versions, the results show that the new distance can be used to obtain the correct standard word among the closest five words in 89.53% of the cases compared to only 68.22% using the edit distance.

https://doi.org/10.1109/cec.2017.7969473