6533b85afe1ef96bd12b9eb2

RESEARCH PRODUCT

Learning Similarity Scores by Using a Family of Distance Functions in Multiple Feature Spaces

Francisco GrimaldoEmilia López-iñestaMiguel Arevalillo-herráez

subject

Training setbusiness.industryFeature vectorSimilarity heuristicPattern recognition02 engineering and technologyMachine learningcomputer.software_genreSemantic similarityArtificial Intelligence020204 information systemsNormalized compression distance0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingComputer Vision and Pattern RecognitionArtificial intelligenceJaro–Winkler distancebusinesscomputerClassifier (UML)SoftwareSimilarity learningMathematics

description

There exist a large number of distance functions that allow one to measure similarity between feature vectors and thus can be used for ranking purposes. When multiple representations of the same object are available, distances in each representation space may be combined to produce a single similarity score. In this paper, we present a method to build such a similarity ranking out of a family of distance functions. Unlike other approaches that aim to select the best distance function for a particular context, we use several distances and combine them in a convenient way. To this end, we adopt a classical similarity learning approach and face the problem as a standard supervised machine learning task. As in most similarity learning settings, the training data are composed of a set of pairs of objects that have been labeled as similar/dissimilar. These are first used as an input to a transformation function that computes new feature vectors for each pair by using a family of distance functions in each of the available representation spaces. Then, this information is used to learn a classifier. The approach has been tested using three different repositories. Results show that the proposed method outperforms other alternative approaches in high-dimensional spaces and highlight the benefits of using multiple distances in each representation space.

https://doi.org/10.1142/s0218001417500276