6533b821fe1ef96bd127ac45

RESEARCH PRODUCT

Deterministic Linkage as a Preceding Filter for Other Record Linkage Methods

Murat SariyarMurat SariyarAndreas Borg

subject

Linkage (software)education.field_of_studyComputer scienceDecision tree learningPopulationProbabilistic logiccomputer.software_genreFilter (higher-order function)Expectation–maximization algorithmComputer Science (miscellaneous)Data miningeducationcomputerAlgorithmRecord linkageTest data

description

Deterministic record linkage (RL) is frequently regarded as a rival to more sophisticated strategies like probabilistic RL. We investigate the effect of combining deterministic linkage with other linkage techniques. For this task, we use a simple deterministic linkage strategy as a preceding filter: a data pair is classified as ‘match' if all values of attributes considered agree exactly, otherwise as ‘nonmatch'. This strategy is separately combined with two probabilistic RL methods based on the Fellegi–Sunter model and with two classification tree methods (CART and Bagging). An empirical comparison was conducted on two real data sets. We used four different partitions into training data and test data to increase the validity of the results. In almost all cases, application of deterministic linkage as a preceding filter leads to better results compared to the omission of such a pre-filter, and overall classification trees exhibited best results. On all data sets, probabilistic RL only profited from deterministic linkage when the underlying probabilities were estimated before applying deterministic linkage. When using a pre-filter for subtracting definite cases, the underlying population of data pairs changes. It is crucial to take this into account for model-based probabilistic RL.

https://doi.org/10.1142/s0219622015500108