Active learning strategies for the deduplication of electronic patient data using classification trees.

6533b823fe1ef96bd127e227

RESEARCH PRODUCT

Active learning strategies for the deduplication of electronic patient data using classification trees.

Murat Sariyar Klaus Pommerening Andreas Borg

subject

Active learning Computer science Active learning (machine learning)Information Storage and Retrieval Context (language use)Health Informatics Semi-supervised learning Machine learning computer.software_genre Set (abstract data type)Artificial Intelligence Bagging Data deduplication Electronic Health Records Humans business.industry String (computer science)Decision Trees Online machine learning Computer Science Applications Data mining Artificial intelligence Medical Record Linkage String metric business computer Algorithms

description

Graphical abstractDisplay Omitted Highlights? Active learning for medical record linkage is used on a large data set. ? We compare a simple active learning strategy with a more sophisticated variant. ? The active learning method of Sarawagi and Bhamidipaty (2002) 6] is extended. ? We deliver insights into the variations of the results due to random sampling in the active learning strategies. IntroductionSupervised record linkage methods often require a clerical review to gain informative training data. Active learning means to actively prompt the user to label data with special characteristics in order to minimise the review costs. We conducted an empirical evaluation to investigate whether a simple active learning strategy using binary comparison patterns is sufficient or if string metrics together with a more sophisticated algorithm are necessary to achieve high accuracies with a small training set. Material and MethodsBased on medical registry data with different numbers of attributes, we used active learning to acquire training sets for classification trees, which were then used to classify the remaining data. Active learning for binary patterns means that every distinct comparison pattern represents a stratum from which one item is sampled. Active learning for patterns consisting of the Levenshtein string metric values uses an iterative process where the most informative and representative examples are added to the training set. In this context, we extended the active learning strategy by Sarawagi and Bhamidipaty (2002) 6]. ResultsOn the original data set, active learning based on binary comparison patterns leads to the best results. When dropping four or six attributes, using string metrics leads to better results. In both cases, not more than 200 manually reviewed training examples are necessary. ConclusionsIn record linkage applications where only forename, name and birthday are available as attributes, we suggest the sophisticated active learning strategy based on string metrics in order to achieve highly accurate results. We recommend the simple strategy if more attributes are available, as in our study. In both cases, active learning significantly reduces the amount of manual involvement in training data selection compared to usual record linkage settings.

year	journal	country	edition	language
2012-10-01	Journal of biomedical informatics

10.1016/j.jbi.2012.02.002 https://pubmed.ncbi.nlm.nih.gov/22402197