6533b7d5fe1ef96bd1263e16

RESEARCH PRODUCT

Diversity in search strategies for ensemble feature selection

Pádraig CunninghamAlexey TsymbalMykola Pechenizkiy

subject

business.industryContext (language use)Feature selectionMachine learningcomputer.software_genreEnsemble learningMeasure (mathematics)Random subspace methodEnsembles of classifiersComputingMethodologies_PATTERNRECOGNITIONHardware and ArchitectureFeature (computer vision)Signal ProcessingArtificial intelligenceData miningbusinesscomputerSoftwareSelection (genetic algorithm)Information SystemsMathematics

description

Ensembles of learnt models constitute one of the main current directions in machine learning and data mining. Ensembles allow us to achieve higher accuracy, which is often not achievable with single models. It was shown theoretically and experimentally that in order for an ensemble to be effective, it should consist of base classifiers that have diversity in their predictions. One technique, which proved to be effective for constructing an ensemble of diverse base classifiers, is the use of different feature subsets, or so-called ensemble feature selection. Many ensemble feature selection strategies incorporate diversity as an objective in the search for the best collection of feature subsets. A number of ways are known to quantify diversity in ensembles of classifiers, and little research has been done about their appropriateness to ensemble feature selection. In this paper, we compare five measures of diversity with regard to their possible use in ensemble feature selection. We conduct experiments on 21 data sets from the UCI machine learning repository, comparing the ensemble accuracy and other characteristics for the ensembles built with ensemble feature selection based on the considered measures of diversity. We consider four search strategies for ensemble feature selection together with the simple random subspacing: genetic search, hill-climbing, and ensemble forward and backward sequential selection. In the experiments, we show that, in some cases, the ensemble feature selection process can be sensitive to the choice of the diversity measure, and that the question of the superiority of a particular measure depends on the context of the use of diversity and on the data being processed. In many cases and on average, the plain disagreement measure is the best. Genetic search, kappa, and dynamic voting with selection form the best combination of a search strategy, diversity measure and integration method.

10.1016/j.inffus.2004.04.003https://research.tue.nl/nl/publications/6e8650d6-3cd9-4cfd-b9d4-b00c22da1810