6533b7dcfe1ef96bd1271788

RESEARCH PRODUCT

The Hierarchical Agglomerative Clustering with Gower index: a methodology for automatic design of OLAP cube in ecological data processing context

Bruno FaivreLudovic JournauxLucile SautotLucile SautotPaul Molin

subject

[ INFO.INFO-NA ] Computer Science [cs]/Numerical Analysis [cs.NA]Computer scienceContext (language use)02 engineering and technologycomputer.software_genre020204 information systems0202 electrical engineering electronic engineering information engineeringDimension (data warehouse)Cluster analysisEcology Evolution Behavior and Systematics[ SDE.BE ] Environmental Sciences/Biodiversity and Ecology[INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB]OLAPEcologyAutomatic designApplied MathematicsEcological ModelingOnline analytical processing[ STAT.AP ] Statistics [stat]/Applications [stat.AP]InformationSystems_DATABASEMANAGEMENTHierarchical agglomerative clustering[INFO.INFO-NA]Computer Science [cs]/Numerical Analysis [cs.NA]Missing dataData warehouseComputer Science ApplicationsHierarchical clustering[ INFO.INFO-DB ] Computer Science [cs]/Databases [cs.DB]Computational Theory and MathematicsModeling and SimulationOLAP cube020201 artificial intelligence & image processingData mining[SDE.BE]Environmental Sciences/Biodiversity and EcologyBird populationcomputer

description

In Press, Corrected Proof; International audience; The OLAP systems can be an improvement for ecological studies. In fact, ecology studies, follows and analyzes phenomenon across space and time and according to several parameters. OLAP systems can provide to ecologists browsing in a large dataset. One focus of the current research on OLAP system is the automatic design of OLAP cubes and of data warehouse schemas. This kind of works makes accessible OLAP technology to non information technology experts. But to be efficient, the automatic OLAP building must take into account various cases. Moreover the OLAP technology is based on the concept of hierarchy. Thereby the hierarchical clustering methods are often used by OLAP system designer. In this article, we propose using hierarchical agglomerative clustering with a metric that comes from ecological studies (the Gower similarity index) to build automatically hierarchical dimensions in an OLAP cube. With this similarity index we can perform a hierarchical clustering on heterogeneous datasets that contains qualitative and quantitative variables. We offer a prototypical automatic system which builds dimension for an OLAP cube and we measure the performances of this system according to the number of clustered individuals and according to the number of variables used for clustering. Thanks to these measures we can offer an approximation of performances with a large dataset. Thereby the Gower index in a hierarchical agglomerative clustering permits the management of heterogeneous dataset with missing values in a context of automatic building of OLAP cube. With this methodology, we can build new dimensions based on hierarchies in the data, which are not evident. The data mining methods can complete the expert knowledge during the design of an OLAP cube, because these methods can explain the inherent structure of the data.

https://hal.archives-ouvertes.fr/hal-01060817