A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics

6533b86ffe1ef96bd12cd382

RESEARCH PRODUCT

A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics

subject

Document Structure Description Computer Networks and Communications computer.internet_protocol Computer science Efficient XML Interchange [SCCO.COMP]Cognitive science/Computer science 0102 computer and information sciences 02 engineering and technology computer.software_genre 01 natural sciences Semantic similarity XML Schema Editor 020204 information systems 0202 electrical engineering electronic engineering information engineering XML schema computer.programming_language Information retrieval [INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB][INFO.INFO-WB]Computer Science [cs]/Web [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM]XML validation computer.file_format Document clustering Human-Computer Interaction XML framework Tree (data structure)XML database Tree structure 010201 computation theory & mathematics [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]020201 artificial intelligence & image processing Semi-structured data Edit distance computer Software XML XML Catalog Data integration

description

International audience; XML similarity evaluation has become a central issue in the database and information communities, its applications ranging over document clustering, version control, data integration and ranked retrieval. Various algorithms for comparing hierarchically structured data, XML documents in particular, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being commonly modeled as Ordered Labeled Trees. Yet, a thorough investigation of current approaches led us to identify several similarity aspects, i.e., sub-tree related structural and semantic similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an integrated and fine-grained comparison framework to deal with both structural and semantic similarities in XML documents (detecting the occurrences and repetitions of structurally and semantically similar sub-trees), and to allow the end-user to adjust the comparison process according to her requirements. Our framework consists of four main modules for (i) discovering the structural commonalities between sub-trees, (ii) identifying sub-tree semantic resemblances, (iii) computing tree-based edit operations costs, and (iv) computing tree edit distance. Experimental results demonstrate higher comparison accuracy with respect to alternative methods, while timing experiments reflect the impact of semantic similarity on overall system performance.

year	journal	country	edition	language
2012-03-01

https://hal.archives-ouvertes.fr/hal-01092512