6533b85ffe1ef96bd12c19d1

RESEARCH PRODUCT

Semi-automated annotation of page-based documents within the Genre and Multimodality framework

Tuomo Hiippala

subject

060201 languages & linguisticsStructure (mathematical logic)Information retrievalComputer sciencecomputer.internet_protocolbusiness.industry05 social sciences050801 communication & media studies06 humanities and the artsTemporal annotationcomputer.software_genreDocument processingPipeline (software)MultimodalityAnnotation0508 media and communicationsOpen source0602 languages and literatureComputingMethodologies_DOCUMENTANDTEXTPROCESSINGArtificial intelligencebusinesscomputerNatural language processingXML

description

This paper describes ongoing work on a tool developed for annotating document images for their multimodal features and compiling this information into a corpus. The tool leverages open source computer vision and natural language processing libraries to describe the content and structure of multimodal documents and to generate multiple layers of XML annotation. The paper introduces the annotation schema, describes the document processing pipeline and concludes with a brief description of future work.

https://doi.org/10.18653/v1/w16-2109