6533b85ffe1ef96bd12c19d1
RESEARCH PRODUCT
Semi-automated annotation of page-based documents within the Genre and Multimodality framework
Tuomo Hiippalasubject
060201 languages & linguisticsStructure (mathematical logic)Information retrievalComputer sciencecomputer.internet_protocolbusiness.industry05 social sciences050801 communication & media studies06 humanities and the artsTemporal annotationcomputer.software_genreDocument processingPipeline (software)MultimodalityAnnotation0508 media and communicationsOpen source0602 languages and literatureComputingMethodologies_DOCUMENTANDTEXTPROCESSINGArtificial intelligencebusinesscomputerNatural language processingXMLdescription
This paper describes ongoing work on a tool developed for annotating document images for their multimodal features and compiling this information into a corpus. The tool leverages open source computer vision and natural language processing libraries to describe the content and structure of multimodal documents and to generate multiple layers of XML annotation. The paper introduces the annotation schema, describes the document processing pipeline and concludes with a brief description of future work.
year | journal | country | edition | language |
---|---|---|---|---|
2016-01-01 | Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities |