0000000000925346

AUTHOR

Mathias Seuret

showing 2 related works from this author

ICDAR 2021 Competition on Historical Document Classification

2021

International audience; This competition investigated the performance of historical document classification. The analysis of historical documents is a difficult challenge commonly solved by trained humanists. We provided three different classification tasks, which can be solved individually or jointly: font group/script type, location, date. The document images are provided by several institutions and are taken from handwritten and printed books as well as from charters. In contrast to previous competitions, all participants relied upon Deep Learning based approaches. Nevertheless, we saw a great performance variety of the different submitted systems. The easiest task seemed to be font grou…

Historical document imagesbusiness.industryComputer scienceDocument classificationDeep learningContrast (statistics)computer.software_genreVariety (linguistics)Task (project management)Competition (economics)Document classification[INFO.INFO-TS]Computer Science [cs]/Signal and Image ProcessingDocument analysisFontComputingMethodologies_DOCUMENTANDTEXTPROCESSINGDatingArtificial intelligence[SHS.HIST]Humanities and Social Sciences/HistorybusinesscomputerNatural language processingHistorical document
researchProduct

New Approaches to OCR for Early Printed Books

2020

Books printed before 1800 present major problems for OCR. One of the main obstacles is the lack of diversity of historical fonts in training data. The OCR-D project, consisting of book historians and computer scientists, aims to address this deficiency by focussing on three major issues. Our first target was to create a tool that identifies font groups automatically in images of historical documents. We concentrated on Gothic font groups that were commonly used in German texts printed in the 15th and 16th century: the well-known Fraktur and the lesser known Bastarda, Rotunda, Textura und Schwabacher. The tool was trained with 35,000 images and reaches an accuracy level of 98%. It can not on…

GermanInformation retrievalHebrewComputer scienceFontKrakenlanguageComparative historical researchTesseractHistory of the booklanguage.human_languageWoodcutDigItalia
researchProduct