Open Source Handwritten Text Recognition on Medieval Manuscripts Using Mixed Models and Document-Specific Finetuning

Reul, Christian; Tomasek, Stefan; Langhanki, Florian; Springmann, Uwe

doi:10.1007/978-3-031-06555-2_28

Christian Reul¹⁰,
Stefan Tomasek¹⁰,
Florian Langhanki¹⁰ &
…
Uwe Springmann¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13237))

Included in the following conference series:

International Workshop on Document Analysis Systems

1764 Accesses
3 Citations

Abstract

This paper deals with the task of practical and open source Handwritten Text Recognition (HTR) on German medieval manuscripts. We report on our efforts to construct mixed recognition models which can be applied out-of-the-box without any further document-specific training but also serve as a starting point for finetuning by training a new model on a few pages of transcribed text (ground truth). To train the mixed models we collected a corpus of 35 manuscripts and ca. 12.5k text lines for two widely used handwriting styles, Gothic and Bastarda cursives. Evaluating the mixed models out-of-the-box on four unseen manuscripts resulted in an average Character Error Rate (CER) of 6.22%. After training on 2, 4 and eventually 32 pages the CER dropped to 3.27%, 2.58%, and 1.65%, respectively. While the in-domain recognition and training of models (Bastarda model to Bastarda material, Gothic to Gothic) unsurprisingly yielded the best results, finetuning out-of-domain models to unseen scripts was still shown to be superior to training from scratch. Our new mixed models have been made openly available to the community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/Calamari-OCR/calamari_models_experimental.
2.
https://readcoop.eu/transkribus/public-models.
3.
https://github.com/jpuigcerver/PyLaia.
4.
https://github.com/ocropus/ocropy.
5.
https://zenodo.org/record/5167263.
6.
https://zenodo.org/record/4746342.
7.
https://en.wikipedia.org/wiki/Diplomatics#Diplomatic_editions_and_transcription.
8.
https://www.parzival.unibe.ch/englishpresentation.html.
9.
https://lab.sbb.berlin/events/faithful-transcriptions-2/?lang=en.
10.
https://www.adfontes.uzh.ch/tutorium/schriften-lesen/schriftgeschichte/bastarda-und-gotische-kursive.
11.
https://github.com/ocr4all.
12.
https://digi.ub.uni-heidelberg.de/wgd.
13.
https://github.com/Calamari-OCR/calamari.
14.
In Calamari short notation:conv=40:3\(\,\times \,\)3, pool=2\(\,\times \,\)2, conv=60:3\(\,\times \,\)3, pool=2\(\,\times \,\)2, lstm=200, dropout=0.5.
15.
In Calamari short notation:conv=40:3\(\,\times \,\)3, pool=2\(\,\times \,\)2, conv=60:3\(\,\times \,\)3, pool=2\(\,\times \,\)2, conv=120:3\(\,\times \,\)3, pool=2\(\,\times \,\)2,lstm=200, lstm=200, lstm=200, dropout=0.5.
16.
https://github.com/OCR-D/ocrd_olena.
17.
https://github.com/qurator-spk/sbb_binarization.

References

Breuel, T.M., Ul-Hasan, A., Al-Azawi, M.A., Shafait, F.: High-performance OCR for printed English and Fraktur using LSTM networks. In: 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 683–687. IEEE (2013). https://doi.org/10.1109/ICDAR.2013.140
Diaz, D.H., Qin, S., Ingle, R., Fujii, Y., Bissacco, A.: Rethinking text line recognition models. arXiv preprint (2021). https://arxiv.org/abs/2104.07787
Eichenberger, N., Suwelack, H., Schröer, A.: Faithful transcriptions. 027.7 J. Libr. Cult. (2021). https://doi.org/10.21428/1bfadeb6.d3bdbcd2
Hawk, B.W., Karaisl, A., White, N.: Modelling medieval hands: practical OCR for caroline minuscule. Digit. Humaniti. Q. 13(1) (2019). http://www.digitalhumanities.org/dhq/vol/13/1/000412/000412.html
Hodel, T., Schoch, D., Schneider, C., Purcell, J.: General models for handwritten text recognition: feasibility and state-of-the art. German kurrent as an example. J. Open Humanit. Data 7(13), 1–10 (2021). https://doi.org/10.5334/johd.46
Article Google Scholar
Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus-a service platform for transcription, recognition and retrieval of historical documents. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 4, pp. 19–24. IEEE (2017). https://doi.org/10.1109/ICDAR.2017.307
Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay attention to what you read: non-recurrent handwritten text-line recognition. arXiv preprint (2020). arXiv:2005.13044, https://arxiv.org/abs/2005.13044
Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character recognition (OCR): a comprehensive systematic literature review (SLR). IEEE Access 8, 142642–142668 (2020). https://doi.org/10.1109/ACCESS.2020.3012542
Article Google Scholar
Michael, J., Weidemann, M., Labahn, R.: HTR engine based on NNs P3. Horizon 2020 Technical report (2018). https://readcoop.eu/wp-content/uploads/2018/12/Del_D7_9.pdf
Mocholí Calvo, C., et al.: Development and experimentation of a deep learning system for convolutional and recurrent neural networks. Ph.D. thesis. Universitat Politècnica de València (2018)
Google Scholar
Pletschacher, S., Antonacopoulos, A.: The PAGE (page analysis and ground-truth elements) format framework. In: 20th International Conference on Pattern Recognition, pp. 257–260. IEEE (2010). https://doi.org/10.1109/ICPR.2010.72
Reul, C., et al.: OCR4all-an open-source tool providing a (semi-)automatic OCR workflow for historical printings. Appl. Sci. 9(22), 4853 (2019). https://doi.org/10.3390/app9224853
Article Google Scholar
Reul, C., Springmann, U., Wick, C., Puppe, F.: Improving OCR accuracy on early printed books by combining pretraining, voting, and active learning. JLCL: Spec. Issue Autom. Text Layout Recognit. 33(1), 3–24 (2018). https://jlcl.org/content/2-allissues/2-heft1-2018/jlcl_2018-1_1.pdf
Reul, C., Springmann, U., Wick, C., Puppe, F.: Improving OCR accuracy on early printed books by utilizing cross fold training and voting. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 423–428. IEEE (2018). https://doi.org/10.1109/DAS.2018.30
Reul, C., Wick, C., Noeth, M., Wehner, M., Springmann, U.: Mixed model OCR training on historical Latin script for Out-of-the-box recognition and finetuning. In: The 6th International Workshop on Historical Document Imaging and Processing, pp. 7–12 (2021). https://doi.org/10.1145/3476887.3476910
Sánchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: A set of benchmarks for handwritten text recognition on historical documents. Pattern Recognit. 94, 122–134 (2019). https://doi.org/10.1016/j.patcog.2019.05.025
Article Google Scholar
Springmann, U., Lüdeling, A.: OCR of historical printings with an application to building diachronic corpora: a case study using the RIDGES herbal corpus. Digit. Humanit. Q. 11(2) (2017), http://www.digitalhumanities.org/dhq/vol/11/2/000288/000288.html
Stökl Ben Ezra, D., Brown-DeVost, B., Jablonski, P., Lapin, H., Kiessling, B., Lolli, E.: BiblIA-a general model for medieval hebrew manuscripts and an open annotated dataset. In: The 6th International Workshop on Historical Document Imaging and Processing, pp. 61–66 (2021). https://doi.org/10.1145/3476887.3476896
Wick, C., Reul, C., Puppe, F.: Calamari-a high-performance tensorflow-based deep learning package for optical character recognition. Digit. Humanit. Q. 14(2) (2020). http://www.digitalhumanities.org/dhq/vol/14/2/000451/000451.html

Download references

Acknowledgement

The authors would like to thank our student research assistants Lisa Gugel, Kiara Hart, Ursula Heß, Annika Müller, and Anne Schmid for their extensive segmentation and transcription work as well as Maximilian Nöth and Maximilian Wehner for supporting the data preparation.

This work was partially funded by the German Research Foundation (DFG) under project no. 460665940.

Author information

Authors and Affiliations

University of Würzburg, Würzburg, Germany
Christian Reul, Stefan Tomasek & Florian Langhanki
CIS, LMU Munich, Munich, Germany
Uwe Springmann

Authors

Christian Reul
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Tomasek
View author publications
You can also search for this author in PubMed Google Scholar
Florian Langhanki
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Springmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Reul .

Editor information

Editors and Affiliations

Kyushu University, Fukuoka, Japan
Seiichi Uchida
Boise State University, BOISE, ID, USA
Elisa Barney
LIRIS UMR CNRS, Villeurbanne, France
Véronique Eglin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Reul, C., Tomasek, S., Langhanki, F., Springmann, U. (2022). Open Source Handwritten Text Recognition on Medieval Manuscripts Using Mixed Models and Document-Specific Finetuning. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-06555-2_28
Published: 18 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06554-5
Online ISBN: 978-3-031-06555-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Open Source Handwritten Text Recognition on Medieval Manuscripts Using Mixed Models and Document-Specific Finetuning