Machine Learning for Document Structure Recognition

Paaß, Gerhard; Konya, Iuliu

doi:10.1007/978-3-642-22613-7_12

Gerhard Paaß⁷ &
Iuliu Konya⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 370))

1056 Accesses
10 Citations
3 Altmetric

Abstract

The backbone of the information age is digital information which may be searched, accessed, and transferred instantaneously. Therefore the digitization of paper documents is extremely interesting. This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents, such as pages, paragraphs, and figures, and transforms this into a hierarchy of logical components, such as titles, authors, and sections. This structural information improves readability and is useful for indexing and retrieving information contained in documents. First we present a rule-based system segmenting the document image and estimating the logical role of these zones. It is extensively used for processing newspaper collections showing world-class performance. In the second part we introduce several machine learning approaches exploring large numbers of interrelated features. They can be adapted to geometrical models of the document structure, which may be set up as a linear sequence or a general graph. These advanced models require far more computational resources but show a better performance than simpler alternatives and might be used in future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Akindele, O., Belaid, A.: Page segmentation by segment tracing. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 341–344 (1993)
Google Scholar
Antonacopoulos, A., Gatos, B., Bridson, D.: ICDAR2007 Page Segmentation Competition. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 2, pp. 1279–1283. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Antonacopoulos, A., Pletschacher, S., Bridson, D., Papadopoulos, C.: Icdar 2009 page segmentation competition. In: 10th International Conference on Document Analysis and Recognition, ICDAR (2009)
Google Scholar
Awasthi, P., Gagrani, A., Ravindran, B.: Image modeling using tree structured conditional random fields. In: IJCAI (2007)
Google Scholar
Baird, H., Casey, M.: Towards versatile document analysis systems. In: Proc. 7th International Workshop Document Analysis Systems, pp. 280–290 (2006)
Google Scholar
Besag, J.: Statistical analysis of non-lattice data. The Statistician 24(3), 179–195 (1975)
Article MathSciNet Google Scholar
van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.: Example-based logical labeling of document title page images. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 2, pp. 919–923. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Breuel, T.M.: High performance document layout analysis. In: Symposium on Document Image Understanding Technology, Greenbelt, Maryland (2003)
Google Scholar
Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. Tech. Rep. 9703-09, ITC-irst (1998), http://citeseer.comp.nus.edu.sg/330609.html
Chen, S., Mao, S., Thoma, G.: Simultaneous layout style and logical entity recognition in a heterogeneous collection of documents. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 1, pp. 118–122. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Chidlovskii, B., Lecerf, L.: Stacked dependency networks for layout document structuring. In: SAC 2008, pp. 424–428 (2008)
Google Scholar
DDB, Deutsche Digitale Bibliothek (2010), http://www.deutsche-digitale-bibliothek.de/ (retrieved on December 23, 2010)
Dias, A.P.: Minimum spanning trees for text segmentation. In: Proc. Annual Symposium Document Analysis and Information Retrieval (1996)
Google Scholar
Doermann, D.: Page decomposition and related research. In: Proc. Symp. Document Image Understanding Technology, pp. 39–55 (1995)
Google Scholar
Esposito, F., Malerba, D., Semeraro, G., Ferilli, S., Altamura, O., Basile, T., Berardi, M., Ceci, M., Di Mauro, N.: Machine learning methods for automatically processing historical documents: from paper acquisition to xml transformation. In: Proc. 1st International Workshop Document Image Analysis for Libraries, pp. 328–335. IEEE Computer Society, Los Alamitos (2004)
Chapter Google Scholar
Europeana, Europeana portal (2010), http://www.europeana.eu/ (retrieved on December 23, 2010)
Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL (2005)
Google Scholar
Furmaniak, R.: Unsupervised newspaper segmentation using language context. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 2, pp. 619–623. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J.: Automatic table detection in document images. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 609–618. Springer, Heidelberg (2005)
Chapter Google Scholar
Getoor, L., Taskar, B. (eds.): Introduction to Relational Statistical Learning. MIT Press, Cambridge (2007)
MATH Google Scholar
Ha, J., Haralick, R., Phillips, I.: Document page decomposition by the bounding-box projection technique. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 1119–1122 (1995)
Google Scholar
Haralick, R.: Document image understanding: Geometric and logical layout. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 385–390 (1994)
Google Scholar
Ittner, D.J., Baird, H.S.: Language-free layout analysis. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 336–340 (1993)
Google Scholar
Jain, A., Yu, B.: Document representation and its application to page decomposition. IEEE Trans on Pattern Analysis and Machine Intelligence 20(3), 294–308 (1998)
Article Google Scholar
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area voronoi diagram. Computer Vision and Image Understanding 70(3), 370–382 (1998)
Article Google Scholar
Konya, I.V., Seibert, C., Eickeler, S.: Fraunhofer newspaper segmenter – a modular document image understanding system. Journal on Document Analysis and Recognition, IJDAR (2011); Ijdar – expected publication in 2010 (accepted for publication)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on ine Learning, vol. (2001)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: A literature survey. In: Document Recognition and Retrieval X, SPIE, vol. 5010, pp. 197–207 (2003)
Google Scholar
Marinai, S., Fujisawa, H. (eds.): Machine Learning in Document Analysis and Recognition. Springer, Heidelberg (2008)
MATH Google Scholar
Neville, J., Jensen, D.: Relational dependency networks. Journal of Machine Learning Research 8, 653–692 (2007)
Google Scholar
Niyogi, D., Srihari, S.: Knowledge-based derivation of document logical structure. In: Proc. Int. Conference on Document Analysis and Recognition, Montreal, Canada, pp. 472–475 (1995)
Google Scholar
Paaß, G., Reichartz, F.: Exploiting semantic constraints for estimating supersenses with crfs. In: Proc. SDM 2009 (2009)
Google Scholar
Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLT-NAACL 2004, pp. 329–336 (2004)
Google Scholar
Rangoni, Y., Belaïd, A.: Document logical structure analysis based on perceptive cycles. In: Proc. 7th International Workshop Document Analysis Systems, pp. 117–128. Springer, Heidelberg (2006)
Chapter Google Scholar
Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62(1-2), 107–136 (2006)
Article Google Scholar
Sankar, K.P., Ambati, V., Pratha, L., Jawahal, C.: Digitizing a million books: Challenges for document analysis. In: Proc. 7th International Workshop Document Analysis Systems, pp. 425–436 (2006)
Google Scholar
Schneider, K.M.: Information extraction from calls for papers with conditional random fields and layout features. Artif. Intell. Rev. 25, 67–77 (2006)
Article Google Scholar
Shafait, F., Keysers, D., Breuel, T.: Performance comparison of six algorithms for page segmentation. In: 7th IAPR Workshop on Document Analysis Systems (DAS), pp. 368–379 (2006)
Google Scholar
Summers, K.: Near-wordless document structure classification. In: Proc. International Conf. on Document Analysis and Recognition (ICDAR), pp. 462–465 (1995)
Google Scholar
Sutton, C., McCallum, A.: Collective segmentation and labeling of distant entities in information extraction. In: ICML Workshop on Statistical Relational Learning (2004)
Google Scholar
Sutton, C., McCallum, A.: An introduction to conditional random fields for relational learning. In: Getoor, L., Taskar, B. (eds.) Introduction to Relational Statistical Learning. MIT Press, Cambridge (2007)
Google Scholar
Sutton, C.A., Rohanimanesh, K., McCallum, A.: Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In: Proc. ICML 2004 (2004)
Google Scholar
Tang, Y., Ma, H., Mao, X., Liu, D., Suen, C.: A new approach to document analysis based on modified fractal signature. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 567–570 (1995)
Google Scholar
Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational data. In: Eighteenth Conference on Uncertainty in Artificial Intelligence, UAI 2002 (2002)
Google Scholar
Tsujimoto, S., Asada, H.: Major components of a complete text reading system. Proc. IEEE 80(7), 1133–1149 (1992)
Article Google Scholar
Vincent, L.: Google book search: Document understanding on a massive scale. In: Proc. 9th International Conf. Document Analysis and Recognition, pp. 819–823 (2007)
Google Scholar
Viola, P.A., Narasimhan, M.: Learning to extract information from semi-structured text using a discriminative context free grammar. In: SIGIR 2005, pp. 330–337 (2005)
Google Scholar
Wahl, F., Wong, K., Casey, R.: Block segmentation and text extraction in mixed text/image documents. Computer Vision, Graphics, and Image Processing 20, 375–390 (1982)
Article Google Scholar
Wisniewski, G., Gallinari, P.: Relaxation labeling for selecting and exploiting efficiently non-local dependencies in sequence labeling. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 312–323. Springer, Heidelberg (2007)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Schloss Birlinghoven, D-53754, Sankt Augustin, Germany
Gerhard Paaß & Iuliu Konya

Authors

Gerhard Paaß
View author publications
You can also search for this author in PubMed Google Scholar
Iuliu Konya
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Linguistics and Literature, Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, Germany
Alexander Mehler
Institute of Cognitive Science, University of Osnabrück, Albrechtstr. 28, 49076, Osnabrück, Germany
Kai-Uwe Kühnberger
Angewandte Sprachwissenschaft und, Justus-Liebig-Universität Gießen, Computerlinguistik, Otto-Behaghel-Straße 10D, 35394, Gießen, Germany
Henning Lobin & Harald Lüngen &
Institut für deutsche Sprache und Literatur, Technical University Dortmund, Emil-Figge-Straße 50, 44227, Dortmund, Germany
Angelika Storrer
SFB 441 Linguistic Data Structures, Eberhard Karls Universität Tübingen, Nauklerstraße 35, 72074, Tübingen, Germany
Andreas Witt

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Paaß, G., Konya, I. (2011). Machine Learning for Document Structure Recognition. In: Mehler, A., Kühnberger, KU., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds) Modeling, Learning, and Processing of Text Technological Data Structures. Studies in Computational Intelligence, vol 370. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22613-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-22613-7_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22612-0
Online ISBN: 978-3-642-22613-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics