Cooperative and Fast-Learning Information Extraction from Business Documents for Document Archiving

Esser, Daniel

doi:10.1007/978-3-642-41033-8_4

Daniel Esser¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8186))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

2174 Accesses
1 Citations

Abstract

Automatic information extraction from scanned business documents is especially valuable in the application domain of document management and archiving. Although current solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts or administrators. Especially small office/home office (SOHO) users and private individuals often do not use such systems because of the need for configuration and long periods of training to reach acceptable extraction rates. Therefore we present a solution for information extraction out of scanned business documents that fits the requirements of these users. Our approach is highly adaptable to new document types and index fields and uses only a minimum of training documents to reach extraction rates comparable to related works and manual document indexing. By providing a cooperative extraction system, which allows sharing extraction knowledge between participants, we furthermore want to minimize the number of user feedback and increase the acceptance of such a system.

A first evaluation of our solution according to a document set of 12,500 documents with 10 commonly used fields shows competitive results above 85% F1-measure. Results above 75% F1-measure are already reached with a minimal training set of only one document per template.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Klein, B., Dengel, A., Fordan, A.: smartfix: An adaptive system for document analysis and understanding. In: Reading and Learning, pp. 166–186 (2004)
Google Scholar
Opentext, Opentext capture center (2012), http://www.opentext.com/2/global/products/products-capture-and-imaging/products-opentext-capture-center.htm
Schulz, F., Ebbecke, M., Gillmann, M., Adrian, B., Agne, S., Dengel, A.: Seizing the treasure: Transferring knowledge in invoice analysis. In: 10th International Conference on Document Analysis and Recognition, pp. 848–852 (2009)
Google Scholar
Sako, H., Seki, M., Furukawa, N., Ikeda, H., Imaizumi, A.: Form reading based on form-type identification and form-data recognition. In: Seventh International Conference on Document Analysis and Recognition, pp. 926–930 (2003)
Google Scholar
Appiani, E., Cesarini, F., Colla, A.M., Diligenti, M., Gori, M., Marinai, S., Soda, G.: Automatic document classification and indexing in high-volume applications. International Journal on Document Analysis and Recognition 4(2), 69–83 (2001)
Article Google Scholar
Sorio, E., Bartoli, A., Davanzo, G., Medvet, E.: Open world classification of printed invoices. In: Proceedings of the 10th ACM Symposium on Document Engineering, DocEng 2010, pp. 187–190. ACM, New York (2010)
Google Scholar
Vila, M., Bardera, A., Feixas, M., Sbert, M.: Tsallis mutual information for document classification. Entropy 13(9), 1694–1707 (2011)
Article Google Scholar
Diligenti, M., Frasconi, P., Gori, M.: Hidden tree markov models for document image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(4), 519–523 (2003)
Article Google Scholar
Belaïd, A., D’Andecy, V.P., Hamza, H., Belaïd, Y.: Administrative Document Analysis and Structure. In: Biba, M., Xhafa, F. (eds.) Learning Structure and Schemas from Documents. SCI, vol. 375, pp. 51–71. Springer, Heidelberg (2011)
Chapter Google Scholar
Alippi, C., Pessina, F., Roveri, M.: An adaptive system for automatic invoice-documents classification. In: IEEE International Conference on Image Processing, ICIP 2005, vol. 2, pp. II-526–II-529 (2005)
Google Scholar
Adali, S., Sonmez, A.C., Gokturk, M.: An integrated architecture for processing business documents in turkish. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 394–405. Springer, Heidelberg (2009)
Chapter Google Scholar
Cesarini, F., Francesconi, E., Gori, M., Soda, G.: Analysis and understanding of multi-class invoices. IJDAR 6(2), 102–114 (2003)
Article Google Scholar
Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS 2010, pp. 175–182 (2010)
Google Scholar
Belaid, Y., Belaid, A.: Morphological tagging approach in document analysis of invoices. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004 (2004)
Google Scholar
Likforman-Sulem, L., Vaillant, P., Yvon, F.: Proper names extraction from fax images combining textual and image features. In: Seventh International Conference on Document Analysis and Recognition, vol. 1, pp. 545–549 (2003)
Google Scholar
Saund, E.: Scientific challenges underlying production document processing. In: Document Recognition and Retrieval XVIII, DRR (2011)
Google Scholar
Salperwyck, C., Lemaire, V.: Learning with few examples: An empirical study on leading classifiers. In: The International Joint Conference on Neural Networks, IJCNN (2011)
Google Scholar
Forman, G., Cohen, I.: Learning from little: Comparison of classifiers given little training. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 161–172. Springer, Heidelberg (2004)
Chapter Google Scholar
Esser, D., Schuster, D., Muthmann, K., Berger, M., Schill, A.: Automatic indexing of scanned documents - a layout-based approach. In: Document Recognition and Retrieval XIX, DRR, San Francisco, CA, USA (2012)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Chinchor, N., Sundheim, B.: Muc-5 evaluation metrics. In: Proceedings of the 5th Conference on Message Understanding, MUC5 1993, pp. 69–78 (1993)
Google Scholar
Klein, B., Agne, S., Dengel, A.R.: Results of a study on invoice-reading systems in germany. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 451–462. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Computer Networks Group, Technical University Dresden, 01062, Dresden, Germany
Daniel Esser

Authors

Daniel Esser
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Systems, Software and In-Orbit Demonstration Department, European Space Agency, Noordwijk, The Netherlands
Yan Tang Demey
CRAN, University of Lorraine, Campus Sciences, BP 70239, 54506, Vandoevre-les-Nancy, France
Hervé Panetto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Esser, D. (2013). Cooperative and Fast-Learning Information Extraction from Business Documents for Document Archiving. In: Demey, Y.T., Panetto, H. (eds) On the Move to Meaningful Internet Systems: OTM 2013 Workshops. OTM 2013. Lecture Notes in Computer Science, vol 8186. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41033-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-41033-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41032-1
Online ISBN: 978-3-642-41033-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics