Skip to main content

Cooperative and Fast-Learning Information Extraction from Business Documents for Document Archiving

  • Conference paper
On the Move to Meaningful Internet Systems: OTM 2013 Workshops (OTM 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8186))

Abstract

Automatic information extraction from scanned business documents is especially valuable in the application domain of document management and archiving. Although current solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts or administrators. Especially small office/home office (SOHO) users and private individuals often do not use such systems because of the need for configuration and long periods of training to reach acceptable extraction rates. Therefore we present a solution for information extraction out of scanned business documents that fits the requirements of these users. Our approach is highly adaptable to new document types and index fields and uses only a minimum of training documents to reach extraction rates comparable to related works and manual document indexing. By providing a cooperative extraction system, which allows sharing extraction knowledge between participants, we furthermore want to minimize the number of user feedback and increase the acceptance of such a system.

A first evaluation of our solution according to a document set of 12,500 documents with 10 commonly used fields shows competitive results above 85% F1-measure. Results above 75% F1-measure are already reached with a minimal training set of only one document per template.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Klein, B., Dengel, A., Fordan, A.: smartfix: An adaptive system for document analysis and understanding. In: Reading and Learning, pp. 166–186 (2004)

    Google Scholar 

  2. Opentext, Opentext capture center (2012), http://www.opentext.com/2/global/products/products-capture-and-imaging/products-opentext-capture-center.htm

  3. Schulz, F., Ebbecke, M., Gillmann, M., Adrian, B., Agne, S., Dengel, A.: Seizing the treasure: Transferring knowledge in invoice analysis. In: 10th International Conference on Document Analysis and Recognition, pp. 848–852 (2009)

    Google Scholar 

  4. Sako, H., Seki, M., Furukawa, N., Ikeda, H., Imaizumi, A.: Form reading based on form-type identification and form-data recognition. In: Seventh International Conference on Document Analysis and Recognition, pp. 926–930 (2003)

    Google Scholar 

  5. Appiani, E., Cesarini, F., Colla, A.M., Diligenti, M., Gori, M., Marinai, S., Soda, G.: Automatic document classification and indexing in high-volume applications. International Journal on Document Analysis and Recognition 4(2), 69–83 (2001)

    Article  Google Scholar 

  6. Sorio, E., Bartoli, A., Davanzo, G., Medvet, E.: Open world classification of printed invoices. In: Proceedings of the 10th ACM Symposium on Document Engineering, DocEng 2010, pp. 187–190. ACM, New York (2010)

    Google Scholar 

  7. Vila, M., Bardera, A., Feixas, M., Sbert, M.: Tsallis mutual information for document classification. Entropy 13(9), 1694–1707 (2011)

    Article  Google Scholar 

  8. Diligenti, M., Frasconi, P., Gori, M.: Hidden tree markov models for document image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(4), 519–523 (2003)

    Article  Google Scholar 

  9. Belaïd, A., D’Andecy, V.P., Hamza, H., Belaïd, Y.: Administrative Document Analysis and Structure. In: Biba, M., Xhafa, F. (eds.) Learning Structure and Schemas from Documents. SCI, vol. 375, pp. 51–71. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Alippi, C., Pessina, F., Roveri, M.: An adaptive system for automatic invoice-documents classification. In: IEEE International Conference on Image Processing, ICIP 2005, vol. 2, pp. II-526–II-529 (2005)

    Google Scholar 

  11. Adali, S., Sonmez, A.C., Gokturk, M.: An integrated architecture for processing business documents in turkish. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 394–405. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  12. Cesarini, F., Francesconi, E., Gori, M., Soda, G.: Analysis and understanding of multi-class invoices. IJDAR 6(2), 102–114 (2003)

    Article  Google Scholar 

  13. Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS 2010, pp. 175–182 (2010)

    Google Scholar 

  14. Belaid, Y., Belaid, A.: Morphological tagging approach in document analysis of invoices. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004 (2004)

    Google Scholar 

  15. Likforman-Sulem, L., Vaillant, P., Yvon, F.: Proper names extraction from fax images combining textual and image features. In: Seventh International Conference on Document Analysis and Recognition, vol. 1, pp. 545–549 (2003)

    Google Scholar 

  16. Saund, E.: Scientific challenges underlying production document processing. In: Document Recognition and Retrieval XVIII, DRR (2011)

    Google Scholar 

  17. Salperwyck, C., Lemaire, V.: Learning with few examples: An empirical study on leading classifiers. In: The International Joint Conference on Neural Networks, IJCNN (2011)

    Google Scholar 

  18. Forman, G., Cohen, I.: Learning from little: Comparison of classifiers given little training. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 161–172. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  19. Esser, D., Schuster, D., Muthmann, K., Berger, M., Schill, A.: Automatic indexing of scanned documents - a layout-based approach. In: Document Recognition and Retrieval XIX, DRR, San Francisco, CA, USA (2012)

    Google Scholar 

  20. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  21. Chinchor, N., Sundheim, B.: Muc-5 evaluation metrics. In: Proceedings of the 5th Conference on Message Understanding, MUC5 1993, pp. 69–78 (1993)

    Google Scholar 

  22. Klein, B., Agne, S., Dengel, A.R.: Results of a study on invoice-reading systems in germany. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 451–462. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Esser, D. (2013). Cooperative and Fast-Learning Information Extraction from Business Documents for Document Archiving. In: Demey, Y.T., Panetto, H. (eds) On the Move to Meaningful Internet Systems: OTM 2013 Workshops. OTM 2013. Lecture Notes in Computer Science, vol 8186. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41033-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41033-8_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41032-1

  • Online ISBN: 978-3-642-41033-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics