Machine Learning for Intelligent Processing of Printed Documents

Esposito, Floriana; Malerba, Donato; Lisi, Francesca A.

doi:10.1023/A:1008735902918

Machine Learning for Intelligent Processing of Printed Documents

Published: March 2000

Volume 14, pages 175–198, (2000)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Floriana Esposito¹,
Donato Malerba² &
Francesca A. Lisi³

356 Accesses
40 Citations
Explore all metrics

Abstract

A paper document processing system is an information system component which transforms information on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. This article proposes the application of machine learning techniques to acquire the specific knowledge required by an intelligent document processing system, named WISDOM++, that manages printed documents, such as letters and journals. Knowledge is represented by means of decision trees and first-order rules automatically generated from a set of training documents. In particular, an incremental decision tree learning system is applied for the acquisition of decision trees used for the classification of segmented blocks, while a first-order learning system is applied for the induction of rules used for the layout-based classification and understanding of documents. Issues concerning the incremental induction of decision trees and the handling of both numeric and symbolic data in first-order rule learning are discussed, and the validity of the proposed solutions is empirically evaluated by processing a set of real printed documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Intellectualization of information technologies for producing digital graphic documents with weakly formalized description of objects

Article 01 July 2016

Web Page Representations and Data Extraction with BERyL

Unsupervised Recognition of the Logical Structure of Business Documents Based on Spatial Relationships

References

Altamura, O., Esposito, F., and Malerba, D. (1999).WISDOM++: An Interactive and Adaptive Document Analysis System. In Proc. of the 5th Int. Conf. on Document Analysis and Recognition (pp. 366–369). Los Alamitos: IEEE Computer Society Press.
Google Scholar
Bayer, T., Bohnacher, U., and Mogg-Schneider, H. (1994). InforPortLab: An Experimental Document Analysis System. In Proc. of the IAPR Workshop on Document Analysis Systems, Kaiserslautern, Germany.
Bergadano, F. and Bisio, R. (1988). Constructive Learning with Continuous-Valued Attributes. In B. Bouchon, L. Saitta, and R.R. Yager (Eds.), Lecture Notes in Computer Science, Vol. 313: Uncertainty and Intelligent Systems (pp. 54–162). Berlin: Springer-Verlag.
Google Scholar
Botta, M. and Giordana, A. (1991). Learning Quantitative Features in a Symbolic Environment. In Z.W. Ras and M. Zemankova (Eds.), Lecture Notes in Artificial Intelligence, Vol. 542: Methodologies for Intelligent Systems (pp. 296–305). Berlin: Springer-Verlag.
Google Scholar
Connell, J.H. and Brady,M. (1987). Generating and Generalizing Models of Visual Objects, Artificial Intelligence, 31(2), 159–183.
Google Scholar
Dengel, A. and Barth, G. (1989). ANASTASIL: A Hybrid Knowledge-Based System for Document Layout Analysis. In Proc. of the 6th Int. Joint Conf. on Artificial Intelligence (pp. 1249–1254).
De Raedt, L. (1992). Inductive Theory Revision. London: Academic Press.
Google Scholar
Dzeroski, S. and Bratko, I. (1996). Applications of Inductive Logic Programming. In L. De Raedt (Ed.), Advances in Inductive Logic Programming (pp. 65–81). Amsterdam: IOS Press.
Google Scholar
Esposito, F., Lanza, A., Malerba, D., and Semeraro, G. (1997). Machine Learning for Map Interpretation: An Intelligent Tool for Environmental Planning, Applied Artificial Intelligence: An International Journal, 11(10), 673–696.
Google Scholar
Esposito, F., Malerba, D., Semeraro, G., Annese, E., and Scafuro, G. (1990). Empirical Learning Methods for Digitized Document Recognition: An Integrated Approach to Inductive Generalization. In Proc. of the 6th IEEE Conf. on Artificial Intelligence Applications (pp. 37–45). Los Alamitos: IEEE Computer Society Press.
Google Scholar
Esposito, F., Malerba, D., and Semeraro, G. (1993). Incorporating Statistical Techniques into Empirical Symbolic Learning Systems. In D.J. Hand (Ed.), Artificial Intelligence Frontiers in Statistics (pp. 168–181). London: Chapman & Hall.
Google Scholar
Esposito, F., Malerba, D., and Semeraro, G. (1994). Multistrategy Learning for Document Recognition, Applied Artificial Intelligence, 8(1), 33–84.
Google Scholar
Esposito, F., Malerba, D., Semeraro, Fanizzi, N., and Ferilli, S. (1998). Adding Machine Learning and Knowledge Intensive Techniques to a Digital Library Service, International Journal on Digital Libraries, 2(1), 3–19.
Google Scholar
Fayyad, U.M. and Irani, K.B. (1992). On the Handling of Continuous-Valued Attributes in Decision Tree Generation. Machine Learning, 8, 87–102.
Google Scholar
Fayyad, U., Piatesky-Shapiro, G., and Smyth, P. (1996). From Data Mining to Knowledge Discovery: An Overview. In U. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 1–35). AAAI/MIT Press.
Fisher, J.L., Hinds, S.C., and D'Amato, D.P. (1990). A Rule-Based System for Document Image Segmentation. In Proc. of the 10th Int. Conf. on Pattern Recognition (pp. 567–572). Los Alamitos: IEEE Computer Society Press.
Google Scholar
Gemello, R., Mana, F., and Saitta, L. (1991). Rigel: An Inductive Learning System, Machine Learning, 6(1), 7–35.
Google Scholar
Helft, N. (1987). Inductive Generalization: A Logical Framework. In I. Bratko and N. Lavrac (Eds.), Progress in Machine Learning-Proc. of the EWSL87 (pp. 149–157). London: Sigma Press.
Google Scholar
Lavrac, N. and Dzeroski, S. (1994). Inductive Logic Programming: Techniques and Applications. Chichester: Ellis Horwood.
Google Scholar
Lavrac, N., Dzeroski, S., and Bratko, I. (1996). Handling Imperfect Data in Inductive Logic Programming. In L. De Raedt (Ed.), Advances in Inductive Logic Programming (pp. 48–64). Amsterdam: IOS Press.
Google Scholar
Malerba, D., Semeraro, G., and Bellisari, E. (1995). LEX: A Knowledge-Based System for the Layout Analysis. In Proc. of the 3rd Int. Conf. on the Practical Application of Prolog (pp. 429–443).
Malerba, D., Esposito, F., Semeraro, G., and Caggese, S. (1997a). Handling Continuous Data in Top-Down Induction of First-Order Rules. In M. Lenzerini (Ed.), Lecture Notes in Artificial Intelligence, Vol. 1321: AI*IA 97: Advances in Artificial Intelligence (pp. 24–35). Berlin: Springer-Verlag.
Google Scholar
Malerba, D., Esposito, F., Semeraro, G., and De Filippis, L. (1997b). Processing Paper Documents in WISDOM. In M. Lenzerini (Ed.), Lecture Notes in Artificial Intelligence, Vol. 1321: AI*IA 97: Advances in Artificial Intelligence (pp. 439–442). Berlin: Springer-Verlag.
Google Scholar
Malerba, D., Semeraro, G., and Esposito, F. (1997c). A Multistrategy Approach to Learning Multiple Dependent Concepts. In C. Taylor and R. Nakhaeizadeh (Eds.), Machine Learning and Statistics: The Interface (pp. 87–106). London: Wiley.
Google Scholar
Mitchell, T. (1997). Machine Learning. New York: McGraw Hill.
Google Scholar
Nagy, G., Seth, S.C., and Stoddard, S.D. (1992). A Prototype Document Image Analysis System for Technical Journals, IEEE Computer, 25(7), 10–22.
Google Scholar
Orkin, M. and Drogin, R. (1990). Vital Statistics. New York: McGraw Hill.
Google Scholar
Pagurek, B., Dawes, N., Bourassa, G., Evans, G., and Smithers, P. (1990). Letter Pattern Recognition. In Proc. of the 6th Conf. on Artificial Intelligence Applications (pp. 312–319). Los Alamitos: IEEE Computer Society Press.
Google Scholar
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. San Mateo: Morgan Kaufmann.
Google Scholar
Quinlan, J.R. and Cameron-Jones, R.M. (1993). FOIL: A Midterm Report. In P.B. Brazdil (Ed.), Lecture Notes in Artificial Intelligence, Vol. 667: Machine Learning: ECML-93 (pp. 3–20). Berlin: Springer-Verlag.
Google Scholar
Tang, Y.Y., Yan, C.D., and Suen, C.Y. (1994). Document Processing for Automatic Knowledge Acquisition, IEEE Trans. on Knowledge and Data Engineering, 6(1), 3–21.
Google Scholar
Utgoff, P.E. (1994). An Improved Algorithm for Incremental Induction of Decision Trees. In Proc. of the 11th Int. Conf. on Machine Learning (pp. 318–325). San Francisco: Morgan Kaufmann.
Google Scholar
Wang, D. and Srihari, R.N. (1989). Classification of Newspaper Image Blocks Using Texture Analysis, Computer Vision, Graphics, and Image Processing, 47, 327–352.
Google Scholar
Wong, K.Y., Casey, R.G., and Wahl, F.M. (1982). Document Analysis System, IBM J. of Research Development, 26(6), 647–656.
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi di Bari, via Orabona 4, 70125, Bari, Italy
Floriana Esposito
Dipartimento di Informatica, Università degli Studi di Bari, via Orabona 4, 70125, Bari, Italy
Donato Malerba
Dipartimento di Informatica, Università degli Studi di Bari, via Orabona 4, 70125, Bari, Italy
Francesca A. Lisi

Authors

Floriana Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar
Francesca A. Lisi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Esposito, F., Malerba, D. & Lisi, F.A. Machine Learning for Intelligent Processing of Printed Documents. Journal of Intelligent Information Systems 14, 175–198 (2000). https://doi.org/10.1023/A:1008735902918

Download citation

Issue Date: March 2000
DOI: https://doi.org/10.1023/A:1008735902918

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Learning for Intelligent Processing of Printed Documents

Abstract

Access this article

Similar content being viewed by others

Intellectualization of information technologies for producing digital graphic documents with weakly formalized description of objects

Web Page Representations and Data Extraction with BERyL

Unsupervised Recognition of the Logical Structure of Business Documents Based on Spatial Relationships

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Machine Learning for Intelligent Processing of Printed Documents

Abstract

Access this article

Similar content being viewed by others

Intellectualization of information technologies for producing digital graphic documents with weakly formalized description of objects

Web Page Representations and Data Extraction with BERyL

Unsupervised Recognition of the Logical Structure of Business Documents Based on Spatial Relationships

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation