Rejection Threshold Estimation for an Unknown Language Model in an OCR Task

Arlandis, Joaquim; Perez-Cortes, Juan-Carlos; Navarro-Cerdan, J. Ramon; Llobet, Rafael

doi:10.1007/978-3-642-14980-1_73

Joaquim Arlandis²¹,
Juan-Carlos Perez-Cortes²¹,
J. Ramon Navarro-Cerdan²¹ &
…
Rafael Llobet²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 6218))

Included in the following conference series:

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

1719 Accesses
1 Citations

Abstract

In an OCR post-processing task, a language model is used to find the best transformation of the OCR hypothesis into a string compatible with the language. The cost of this transformation is used as a confidence value to reject the strings that are less likely to be correct, and the error rate of the accepted strings should be strictly controlled by the user. In this work, the expected error rate distribution of an unknown language model is estimated from a training set composed of known language models. This means that after building a new language model, the user should be able to automatically “fix” the expected error rate at an acceptable level instead of having to deal with an arbitrary threshold.

Work partially supported by the Spanish MICINN grants TIN2009-14205-C04-02 and Consolider Ingenio 2010: MIPRCV (CSD2007-00018) and by IMPIVA and the E.U. by means of the ERDF in the context of the R+D Program for Technological Institutes of IMPIVA network for 2010 (IMIDIC-2009/204).

Download to read the full chapter text

Chapter PDF

Revisiting CCNet for Quality Measurements in Galician

Bayesian Finite Mixture Models for Probabilistic Context-Free Grammars

Fine-Tuning OCR Error Detection and Correction in a Polish Corpus of Scientific Abstracts

Keywords

References

Amengual, J., Vidal, E.: Efficient error-correcting viterbi parsing. IEEE Trans. on Pattern Analysis and Machine Intelligence 20(10), 1109 (1998)
Article Google Scholar
Lindberg, J., Koolwaaij, J., Hutter, H., Genoud, D., Pierrot, J., Blomberg, M., Bimbot, F.: Techniques for a priori decision threshold estimation in speaker verification. In: Proceedings RLA2C, pp. 89–92 (1998)
Google Scholar
Bertolami, R., Zimmermann, M., Bunke, H.: Rejection strategies for offline handwritten text line recognition. Pattern Recognition Letters 27(16), 2005–2012 (2006)
Article Google Scholar
Broadwater, J., Chellappa, R.: Adaptive threshold estimation via extreme value theory. IEEE Transactions on Signal Processing 58, 490–500 (2010)
Article Google Scholar
Gandrabur, S., Foster, G.F., Lapalme, G.: Confidence estimation for nlp applications. TSLP 3(3), 1–29 (2006)
Article Google Scholar
Hall, P., Dowling, G.: Approximate string matching. ACM Surveys 12(4), 381–402 (1980)
Article MathSciNet Google Scholar
Hansen, B.E.: Sample splitting and threshold estimation. Econometrica 68(3), 575–604 (2000)
Article MATH MathSciNet Google Scholar
He, C.L., Lam, L., Suen, C.Y.: A novel rejection measurement in handwritten numeral recognition based on linear discriminant analysis. In: 10th Intl. Conf. on Document Analysis and Recognition, pp. 451–455. IEEE Computer Society, Los Alamitos (2009)
Chapter Google Scholar
Hull, J., Srihari, S.: Experiments in text recognition with binary n-gram and viterbi algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence 4(5), 520–530 (1982)
Article Google Scholar
Jelinek, F.: Up from trigrams, the strugle for improved language models. In: European Conf. on Speech Communication and Technology, Berlin, pp. 1037–1040 (1993)
Google Scholar
Kae, A., Huang, G.B., Learned-Miller, E.G.: Bounding the probability of error for high precision recognition. CoRR, abs/0907.0418 (2009)
Google Scholar
Kolak, O., Resnik, P.: Ocr post-processing for low density languages. In: Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT), pp. 867–874. Association for Computational Linguistics (2005)
Google Scholar
Landgrebe, T., Paclík, P., Duin, R.P.W.: Precision-recall operating characteristic (p-roc) curves in imprecise environments. In: International Conference on Pattern Recognition ICPR (4), pp. 123–127 (2006)
Google Scholar
Ozturk, A., Chakravarthi, P.R., Weiner, D.D.: On determining the radar threshold for non-gaussian processes from experimental data. IEEE Transactions on Information Theory 42(4), 1310–1316 (1996)
Article MATH Google Scholar
Perez-Cortes, J., Amengual, J., Arlandis, J., Llobet, R.: Stochastic error correcting parsing for ocr post-processing. In: International Conference on Pattern Recognition ICPR-2000, Barcelona, Spain, vol. 4, pp. 405–408 (2000)
Google Scholar
Pitrelli, J.F., Subrahmonia, J., Perrone, M.P.: Confidence modeling for handwriting recognition: algorithms and applications. International Journal of Document Analysis 8(1), 35–46 (2006)
Article Google Scholar
Serrano, N., Sanchis, A., Juan, A.: Balancing error and supervision effort in interactive-predictive handwriting recognition. In: International Conference on Intelligent User Interfaces (ICIUI), Hong-Kong, China (2010)
Google Scholar
Tong, X., Evans, D.A.: A statistical approach to automatic ocr error correction in context. In: Fourth Workshop on Very Large Corpora, pp. 88–100 (1996)
Google Scholar
Garcia, P., Vidal, E.: Inference of K-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 12(9), 920–925 (1990)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Tecnológico de Informática, Universitat Politècnica de València, Camí de Vera s/n, 46071, València, Spain
Joaquim Arlandis, Juan-Carlos Perez-Cortes, J. Ramon Navarro-Cerdan & Rafael Llobet

Authors

Joaquim Arlandis
View author publications
You can also search for this author in PubMed Google Scholar
Juan-Carlos Perez-Cortes
View author publications
You can also search for this author in PubMed Google Scholar
J. Ramon Navarro-Cerdan
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Llobet
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Vision and Pattern Recognition Group,Computer Science, University of York Heslington, YO10-5DD, York, United Kingdom
Edwin R. Hancock
Department of Computer Science, University of York, YO10 5DD, UK
Richard C. Wilson
Centre for Vision, Speech and Signal Proc (CVSSP), University of Surrey, Guildford, GU2 7XH, Surrey, United Kingdom
Terry Windeatt
Electrical and Electronics Engineering Department, Middle East Technical University, 06531, Ankara, Turkey
Ilkay Ulusoy
Department of Computer Science and Artificial Intelligence, University of Alicante, P.O.B. 99, E-03080, Alicante, Spain
Francisco Escolano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arlandis, J., Perez-Cortes, JC., Navarro-Cerdan, J.R., Llobet, R. (2010). Rejection Threshold Estimation for an Unknown Language Model in an OCR Task. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2010. Lecture Notes in Computer Science, vol 6218. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14980-1_73

Download citation

DOI: https://doi.org/10.1007/978-3-642-14980-1_73
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14979-5
Online ISBN: 978-3-642-14980-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Rejection Threshold Estimation for an Unknown Language Model in an OCR Task

Abstract

Chapter PDF

Similar content being viewed by others

Revisiting CCNet for Quality Measurements in Galician

Bayesian Finite Mixture Models for Probabilistic Context-Free Grammars

Fine-Tuning OCR Error Detection and Correction in a Polish Corpus of Scientific Abstracts

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Rejection Threshold Estimation for an Unknown Language Model in an OCR Task

Abstract

Chapter PDF

Similar content being viewed by others

Revisiting CCNet for Quality Measurements in Galician

Bayesian Finite Mixture Models for Probabilistic Context-Free Grammars

Fine-Tuning OCR Error Detection and Correction in a Polish Corpus of Scientific Abstracts

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation