Omnifont text recognition of printed cursive scripts via HMMs, compact lossless features, and soft data clustering

Attia, Mohamed; El-Mahallawy, Mohamed S.; Rashwan, Mohsen A. A.; Nazih, Waleed; Al-Badrashiny, Mohamed A. S. A. A.

doi:10.1007/s10044-014-0428-0

Omnifont text recognition of printed cursive scripts via HMMs, compact lossless features, and soft data clustering

Theoretical Advances
Published: 26 November 2014

Volume 18, pages 507–521, (2015)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Mohamed Attia^1,2,3,
Mohamed S. El-Mahallawy³,
Mohsen A. A. Rashwan^1,4,
Waleed Nazih⁵ &
…
Mohamed A. S. A. A. Al-Badrashiny⁶

224 Accesses
Explore all metrics

Abstract

This paper presents an optical character/text recognition (OCR) system for cursive scripts like those of Arabic, Urdu, Persian, Kurdish, etc. This OCR system is a large-scale one in the sense of architecture, training data size, and state-of-the-art performance. The paper introduces the theoretical derivation and experimental assessment of our two main contributions deployed in this OCR. The first contribution is the design of a new autonomously normalized, lossless, and compact feature vector that enables the production of a truly robust omnifont OCR system for cursive scripts with an ASR-like HMM-based architecture. Half of the components in this feature vector are analogs and the other half are discrete, which obstructs the use of continuous Gaussian mixtures to model an aggregate of such features and mandates the use of discrete HMMs instead, which in turn necessitates the deployment of vector clustering and quantization. The second contribution is a new soft (i.e., probabilistic) vector quantization (VQ) scheme, as opposed to conventional hard-deciding VQ that we analytically derive and then deploy to alleviate overfitting and boost the robustness of our OCR against different kinds of obtrusive variances. We present experimental evidence of these benefits of the presented soft VQ scheme to our OCR. Other machine-learning systems with VQ modules may also deploy soft VQ to obtain the same benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

Article 14 June 2021

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Article Open access 22 November 2021

An Exploration into Human–Computer Interaction: Hand Gesture Recognition Management in a Challenging Environment

Article Open access 12 June 2023

Notes

This holds true at least until the time of submitting our manuscript to this journal in July 2012.
To make sure that the difference in WER with SLM enabled and WER with SLM neutralized is not due to some random statistical perturbation, a statistical significance t test was run. It produced p-value lower than the significance level of 0.001, which strongly rejects the hypothesis of acquiring this difference randomly and confirms the actual positive impact of enabling SLM in our HMM decoder.
All the material presented in this section is independent of the algorithm used for inferring that codebook, e.g., k-means and LBG.
Typically, any adaptive methodology for inferring the codebook works in the off-line phase on a sample population that is assumed to have the same statistical properties of the phenomenon being modeled.
The generalization test dataset and the corresponding output of both versions are downloadable via http://www.rdi-eg.com/Soft_Hard_VQ_OCR_Generalization_Test_Data.RAR.
To make sure that the differences in WER of the Soft VQ and WER with Hard VQ are not due to some random statistical perturbation, statistical significance t tests were run. They produced p-value lower than the significance level of 0.05, which rejects the hypothesis of acquiring these differences by random chance.

References

Al-Badr B, Mahmoud SA (1995) Survey and bibliography of arabic optical text recognition. Elsevier Sci Signal Process 41:49–77
Article MATH Google Scholar
Bazzi I, Schwartz R, Makhoul J (1999) An Omnifont Open-Vocabulary OCR System for English and Arabic. IEEE Trans Pattern Anal Mach Intell 21:6
Article Google Scholar
Khorsheed MS (2007) Offline recognition of omnifont arabic text using the HMM Tool Kit (HTK). Pattern Recognit Lett Elsevier 28:1563–1571
Article Google Scholar
Attia M (2004)Arabic Orthography vs. Arabic OCR; Rich Heritage Challenging A Much Needed Technology, Multilingual Computing and Technology magazine. http://www.multilingual.com, USA, Dec 2004
Rabiner LR, Levinson SE (1985) A speaker-independent, syntax-directed, connected word recognition system based on hidden markov model and level building, IEEE Transactions on Audio, Speech and Signal Processing, vol. 33, p 3
Gouda AM (2004) Arabic handwritten connected character recognition, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering, Cairo University, Nov 2004
Rashwan M (2000) A New OCR System Similar to ASR System, the 10th International Conference on Computing and Information, ICCI 2000, Kuwait
Attia M, El-Mahallawy M (2007) Histogram-based lines and words decomposition for arabic omni font-written OCR systems; Enhancements and Evaluation, Lecture Notes on Computer Science (LNCS): Computer Analysis of Images and Patterns, Springer-Verlag Berlin Heidelberg http://www.SpringerOnline.com, LNCS 4673
Rashwan M, Fakhr MTW, Attia M, El-Mahallawy M (2007) Arabic OCR system analogous to HMM-Based ASR systems; implementation and evaluation. J Eng Appl Sci Cairo University, http://www.Journal.eng.CU.edu.eg 54 (6):653–672
El-Mahallawy M (2008) A large scale HMM-based Omni front-written OCR system for cursive scripts, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering, Cairo University
Linde Y, Buzo A, Gray RM (1980) An algorithm for vector quantization design. IEEE Trans Comm COM 28:84–95
Article Google Scholar
Bunke H, Roth TM, Schukat-Talamazzini EG (1995) Offline cursive handwriting recognition using hidden markov models. Elsevier Sci Pattern Recognit 28(9):1399–1413
Article MATH Google Scholar
Mohamed M, Gader P (1996) Handwritten word recognition using segmentation-free hidden markov modeling and segmentation-based dynamic programming techniques. IEEE Trans Pattern Anal Mach Intell 18(5):548–554
Article Google Scholar
Trier OD, Jain AK, Taxt T (1996) Feature extraction methods for character recognition—a survey. Elsevier Sci Pattern Recognit 29(4):641–662
Article Google Scholar
Duda RO, Hart PE, Stork DG (2001) Pattern Classification, 2nd ed, John Wiley & Sons Inc
Yaseen M et al (2006) Building annotated written and spoken Arabic LR’s in NEMLAR Project, LREC2006 conference http://www.LREC-conf.org/LREC2006, Genoa-Italy, May 2006
Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-35 p 3
Kanungo T, Marton G, Bulbul O (1999) OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products. In: Proceedings SPIE Conf. Document Recognition and Retrieval (VI), pp 109–121
Gray RM (1984) Vector quantization. IEEE Signal Processing Magazine, pp 4–29
Gray RM, Neuhoff DL (1998) Quantization. IEEE Trans Inf Theory 44(6):2325–2383
Article MathSciNet MATH Google Scholar
Jain AK (2008) Data clustering: 50 Years Beyond K-means. http://biometrics.cse.msu.edu/Presentations/FuLectureDec5.pdf, Plenary Talk at The IAPR’s 19th International Conference on Pattern Recognition http://www.icpr2008.org/, Tampa-Florida-USA, Dec 2008
Kövesi B, Boucher J-M, Saoudi S (2001) Stochastic K-means algorithm for vector quantization. Pattern Recogn Lett Elsevier 22(6–7):603–610
Article MATH Google Scholar
Seo S, Obermayer K (2003) Soft learning vector quantization. ACM’s Neural Computation, vol. 15-Issue 7, MIT Press pp. 1589–1604
Jacobson N (2009) Basic algebra, 2nd ed. vol. 1, Dover, ISBN 978-0-486-47189-1

Download references

Author information

Authors and Affiliations

Natural Language Processing, The Engineering Company for the Development of Computer Systems, RDI, Giza, Egypt
Mohamed Attia & Mohsen A. A. Rashwan
Software Engineering, Luxor Technology Inc, Oakville, ON, L6L6V2, Canada
Mohamed Attia
College of Engineering and Technology, Arab Academy for Science, Technology, and Maritime Transport (AASTMT), Heliopolis Campus, Cairo, Egypt
Mohamed Attia & Mohamed S. El-Mahallawy
Department of Electronics and Electrical Communications, Faculty of Engineering, Cairo University, Giza, Egypt
Mohsen A. A. Rashwan
College of Computer Engineering and Sciences, Salman bin Abdulaziz University, Alkharj, KSA
Waleed Nazih
Department of Computer Science, George Washington University, Washington DC, USA
Mohamed A. S. A. A. Al-Badrashiny

Authors

Mohamed Attia
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed S. El-Mahallawy
View author publications
You can also search for this author in PubMed Google Scholar
Mohsen A. A. Rashwan
View author publications
You can also search for this author in PubMed Google Scholar
Waleed Nazih
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed A. S. A. A. Al-Badrashiny
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Attia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Attia, M., El-Mahallawy, M.S., Rashwan, M.A.A. et al. Omnifont text recognition of printed cursive scripts via HMMs, compact lossless features, and soft data clustering. Pattern Anal Applic 18, 507–521 (2015). https://doi.org/10.1007/s10044-014-0428-0

Download citation

Received: 17 July 2012
Accepted: 20 October 2014
Published: 26 November 2014
Issue Date: August 2015
DOI: https://doi.org/10.1007/s10044-014-0428-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Omnifont text recognition of printed cursive scripts via HMMs, compact lossless features, and soft data clustering

Abstract

Access this article

Similar content being viewed by others

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

An Exploration into Human–Computer Interaction: Hand Gesture Recognition Management in a Challenging Environment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Omnifont text recognition of printed cursive scripts via HMMs, compact lossless features, and soft data clustering

Abstract

Access this article

Similar content being viewed by others

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

An Exploration into Human–Computer Interaction: Hand Gesture Recognition Management in a Challenging Environment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation