Skip to main content
Log in

Omnifont text recognition of printed cursive scripts via HMMs, compact lossless features, and soft data clustering

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

This paper presents an optical character/text recognition (OCR) system for cursive scripts like those of Arabic, Urdu, Persian, Kurdish, etc. This OCR system is a large-scale one in the sense of architecture, training data size, and state-of-the-art performance. The paper introduces the theoretical derivation and experimental assessment of our two main contributions deployed in this OCR. The first contribution is the design of a new autonomously normalized, lossless, and compact feature vector that enables the production of a truly robust omnifont OCR system for cursive scripts with an ASR-like HMM-based architecture. Half of the components in this feature vector are analogs and the other half are discrete, which obstructs the use of continuous Gaussian mixtures to model an aggregate of such features and mandates the use of discrete HMMs instead, which in turn necessitates the deployment of vector clustering and quantization. The second contribution is a new soft (i.e., probabilistic) vector quantization (VQ) scheme, as opposed to conventional hard-deciding VQ that we analytically derive and then deploy to alleviate overfitting and boost the robustness of our OCR against different kinds of obtrusive variances. We present experimental evidence of these benefits of the presented soft VQ scheme to our OCR. Other machine-learning systems with VQ modules may also deploy soft VQ to obtain the same benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. This holds true at least until the time of submitting our manuscript to this journal in July 2012.

  2. To make sure that the difference in WER with SLM enabled and WER with SLM neutralized is not due to some random statistical perturbation, a statistical significance t test was run. It produced p-value lower than the significance level of 0.001, which strongly rejects the hypothesis of acquiring this difference randomly and confirms the actual positive impact of enabling SLM in our HMM decoder.

  3. All the material presented in this section is independent of the algorithm used for inferring that codebook, e.g., k-means and LBG.

  4. Typically, any adaptive methodology for inferring the codebook works in the off-line phase on a sample population that is assumed to have the same statistical properties of the phenomenon being modeled.

  5. The generalization test dataset and the corresponding output of both versions are downloadable via http://www.rdi-eg.com/Soft_Hard_VQ_OCR_Generalization_Test_Data.RAR.

  6. To make sure that the differences in WER of the Soft VQ and WER with Hard VQ are not due to some random statistical perturbation, statistical significance t tests were run. They produced p-value lower than the significance level of 0.05, which rejects the hypothesis of acquiring these differences by random chance.

References

  1. Al-Badr B, Mahmoud SA (1995) Survey and bibliography of arabic optical text recognition. Elsevier Sci Signal Process 41:49–77

    Article  MATH  Google Scholar 

  2. Bazzi I, Schwartz R, Makhoul J (1999) An Omnifont Open-Vocabulary OCR System for English and Arabic. IEEE Trans Pattern Anal Mach Intell 21:6

    Article  Google Scholar 

  3. Khorsheed MS (2007) Offline recognition of omnifont arabic text using the HMM Tool Kit (HTK). Pattern Recognit Lett Elsevier 28:1563–1571

    Article  Google Scholar 

  4. Attia M (2004)Arabic Orthography vs. Arabic OCR; Rich Heritage Challenging A Much Needed Technology, Multilingual Computing and Technology magazine. http://www.multilingual.com, USA, Dec 2004

  5. Rabiner LR, Levinson SE (1985) A speaker-independent, syntax-directed, connected word recognition system based on hidden markov model and level building, IEEE Transactions on Audio, Speech and Signal Processing, vol. 33, p 3

  6. Gouda AM (2004) Arabic handwritten connected character recognition, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering, Cairo University, Nov 2004

  7. Rashwan M (2000) A New OCR System Similar to ASR System, the 10th International Conference on Computing and Information, ICCI 2000, Kuwait

  8. Attia M, El-Mahallawy M (2007) Histogram-based lines and words decomposition for arabic omni font-written OCR systems; Enhancements and Evaluation, Lecture Notes on Computer Science (LNCS): Computer Analysis of Images and Patterns, Springer-Verlag Berlin Heidelberg http://www.SpringerOnline.com, LNCS 4673

  9. Rashwan M, Fakhr MTW, Attia M, El-Mahallawy M (2007) Arabic OCR system analogous to HMM-Based ASR systems; implementation and evaluation. J Eng Appl Sci Cairo University, http://www.Journal.eng.CU.edu.eg 54 (6):653–672

  10. El-Mahallawy M (2008) A large scale HMM-based Omni front-written OCR system for cursive scripts, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering, Cairo University

  11. Linde Y, Buzo A, Gray RM (1980) An algorithm for vector quantization design. IEEE Trans Comm COM 28:84–95

    Article  Google Scholar 

  12. Bunke H, Roth TM, Schukat-Talamazzini EG (1995) Offline cursive handwriting recognition using hidden markov models. Elsevier Sci Pattern Recognit 28(9):1399–1413

    Article  MATH  Google Scholar 

  13. Mohamed M, Gader P (1996) Handwritten word recognition using segmentation-free hidden markov modeling and segmentation-based dynamic programming techniques. IEEE Trans Pattern Anal Mach Intell 18(5):548–554

    Article  Google Scholar 

  14. Trier OD, Jain AK, Taxt T (1996) Feature extraction methods for character recognition—a survey. Elsevier Sci Pattern Recognit 29(4):641–662

    Article  Google Scholar 

  15. Duda RO, Hart PE, Stork DG (2001) Pattern Classification, 2nd ed, John Wiley & Sons Inc

  16. Yaseen M et al (2006) Building annotated written and spoken Arabic LR’s in NEMLAR Project, LREC2006 conference http://www.LREC-conf.org/LREC2006, Genoa-Italy, May 2006

  17. Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-35 p 3

  18. Kanungo T, Marton G, Bulbul O (1999) OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products. In: Proceedings SPIE Conf. Document Recognition and Retrieval (VI), pp 109–121

  19. Gray RM (1984) Vector quantization. IEEE Signal Processing Magazine, pp 4–29

  20. Gray RM, Neuhoff DL (1998) Quantization. IEEE Trans Inf Theory 44(6):2325–2383

    Article  MathSciNet  MATH  Google Scholar 

  21. Jain AK (2008) Data clustering: 50 Years Beyond K-means. http://biometrics.cse.msu.edu/Presentations/FuLectureDec5.pdf, Plenary Talk at The IAPR’s 19th International Conference on Pattern Recognition http://www.icpr2008.org/, Tampa-Florida-USA, Dec 2008

  22. Kövesi B, Boucher J-M, Saoudi S (2001) Stochastic K-means algorithm for vector quantization. Pattern Recogn Lett Elsevier 22(6–7):603–610

    Article  MATH  Google Scholar 

  23. Seo S, Obermayer K (2003) Soft learning vector quantization. ACM’s Neural Computation, vol. 15-Issue 7, MIT Press pp. 1589–1604

  24. Jacobson N (2009) Basic algebra, 2nd ed. vol. 1, Dover, ISBN 978-0-486-47189-1

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Attia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Attia, M., El-Mahallawy, M.S., Rashwan, M.A.A. et al. Omnifont text recognition of printed cursive scripts via HMMs, compact lossless features, and soft data clustering. Pattern Anal Applic 18, 507–521 (2015). https://doi.org/10.1007/s10044-014-0428-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-014-0428-0

Keywords

Navigation