Abstract
This paper presents an optical character/text recognition (OCR) system for cursive scripts like those of Arabic, Urdu, Persian, Kurdish, etc. This OCR system is a large-scale one in the sense of architecture, training data size, and state-of-the-art performance. The paper introduces the theoretical derivation and experimental assessment of our two main contributions deployed in this OCR. The first contribution is the design of a new autonomously normalized, lossless, and compact feature vector that enables the production of a truly robust omnifont OCR system for cursive scripts with an ASR-like HMM-based architecture. Half of the components in this feature vector are analogs and the other half are discrete, which obstructs the use of continuous Gaussian mixtures to model an aggregate of such features and mandates the use of discrete HMMs instead, which in turn necessitates the deployment of vector clustering and quantization. The second contribution is a new soft (i.e., probabilistic) vector quantization (VQ) scheme, as opposed to conventional hard-deciding VQ that we analytically derive and then deploy to alleviate overfitting and boost the robustness of our OCR against different kinds of obtrusive variances. We present experimental evidence of these benefits of the presented soft VQ scheme to our OCR. Other machine-learning systems with VQ modules may also deploy soft VQ to obtain the same benefits.
Similar content being viewed by others
Notes
This holds true at least until the time of submitting our manuscript to this journal in July 2012.
To make sure that the difference in WER with SLM enabled and WER with SLM neutralized is not due to some random statistical perturbation, a statistical significance t test was run. It produced p-value lower than the significance level of 0.001, which strongly rejects the hypothesis of acquiring this difference randomly and confirms the actual positive impact of enabling SLM in our HMM decoder.
All the material presented in this section is independent of the algorithm used for inferring that codebook, e.g., k-means and LBG.
Typically, any adaptive methodology for inferring the codebook works in the off-line phase on a sample population that is assumed to have the same statistical properties of the phenomenon being modeled.
The generalization test dataset and the corresponding output of both versions are downloadable via http://www.rdi-eg.com/Soft_Hard_VQ_OCR_Generalization_Test_Data.RAR.
To make sure that the differences in WER of the Soft VQ and WER with Hard VQ are not due to some random statistical perturbation, statistical significance t tests were run. They produced p-value lower than the significance level of 0.05, which rejects the hypothesis of acquiring these differences by random chance.
References
Al-Badr B, Mahmoud SA (1995) Survey and bibliography of arabic optical text recognition. Elsevier Sci Signal Process 41:49–77
Bazzi I, Schwartz R, Makhoul J (1999) An Omnifont Open-Vocabulary OCR System for English and Arabic. IEEE Trans Pattern Anal Mach Intell 21:6
Khorsheed MS (2007) Offline recognition of omnifont arabic text using the HMM Tool Kit (HTK). Pattern Recognit Lett Elsevier 28:1563–1571
Attia M (2004)Arabic Orthography vs. Arabic OCR; Rich Heritage Challenging A Much Needed Technology, Multilingual Computing and Technology magazine. http://www.multilingual.com, USA, Dec 2004
Rabiner LR, Levinson SE (1985) A speaker-independent, syntax-directed, connected word recognition system based on hidden markov model and level building, IEEE Transactions on Audio, Speech and Signal Processing, vol. 33, p 3
Gouda AM (2004) Arabic handwritten connected character recognition, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering, Cairo University, Nov 2004
Rashwan M (2000) A New OCR System Similar to ASR System, the 10th International Conference on Computing and Information, ICCI 2000, Kuwait
Attia M, El-Mahallawy M (2007) Histogram-based lines and words decomposition for arabic omni font-written OCR systems; Enhancements and Evaluation, Lecture Notes on Computer Science (LNCS): Computer Analysis of Images and Patterns, Springer-Verlag Berlin Heidelberg http://www.SpringerOnline.com, LNCS 4673
Rashwan M, Fakhr MTW, Attia M, El-Mahallawy M (2007) Arabic OCR system analogous to HMM-Based ASR systems; implementation and evaluation. J Eng Appl Sci Cairo University, http://www.Journal.eng.CU.edu.eg 54 (6):653–672
El-Mahallawy M (2008) A large scale HMM-based Omni front-written OCR system for cursive scripts, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering, Cairo University
Linde Y, Buzo A, Gray RM (1980) An algorithm for vector quantization design. IEEE Trans Comm COM 28:84–95
Bunke H, Roth TM, Schukat-Talamazzini EG (1995) Offline cursive handwriting recognition using hidden markov models. Elsevier Sci Pattern Recognit 28(9):1399–1413
Mohamed M, Gader P (1996) Handwritten word recognition using segmentation-free hidden markov modeling and segmentation-based dynamic programming techniques. IEEE Trans Pattern Anal Mach Intell 18(5):548–554
Trier OD, Jain AK, Taxt T (1996) Feature extraction methods for character recognition—a survey. Elsevier Sci Pattern Recognit 29(4):641–662
Duda RO, Hart PE, Stork DG (2001) Pattern Classification, 2nd ed, John Wiley & Sons Inc
Yaseen M et al (2006) Building annotated written and spoken Arabic LR’s in NEMLAR Project, LREC2006 conference http://www.LREC-conf.org/LREC2006, Genoa-Italy, May 2006
Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-35 p 3
Kanungo T, Marton G, Bulbul O (1999) OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products. In: Proceedings SPIE Conf. Document Recognition and Retrieval (VI), pp 109–121
Gray RM (1984) Vector quantization. IEEE Signal Processing Magazine, pp 4–29
Gray RM, Neuhoff DL (1998) Quantization. IEEE Trans Inf Theory 44(6):2325–2383
Jain AK (2008) Data clustering: 50 Years Beyond K-means. http://biometrics.cse.msu.edu/Presentations/FuLectureDec5.pdf, Plenary Talk at The IAPR’s 19th International Conference on Pattern Recognition http://www.icpr2008.org/, Tampa-Florida-USA, Dec 2008
Kövesi B, Boucher J-M, Saoudi S (2001) Stochastic K-means algorithm for vector quantization. Pattern Recogn Lett Elsevier 22(6–7):603–610
Seo S, Obermayer K (2003) Soft learning vector quantization. ACM’s Neural Computation, vol. 15-Issue 7, MIT Press pp. 1589–1604
Jacobson N (2009) Basic algebra, 2nd ed. vol. 1, Dover, ISBN 978-0-486-47189-1
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Attia, M., El-Mahallawy, M.S., Rashwan, M.A.A. et al. Omnifont text recognition of printed cursive scripts via HMMs, compact lossless features, and soft data clustering. Pattern Anal Applic 18, 507–521 (2015). https://doi.org/10.1007/s10044-014-0428-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-014-0428-0