Handwritten multilingual word segmentation using polygonal approximation of digital curves for Indian languages

Gupta, Deepika; Bag, Soumen

doi:10.1007/s11042-019-7286-0

Handwritten multilingual word segmentation using polygonal approximation of digital curves for Indian languages

Published: 11 February 2019

Volume 78, pages 19361–19386, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Deepika Gupta¹ &
Soumen Bag¹

335 Accesses
7 Citations
Explore all metrics

Abstract

Multilingual Optical Character Recognition (OCR) is difficult to develop as different languages exhibit different writing and structural characteristics and it is very difficult to generalize their segmentation process. Character segmentation plays an important role in developing OCR for handwritten languages. The exactness of character segmentation is the integral factor of OCR. In this paper, we exploit this limitation and propose a approach based on the polygonal approximation of the word, which works on more than one Indian languages. This work depicts the novel approach for script independent character segmentation of handwritten text utilizing basic structural properties of the languages. Digitally straight line segments (DSS) of the word is obtained by applying Polygonal approximation to the word. The segmentation of character is language independent and works considerably with skew words as well. Experiments are carried out with four popular Indian languages, Hindi, Marathi, Punjabi, and Bangla. The average success rate for character segmentation of four languages is 90.07% which is satisfactory compared with other existing methods. We use shadow and cumulative stretch feature set with random forest, support vector machine (SVM), multi-layer perceptron (MLP), and convolutional neural network (CNN) classifiers for character recognition. On experimentation, it is observed that our proposed method provided good accuracy for character segmentation and recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Frequently Used Devanagari Words in Marathi and Pali Language Documents

Meitei Mayek handwritten dataset: compilation, segmentation, and character recognition

Article 31 January 2020

Multi-lingual character segmentation and recognition based on adaptive projection profiles and composite feature vectors

Article 09 February 2023

References

Arefin N, Hassan M, Khaliluzzaman M, Chowdhury SA (2017) Bangla handwritten characters recognition by using distance-based segmentation and histogram oriented gradients. In: IEEE Region 10 humanitarian technology conference, pp 678–681
Arya D, Jawahar C, Bhagvati C, Patnaik T, Chaudhuri B, Lehal G, Chaudhury S, Ramakrishna A (2011) Experiences of integration and performance testing of multilingual OCR for printed Indian scripts. In: Joint workshop on multilingual OCR and analytics for noisy unstructured text data, 9
Bag S, Bhowmick P, Harit G, Biswas A (2011) Character segmentation of handwritten Bangla text by vertex characterization of isothetic covers. In: National conference on computer vision, pattern recognition, image processing and graphics, pp 21–24
Bag S, Harit G (2013) A survey on optical character recognition for Bangla and Devanagari scripts. Sadhana 38(1):133–168
Article Google Scholar
Bag S, Krishna A (2015) Character segmentation of Hindi unconstrained handwritten words. In: International workshop on combinatorial image analysis, pp 247–260
Bansal V, Sinha R (2002) Segmentation of touching and fused Devanagari characters. Pattern Recogn 35(4):875–893
Article MATH Google Scholar
Basu S, Sarkar R, Das N, Kundu M, Nasipuri M, Basu DK (2007) A fuzzy technique for segmentation of handwritten Bangla word images. In: International conference on computing: theory and applications, pp 427–433
Bhattad AJ, Chaudhuri B (2015) An approach for character segmentation of handwritten Bangla and Devanagari script. In: International conference on advance computing conference, pp 676–680
Bhowmick P, Bhattacharya BB (2007) Fast polygonal approximation of digital curves using relaxed straightness properties. IEEE Trans Pattern Anal Mach Intell 29 (9):1590–1602
Article Google Scholar
Bishnu A, Chaudhuri B (1999) Segmentation of Bangla handwritten text into characters by recursive contour following. In: International conference on document analysis and recognition, pp 402–405
Bunke H (2003) Recognition of cursive Roman handwriting: past, present and future. In: International conference on document analysis and recognition, pp 448–459
Casey RG, Lecolinet E (1995) Strategies in character segmentation: a survey. In: International conference on document analysis and recognition, vol 2, pp 1028–1033
Chaudhuri B, Pal U (1997) An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In: International conference on document analysis and recognition, vol 2, pp 1011–1015
Das N, Das B, Sarkar R, Basu S, Kundu M, Nasipuri M (2010) Handwritten Bangla basic and compound character recognition using MLP and SVM classifier. arXiv:1002.4040
Dershowitz N, Rosenberg A (2014) Arabic character recognition. In: Language, culture, computation. Computing-theory and technology, pp 584–602
Gao Y, Yang Y (2004) Survey of unconstrained handwritten Chinese character segmentation. Comput Eng 5:052
Google Scholar
Garain U, Chaudhuri B (2002) Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis. IEEE Trans Syst Man Cybern Part C Appl Rev 32(4):449– 459
Article Google Scholar
Hanmandlu M, Agrawal P (2005) A structural approach for segmentation of handwritten Hindi text. In: International conference on cognition and recognition, pp 589–597
https://en.wikipedia.org/wiki/Marathi_language. Accessed 23 Jan 2018
https://en.wikipedia.org/wiki/Punjabi_language. Accessed 23 Jan 2018
https://en.wikipedia.org/wiki/Bengali_language. Accessed 23 Jan 2018
Jawahar C, Kumar MP, Kiran SR (2003) A bilingual OCR for Hindi-Telugu documents and its applications. In: International conference on document analysis and recognition, pp 408–412
Jayadevan R, Kolhe SR, Patil PM, Pal U (2011) Offline recognition of Devanagari script: a survey. IEEE Trans Syst Man Cybern Part C Appl Rev 41 (6):782–796
Article Google Scholar
Khorsheed MS (2002) Off-line Arabic character recognition–a review. Pattern Anal Applic 5(1):31–45
Article MathSciNet Google Scholar
Kumar V, Senegar PK (2010) Segmentation of printed text in Devnagari script and Gurmukhi script. Int J Comput Appl 3:24–29
Google Scholar
Lehal GS (2009) A complete machine-printed Gurmukhi OCR system. In: Guide to OCR for Indic scripts, pp 43–71
Lehal GS, Singh C (2000) A Gurmukhi script recognition system. In: International conference on pattern recognition, vol 2, pp 557–560
Ma H, Doermann D (2003) Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Transactions on Asian Language Information Processing 2 (3):193–218
Article Google Scholar
Mangla P, Kaur H (2014) An end detection algorithm for segmentation of broken and touching characters in handwritten Gurumukhi word. In: International conference on reliability, infocom technologies and optimization, pp 1–4
Mohanty S, Dasbebartta HN, Behera TK (2009) An efficient bilingual optical character recognition (English-Oriya) system for printed documents. In: International conference on advances in pattern recognition, pp 398–401
Nawab NB, Hassan M (2012) Optical Bangla character recognition using chain-code. In: International conference on informatics, electronics & vision, pp 622–627
Obaidullah SM, Halder C, Santosh K, Das N, Roy K (2017) Phdindic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimed Tools Appl: 1–36
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66
Article Google Scholar
Pal U, Chaudhuri B (2004) Indian script character recognition: a survey. Pattern Recogn 37(9):1887–1899
Article Google Scholar
Pal U, Datta S (2003) Segmentation of Bangla unconstrained handwritten text. In: International conference on document analysis and recognition, pp 1128–1132
Palakollu S, Dhir R, Rani R (2012) Handwritten Hindi text segmentation techniques for lines and characters. In: World congress on engineering and computer science, vol 1, pp 24–26
Patel C, Desai A (2010) Segmentation of text lines into words for Gujarati handwritten text. In: International conference on signal and image processing, pp 130–134
Pramanik R, Bag S (2018) Shape decomposition-based handwritten compound character recognition for Bangla OCR. J Vis Commun Image Represent 50:123–134
Article Google Scholar
Pramanik R, Bag S (2017) Linear curve fitting-based headline estimation in handwritten words for Indian scripts. In: International conference on pattern recognition and machine intelligence, pp 116–123
Pramanik R, Raj V, Bag S (2018) Finding the optimum classifier: Classification of segmentable components in offline handwritten Devanagari words. In: International conference on recent advances in information technology, pp 1–5
Ramteke S, Gurjar A, Deshmukh D (2016) Automatic segmentation of content and noncontent based handwritten Marathi text document. In: International conference on global trends in signal processing, information computing and communication, pp 404–408
Roy A, Bhowmik TK, Parui SK, Roy U (2005) A novel approach to skew detection and character segmentation for handwritten Bangla words. In: Digital image computing: Techniques and applications, pp 30–38
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) Cmaterdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image. Int J Doc Anal Recognit 15(1):71–83
Article Google Scholar
Sarkar R, Sen B, Das N, Basu S (2015) Handwritten Devanagari script segmentation: A non-linear fuzzy approach. arXiv:1501.05472
Sharma DV, Lehal GS (2006) An iterative algorithm for segmentation of isolated handwritten words in Gurmukhi script. In: International conference on pattern recognition, vol 2, pp 1022–1025
Shinde AB, Dandawate YH (2014) Shirorekha extraction in character segmentation for printed Devanagri text in document image processing. In: Annual IEEE India conference, pp 1–7
Srivastav A, Sahu N (2016) Segmentation of Devanagari handwritten characters. Int J Comput Appl 142(14)
Wang SH, Phillips P, Dong ZC, Zhang YD (2018) Intelligent facial emotion recognition based on stationary wavelet entropy and Jaya algorithm. Neurocomputing 272:668–676
Article Google Scholar
Zhang T, Suen CY (1984) A fast parallel algorithm for thinning digital patterns. Commun ACM 27(3):236–239
Article Google Scholar
Zhang YD, Sun J (2018) Preliminary study on angiosperm genus classification by weight decay and combination of most abundant color index with fractional Fourier entropy. Multimed Tools Appl 77(17):22671–22688
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, 826004, India
Deepika Gupta & Soumen Bag

Authors

Deepika Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Soumen Bag
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepika Gupta.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gupta, D., Bag, S. Handwritten multilingual word segmentation using polygonal approximation of digital curves for Indian languages. Multimed Tools Appl 78, 19361–19386 (2019). https://doi.org/10.1007/s11042-019-7286-0

Download citation

Received: 13 July 2018
Revised: 11 January 2019
Accepted: 27 January 2019
Published: 11 February 2019
Issue Date: 30 July 2019
DOI: https://doi.org/10.1007/s11042-019-7286-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Handwritten multilingual word segmentation using polygonal approximation of digital curves for Indian languages

Abstract

Access this article

Similar content being viewed by others

Frequently Used Devanagari Words in Marathi and Pali Language Documents

Meitei Mayek handwritten dataset: compilation, segmentation, and character recognition

Multi-lingual character segmentation and recognition based on adaptive projection profiles and composite feature vectors

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Handwritten multilingual word segmentation using polygonal approximation of digital curves for Indian languages

Abstract

Access this article

Similar content being viewed by others

Frequently Used Devanagari Words in Marathi and Pali Language Documents

Meitei Mayek handwritten dataset: compilation, segmentation, and character recognition

Multi-lingual character segmentation and recognition based on adaptive projection profiles and composite feature vectors

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation