Skip to main content
Log in

Scene text recognition based on two-stage attention and multi-branch feature fusion module

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Text image recognition in natural scenes is challenging in computer vision, even though it is already widely used in real-life applications. With the development of deep learning, the accuracy of scene text recognition has been continuously improved. The encoder-decoder architecture is currently a general framework in scene text recognition. With 2D attention on the decoder, better attention can be paid to the position of each character. However, many methods based on the encoder-decoder architecture only adopt the attention mechanism on the decoder. Therefore, their ability to locate characters is limited. In order to solve this problem, we propose a Transformer-based encoder-decoder structure with a two-stage attention mechanism for scene text recognition. At the encoder, a first-stage attention module integrating spatial attention and channel attention is used to capture the overall location of the text in the image, while at the decoder, a second-stage attention module is used to pinpoint the position of each character in the text image. This two-stage attention mechanism can locate the position of the text more effectively and improve recognition accuracy. Also, we design a multi-branch feature fusion module for the encoder that can fuse features from different receptive fields to obtain more robust features. We train the model on synthetic text datasets and test it on real scene text datasets. The experimental results show that our model is very competitive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Yim M, Kim Y, Cho H-C, Park S (2021) Synthtiger: synthetic text image generator towards better text recognition models. In: International conference on document analysis and recognition (ICDAR). Springer, pp 109–124

  2. Liao M, Song B, Long S, He M, Yao C, Bai X (2020) Synthtext3d: synthesizing scene text images from 3d virtual worlds. Sci China Inform Sci 63(2):1–14

    Article  Google Scholar 

  3. Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: Conference on computer vision and pattern recognition (CVPR), pp 2315–2324

  4. Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh SJ, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: International conference on computer vision (ICCV), pp 4715–4723

  5. Wang W, Xie E, Liu X, Wang W, Liang D, Shen C, Bai X (2020) Scene text image super-resolution in the wild. In: European conference on computer vision (ECCV). Springer, pp 650–666

  6. Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text recognition with automatic rectification. In: Conference on computer vision and pattern recognition (CVPR), pp 4168–4176

  7. Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks. In: Advances in neural information processing systems (neurIPS), pp 2017–2025

  8. Xu C, Wang Y, Bai F, Guan J, Zhou S (2022) Robustly recognizing irregular scene text by rectifying principle irregularities. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3061–3068

  9. Zhan F, Lu S (2019) Esir: end-to-end scene text recognition via iterative image rectification. In: Conference on computer vision and pattern recognition (CVPR), pp 2059–2068

  10. Wu L, Xu Y, Hou J, Chen CP, Liu C-L (2022) A two-level rectification attention network for scene text recognition. IEEE Trans Multimed

  11. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR)

  12. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Conference on computer vision and pattern recognition (CVPR), pp 770–778

  13. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Conference on computer vision and pattern recognition (CVPR), pp 1–9

  14. Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on empirical methods in natural language processing (EMNLP)

  15. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–1780

    Article  Google Scholar 

  16. Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2008) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 31(5):855–868

    Article  Google Scholar 

  17. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems (neurIPS)

  18. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. PMLR, pp 10347–10357

  19. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations (ICLR)

  20. Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 39(11):2298–2304

    Article  Google Scholar 

  21. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International conference on machine learning (ICML), pp 369–376

  22. Cheng Z, Xu Y, Bai F, Niu Y, Pu S, Zhou S (2018) Aon: towards arbitrarily-oriented text recognition. In: Conference on computer vision and pattern recognition (CVPR), pp 5571–5579

  23. Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 33, pp 8610–8617

  24. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Conference on computer vision and pattern recognition (CVPR). Ieee, vol 1, pp 886–893

  25. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis (IJCV) 60(2):91–110

    Article  Google Scholar 

  26. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Understand 110(3):346–359

    Article  Google Scholar 

  27. Su B, Lu S (2014) Accurate scene text recognition based on recurrent neural network. In: Asian conference on computer vision (ACCV). Springer, pp 35–48

  28. Bissacco A, Cummins M, Netzer Y, Neven H (2013) Photoocr: reading text in uncontrolled conditions. In: International conference on computer vision (ICCV), pp 785–792

  29. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2015) Deep structured output learning for unconstrained text recognition. In: ICLR

  30. Liao M, Zhang J, Wan Z, Xie F, Liang J, Lyu P, Yao C, Bai X (2019) Scene text recognition from two-dimensional perspective. In: AAAI conference on artificial intelligence (AAAI), vol 33, pp 8714–8721

  31. Wan Z, He M, Chen H, Bai X, Yao C (2020) Textscanner: reading characters in order for robust scene text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 34, pp 12120–12127

  32. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on deep learning, NIPS

  33. Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled attention network for text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 34, pp 12216–12224

  34. Yu D, Li X, Zhang C, Liu T, Han J, Liu J, Ding E (2020) Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12113–12122

  35. Yang L, Wang P, Li H, Li Z, Zhang Y (2020) A holistic representation guided attention network for scene text recognition. Neurocomputing 414:67–75

    Article  Google Scholar 

  36. Ma X, He K, Zhang D, Li D (2021) Pieed: position information enhanced encoder-decoder framework for scene text recognition. Appl Intell:1–10

  37. Sheng F, Chen Z, Xu B (2019) Nrtr: a no-recurrence sequence-to-sequence model for scene text recognition. In: International conference on document analysis and recognition (ICDAR). IEEE, pp 781–786

  38. Atienza R (2021) Vision transformer for fast and efficient scene text recognition. In: International conference on document analysis and recognition. Springer, pp 319–334

  39. Lu N, Yu W, Qi X, Chen Y, Gong P, Xiao R, Bai X (2021) Master: multi-aspect non-local network for scene text recognition. Pattern Recogn 117:107980

    Article  Google Scholar 

  40. Liu W, Chen C, Wong K-YK, Su Z, Han J (2016) Star-net: a spatial attention residue network for scene text recognition. In: British machine vision conference (BMVC), vol 2, p 7

  41. Qi X, Chen Y, Xiao R, Li C-G, Zou Q, Cui S (2019) A novel joint character categorization and localization approach for character-level scene text recognition. In: 2019 International conference on document analysis and recognition workshops (ICDARW). IEEE, vol 5, pp 83–90

  42. Gao Y, Chen Y, Wang J, Tang M, Lu H (2019) Reading scene text with fully convolutional sequence modeling. Neurocomputing 339:161–170

    Article  Google Scholar 

  43. Hu W, Cai X, Hou J, Yi S, Lin Z (2020) Gtc: guided training of ctc towards efficient and accurate scene text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11005–11012

  44. Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International conference on learning representations, ICLR 2015

  45. Lee C-Y, Osindero S (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In: Conference on computer vision and pattern recognition (CVPR), pp 2231–2239

  46. Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, Zhang H, Lan Y, Wang L, Liu T (2020) On layer normalization in the transformer architecture. In: International conference on machine learning (ICML). PMLR, pp 10524–10533

  47. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: European conference on computer vision (ECCV), pp 3–19

  48. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  49. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456

  50. Xu J, Sun X, Zhang Z, Zhao G, Lin J (2019) Understanding and improving layer normalization. In: Advances in neural information processing systems (neurIPS), vol 32

  51. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems (neurIPS), pp 8026–8037

  52. Cheng Z, Bai F, Xu Y, Zheng G, Pu S, Zhou S (2017) Focusing attention: towards accurate text recognition in natural images. In: International conference on computer vision (ICCV), pp 5076–5084

  53. Yang X, He D, Zhou Z, Kifer D, Giles CL (2017) Learning to read irregular text with attention mechanisms. In: International joint conference on artificial intelligence (IJCAI), vol 1, p 3

  54. Liu W, Chen C, Wong K-Y (2018) Char-net: a character-aware neural network for distorted scene text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 32

  55. Luo C, Jin L, Sun Z (2019) Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn 90:109–118

    Article  Google Scholar 

  56. Mishra A, Alahari K, Jawahar C (2012) Scene text recognition using higher order language priors. In: British machine vision conference (BMVC). BMVA

  57. Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: International conference on computer vision (ICCV). IEEE, pp 1457–1464

  58. Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R, Ashida K, Nagai H, Okamoto M, Yamamoto H et al (2005) Icdar 2003 robust reading competitions: entries, results, and future directions. Int J Doc Anal Recognit (IJDAR) 7(2-3):105–122

    Article  Google Scholar 

  59. Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda LGI, Mestre SR, Mas J, Mota DF, Almazan JA, De Las Heras LP (2013) Icdar 2013 robust reading competition. In: International conference on document analysis and recognition (ICDAR). IEEE, pp 1484–1493

  60. Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) Icdar 2015 competition on robust reading. In: International conference on document analysis and recognition (ICDAR). IEEE, pp 1156–1160

  61. Phan TQ, Shivakumara P, Tian S, Tan CL (2013) Recognizing text with perspective distortion in natural scenes. In: International conference on computer vision (ICCV), pp 569–576

  62. Risnumawan A, Shivakumara P, Chan CS, Tan CL (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ningzhong Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, S., Kou, J., Liu, N. et al. Scene text recognition based on two-stage attention and multi-branch feature fusion module. Appl Intell 53, 14219–14232 (2023). https://doi.org/10.1007/s10489-022-04241-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04241-5

Keywords

Navigation