Abstract
Text image recognition in natural scenes is challenging in computer vision, even though it is already widely used in real-life applications. With the development of deep learning, the accuracy of scene text recognition has been continuously improved. The encoder-decoder architecture is currently a general framework in scene text recognition. With 2D attention on the decoder, better attention can be paid to the position of each character. However, many methods based on the encoder-decoder architecture only adopt the attention mechanism on the decoder. Therefore, their ability to locate characters is limited. In order to solve this problem, we propose a Transformer-based encoder-decoder structure with a two-stage attention mechanism for scene text recognition. At the encoder, a first-stage attention module integrating spatial attention and channel attention is used to capture the overall location of the text in the image, while at the decoder, a second-stage attention module is used to pinpoint the position of each character in the text image. This two-stage attention mechanism can locate the position of the text more effectively and improve recognition accuracy. Also, we design a multi-branch feature fusion module for the encoder that can fuse features from different receptive fields to obtain more robust features. We train the model on synthetic text datasets and test it on real scene text datasets. The experimental results show that our model is very competitive.
Similar content being viewed by others
References
Yim M, Kim Y, Cho H-C, Park S (2021) Synthtiger: synthetic text image generator towards better text recognition models. In: International conference on document analysis and recognition (ICDAR). Springer, pp 109–124
Liao M, Song B, Long S, He M, Yao C, Bai X (2020) Synthtext3d: synthesizing scene text images from 3d virtual worlds. Sci China Inform Sci 63(2):1–14
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: Conference on computer vision and pattern recognition (CVPR), pp 2315–2324
Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh SJ, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: International conference on computer vision (ICCV), pp 4715–4723
Wang W, Xie E, Liu X, Wang W, Liang D, Shen C, Bai X (2020) Scene text image super-resolution in the wild. In: European conference on computer vision (ECCV). Springer, pp 650–666
Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text recognition with automatic rectification. In: Conference on computer vision and pattern recognition (CVPR), pp 4168–4176
Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks. In: Advances in neural information processing systems (neurIPS), pp 2017–2025
Xu C, Wang Y, Bai F, Guan J, Zhou S (2022) Robustly recognizing irregular scene text by rectifying principle irregularities. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3061–3068
Zhan F, Lu S (2019) Esir: end-to-end scene text recognition via iterative image rectification. In: Conference on computer vision and pattern recognition (CVPR), pp 2059–2068
Wu L, Xu Y, Hou J, Chen CP, Liu C-L (2022) A two-level rectification attention network for scene text recognition. IEEE Trans Multimed
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Conference on computer vision and pattern recognition (CVPR), pp 770–778
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Conference on computer vision and pattern recognition (CVPR), pp 1–9
Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on empirical methods in natural language processing (EMNLP)
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–1780
Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2008) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 31(5):855–868
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems (neurIPS)
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. PMLR, pp 10347–10357
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations (ICLR)
Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 39(11):2298–2304
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International conference on machine learning (ICML), pp 369–376
Cheng Z, Xu Y, Bai F, Niu Y, Pu S, Zhou S (2018) Aon: towards arbitrarily-oriented text recognition. In: Conference on computer vision and pattern recognition (CVPR), pp 5571–5579
Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 33, pp 8610–8617
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Conference on computer vision and pattern recognition (CVPR). Ieee, vol 1, pp 886–893
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis (IJCV) 60(2):91–110
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Understand 110(3):346–359
Su B, Lu S (2014) Accurate scene text recognition based on recurrent neural network. In: Asian conference on computer vision (ACCV). Springer, pp 35–48
Bissacco A, Cummins M, Netzer Y, Neven H (2013) Photoocr: reading text in uncontrolled conditions. In: International conference on computer vision (ICCV), pp 785–792
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2015) Deep structured output learning for unconstrained text recognition. In: ICLR
Liao M, Zhang J, Wan Z, Xie F, Liang J, Lyu P, Yao C, Bai X (2019) Scene text recognition from two-dimensional perspective. In: AAAI conference on artificial intelligence (AAAI), vol 33, pp 8714–8721
Wan Z, He M, Chen H, Bai X, Yao C (2020) Textscanner: reading characters in order for robust scene text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 34, pp 12120–12127
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on deep learning, NIPS
Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled attention network for text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 34, pp 12216–12224
Yu D, Li X, Zhang C, Liu T, Han J, Liu J, Ding E (2020) Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12113–12122
Yang L, Wang P, Li H, Li Z, Zhang Y (2020) A holistic representation guided attention network for scene text recognition. Neurocomputing 414:67–75
Ma X, He K, Zhang D, Li D (2021) Pieed: position information enhanced encoder-decoder framework for scene text recognition. Appl Intell:1–10
Sheng F, Chen Z, Xu B (2019) Nrtr: a no-recurrence sequence-to-sequence model for scene text recognition. In: International conference on document analysis and recognition (ICDAR). IEEE, pp 781–786
Atienza R (2021) Vision transformer for fast and efficient scene text recognition. In: International conference on document analysis and recognition. Springer, pp 319–334
Lu N, Yu W, Qi X, Chen Y, Gong P, Xiao R, Bai X (2021) Master: multi-aspect non-local network for scene text recognition. Pattern Recogn 117:107980
Liu W, Chen C, Wong K-YK, Su Z, Han J (2016) Star-net: a spatial attention residue network for scene text recognition. In: British machine vision conference (BMVC), vol 2, p 7
Qi X, Chen Y, Xiao R, Li C-G, Zou Q, Cui S (2019) A novel joint character categorization and localization approach for character-level scene text recognition. In: 2019 International conference on document analysis and recognition workshops (ICDARW). IEEE, vol 5, pp 83–90
Gao Y, Chen Y, Wang J, Tang M, Lu H (2019) Reading scene text with fully convolutional sequence modeling. Neurocomputing 339:161–170
Hu W, Cai X, Hou J, Yi S, Lin Z (2020) Gtc: guided training of ctc towards efficient and accurate scene text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11005–11012
Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International conference on learning representations, ICLR 2015
Lee C-Y, Osindero S (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In: Conference on computer vision and pattern recognition (CVPR), pp 2231–2239
Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, Zhang H, Lan Y, Wang L, Liu T (2020) On layer normalization in the transformer architecture. In: International conference on machine learning (ICML). PMLR, pp 10524–10533
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: European conference on computer vision (ECCV), pp 3–19
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
Xu J, Sun X, Zhang Z, Zhao G, Lin J (2019) Understanding and improving layer normalization. In: Advances in neural information processing systems (neurIPS), vol 32
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems (neurIPS), pp 8026–8037
Cheng Z, Bai F, Xu Y, Zheng G, Pu S, Zhou S (2017) Focusing attention: towards accurate text recognition in natural images. In: International conference on computer vision (ICCV), pp 5076–5084
Yang X, He D, Zhou Z, Kifer D, Giles CL (2017) Learning to read irregular text with attention mechanisms. In: International joint conference on artificial intelligence (IJCAI), vol 1, p 3
Liu W, Chen C, Wong K-Y (2018) Char-net: a character-aware neural network for distorted scene text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 32
Luo C, Jin L, Sun Z (2019) Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn 90:109–118
Mishra A, Alahari K, Jawahar C (2012) Scene text recognition using higher order language priors. In: British machine vision conference (BMVC). BMVA
Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: International conference on computer vision (ICCV). IEEE, pp 1457–1464
Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R, Ashida K, Nagai H, Okamoto M, Yamamoto H et al (2005) Icdar 2003 robust reading competitions: entries, results, and future directions. Int J Doc Anal Recognit (IJDAR) 7(2-3):105–122
Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda LGI, Mestre SR, Mas J, Mota DF, Almazan JA, De Las Heras LP (2013) Icdar 2013 robust reading competition. In: International conference on document analysis and recognition (ICDAR). IEEE, pp 1484–1493
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) Icdar 2015 competition on robust reading. In: International conference on document analysis and recognition (ICDAR). IEEE, pp 1156–1160
Phan TQ, Shivakumara P, Tian S, Tan CL (2013) Recognizing text with perspective distortion in natural scenes. In: International conference on computer vision (ICCV), pp 569–576
Risnumawan A, Shivakumara P, Chan CS, Tan CL (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xia, S., Kou, J., Liu, N. et al. Scene text recognition based on two-stage attention and multi-branch feature fusion module. Appl Intell 53, 14219–14232 (2023). https://doi.org/10.1007/s10489-022-04241-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04241-5