Scene text recognition based on two-stage attention and multi-branch feature fusion module

Xia, Shifeng; Kou, Jinqiao; Liu, Ningzhong; Yin, Tianxiang

doi:10.1007/s10489-022-04241-5

Scene text recognition based on two-stage attention and multi-branch feature fusion module

Published: 22 October 2022

Volume 53, pages 14219–14232, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Shifeng Xia^1,2,
Jinqiao Kou³,
Ningzhong Liu^1,2 &
…
Tianxiang Yin^1,2

394 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Text image recognition in natural scenes is challenging in computer vision, even though it is already widely used in real-life applications. With the development of deep learning, the accuracy of scene text recognition has been continuously improved. The encoder-decoder architecture is currently a general framework in scene text recognition. With 2D attention on the decoder, better attention can be paid to the position of each character. However, many methods based on the encoder-decoder architecture only adopt the attention mechanism on the decoder. Therefore, their ability to locate characters is limited. In order to solve this problem, we propose a Transformer-based encoder-decoder structure with a two-stage attention mechanism for scene text recognition. At the encoder, a first-stage attention module integrating spatial attention and channel attention is used to capture the overall location of the text in the image, while at the decoder, a second-stage attention module is used to pinpoint the position of each character in the text image. This two-stage attention mechanism can locate the position of the text more effectively and improve recognition accuracy. Also, we design a multi-branch feature fusion module for the encoder that can fuse features from different receptive fields to obtain more robust features. We train the model on synthetic text datasets and test it on real scene text datasets. The experimental results show that our model is very competitive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer

CarveNet: a channel-wise attention-based network for irregular scene text recognition

Article 05 April 2022

Combining Swin Transformer and Attention-Weighted Fusion for Scene Text Detection

Article Open access 17 February 2024

References

Yim M, Kim Y, Cho H-C, Park S (2021) Synthtiger: synthetic text image generator towards better text recognition models. In: International conference on document analysis and recognition (ICDAR). Springer, pp 109–124
Liao M, Song B, Long S, He M, Yao C, Bai X (2020) Synthtext3d: synthesizing scene text images from 3d virtual worlds. Sci China Inform Sci 63(2):1–14
Article Google Scholar
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: Conference on computer vision and pattern recognition (CVPR), pp 2315–2324
Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh SJ, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: International conference on computer vision (ICCV), pp 4715–4723
Wang W, Xie E, Liu X, Wang W, Liang D, Shen C, Bai X (2020) Scene text image super-resolution in the wild. In: European conference on computer vision (ECCV). Springer, pp 650–666
Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text recognition with automatic rectification. In: Conference on computer vision and pattern recognition (CVPR), pp 4168–4176
Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks. In: Advances in neural information processing systems (neurIPS), pp 2017–2025
Xu C, Wang Y, Bai F, Guan J, Zhou S (2022) Robustly recognizing irregular scene text by rectifying principle irregularities. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3061–3068
Zhan F, Lu S (2019) Esir: end-to-end scene text recognition via iterative image rectification. In: Conference on computer vision and pattern recognition (CVPR), pp 2059–2068
Wu L, Xu Y, Hou J, Chen CP, Liu C-L (2022) A two-level rectification attention network for scene text recognition. IEEE Trans Multimed
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Conference on computer vision and pattern recognition (CVPR), pp 770–778
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Conference on computer vision and pattern recognition (CVPR), pp 1–9
Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on empirical methods in natural language processing (EMNLP)
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–1780
Article Google Scholar
Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2008) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 31(5):855–868
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems (neurIPS)
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. PMLR, pp 10347–10357
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations (ICLR)
Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell (PAMI) 39(11):2298–2304
Article Google Scholar
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International conference on machine learning (ICML), pp 369–376
Cheng Z, Xu Y, Bai F, Niu Y, Pu S, Zhou S (2018) Aon: towards arbitrarily-oriented text recognition. In: Conference on computer vision and pattern recognition (CVPR), pp 5571–5579
Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 33, pp 8610–8617
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Conference on computer vision and pattern recognition (CVPR). Ieee, vol 1, pp 886–893
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis (IJCV) 60(2):91–110
Article Google Scholar
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Understand 110(3):346–359
Article Google Scholar
Su B, Lu S (2014) Accurate scene text recognition based on recurrent neural network. In: Asian conference on computer vision (ACCV). Springer, pp 35–48
Bissacco A, Cummins M, Netzer Y, Neven H (2013) Photoocr: reading text in uncontrolled conditions. In: International conference on computer vision (ICCV), pp 785–792
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2015) Deep structured output learning for unconstrained text recognition. In: ICLR
Liao M, Zhang J, Wan Z, Xie F, Liang J, Lyu P, Yao C, Bai X (2019) Scene text recognition from two-dimensional perspective. In: AAAI conference on artificial intelligence (AAAI), vol 33, pp 8714–8721
Wan Z, He M, Chen H, Bai X, Yao C (2020) Textscanner: reading characters in order for robust scene text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 34, pp 12120–12127
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on deep learning, NIPS
Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled attention network for text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 34, pp 12216–12224
Yu D, Li X, Zhang C, Liu T, Han J, Liu J, Ding E (2020) Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12113–12122
Yang L, Wang P, Li H, Li Z, Zhang Y (2020) A holistic representation guided attention network for scene text recognition. Neurocomputing 414:67–75
Article Google Scholar
Ma X, He K, Zhang D, Li D (2021) Pieed: position information enhanced encoder-decoder framework for scene text recognition. Appl Intell:1–10
Sheng F, Chen Z, Xu B (2019) Nrtr: a no-recurrence sequence-to-sequence model for scene text recognition. In: International conference on document analysis and recognition (ICDAR). IEEE, pp 781–786
Atienza R (2021) Vision transformer for fast and efficient scene text recognition. In: International conference on document analysis and recognition. Springer, pp 319–334
Lu N, Yu W, Qi X, Chen Y, Gong P, Xiao R, Bai X (2021) Master: multi-aspect non-local network for scene text recognition. Pattern Recogn 117:107980
Article Google Scholar
Liu W, Chen C, Wong K-YK, Su Z, Han J (2016) Star-net: a spatial attention residue network for scene text recognition. In: British machine vision conference (BMVC), vol 2, p 7
Qi X, Chen Y, Xiao R, Li C-G, Zou Q, Cui S (2019) A novel joint character categorization and localization approach for character-level scene text recognition. In: 2019 International conference on document analysis and recognition workshops (ICDARW). IEEE, vol 5, pp 83–90
Gao Y, Chen Y, Wang J, Tang M, Lu H (2019) Reading scene text with fully convolutional sequence modeling. Neurocomputing 339:161–170
Article Google Scholar
Hu W, Cai X, Hou J, Yi S, Lin Z (2020) Gtc: guided training of ctc towards efficient and accurate scene text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11005–11012
Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International conference on learning representations, ICLR 2015
Lee C-Y, Osindero S (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In: Conference on computer vision and pattern recognition (CVPR), pp 2231–2239
Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, Zhang H, Lan Y, Wang L, Liu T (2020) On layer normalization in the transformer architecture. In: International conference on machine learning (ICML). PMLR, pp 10524–10533
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: European conference on computer vision (ECCV), pp 3–19
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
Xu J, Sun X, Zhang Z, Zhao G, Lin J (2019) Understanding and improving layer normalization. In: Advances in neural information processing systems (neurIPS), vol 32
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems (neurIPS), pp 8026–8037
Cheng Z, Bai F, Xu Y, Zheng G, Pu S, Zhou S (2017) Focusing attention: towards accurate text recognition in natural images. In: International conference on computer vision (ICCV), pp 5076–5084
Yang X, He D, Zhou Z, Kifer D, Giles CL (2017) Learning to read irregular text with attention mechanisms. In: International joint conference on artificial intelligence (IJCAI), vol 1, p 3
Liu W, Chen C, Wong K-Y (2018) Char-net: a character-aware neural network for distorted scene text recognition. In: AAAI conference on artificial intelligence (AAAI), vol 32
Luo C, Jin L, Sun Z (2019) Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn 90:109–118
Article Google Scholar
Mishra A, Alahari K, Jawahar C (2012) Scene text recognition using higher order language priors. In: British machine vision conference (BMVC). BMVA
Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: International conference on computer vision (ICCV). IEEE, pp 1457–1464
Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R, Ashida K, Nagai H, Okamoto M, Yamamoto H et al (2005) Icdar 2003 robust reading competitions: entries, results, and future directions. Int J Doc Anal Recognit (IJDAR) 7(2-3):105–122
Article Google Scholar
Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda LGI, Mestre SR, Mas J, Mota DF, Almazan JA, De Las Heras LP (2013) Icdar 2013 robust reading competition. In: International conference on document analysis and recognition (ICDAR). IEEE, pp 1484–1493
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) Icdar 2015 competition on robust reading. In: International conference on document analysis and recognition (ICDAR). IEEE, pp 1156–1160
Phan TQ, Shivakumara P, Tian S, Tan CL (2013) Recognizing text with perspective distortion in natural scenes. In: International conference on computer vision (ICCV), pp 569–576
Risnumawan A, Shivakumara P, Chan CS, Tan CL (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
Shifeng Xia, Ningzhong Liu & Tianxiang Yin
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, 211106, China
Shifeng Xia, Ningzhong Liu & Tianxiang Yin
Beijing Institute of Computer Technology and Application Fangzhou Key Laboratory, Beijing, 100854, China
Jinqiao Kou

Authors

Shifeng Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jinqiao Kou
View author publications
You can also search for this author in PubMed Google Scholar
Ningzhong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tianxiang Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ningzhong Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xia, S., Kou, J., Liu, N. et al. Scene text recognition based on two-stage attention and multi-branch feature fusion module. Appl Intell 53, 14219–14232 (2023). https://doi.org/10.1007/s10489-022-04241-5

Download citation

Accepted: 04 October 2022
Published: 22 October 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-04241-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scene text recognition based on two-stage attention and multi-branch feature fusion module

Abstract

Access this article

Similar content being viewed by others

SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer

CarveNet: a channel-wise attention-based network for irregular scene text recognition

Combining Swin Transformer and Attention-Weighted Fusion for Scene Text Detection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scene text recognition based on two-stage attention and multi-branch feature fusion module

Abstract

Access this article

Similar content being viewed by others

SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer

CarveNet: a channel-wise attention-based network for irregular scene text recognition

Combining Swin Transformer and Attention-Weighted Fusion for Scene Text Detection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation