Abstract
We present a CATALIST model that ‘tames’ the attention (heads) of an attention-based scene text recognition model. We provide supervision to the attention masks at multiple levels, i.e., line, word, and character levels while training the multi-head attention model. We demonstrate that such supervision improves training performance and testing accuracy. To train CATALIST and its attention masks, we also present a synthetic data generator ALCHEMIST that enables the synthetic creation of large scene-text video datasets, along with mask information at character, word, and line levels. We release a real scene-text dataset of 2k videos, \(\text {CATALIST}_\text {d}\) with videos of real scenes that potentially contain scene-text in a combination of three different languages, namely, English, Hindi, and Marathi. We record these videos using 5 types of camera transformations - (i) translation, (ii) roll, (iii) tilt, (iv) pan, and (v) zoom to create transformed videos. The dataset and other useful resources are available as a documented public repository for use by the community.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
ALCHEMIST stands for synthetic video generation in order to tame Attention for Language (line, word, character, etc.) and other camera-CHangEs and coMbinatIons for Scene Text.
- 2.
\(f_L\) represents the features used for producing line masks, \(f_w\) represents features used for word masks, \(f_c\) represents features used for character masks, and \(f_f\) represents features used for free attention masks.
- 3.
for the corresponding features \(f_L\), \(f_w\), \(f_c\), \(f_f\), etc.
- 4.
For Devanagari (the script used for Hindi and Marathi), we carefully consider the boxes at the level of joint-glyphs instead of characters since rendering characters individually (to obtain character level text-boxes) hamper glyph substitution rules that form the joint glyphs in Devanagari.
- 5.
- 6.
References
Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: 12th \(\{USENIX\}\) Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283 (2016)
Bartz, C., Yang, H., Meinel, C.: STN-OCR: a single neural network for text detection and text recognition. arXiv preprint arXiv:1707.08831 (2017)
Bušta, M., Neumann, L., Matas, J.: Deep textspotter: an end-to-end trainable scene text localization and recognition framework. In: International Conference on Computer Vision (2017)
Bušta, M., Patel, Y., Matas, J.: E2E-MLT - an unconstrained end-to-end method for multi-language scene text. In: Carneiro, G., You, S. (eds.) ACCV 2018. LNCS, vol. 11367, pp. 127–143. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21074-8_11
Duarte, K., Rawat, Y.S., Shah, M.: Videocapsulenet: a simplified network for action detection. arXiv preprint arXiv:1805.08162 (2018)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324 (2016)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with EM routing. In: International Conference on Learning Representations (2018)
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. IEEE (2015)
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493. IEEE (2013)
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network. In: AAAI, pp. 4161–4167 (2017)
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)
Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. CoRR abs/1801.02765 (2018)
Reddy, S., Mathew, M., Gomez, L., Rusinol, M., Karatzas, D., Jawahar, C.: RoadText-1K: text detection & recognition dataset for driving videos. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 11074–11080. IEEE (2020)
Saluja, R., Maheshwari, A., Ramakrishnan, G., Chaudhuri, P., Carman, M.: Robust end-to-end systems for reading license plates and street signs. In: 2019 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 154–159. IEEE (2019)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
Smith, R., et al.: End-to-end interpretation of the French street name signs dataset. In: Hua, G., Jégou, H. (eds.) ECCV 2016, Part I. LNCS, vol. 9913, pp. 411–426. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_30
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Wojna, Z., et al.: Attention-based extraction of structured information from street view imagery. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 844–850. IEEE (2017)
Yu, F., et al.: BDD100K: a diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, vol. 2, no. 5, p. 6 (2018)
Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., Kadlec, B.: Uber-text: a large-scale dataset for optical character recognition from street-level imagery. In: SUNw: Scene Understanding Workshop-CVPR, vol. 2017 (2017)
Acknowledgment
We thank Shubham Shukla for dataset collection and annotation efforts.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Sood, S., Saluja, R., Ramakrishnan, G., Chaudhuri, P. (2021). CATALIST: CAmera TrAnsformations for Multi-LIngual Scene Text Recognition. In: Barney Smith, E.H., Pal, U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science(), vol 12916. Springer, Cham. https://doi.org/10.1007/978-3-030-86198-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-86198-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86197-1
Online ISBN: 978-3-030-86198-8
eBook Packages: Computer ScienceComputer Science (R0)