Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval

Can, Quang-Huy; Nguyen, Hong-Quan; Do, Thi-Ngoc-Diep; Phan, Hoai; Nguyen, Thuy-Binh; Pham, Thi Thanh Thuy; Tran, Thanh-Hai; Le, Thi-Lan

doi:10.1007/978-981-19-8234-7_5

Quang-Huy Can¹⁰,
Hong-Quan Nguyen¹¹,
Thi-Ngoc-Diep Do¹⁰,
Hoai Phan¹²,
Thuy-Binh Nguyen¹³,
Thi Thanh Thuy Pham¹²,
Thanh-Hai Tran¹⁰ &
…
Thi-Lan Le¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1716))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

847 Accesses

Abstract

Vehicle searching from videos by textual descriptions is one of the most important tasks in traffic management towards smart cities. This paper proposes a method for retrieval of vehicles using a natural language-based query. Our method consists of two main components of textual extractor based on Bi-LSTM and visual extractor using ResNet-50 model. Both components extract hidden features from different modalities and then match them in a common space. This end-to-end process tries to build a textual-visual alignment model that will be utilized for the search phase. Our particularities in this framework are two-fold. In the video stream, we evaluate in detail the role of vehicle appearance compared to its motion. In the textual stream, we apply back-translation systems to enrich the textual dataset for the training phase. Experiments are conducted on AI City Challenging, showing the efficiency of each contribution in the overall framework. It confirms that not only appearance but additional motion cues are promising for vehicle retrieval, which provides the results of MRR, Rank@5 and Rank@10 are 0.2333, 0.3587 and 0.4837, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Islam, K.: Person search: new paradigm of person re-identification: a survey and outlook of recent works. Image Vis. Comput. 101, 103970 (2020)
Article Google Scholar
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, IEEE Computer Society, July 2017, pp. 5187–5196 (2017)
Google Scholar
Pham, T.T.T., et al.: Towards a large-scale person search by Vietnamese natural language: dataset and methods. Multimedia Tools. Appl. 81, 1–32 (2022). https://doi.org/10.1007/s11042-022-12138-1
Article Google Scholar
Naphade, M., et al.: The 5th AI city challenge. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2021)
Google Scholar
Feng, Q., Ablavsky, V., Sclaroff, S.: CityFlow-NL: tracking and retrieval of vehicles at city scale by natural language descriptions. CoRR abs/2101.04741 (2021)
Google Scholar
Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3165–3173 (2017)
Google Scholar
Bai, S., et al.: Connecting language and vision for natural language-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4034–4043 (2021)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308(2017)
Google Scholar
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473 (2019)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
Wang, Z., Fang, Z., Wang, J., Yang, Y.: Visual-textual attributes alignment in person search by natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 402–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_24
Chapter Google Scholar
Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: MDMMT: multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3363 (2021)
Google Scholar
Park, E.J., Kim, H., Jeong, S., Kang, B., Kwon, Y.: Keyword-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4220–4227 (2021)
Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738 (2021)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015)
Hoang, V.C.D., Koehn, P., Haffari, G., Cohn, T.: Iterative back-translation for neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 18–24 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16(2), 1–23 (2020)
Article Google Scholar
Tang, Z., et al.: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8797–8806 (2019)
Google Scholar
Voorhees, E.M., Tice, D.M.: The TREC-8 question answering track. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, European Language Resources Association (ELRA) (2000)
Google Scholar
Pham, T.T.T., et al.: Person search by natural language description in Vietnamese using pre-trained visual-textual attributes alignment model. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE), pp. 1–6. IEEE (2021)
Google Scholar

Download references

Acknowledgement

This research is funded by the Vietnam MPS under grant number BCN. 2020. T01. 04.

Author information

Authors and Affiliations

School of Electrical and Electronics Engineering, Hanoi University of Science and Technology, Hanoi, Vietnam
Quang-Huy Can, Thi-Ngoc-Diep Do, Thanh-Hai Tran & Thi-Lan Le
Faculty of Information Technology, Viet-Hung Industrial University, Hanoi, Vietnam
Hong-Quan Nguyen
Faculty of Information Security, Academy of People Security, Hanoi, Vietnam
Hoai Phan & Thi Thanh Thuy Pham
Faculty of Electrical-Electronic Engineering, University of Transport and Communications, Hanoi, Vietnam
Thuy-Binh Nguyen

Authors

Quang-Huy Can
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Quan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thi-Ngoc-Diep Do
View author publications
You can also search for this author in PubMed Google Scholar
Hoai Phan
View author publications
You can also search for this author in PubMed Google Scholar
Thuy-Binh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thi Thanh Thuy Pham
View author publications
You can also search for this author in PubMed Google Scholar
Thanh-Hai Tran
View author publications
You can also search for this author in PubMed Google Scholar
Thi-Lan Le
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thi Thanh Thuy Pham .

Editor information

Editors and Affiliations

University of Newcastle Australia, Newcastle, NSW, Australia
Edward Szczerbicki
Wrocław University of Science and Technology, Wrocław, Poland
Krystian Wojtkiewicz
International University - VNU-HCM, Ho Chi Minh City, Vietnam
Sinh Van Nguyen
Wrocław University of Science and Technology, Wrocław, Poland
Marcin Pietranik
Wrocław University of Science and Technology, Wrocław, Poland
Marek Krótkiewicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Can, QH. et al. (2022). Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval. In: Szczerbicki, E., Wojtkiewicz, K., Nguyen, S.V., Pietranik, M., Krótkiewicz, M. (eds) Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2022. Communications in Computer and Information Science, vol 1716. Springer, Singapore. https://doi.org/10.1007/978-981-19-8234-7_5

Download citation

DOI: https://doi.org/10.1007/978-981-19-8234-7_5
Published: 24 November 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8233-0
Online ISBN: 978-981-19-8234-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval