Skip to main content

Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval

  • Conference paper
  • First Online:
Recent Challenges in Intelligent Information and Database Systems (ACIIDS 2022)

Abstract

Vehicle searching from videos by textual descriptions is one of the most important tasks in traffic management towards smart cities. This paper proposes a method for retrieval of vehicles using a natural language-based query. Our method consists of two main components of textual extractor based on Bi-LSTM and visual extractor using ResNet-50 model. Both components extract hidden features from different modalities and then match them in a common space. This end-to-end process tries to build a textual-visual alignment model that will be utilized for the search phase. Our particularities in this framework are two-fold. In the video stream, we evaluate in detail the role of vehicle appearance compared to its motion. In the textual stream, we apply back-translation systems to enrich the textual dataset for the training phase. Experiments are conducted on AI City Challenging, showing the efficiency of each contribution in the overall framework. It confirms that not only appearance but additional motion cues are promising for vehicle retrieval, which provides the results of MRR, Rank@5 and Rank@10 are 0.2333, 0.3587 and 0.4837, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Islam, K.: Person search: new paradigm of person re-identification: a survey and outlook of recent works. Image Vis. Comput. 101, 103970 (2020)

    Article  Google Scholar 

  2. Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, IEEE Computer Society, July 2017, pp. 5187–5196 (2017)

    Google Scholar 

  3. Pham, T.T.T., et al.: Towards a large-scale person search by Vietnamese natural language: dataset and methods. Multimedia Tools. Appl. 81, 1–32 (2022). https://doi.org/10.1007/s11042-022-12138-1

    Article  Google Scholar 

  4. Naphade, M., et al.: The 5th AI city challenge. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2021)

    Google Scholar 

  5. Feng, Q., Ablavsky, V., Sclaroff, S.: CityFlow-NL: tracking and retrieval of vehicles at city scale by natural language descriptions. CoRR abs/2101.04741 (2021)

    Google Scholar 

  6. Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3165–3173 (2017)

    Google Scholar 

  7. Bai, S., et al.: Connecting language and vision for natural language-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4034–4043 (2021)

    Google Scholar 

  8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  10. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  12. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  13. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308(2017)

    Google Scholar 

  14. Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473 (2019)

    Google Scholar 

  15. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)

  16. Wang, Z., Fang, Z., Wang, J., Yang, Y.: Visual-textual attributes alignment in person search by natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 402–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_24

    Chapter  Google Scholar 

  17. Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: MDMMT: multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3363 (2021)

    Google Scholar 

  18. Park, E.J., Kim, H., Jeong, S., Kang, B., Kwon, Y.: Keyword-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4220–4227 (2021)

    Google Scholar 

  19. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738 (2021)

    Google Scholar 

  20. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015)

  21. Hoang, V.C.D., Koehn, P., Haffari, G., Cohn, T.: Iterative back-translation for neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 18–24 (2018)

    Google Scholar 

  22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  23. Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16(2), 1–23 (2020)

    Article  Google Scholar 

  24. Tang, Z., et al.: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8797–8806 (2019)

    Google Scholar 

  25. Voorhees, E.M., Tice, D.M.: The TREC-8 question answering track. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, European Language Resources Association (ELRA) (2000)

    Google Scholar 

  26. Pham, T.T.T., et al.: Person search by natural language description in Vietnamese using pre-trained visual-textual attributes alignment model. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE), pp. 1–6. IEEE (2021)

    Google Scholar 

Download references

Acknowledgement

This research is funded by the Vietnam MPS under grant number BCN. 2020. T01. 04.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thi Thanh Thuy Pham .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Can, QH. et al. (2022). Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval. In: Szczerbicki, E., Wojtkiewicz, K., Nguyen, S.V., Pietranik, M., Krótkiewicz, M. (eds) Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2022. Communications in Computer and Information Science, vol 1716. Springer, Singapore. https://doi.org/10.1007/978-981-19-8234-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8234-7_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8233-0

  • Online ISBN: 978-981-19-8234-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics