Modeling Cross-Modal Interaction in a Multi-detector, Multi-modal Tracking Framework

Zhong, Yiqi; You, Suya; Neumann, Ulrich

doi:10.1007/978-3-030-69532-3_41

Yiqi Zhong¹²,
Suya You¹³ &
Ulrich Neumann¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12623))

Included in the following conference series:

Asian Conference on Computer Vision

893 Accesses

Abstract

Different modalities have their own advantages and disadvantages. In a tracking-by-detection framework, fusing data from multiple modalities would ideally improve tracking performance than using a single modality, but this has been a challenge. This study builds upon previous research in this area. We propose a deep-learning based tracking-by-detection pipeline that uses multiple detectors and multiple sensors. For the input, we associate object proposals from 2D and 3D detectors. Through a cross-modal attention module, we optimize interaction between the 2D RGB and 3D point clouds features of each proposal. This helps to generate 2D features with suppressed irrelevant information for boosting performance. Through experiments on a published benchmark, we prove the value and ability of our design in introducing a multi-modal tracking solution to the current research on Multi-Object Tracking (MOT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Yoo, J.H., Kim, Y., Kim, J.S., Choi, J.W.: 3D-CVF: generating joint camera and lidar features using cross-view spatial feature fusion for 3D object detection (2020)
Google Scholar
Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusion for 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7345–7353 (2019)
Google Scholar
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)
Google Scholar
Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538 (2020)
Google Scholar
Yilmaz, A., Javed, O., Shah, M.: Object tracking: a survey. ACM Comput. Surv. (CSUR) 38, 13-es (2006)
Google Scholar
Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Chu, Q., Ouyang, W., Liu, B., Zhu, F., Yu, N.: DASOT: a unified framework integrating data association and single object tracking for online multi-object tracking. In: AAAI, pp. 10672–10679 (2020)
Google Scholar
Xu, J., Cao, Y., Zhang, Z., Hu, H.: Spatial-temporal relation networks for multi-object tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3988–3998 (2019)
Google Scholar
Baser, E., Balasubramanian, V., Bhattacharyya, P., Czarnecki, K.: FANTrack: 3D multi-object tracking with feature association network. arxiv abs/1905.02843 (2019)
Google Scholar
Weng, X., Kitani, K.: A baseline for 3D multi-object tracking (2019)
Google Scholar
Zhang, W., Zhou, H., Sun, S., Wang, Z., Shi, J., Loy, C.C.: Robust multi-modality multi-object tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2365–2374 (2019)
Google Scholar
Weng, X., Wang, Y., Man, Y., Kitani, K.: GNN3DMOT: graph neural network for 3D multi-object tracking with multi-feature learning (2020)
Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)
Article MathSciNet Google Scholar
Ristic, B., Arulampalam, S., Gordon, N.: Beyond the Kalman Filter: Particle Filters for Tracking Applications, vol. 685. Artech House, Boston (2004)
Google Scholar
Zhang, L., Li, Y., Nevatia, R.: Global data association for multi-object tracking using network flows. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
Google Scholar
Frossard, D., Urtasun, R.: End-to-end learning of multi-sensor 3D tracking by detection. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 635–642. IEEE (2018)
Google Scholar
Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 941–951 (2019)
Google Scholar
Li, G., Gan, Y., Wu, H., Xiao, N., Lin, L.: Cross-modal attentional context learning for RGB-D object detection. IEEE Trans. Image Process. 28, 1591–1601 (2018)
Article MathSciNet Google Scholar
Chen, H., Li, Y.F., Su, D.: Attention-aware cross-modal cross-level fusion network for RGB-D salient object detection. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6821–6826. IEEE (2018)
Google Scholar
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Chapter Google Scholar
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 299–307 (2017)
Google Scholar
Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019)
Google Scholar
Wang, X., Wang, Y.F., Wang, W.Y.: Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning. arXiv preprint arXiv:1804.05448 (2018)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Google Scholar
Frossard, D., Urtasun, R.: End-to-end learning of multi-sensor 3D tracking by detection. In: ICRA. IEEE (2018)
Google Scholar
Schulter, S., Vernaza, P., Choi, W., Chandraker, M.: Deep network flow for multi-object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6951–6960 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need (2017)
Google Scholar
Guan, Q., Huang, Y., Zhong, Z., Zheng, Z., Zheng, L., Yang, Y.: Diagnose like a radiologist: attention guided convolutional neural network for thorax disease classification (2018)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Google Scholar
Li, Y., Huang, C., Nevatia, R.: Learning to associate: hybridboosted multi-target tracker for crowded scene. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2953–2960. IEEE (2009)
Google Scholar
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J. Image Video Process. 2008, 1–10 (2008)
Article Google Scholar
Ren, J., et al.: Accurate single stage detector using recurrent rolling convolution. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
Google Scholar
Gunduz, G., Acarman, T.: Efficient multi-object tracking by strong associations on temporal window. IEEE Trans. Intell. Veh. 4(3), 447–455 (2019)
Google Scholar
Osep, A., Mehner, W., Mathias, M., Leibe, B.: Combined image- and world-space tracking in traffic scenes. In: ICRA (2017)
Google Scholar
Yoon, J.H., Lee, C.R., Yang, M.H., Yoon, K.J.: Online multi-object tracking via structural constraint event aggregation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Simon, M., et al.: Complexer-YOLO: real-time 3D object detection and tracking on semantic point clouds. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2019)
Google Scholar
Chu, P., Ling, H.: FAMNet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: ICCV (2019)
Google Scholar
Wang, S., Fowlkes, C.: Learning optimal parameters for multi-target tracking with contextual interactions. Int. J. Comput. Vis. 122(3), 484–501 (2016)
Google Scholar
Burnett, K., Samavi, S., Waslander, S., Barfoot, T., Schoellig, A.: aUToTrack: a lightweight object detection and tracking system for the SAE autodrive challenge. In: 2019 16th Conference on Computer and Robot Vision (CRV) (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Southern California, Los Angeles, CA, 90007, USA
Yiqi Zhong & Ulrich Neumann
US Army Research Laboratory, Playa Vista, CA, 90094, USA
Suya You

Authors

Yiqi Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Suya You
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Neumann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiqi Zhong .

Editor information

Editors and Affiliations

Waseda University, Tokyo, Japan
Hiroshi Ishikawa
Institute of Automation of Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Czech Technical University in Prague, Prague, Czech Republic
Tomas Pajdla
University of Pennsylvania, Philadelphia, PA, USA
Jianbo Shi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhong, Y., You, S., Neumann, U. (2021). Modeling Cross-Modal Interaction in a Multi-detector, Multi-modal Tracking Framework. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12623. Springer, Cham. https://doi.org/10.1007/978-3-030-69532-3_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-69532-3_41
Published: 27 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69531-6
Online ISBN: 978-3-030-69532-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics