PreTraM: Self-supervised Pre-training via Connecting Trajectory and Map

Xu, Chenfeng; Li, Tian; Tang, Chen; Sun, Lingfeng; Keutzer, Kurt; Tomizuka, Masayoshi; Fathi, Alireza; Zhan, Wei

doi:10.1007/978-3-031-19842-7_3

Chenfeng Xu¹²,
Tian Li¹³,
Chen Tang¹²,
Lingfeng Sun¹²,
Kurt Keutzer¹²,
Masayoshi Tomizuka¹²,
Alireza Fathi¹⁴ &
…
Wei Zhan¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

European Conference on Computer Vision

2606 Accesses
4 Citations

Abstract

Deep learning has recently achieved significant progress in trajectory forecasting. However, the scarcity of trajectory data inhibits the data-hungry deep-learning models from learning good representations. While pre-training methods for representation learning exist in computer vision and natural language processing, they still require large-scale data. It is hard to replicate their success in trajectory forecasting due to the inadequate trajectory data (e.g., 34K samples in the nuScenes dataset). To work around the scarcity of trajectory data, we resort to another data modality closely related to trajectories—HD-maps, which is abundantly provided in existing datasets. In this paper, we propose PreTraM, a self-supervised Pre-training scheme via connecting Trajectories and Maps for trajectory forecasting. PreTraM consists of two parts: 1) Trajectory-Map Contrastive Learning, where we project trajectories and maps to a shared embedding space with cross-modal contrastive learning, 2) Map Contrastive Learning, where we enhance map representation with contrastive learning on large quantities of HD-maps. On top of popular baselines such as AgentFormer and Trajectron++, PreTraM reduces their errors by 5.5% and 6.9% relatively on the nuScenes dataset. We show that PreTraM improves data efficiency and scales well with model size. Our code and pre-trained models will be released at https://github.com/chenfengxu714/PreTraM.

C. Xu and T. Li—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Battaglia, P.W., et al.: Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Google Scholar
Cai, P., Wang, S., Wang, H., Liu, M.: Carl-lead: lidar-based end-to-end autonomous driving with contrastive deep reinforcement learning. arXiv preprint arXiv:2109.08473 (2021)
Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. In: CoRL (2019)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22243–22255 (2020)
Google Scholar
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Cui, H., et al.: Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 2090–2096. IEEE (2019)
Google Scholar
Deng, L.: The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29(6), 141–142 (2012)
Article Google Scholar
Deo, N., Wolff, E., Beijbom, O.: Multimodal trajectory prediction conditioned on lane-graph traversals. In: Conference on Robot Learning, pp. 203–212. PMLR (2022)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/N19-1423
Gao, J., et al.: VectorNet: encoding HD maps and agent dynamics from vectorized representation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.552
Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., Moutarde, F.: Home: heatmap output for future motion estimation. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp. 500–507 (2021). https://doi.org/10.1109/ITSC48978.2021.9564944
Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., Moutarde, F.: THOMAS: trajectory heatmap output with learned multi-agent sampling. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=QDdJhACYrlX
Gu, J., Sun, C., Zhao, H.: DenseTNT: end-to-end trajectory prediction from dense goal sets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15303–15312 (2021)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Laddha, A.G., Gautam, S., Palombo, S., Pandey, S., Vallespi-Gonzalez, C.: MVFuseNet: improving end-to-end object detection and motion forecasting through multi-view fusion of lidar data. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2859–2868 (2021)
Google Scholar
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Google Scholar
Liang, M., et al.: Learning lane graph representations for motion forecasting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 541–556. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_32
Chapter Google Scholar
Ma, H., Sun, Y., Li, J., Tomizuka, M.: Multi-agent driving behavior prediction across different scenarios with self-supervised domain knowledge. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) (2021)
Google Scholar
Ma, Y.J., Inala, J.P., Jayaraman, D., Bastani, O.: Likelihood-based diverse sampling for trajectory forecasting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13279–13288 (2021)
Google Scholar
Ngiam, J., et al.: Scene transformer: a unified architecture for predicting future trajectories of multiple agents. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=Wm3EA5OlHsG
Van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv e-prints. arXiv-1807 (2018)
Google Scholar
Phan-Minh, T., Grigore, E.C., Boulton, F.A., Beijbom, O., Wolff, E.M.: CoverNet: multimodal behavior prediction using trajectory sets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14074–14083 (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Salzmann, T., Ivanovic, B., Chakravarty, P., Pavone, M.: Trajectron++: dynamically-feasible trajectory forecasting with heterogeneous data. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 683–700. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_40
Chapter Google Scholar
Shah, M., et al.: LiRaNet: end-to-end trajectory prediction using spatio-temporal radar fusion. In: CoRL (2020)
Google Scholar
Tang, C., Zhan, W., Tomizuka, M.: Exploring social posterior collapse in variational autoencoder for interaction modeling. Adv. Neural. Inf. Process. Syst. 34, 8481–8494 (2021)
Google Scholar
Varadarajan, B., et al.: Multipath++: efficient information fusion and trajectory aggregation for behavior prediction. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 7814–7821 (2022). https://doi.org/10.1109/ICRA46639.2022.9812107
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Yuan, Y., Kitani, K.: DLow: diversifying latent flows for diverse human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 346–364. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_20
Chapter Google Scholar
Yuan, Y., Weng, X., Ou, Y., Kitani, K.: AgentFormer: agent-aware transformers for socio-temporal multi-agent forecasting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Zhao, H., et al.: TNT: target-driven trajectory prediction. In: Kober, J., Ramos, F., Tomlin, C. (eds.) Proceedings of the 2020 Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 155, pp. 895–904. PMLR, 16–18 November 2021. https://proceedings.mlr.press/v155/zhao21b.html

Download references

Acknowledgements

We sincerely appreciate Boris Ivanovic and Rowan McAllister for providing help on the experiments related to Trajectron++. This work was sponsored by Google-BAIR Commons program. Google also provided a generous donation of cloud compute credits through the Google-BAIR Commons program.

Author information

Authors and Affiliations

University of California, Berkeley, USA
Chenfeng Xu, Chen Tang, Lingfeng Sun, Kurt Keutzer, Masayoshi Tomizuka & Wei Zhan
University of California, San Diego, USA
Tian Li
Google Research, Mountain View, CA, USA
Alireza Fathi

Authors

Chenfeng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Tian Li
View author publications
You can also search for this author in PubMed Google Scholar
Chen Tang
View author publications
You can also search for this author in PubMed Google Scholar
Lingfeng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Kurt Keutzer
View author publications
You can also search for this author in PubMed Google Scholar
Masayoshi Tomizuka
View author publications
You can also search for this author in PubMed Google Scholar
Alireza Fathi
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Tang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 183 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, C. et al. (2022). PreTraM: Self-supervised Pre-training via Connecting Trajectory and Map. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-19842-7_3
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19841-0
Online ISBN: 978-3-031-19842-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PreTraM: Self-supervised Pre-training via Connecting Trajectory and Map