Domain Adaptive Video Segmentation via Temporal Pseudo Supervision

Xing, Yun; Guan, Dayan; Huang, Jiaxing; Lu, Shijian

doi:10.1007/978-3-031-20056-4_36

Yun Xing¹²,
Dayan Guan¹³,
Jiaxing Huang¹² &
…
Shijian Lu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13690))

Included in the following conference series:

European Conference on Computer Vision

2196 Accesses
4 Citations

Abstract

Video semantic segmentation has achieved great progress under the supervision of large amounts of labelled training data. However, domain adaptive video segmentation, which can mitigate data labelling constraints by adapting from a labelled source domain toward an unlabelled target domain, is largely neglected. We design temporal pseudo supervision (TPS), a simple and effective method that explores the idea of consistency training for learning effective representations from unlabelled target videos. Unlike traditional consistency training that builds consistency in spatial space, we explore consistency training in spatiotemporal space by enforcing model consistency across augmented video frames which helps learn from more diverse target data. Specifically, we design cross-frame pseudo labelling to provide pseudo supervision from previous video frames while learning from the augmented current video frames. The cross-frame pseudo labelling encourages the network to produce high-certainty predictions, which facilitates consistency training with cross-frame augmentation effectively. Extensive experiments over multiple public datasets show that TPS is simpler to implement, much more stable to train, and achieves superior video segmentation accuracy as compared with the state-of-the-art. Code is available at https://github.com/xing0047/TPS.

Y. Zing and D. Guan—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Araslanov, N., Roth, S.: Self-supervised augmentation consistency for adapting semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15384–15394 (2021)
Google Scholar
Badrinarayanan, V., Galasso, F., Cipolla, R.: Label propagation in video sequences. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3265–3272. IEEE (2010)
Google Scholar
Berthelot, D., et al.: Remixmatch: semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785 (2019)
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_5
Chapter Google Scholar
Budvytis, I., Sauer, P., Roddick, T., Breen, K., Cipolla, R.: Large scale labelled video data augmentation for semantic segmentation in driving scenarios. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 230–237 (2017)
Google Scholar
Chen, C., et al.: Progressive feature alignment for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 627–636 (2019)
Google Scholar
Chen, L.C., et al.: Naive-student: leveraging semi-supervised learning in video sequences for urban scene segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 695–714. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_40
Chapter Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Chen, M., Xue, H., Cai, D.: Domain adaptation for semantic segmentation with maximum squares loss. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2090–2099 (2019)
Google Scholar
Chen, X., Yuan, Y., Zeng, G., Wang, J.: Semi-supervised semantic segmentation with cross pseudo supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2613–2622 (2021)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Google Scholar
Couprie, C., Farabet, C., LeCun, Y., Najman, L.: Causal graph-based video segmentation. In: 2013 IEEE International Conference on Image Processing. IEEE, pp. 4249–4253 (2013)
Google Scholar
Ding, M., Wang, Z., Zhou, B., Shi, J., Lu, Z., Luo, P.: Every frame counts: joint learning of video segmentation and optical flow. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 10713–10720 (2020)
Google Scholar
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE international Conference on Computer Vision, pp. 2758–2766 (2015)
Google Scholar
Floros, G., Leibe, B.: Joint 2D–3D temporally consistent semantic segmentation of street scenes. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 2823–2830. (2012)
Google Scholar
French, G., Mackiewicz, M., Fisher, M.: Self-ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208 (2017)
Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representation warping. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Guan, D., Huang, J., Lu, S., Xiao, A.: Scale variance minimization for unsupervised domain adaptation in image segmentation. Pattern Recogn. 112, 107764 (2021)
Article Google Scholar
Guan, D., Huang, J., Xiao, A., Lu, S.: Domain adaptive video segmentation via temporal consistency regularization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8053–8064 (2021)
Google Scholar
Guan, D., Huang, J., Xiao, A., Lu, S.: Unbiased subclass regularization for semi-supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9968–9978 (2022)
Google Scholar
Guan, D., Huang, J., Xiao, A., Lu, S., Cao, Y.: Uncertainty-aware unsupervised domain adaptation in object detection. IEEE Trans. Multimedia 24, 2502–2514 (2021)
Article Google Scholar
Hernandez-Juarez, D., et al.: Slanted Stixels: representing San Francisco’s steepest streets. arXiv preprint arXiv:1707.05397 (2017)
Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213 (2017)
Hu, P., Caba, F., Wang, O., Lin, Z., Sclaroff, S., Perazzi, F.: Temporally distributed networks for fast video semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8818–8827 (2020)
Google Scholar
Huang, J., Guan, D., Xiao, A., Lu, S.: Cross-view regularization for domain adaptive panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10133–10144 (2021)
Google Scholar
Huang, J., Guan, D., Xiao, A., Lu, S.: Model adaptation: historical contrastive learning for unsupervised domain adaptation without source data. In: Advances in Neural Information Processing Systems, vol. 34, pp. 3635–3649 (2021)
Google Scholar
Huang, J., Guan, D., Xiao, A., Lu, S.: RDA: robust domain adaptation via Fourier adversarial attacking. arXiv preprint arXiv:2106.02874 (2021)
Huang, J., Guan, D., Xiao, A., Lu, S.: Multi-level adversarial network for domain adaptive semantic segmentation. Pattern Recogn. 123, 108384 (2022)
Article Google Scholar
Huang, J., Guan, D., Xiao, A., Lu, S., Shao, L.: Category contrast for unsupervised domain adaptation in visual tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1203–1214 (2022)
Google Scholar
Huang, J., Lu, S., Guan, D., Zhang, X.: Contextual-relation consistent domain adaptation for semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 705–722. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_42
Chapter Google Scholar
Huang, P.Y., Hsu, W.T., Chiu, C.Y., Wu, T.F., Sun, M.: Efficient uncertainty estimation for semantic segmentation in videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 520–535 (2018)
Google Scholar
Hur, J., Roth, S.: Joint optical flow and temporally consistent semantic segmentation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 163–177. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_12
Chapter Google Scholar
Jabri, A., Owens, A., Efros, A.A.: Space-time correspondence as a contrastive random walk. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Jain, S., Wang, X., Gonzalez, J.E.: Accel: a corrective fusion network for efficient semantic segmentation on video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8866–8875 (2019)
Google Scholar
Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Video panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9859–9868 (2020)
Google Scholar
Kim, M., Byun, H.: Learning texture invariant representation for domain adaptation of semantic segmentation. arXiv preprint arXiv:2003.00867 (2020)
Kumar, A., et al.: Co-regularized alignment for unsupervised domain adaptation. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Kundu, A., Vineet, V., Koltun, V.: Feature space optimization for semantic video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3168–3175 (2016)
Google Scholar
Lai, Z., Lu, E., Xie, W.: MAST: a memory-augmented self-supervised tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6479–6488 (2020)
Google Scholar
Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
Li, Y., Yuan, L., Vasconcelos, N.: Bidirectional learning for domain adaptation of semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6936–6945 (2019)
Google Scholar
Lian, Q., Lv, F., Duan, L., Gong, B.: Constructing self-motivated pyramid curriculums for cross-domain semantic segmentation: a non-adversarial approach. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Liu, B., He, X.: Multiclass semantic video segmentation with object-level active inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4286–4294 (2015)
Google Scholar
Liu, Y., Shen, C., Yu, C., Wang, J.: Efficient semantic video segmentation with per-frame inference. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 352–368. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_21
Chapter Google Scholar
Luo, Y., Liu, P., Guan, T., Yu, J., Yang, Y.: Significance-aware information bottleneck for domain adaptive semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6778–6787 (2019)
Google Scholar
Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Google Scholar
Mei, K., Zhu, C., Zou, J., Zhang, S.: Instance adaptive self-training for unsupervised domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 415–430. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_25
Chapter Google Scholar
Melas-Kyriazi, L., Manrai, A.K.: PixMatch: unsupervised domain adaptation via pixelwise consistency training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12435–12445 (2021)
Google Scholar
Miksik, O., Munoz, D., Bagnell, J.A., Hebert, M.: Efficient temporal consistency for streaming video scene analysis. In: ICRA. IEEE, pp. 133–139. (2013)
Google Scholar
Miyato, T., Maeda, S.i., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1979–1993 (2018)
Google Scholar
Mustikovela, S.K., Yang, M.Y., Rother, C.: Can ground truth label propagation from video help semantic segmentation? In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 804–820. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_66
Chapter Google Scholar
Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6819–6828 (2018)
Google Scholar
Ouali, Y., Hudelot, C., Tami, M.: Semi-supervised semantic segmentation with cross-consistency training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12674–12684 (2020)
Google Scholar
Pan, F., Shin, I., Rameau, F., Lee, S., Kweon, I.S.: Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. arXiv preprint arXiv:2004.07703 (2020)
Patraucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309 (2015)
Richter, S.R., Hayder, Z., Koltun, V.: Playing for benchmarks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2213–2222 (2017)
Google Scholar
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3234–3243 (2016)
Google Scholar
Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: Advances in Neural Information Processing Systems, vol. 29, pp. 1163–1171 (2016)
Google Scholar
Shelhamer, E., Rakelly, K., Hoffman, J., Darrell, T.: Clockwork convnets for video semantic segmentation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 852–868. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_69
Chapter Google Scholar
Sohn, K.,et al.: FixMatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685 (2020)
Tokmakov, P., Alahari, K., Schmid, C.: Weakly-supervised semantic segmentation using motion cues. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 388–404. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_24
Chapter Google Scholar
Tranheden, W., Olsson, V., Pinto, J., Svensson, L.: DACS: domain adaptation via cross-domain mixed sampling. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1379–1389 (2021)
Google Scholar
Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481 (2018)
Google Scholar
Tsai, Y.H., Sohn, K., Schulter, S., Chandraker, M.: Domain adaptation for structured output via discriminative patch representations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1456–1465 (2019)
Google Scholar
Vu, T.H., Jain, H., Bucher, M., Cord, M., Pérez, P.: Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2517–2526 (2019)
Google Scholar
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2566–2576 (2019)
Google Scholar
Xiao, A., Huang, J., Guan, D., Zhan, F., Lu, S.: Transfer learning from synthetic to real lidar point cloud for semantic segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2795–2803 (2022)
Google Scholar
Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q.: Unsupervised data augmentation for consistency training. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6256–6268 (2020)
Google Scholar
Yang, Y., Soatto, S.: FDA: Fourier domain adaptation for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4085–4095 (2020)
Google Scholar
Zhang, P., Zhang, B., Zhang, T., Chen, D., Wang, Y., Wen, F.: Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12414–12424 (2021)
Google Scholar
Zheng, Z., Yang, Y.: Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. Int. J. Comput. Vis. 129, 1–15 (2021). https://doi.org/10.1007/s11263-020-01395-y
Article Google Scholar
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)
Google Scholar
Zhu, Y., et al.: Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8856–8865 (2019)
Google Scholar
Zou, Y., Yu, Z., Liu, X., Kumar, B., Wang, J.: Confidence regularized self-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5982–5991 (2019)
Google Scholar
Zou, Y., Yu, Z., Vijaya Kumar, B., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 289–305 (2018)
Google Scholar

Download references

Acknowledgement

This study is supported under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from Singapore Telecommunications Limited (Singtel), through Singtel Cognitive and Artificial Intelligence Lab for Enterprises.

Author information

Authors and Affiliations

Nanyang Technological University, Singapore, Singapore
Yun Xing, Jiaxing Huang & Shijian Lu
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Dayan Guan

Authors

Yun Xing
View author publications
You can also search for this author in PubMed Google Scholar
Dayan Guan
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Shijian Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shijian Lu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 840 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xing, Y., Guan, D., Huang, J., Lu, S. (2022). Domain Adaptive Video Segmentation via Temporal Pseudo Supervision. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13690. Springer, Cham. https://doi.org/10.1007/978-3-031-20056-4_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-20056-4_36
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20055-7
Online ISBN: 978-3-031-20056-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Domain Adaptive Video Segmentation via Temporal Pseudo Supervision