Multiple Sound Sources Localization from Coarse to Fine

Qian, Rui; Hu, Di; Dinkel, Heinrich; Wu, Mengyue; Xu, Ning; Lin, Weiyao

doi:10.1007/978-3-030-58565-5_18

Rui Qian¹²,
Di Hu¹³,
Heinrich Dinkel¹²,
Mengyue Wu¹²,
Ning Xu¹⁴ &
…
Weiyao Lin¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12365))

Included in the following conference series:

European Conference on Computer Vision

3819 Accesses
60 Citations

Abstract

How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations. To solve this problem, we develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes, then performs cross-modal feature alignment in a coarse-to-fine manner. Our model achieves state-of-the-art results on public dataset of localization, as well as considerable performance on multi-source sound localization in complex scenes. We then employ the localization results for sound separation and obtain comparable performance to existing methods. These outcomes demonstrate our model’s ability in effectively aligning sounds with specific visual sources. Code is available at https://github.com/shvdiwnkozbw/Multi-Source-Sound-Localization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We choose \(F_a\) and \(O_v\) for two reasons: we can obtain more fine-grained local features and achieve easier training process.
2.
We find that directly using \(F_r\) with the weights \(W_r^c\) is difficult to perform alignment objective, but by performing weighted pooling on \(E_r\), we achieve easier training and faster convergence.
3.
In practice, a threshold over all the class predictions is considered to select valid categories.
4.
Since AudioSet only provides clip-level audio labels, we can only ensure that labelled sounds appear in the clip. Thus we adopt the whole 10-s audio clip with one randomly selected frame from video as a pair.
5.
We have filtered out those silent detected objects.

References

Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 609–617, October 2017. https://doi.org/10.1109/ICCV.2017.73
Arandjelović, R., Zisserman, A.: Objects that sound. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 451–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_27
Chapter Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 892–900. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6146-soundnet-learning-sound-representations-from-unlabeled-video.pdf
Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847, March 2018. https://doi.org/10.1109/WACV.2018.00097
Gan, C., Huang, D., Zhao, H., Tenenbaum, J.B., Torralba, A.: Music gesture for visual sound separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10478–10487 (2020)
Google Scholar
Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B.: Look, listen, and act: towards audio-visual embodied navigation. arXiv preprint arXiv:1912.11684 (2019)
Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7053–7062 (2019)
Google Scholar
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_3
Chapter Google Scholar
Gao, R., Grauman, K.: 2.5D visual sound. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 324–333 (2019)
Google Scholar
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780, March 2017. https://doi.org/10.1109/ICASSP.2017.7952261
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 1735–1742, June 2006. https://doi.org/10.1109/CVPR.2006.100
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135, March 2017. https://doi.org/10.1109/ICASSP.2017.7952132
Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Hu, D., Wang, Z., Xiong, H., Wang, D., Nie, F., Dou, D.: Curriculum audiovisual learning. arXiv preprint arXiv:2001.09414 (2020)
Korbar, B., Tran, D., Torresani, L.: Co-training of audio and video representations from self-supervised temporal synchronization. arXiv:1807.00230 (2018)
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 639–658. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_39
Chapter Google Scholar
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Chapter Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., Torralba, A.: Self-supervised audio-visual co-segmentation. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2357–2361. IEEE (2019)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017). https://doi.org/10.1109/TPAMI.2016.2646371
Article Google Scholar
Spiertz, M., Gnann, V.: Source-filter based clustering for monaural blind source separation. In: Proceedings of the 12th International Conference on Digital Audio Effects (2009)
Google Scholar
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16
Chapter Google Scholar
Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35
Chapter Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Zhou, H., Xu, X., Lin, D., Wang, X., Liu, Z.: Sep-stereo: visually guided stereophonic audio generation by associating source separation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 52–69. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_4
Chapter Google Scholar

Download references

Acknowledgement

The paper is supported in part by the following grants: China Major Project for New Generation of AI Grant (No.2018AAA0100400), National Natural Science Foundation of China (No. 61971277, No. 61901265).

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Rui Qian, Heinrich Dinkel, Mengyue Wu & Weiyao Lin
Baidu Research, Beijing, China
Di Hu
Adobe Research, San Jose, USA
Ning Xu

Authors

Rui Qian
View author publications
You can also search for this author in PubMed Google Scholar
Di Hu
View author publications
You can also search for this author in PubMed Google Scholar
Heinrich Dinkel
View author publications
You can also search for this author in PubMed Google Scholar
Mengyue Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ning Xu
View author publications
You can also search for this author in PubMed Google Scholar
Weiyao Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiyao Lin .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (mp4 10754 KB)

Supplementary material 1 (pdf 2531 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., Lin, W. (2020). Multiple Sound Sources Localization from Coarse to Fine. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12365. Springer, Cham. https://doi.org/10.1007/978-3-030-58565-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-58565-5_18
Published: 12 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58564-8
Online ISBN: 978-3-030-58565-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics