Skip to main content

Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12367))

Included in the following conference series:

Abstract

In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audio-visual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array of microphones. By coupling such array with a video camera, we obtain spatio-temporal alignment of acoustic images and video frames. This constitutes a powerful source of self-supervision, which can be exploited in the learning pipeline we are proposing, without resorting to expensive data annotations. However, since 2D planar arrays are cumbersome and not as widespread as ordinary microphones, we propose that the richer information content of acoustic images can be distilled, through a self-supervised learning scheme, into more powerful audio and visual feature representations. The learnt feature representations can then be employed for downstream tasks such as classification and cross-modal retrieval, without the need of a microphone array. To prove that, we introduce a novel multimodal dataset consisting in RGB videos, raw audio signals and acoustic images, aligned in space and synchronized in time. Experimental results demonstrate the validity of our hypothesis and the effectiveness of the proposed pipeline, also when tested for tasks and datasets different from those used for training.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/IIT-PAVIS/acoustic-images-self-supervision.

References

  1. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: The IEEE International Conference on Computer Vision (ICCV), October 2017

    Google Scholar 

  2. Arandjelovic, R., Zisserman, A.: Objects that sound. In: The European Conference on Computer Vision (ECCV), September 2018

    Google Scholar 

  3. Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, pp. 892–900. Curran Associates Inc., USA (2016). http://dl.acm.org/citation.cfm?id=3157096.3157196

  4. Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. CoRR abs/1706.00932 (2017). http://arxiv.org/abs/1706.00932

  5. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML (2009)

    Google Scholar 

  6. Crocco, M., Martelli, S., Trucco, A., Zunino, A., Murino, V.: Audio tracking in noisy environments by acoustic map and spectral signature. IEEE Trans. Cybern. 48, 1619–1632 (2018)

    Article  Google Scholar 

  7. Crocco, M., Trucco, A.: Design of superdirective planar arrays with sparse aperiodic layouts for processing broadband signals via 3-D beamforming. IEEE/ACM Trans. Audio, Speech Lang. Process. 22(4), 800–815 (2014). https://doi.org/10.1109/TASLP.2014.2304635

    Article  Google Scholar 

  8. Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 1–2 (2018)

    Article  Google Scholar 

  9. Gao, R., Grauman, K.: 2.5d visual sound. CVPR 2019 arXiv:1812.04204 (2019)

  10. Garcia, N.C., Morerio, P., Murino, V.: Learning with privileged information via adversarial discriminative modality distillation. CoRR abs/1810.08437 (2018)

    Google Scholar 

  11. Garcia, N.C., Morerio, P., Murino, V.: Modality distillation with multiple stream networks for action recognition. In: The European Conference on Computer Vision (ECCV), September 2018

    Google Scholar 

  12. Harwath, D., Recasens, A., Suris, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: The European Conference on Computer Vision (ECCV), September 2018

    Google Scholar 

  13. Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 1858–1866. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016

    Google Scholar 

  15. Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. NIPS 2014 Deep Learning Workshop abs/1503.02531 (2015)

    Google Scholar 

  16. Hoffman, J., Gupta, S., Darrell, T.: Learning with side information through modality hallucination. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 826–834, June 2016. https://doi.org/10.1109/CVPR.2016.96

  17. Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

    Google Scholar 

  18. Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. CoRR abs/1902.06162 (2019)

    Google Scholar 

  19. Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 7774–7785. Curran Associates, Inc. (2018). http://papers.nips.cc/paper/8002-cooperative-learning-of-audio-and-video-models-from-self-supervised-synchronization.pdf

  20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)

    Google Scholar 

  21. Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. ICLR 2016 abs/1511.03643 (2016)

    Google Scholar 

  22. Mesaros, A., Heittola, T., Virtanen, T.: A multi-device dataset for urban acoustic scene classification. In: DCASE 2018 Workshop (2018)

    Google Scholar 

  23. Morgado, P., Vasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 video. In: Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS 2018, pp. 360–370. Curran Associates Inc., USA (2018)

    Google Scholar 

  24. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML 2011, pp. 689–696. Omnipress, USA (2011). http://dl.acm.org/citation.cfm?id=3104482.3104569

  25. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: The European Conference on Computer Vision (ECCV), September 2018

    Google Scholar 

  26. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2405–2413 (2016)

    Google Scholar 

  27. Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Learning sight from sound: ambient sound provides supervision for visual learning. Int. J. Comput. Vis. 126(10), 1120–1137 (2018). https://doi.org/10.1007/s11263-018-1083-5

  28. Pérez, A.F., Sanguineti, V., Morerio, P., Murino, V.: Audio-visual model distillation using acoustic images. In: Winter Conference on Applications of Computer Vision (WACV) (2020)

    Google Scholar 

  29. Ramaswamy, J., Das, S.: See the sound, hear the pixels. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2959–2968 (2020)

    Google Scholar 

  30. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823, June 2015

    Google Scholar 

  31. Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

    Google Scholar 

  32. Terasawa, H., Slaney, M., Berger, J.: A statistical model of timbre perception. In: SAPA@INTERSPEECH (2006)

    Google Scholar 

  33. Van Trees, H.: Detection, Estimation, and Modulation Theory, Optimum Array Processing. Wiley (2002)

    Google Scholar 

  34. Vapnik, V., Vashist, A.: A new learning paradigm: learning using privileged information. Neural Netw. 22(5–6), 544–557 (2009)

    Article  Google Scholar 

  35. Yang, K., Russell, B., Salamon, J.: Telling left from right: learning spatial correspondence of sight and sound. In: CVPR (2020)

    Google Scholar 

  36. Zunino, A., Crocco, M., Martelli, S., Trucco, A., Bue, A.D., Murino, V.: Seeing the sound: a new multimodal imaging device for computer vision. In: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 693–701, December 2015. https://doi.org/10.1109/ICCVW.2015.95

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valentina Sanguineti .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (mp4 214 KB)

Supplementary material 3 (mp4 122 KB)

Supplementary material 4 (mp4 447 KB)

Supplementary material 5 (mp4 461 KB)

Supplementary material 6 (mp4 460 KB)

Supplementary material 7 (mp4 344 KB)

Supplementary material 8 (mp4 306 KB)

Supplementary material 9 (mp4 307 KB)

Supplementary material 1 (pdf 14053 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sanguineti, V., Morerio, P., Pozzetti, N., Greco, D., Cristani, M., Murino, V. (2020). Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12367. Springer, Cham. https://doi.org/10.1007/978-3-030-58542-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58542-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58541-9

  • Online ISBN: 978-3-030-58542-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics