Skip to main content

Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator

  • Conference paper
  • First Online:
Medical Image Computing and Computer Assisted Intervention – MICCAI 2022 (MICCAI 2022)

Abstract

Understanding the underlying relationship between tongue and oropharyngeal muscle deformation seen in tagged-MRI and intelligible speech plays an important role in advancing speech motor control theories and treatment of speech related-disorders. Because of their heterogeneous representations, however, direct mapping between the two modalities—i.e., two-dimensional (mid-sagittal slice) plus time tagged-MRI sequence and its corresponding one-dimensional waveform—is not straightforward. Instead, we resort to two-dimensional spectrograms as an intermediate representation, which contains both pitch and resonance, from which to develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size. Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy to specifically exploit the moving muscular structures during speech. In addition, we leverage a pairwise correlation of the samples with the same utterances with a latent space representation disentanglement strategy. Furthermore, we incorporate an adversarial training approach with generative adversarial networks to offer improved realism on our generated spectrograms. Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms from a sequence of tagged-MRI, surpassing competing methods. Thus, our framework provides the great potential to help better understand the relationship between the two modalities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Link: The librosa for audio to mel-spectrogram.

  2. 2.

    Link: Slicing the tensor in PyTorch.

  3. 3.

    Link: The librosa library for reversing mel-spectrogram to audio.

References

  1. Akbari, H., Arora, H., Cao, L., Mesgarani, N.: Lip2audspec: speech reconstruction from silent lip movements video. In: ICASSP, pp. 2516–2520. IEEE (2018)

    Google Scholar 

  2. Che, T., et al.: Deep verifier networks: verification of deep discriminative models with deep generative models. In: AAAI (2021)

    Google Scholar 

  3. Chi, T., Ru, P., Shamma, S.A.: Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am. 118(2), 887–906 (2005)

    Article  Google Scholar 

  4. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6

    Chapter  Google Scholar 

  5. Ephrat, A., Peleg, S.: Vid2speech: speech reconstruction from silent video. In: ICASSP, pp. 5095–5099. IEEE (2017)

    Google Scholar 

  6. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning (adaptive Computation And Machine Learning Series). MIT Press, Cambridge (2017)

    Google Scholar 

  7. Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)

    Article  Google Scholar 

  8. He, G., Liu, X., Fan, F., You, J.: Classification-aware semi-supervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 964–965 (2020)

    Google Scholar 

  9. He, G., Liu, X., Fan, F., You, J.: Image2audio: facilitating semi-supervised audio emotion recognition with facial expression image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 912–913 (2020)

    Google Scholar 

  10. Lee, J., Woo, J., Xing, F., Murano, E.Z., Stone, M., Prince, J.L.: Semi-automatic segmentation of the tongue for 3D motion analysis with dynamic MRI. In: ISBI, pp. 1465–1468. IEEE (2013)

    Google Scholar 

  11. Liu, X., Chao, Y., You, J.J., Kuo, C.C.J., Vijayakumar, B.: Mutual information regularized feature-level Frankenstein for discriminative recognition. In: IEEE TPAMI (2021)

    Google Scholar 

  12. Liu, X., Che, T., Lu, Y., Yang, C., Li, S., You, J.: AUTO3D: novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 52–71. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_4

    Chapter  Google Scholar 

  13. Liu, X., Guo, Z., You, J., Kumar, B.V.: Dependency-aware attention control for image set-based face recognition. IEEE Trans. Inf. Forensic. Secur. 15, 1501–1512 (2019)

    Article  Google Scholar 

  14. Liu, X., et al.: Domain generalization under conditional and label shifts via variational Bayesian inference. In: IJCAI (2021)

    Google Scholar 

  15. Liu, X., Li, S., Kong, L., Xie, W., Jia, P., You, J., Kumar, B.: Feature-level Frankenstein: Eliminating variations for discriminative recognition. In: CVPR, pp. 637–646 (2019)

    Google Scholar 

  16. Liu, X., Vijaya Kumar, B., You, J., Jia, P.: Adaptive deep metric learning for identity-aware facial expression recognition. In: CVPR, pp. 20–29 (2017)

    Google Scholar 

  17. Liu, X., et al.: Dual-cycle constrained bijective VAE-GAN for tagged-to-cine magnetic resonance image synthesis. In: ISBI, pp. 1448–1452. IEEE (2021)

    Google Scholar 

  18. Liu, X., Xing, F., Prince, J.L., Stone, M., El Fakhri, G., Woo, J.: Structure-aware unsupervised tagged-to-cine MRI synthesis with self disentanglement. In: Medical Imaging 2022: Image Processing, vol. 12032, pp. 470–476. SPIE (2022)

    Google Scholar 

  19. Liu, X., et al.: CMRI2spec: Cine MRI sequence to spectrogram synthesis via a pairwise heterogeneous translator. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1481–1485. IEEE (2022)

    Google Scholar 

  20. Liu, X., et al.: Tagged-MRI to audio synthesis with a pairwise heterogeneous deep translator. J. Acoust. Soc. Am. 151(4), A133–A133 (2022)

    Article  Google Scholar 

  21. Liu, X., et al.: Generative self-training for cross-domain unsupervised tagged-to-cine MRI synthesis. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12903, pp. 138–148. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87199-4_13

    Chapter  Google Scholar 

  22. Liu, X., et al.: Symmetric-constrained irregular structure inpainting for brain MRI registration with tumor pathology. In: Crimi, A., Bakas, S. (eds.) BrainLes 2020. LNCS, vol. 12658, pp. 80–91. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72084-1_8

    Chapter  Google Scholar 

  23. Michelsanti, D., et al.: An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1368–1396 (2021)

    Article  Google Scholar 

  24. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML, pp. 1310–1318. PMLR (2013)

    Google Scholar 

  25. Recommendation, I.T.: Perceptual evaluation of speech quality PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P. 862 (2001)

    Google Scholar 

  26. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. NIPS 29, 2234–2242 (2016)

    Google Scholar 

  27. Wang, J., et al.: Automated interpretation of congenital heart disease from multi-view echocardiograms. Med. Image Anal. 69, 101942 (2021)

    Article  Google Scholar 

  28. Xie, W., Liang, L., Lu, Y., Luo, H., Liu, X.: Deep 3D-CNN for depression diagnosis with facial video recording of self-rating depression scale questionnaire. In: JBHI (2021)

    Google Scholar 

  29. Xing, F., Liu, X., Kuo, J., Fakhri, G., Woo, J.: Brain MR atlas construction using symmetric deep neural inpainting. IEEE J. Biomed. Health Inform. 26, 3185–3196 (2022)

    Article  Google Scholar 

  30. Xing, F., et al.: 3D tongue motion from tagged and cine MR images. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8151, pp. 41–48. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40760-4_6

    Chapter  Google Scholar 

  31. Yu, Y., Shandiz, A.H., Tóth, L.: Reconstructing speech from real-time articulatory MRI using neural vocoders. In: 2021 29th European Signal Processing Conference (EUSIPCO), pp. 945–949. IEEE (2021)

    Google Scholar 

Download references

Acknowledgements

This work is supported by NIH R01DC014717, R01DC018511, and R01CA133015.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofeng Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, X. et al. (2022). Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13436. Springer, Cham. https://doi.org/10.1007/978-3-031-16446-0_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16446-0_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16445-3

  • Online ISBN: 978-3-031-16446-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics