Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator

Liu, Xiaofeng; Xing, Fangxu; Prince, Jerry L.; Zhuo, Jiachen; Stone, Maureen; El Fakhri, Georges; Woo, Jonghye

doi:10.1007/978-3-031-16446-0_36

Xiaofeng Liu¹²,
Fangxu Xing¹²,
Jerry L. Prince¹³,
Jiachen Zhuo¹⁴,
Maureen Stone¹⁴,
Georges El Fakhri¹² &
…
Jonghye Woo¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13436))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

6741 Accesses

Abstract

Understanding the underlying relationship between tongue and oropharyngeal muscle deformation seen in tagged-MRI and intelligible speech plays an important role in advancing speech motor control theories and treatment of speech related-disorders. Because of their heterogeneous representations, however, direct mapping between the two modalities—i.e., two-dimensional (mid-sagittal slice) plus time tagged-MRI sequence and its corresponding one-dimensional waveform—is not straightforward. Instead, we resort to two-dimensional spectrograms as an intermediate representation, which contains both pitch and resonance, from which to develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size. Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy to specifically exploit the moving muscular structures during speech. In addition, we leverage a pairwise correlation of the samples with the same utterances with a latent space representation disentanglement strategy. Furthermore, we incorporate an adversarial training approach with generative adversarial networks to offer improved realism on our generated spectrograms. Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms from a sequence of tagged-MRI, surpassing competing methods. Thus, our framework provides the great potential to help better understand the relationship between the two modalities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Akbari, H., Arora, H., Cao, L., Mesgarani, N.: Lip2audspec: speech reconstruction from silent lip movements video. In: ICASSP, pp. 2516–2520. IEEE (2018)
Google Scholar
Che, T., et al.: Deep verifier networks: verification of deep discriminative models with deep generative models. In: AAAI (2021)
Google Scholar
Chi, T., Ru, P., Shamma, S.A.: Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am. 118(2), 887–906 (2005)
Article Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
Chapter Google Scholar
Ephrat, A., Peleg, S.: Vid2speech: speech reconstruction from silent video. In: ICASSP, pp. 5095–5099. IEEE (2017)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning (adaptive Computation And Machine Learning Series). MIT Press, Cambridge (2017)
Google Scholar
Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
Article Google Scholar
He, G., Liu, X., Fan, F., You, J.: Classification-aware semi-supervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 964–965 (2020)
Google Scholar
He, G., Liu, X., Fan, F., You, J.: Image2audio: facilitating semi-supervised audio emotion recognition with facial expression image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 912–913 (2020)
Google Scholar
Lee, J., Woo, J., Xing, F., Murano, E.Z., Stone, M., Prince, J.L.: Semi-automatic segmentation of the tongue for 3D motion analysis with dynamic MRI. In: ISBI, pp. 1465–1468. IEEE (2013)
Google Scholar
Liu, X., Chao, Y., You, J.J., Kuo, C.C.J., Vijayakumar, B.: Mutual information regularized feature-level Frankenstein for discriminative recognition. In: IEEE TPAMI (2021)
Google Scholar
Liu, X., Che, T., Lu, Y., Yang, C., Li, S., You, J.: AUTO3D: novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 52–71. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_4
Chapter Google Scholar
Liu, X., Guo, Z., You, J., Kumar, B.V.: Dependency-aware attention control for image set-based face recognition. IEEE Trans. Inf. Forensic. Secur. 15, 1501–1512 (2019)
Article Google Scholar
Liu, X., et al.: Domain generalization under conditional and label shifts via variational Bayesian inference. In: IJCAI (2021)
Google Scholar
Liu, X., Li, S., Kong, L., Xie, W., Jia, P., You, J., Kumar, B.: Feature-level Frankenstein: Eliminating variations for discriminative recognition. In: CVPR, pp. 637–646 (2019)
Google Scholar
Liu, X., Vijaya Kumar, B., You, J., Jia, P.: Adaptive deep metric learning for identity-aware facial expression recognition. In: CVPR, pp. 20–29 (2017)
Google Scholar
Liu, X., et al.: Dual-cycle constrained bijective VAE-GAN for tagged-to-cine magnetic resonance image synthesis. In: ISBI, pp. 1448–1452. IEEE (2021)
Google Scholar
Liu, X., Xing, F., Prince, J.L., Stone, M., El Fakhri, G., Woo, J.: Structure-aware unsupervised tagged-to-cine MRI synthesis with self disentanglement. In: Medical Imaging 2022: Image Processing, vol. 12032, pp. 470–476. SPIE (2022)
Google Scholar
Liu, X., et al.: CMRI2spec: Cine MRI sequence to spectrogram synthesis via a pairwise heterogeneous translator. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1481–1485. IEEE (2022)
Google Scholar
Liu, X., et al.: Tagged-MRI to audio synthesis with a pairwise heterogeneous deep translator. J. Acoust. Soc. Am. 151(4), A133–A133 (2022)
Article Google Scholar
Liu, X., et al.: Generative self-training for cross-domain unsupervised tagged-to-cine MRI synthesis. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12903, pp. 138–148. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87199-4_13
Chapter Google Scholar
Liu, X., et al.: Symmetric-constrained irregular structure inpainting for brain MRI registration with tumor pathology. In: Crimi, A., Bakas, S. (eds.) BrainLes 2020. LNCS, vol. 12658, pp. 80–91. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72084-1_8
Chapter Google Scholar
Michelsanti, D., et al.: An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1368–1396 (2021)
Article Google Scholar
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML, pp. 1310–1318. PMLR (2013)
Google Scholar
Recommendation, I.T.: Perceptual evaluation of speech quality PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P. 862 (2001)
Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. NIPS 29, 2234–2242 (2016)
Google Scholar
Wang, J., et al.: Automated interpretation of congenital heart disease from multi-view echocardiograms. Med. Image Anal. 69, 101942 (2021)
Article Google Scholar
Xie, W., Liang, L., Lu, Y., Luo, H., Liu, X.: Deep 3D-CNN for depression diagnosis with facial video recording of self-rating depression scale questionnaire. In: JBHI (2021)
Google Scholar
Xing, F., Liu, X., Kuo, J., Fakhri, G., Woo, J.: Brain MR atlas construction using symmetric deep neural inpainting. IEEE J. Biomed. Health Inform. 26, 3185–3196 (2022)
Article Google Scholar
Xing, F., et al.: 3D tongue motion from tagged and cine MR images. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8151, pp. 41–48. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40760-4_6
Chapter Google Scholar
Yu, Y., Shandiz, A.H., Tóth, L.: Reconstructing speech from real-time articulatory MRI using neural vocoders. In: 2021 29th European Signal Processing Conference (EUSIPCO), pp. 945–949. IEEE (2021)
Google Scholar

Download references

Acknowledgements

This work is supported by NIH R01DC014717, R01DC018511, and R01CA133015.

Author information

Authors and Affiliations

Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
Xiaofeng Liu, Fangxu Xing, Georges El Fakhri & Jonghye Woo
Johns Hopkins University, Baltimore, MD, USA
Jerry L. Prince
University of Maryland, Baltimore, MD, USA
Jiachen Zhuo & Maureen Stone

Authors

Xiaofeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Fangxu Xing
View author publications
You can also search for this author in PubMed Google Scholar
Jerry L. Prince
View author publications
You can also search for this author in PubMed Google Scholar
Jiachen Zhuo
View author publications
You can also search for this author in PubMed Google Scholar
Maureen Stone
View author publications
You can also search for this author in PubMed Google Scholar
Georges El Fakhri
View author publications
You can also search for this author in PubMed Google Scholar
Jonghye Woo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Liu .

Editor information

Editors and Affiliations

Rochester Institute of Technology, Rochester, NY, USA
Linwei Wang
Chinese University of Hong Kong, Hong Kong, Hong Kong
Qi Dou
University of Virginia, Charlottesville, VA, USA
P. Thomas Fletcher
National Center for Tumor Diseases (NCT/UCC), Dresden, Germany
Stefanie Speidel
Case Western Reserve University, Cleveland, OH, USA
Shuo Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, X. et al. (2022). Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13436. Springer, Cham. https://doi.org/10.1007/978-3-031-16446-0_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-16446-0_36
Published: 17 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16445-3
Online ISBN: 978-3-031-16446-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator