Skip to main content

Multimodal Fusion and Sequence Learning for Cued Speech Recognition from Videos

  • Conference paper
  • First Online:
Universal Access in Human-Computer Interaction. Access to Media, Learning and Assistive Environments (HCII 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12769))

Included in the following conference series:

  • 1216 Accesses

Abstract

Cued Speech (CS) constitutes a non-vocal mode of communication that relies on lip movements in conjunction with hand positional and gestural cues, in order to disambiguate phonetic information and make it accessible to the speech and hearing impaired. In this study, we address the automatic recognition of CS from videos, employing deep learning techniques and extending our earlier work on this topic as follows: First, for visual feature extraction, in addition to hand positioning embeddings and convolutional neural network-based appearance features of the mouth region and signing hand, we consider structural information of the hand and mouth articulators. Specifically, we utilize the OpenPose framework to extract 2D lip keypoints and hand skeletal coordinates of the signer, and we also infer 3D hand skeletal coordinates from the latter exploiting own earlier work on 2D-to-3D hand-pose regression. Second, we modify the sequence learning model, by considering a time-depth separable (TDS) convolution block structure that encodes the fused visual features, in conjunction with a decoder that is based on connectionist temporal classification for phonetic sequence prediction. We investigate the contribution of the above to CS recognition, evaluating our model on a French and a British English CS video dataset, and we report significant gains over the state-of-the-art on both sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Attina, V., Beautemps, D., Cathiard, M., Odisio, M.: A pilot study of temporal organization in cued speech production of French syllables: rules for a cued speech synthesizer. Speech Commun. 44(1), 197–214 (2004)

    Article  Google Scholar 

  2. Bheda, V., Radpour, D.: Using deep convolutional networks for gesture recognition in American Sign Language. CoRR abs/1710.06836 (2017)

    Google Scholar 

  3. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)

    Google Scholar 

  4. Cho, K., Merrienboer, B.V., Gülçehre, C., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014)

    Google Scholar 

  5. Community, B.O.: Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam (2018). http://www.blender.org

  6. Cornett, R.O.: Cued speech. Am. Ann. Deaf 112(1), 3–13 (1967)

    Google Scholar 

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)

    Google Scholar 

  8. Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. CoRR abs/1702.01806 (2017)

    Google Scholar 

  9. Fuse, M.: Mixamo: Quality 3D Character Animation in Minutes (2015). https://www.mixamo.com

  10. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning (ICML) (2006)

    Google Scholar 

  11. Hannun, A., Lee, A., Xu, Q., Collobert, R.: Sequence-to-sequence speech recognition with time-depth separable convolutions. CoRR abs/1904.02619 (2019)

    Google Scholar 

  12. Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3154–3160 (2017)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034 (2015)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  15. Heracleous, P., Beautemps, D., Aboutabit, N.: Cued speech automatic recognition in normal-hearing and deaf subjects. Speech Commun. 52(6), 504–512 (2010)

    Article  Google Scholar 

  16. Heracleous, P., Beautemps, D., Hagita, N.: Continuous phoneme recognition in cued speech for French. In: Proceedings of the European Signal Processing Conference (EUSIPCO), pp. 2090–2093 (2012)

    Google Scholar 

  17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)

    Google Scholar 

  19. Liu, L., Feng, G.: A pilot study on Mandarin Chinese cued speech. Am. Ann. Deaf 164(4), 496–518 (2019)

    Article  Google Scholar 

  20. Liu, L., Feng, G., Beautemps, D.: Automatic temporal segmentation of hand movements for hand positions recognition in French cued speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3061–3065 (2018)

    Google Scholar 

  21. Liu, L., Hueber, T., Feng, G., Beautemps, D.: Visual recognition of continuous cued speech using a tandem CNN-HMM approach. In: Proceedings of Interspeech, pp. 2643–2647 (2018)

    Google Scholar 

  22. Liu, L., Li, J., Feng, G., Zhang, X.: Automatic detection of the temporal segmentation of hand movements in British English cued speech. In: Proceedings of Interspeech, pp. 2285–2289 (2019)

    Google Scholar 

  23. Liu, L., Feng, G., Beautemps, D., Zhang, X.P.: Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition. IEEE Trans. Multimedia 23, 292–305 (2021)

    Article  Google Scholar 

  24. Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2659–2668 (2017)

    Google Scholar 

  25. Papadimitriou, K., Potamianos, G.: A fully convolutional sequence learning approach for cued speech recognition from videos. In: Proceedings of the European Signal Processing Conference (EUSIPCO), pp. 326–330 (2021)

    Google Scholar 

  26. Parelli, M., Papadimitriou, K., Potamianos, G., Pavlakos, G., Maragos, P.: Exploiting 3D hand pose estimation in deep learning-based sign language recognition from RGB videos. In: Proceedings of the ECCV 2020 Workshops, pp. 249–263 (2020)

    Google Scholar 

  27. Paszke, A., et al.: Automatic differentiation in PyTorch. In: Proceedings of the NIPS-W (2017)

    Google Scholar 

  28. Potamianos, G., et al.: Audio and visual modality combination in speech processing applications. In: Oviatt, S., Schuller, B., Cohen, P., Sonntag, D., Potamianos, G., Krüger, A. (eds.) The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations, pp. 489–543. Morgan-Claypool (2017)

    Google Scholar 

  29. Rao, G.A., Syamala, K., Kishore, P.V.V., Sastry, A.S.C.S.: Deep convolutional neural networks for sign language recognition. In: Proceedings of the Signal Processing and Communication Engineering Systems (SPACES), pp. 194–197 (2018)

    Google Scholar 

  30. Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4645–4653 (2017)

    Google Scholar 

  31. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016)

    Google Scholar 

  32. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), pp. 5998–6008 (2017)

    Google Scholar 

  33. Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4913–4921 (2017)

    Google Scholar 

Download references

Acknowledgments

This research work was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “1st Call for H.F.R.I. Research Projects to support Faculty Members & Researchers and the procurement of high-cost research equipment grant” (Project “SL-ReDu”, Project Number 2456).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katerina Papadimitriou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Papadimitriou, K., Parelli, M., Sapountzaki, G., Pavlakos, G., Maragos, P., Potamianos, G. (2021). Multimodal Fusion and Sequence Learning for Cued Speech Recognition from Videos. In: Antona, M., Stephanidis, C. (eds) Universal Access in Human-Computer Interaction. Access to Media, Learning and Assistive Environments. HCII 2021. Lecture Notes in Computer Science(), vol 12769. Springer, Cham. https://doi.org/10.1007/978-3-030-78095-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-78095-1_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-78094-4

  • Online ISBN: 978-3-030-78095-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics