Multimodal Fusion and Sequence Learning for Cued Speech Recognition from Videos

Papadimitriou, Katerina; Parelli, Maria; Sapountzaki, Galini; Pavlakos, Georgios; Maragos, Petros; Potamianos, Gerasimos

doi:10.1007/978-3-030-78095-1_21

Katerina Papadimitriou¹⁰,
Maria Parelli¹¹,
Galini Sapountzaki¹²,
Georgios Pavlakos¹³,
Petros Maragos¹¹ &
…
Gerasimos Potamianos¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12769))

Included in the following conference series:

International Conference on Human-Computer Interaction

1216 Accesses

Abstract

Cued Speech (CS) constitutes a non-vocal mode of communication that relies on lip movements in conjunction with hand positional and gestural cues, in order to disambiguate phonetic information and make it accessible to the speech and hearing impaired. In this study, we address the automatic recognition of CS from videos, employing deep learning techniques and extending our earlier work on this topic as follows: First, for visual feature extraction, in addition to hand positioning embeddings and convolutional neural network-based appearance features of the mouth region and signing hand, we consider structural information of the hand and mouth articulators. Specifically, we utilize the OpenPose framework to extract 2D lip keypoints and hand skeletal coordinates of the signer, and we also infer 3D hand skeletal coordinates from the latter exploiting own earlier work on 2D-to-3D hand-pose regression. Second, we modify the sequence learning model, by considering a time-depth separable (TDS) convolution block structure that encodes the fused visual features, in conjunction with a decoder that is based on connectionist temporal classification for phonetic sequence prediction. We investigate the contribution of the above to CS recognition, evaluating our model on a French and a British English CS video dataset, and we report significant gains over the state-of-the-art on both sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Attina, V., Beautemps, D., Cathiard, M., Odisio, M.: A pilot study of temporal organization in cued speech production of French syllables: rules for a cued speech synthesizer. Speech Commun. 44(1), 197–214 (2004)
Article Google Scholar
Bheda, V., Radpour, D.: Using deep convolutional networks for gesture recognition in American Sign Language. CoRR abs/1710.06836 (2017)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)
Google Scholar
Cho, K., Merrienboer, B.V., Gülçehre, C., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014)
Google Scholar
Community, B.O.: Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam (2018). http://www.blender.org
Cornett, R.O.: Cued speech. Am. Ann. Deaf 112(1), 3–13 (1967)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)
Google Scholar
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. CoRR abs/1702.01806 (2017)
Google Scholar
Fuse, M.: Mixamo: Quality 3D Character Animation in Minutes (2015). https://www.mixamo.com
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning (ICML) (2006)
Google Scholar
Hannun, A., Lee, A., Xu, Q., Collobert, R.: Sequence-to-sequence speech recognition with time-depth separable convolutions. CoRR abs/1904.02619 (2019)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3154–3160 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Heracleous, P., Beautemps, D., Aboutabit, N.: Cued speech automatic recognition in normal-hearing and deaf subjects. Speech Commun. 52(6), 504–512 (2010)
Article Google Scholar
Heracleous, P., Beautemps, D., Hagita, N.: Continuous phoneme recognition in cued speech for French. In: Proceedings of the European Signal Processing Conference (EUSIPCO), pp. 2090–2093 (2012)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Google Scholar
Liu, L., Feng, G.: A pilot study on Mandarin Chinese cued speech. Am. Ann. Deaf 164(4), 496–518 (2019)
Article Google Scholar
Liu, L., Feng, G., Beautemps, D.: Automatic temporal segmentation of hand movements for hand positions recognition in French cued speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3061–3065 (2018)
Google Scholar
Liu, L., Hueber, T., Feng, G., Beautemps, D.: Visual recognition of continuous cued speech using a tandem CNN-HMM approach. In: Proceedings of Interspeech, pp. 2643–2647 (2018)
Google Scholar
Liu, L., Li, J., Feng, G., Zhang, X.: Automatic detection of the temporal segmentation of hand movements in British English cued speech. In: Proceedings of Interspeech, pp. 2285–2289 (2019)
Google Scholar
Liu, L., Feng, G., Beautemps, D., Zhang, X.P.: Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition. IEEE Trans. Multimedia 23, 292–305 (2021)
Article Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2659–2668 (2017)
Google Scholar
Papadimitriou, K., Potamianos, G.: A fully convolutional sequence learning approach for cued speech recognition from videos. In: Proceedings of the European Signal Processing Conference (EUSIPCO), pp. 326–330 (2021)
Google Scholar
Parelli, M., Papadimitriou, K., Potamianos, G., Pavlakos, G., Maragos, P.: Exploiting 3D hand pose estimation in deep learning-based sign language recognition from RGB videos. In: Proceedings of the ECCV 2020 Workshops, pp. 249–263 (2020)
Google Scholar
Paszke, A., et al.: Automatic differentiation in PyTorch. In: Proceedings of the NIPS-W (2017)
Google Scholar
Potamianos, G., et al.: Audio and visual modality combination in speech processing applications. In: Oviatt, S., Schuller, B., Cohen, P., Sonntag, D., Potamianos, G., Krüger, A. (eds.) The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations, pp. 489–543. Morgan-Claypool (2017)
Google Scholar
Rao, G.A., Syamala, K., Kishore, P.V.V., Sastry, A.S.C.S.: Deep convolutional neural networks for sign language recognition. In: Proceedings of the Signal Processing and Communication Engineering Systems (SPACES), pp. 194–197 (2018)
Google Scholar
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4645–4653 (2017)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), pp. 5998–6008 (2017)
Google Scholar
Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4913–4921 (2017)
Google Scholar

Download references

Acknowledgments

This research work was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “1st Call for H.F.R.I. Research Projects to support Faculty Members & Researchers and the procurement of high-cost research equipment grant” (Project “SL-ReDu”, Project Number 2456).

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece
Katerina Papadimitriou & Gerasimos Potamianos
School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece
Maria Parelli & Petros Maragos
Department of Special Education, University of Thessaly, Volos, Greece
Galini Sapountzaki
Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
Georgios Pavlakos

Authors

Katerina Papadimitriou
View author publications
You can also search for this author in PubMed Google Scholar
Maria Parelli
View author publications
You can also search for this author in PubMed Google Scholar
Galini Sapountzaki
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Pavlakos
View author publications
You can also search for this author in PubMed Google Scholar
Petros Maragos
View author publications
You can also search for this author in PubMed Google Scholar
Gerasimos Potamianos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katerina Papadimitriou .

Editor information

Editors and Affiliations

Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
Margherita Antona
University of Crete and Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
Constantine Stephanidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Papadimitriou, K., Parelli, M., Sapountzaki, G., Pavlakos, G., Maragos, P., Potamianos, G. (2021). Multimodal Fusion and Sequence Learning for Cued Speech Recognition from Videos. In: Antona, M., Stephanidis, C. (eds) Universal Access in Human-Computer Interaction. Access to Media, Learning and Assistive Environments. HCII 2021. Lecture Notes in Computer Science(), vol 12769. Springer, Cham. https://doi.org/10.1007/978-3-030-78095-1_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-78095-1_21
Published: 03 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78094-4
Online ISBN: 978-3-030-78095-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics