A comparison of models for fusion of the auditory and visual sensors in speech perception

Robert-Ribes, Jordi; Schwartz, Jean-Luc; Escudier, Pierre

doi:10.1007/BF00849043

A comparison of models for fusion of the auditory and visual sensors in speech perception

Published: October 1995

Volume 9, pages 323–346, (1995)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Jordi Robert-Ribes¹,
Jean-Luc Schwartz¹ &
Pierre Escudier¹

122 Accesses
17 Citations
Explore all metrics

Abstract

Though a large amount of psychological and physiological evidence of audio-visual integration in speech has been collected in the last 20 years, there is no agreement about the nature of the fusion process. We present the main experimental data, and describe the various models proposed in the literature, together with a number of studies in the field of automatic audiovisual speech recognition. We discuss these models in relation to general proposals arising from psychology in the field of intersensory interaction, or from the field of vision and robotics in the field of sensor fusion. Then we examine the characteristics of four main models, in the light of psychological data and formal properties, and we present the results of a modelling study on audio-visual recognition of French vowels in noise. We conclude in favor of the relative superiority of a model in which the auditory and visual inputs are projected and fused in a common representation space related to motor properties of speech objects, the fused representation being further classified for lexical access.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Neural Network Dynamics and Audiovisual Integration

An Object-Based Interpretation of Audiovisual Processing

Bimodal Speech Recognition Fusing Audio-Visual Modalities

References

Abry, C. & Boë, L. J. (1986). Laws for Lips.Speech Communication 5: 97–104.
Google Scholar
Aulanko, R. & Sams, M. (1991). Integration of Auditory and Visual Components of Articulatory Information in the Human Brain. XII Congrès Intern. des Sciences Phonétiques, 19–24 Août 1991, Aix-en-Provence (France), pp. 38–41.
Basir, O. A. & Shen, H. C. (1992). Sensory Data Integration: A Team Consensus Approach.Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1683–1688.
Beckerman, M. (1992). A Bayes-maximum entropy method for multi-sensor data fusion.Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1668–1674.
Benoît, C., Mohamadi, T. & Kandel, S. D. (in press). Effects of Phonetic Context on Audio-Visual Intelligibility of French.Journal of Speech and Hearing Research (in press).
Binnie, C. A., Montgomery, A. A. & Jackson, P. L. (1974). Auditory and Visual Contributions to the Perception of Consonants.Journal of Speech and Hearing Research 17: 619–630.
Google Scholar
Bladon, R. A. W. & Lindblom, B. (1981). Modelling the Judgement of Vowel Quality Differences.J. Acoust. Soc. Am. 69: 1414–1422.
Google Scholar
Boë, L. J. & Perrier, P. (1993). Personal Communication.
Boë, L. J., Schwartz, J. L. & Vallée, N., (to appear). The Prediction of Vowel Systems: Perceptual Contrast and Stability. In Keller, E. (ed.)Fundamentals of Speech Synthesis and Speech Recognition. John Wiley.
Braida, D., Picheny, M. A., Cohen, J. R., Rabinowitz, W. M. & Perkell, J. S. (1986). Use of Articulatory Signals in Automatic Speech Recognition.J. Acoust. Soc. Am. 80: 18.
Google Scholar
Breeuwer, M. & Plomp, R. (1986). Speechreading Supplemented with Auditorily Presented Speech Parameters.J. Acoust. Soc. Am. 79: 481–499.
Google Scholar
Bregman, A. S. (1990).Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press: Cambridge, MA.
Google Scholar
Brooke, M. & Petajan, E. D. (1986). Seeing Speech: Investigations into the Synthesis and Recognition of Visible Speech Movements Using Automatic Image Processing and Computer Graphics. Conference Publication No. 258, Inter. Conf. on Speech Input/Output; Techniques and Applications (London 24–26 March 1986), pp. 104–109.
Campbell, H. W. (1974). Phoneme Recognition by Ear and by Eye: A Distinctive Feature Analysis. Doctoral dissertation, Katholieke Universiteit te Nijmegen.
Campbell, R. (1988). Tracing Lip Movements: Making Speech Visible.Visible Language 13(1).
Campbell, R. & Dodd, B. (1980). Hearing by Eye.Quarterly Journal of Experimental Psychology,32: 85–99.
Google Scholar
Cathiard, M. A., Lallouache, T., Mohamadi, T. & Abry, Ch. (1993). Etude perceptive des interactions Son/Image: Application au visiophone et à la visioconférence. Rapport CNET Marché 92 7B 032.
Crowley, J. L. & Demazeau, Y. (1993). Principles and Techniques for Sensor Data Fusion.Signal Processing 32: 5–27.
Google Scholar
de Gelder, B. & Vroomen, J. (1992). Abstract Versus Modality-Specific Memory Representations in Processing Auditory and Visual Speech.Memory and Cognition 20: 533–538.
Google Scholar
de Gelder, B., Vroomen, J. & van der Heide, L. (1991). Face Recognition and Lip-Reading in Autism.European Journal of Cognitive Psychology,3: 69–86.
Google Scholar
Delgutte, B. & Kiang, N. Y. S. (1984). Speech Coding in the Auditory Nerve: IV. Sounds with Consonant-Like Dynamic Characteristics.J. Acoust. Soc. Am. 75: 897–907.
Google Scholar
Dixon, N. F. & Spitz L. (1980). The Detection of Audiovisual Desynchrony.Perception 9: 719–721.
Google Scholar
Dodd, B. (1979). Lip-Reading in Infants: Attention to Speech Presented In- and Out-of-Synchrony.Cognitive Psychology 11: 478–484.
Google Scholar
Erber, N. P. (1969). Interaction of Audition and Vision in the Recognition of Oral Speech Stimuli.Journal of Speech and Hearing Research 12: 423–425.
Google Scholar
Erber, N. P. (1975). Auditory-Visual Perception of Speech.Journal of Speech and Hearing Disorders 40: 481–492.
Google Scholar
Faugeras, O., Ayache, N. & Faverjon, B. (1986). Building Visual Maps by Combining Noisy Stereo Measurements. IEEE Internat. Conf. on Robotics and Automation, San Francisco, CA.
Finn, K. E. & Montgomery (1988). Automatic Optically Based Recognition of Speech.Pattern Recognition Letters 8: 159–164.
Google Scholar
Fowler, C. A. (1986). An Event Approach to the Study of Speech Perception from a Direct-Realist Perspective.J. of Phonetics 14: 3–28.
Google Scholar
Fowler, C. A. & Dekle, D. J. (1991). Listening with Eye and Hand: Crossmodal Contributions to Speech Perception.J. of Experimental Psychology: Human Perception and Performance 17: 816–828.
Google Scholar
Gibson, J. J. (1966).The Senses Considered as Perceptual Systems. Boston: Houghton Mifflin Co.
Google Scholar
Grant, K. W., Ardell, L. H., Kuhl, P. K. & Sparks, D. W. (1985). The Contribution of Fundamental Frequency, Amplitude Envelope, and Voicing Duration Cues to Speechreading in Normal-Hearing Subjects.J. Acoust. Soc. Am. 77: 671–677.
Google Scholar
Green, K. P. (1987). The Perception of Speaking Rate Using Visual Information from a Talker's Face.Perception and Psychophysics 42: 587–593.
Google Scholar
Green, K. P. & Kuhl, P. K. (1989). The Role of Visual Information in the Processing of Place and Manner Features in Speech Perception.Perception and Psychophysics 45: 34–42.
Google Scholar
Green, K. P. & Kuhl, P. K. (1991). Integral Processing of Visual Place and Auditory Voicing Information During Phonetic Perception.Journal of Experimental Psychology: Human Perception and Performance 17: 278–288.
Google Scholar
Green, K. P. & Miller J. L. (1985). On the Role of Visual Rate Information in Phonetic Perception.Perception and Psychophysics 38: 269–276.
Google Scholar
Hatwell, Y. (1993). Transferts intermodaux et intégration intermodale. In Richelle, M., Reguin, J. & Robert, M. (eds.)Traité de Psychologie Expérimentale. Presses Universitaires de France: Paris.
Google Scholar
King, A. J. & Palmer, A. R. (1985). Integration of Visual and Auditory Information in Bimodal Neurones in the Guinea-pig Superior Colliculus.Expl. Brain Res. 60: 492–500.
Google Scholar
Klatt, D. H. (1979). Speech Perception: A Model of Acoustic-Phonetic Analysis and Lexical Access.J. of Phonetics 7: 279–312.
Google Scholar
Knudsen, E. I. (1982). Auditory and Visual Maps of Space in the Optic Tectum of the Owl.J. Neurosci. 2: 1177–1194.
Google Scholar
Konishi, M. (1986). Centrally Synthesized Maps of Sensory Space.Trends Neurosci 9: 163–168.
Google Scholar
Kuhl, P. K. & Meltzoff, A. N. (1982). The Bimodal Perception of Speech in Infancy.Science 218: 1138–1141.
Google Scholar
Kurita, T., Honda, K. & Kakita, Y. (1988). Analysis of Speech by a Joint Use of Image Processing of Lip Movements. I.E.I.C.E. Tech. Report, SP88-94, 41–48.
Lallouache, M. T. (1991). Un poste ‘visage-parole’ couleur. Acquisition et traitement automatique des contours des lèvres. Doctoral Thesis, Institut National Polytechnique, Grenoble.
Google Scholar
Liberman, A. & Mattingly, I. (1985). The Motor Theory of Speech Perception Revised.Cognition 21: 1–33.
PubMed Google Scholar
Lisker, L. & Rossi, M. (1993). Auditory and Visual Cueing of the [± Rounded] Feature of Vowels.Language and Speech 35: 391–417.
Google Scholar
Massaro, D. W. (1987).Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Lawrence Erlbaum Associates: London.
Google Scholar
Massaro, D. W. (1989). Testing Between the TRACE Model and the Fuzzy Logical Model of Speech Perception.Cognitive Psychology 21: 398–421.
Google Scholar
Massaro, D. W. & Cohen, M. M. (1993). Perceiving Asynchronous Bimodal Speech in Consonant-Vowel and Vowel Syllables.Speech Communication 13: 127–134.
Google Scholar
Massaro, D. W., Cohen, M. M., Gesi, A., Heredia, R. & Tsuzaki, M. (1993). Bimodal Speech Perception: An Examination Across Languages.J. of Phonetics 21: 445–478.
Google Scholar
Massaro, D. W. & Warner, D. S. (1977). Dividing Attention Between Auditory and Visual Perception.Perception and Psychophysics 21: 569–574.
Google Scholar
Matsuoka, K., Furuya, T. & Kurosu, K. (1986). Speech Recognition by Image Processing of Lip Movements.Trans. on Soc. Instru. and Cont. Eng. 22: 191–198.
Google Scholar
McGurk, H. & MacDonald, J. (1976). Hearing Lips and Seeing Voices.Nature 264: 746–748.
Google Scholar
McLeod, A. & Summerfield, Q. (1987). Quantifying the Contribution of Vision to Speech Perception in Noise.British Journal of Audiology 21: 131–141.
Google Scholar
Meredith, M. A. & Stein, B. E. (1983). Interactions Among Converging Sensory Inputs in the Superior Colliculus.Science 221: 389–391.
Google Scholar
Morris, A. C. (1992).Analyse informationnelle du traitement de la parole dans le système auditif périphérique et le noyau cochléaire. Doctoral Thesis, Institut National Polytechnique, Grenoble.
Google Scholar
Petajan, E. D. (1984). Automatic Lipreading to Enhance Speech Recognition.Proceeding of the Global Communications Conference, IEEE Communication Society, Atlanta, Georgia, 265–272.
Google Scholar
Radeau, M. (1994). Auditory-Visual Spatial Interaction and Modularity.Current Psychology of Cognition 13: 3–51.
Google Scholar
Reed, C. M., Rabinowitz, W. M., Durlach, N. I. & Braida, L. D. (1985). Research on the Tadoma method of speech communication.J. Acoust. Soc. Am. 77(1): 247–257.
Google Scholar
Reisberg, D., McLean, J. & Goldfield, A. (1987). Easy to Hear but Hard to Understand: A Lipreading Advantage with Intact Auditory Stimuli. In Dodd, B. & Campbell, R. (eds.)Hearing by Eye: the Psychology of Lipreading, 97–113. Lawrence Erlbaum Associates: London.
Google Scholar
Risberg, A. & Lubker, J. L. (1978). Prosody and Speechreading.Speech Transmission Laboratory Quaterly Progress & Status Report, Stockholm, Vol. 4, 1–16.
Google Scholar
Robert-Ribes, J. (to appear).Models of Audiovisual Integration. Doctoral Thesis, Institut National Polytechnique, Grenoble.
Robert-Ribes, J., Escudier, P. & Schwartz J. L. (1991). Modèles d'intégration audition-vision: une étude neuromimétique. Internal report, ICP, Grenoble.
Google Scholar
Robert-Ribes, J., Lallouache, T., Escudier, P. & Schwartz, J. L. (1993). Integrating Auditory and Visual Representations for Audiovisual Vowel Recognition.Proceedings of The Third European Conference on Speech Communication and Technology, 1753–1756 (by ESCA), Berlin Germany.
Roberts, M. & Summerfield, Q. (1981). Audiovisual Presentation Demonstrates That Selective Adaptation in Speech Perception is Purely Auditory.Perception and Psychophysics 30: 309–314.
Google Scholar
Rönnberg, J., Arlinger, S., Lyxell, B. & Kinnefords, C. (1989). Visual Evoked Potentials: Relation to Adult Speechreading and Cognitive Function.J. of Speech and Hearing Research 32: 725–735.
Google Scholar
Rosen, S., Fourcin, A. J. & Moore B. (1981). Voice Pitch as an Aid to Lipreading.Nature 291: 150–152.
Google Scholar
Sams, M., Aulanko, R., Hämäläinen, M., Hari, R., Lounasmaa, O. V., Lu, S. & Simola, J. (1991). Seeing Speech: Visual Information from Lip Movements Modifies Activity in the Human Auditory Cortex.Neuroscience Letters 127: 141–145.
Google Scholar
Schroeder, M. R., Atal, B. S. & Hall, J. L. (1979). Objective Measure of Certain Speech Signal Degradations Based on Masking Properties of Human Auditory Perception. In Lindblom, B. & Ohman, S. (eds.)Frontiers of Speech Communication Research, 217–229. Academic Press: London.
Google Scholar
Sekiyama, K. & Tokhura, Y. (1993). Inter-Language Differences in the Influence of Visual Cues in Speech Perception.J. of Phonetics 21: 427–444.
Google Scholar
Shepherd, D. C., DeLavergne, R. W., Frueh, F. X. & Clobridge, C. (1977). Visual-Neural Correlate of Speechreading Ability in Normal-Hearing Adults.Journal of Speech and Hearing Research 20: 752–765.
Google Scholar
Stein, B. E. & Meredith, M. A. (1993).The Merging of the Senses. MIT Press: Cambridge.
Google Scholar
Stork, D. G., Wolff, G. & Levine, E. (1992).Neural Network Lipreading System for Improved Speech Recognition. IJCNN-92, Baltimore MD, Vol. 2, 285–295.
Sumby, W. H. & Pollack, I. (1954). Visual Contribution to Speech Intelligibility in Noise.J. Acoust. Soc. Am. 26: 212–215.
Google Scholar
Summerfield, Q. (1979). Use of Visual Information for Phonetic Perception.Phonetica 36: 314–331.
Google Scholar
Summerfield, Q. (1987). Some Preliminaries to a Comprehensive Account of Audio-Visual Speech Perception. In Dodd, B. & Campbell, R. (eds.)Hearing by Eye: The Psychology of Lipreading, 3–51. Lawrence Erlbaum Associates: London.
Google Scholar
Summerfield, Q. (1992). Lipreading and Audio-Visual Speech Perception. In Bruce, Cowey, Ellis & Peret (eds.)Proceeding the Facial Image, 71–78. Clarendon Press: Oxford.
Google Scholar
Summerfield, Q. & McGrath, M. (1984). Detection and Resolution of Audio-Visual Incompatibility in the Perception of Vowels.Quarterly Journal of Experimental Psychology: Human Experimental Psychology 36: 51–74.
Google Scholar
Tamura, S. (1989).Lip Contour Extraction-Complement and Tracing by Using Energy Function and Optical Flow. Paper of Technical Group on Pattern Recognition and Understanding, I.E.I.C.E., PRU89-20, 9–16.
Vroomen, J. H. M. (1992).Hearing Voices and Seeing Lips: Investigations in the Psychology of Lipreading. Doctoral dissertation, Katolieke Univ. Brabant.
Watanabe, T. & Kohda, M. (1990). Lip-Reading of Japanese Vowels Using Neural Networks.Proceedings of The Int. Conf. Spoken Lang. Proces. 90, Kobe, Japan, 1373–1376.
Google Scholar
Welch. R. B. (1989). A Comparison of Speech Perception and Spatial Localization.Behavioral and Brain Sciences 12: 776–777.
Google Scholar
Welch, R. B. & Warren, D. H. (1986). Intersensory Interactions. In Boff, K. R., Kaufman, L. & Thomas, J. P. (eds.)Handbook of Perception and Human Performance, Volume I: Sensory Processes and Perception, 25-1–25-36. Wiley: New York.
Google Scholar
Yuhas, B. P., Goldstein, M. H. & Sejnowski, T. J. (1989). Integration of Acoustic and Visual Speech Signals Using Neural Networks.IEEE Communications Magazine 65–71.
Yuhas, B. P., Goldstein, M. H., Sejnowski, T. J. & Jenkins, R. E. (1990). Neural Network Models of Sensory Integration for Improved Vowel Recognition.Proc. of the IEEE 78: 1658–1668.
Google Scholar
Yuhas, B. P. & Goldstein, M. H. (1991). Comparing Human and Neural Network Lipreaders.J. Acoust. Soc. Am.,90: 598–600.
Google Scholar

Download references

Author information

Authors and Affiliations

Institut de la Communication Parlée, CNRS UA 368, INPG/ENSERG, Université Stendhal \ INPG, 46 Av. Félix Viallet, 38031, Grenoble Cedex 1, France
Jordi Robert-Ribes, Jean-Luc Schwartz & Pierre Escudier

Authors

Jordi Robert-Ribes
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Luc Schwartz
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Escudier
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Robert-Ribes, J., Schwartz, JL. & Escudier, P. A comparison of models for fusion of the auditory and visual sensors in speech perception. Artif Intell Rev 9, 323–346 (1995). https://doi.org/10.1007/BF00849043

Download citation

Issue Date: October 1995
DOI: https://doi.org/10.1007/BF00849043

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparison of models for fusion of the auditory and visual sensors in speech perception

Abstract

Access this article

Similar content being viewed by others

Neural Network Dynamics and Audiovisual Integration

An Object-Based Interpretation of Audiovisual Processing

Bimodal Speech Recognition Fusing Audio-Visual Modalities

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Key words

Navigation

A comparison of models for fusion of the auditory and visual sensors in speech perception

Abstract

Access this article

Similar content being viewed by others

Neural Network Dynamics and Audiovisual Integration

An Object-Based Interpretation of Audiovisual Processing

Bimodal Speech Recognition Fusing Audio-Visual Modalities

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation