Abstract
Though a large amount of psychological and physiological evidence of audio-visual integration in speech has been collected in the last 20 years, there is no agreement about the nature of the fusion process. We present the main experimental data, and describe the various models proposed in the literature, together with a number of studies in the field of automatic audiovisual speech recognition. We discuss these models in relation to general proposals arising from psychology in the field of intersensory interaction, or from the field of vision and robotics in the field of sensor fusion. Then we examine the characteristics of four main models, in the light of psychological data and formal properties, and we present the results of a modelling study on audio-visual recognition of French vowels in noise. We conclude in favor of the relative superiority of a model in which the auditory and visual inputs are projected and fused in a common representation space related to motor properties of speech objects, the fused representation being further classified for lexical access.
Similar content being viewed by others
References
Abry, C. & Boë, L. J. (1986). Laws for Lips.Speech Communication 5: 97–104.
Aulanko, R. & Sams, M. (1991). Integration of Auditory and Visual Components of Articulatory Information in the Human Brain. XII Congrès Intern. des Sciences Phonétiques, 19–24 Août 1991, Aix-en-Provence (France), pp. 38–41.
Basir, O. A. & Shen, H. C. (1992). Sensory Data Integration: A Team Consensus Approach.Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1683–1688.
Beckerman, M. (1992). A Bayes-maximum entropy method for multi-sensor data fusion.Proceedings of The 1992 IEEE Int. Conf. on Robotics and Automation, Nice, France, May 1992, pp. 1668–1674.
Benoît, C., Mohamadi, T. & Kandel, S. D. (in press). Effects of Phonetic Context on Audio-Visual Intelligibility of French.Journal of Speech and Hearing Research (in press).
Binnie, C. A., Montgomery, A. A. & Jackson, P. L. (1974). Auditory and Visual Contributions to the Perception of Consonants.Journal of Speech and Hearing Research 17: 619–630.
Bladon, R. A. W. & Lindblom, B. (1981). Modelling the Judgement of Vowel Quality Differences.J. Acoust. Soc. Am. 69: 1414–1422.
Boë, L. J. & Perrier, P. (1993). Personal Communication.
Boë, L. J., Schwartz, J. L. & Vallée, N., (to appear). The Prediction of Vowel Systems: Perceptual Contrast and Stability. In Keller, E. (ed.)Fundamentals of Speech Synthesis and Speech Recognition. John Wiley.
Braida, D., Picheny, M. A., Cohen, J. R., Rabinowitz, W. M. & Perkell, J. S. (1986). Use of Articulatory Signals in Automatic Speech Recognition.J. Acoust. Soc. Am. 80: 18.
Breeuwer, M. & Plomp, R. (1986). Speechreading Supplemented with Auditorily Presented Speech Parameters.J. Acoust. Soc. Am. 79: 481–499.
Bregman, A. S. (1990).Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press: Cambridge, MA.
Brooke, M. & Petajan, E. D. (1986). Seeing Speech: Investigations into the Synthesis and Recognition of Visible Speech Movements Using Automatic Image Processing and Computer Graphics. Conference Publication No. 258, Inter. Conf. on Speech Input/Output; Techniques and Applications (London 24–26 March 1986), pp. 104–109.
Campbell, H. W. (1974). Phoneme Recognition by Ear and by Eye: A Distinctive Feature Analysis. Doctoral dissertation, Katholieke Universiteit te Nijmegen.
Campbell, R. (1988). Tracing Lip Movements: Making Speech Visible.Visible Language 13(1).
Campbell, R. & Dodd, B. (1980). Hearing by Eye.Quarterly Journal of Experimental Psychology,32: 85–99.
Cathiard, M. A., Lallouache, T., Mohamadi, T. & Abry, Ch. (1993). Etude perceptive des interactions Son/Image: Application au visiophone et à la visioconférence. Rapport CNET Marché 92 7B 032.
Crowley, J. L. & Demazeau, Y. (1993). Principles and Techniques for Sensor Data Fusion.Signal Processing 32: 5–27.
de Gelder, B. & Vroomen, J. (1992). Abstract Versus Modality-Specific Memory Representations in Processing Auditory and Visual Speech.Memory and Cognition 20: 533–538.
de Gelder, B., Vroomen, J. & van der Heide, L. (1991). Face Recognition and Lip-Reading in Autism.European Journal of Cognitive Psychology,3: 69–86.
Delgutte, B. & Kiang, N. Y. S. (1984). Speech Coding in the Auditory Nerve: IV. Sounds with Consonant-Like Dynamic Characteristics.J. Acoust. Soc. Am. 75: 897–907.
Dixon, N. F. & Spitz L. (1980). The Detection of Audiovisual Desynchrony.Perception 9: 719–721.
Dodd, B. (1979). Lip-Reading in Infants: Attention to Speech Presented In- and Out-of-Synchrony.Cognitive Psychology 11: 478–484.
Erber, N. P. (1969). Interaction of Audition and Vision in the Recognition of Oral Speech Stimuli.Journal of Speech and Hearing Research 12: 423–425.
Erber, N. P. (1975). Auditory-Visual Perception of Speech.Journal of Speech and Hearing Disorders 40: 481–492.
Faugeras, O., Ayache, N. & Faverjon, B. (1986). Building Visual Maps by Combining Noisy Stereo Measurements. IEEE Internat. Conf. on Robotics and Automation, San Francisco, CA.
Finn, K. E. & Montgomery (1988). Automatic Optically Based Recognition of Speech.Pattern Recognition Letters 8: 159–164.
Fowler, C. A. (1986). An Event Approach to the Study of Speech Perception from a Direct-Realist Perspective.J. of Phonetics 14: 3–28.
Fowler, C. A. & Dekle, D. J. (1991). Listening with Eye and Hand: Crossmodal Contributions to Speech Perception.J. of Experimental Psychology: Human Perception and Performance 17: 816–828.
Gibson, J. J. (1966).The Senses Considered as Perceptual Systems. Boston: Houghton Mifflin Co.
Grant, K. W., Ardell, L. H., Kuhl, P. K. & Sparks, D. W. (1985). The Contribution of Fundamental Frequency, Amplitude Envelope, and Voicing Duration Cues to Speechreading in Normal-Hearing Subjects.J. Acoust. Soc. Am. 77: 671–677.
Green, K. P. (1987). The Perception of Speaking Rate Using Visual Information from a Talker's Face.Perception and Psychophysics 42: 587–593.
Green, K. P. & Kuhl, P. K. (1989). The Role of Visual Information in the Processing of Place and Manner Features in Speech Perception.Perception and Psychophysics 45: 34–42.
Green, K. P. & Kuhl, P. K. (1991). Integral Processing of Visual Place and Auditory Voicing Information During Phonetic Perception.Journal of Experimental Psychology: Human Perception and Performance 17: 278–288.
Green, K. P. & Miller J. L. (1985). On the Role of Visual Rate Information in Phonetic Perception.Perception and Psychophysics 38: 269–276.
Hatwell, Y. (1993). Transferts intermodaux et intégration intermodale. In Richelle, M., Reguin, J. & Robert, M. (eds.)Traité de Psychologie Expérimentale. Presses Universitaires de France: Paris.
King, A. J. & Palmer, A. R. (1985). Integration of Visual and Auditory Information in Bimodal Neurones in the Guinea-pig Superior Colliculus.Expl. Brain Res. 60: 492–500.
Klatt, D. H. (1979). Speech Perception: A Model of Acoustic-Phonetic Analysis and Lexical Access.J. of Phonetics 7: 279–312.
Knudsen, E. I. (1982). Auditory and Visual Maps of Space in the Optic Tectum of the Owl.J. Neurosci. 2: 1177–1194.
Konishi, M. (1986). Centrally Synthesized Maps of Sensory Space.Trends Neurosci 9: 163–168.
Kuhl, P. K. & Meltzoff, A. N. (1982). The Bimodal Perception of Speech in Infancy.Science 218: 1138–1141.
Kurita, T., Honda, K. & Kakita, Y. (1988). Analysis of Speech by a Joint Use of Image Processing of Lip Movements. I.E.I.C.E. Tech. Report, SP88-94, 41–48.
Lallouache, M. T. (1991). Un poste ‘visage-parole’ couleur. Acquisition et traitement automatique des contours des lèvres. Doctoral Thesis, Institut National Polytechnique, Grenoble.
Liberman, A. & Mattingly, I. (1985). The Motor Theory of Speech Perception Revised.Cognition 21: 1–33.
Lisker, L. & Rossi, M. (1993). Auditory and Visual Cueing of the [± Rounded] Feature of Vowels.Language and Speech 35: 391–417.
Massaro, D. W. (1987).Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Lawrence Erlbaum Associates: London.
Massaro, D. W. (1989). Testing Between the TRACE Model and the Fuzzy Logical Model of Speech Perception.Cognitive Psychology 21: 398–421.
Massaro, D. W. & Cohen, M. M. (1993). Perceiving Asynchronous Bimodal Speech in Consonant-Vowel and Vowel Syllables.Speech Communication 13: 127–134.
Massaro, D. W., Cohen, M. M., Gesi, A., Heredia, R. & Tsuzaki, M. (1993). Bimodal Speech Perception: An Examination Across Languages.J. of Phonetics 21: 445–478.
Massaro, D. W. & Warner, D. S. (1977). Dividing Attention Between Auditory and Visual Perception.Perception and Psychophysics 21: 569–574.
Matsuoka, K., Furuya, T. & Kurosu, K. (1986). Speech Recognition by Image Processing of Lip Movements.Trans. on Soc. Instru. and Cont. Eng. 22: 191–198.
McGurk, H. & MacDonald, J. (1976). Hearing Lips and Seeing Voices.Nature 264: 746–748.
McLeod, A. & Summerfield, Q. (1987). Quantifying the Contribution of Vision to Speech Perception in Noise.British Journal of Audiology 21: 131–141.
Meredith, M. A. & Stein, B. E. (1983). Interactions Among Converging Sensory Inputs in the Superior Colliculus.Science 221: 389–391.
Morris, A. C. (1992).Analyse informationnelle du traitement de la parole dans le système auditif périphérique et le noyau cochléaire. Doctoral Thesis, Institut National Polytechnique, Grenoble.
Petajan, E. D. (1984). Automatic Lipreading to Enhance Speech Recognition.Proceeding of the Global Communications Conference, IEEE Communication Society, Atlanta, Georgia, 265–272.
Radeau, M. (1994). Auditory-Visual Spatial Interaction and Modularity.Current Psychology of Cognition 13: 3–51.
Reed, C. M., Rabinowitz, W. M., Durlach, N. I. & Braida, L. D. (1985). Research on the Tadoma method of speech communication.J. Acoust. Soc. Am. 77(1): 247–257.
Reisberg, D., McLean, J. & Goldfield, A. (1987). Easy to Hear but Hard to Understand: A Lipreading Advantage with Intact Auditory Stimuli. In Dodd, B. & Campbell, R. (eds.)Hearing by Eye: the Psychology of Lipreading, 97–113. Lawrence Erlbaum Associates: London.
Risberg, A. & Lubker, J. L. (1978). Prosody and Speechreading.Speech Transmission Laboratory Quaterly Progress & Status Report, Stockholm, Vol. 4, 1–16.
Robert-Ribes, J. (to appear).Models of Audiovisual Integration. Doctoral Thesis, Institut National Polytechnique, Grenoble.
Robert-Ribes, J., Escudier, P. & Schwartz J. L. (1991). Modèles d'intégration audition-vision: une étude neuromimétique. Internal report, ICP, Grenoble.
Robert-Ribes, J., Lallouache, T., Escudier, P. & Schwartz, J. L. (1993). Integrating Auditory and Visual Representations for Audiovisual Vowel Recognition.Proceedings of The Third European Conference on Speech Communication and Technology, 1753–1756 (by ESCA), Berlin Germany.
Roberts, M. & Summerfield, Q. (1981). Audiovisual Presentation Demonstrates That Selective Adaptation in Speech Perception is Purely Auditory.Perception and Psychophysics 30: 309–314.
Rönnberg, J., Arlinger, S., Lyxell, B. & Kinnefords, C. (1989). Visual Evoked Potentials: Relation to Adult Speechreading and Cognitive Function.J. of Speech and Hearing Research 32: 725–735.
Rosen, S., Fourcin, A. J. & Moore B. (1981). Voice Pitch as an Aid to Lipreading.Nature 291: 150–152.
Sams, M., Aulanko, R., Hämäläinen, M., Hari, R., Lounasmaa, O. V., Lu, S. & Simola, J. (1991). Seeing Speech: Visual Information from Lip Movements Modifies Activity in the Human Auditory Cortex.Neuroscience Letters 127: 141–145.
Schroeder, M. R., Atal, B. S. & Hall, J. L. (1979). Objective Measure of Certain Speech Signal Degradations Based on Masking Properties of Human Auditory Perception. In Lindblom, B. & Ohman, S. (eds.)Frontiers of Speech Communication Research, 217–229. Academic Press: London.
Sekiyama, K. & Tokhura, Y. (1993). Inter-Language Differences in the Influence of Visual Cues in Speech Perception.J. of Phonetics 21: 427–444.
Shepherd, D. C., DeLavergne, R. W., Frueh, F. X. & Clobridge, C. (1977). Visual-Neural Correlate of Speechreading Ability in Normal-Hearing Adults.Journal of Speech and Hearing Research 20: 752–765.
Stein, B. E. & Meredith, M. A. (1993).The Merging of the Senses. MIT Press: Cambridge.
Stork, D. G., Wolff, G. & Levine, E. (1992).Neural Network Lipreading System for Improved Speech Recognition. IJCNN-92, Baltimore MD, Vol. 2, 285–295.
Sumby, W. H. & Pollack, I. (1954). Visual Contribution to Speech Intelligibility in Noise.J. Acoust. Soc. Am. 26: 212–215.
Summerfield, Q. (1979). Use of Visual Information for Phonetic Perception.Phonetica 36: 314–331.
Summerfield, Q. (1987). Some Preliminaries to a Comprehensive Account of Audio-Visual Speech Perception. In Dodd, B. & Campbell, R. (eds.)Hearing by Eye: The Psychology of Lipreading, 3–51. Lawrence Erlbaum Associates: London.
Summerfield, Q. (1992). Lipreading and Audio-Visual Speech Perception. In Bruce, Cowey, Ellis & Peret (eds.)Proceeding the Facial Image, 71–78. Clarendon Press: Oxford.
Summerfield, Q. & McGrath, M. (1984). Detection and Resolution of Audio-Visual Incompatibility in the Perception of Vowels.Quarterly Journal of Experimental Psychology: Human Experimental Psychology 36: 51–74.
Tamura, S. (1989).Lip Contour Extraction-Complement and Tracing by Using Energy Function and Optical Flow. Paper of Technical Group on Pattern Recognition and Understanding, I.E.I.C.E., PRU89-20, 9–16.
Vroomen, J. H. M. (1992).Hearing Voices and Seeing Lips: Investigations in the Psychology of Lipreading. Doctoral dissertation, Katolieke Univ. Brabant.
Watanabe, T. & Kohda, M. (1990). Lip-Reading of Japanese Vowels Using Neural Networks.Proceedings of The Int. Conf. Spoken Lang. Proces. 90, Kobe, Japan, 1373–1376.
Welch. R. B. (1989). A Comparison of Speech Perception and Spatial Localization.Behavioral and Brain Sciences 12: 776–777.
Welch, R. B. & Warren, D. H. (1986). Intersensory Interactions. In Boff, K. R., Kaufman, L. & Thomas, J. P. (eds.)Handbook of Perception and Human Performance, Volume I: Sensory Processes and Perception, 25-1–25-36. Wiley: New York.
Yuhas, B. P., Goldstein, M. H. & Sejnowski, T. J. (1989). Integration of Acoustic and Visual Speech Signals Using Neural Networks.IEEE Communications Magazine 65–71.
Yuhas, B. P., Goldstein, M. H., Sejnowski, T. J. & Jenkins, R. E. (1990). Neural Network Models of Sensory Integration for Improved Vowel Recognition.Proc. of the IEEE 78: 1658–1668.
Yuhas, B. P. & Goldstein, M. H. (1991). Comparing Human and Neural Network Lipreaders.J. Acoust. Soc. Am.,90: 598–600.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Robert-Ribes, J., Schwartz, JL. & Escudier, P. A comparison of models for fusion of the auditory and visual sensors in speech perception. Artif Intell Rev 9, 323–346 (1995). https://doi.org/10.1007/BF00849043
Issue Date:
DOI: https://doi.org/10.1007/BF00849043