AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking

Lathoud, Guillaume; Odobez, Jean-Marc; Gatica-Perez, Daniel

doi:10.1007/978-3-540-30568-2_16

Guillaume Lathoud^18,19,
Jean-Marc Odobez¹⁸ &
Daniel Gatica-Perez¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3361))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

1038 Accesses
19 Citations

Abstract

Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the ground-truth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, video and audio-visual speaker localization and tracking. The desired location annotation can be either 2-dimensional (image plane) or 3-dimensional (physical space). This paper motivates and describes a corpus of audio-visual data called “AV16.3”, along with a method for 3-D location annotation based on calibrated cameras. “16.3” stands for 16 microphones and 3 cameras, recorded in a fully synchronized manner, in a meeting room. Part of this corpus has already been successfully used to report research results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Algazi, V., Duda, R., Thompson, D.: The CIPIC HRTF Database. In: Proceedings of WASPAA (2001)
Google Scholar
Bouguet, J.Y.: Camera Calibration Toolbox for Matlab (January 2004), http://www.vision.caltech.edu/bouguetj/calib_doc/
DiBiase, J., Silverman, H., Brandstein, M.: Robust Localization in Reverberant Rooms. In: Brandstein, M., Ward, D. (eds.) Microphone Arrays, pp. 157–180. Springer, Heidelberg (2001)
Google Scholar
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C.: The ICSI Meeting Corpus. In: Proceedings of ICASSP (2003)
Google Scholar
Lathoud, G., McCowan, I.A.: A Sector-Based Approach for Localization of Multiple Speakers with Microphone Arrays. In: Proceedings of SAPA (2004) (to appear)
Google Scholar
Moore, D.: The IDIAP Smart Meeting Room. IDIAP Communication COM-02-07 (2002)
Google Scholar
Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J.: Moving Talker, Speaker-Independent Feature Study and Baseline Results Using the CUAVE Multimodal Speech Corpus. Eurasip Journal on Applied Signal Processing 11, 1189–1201 (2002)
Google Scholar
Perez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based Probabilistic Tracking. Proceedings of ECCV (2002)
Google Scholar
Shriberg, E., Stolcke, A., Baron, D.: Observations on Overlap: Findings and Implications for Automatic Processing of Multi-Party Conversation. In: Proceedings of Eurospeech, vol. 2, pp. 1359–1362 (2001)
Google Scholar
Svoboda, T.: Multi-Camera Self-Calibration (August 2003), http://cmp.felk.cvut.cz/svoboda/SelfCal/index.html

Download references

Author information

Authors and Affiliations

IDIAP Research Institute, CH-1920, Martigny, Switzerland
Guillaume Lathoud, Jean-Marc Odobez & Daniel Gatica-Perez
EPFL, CH-1015, Lausanne, Switzerland
Guillaume Lathoud

Authors

Guillaume Lathoud
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Marc Odobez
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Gatica-Perez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IDIAP Research Institute, Martigny, Switzerland
Samy Bengio
IDIAP Research Institute, CH-1920, Martigny, Switzerland
Hervé Bourlard

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lathoud, G., Odobez, JM., Gatica-Perez, D. (2005). AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking. In: Bengio, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2004. Lecture Notes in Computer Science, vol 3361. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30568-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-30568-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24509-4
Online ISBN: 978-3-540-30568-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics