Skip to main content

AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking

  • Conference paper
Machine Learning for Multimodal Interaction (MLMI 2004)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3361))

Included in the following conference series:

Abstract

Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the ground-truth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, video and audio-visual speaker localization and tracking. The desired location annotation can be either 2-dimensional (image plane) or 3-dimensional (physical space). This paper motivates and describes a corpus of audio-visual data called “AV16.3”, along with a method for 3-D location annotation based on calibrated cameras. “16.3” stands for 16 microphones and 3 cameras, recorded in a fully synchronized manner, in a meeting room. Part of this corpus has already been successfully used to report research results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Algazi, V., Duda, R., Thompson, D.: The CIPIC HRTF Database. In: Proceedings of WASPAA (2001)

    Google Scholar 

  2. Bouguet, J.Y.: Camera Calibration Toolbox for Matlab (January 2004), http://www.vision.caltech.edu/bouguetj/calib_doc/

  3. DiBiase, J., Silverman, H., Brandstein, M.: Robust Localization in Reverberant Rooms. In: Brandstein, M., Ward, D. (eds.) Microphone Arrays, pp. 157–180. Springer, Heidelberg (2001)

    Google Scholar 

  4. Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C.: The ICSI Meeting Corpus. In: Proceedings of ICASSP (2003)

    Google Scholar 

  5. Lathoud, G., McCowan, I.A.: A Sector-Based Approach for Localization of Multiple Speakers with Microphone Arrays. In: Proceedings of SAPA (2004) (to appear)

    Google Scholar 

  6. Moore, D.: The IDIAP Smart Meeting Room. IDIAP Communication COM-02-07 (2002)

    Google Scholar 

  7. Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J.: Moving Talker, Speaker-Independent Feature Study and Baseline Results Using the CUAVE Multimodal Speech Corpus. Eurasip Journal on Applied Signal Processing 11, 1189–1201 (2002)

    Google Scholar 

  8. Perez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based Probabilistic Tracking. Proceedings of ECCV (2002)

    Google Scholar 

  9. Shriberg, E., Stolcke, A., Baron, D.: Observations on Overlap: Findings and Implications for Automatic Processing of Multi-Party Conversation. In: Proceedings of Eurospeech, vol. 2, pp. 1359–1362 (2001)

    Google Scholar 

  10. Svoboda, T.: Multi-Camera Self-Calibration (August 2003), http://cmp.felk.cvut.cz/svoboda/SelfCal/index.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lathoud, G., Odobez, JM., Gatica-Perez, D. (2005). AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking. In: Bengio, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2004. Lecture Notes in Computer Science, vol 3361. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30568-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30568-2_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24509-4

  • Online ISBN: 978-3-540-30568-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics