Skip to main content

Prosodic Features for Speaker Recognition

  • Chapter
  • First Online:
Forensic Speaker Recognition

Abstract

In this chapter the effectiveness of syllable-based prosodic features for speaker recognition is discussed. The term prosody represents a collection of characteristics such as intonation, stress and timing, primarily expressed using variations in pitch, energy and duration at various levels of speech. Prosody reflects the learned/acquired speaking habits of a person and hence contributes for speaker recognition. Because prosodic features are less affected by channel mismatch and noise, they are particularly well suited for speaker forensics, a field that demands accurate identification of suspects with as few mitigating conditions as possible. In this chapter, the author describes a method for extracting prosodic features directly from speech signal. Applying this method, speech is segmented into syllable-like regions using vowel onset points (VOP). The locations of VOPs serve as reference for extraction and representation of prosodic features. The effectiveness of the prosodic features for speaker recognition is demonstrated for extended task of NIST speaker recognition evaluation 2003. Combining evidence from spectral features with that of the proposed prosodic features helps to improve overall speaker recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Heck LP (2002) Integrating high-level information for robust speaker recognition in John Hopkins University workshop on SuperSID, Baltimore, Maryland. http:\\www.cslp.jhu.edu/ws2002/groups/supersid

  2. Doddington GG (2001) Speaker recognition based on idiolectic differences between speakers. Proc. EUROSPEECH, Aalborg, Denmark, pp 2521–2524

    Google Scholar 

  3. Campbell JP (1997) Speaker recognition: a tutorial. Proc IEEE 85(9):1437–1462

    Article  Google Scholar 

  4. Mary L (2006) Multilevel implicit features for language and speaker recognition. Ph. D. Thesis, Indian Institute of Technology, Madras

    Google Scholar 

  5. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52:12–40

    Article  Google Scholar 

  6. NIST (2001) Speaker recognition evaluation website: http://www.nist.gov/speech/tests/spk/2001

  7. Reynolds D, Andrews W, Campbell J, Navratil J, Peskin B, Adami A, Jin Q, Klusacek D, Abramson J, Mihaescu R, Godfrey J, Jones D, Xiang B (2003) The superSID project: exploiting high-level information for high-accuracy speaker recognition Proc. IEEE Int. Conf. Acoust., Speech and Signal Processing, Hong Kong, China, 4, pp 784–787

    Google Scholar 

  8. Shriberg E, Stolcke A, Hakkani-Tur D, Tur G (2000) Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun 32:127–154

    Article  Google Scholar 

  9. Sonmez MK, Heck L, Weintraub M, Shriberg E (1997) A lognormal tied mixture model of pitch for prosody-based speaker recognition. Proc. EUROSPEECH, Rhodes, Greece. 3, pp 1391–1394

    Google Scholar 

  10. Atkinson JE (1978) Correlation analysis of the physiological factors controlling fundamental voice frequency. J Acoust Soc Am 63(1):211–222

    Article  Google Scholar 

  11. Yegnanarayana B, Prasanna SRM, Zachariah JM, Gupta CS (2005) Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans Speech Audio Process 13(4):575–582

    Article  Google Scholar 

  12. Atal B (1972) Automatic speaker recognition based on pitch contours. J Acous Soc Am 52(3):1687–1697

    Article  Google Scholar 

  13. Adami AG, Mihaescu R, Reynolds DA, Godfrey JJ (2003) Modeling prosodic dynamics for speaker recognition. Proc. ICASSP, Hong Kong, China, 4, pp 788–791

    Google Scholar 

  14. Makhoul J (1975) Linear prediction: a tutorial review. Proc IEEE 63:561–580

    Article  Google Scholar 

  15. Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans Speech Audio Process 29:254–272

    Google Scholar 

  16. Reynolds DA, Rose R (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–83

    Article  Google Scholar 

  17. Reynolds DA (1996) The effect of handset variability on speaker recognition performance: Experiments on the switchboard corpus. Proc. ICASSP, Atlanta, GA, USA, 1, pp 113–116

    Google Scholar 

  18. Thyme-Gobbel AE, Hutchins SE (1996) On using prosodic cues in automatic language identification. Proc. Int. Conf. Spoken Language Processing, Philadelphia, PA, USA, 3, pp 1768–1772

    Google Scholar 

  19. Mary L, Yegnanarayana B (2008) Extraction and representation of prosodic features for language and speaker recognition. Speech Commun 50:782–796

    Article  Google Scholar 

  20. Drygajlo A (2007) Forensic automatic speaker recognition. IEEE Signal Process Mag 132–135

    Google Scholar 

  21. Shriberg E, Stolcke A (2008) The case for automatic higher level features in forensic speaker recognition. Proc. Interspeech, Brisbane, Australia, pp 1509–1512

    Google Scholar 

  22. Rose P (2006) Technical speaker recognition: evaluation, types and testing of evidence. Comp Speech Lang 20:159–1914

    Article  Google Scholar 

  23. Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Commun 46:455–472

    Article  Google Scholar 

  24. Sonmez MK, Shriberg E, Heck L, Weintraub M (1998) Modeling dynamic prosodic variation for speaker variation. Proc. ICSLP, Sydney, Australia, 7, pp 3189–3192

    Google Scholar 

  25. Adami AG, Mihaescu R, Reynolds DA, Godfrey JJ (2003) Modeling prosodic dynamics for speaker recognition. Proc. ICASSP, Hong kong, China, 4, pp 788–791

    Google Scholar 

  26. Peskin B, Navratil J, Abramson J, Jones D, Klusacek D, Reynolds D, Xiang B (2003) Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS`02. Proc. ICASSP, Hong kong, China, 4, pp 792–795

    Google Scholar 

  27. Rouas J, Farinas J, Pellegrino F, Andre-Obrecht R (2005) Rhythmic unit extraction and modelling for automatic language identification. Speech Commun 47:436–456

    Article  Google Scholar 

  28. Nagarajan T, Murthy HA (2006) Language identification using acoustic log-likelihoods of syllable-like units. Speech Commun 48:913–926

    Article  Google Scholar 

  29. Dehak N, Kenny P, Dumouchel P (2007) Continuous prosodic features and formant modeling with joint factor analysis for speaker verification. Proc. of Interspeech, pp 1234–1237

    Google Scholar 

  30. Mary L, Yegnanarayana B (2006) Prosodic features for speaker verification. Proc. of Interspeech, Pittsburgh, Pennsylvania, pp 917–920

    Google Scholar 

  31. MacNeilage PF (1998) The frame/content theory of evolution of speech production. Behav Brain Sci 21:499–546

    Google Scholar 

  32. Krakow RA (1999) Physiological organization of syllables: a review. J Phonetics 27:23–54

    Article  Google Scholar 

  33. Atterer M, Ladd DR (2004) On the phonetics and phonology of “segmental anchoring” of F0: evidence from German. J Phonetics 32:177–197

    Article  Google Scholar 

  34. Prasanna SRM, Gangashetty SV, Yegnanarayana B (2001) Significance of vowel onset point for speech analysis. Proc. Signal Proc. Com, Indian Institute of Science, pp. 81–88

    Google Scholar 

  35. Prasanna SRM (2004) Event-based analysis of speech. Ph D Thesis, Indian Institute of Technology, Madras

    Google Scholar 

  36. Prasanna SRM, Yegnanarayana B (2005) Detection of vowel onset point events using excitation source information, Proc. of Interspeech, pp 1133–1136

    Google Scholar 

  37. Prasanna SRM, Zachariah JM (2002) Detection of vowel onset point in speech. Proc. IEEE Int Conf Acoust Speech, Signal Processing, Orlando, Fl, USA 4:4159

    Google Scholar 

  38. Ananthapadmanabha TV (1978) Epoch extraction of voice speech. Ph. D. Thesis, Indian institute of Science, Bangalore

    Google Scholar 

  39. Hess W (1983) Pitch determination of speech signals. Springer, Berlin

    Book  Google Scholar 

  40. Ananthapadmanabha TV, Yegnanarayana B (1979) Epoch extraction fromlinear prediction residual for identification of closed glottis interval. IEEE Trans ASSP 27:309–319

    Article  Google Scholar 

  41. Ananthapadmanabha TV, Yegnanarayana B (1975) Epoch extraction of voice speech. IEEE Trans ASSP 23:562–570

    Article  Google Scholar 

  42. Taylor P (2000) Analysis and synthesis of intonation using the tilt model. J Acoust Soc Am 107(3):1697–1714

    Article  Google Scholar 

  43. Gussenhoven C, Reepp BH, Rietveld A, Rump HH, Terken J (1997) The perceptual prominence of fundamental frequency peaks. J Acoust Soc Am 102(5):3009–3022

    Article  Google Scholar 

  44. Yegnanarayana B (1999) Artificial neural network. Prentice Hall of India, New Delhi

    Google Scholar 

  45. Yegnanarayana B, Kishore SP (2002) AANN-An alternative for GMM for pattern recognition. Neural Netw 15(3):459–469

    Article  Google Scholar 

Download references

Acknowledgement

The author would like to thank Prof. B. Yegnanarayana and members of Speech and Vision Laboratory of IIT Madras, India during 2002–2006 for their support to carry out the study described in this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leena Mary Ph.D. .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Mary, L. (2012). Prosodic Features for Speaker Recognition. In: Neustein, A., Patil, H. (eds) Forensic Speaker Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-0263-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-0263-3_13

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-0262-6

  • Online ISBN: 978-1-4614-0263-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics