Skip to main content

Comparison of Text-Independent Original Speaker Recognition from Emotionally Converted Speech

  • Chapter
  • First Online:
Recent Advances in Nonlinear Speech Processing

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 48))

Abstract

The paper describes an application of the classifier based on the Gaussian mixture models (GMM) for reverse identification of the original speaker from the emotionally transformed speech in Czech and Slovak. We investigate whether the identification score given by the GMM classifier depends on the type and the structure of used speech features. Comparison of the results obtained with the sentences in German and English has shown that the structure and the balance of the speech database have influence on the identification accuracy but the used language is not practically important. The evaluation experiments confirmed that the developed text-independent GMM original speaker identifier is functional for the closed-set classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Skowron, M., Rank, S., Swiderska, A., Küster, D., Kappas, A.: Applying a text-based affective dialogue system in psychological research: case studies on the effects of system behaviour, interaction context and social exclusion. Cogn. Comput. p. 20 (2014), doi:10.1007/s12559-014-9271-2

    Google Scholar 

  2. Maia, R., Akamine, M.: On the impact of excitation and spectral parameters for expressive statistical parametric speech synthesis. Comput. Speech Lang. 28(5), 1209–1232 (2014)

    Article  Google Scholar 

  3. Riviello, M.T., Chetouani, M., Cohen, D., Esposito, A.: On the perception of emotional “voices”: a cross-cultural comparison among American, French and Italian subjects. In: Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., Nijholt, A. (eds.) Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues. LNCS, vol. 6800, pp. 368–377. Springer, Berlin (2011)

    Google Scholar 

  4. Yun, S., Lee, Y.J., Kim, S.H.: Multilingual speech-to-speech translation system for mobile consumer devices. IEEE Trans. Consum. Electron. 60(3), 508–516 (2014)

    Article  Google Scholar 

  5. Přibil, J., Přibilová, A.: Application of expressive speech in TTS System with cepstral description. In: Esposito, A., Bourbakis, N., Avouris, N., Hatrzilygeroudis, I. (eds.) Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction. LNAI, vol. 5042, pp. 201–213. Springer, Berlin (2008)

    Google Scholar 

  6. Hanzlíček, Z., Matoušek, J., Tihelka, D.: First experiments on text-to-speech system personification. In: Matoušek, V., Mautner, P. (eds.) Text, Speech, and Dialogue 2009. LNCS, vol. 5729, pp. 186–193. Springer, Berlin (2009)

    Chapter  Google Scholar 

  7. Lee, H.J.: Fairy tale storytelling system: using both prosody and text for emotional speech synthesis. In: Lee, G., Howard, D., Ślogonezak, D., Hong, Y.S. (eds.) Convergence and Hybrid Information Technology. Communications in Computer and Information Science, vol. 310, pp. 317–324. Springer, Berlin (2012)

    Chapter  Google Scholar 

  8. Alcantara, J.A., Lu, L.P., Magno, J.K., Soriano, Z., Ong, E., Resurreccion, R.: Emotional narration of children’s stories. In: Nishizaki, S.Y., Numao, M., Caro, J., Suarez, M.T. (eds.) Theory and Practice of Computation. Proceedings in Information and Communication Technology, vol. 5, pp. 1–14. Springer, Japan (2012)

    Chapter  Google Scholar 

  9. Přibil, J., Přibilová, A.: Czech TTS engine for Braille pen device based on pocket PC platform. In: Vích, R. (ed.) Proceedings of the 16th Conference Electronic Speech Signal Processing ESSP’05 joined with the 15th Czech-German Workshop Speech Processing, pp. 402–408 (2005)

    Google Scholar 

  10. Erro, D., Alonso, A., Serrano, L., Navas, E., Hernaez, I.: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations. Comput. Speech Lang. 30(1), 3–15 (2015)

    Article  Google Scholar 

  11. Tihelka, D., Matoušek, J., Kala, J.: Quality deterioration factors in unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) Text, Speech, and Dialogue 2007. LNAI, vol. 4629, pp. 508–515. Springer, Berlin (2007)

    Chapter  Google Scholar 

  12. Přibil, J., Přibilová, A., Matoušek, J.: GMM classification of TTS synthesis: Identification of original speaker’s voice. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) Text, Speech, and Dialogue. LNAI, vol. 8655, pp. 365–373. Springer, Cham (2014)

    Google Scholar 

  13. Shahin, I.: Speaker identification in emotional talking environments based on CSPHMM2s. Eng. Appl. Artif. Intell. 26(7), 1652–1659 (2013)

    Article  Google Scholar 

  14. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)

    Article  Google Scholar 

  15. Ajmera, P.K., Jadhav, D.V., Holambe, R.S.: Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognit. 44(10–11), 2749–2759 (2011)

    Article  Google Scholar 

  16. Jawarkar, N.P., Holambe, R.S., Basu, T.K.: Text-independent speaker identification in emotional environments: a classifier fusion approach. In: Sambath, S., Zhu, E. (eds.) Frontiers in Computer Education. AISC, vol. 133, pp. 569–576. Springer, Berlin (2012)

    Chapter  Google Scholar 

  17. Přibil, J., Přibilová, A.: Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech. EURASIP J. Audio Speech Music Process. 2013(8), 1–22 (2013)

    Google Scholar 

  18. Přibilová, A., Přibil, J.: Harmonic model for female voice emotional synthesis. In: Fierrez, J., et al. (eds.) Biometric ID Management and Multimodal Communication. LNCS, vol. 5707, pp. 41–48. Springer, Berlin (2009)

    Chapter  Google Scholar 

  19. Scherer, K.R.: Vocal communication of emotion: a review of research paradigms. Speech Commun. 40(1–2), 227–256 (2003)

    Article  MATH  Google Scholar 

  20. Vích, R.: Cepstral speech model, Padé approximation, excitation, and gain matching in cepstral speech synthesis. In: Proceedings of the 15th Biennial EURASIP Conference Biosignal 2000, pp. 77–82. Brno, Czech Republic (2000)

    Google Scholar 

  21. Madlová, A.: Autoregressive and cepstral parametrization in harmonic speech modelling. J. Electr. Eng. 53(1–2), 46–49 (2002)

    Google Scholar 

  22. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH 2005, pp. 1517–1520. Lisbon, Portugal (2005)

    Google Scholar 

  23. Lopes, C., Perdigão, F.: Phoneme recognition on the TIMIT database. In: I. Ipšić (ed.) Speech Technologies, InTech (2011). doi:10.5772/17600

    Google Scholar 

  24. Dileep, A.D., Sekhar, CCh.: Class-specific GMM based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. Speech Commun. 57, 126–143 (2014)

    Article  Google Scholar 

  25. Zhao, X., Wang, D.: Analysing noise robustness of MFCC and GFCC features in speaker identification. In: Proceedings of the IEEE International Conference on acoustics, Speech and Signal Processing (ICASSP), pp. 7204–7208 (2013)

    Google Scholar 

  26. Ooi, C.S., Seng, K.P., Ang, L.M., Chew, L.W.: A new approach of audio emotion recognition. Expert Syst. Appl. 41(13), 5858–5869 (2014)

    Article  Google Scholar 

  27. Gharavian, D., Sheikhan, M., Ashoftedel, F.: Emotion recognition improvement using normalized formant supplementary features by hybrid of DTW-MLP-GMM model. Neural Comput. Appl. 22(6), 1181–1191 (2013)

    Article  Google Scholar 

  28. Stanek, M., Sigmund, M.: Comparison of speaker individuality in triangle areas of plane formant spaces. In: Proceedings of the 24th International Conference Radioelektronika, Bratislava 2014, p. 4 (2014). doi:10.1109/Radioelek.2014.6828439

  29. Wu, C.H., Hsia, C.C., Lee, C.H., Lin, M.C.: Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis. IEEE Trans. Audio Speech Lang. Process. 18(6), 1394–1405 (2010)

    Article  Google Scholar 

  30. Sezgin, M.C., Gunsel, B., Kurt, G.K.: Perceptual audio features for emotion detection. EURASIP J. Audio Speech Music Process. 2012(16) (2012). http://asmp.eurasipjournals.com/2012/1/16

  31. Tóth, L., Grósz, T.: A Comparison of deep neural network training methods for large vocabulary speech recognition. In: Habernal, I., Matoušek, V. (eds.) Text, Speech and Dialogue. LNAI, vol. 8082, pp. 36–43. Springer, Berlin (2013)

    Google Scholar 

  32. Nabney, I.T.: Netlab Pattern Analysis Toolbox (1996-2001). Retrieved 16 February 2012, from http://www.mathworks.com/matlabcentral/fileexchange/2654-netlab

  33. Přibil, J., Přibilová, A.: GMM-Based evaluation of emotional style transformation in Czech and Slovak. Cogn. Comput. p. 11 (2014). doi:10.1007/s12559-014-9283-y

    Google Scholar 

  34. Zhao, J., Jiang, Q.: Probabilistic PCA for t-distributions. Neurocomputing 69(16–18), 2217–2226 (2006)

    Google Scholar 

  35. Staroniewicz, P. Majewski, W.: SVM based text-dependent speaker identification for large set of voices. In: Proceedings of the 12th European Signal Processing Conference, EUSIPCO 2004, pp. 333–336. Vienna, Austria (2004)

    Google Scholar 

Download references

Acknowledgments

This work has been supported by the and VEGA 1/0090/16 Grant Agency of the Slovak Academy of Sciences (VEGA 2/0013/14) and the Ministry of Education of the Slovak Republic (KEGA 022STU-4/2014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiří Přibil .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Přibil, J., Přibilová, A. (2016). Comparison of Text-Independent Original Speaker Recognition from Emotionally Converted Speech. In: Esposito, A., et al. Recent Advances in Nonlinear Speech Processing. Smart Innovation, Systems and Technologies, vol 48. Springer, Cham. https://doi.org/10.1007/978-3-319-28109-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28109-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28107-0

  • Online ISBN: 978-3-319-28109-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics