Abstract
Stress causes a speaker’s voice characteristics to be changed. Emotional stress alters a person’s speech pattern such that it is distributed non-normally along the temporal dimension. Thus, the methods for identifying the gender of a non-stressed speaker were no longer effective in recognizing the gender of a speaker in stressful conditions. To address this issue, a new gender identification framework is proposed. We leveraged i-vector for capturing gender information on each speech segment. Then the long short-term memory dynamically handled all speech temporal context features and learned the long-term dependency from the input. We evaluated the effectiveness, in terms of accuracy and the number of iterations to saturate, of the proposed method by comparing it with the baseline methods in their respective abilities to identify the speaker’s gender from conversations with different durations. By learning the gender information encoded in long-term dependencies, our proposed method outperforms the baseline methods and is able to correctly identify the speaker’s gender in all conversation types.
Similar content being viewed by others
References
Kanervisto A, Vestman V, Sahidullah Md, Hautamaki V, Kinnunen T (2017) Effects of gender information in text-independent and text-dependent speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA
Jayasankar T, Vinothkumar K, Vijayaselvi A (2017) Automatic gender identification in speech recognition by genetic algorithm. Appl Math Inf Sci 11(3):907–913
Jayasankar T, Vinothkumar K, Vijayaselvi A (2013) Gender-dependent emotion recognition based on HMMs and SPHMMs. Int J Speech Technol 16(2):133–141
Shaqra FA, Duwairi R, Al-Ayyoub M (2019) Recognizing emotion from speech based on age and gender using hierarchical models. Proced Comput Sci 151:37–44
Bisio I, Delfino A, Lavagetto F, Marchese M, Sciarrone A (2013) Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Trans Emerg Top Comput 1(2):224–257
Zhang L, Wang L, Dang J, Guo L, Yu Q (2018) Gender-aware CNN-BLSTM for speech emotion recognition. In: International conference on artificial neural networks (ICANN), Rhodes, Greece
Lester-Smith RA, Brad HS (2016) The effects of physiological adjustments on the perceptual and acoustical characteristics of vibrato as a model of vocal tremor. J Acoust Soc Am 140(5):3827–3833
Archana GS, Malleswari M (2015) Gender identification and performance analysis of speech signals. In: Global conference on communication technologies (GCCT), Thuckalay, India
Ramdinmawii E, Mittal VK (2016) Gender identification from speech signal by examining the speech production characteristics. In: International conference on signal processing and communication (ICSC), Noida, India
Gupta M, Bharti SS, Agarwal S (2016) Support vector machine based gender identification using voiced speech frames. In: International conference on parallel, distributed and grid computing (PDGC), Waknaghat, India
Levitan SI, Mishra T, Bangalore S (2016) Automatic identification of gender from speech. In: Proceeding of speech prosody
Hansen JHL, Patil S (2007) Speech under stress: analysis, modeling and recognition. In: Müller C (eds) Speaker classification I. Lecture notes in computer science, vol 4343. Springer, Berlin
Godin KW, Hansen JHL (2015) Physical task stress and speaker variability in voice quality. EURASIP J Audio Speech Music Process 29:2015
Marten J (2005) Culture, gender and the recognition of the basic emotions. Psychologia 2005(48):306–316
Grzybowska J, Ziolko M (2015) I-vectors in gender recognition from telephone speech. In: Proceedings of the twenty-first national conference on applications of mathematics in biology and medicine, pp 57–62
Wang M, Chen Y, Tang Z, Zhang E (2015) I-vector based speaker gender recognition. In: IEEE advanced information technology, electronic and automation control conference (IAEAC), Chongqing, China
Xia R, Liu Y (2012) Using i-vector space model for emotion recognition. INTERSPEECH, Portland
Gomes J, EL-Sharkawy M (2015) i-vector algorithm with Gaussian mixture model for efficient speech emotion recognition. In: International conference on computational science and computational intelligence (CSCI), Las Vegas, NV, USA
Xia R, Liu Y (2016) DBN-ivector framework for acoustic emotion recognition. INTERSPEECH, San Francisco
Verma P, Das PK (2015) i-Vectors in speech processing applications: a survey. Int J Speech Technol 18(4):529–546
Coutinho E, Schuller B (2017) Shared acoustic codes underlie emotional communication in music and speech-evidence from deep transfer learning. PLoS One 12:6
Hansen JHL (1999) Composer, SUSAS LDC99S78. Web download. [Sound Recording]. Linguistic Data Consortium, Philadelphia
Hansen JHL (1999) Composer, SUSAS transcripts LDC99T33. [Sound Recording]. Linguistic Data Consortium, Philadelphia
Son HH (2017) Toward a proposed framework for mood recognition using LSTM recurrent neuron network. Proced Comput Sci 109:1028–1034
Son G, Kwon S, Park N (2019) Gender classification based on the non-lexical cues of emergency calls with recurrent neural networks (RNN). Symmetry 11(4):15
Nammous MK, Saeed K (2019) Natural language processing: speaker, language, and gender identification with LSTM, advanced computing and systems for security, advances in intelligent systems and computing. Springer, Singapore, p 883
Serizel R, Bisot V, Essid S, Richard G (2017) Acoustic features for environmental sound analysis. Computational analysis of sound scenes and events. Springer, Berlin, pp 71–101
Rabiner LR, Schafer RW (2010) Theory and applications of digital speech processing. Pearson, Upper Saddle River
Zewoudie AW, Luque J, Hernandi J (2018) The use of long-term features for GMM-and i-vector-based speaker diarization systems. EURASIP J Audio Speech Music Process 14:2018
Zazo R, Nidadavolu PS, Chen N, Gonzalez-Rodriguez J, Dehak N (2018) Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE Access 6:22524–22530
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Alsulami B, Dauber E, Harang RE, Mancoridis S, Greenstadt R (2017) Source code authorship attribution using long short-term memory based networks. In: 22nd European symposium on research in computer security (ESORICS), Oslo, Norway
Dwarampudi M, Reddy NVS (2019) Effects of padding on LSTMs and CNNs. arXiv:1903.07288v1 [cs.LG]
Mesnil G, He X, Deng L, Bengio Y (2013) Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. INTERSPEECH, Lyon
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning, vol 28, no 3, pp 1310–1318
Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv:1402.1128v1 [cs.NE]
Yokoyama N, Azuma D, Tsukiyama S (2016) An efficient Gaussian mixture reduction to two components. In: The 20th workshop on synthesis and system integration of mixed information technologies, Kyoto, Japan, pp 236–241
Acknowledgements
I would like to thank the Tamura Laboratory, which supported us in this work. Thank you to the LDC for allowing us access to the SUSAS database. Thank you to Dr. Lindsey R. Tate for editing the grammatical error in this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Prasetio, B.H., Tamura, H. & Tanno, K. The long short-term memory based on i-vector extraction for conversational speech gender identification approach. Artif Life Robotics 25, 233–240 (2020). https://doi.org/10.1007/s10015-020-00582-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10015-020-00582-x