The long short-term memory based on i-vector extraction for conversational speech gender identification approach

Prasetio, Barlian Henryranu; Tamura, Hiroki; Tanno, Koichi

doi:10.1007/s10015-020-00582-x

The long short-term memory based on i-vector extraction for conversational speech gender identification approach

Original Article
Published: 23 January 2020

Volume 25, pages 233–240, (2020)
Cite this article

Artificial Life and Robotics Aims and scope Submit manuscript

Barlian Henryranu Prasetio¹,
Hiroki Tamura² &
Koichi Tanno²

277 Accesses
4 Citations
Explore all metrics

Abstract

Stress causes a speaker’s voice characteristics to be changed. Emotional stress alters a person’s speech pattern such that it is distributed non-normally along the temporal dimension. Thus, the methods for identifying the gender of a non-stressed speaker were no longer effective in recognizing the gender of a speaker in stressful conditions. To address this issue, a new gender identification framework is proposed. We leveraged i-vector for capturing gender information on each speech segment. Then the long short-term memory dynamically handled all speech temporal context features and learned the long-term dependency from the input. We evaluated the effectiveness, in terms of accuracy and the number of iterations to saturate, of the proposed method by comparing it with the baseline methods in their respective abilities to identify the speaker’s gender from conversations with different durations. By learning the gender information encoded in long-term dependencies, our proposed method outperforms the baseline methods and is able to correctly identify the speaker’s gender in all conversation types.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural Language Processing: Speaker, Language, and Gender Identification with LSTM

Gender-Aware CNN-BLSTM for Speech Emotion Recognition

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

Article 10 February 2020

References

Kanervisto A, Vestman V, Sahidullah Md, Hautamaki V, Kinnunen T (2017) Effects of gender information in text-independent and text-dependent speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA
Jayasankar T, Vinothkumar K, Vijayaselvi A (2017) Automatic gender identification in speech recognition by genetic algorithm. Appl Math Inf Sci 11(3):907–913
Article Google Scholar
Jayasankar T, Vinothkumar K, Vijayaselvi A (2013) Gender-dependent emotion recognition based on HMMs and SPHMMs. Int J Speech Technol 16(2):133–141
Article Google Scholar
Shaqra FA, Duwairi R, Al-Ayyoub M (2019) Recognizing emotion from speech based on age and gender using hierarchical models. Proced Comput Sci 151:37–44
Article Google Scholar
Bisio I, Delfino A, Lavagetto F, Marchese M, Sciarrone A (2013) Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Trans Emerg Top Comput 1(2):224–257
Google Scholar
Zhang L, Wang L, Dang J, Guo L, Yu Q (2018) Gender-aware CNN-BLSTM for speech emotion recognition. In: International conference on artificial neural networks (ICANN), Rhodes, Greece
Lester-Smith RA, Brad HS (2016) The effects of physiological adjustments on the perceptual and acoustical characteristics of vibrato as a model of vocal tremor. J Acoust Soc Am 140(5):3827–3833
Article Google Scholar
Archana GS, Malleswari M (2015) Gender identification and performance analysis of speech signals. In: Global conference on communication technologies (GCCT), Thuckalay, India
Ramdinmawii E, Mittal VK (2016) Gender identification from speech signal by examining the speech production characteristics. In: International conference on signal processing and communication (ICSC), Noida, India
Gupta M, Bharti SS, Agarwal S (2016) Support vector machine based gender identification using voiced speech frames. In: International conference on parallel, distributed and grid computing (PDGC), Waknaghat, India
Levitan SI, Mishra T, Bangalore S (2016) Automatic identification of gender from speech. In: Proceeding of speech prosody
Hansen JHL, Patil S (2007) Speech under stress: analysis, modeling and recognition. In: Müller C (eds) Speaker classification I. Lecture notes in computer science, vol 4343. Springer, Berlin
Godin KW, Hansen JHL (2015) Physical task stress and speaker variability in voice quality. EURASIP J Audio Speech Music Process 29:2015
Google Scholar
Marten J (2005) Culture, gender and the recognition of the basic emotions. Psychologia 2005(48):306–316
Article Google Scholar
Grzybowska J, Ziolko M (2015) I-vectors in gender recognition from telephone speech. In: Proceedings of the twenty-first national conference on applications of mathematics in biology and medicine, pp 57–62
Wang M, Chen Y, Tang Z, Zhang E (2015) I-vector based speaker gender recognition. In: IEEE advanced information technology, electronic and automation control conference (IAEAC), Chongqing, China
Xia R, Liu Y (2012) Using i-vector space model for emotion recognition. INTERSPEECH, Portland
Google Scholar
Gomes J, EL-Sharkawy M (2015) i-vector algorithm with Gaussian mixture model for efficient speech emotion recognition. In: International conference on computational science and computational intelligence (CSCI), Las Vegas, NV, USA
Xia R, Liu Y (2016) DBN-ivector framework for acoustic emotion recognition. INTERSPEECH, San Francisco
Book Google Scholar
Verma P, Das PK (2015) i-Vectors in speech processing applications: a survey. Int J Speech Technol 18(4):529–546
Article Google Scholar
Coutinho E, Schuller B (2017) Shared acoustic codes underlie emotional communication in music and speech-evidence from deep transfer learning. PLoS One 12:6
Google Scholar
Hansen JHL (1999) Composer, SUSAS LDC99S78. Web download. [Sound Recording]. Linguistic Data Consortium, Philadelphia
Google Scholar
Hansen JHL (1999) Composer, SUSAS transcripts LDC99T33. [Sound Recording]. Linguistic Data Consortium, Philadelphia
Google Scholar
Son HH (2017) Toward a proposed framework for mood recognition using LSTM recurrent neuron network. Proced Comput Sci 109:1028–1034
Article Google Scholar
Son G, Kwon S, Park N (2019) Gender classification based on the non-lexical cues of emergency calls with recurrent neural networks (RNN). Symmetry 11(4):15
Article Google Scholar
Nammous MK, Saeed K (2019) Natural language processing: speaker, language, and gender identification with LSTM, advanced computing and systems for security, advances in intelligent systems and computing. Springer, Singapore, p 883
Google Scholar
Serizel R, Bisot V, Essid S, Richard G (2017) Acoustic features for environmental sound analysis. Computational analysis of sound scenes and events. Springer, Berlin, pp 71–101
Google Scholar
Rabiner LR, Schafer RW (2010) Theory and applications of digital speech processing. Pearson, Upper Saddle River
Google Scholar
Zewoudie AW, Luque J, Hernandi J (2018) The use of long-term features for GMM-and i-vector-based speaker diarization systems. EURASIP J Audio Speech Music Process 14:2018
Google Scholar
Zazo R, Nidadavolu PS, Chen N, Gonzalez-Rodriguez J, Dehak N (2018) Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE Access 6:22524–22530
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Alsulami B, Dauber E, Harang RE, Mancoridis S, Greenstadt R (2017) Source code authorship attribution using long short-term memory based networks. In: 22nd European symposium on research in computer security (ESORICS), Oslo, Norway
Dwarampudi M, Reddy NVS (2019) Effects of padding on LSTMs and CNNs. arXiv:1903.07288v1 [cs.LG]
Mesnil G, He X, Deng L, Bengio Y (2013) Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. INTERSPEECH, Lyon
Google Scholar
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning, vol 28, no 3, pp 1310–1318
Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv:1402.1128v1 [cs.NE]
Yokoyama N, Azuma D, Tsukiyama S (2016) An efficient Gaussian mixture reduction to two components. In: The 20th workshop on synthesis and system integration of mixed information technologies, Kyoto, Japan, pp 236–241

Download references

Acknowledgements

I would like to thank the Tamura Laboratory, which supported us in this work. Thank you to the LDC for allowing us access to the SUSAS database. Thank you to Dr. Lindsey R. Tate for editing the grammatical error in this manuscript.

Author information

Authors and Affiliations

Interdisciplinary Graduate School of Agriculture and Engineering, University of Miyazaki, Miyazaki, Japan
Barlian Henryranu Prasetio
Faculty of Engineering, University of Miyazaki, Miyazaki, Japan
Hiroki Tamura & Koichi Tanno

Authors

Barlian Henryranu Prasetio
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Tamura
View author publications
You can also search for this author in PubMed Google Scholar
Koichi Tanno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Barlian Henryranu Prasetio.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Cite this article

Prasetio, B.H., Tamura, H. & Tanno, K. The long short-term memory based on i-vector extraction for conversational speech gender identification approach. Artif Life Robotics 25, 233–240 (2020). https://doi.org/10.1007/s10015-020-00582-x

Download citation

Received: 06 December 2018
Accepted: 07 January 2020
Published: 23 January 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s10015-020-00582-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The long short-term memory based on i-vector extraction for conversational speech gender identification approach

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing: Speaker, Language, and Gender Identification with LSTM

Gender-Aware CNN-BLSTM for Speech Emotion Recognition

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Keywords

Navigation

The long short-term memory based on i-vector extraction for conversational speech gender identification approach

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing: Speaker, Language, and Gender Identification with LSTM

Gender-Aware CNN-BLSTM for Speech Emotion Recognition

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Share this article

Keywords

Search

Navigation