Spoken Language Identification Using Language Bottleneck Features

Grisard, Malo; Motlicek, Petr; Allouchi, Wissem; Baeriswyl, Michael; Lazaridis, Alexandros; Zhan, Qingran

doi:10.1007/978-3-030-27947-9_32

Malo Grisard^9,10,
Petr Motlicek¹¹,
Wissem Allouchi¹⁰,
Michael Baeriswyl¹⁰,
Alexandros Lazaridis¹¹ &
…
Qingran Zhan¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

787 Accesses

Abstract

In this paper, we introduce a novel approach for Language Identification (LID). Two commonly used state-of-the-art methods based on UBM/GMM I-vector technique, combined with a back-end classifier, are first evaluated. The differential factor between these two methods is the deployment of input features to train the UBM/GMM models: conventional MFCCs, or deep Bottleneck Features (BNF) extracted from a neural network. Analogous to successful algorithms developed for speaker recognition tasks, this paper proposes to train the BNF classifier directly on language targets rather than using conventional phone targets (i.e. international phone alphabet). We show that the proposed approach reduces the number of targets by 96% when tested on 4 languages of SpeechDat databases, which leads to 94% reduction in training time (i.e. to train BNF classifier). We achieve in average, relative improvement of approximately 35% in terms of cost average \(C_{avg}\), as well as Language Error Rates (LER), across all test duration conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Article Google Scholar
Elenius, K., Lindberg, J.: SpeechDat Speech Databases for Creation of Voice Driven Teleservices. Phonum 4, Phonetics, pp. 61–64 (1997). http://www.speech.kth.se/prod/publications/files/538.pdf
Fér, R., Matějka, P., Grézl, F., Plchot, O., Cernocký, J.H.: Multilingual bottleneck features for language recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, pp. 389–393, January 2015
Google Scholar
Glembek, O., Burget, L., Matějka, P., Karafiát, M., Kenny, P.: Simplification and optimization of i-vector extraction. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4516–4519, May 2011
Google Scholar
Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.J.: Automatic language identification using long short-term memory recurrent neural networks. In: Proceedings of Interspeech, pp. 2155–2159 (2014)
Google Scholar
Kenny, P.: Joint factor analysis of speaker and session variability: Theory and algorithms. Technical report (2005)
Google Scholar
Kramer, M.A.: Nonlinear principal component analysis using auto-associative neural networks. AIChEJ 37(2), 233–243 (1991)
Article Google Scholar
Díez, M., Varona, A., Peñagarikano, M., Rodríguez-Fuentes, L.J., Bordel, G.: On the use of phone log-likelihood ratios as features in spoken language recognition. SLT, pp. 274–279 (2012)
Google Scholar
Martinez, D., Plchot, O., Burget, L., Glembek, O., Matějka, P.: Language recognition in ivectors space. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Matejka, P., Cumani, S., Ondel, L., Mounika, K.V., Silnova, A., Rohdin, J.: BUT-PT System Description for NIST LRE 2017, 748097 (2017)
Google Scholar
Matejka, P., et al.: Neural network bottleneck features for language identification. Odyssey, the Speaker and Language Recognition Workshop, pp. 299–304, June 2014
Google Scholar
Povey, D., Chu, S.M., Varadarajan, B.: Universal background model based speech recognition. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4561–4564. IEEE (2008)
Google Scholar
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011, iEEE Catalog No.: CFP11SRW-USB
Google Scholar
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., Khudanpur, S.: Deep neural network-based speaker embeddings for end-to-end speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 165–170, December 2016
Google Scholar
US department of commerce, N.: The 2007 NIST Language Recognition Evaluation Plan (LRE07). NIST Web document, pp. 1–5 (2007). https://catalog.ldc.upenn.edu/docs/LDC2009S04/LRE07EvalPlan-v8b-1.pdf

Download references

Acknowledgement

This work was partially supported by several industrial projects at Idiap and the China Scholarship Council.

Author information

Authors and Affiliations

Department of Electrical Engineering, EPFL, Lausanne, Switzerland
Malo Grisard
Artificial Intelligence and Machine Learning Group, Swisscom, Switzerland
Malo Grisard, Wissem Allouchi & Michael Baeriswyl
Idiap Research Institute, Martigny, Switzerland
Petr Motlicek, Alexandros Lazaridis & Qingran Zhan

Authors

Malo Grisard
View author publications
You can also search for this author in PubMed Google Scholar
Petr Motlicek
View author publications
You can also search for this author in PubMed Google Scholar
Wissem Allouchi
View author publications
You can also search for this author in PubMed Google Scholar
Michael Baeriswyl
View author publications
You can also search for this author in PubMed Google Scholar
Alexandros Lazaridis
View author publications
You can also search for this author in PubMed Google Scholar
Qingran Zhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingran Zhan .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grisard, M., Motlicek, P., Allouchi, W., Baeriswyl, M., Lazaridis, A., Zhan, Q. (2019). Spoken Language Identification Using Language Bottleneck Features. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-27947-9_32
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics