Abstract
In this paper, we focus on the issue of speaker diarization of data from a real call center. We have previously proposed a specialized solution to the problem, which employed additional knowledge about the identities of the phone operators (in our case, the language counselors from the Language Consulting Center), thus improving performance over the baseline. But a recent end-to-end diarization method, EEND, has since proven very successful on other data and was shown to surpass the previous state of the art in the field. Thus, we chose to compare this new method with our own previous approach. Using an existing implementation of the EEND method (adapted using a small amount of the target data from the Language Consulting Center), we successfully surpass the performance of our previous approach (17.42% vs. 19.39% DER), without the need for any additional information about speaker identities. The majority of the remaining diarization error of the EEND system is due to incorrect decisions between speech and silence, rather than speaker confusion. For comparison, we also show the results of a more standard diarization approach, represented by the method used in the Kaldi toolkit.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)
Diez, M., Burget, L., Matejka, P.: Speaker Diarization based on Bayesian HMM with Eigenvoice priors. In: Odyssey - Speaker and Language Recognition Workshop, pp. 147–154 (2018)
Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The rich transcription 2006 spring meeting recognition evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 309–322. Springer, Heidelberg (2006). https://doi.org/10.1007/11965152_28
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S.: End-to-end neural speaker diarization with permutation-free objectives. In: Interspeech, pp. 4300–4304 (2019)
Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., Watanabe, S.: End-to-end neural speaker diarization with self-attention. In: IEEE Automatic Speech Recognition and Understanding Workshop, pp. 296–303. Institute of Electrical and Electronics Engineers Inc., December 2019
Horiguchi, S., Fujita, Y., Watanabe, S., Xue, Y., Nagamatsu, K.: End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In: Interspeech, pp. 269–273 (2020)
Ioffe, S.: Probabilistic linear discriminant analysis. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 531–542. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085_41
Kinoshita, K., Delcroix, M., Tawara, N.: Integrating end-to-end neural and clustering-based diarization: getting the best of both worlds. Arxiv (2020)
Lin, Z., et al.: A structured self-attentive sentence embedding. In: ICLR (2017)
Nagrani, A., et al.: VOXSRC 2020: The Second Voxceleb Speaker Recognition Challenge. Arxiv (2020)
Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A Review of Speaker Diarization: Recent Advances with Deep Learning. arXiv, January 2021
Povey, D., et al.: The Kaldi speech recognition toolkit. In: Workshop on Automatic Speech Recognition and Understanding. IEEE Catalog No.: CFP11SRW-USB, Hawaii (2011)
Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news diarization. In: Interspeech, Lyon, pp. 1477–1481 (2013)
Ryant, N., et al.: The second DIHARD diarization challenge: dataset, task, and baselines. In: INTERSPEECH, Gratz (2019)
Ryant, N., et al.: The Third DIHARD Diarization Challenge. arXiv, p. 5 (2020)
Sell, G., Garcia-Romero, D.: Speaker diarization with PLDA I-vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)
Sell, G., Garcia-Romero, D.: Diarization resegmentation in the factor analysis subspace. In: ICASSP, pp. 4794–4798. IEEE, April 2015
Sell, G., et al.: Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In: Interspeech, Hyderabad, pp. 2808–2812 (2018)
Senoussaoui, M., Kenny, P., Stafylakis, T., Dumouchel, P.: A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 217–227 (2014)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: ICASSP, pp. 5329–5333 (2018)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural. Inf. Process. Syst. 4, 3104–3112 (2014)
Von Neumann, T., Kinoshita, K., Delcroix, M., Araki, S., Nakatani, T., Haeb-Umbach, R.: All-neural online source separation, counting, and diarization for meeting analysis. In: ICASSP, pp. 91–95 (2019)
Yin, R., Bredin, H., Barras, C.: Neural speech turn segmentation and affinity propagation for speaker diarization. In: INTERSPEECH, pp. 1393–1397. ISCA, Hyderabad (2018)
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: ICASSP, pp. 241–245, July 2017
Zajíc, Z., Psutka, J.V., Müller, L.: Diarization based on identification with x-vectors. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 667–678. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_64
Zajíc, Z., Kunešová, M., Radová, V.: Investigation of segmentation in i-vector based speaker diarization of telephone speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 411–418. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_49
Zajíc, Z., Psutka, J.V., Zajícová, L., Müller, L., Salajka, P.: Diarization of the language consulting center telephone calls. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 549–558. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_56
Zajíc, Z., et al.: First insight into the processing of the language consulting center data. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 778–787. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_79
Acknowledgments
The work described herein has been supported by the Ministry of Education, Youth and Sports of the Czech Republic, Project No. LM2018101 LINDAT/CLARIAH-CZ. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the program “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zajíc, Z., Kunešová, M., Müller, L. (2021). Applying EEND Diarization to Telephone Recordings from a Call Center. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_72
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_72
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)