Applying EEND Diarization to Telephone Recordings from a Call Center

Zajíc, Zbyněk; Kunešová, Marie; Müller, Luděk

doi:10.1007/978-3-030-87802-3_72

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

International Conference on Speech and Computer

1616 Accesses
1 Citations

Abstract

In this paper, we focus on the issue of speaker diarization of data from a real call center. We have previously proposed a specialized solution to the problem, which employed additional knowledge about the identities of the phone operators (in our case, the language counselors from the Language Consulting Center), thus improving performance over the baseline. But a recent end-to-end diarization method, EEND, has since proven very successful on other data and was shown to surpass the previous state of the art in the field. Thus, we chose to compare this new method with our own previous approach. Using an existing implementation of the EEND method (adapted using a small amount of the target data from the Language Consulting Center), we successfully surpass the performance of our previous approach (17.42% vs. 19.39% DER), without the need for any additional information about speaker identities. The majority of the remaining diarization error of the EEND system is due to incorrect decisions between speech and silence, rather than speaker confusion. For comparison, we also show the results of a more standard diarization approach, represented by the method used in the Kaldi toolkit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)
Google Scholar
Diez, M., Burget, L., Matejka, P.: Speaker Diarization based on Bayesian HMM with Eigenvoice priors. In: Odyssey - Speaker and Language Recognition Workshop, pp. 147–154 (2018)
Google Scholar
Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The rich transcription 2006 spring meeting recognition evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 309–322. Springer, Heidelberg (2006). https://doi.org/10.1007/11965152_28
Chapter Google Scholar
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S.: End-to-end neural speaker diarization with permutation-free objectives. In: Interspeech, pp. 4300–4304 (2019)
Google Scholar
Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., Watanabe, S.: End-to-end neural speaker diarization with self-attention. In: IEEE Automatic Speech Recognition and Understanding Workshop, pp. 296–303. Institute of Electrical and Electronics Engineers Inc., December 2019
Google Scholar
Horiguchi, S., Fujita, Y., Watanabe, S., Xue, Y., Nagamatsu, K.: End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In: Interspeech, pp. 269–273 (2020)
Google Scholar
Ioffe, S.: Probabilistic linear discriminant analysis. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 531–542. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085_41
Chapter Google Scholar
Kinoshita, K., Delcroix, M., Tawara, N.: Integrating end-to-end neural and clustering-based diarization: getting the best of both worlds. Arxiv (2020)
Google Scholar
Lin, Z., et al.: A structured self-attentive sentence embedding. In: ICLR (2017)
Google Scholar
Nagrani, A., et al.: VOXSRC 2020: The Second Voxceleb Speaker Recognition Challenge. Arxiv (2020)
Google Scholar
Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A Review of Speaker Diarization: Recent Advances with Deep Learning. arXiv, January 2021
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: Workshop on Automatic Speech Recognition and Understanding. IEEE Catalog No.: CFP11SRW-USB, Hawaii (2011)
Google Scholar
Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news diarization. In: Interspeech, Lyon, pp. 1477–1481 (2013)
Google Scholar
Ryant, N., et al.: The second DIHARD diarization challenge: dataset, task, and baselines. In: INTERSPEECH, Gratz (2019)
Google Scholar
Ryant, N., et al.: The Third DIHARD Diarization Challenge. arXiv, p. 5 (2020)
Google Scholar
Sell, G., Garcia-Romero, D.: Speaker diarization with PLDA I-vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)
Google Scholar
Sell, G., Garcia-Romero, D.: Diarization resegmentation in the factor analysis subspace. In: ICASSP, pp. 4794–4798. IEEE, April 2015
Google Scholar
Sell, G., et al.: Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In: Interspeech, Hyderabad, pp. 2808–2812 (2018)
Google Scholar
Senoussaoui, M., Kenny, P., Stafylakis, T., Dumouchel, P.: A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 217–227 (2014)
Article Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: ICASSP, pp. 5329–5333 (2018)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural. Inf. Process. Syst. 4, 3104–3112 (2014)
Google Scholar
Von Neumann, T., Kinoshita, K., Delcroix, M., Araki, S., Nakatani, T., Haeb-Umbach, R.: All-neural online source separation, counting, and diarization for meeting analysis. In: ICASSP, pp. 91–95 (2019)
Google Scholar
Yin, R., Bredin, H., Barras, C.: Neural speech turn segmentation and affinity propagation for speaker diarization. In: INTERSPEECH, pp. 1393–1397. ISCA, Hyderabad (2018)
Google Scholar
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: ICASSP, pp. 241–245, July 2017
Google Scholar
Zajíc, Z., Psutka, J.V., Müller, L.: Diarization based on identification with x-vectors. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 667–678. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_64
Chapter Google Scholar
Zajíc, Z., Kunešová, M., Radová, V.: Investigation of segmentation in i-vector based speaker diarization of telephone speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 411–418. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_49
Chapter Google Scholar
Zajíc, Z., Psutka, J.V., Zajícová, L., Müller, L., Salajka, P.: Diarization of the language consulting center telephone calls. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 549–558. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_56
Chapter Google Scholar
Zajíc, Z., et al.: First insight into the processing of the language consulting center data. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 778–787. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_79
Chapter Google Scholar

Download references

Acknowledgments

The work described herein has been supported by the Ministry of Education, Youth and Sports of the Czech Republic, Project No. LM2018101 LINDAT/CLARIAH-CZ. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the program “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Faculty of Applied Sciences, NTIS - New Technologies for the Information Society and Department of Cybernetics, University of West Bohemia, Univerzitní 8, 301 00, Plzeň, Czech Republic
Zbyněk Zajíc, Marie Kunešová & Luděk Müller

Authors

Zbyněk Zajíc
View author publications
You can also search for this author in PubMed Google Scholar
Marie Kunešová
View author publications
You can also search for this author in PubMed Google Scholar
Luděk Müller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zbyněk Zajíc .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zajíc, Z., Kunešová, M., Müller, L. (2021). Applying EEND Diarization to Telephone Recordings from a Call Center. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_72

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_72
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics