A Combined Feature Approach for Speaker Segmentation Using Convolution Neural Network

Zhong, Jiang; Zhang, Pan; Li, Xue

doi:10.1007/978-3-319-77383-4_54

Jiang Zhong^19,20,
Pan Zhang²⁰ &
Xue Li^19,21

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10736))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2356 Accesses
2 Citations

Abstract

In this paper, a speaker segmentation algorithm is proposed based on a Combined feature approach using the Convolution Neural Network (CNN), which is used to deal with the speaker segmentation problem of dialogue speech with partial prior knowledge in the CALL_CENTER environment. For the first time, the Mel-Frequency Cepstral Coefficients (MFCC) feature and the SPECTROGRAM feature are combined as the input of CNN to train the speakers’ voice feature model and to estimate the change point. In the experiments, a real database about the dialogue voice related to insurance sales and real estate sales industry is used to compare our proposed approach with Bayesian Information Criterion (BIC) approach using different acoustic features sets. The results show that the synthetical performance is improved, and our algorithm has a better segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bonastre, J.F., Delacourt, P., Fredouille, C., Merlin, T., Wellekens, C.: A speaker tracking system based on speaker turn detection for NIST evaluation. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2000, vol. 2, pp. 1177–1180 (2000)
Google Scholar
Barras, C., Zhu, X., Meignier, S., et al.: Multistage speaker diarization of broadcast news. IEEE Trans. Audio Speech Lang. Process. 14(5), 1505–1512 (2006)
Article Google Scholar
Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)
Article Google Scholar
Saeidi, R., Mohammadi, H.S., Rodman, R.D., Kinnunen, T.: A new segmentation algorithm combined with transient frames power for text independent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal, ICASSP 2007, vol. 4, p. 305 (2007)
Google Scholar
Chen, S., Gopalakrishnan, P. S.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 8, pp. 127–132 (1998)
Google Scholar
Delacourt, P., Wellekens, C.: DISTBIC: A speaker-based segmentation for audio data indexing. Speech Commun. 32(1), 111–126 (2000)
Article Google Scholar
Bakis, R., Chen, S., Gopalakrishnan, P., Gopinath, R., Maes, S., Polymenakos, L., Franz, M.: Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. In: Proceedings of DARPA Speech Recognition Workshop, VA, pp. 67–72 (1997)
Google Scholar
Cettolo, M., Vescovi, M., Rizzi, R.: Evaluation of BIC-based algorithms for audio segmentation. J. Comput. Speech & Lang. 19(2), 147–170 (2005)
Article Google Scholar
Siegler, M.A., Jain, U., Raj, B., Stern, R.M.: Automatic segmentation, classification and clustering of broadcast news audio. In: Proceedings of DARPA Speech Recognition Workshop, VA, pp. 97–99 (1997)
Google Scholar
Gish, H., Siu, M.H., Rohlicek, R.: Segregation of speakers for speech recognition and speaker identification. In: 1991 International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1991, pp. 873–876 (1991)
Google Scholar
Jin, H., Kubala, F., Schwartz, R.: Automatic speaker clustering. In: Proceedings of the DARPA Speech Recognition Workshop, pp. 108–111 (1997)
Google Scholar
Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. 33(5), 169 (2014)
Article Google Scholar
Sell, G., Garcia-Romero, D., McCree, A.: Speaker diarization with I-Vectors from DNN senone posteriors. In: Proceedings of Interspeech, pp. 3096–3099 (2015)
Google Scholar
Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280 (2012)
Google Scholar
Quatieri, T.F.: Discrete-time Speech Signal Processing: Principles and Practice. Pearson Education, Delhi, India (2006)
Google Scholar
Cole, R.A., Rudnicky, A.I., Zue, V.M.: Performance of an expert spectrogram reader. J. Acoust. Soc. Am. 65(S1), S81–S81 (1979)
Article Google Scholar
Zue, V., Lamel, L.: An expert spectrogram reader: A knowledge-based approach to speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1986, vol. 11, pp. 1197–1200 (1986)
Google Scholar
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sig. Process. 28(4), 357–366 (1980)
Article Google Scholar
Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)
Article Google Scholar
Deller Jr., J.R., Proakis, J.G., Hansen, J.H.: Discrete Time Processing of Speech Signals, 2nd edn. IEEE Press, New York (2000)
Google Scholar
Speer, S. R., Warren, P., Schafer, A.: Intonation and sentence processing. In: Proceedings of the 15th International Congress of Phonetic Sciences, pp. 95–105 (2003)
Google Scholar
Mammone, R.J., Zhang, X., Ramachandran, R.P.: Robust speaker recognition: a feature-based approach. IEEE Sig. Process. Mag. 13(5), 58–71 (1996)
Article Google Scholar
Reynolds, D.A.: Experimental evaluation of features for robust speaker identification. IEEE Trans. Speech Audio Process. 2(4), 639–643 (1994)
Article MathSciNet Google Scholar
Ajmera, J., McCowan, I., Bourlard, H.: Robust speaker change detection. IEEE Sig. Process. Lett. 11(8), 649–651 (2004)
Article Google Scholar
Kadri, H., Davy, M., Rabaoui, A., Lachiri, Z., Ellouze, N.: Robust audio speaker segmentation using one class SVMs. In: 2008 16th European Conference on Signal Processing, pp. 1–5 (2008)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the National High-tech R&D Program of China (NO. 2015AA015308), Social Undertakings and Livelihood Security Science and Technology Innovation Funds of CQ CSTC (No. cstc2017shmsA20013), Frontier and Application Foundation Research Program of CQ CSTC (No. cstc2017jcyjAX0340), National Natural Science Foundation of Chi-na (No. 61402020) and Ph.D. Programs Foundation of Ministry of Education of China (No. 20130001120021).

Author information

Authors and Affiliations

Key Laboratory of Dependable Service Computing in Cyber Physical Society, Ministry of Education, Chongqing University, Chongqing, 400030, China
Jiang Zhong & Xue Li
College of Computer Science, Chongqing University, Chongqing, 400030, China
Jiang Zhong & Pan Zhang
School of Information Technology and Electrical Engineering, University of Queensland, Brisbane, Australia
Xue Li

Authors

Jiang Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Pan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xue Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pan Zhang .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Bing Zeng
University of Chinese Academy of Sciences, Beijing, China
Qingming Huang
University of Ottawa, Ottawa, Ontario, Canada
Abdulmotaleb El Saddik
University of Electronic Science and Technology of China, Chengdu, China
Hongliang Li
Chinese Academy of Sciences, Beijing, China
Shuqiang Jiang
Harbin Institute of Technology, Harbin, China
Xiaopeng Fan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhong, J., Zhang, P., Li, X. (2018). A Combined Feature Approach for Speaker Segmentation Using Convolution Neural Network. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds) Advances in Multimedia Information Processing – PCM 2017. PCM 2017. Lecture Notes in Computer Science(), vol 10736. Springer, Cham. https://doi.org/10.1007/978-3-319-77383-4_54

Download citation

DOI: https://doi.org/10.1007/978-3-319-77383-4_54
Published: 10 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77382-7
Online ISBN: 978-3-319-77383-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics