Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing

Finan, R.A.; Damper, R.I.; Sapeluk, A.T.

doi:10.1023/A:1009652732313

Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing

Published: March 2001

Volume 4, pages 45–62, (2001)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

R.A. Finan¹,
R.I. Damper² &
A.T. Sapeluk¹

100 Accesses
8 Citations
Explore all metrics

Abstract

A growing body of recent work documents the potential benefits of sub-band processing over wideband processing in automatic speech recognition and, less usually, speaker recognition. It is often found that the sub-band approach delivers performance improvements (especially in the presence of noise), but not always so. This raises the question of precisely when and how sub-band processing might be advantageous, which is difficult to answer because there is as yet only a rudimentary theoretical framework guiding this work. We describe a simple sub-band speaker recognition system designed to facilitate experimentation aimed at increasing understanding of the approach. This splits the time-domain speech signal into 16 sub-bands using a bank of second-order filters spaced on the psychophysical mel scale. Each sub-band has its own separate cepstral-based recognition system, the outputs of which are combined using the sum rule to produce a final decision. We find that sub-band processing leads to worthwhile reductions in both the verification and identification error rates relative to the wideband system, decreasing the identification error rate from 3.33% to 0.56% and equal error rate for verification by approximately 50% for clean speech. The hypothesis is advanced that, unlike the wideband system, sub-band processing effectively constrains the free parameters of the speaker models to be more uniformly deployed across frequency: as such, it offers a practical solution to the bias/variance dilemma of data modeling. Much remains to be done to explore fully the new paradigm of sub-band processing. Accordingly, several avenues for future work are identified. In particular, we aim to explore the hypothesis of a practical solution to the bias/variance dilemma in more depth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Wavelet Packet Based Mel Frequency Cepstral Features for Text Independent Speaker Identification

A sub-band-based feature reconstruction approach for robust speaker recognition

Article Open access 21 October 2014

A Novel Speech Enhancement Method Using Fourier Series Decomposition and Spectral Subtraction for Robust Speaker Identification

Article 11 June 2019

References

Allen, J.B. (1994). How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4):567–577.
Google Scholar
Atal, B.S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. Journal of the Acoustical Society of America, 55:1304–1312.
Google Scholar
Auckenthaler, R. and Mason, J.S. (1997). Equalizing sub-band error rates in speaker recognition. In European Speech Communication Association (ESCA) Conference, Eurospeech 97, Rhodes, Greece, pp. 2303–2306.
Besacier, L. and Bonastre, J.-F. (1997). Subband approach for automatic speaker recognition: Optimal division of the frequency domain. In Proceedings of 1st International Conference on Audioand Visual-Based Biometric Person Authentication (AVBPA), Crans-Montana, Switzerland, pp. 195–202.
Bimbot, F. and Mathan, L. (1994). Second-order statistical measures for text-independent speaker recognition. In European Speech Communication Association (ESCA) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, pp. 51–54.
Google Scholar
Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press.
Google Scholar
Booth, I., Barlow, M., and Watson, B. (1993). Enhancements to DTW and VQ decision algorithms for speaker recognition. Speech Communication, 13:427–433.
Google Scholar
Bourlard, H. and Dupont, S. (1996). A new ASR approach based on independent processing and recombination of partial frequency bands. In Proceedings of International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, PA, pp. 426–429.
Bowles, R.L., Damper, R.I., and Lucas, S.M. (1988). Combining evidence from separate speech recognition processes. In Proceedings of 7th FASE Symposium, Speech 88, Edinburgh, Scotland, Vol. 2, pp. 669–674.
Google Scholar
Campbell, J.P. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9):1437–1462.
Google Scholar
Carey, M.J. and Parris, E.S. (1992). Speaker verification using connected words. Proceedings of the Institute of Acoustics, 14(6):95–100.
Google Scholar
Cherkassky, V. and Mulier, F. (1998). Learning from Data.NewYork, NY: John Wiley.
Google Scholar
Damper, R.I. (1995). Introduction to Discrete-Time Signals and Systems. London: Chapman and Hall.
Google Scholar
Doddington, G. (1985). Speaker recognition—identifying people by their voices. Proceedings of the IEEE, 73(11):1651–1664.
Google Scholar
Doddington, G., Liggett, W., Martin, A., Przybocki, M., and Reynolds, D. (1998). Sheep, goats, lambs andwolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 608 on CD-ROM.
Finan, R.A. (1998). Towards the Use of Sub-Band Processing in Automatic Speaker Recognition. Ph.D. thesis, School of Engineering, University of Abertay Dundee.
Finan, R.A., Sapeluk, A.T., and Damper, R.I. (1997). Impostor cohort selection for score normalisation in speaker verification. Pattern Recognition Letters, 18:881–888.
Google Scholar
Furui, S. (1997). Recent advances in speaker recognition. Pattern Recognition Letters, 18:859–872.
Google Scholar
Gabor, D. (1946). Theory of communication. Journal of the Institution of Electrical Engineers, 93:429–457.
Google Scholar
Gabor, D. (1950). Communication theory and physics. Philosophical Magazine, 4:1161–1187.
Google Scholar
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4:1–58.
Google Scholar
Hennecke, M., Stork, D.G., and Venkatesh Prasad, K. (1996).Visionary speech: Looking ahead to practical speechreading systems. In D.G. Stork and M. Hennecke (Eds.), Speechreading by Humans and Machines: Models, Systems and Applications. Berlin, Germany: NATO ASI Series, Springer, pp. 331–349.
Google Scholar
Hermansky, H. (1998). Should recognizers have ears? Speech Communication, 25(1–3):3–27.
Google Scholar
Hermansky, H. and Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4):578–589.
Google Scholar
Hermansky, H. and Sharma, S. (1998). TRAPS—Classifiers of temporal patterns. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 615 on CD-ROM.
Hermansky, H., Tibrewala, S., and Pavel, M. (1996). Towards ASR on partially corrupted speech. In Proceedings of 4th International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, PA, Vol. 1, pp. 462–465.
Google Scholar
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., and Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural Computation, 3:79–87.
Google Scholar
Kittler, J., Hatef, M., Duin, R.P.W., and Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239.
Google Scholar
Li, K.-P. and Porter, J.E. (1988). Normalizations and selection of speech segments for speaker recognition scoring. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 88), New York, NY, pp. 595–598.
Linde, J., Buzo, A., and Gray, R.M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, 28:84–95.
Article Google Scholar
Markel, J.D. and Gray, A.H. (1976). Linear Prediction of Speech. Berlin, Germany: Springer-Verlag.
Google Scholar
Matsui, T. and Furui, S. (1995). Likelihood normalization for speaker verification using phone-and speaker-independent models. Speech Communication, 17:109–116.
Google Scholar
Morris, A., Hagen, A., and Bourlard, H. (1999). The full-combination sub-bands approach to noise robust HMM/ANN-based ASR. In Proceedings of 6th European Conference on Speech Communication and Technology (Eurospeech 99), Budapest, Hungary, Vol. 2, pp. 599–602.
Google Scholar
Naik, J.M., Netsch, L.P., and Doddington, G.R. (1989). Speaker verification over long-distance telephone lines. In Proceedings of International Conference on Acoustics, Speech and Signal Processing ICASSP 89, Vol. 1, Glasgow, Scotland, pp. 524–527.
Google Scholar
Okawa, S., Bocchieri, E., and Potamianos, A. (1998). Multi-band speech recognition in noisy environments. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP 98), Seattle, WA, Vol. I, p. 641.
Google Scholar
Owens, F.J. (1993). Signal Processing of Speech. Basingstoke, UK: Macmillan.
Google Scholar
Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81:1215–1247.
Google Scholar
Reynolds, D.A. (1994). Experimental evaluation of features for robust speaker identification. IEEE Transactions on Speech and Audio Processing, 2(4):639–643.
Google Scholar
Reynolds, D.A. (1995). Speaker identification and verification using Gaussian mixture models. Speech Communication, 17:91–108.
Google Scholar
Reynolds, D.A. (1997). Comparison of background normalization methods for text-independent speaker verification. In Proceedings of 5th European Conference on Speech Communication and Technology (Eurospeech 97), Vol. 2, Rhodes, Greece, pp. 963–966.
Google Scholar
Rosenberg, A.E. and Parthasarathy, S. (1996). Speaker background models for connected digit password speaker verification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 96), Atlanta, GA, Vol. 1, pp. 81–84.
Google Scholar
Rosenberg, A.E. and Soong, F.K. (1987). Evaluation of a vector quantization talker recognition system in text dependent and text independent modes. Computer Speech and Language, 22:143–157.
Google Scholar
Schroeder, M. (1999). Computer Speech: Recognition, Compression and Synthesis. Berlin, Germany: Springer-Verlag.
Google Scholar
Siegel, S. (1956). Non-parametric Statistics for the Behavioral Sciences. Tokyo, Japan: McGraw-HillKogakusha.
Google Scholar
Sivakumaran, P., Ariyaeeinia, A.M., and Hewitt, J.A. (1998). Subband speaker verification using dynamic recombination weights. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 1055 on CD-ROM.
Sivakumaran, P., Ariyaeeinia, A.M., Hewitt, J.A., and Malcolm, J.A. (1998). An effective sub-band based approach for robust speaker verification. Proceedings of the Institute of Acoustics, 20(6):69–72.
Google Scholar
Steeneken, H.T.M. and Houtgast, T. (1999). Mutual dependence of the octave-band weights in predicting speech intelligibility. Speech Communication, 28:109–123.
Google Scholar
Thompson, J. and Mason, J.S. (1994). The pre-detection of errorprone class members at the enrollment stage of speaker recognition systems. In European Speech Communication Association (ESCA) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, pp. 127–130.
Google Scholar
Tibrewala, S. and Hermansky, H. (1997). Sub-band based recognition of noisy speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP 97), Munich, Germany, Vol. II, pp. 1255–1258.
Google Scholar
Wolpert, D.H. (1992). Stacked generalization. Neural Networks,:241–259.
Yu, K., Mason, J., and Oglesby, J. (1995). Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation. IEEE Proceedings: Vision, Image and Signal Processing, 142:313–318.
Google Scholar
Zwicker, E. and Terhardt, E. (1980). Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. Journal of the Acoustical Society of America, 68(5):1523–1525.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering, University of Abertay Dundee, Scotland, DD1 1HG
R.A. Finan & A.T. Sapeluk
Image, Speech and Intelligent Systems (ISIS) Research Group, Department of Electronics and Computer Science, University of Southampton, Hants, SO17 1BJ, UK
R.I. Damper

Authors

R.A. Finan
View author publications
You can also search for this author in PubMed Google Scholar
R.I. Damper
View author publications
You can also search for this author in PubMed Google Scholar
A.T. Sapeluk
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Finan, R., Damper, R. & Sapeluk, A. Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing. International Journal of Speech Technology 4, 45–62 (2001). https://doi.org/10.1023/A:1009652732313

Download citation

Issue Date: March 2001
DOI: https://doi.org/10.1023/A:1009652732313

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing

Abstract

Access this article

Similar content being viewed by others

Wavelet Packet Based Mel Frequency Cepstral Features for Text Independent Speaker Identification

A sub-band-based feature reconstruction approach for robust speaker recognition

A Novel Speech Enhancement Method Using Fourier Series Decomposition and Spectral Subtraction for Robust Speaker Identification

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing

Abstract

Access this article

Similar content being viewed by others

Wavelet Packet Based Mel Frequency Cepstral Features for Text Independent Speaker Identification

A sub-band-based feature reconstruction approach for robust speaker recognition

A Novel Speech Enhancement Method Using Fourier Series Decomposition and Spectral Subtraction for Robust Speaker Identification

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation