Abstract
A growing body of recent work documents the potential benefits of sub-band processing over wideband processing in automatic speech recognition and, less usually, speaker recognition. It is often found that the sub-band approach delivers performance improvements (especially in the presence of noise), but not always so. This raises the question of precisely when and how sub-band processing might be advantageous, which is difficult to answer because there is as yet only a rudimentary theoretical framework guiding this work. We describe a simple sub-band speaker recognition system designed to facilitate experimentation aimed at increasing understanding of the approach. This splits the time-domain speech signal into 16 sub-bands using a bank of second-order filters spaced on the psychophysical mel scale. Each sub-band has its own separate cepstral-based recognition system, the outputs of which are combined using the sum rule to produce a final decision. We find that sub-band processing leads to worthwhile reductions in both the verification and identification error rates relative to the wideband system, decreasing the identification error rate from 3.33% to 0.56% and equal error rate for verification by approximately 50% for clean speech. The hypothesis is advanced that, unlike the wideband system, sub-band processing effectively constrains the free parameters of the speaker models to be more uniformly deployed across frequency: as such, it offers a practical solution to the bias/variance dilemma of data modeling. Much remains to be done to explore fully the new paradigm of sub-band processing. Accordingly, several avenues for future work are identified. In particular, we aim to explore the hypothesis of a practical solution to the bias/variance dilemma in more depth.
Similar content being viewed by others
References
Allen, J.B. (1994). How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4):567–577.
Atal, B.S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. Journal of the Acoustical Society of America, 55:1304–1312.
Auckenthaler, R. and Mason, J.S. (1997). Equalizing sub-band error rates in speaker recognition. In European Speech Communication Association (ESCA) Conference, Eurospeech 97, Rhodes, Greece, pp. 2303–2306.
Besacier, L. and Bonastre, J.-F. (1997). Subband approach for automatic speaker recognition: Optimal division of the frequency domain. In Proceedings of 1st International Conference on Audioand Visual-Based Biometric Person Authentication (AVBPA), Crans-Montana, Switzerland, pp. 195–202.
Bimbot, F. and Mathan, L. (1994). Second-order statistical measures for text-independent speaker recognition. In European Speech Communication Association (ESCA) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, pp. 51–54.
Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press.
Booth, I., Barlow, M., and Watson, B. (1993). Enhancements to DTW and VQ decision algorithms for speaker recognition. Speech Communication, 13:427–433.
Bourlard, H. and Dupont, S. (1996). A new ASR approach based on independent processing and recombination of partial frequency bands. In Proceedings of International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, PA, pp. 426–429.
Bowles, R.L., Damper, R.I., and Lucas, S.M. (1988). Combining evidence from separate speech recognition processes. In Proceedings of 7th FASE Symposium, Speech 88, Edinburgh, Scotland, Vol. 2, pp. 669–674.
Campbell, J.P. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9):1437–1462.
Carey, M.J. and Parris, E.S. (1992). Speaker verification using connected words. Proceedings of the Institute of Acoustics, 14(6):95–100.
Cherkassky, V. and Mulier, F. (1998). Learning from Data.NewYork, NY: John Wiley.
Damper, R.I. (1995). Introduction to Discrete-Time Signals and Systems. London: Chapman and Hall.
Doddington, G. (1985). Speaker recognition—identifying people by their voices. Proceedings of the IEEE, 73(11):1651–1664.
Doddington, G., Liggett, W., Martin, A., Przybocki, M., and Reynolds, D. (1998). Sheep, goats, lambs andwolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 608 on CD-ROM.
Finan, R.A. (1998). Towards the Use of Sub-Band Processing in Automatic Speaker Recognition. Ph.D. thesis, School of Engineering, University of Abertay Dundee.
Finan, R.A., Sapeluk, A.T., and Damper, R.I. (1997). Impostor cohort selection for score normalisation in speaker verification. Pattern Recognition Letters, 18:881–888.
Furui, S. (1997). Recent advances in speaker recognition. Pattern Recognition Letters, 18:859–872.
Gabor, D. (1946). Theory of communication. Journal of the Institution of Electrical Engineers, 93:429–457.
Gabor, D. (1950). Communication theory and physics. Philosophical Magazine, 4:1161–1187.
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4:1–58.
Hennecke, M., Stork, D.G., and Venkatesh Prasad, K. (1996).Visionary speech: Looking ahead to practical speechreading systems. In D.G. Stork and M. Hennecke (Eds.), Speechreading by Humans and Machines: Models, Systems and Applications. Berlin, Germany: NATO ASI Series, Springer, pp. 331–349.
Hermansky, H. (1998). Should recognizers have ears? Speech Communication, 25(1–3):3–27.
Hermansky, H. and Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4):578–589.
Hermansky, H. and Sharma, S. (1998). TRAPS—Classifiers of temporal patterns. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 615 on CD-ROM.
Hermansky, H., Tibrewala, S., and Pavel, M. (1996). Towards ASR on partially corrupted speech. In Proceedings of 4th International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, PA, Vol. 1, pp. 462–465.
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., and Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural Computation, 3:79–87.
Kittler, J., Hatef, M., Duin, R.P.W., and Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239.
Li, K.-P. and Porter, J.E. (1988). Normalizations and selection of speech segments for speaker recognition scoring. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 88), New York, NY, pp. 595–598.
Linde, J., Buzo, A., and Gray, R.M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, 28:84–95.
Markel, J.D. and Gray, A.H. (1976). Linear Prediction of Speech. Berlin, Germany: Springer-Verlag.
Matsui, T. and Furui, S. (1995). Likelihood normalization for speaker verification using phone-and speaker-independent models. Speech Communication, 17:109–116.
Morris, A., Hagen, A., and Bourlard, H. (1999). The full-combination sub-bands approach to noise robust HMM/ANN-based ASR. In Proceedings of 6th European Conference on Speech Communication and Technology (Eurospeech 99), Budapest, Hungary, Vol. 2, pp. 599–602.
Naik, J.M., Netsch, L.P., and Doddington, G.R. (1989). Speaker verification over long-distance telephone lines. In Proceedings of International Conference on Acoustics, Speech and Signal Processing ICASSP 89, Vol. 1, Glasgow, Scotland, pp. 524–527.
Okawa, S., Bocchieri, E., and Potamianos, A. (1998). Multi-band speech recognition in noisy environments. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP 98), Seattle, WA, Vol. I, p. 641.
Owens, F.J. (1993). Signal Processing of Speech. Basingstoke, UK: Macmillan.
Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81:1215–1247.
Reynolds, D.A. (1994). Experimental evaluation of features for robust speaker identification. IEEE Transactions on Speech and Audio Processing, 2(4):639–643.
Reynolds, D.A. (1995). Speaker identification and verification using Gaussian mixture models. Speech Communication, 17:91–108.
Reynolds, D.A. (1997). Comparison of background normalization methods for text-independent speaker verification. In Proceedings of 5th European Conference on Speech Communication and Technology (Eurospeech 97), Vol. 2, Rhodes, Greece, pp. 963–966.
Rosenberg, A.E. and Parthasarathy, S. (1996). Speaker background models for connected digit password speaker verification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 96), Atlanta, GA, Vol. 1, pp. 81–84.
Rosenberg, A.E. and Soong, F.K. (1987). Evaluation of a vector quantization talker recognition system in text dependent and text independent modes. Computer Speech and Language, 22:143–157.
Schroeder, M. (1999). Computer Speech: Recognition, Compression and Synthesis. Berlin, Germany: Springer-Verlag.
Siegel, S. (1956). Non-parametric Statistics for the Behavioral Sciences. Tokyo, Japan: McGraw-HillKogakusha.
Sivakumaran, P., Ariyaeeinia, A.M., and Hewitt, J.A. (1998). Subband speaker verification using dynamic recombination weights. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 1055 on CD-ROM.
Sivakumaran, P., Ariyaeeinia, A.M., Hewitt, J.A., and Malcolm, J.A. (1998). An effective sub-band based approach for robust speaker verification. Proceedings of the Institute of Acoustics, 20(6):69–72.
Steeneken, H.T.M. and Houtgast, T. (1999). Mutual dependence of the octave-band weights in predicting speech intelligibility. Speech Communication, 28:109–123.
Thompson, J. and Mason, J.S. (1994). The pre-detection of errorprone class members at the enrollment stage of speaker recognition systems. In European Speech Communication Association (ESCA) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, pp. 127–130.
Tibrewala, S. and Hermansky, H. (1997). Sub-band based recognition of noisy speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP 97), Munich, Germany, Vol. II, pp. 1255–1258.
Wolpert, D.H. (1992). Stacked generalization. Neural Networks,:241–259.
Yu, K., Mason, J., and Oglesby, J. (1995). Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation. IEEE Proceedings: Vision, Image and Signal Processing, 142:313–318.
Zwicker, E. and Terhardt, E. (1980). Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. Journal of the Acoustical Society of America, 68(5):1523–1525.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Finan, R., Damper, R. & Sapeluk, A. Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing. International Journal of Speech Technology 4, 45–62 (2001). https://doi.org/10.1023/A:1009652732313
Issue Date:
DOI: https://doi.org/10.1023/A:1009652732313