Leveraging a Small Corpus by Different Frame Shifts for Training of a Speech Recognizer

Ito, Akinori

doi:10.1007/978-3-030-03748-2_10

Akinori Ito ORCID: orcid.org/0000-0002-8835-7877⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 110))

Included in the following conference series:

International Conference on Intelligent Information Hiding and Multimedia Signal Processing

527 Accesses

Abstract

During the feature extraction process for speech recognition, a window function is first applied to the input waveform to extract temporally-limited spectrum. By shifting the window function with a short time period, we can analyze the temporal change of speech spectrum. This time period is called “the frame shift,” which is usually 5 to 10 ms. In this paper, frame shift is re-considered from two aspects. The first one is the appropriateness of 10 ms as the frame shift. The frame-based process is based on the assumption that temporal change of speech spectrum is slow enough compared with the frame shift, which does not hold for kinds of consonants such as plosives. Thus, this paper experimentally shows that feature value fluctuates much according to the first position of the frame. Then a training method is proposed that uses temporally shifted samples as independent samples to compensate for the fluctuation of feature caused by the difference of the beginning position of a frame. The second aspect is that the frame shift could be longer if the fluctuation can be compensated. To prove this, an experiment was conducted to change frame shift from 10 to 60 ms, and it was found that the result of 40 ms frame shift outperformed the result of 10 ms frame shift, and comparable recognition performance with 10 ms frame shift result was obtained with 50 ms frame shift.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Furui, S.: Cepstral analysis technique for automatic speech verification. IEEE Trans. Acoust. Speech Signal Process. 29(2), 254–272 (1981)
Article Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Itakura, F.: Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 67–72 (1975)
Article Google Scholar
Jelinek, F.: Continuous speech recognition by statistical methods. Proc. IEEE 64(4), 532–556 (1976)
Article Google Scholar
Kinnunen, T., Karpov, E., Franti, P.: Real-time speaker identification and verification. IEEE Trans. Audio Speech Lang. Process. 14(1), 277–288 (2006)
Article Google Scholar
Makino, S., Kawabata, T., Kido, K.: Recognition of consonant based on the perceptron model. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 738–741 (1983)
Google Scholar
McLaulin, J., Reynolds, D.A., Gleason, T.: A study of computation speed-ups of the GMM-UBM speaker recognition system. In: Proceedings of Eurospeech, pp. 1215–1218 (1999)
Google Scholar
O’Shaunessy, D.: Acoustic analysis of automatic speech recognition. Proc. IEEE 101(5), 1038–1053 (2013)
Article Google Scholar
Reddy, D.R.: Phoneme grouping in speech recognition. J. Acoust. Soc. Am. 41(5), 1295–1300 (1967)
Article Google Scholar
Reddy, D.R., Erman, L.D., Neely, R.B.: A model and a system for machine recognition of speech. IEEE Trans. Audio Electroacoust. 21(3), 229–231 (1973)
Article Google Scholar
Ruiz, E.V., Nolla, F.C., Segovia, H.R.: Is the DTW “Distance” really a metric? An algorithm reducing the number of DTW comparisons in isolated word recognition. Speech Commun. 4(4), 333–344 (1985). https://doi.org/10.1016/0167-6393(85)90058-5
Article Google Scholar
Sakai, T., Doshita, S.: The automatic speech recognition system for conversational sound. IEEE Trans. Electron. Comput. EC–12(6), 835–846 (1963)
Article Google Scholar
Umeda, N.: Consonant duration in American English. J. Acoust. Soc. Am. 61(3), 846–858 (1977)
Article Google Scholar
Weibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)
Article Google Scholar

Download references

Acknowledgment

Part of this work was supported by JSPS Kakenhi JP17H00823.

Author information

Authors and Affiliations

Tohoku University, Sendai, Japan
Akinori Ito

Authors

Akinori Ito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akinori Ito .

Editor information

Editors and Affiliations

College of Information Science and Engineering, Fujian University of Technology, Fuzhou, Fujian, China
Jeng-Shyang Pan
Graduate School of Engineering, Tohoku University, Sendai, Miyagi, Japan
Akinori Ito
Swinburne University of Technology, Hawthorn, VIC, Australia
Pei-Wei Tsai
Centre for Artificial Intelligence, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ito, A. (2019). Leveraging a Small Corpus by Different Frame Shifts for Training of a Speech Recognizer. In: Pan, JS., Ito, A., Tsai, PW., Jain, L. (eds) Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing. IIH-MSP 2018. Smart Innovation, Systems and Technologies, vol 110. Springer, Cham. https://doi.org/10.1007/978-3-030-03748-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-03748-2_10
Published: 11 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03747-5
Online ISBN: 978-3-030-03748-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics