Skip to main content
Log in

HMM speech synthesis based on MDCT representation

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Hidden Markov model (HMM) based text-to-speech (TTS) has become one of the most promising approaches, as it has proven to be a particularly flexible and robust framework to generate synthetic speech. However, several factors such as mel-cepstral vocoder and over-smoothing are responsible for causing quality degradation of synthetic speech. This paper presents an HMM speech synthesis technique based on the modified discrete cosine transform (MDCT) representation to cope with these two issues. To this end, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame from feature vectors and allows for a 50% overlap between frames without increasing the data vector, in contrast to the conventional mel-cepstral spectral parameters that do not ensure direct speech waveform reconstruction. Experimental results show that a sound of good quality, conveniently evaluated using both objective and subjective tests, is obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Allen, J., Hunnicutt, M. S., Klatt, D. H., Armstrong, R. C., & Pisoni, D. B. (1987). From text to speech: The MITalk system. Cambridge: Cambridge University Press.

    Google Scholar 

  • Benoît, C., Grice, M., & Hazan, V. (1996). The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Communication, 18(4), 381–392.

    Article  Google Scholar 

  • Bertinetto, P. M. (1981). Strutture prosodiche dell’italiano: accento, quantità, sillaba, giuntura, fondamenti metrici. Firenze: Presso l’Accademia della Crusca.

    Google Scholar 

  • Biagetti, G., Crippa, P., Falaschetti, L., Orcioni, S., & Turchetti, C. (2016). Learning HMM state sequences from phonemes for speech synthesis. Procedia Computer Science, 96, 1589–1596.

    Article  Google Scholar 

  • Black, A. W. (2006). CLUSTERGEN: A statistical parametric synthesizer using trajectory modeling. In Proceeding of 9th International conference on spoken language processing (INTERSPEECH 2006 - ICSLP), Pittsburgh.

  • Black, A. W., & Campbell, N. (1995). Optimising selection of units from speech databases for concatenative synthesis. Eurospeech, 1, 581–584.

    Google Scholar 

  • Black, A. W., & Tokuda, K. (2005). The Blizzard Challenge–2005: Evaluating corpus-based speech synthesis on common datasets. In Proceeding of 9th European conference on speech communication and technology (INTERSPEECH), Lisbon (pp. 77–80).

  • Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In IFA Proceedings 17 (Vol. 17, pp. 97–110).

  • Bosi, M., & Goldberg, R. E. (2003). Introduction to digital audio coding and standards. New York: Springer.

    Book  Google Scholar 

  • Cabral, J. P., Renals, S., Richmond, K., & Yamagishi, J. (2008). Glottal spectral separation for parametric speech synthesis. In Proceeding of interspeech, Brisbane (pp. 1829–1832)

  • Cabral, J. P., Richmond, K., Yamagishi, J., & Renals, S. (2014). Glottal spectral separation for speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 195–208.

    Article  Google Scholar 

  • Canepari, L. (1979). Introduzione alla fonetica. Torino: Einaudi.

    Google Scholar 

  • Chen, G., Koh, S. N., & Soon, I. Y. (2003). Enhanced Itakura measure incorporating masking properties of human auditory system. Signal Processing, 83(7), 1445–1456.

    Article  MATH  Google Scholar 

  • Chen, L. H., Raitio, T., Valentini-Botinhao, C., Ling, Z. H., & Yamagishi, J. (2015). A deep generative architecture for postfiltering in statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11), 2003–2014.

    Article  Google Scholar 

  • CMUSphinx. (2007). CMUSphinx—Open source speech recognition toolkit. https://cmusphinx.github.io.

  • Csapó, T. G., & Németh, G. (2014). Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE Journal of Selected Topics in Signal Processing, 8(2), 209–220.

    Article  Google Scholar 

  • Dobrowolski, A. P., & Majda, E. (2011). Cepstral analysis in the speakers recognition systems. In Proceeding of Signal Processing Algorithms, Architectures, Arrangements, and Applications Conference (SPA), Poznan (pp. 1–6).

  • Donovan, R. E., & Woodland, P. C. (1999). A Hidden Markov-model-based trainable speech synthesizer. Computer Speech & Language, 13(3), 223–241.

    Article  Google Scholar 

  • Dutoit, T. (2008). Corpus-based speech synthesis (pp. 437–456). Berlin: Springer.

    Google Scholar 

  • Erro, D., Sainz, I., Navas, E., & Hernaez, I. (2014). Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 184–194.

    Article  Google Scholar 

  • Helander, E., Silen, H., Virtanen, T., & Gabbouj, M. (2012). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 806–817.

    Article  Google Scholar 

  • Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceeding of IEEE international conference on acoustics, speech, and signal processing conference proceedings (ICASSP), Atlanta (Vol. 1, pp. 373–376).

  • Itakura, F., & Saito, S. (1968). Analysis synthesis telephony based on the maximum likelihood method. In Proceeding of 6th international congress on acoustics, Tokyo (pp. C17–C20).

  • Iwahashi, N., Kaiki, N., & Sagisaka, Y. (1992). Concatenative speech synthesis by minimum distortion criteria. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), San Francisco (Vol. 2, pp. 65–68).

  • Kameoka, H., Yoshizato, K., Ishihara, T., Kadowaki, K., Ohishi, Y., & Kashino, K. (2015). Generative modeling of voice fundamental frequency contours. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(6), 1042–1053.

    Article  Google Scholar 

  • Karhila, R., Remes, U., & Kurimo, M. (2014). Noise in HMM-based speech synthesis adaptation: Analysis, evaluation methods and experiments. IEEE Journal of Selected Topics in Signal Processing, 8(2), 285–295.

    Article  Google Scholar 

  • Kawahara, H., Masuda-Katsuse, I., & De Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3), 187–207.

    Article  Google Scholar 

  • Klatt, D. H. (1973). Interaction between two factors that influence vowel duration. The Journal of the Acoustical Society of America, 54(4), 1102–1104.

    Article  Google Scholar 

  • Klatt, D. H. (1987). Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America, 82(3), 737–793.

    Article  Google Scholar 

  • Koriyama, T., Nose, T., & Kobayashi, T. (2014). Statistical parametric speech synthesis based on gaussian process regression. IEEE Journal of Selected Topics in Signal Processing, 8(2), 173–183.

    Article  Google Scholar 

  • Ling, Z.-H., Deng, L., & Yu, D. (2013). Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2129–2139.

    Article  Google Scholar 

  • Ling, Z. H., Kang, S. Y., Zen, H., Senior, A., Schuster, M., Qian, X. J., et al. (2015). Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Processing Magazine, 32(3), 35–52.

    Article  Google Scholar 

  • Maia, R., Toda, T., Tokuda, K., Sakai, S., & Nakamura, S. (2009). A decision tree-based clustering approach to state definition in an excitation modeling framework for HMM-based speech synthesis. In Proceeding of 10th annual conference of the international speech communication association (INTERSPEECH), Brighton (pp. 1783–1786).

  • Maia, R., Toda, T., Zen, H., Nankaku, Y., & Tokuda, K. (2007). An excitation model for HMM-based speech synthesis based on residual modeling. In Proceeding of international speech communication association speech synthesis, workshop 6 (ISCA SSW6), Bonn (pp. 131–136).

  • Malvar, H. S. (1992). Signal processing with lapped transforms. Norwood: Artech House Inc.

    MATH  Google Scholar 

  • MaryTTS. (2012). MaryTTS—An open-source, multilingual text-to-speech synthesis platform written in java. http://mary.dfki.de.

  • Nakashika, T., Takashima, R., Takiguchi, T., & Ariki, Y. (2013). Voice conversion in high-order eigen space using deep belief nets. In Proceeding of interspeech, Lyon (pp. 369–372).

  • Nespor, M., & Bafile, L. (2008). I suoni del linguaggio. Bologna: Il mulino.

    Google Scholar 

  • Nose, T. (2016). Efficient implementation of global variance compensation for parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10), 1694–1704.

    Article  Google Scholar 

  • Nose, T., Chunwijitra, V., & Kobayashi, T. (2014). A parameter generation algorithm using local variance for HMM-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 221–228.

    Article  Google Scholar 

  • Picart, B., Drugman, T., & Dutoit, T. (2014). Automatic variation of the degree of articulation in new HMM-based voices. IEEE Journal of Selected Topics in Signal Processing, 8(2), 307–322.

    Article  Google Scholar 

  • Princen, J., & Bradley, A. (1986). Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(5), 1153–1161.

    Article  Google Scholar 

  • Princen, J., Johnson, A., & Bradley, A. (1987). Subband/transform coding using filter bank designs based on time domain aliasing cancellation. In Proceeding of IEEE international conference on acoustics, speech, and signal processing (ICASSP), Dallas (Vol. 12, pp. 2161–2164).

  • Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M., et al. (2011). HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech, and Language Processing, 19(1), 153–165.

    Article  Google Scholar 

  • Rosales, H. G., Jokisch, O., & Hoffmann, R. (2008). Bayes optimal classification for corpus-based unit selection in TTS synthesis. In Proceeding of V Jornadas en Tecnologıa del Habla (VJTH) (pp. 141–144).

  • Schröder, M., & Trouvain, J. (2003). The german text-to-speech synthesis system MARY: A tool for research, development and teaching. International Journal of Speech Technology, 6(4), 365–377.

    Article  Google Scholar 

  • Schröder, M., Charfuelan, M., Pammi, S., & Steiner, I. (2011). Open source voice creation toolkit for the MARY TTS platform. In 12th annual conference of the international speech communication association (INTERSPEECH), Florence (pp. 3253–3256).

  • Schröder, M., Charfuelan, M., Pammi, S., & Türk, O. (2008). The MARY TTS entry in the Blizzard Challenge 2008. In Proceeding of Blizzard challenge.

  • Schwarz, D. (2007). Corpus-based concatenative synthesis. IEEE Signal Processing Magazine, 24(2), 92–104.

    Article  Google Scholar 

  • Sung, J. S., Hong, D. H., & Kim, N. S. (2014). Factored maximum penalized likelihood kernel regression for HMM-based style-adaptive speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 251–261.

    Article  Google Scholar 

  • Takaki, S., Nankaku, Y., & Tokuda, K. (2014). Contextual additive structure for HMM-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 229–238.

    Article  Google Scholar 

  • Takamichi, S., Toda, T., Shiga, Y., Sakti, S., Neubig, G., & Nakamura, S. (2014). Parameter generation methods with rich context models for high-quality and flexible text-to-speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 239–250.

    Article  Google Scholar 

  • Taylor, P. (2009). Text-to-speech synthesis. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Tesser, F., Paci, G., Sommavilla, G., & Cosi, P. (2013). A new language and a new voice for MARY-TTS. In 9th national congress, AISV (Associazione Italiana di Scienze della Voce), Venice (pp. 435–443).

  • Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and Systems, 90(5), 816–824.

    Article  Google Scholar 

  • Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.

    Article  Google Scholar 

  • Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proceeding of IEEE international conference on acoustics, speech, and signal processing, (ICASSP), Istanbul (Vol. 3, pp. 1315–1318).

  • Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., & Oura, K. (2013). Speech synthesis based on Hidden Markov models. Proceedings of the IEEE, 101(5), 1234–1252.

    Article  Google Scholar 

  • Tsiaras, V., Maia, R., Diakoloukas, V., Stylianou, Y., & Digalakis, V. (2016). Global variance in speech synthesis with linear dynamical models. IEEE Signal Processing Letters, 23(8), 1057–1061.

    Article  Google Scholar 

  • Urbain, J., Çakmak, H., Charlier, A., Denti, M., Dutoit, T., & Dupont, S. (2014). Arousal-driven synthesis of laughter. IEEE Journal of Selected Topics in Signal Processing, 8(2), 273–284.

    Article  Google Scholar 

  • Wan, V., Latorre, J., Yanagisawa, K., Braunschweiler, N., Chen, L., Gales, M. J. F., et al. (2014). Building HMM-TTS voices on diverse data. IEEE Journal of Selected Topics in Signal Processing, 8(2), 296–306.

    Article  Google Scholar 

  • Wu, Y.-J., Zen, H., Nankaku, Y., Tokuda, K. (2008). Minimum generation error criterion considering global/local variance for HMM-based speech synthesis. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), Las Vegas (pp. 4621–4624).

  • Yamagishi, J., Nose, T., Zen, H., Ling, Z.-H., Toda, T., Tokuda, K., et al. (2009). Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1208–1230.

    Article  Google Scholar 

  • Yoshimura, T. (2002). Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems. PhD thesis, Nagoya Institute of Technology.

  • Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1997). Speaker interpolation in HMM-based speech synthesis system. In Proceeding of 5th European conference on speech communication and technology (EUROSPEECH), Rhodes (pp. 2523–2526).

  • Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceeding of 6th European Conference on Speech Communication and Technology (EUROSPEECH), Budapest (Vol. 5, pp. 2347–2350).

  • Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2001). Mixed excitation for HMM-based speech synthesis. In Proceeding of 7th European conference on speech communication and technology (EUROSPEECH), Scandinavia (pp. 2263–2266).

  • Yu, K., & Young, S. (2011). Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1071–1079.

    Article  Google Scholar 

  • Zen, H., & Senior, A. (2014). Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence (pp. 3844–3848).

  • Zen, H., Toda, T., & Tokuda, K. (2008). The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Transactions on Information and Systems, 91(6), 1764–1773.

    Article  Google Scholar 

  • Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laura Falaschetti.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biagetti, G., Crippa, P., Falaschetti, L. et al. HMM speech synthesis based on MDCT representation. Int J Speech Technol 21, 1045–1055 (2018). https://doi.org/10.1007/s10772-018-09571-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-09571-9

Keywords

Navigation