Skip to main content
Log in

Speech naturalness improvement via \(\mathrm {\epsilon }\)-closed extended vectors sets in voice conversion systems

  • Published:
Multidimensional Systems and Signal Processing Aims and scope Submit manuscript

Abstract

In conventional voice conversion methods, some features of a speech signal’s spectrum envelope are first extracted. Then, these features are converted so as to best match a target speaker’s speech by designing and using a set of conversions. Ultimately, the spectrum envelope of the target speaker’s speech signal is reconstructed from the converted features. The spectrum envelope reconstructed from the converted features usually deviates from its natural form. This aberration from the natural form observed in cases such as over-smoothing, over-fitting, and widening of formants is partially caused by two factors: (1) there is an error in the reconstruction of spectrum envelope from the features, and (2) the set of features extracted from the spectrum envelope of the speech signal is not closed. A method is put forward to improve the naturalness of speech by means of \(\epsilon \)-closed sets of extended vectors in voice conversion systems. In this approach, \(\epsilon \)-closed sets to reconstruct the natural spectrum envelope of a signal in the synthesis phase are introduced. The elements of these sets are generated by forming a group of extended vectors of features and applying a quantization scheme on the features of a speech signal. The use of this method in speech synthesis leads to a noticeable reduction of error in spectrum reconstruction from the features. Furthermore, the final spectrum envelope extracted from voice conversions maintains its natural form and, consequently, the problems arising from the deviation of voice from its natural state are resolved. The above method can be generally used as one phase of speech synthesis. It is independent of the voice conversion technique used and its parallel or non-parallel training method, and can be applied to improve the naturalness of the generated speech signal in all common voice conversion methods. Moreover, this method can be used in other fields of speech processing like texts to speech systems and vocoders to improve the quality of the output signal in the synthesis step.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Abe, M., Nakamura, S., Shikano, K., & Kuwabara, H. (1988). Voice conversion through vector quantization. In International conference on acoustics, speech, and signal processing, ICASSP-88 (Vol. 1, pp. 655–658).

  • Charlier, M., Ohtani, Y., Toda, T., Moinet, A., & Dutoit, T. (2009). Cross-language voice conversion based on eigenvoices. In 10th Annual conference of the international speech communication association, Brighton, UK, September 6–10, pp. 1635–1638.

  • Chen, L., Ling, Z., Liu, L., & Dai, L. (2014). Voice conversion using deep neural networks with layer-wise generative training. IEEE ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1859–1872.

    Google Scholar 

  • Chen, X., & Zhang, L. (2014). High-quality voice conversion system based on GMM statistical parameters and RBF neural network. The Journal of China Universities of Posts and Telecommunications, 21(5), 68–75.

    Article  Google Scholar 

  • Childers, D., Yegnanarayana, B., & Wu, K. (1985) Voice conversion: Factors responsible for quality. In IEEE international conference on acoustics, speech, and signal processing, ICASSP ’85 (Vol. 10, pp. 748–751).

  • Desai, S., Raghavendra, E. V., Yegnanarayana, B., Black, A. W., & Prahallad, K. (2009). Voice conversion using artificial neural networks. In IEEE international conference on acoustics, speech, and signal processing (pp. 3893–3896).

  • Doost, R., Sayadiyan, A., & Shamsi, H. (2009). A new perceptually weighted distance measure for vector quantization of the STFT amplitudes in the speech application. IEICE Electronics Express, 6(12), 824–830.

    Article  Google Scholar 

  • Eide, E., & Picheny, M. (2006). Towards pooled-speaker concatenative text-to-speech. In IEEE International conference on acoustics, speech and signal processing, ICASSP ’06 (Vol. 1, pp. 73–76).

  • Erro, D., & Moreno, A. (2007). Weighted frequency warping for voice conversion. In Annual conference of the international speech communication association, InterSpeech ’07.

  • Erro, D., Navas, E., & Hernáez, I. (2012). Iterative MMSE estimation of vocal tract length normalization factors for voice transformation. In 13th Annual conference of the international speech communication association, INTERSPEECH ’12, Portland, Oregon, USA, September 9–13, pp. 86–89.

  • Erro, D., Navas, E., & Hernaez, I. (2013). Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Transactions on Audio, Speech, and Language Processing, 21(3), 556–566.

    Article  Google Scholar 

  • Ghorbandoost, M., Sayadiyan, A., Ahangar, M., Sheikhzadeh, H., Shahrebabaki, A. S., & Amini, J. (2015). Voice conversion based on feature combination with limited training data. Speech Communication, 67, 113–128.

    Article  Google Scholar 

  • Helander, E., Silen, H., Virtanen, T., & Gabbouj, M. (2012). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 806–817.

    Article  Google Scholar 

  • Helander, E., Virtanen, T., Nurminen, J., & Gabbouj, M. (2010). Voice conversion using partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 912–921.

    Article  Google Scholar 

  • Hu, Y., & Loizou, P. (2008). Evaluation of objective quality measures for speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 229–238.

    Article  Google Scholar 

  • Kain, A., & Macon, M. (1998). Spectral voice conversion for text-to-speech synthesis. In IEEE international conference on acoustics, speech and signal processing, ICASSP ’98 (Vol. 1, pp. 285–288).

  • Kawahara, H., Masuda-Katsuse, I., & de Cheveign, A. (1999). Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous frequency based f0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27, 187–207.

    Article  Google Scholar 

  • Kominek, J., & Black, A. W. (2004). The CMU Arctic speech databases. In Fifth ISCA workshop on speech synthesis.

  • Lee, K. (2007). Statistical approach for voice personality transformation. IEEE Transactions on Audio, Speech and Language Processing, 15(2), 641–651.

    Article  Google Scholar 

  • Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, 28(1), 84–95.

    Article  Google Scholar 

  • Mowlaee, P., Sayadiyan, A., & Sheikhzadeh, H. (2009). FDMSM robust signal representation for speech mixtures and noise corrupted audio signals. IEICE Electronics Express, 6(15), 1077–1083.

    Article  Google Scholar 

  • Nakamura, K., Toda, T., Saruwatari, H., & Shikano, K. (2010). Evaluation of extremely small sound source signals used in speaking-aid system with statistical voice conversion. IEICE Transactions on Information and Systems, 93, 1909–1917.

    Article  Google Scholar 

  • Nakashika, T., Takiguchi, T., & Ariki, Y. (2015a). Voice conversion using RNN pre-trained by recurrent temporal restricted boltzmann machines. IEEE ACM Transactions on Audio, Speech, and Language Processing, 23(3), 580–587.

    Google Scholar 

  • Nakashika, T., Takiguchi, T., & Ariki, Y. (2015). Voice conversion using speaker-dependent conditional restricted boltzmann machine. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 8.

    Article  Google Scholar 

  • Naylor, P. A., & Gaubitch, N. D. (2010). Speech dereverberation (1st ed.). London: Springer.

    Book  MATH  Google Scholar 

  • Paliwal, K., & Atal, B. (1993). Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Transactions on Speech and Audio Processing, 1(1), 3–14.

    Article  Google Scholar 

  • Ramachandran, R., & Mammone, R. (1995). Modern methods of speech processing (1st ed.). New York: Springer. ISSN: 0893-3405.

  • Saino, K., Zen, H., Nankaku, Y., Lee, A., & Tokuda, K. (2006). An HMM-based singing voice synthesis system. In Ninth international conference on spoken language processing, INTERSPEECH ’06, Pittsburgh, PA, USA, September 17–21.

  • Simovici, D. A., & Djeraba, C. (2014). Mathematical tools for data mining (2nd ed.). London: Springer.

    Book  MATH  Google Scholar 

  • Streijl, R. C., Winkler, S., & Hands, D. S. (2016). Mean opinion score revisited: Methods and applications, limitations and alternatives. Multimedia Systems.

  • Stylianou, I. (1996). Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification. Ph.D. thesis, Ecole Nationale Supérieure des Télécommunications.

  • Sundermann, D., Hoge, H., Bonafonte, A., Ney, H., Black, A., & Narayanan, S. (2006). Text-independent voice conversion based on unit selection. In IEEE international conference on acoustics, speech and signal processing, ICASSP ’06 (Vol. 1, pp. 81–84).

  • Takashima, R., Takiguchi, T., & Ariki, Y. (2012). Exemplar based voice conversion in noisy environment. In Spoken language technology workshop (SLT) (pp. 313–317).

  • Toda, T., Black, A., & Tokuda, K. (2005). Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter. In IEEE international conference on acoustics, speech, and signal processing, ICASSP ’05 (Vol. 1, pp. 9–12).

  • Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech and Language Processing, 15(8), 2222–2235.

    Article  Google Scholar 

  • Toda, T., Ohtani, Y., & Shikano, K. (2006). Eigenvoice conversion based on Gaussian mixture model. In Ninth international conference on spoken language processing, INTERSPEECH ’06, Pittsburgh, PA, USA, September 17–21, 2006.

  • Toda, T., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of straight spectrum. In IEEE international conference on acoustics, speech, and signal processing, ICASSP ’01 (Vol. 2, pp. 841–844).

  • Valbret, H., Moulines, E., & Tubach, J. (1992). Voice transformation using PSOLA technique. Speech Communication, 11, 175–187.

    Article  Google Scholar 

  • Wu, Z., Virtanen, T., Chng, E. S., & Li, H. (2014). Exemplar based sparse representation with residual compensation for voice conversion. IEEE Transactions on Speech and Audio Processing, 22(10), 1506–1521.

    Article  Google Scholar 

  • Wu, Z., Virtanen, T., Kinnunen, T., Chng, E. S., & Li, H. (2013). Exemplarbased voice conversion using nonnegative spectrogram deconvolution. In 8th ISCA speech synthesis workshop.

Download references

Acknowledgements

The authors would like to thank the reviewers for their relevant and constructive remarks, which have enabled us to significantly improve the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abolghasem Sayadiyan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jannati, M.J., Sayadiyan, A. & Razi, A. Speech naturalness improvement via \(\mathrm {\epsilon }\)-closed extended vectors sets in voice conversion systems. Multidim Syst Sign Process 29, 385–403 (2018). https://doi.org/10.1007/s11045-017-0470-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11045-017-0470-3

Keywords

Navigation