Skip to main content

Part of the book series: Springer Handbooks ((SHB))

Abstract

Speech synthesis systems have to generate natural-sounding speech output from text. One of the key aspects of speech is prosody, which must be both natural (i.e., sounding like a human) and meaningful (i.e., sounding like a human who understands the contents of the text). The computation of prosody from text can be divided into the computation of prosodic tags from text and the computation of acoustic speech features from these tags. This chapter focuses on the latter. It provides an overview of prosody in human-human communication, including the communicative functions of prosody and the acoustic correlates. Discussed next is a historical overview of the various methods that have been used for prosody generation in speech synthesis, as well as of current methods. Special attention is paid to prosody generation in unit selection synthesis methods, in which large corpora are searched for fragments of speech that match the phonemes and prosodic tags computed from text and that optimize various cost functions, and in which prosody is not modeled and speech not modified. We conclude the chapter by advocating hybrid approaches in which search capabilities of unit selection methods are combined with the speech modification methods from more-traditional approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 579.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 729.00
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

CART:

classification and regression tree

FC:

functional contour

ML:

maximum-likelihood

RMSE:

root-mean-square error

TTS:

text-to-speech

ToBI:

tone and break indices

References

  1. J. van Santen: Contextual effects on vowel duration, Speech Commun. 11(6), 513-546 (1992)

    Article  Google Scholar 

  2. J. van Santen: Exploring N-way tables with Sums-of-Product models, J. Mathemat. Psychol. 37(3), 327-371 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  3. B. Möbius, J. van Santen: Modeling segmental duration in German text-to-speech synthesis, Proc. 1996 Int. Conf. Spoken Lang. Process. Philadelphia (1996) pp. 2395-2398

    Google Scholar 

  4. C. Shih, B. Ao: Duration study for the Bell Laboratories Mandarin text-to-speech system. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirschberg (Springer, New York 1996) pp. 383-397

    Google Scholar 

  5. J. van Santen: Assignment of segmental duration in text-to-speech synthesis, Computer Speech Language 8, 95-128 (1994)

    Article  Google Scholar 

  6. H. Kato, M. Tsuzaki, Y. Sagisaka: Acceptability for temporal modification of single vowel segments in isolated words, J. Acoust. Soc. Am. 104(1), 540-549 (1998)

    Article  Google Scholar 

  7. J. van Santen, C. Shih: Suprasegmental and segmental timing models in Mandarin Chinese and American English, J. Acoust. Soc. Am. 107(2), 1012-1026 (2000)

    Article  Google Scholar 

  8. J. van Santen: Segmental duration and speech timing. In: Computing Prosody, ed. by Y. Sagisaka, W.N. Campbell, N. Higuchi (Springer, New York 1996)

    Google Scholar 

  9. J.B. Pierrehumbert: The phonetics and phonology of English intonation, Ph.D. Thesis (MIT, Cambridge 1980)

    Google Scholar 

  10. H. Fujisaki: Dynamic characteristics of voice fundamental frequency in speech and singing. In: The Production of Speech, ed. by P.F. MacNeilage (Springer, New York 1983) pp. 39-55

    Chapter  Google Scholar 

  11. K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, J. Hirschberg: ToBI: A standard for labeling English prosody, Proc. 1992 Int. Conf. Spoken Language Processing Banff (1992) pp. 867-870

    Google Scholar 

  12. K.J. Kohler: Macro and micro F0 in the synthesis of intonation. In: Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech, ed. by J. Kingston, M.E. Beckman (Cambridge Univ. Press, New York 1990) pp. 115-138

    Chapter  Google Scholar 

  13. M. dʼImperio, D. House: Perception of questions and statements in Neapolitan Italian, Proc. Fifth European Conference on Speech Communication and Technology Rhodes (1997)

    Google Scholar 

  14. D.J. Broad, F. Clermont: Linear scaling of vowel-formant ensembles (VFEs) in consonantal contexts, Speech Commun. 37, 175-195 (2002)

    Article  MATH  Google Scholar 

  15. D.H. Klatt: Interaction between two factors that influence vowel duration, J. Acoust. Soc. Am. 54, 1102-1104 (1973)

    Article  Google Scholar 

  16. D.H. Klatt: Linguistic uses of segmental duration in English: Acoustic and perceptual evidence, J. Acoust. Soc. Am. 59, 1209-1221 (1976)

    Article  Google Scholar 

  17. J. Allen, S. Hunnicut, D. Klatt: Text-to-Speech: The MITalk System (Cambridge Univ. Press, Cambridge 1987)

    Google Scholar 

  18. J.B. Pierrehumbert: Synthesizing intonation, J. Acoust. Soc. Am. 70, 985-995 (1981)

    Article  Google Scholar 

  19. R. Sproat (Ed.): Multilingual Text-to-Speech Synthesis: The Bell Labs Approach (Kluwer, Dordrecht 1997)

    Google Scholar 

  20. P. Taylor: Analysis and synthesis of intonation using the Tilt model, J. Acoust. Soc. Am. 107(3), 1697-1714 (2000)

    Article  Google Scholar 

  21. K. Dusterhoff, A. Black: Generating F0 contours for speech synthesis using the Tilt intonation theory, Intonation: Theory, Models and Applications, Proc. ESCA Workshop, ed. by A. Botinis, G. Kouroupetroglou, G. Carayiannis (1997) pp. 107-110

    Google Scholar 

  22. A. Black, P. Taylor: CHATR: A generic speech synthesis system, Proc. COLINGʼ94 Kyoto (1994) pp. 983-986

    Google Scholar 

  23. A. Black, N. Campbell: Prosody and the selection of source units for concatenative synthesis. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirschberg (Springer, New York 1995) pp. 279-292

    Google Scholar 

  24. J. van Santen: Combinatorial issues in text-to-speech synthesis, Proc. Eurospeech 97, 2511-2514 (1997)

    Google Scholar 

  25. B. Möbius: Rare events and closed domains: Two delicate concepts in speech synthesis, Proc. 4rd ESCA Workshop on Speech Synthesis Pitlochry (2001)

    Google Scholar 

  26. M.D. Riley: Tree-based modeling for speech synthesis. In: Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit, T. Sawallis (Elsevier, Amsterdam 1992) pp. 265-273

    Google Scholar 

  27. D.H. Klatt: Synthesis by rule of segmental durations in English sentences. In: Frontiers of Speech Communication Research, ed. by B. Lindblom, S. Öhman (Academic, New York 1979) pp. 287-300

    Google Scholar 

  28. J.P. Olive, M.Y. Liberman: Text to speech - An overview, J. Acoust. Soc. Am. 78(Suppl. 1), 6 (1985)

    Article  Google Scholar 

  29. R. Carlson, B. Granström: A search for durational rules in a real-speech database, Phonetica 43, 140-154 (1986)

    Article  Google Scholar 

  30. K.J. Kohler: Zeitstrukturierung in der Sprachsynthese, ITG-Fachbericht 105, 165-170 (1994), in German

    Google Scholar 

  31. K. Bartkova, C. Sorin: A model of segmental duration for speech synthesis in French, Speech Commun. 6, 245-260 (1987)

    Article  Google Scholar 

  32. L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone: Classification and Regression Trees (Wadsworths Brooks, Monterey 1984)

    MATH  Google Scholar 

  33. H. Chung: Duration models and the perceptual evaluation of spoken Korean, Proc. Speech Prosody 2002 Aix-en-Provence (2002)

    Google Scholar 

  34. R. Batůšek: A duration model for Czech text-to-speech synthesis, Proc. Speech Prosody 2002 Aix-en-Provence (2002)

    Google Scholar 

  35. N.S. Krishna, H.A. Murthy: Duration modeling of Indian languages Hindi and Telugu, 5th ISCA Workshop of Speech Synthesis Pittsburgh (2005)

    Google Scholar 

  36. F. Tesser, P. Cosi, C. Drioli, G. Tisato: Prosodic data driven modelling of a narrative style in Festival TTS, Proc. 5th ISCA Workshop on Speech Synthesis Pittsburgh (2005)

    Google Scholar 

  37. W.N. Campbell: Syllable-based segmental durations. In: Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit, T. Sawallis (Elsevier, Amsterdam 1992) pp. 43-60

    Google Scholar 

  38. E. Klabbers: Segmental and Prosodic Improvements to Speech Generation, Ph.D. Thesis (Eindhoven University of Technology, Eindhoven 2000)

    Google Scholar 

  39. A. Maghbouleh: An empirical comparison of automatic decision tree and linear regression models for vowel durations, Proc. second meeting of the ACL Special Interest Group in Computational Phonology Santa Cruz (1996)

    Google Scholar 

  40. D.R. Ladd: Intonational Phonology (Cambridge Univ. Press, Cambridge 1996)

    Google Scholar 

  41. J.A. Goldsmith: Autosegmental and Metrical Phonology (Blackwell, Oxford 1990)

    Google Scholar 

  42. N. Campbell, J. Venditti: J-ToBI: An intonation labelling system for Japanese, Proc. Autumn Meeting Acoust. Soc. Jpn. 1, 317-318 (1995)

    Google Scholar 

  43. C. Mayo, M. Aylett, D.R. Ladd: Prosodic transcription of Glasgow English: An evaluation study of Glatobi, Proc. ESCA Workshop: Intonation: Theory, Models and Applications, ed. by A. Botinis, G. Kouroupetroglou, G. Carayiannis (ESCA, 1997) pp. 231-234

    Google Scholar 

  44. M. Reyelt, M. Grice, R. Benzmuller, J. Mayer, A. Batliner: Prosodische Etikettierung des Deutschen mit ToBI. In: Natural Language and Speech Technology, ed. by D. Gibbon (Mouton de Gruyter, Berlin 1996) pp. 144-155, in German

    Google Scholar 

  45. M. Jilka, G. Mohler, G. Dogil: Rules for the generation of ToBI-based American English intonation, Speech Commun. 28, 83-108 (1999)

    Article  Google Scholar 

  46. A. Black, A. Hunt: Generating F0 contours from the ToBI labels using linear regression, Proc. 4th Int. Conf. Spoken Language Process. 3, 1385-1388 (1996)

    Google Scholar 

  47. C. Traber: F0 generation with a database of natural F0 patterns and with a neural network. In: Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit, T. Sawallis (Elsevier, Amsterdam 1992) pp. 287-304

    Google Scholar 

  48. C. Traber: Syntactic processing and prosody control in the SVOX TTS system for German, Proc. Eurospeech 93, 2099-2102 (1993)

    Google Scholar 

  49. C. Traber: SVOX: The Implementation of a Text-to-Speech System for German, Ph.D. Thesis (ETH Zurich, Zurich 1995)

    Google Scholar 

  50. A. Cohen, J. ʼt Hart: On the anatomy of intonation, Lingua 19, 177-192 (1967)

    Article  Google Scholar 

  51. J. ʼt Hart, R. Collier, A. Cohen: A Perceptual Study of Intonation: An Experimental-Phonetic Approach to Speech Melody (Cambridge Univ. Press, Cambridge 1990)

    Book  Google Scholar 

  52. J. Terken: Synthesizing natural-sounding intonation for Dutch: rules and perceptual evaluation, Computer Speech Language 7, 27-48 (1993)

    Article  Google Scholar 

  53. N. Willems, R. Collier, J. ʼt Hart: A synthesis scheme for British English intonation, J. Acoust. Soc. Am. 84(4), 1250-1261 (1988)

    Article  Google Scholar 

  54. J. van Hemert, U. Adriaens-Porzig, L. Adriaens: Speech synthesis in the SPICOS project. In: Analyse und Synthese gesprochener Sprache, ed. by H. Tillmann, G. Willee (Georg Olms, Hildesheim 1987) pp. 34-39

    Google Scholar 

  55. H. Fujisaki, K. Hirose: Modelling the dynamic characteristics of voice fundamental frequency with applications to analysis and synthesis of intonation, Preprints of the Working Group on Intonation, 13th Intl. Congress of Linguists Tokyo (1982) pp. 57-70

    Google Scholar 

  56. H. Fujisaki: Modelling in the study of tonal features of speech with application to multilingual speech synthesis, Joint Conference of SNLP and Oriental COCOSDA Hua Hin Prachuapkirikhan (2002)

    Google Scholar 

  57. H. Mixdorff: Quantitative tone and intonation modeling across languages, Proc. Int. Symp. Tonal Aspects of Languages: With Emphasis on Tone Languages Beijing (2004) pp. 137-142

    Google Scholar 

  58. J. van Santen, B. Möbius: A quantitative model of F 0 generation and alignment. In: Intonation: Analysis, Modeling and Technology, ed. by A. Botinis (Kluwer Academic, Dordrecht 1999) pp. 269-288

    Google Scholar 

  59. J. van Santen, B. Möbius, J. Venditti, C. Shih: Description of the Bell Labs Intonation System, Proc. 3rd ESCA Speech Synthesis Workshop Jenolan Caves (1998) pp. 293-298

    Google Scholar 

  60. V. Aubergé: Prosody modeling with a dynamic lexicon of intonative forms: Application for text-to-speech synthesis, Proc. ESCA Workshop on Prosody (1993) pp. 62-65

    Google Scholar 

  61. B. Holm, G. Bailly: Generating prosody by superposing multi-parametric overlapping contours, Proc. Int. Conf. Speech and Language Processing Beijing (2000) pp. 203-206

    Google Scholar 

  62. G. Bailly, B. Holm: SFC: A trainable prosodic model, Speech Commun. 46, 364-384 (2005)

    Article  Google Scholar 

  63. K.J. Kohler: Studies in German intonation, Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung, Universität Kiel 25, 295-360 (1991)

    Google Scholar 

  64. K.J. Kohler: Parametric control of prosodic variables by symbolic input in TTS synthesis. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirschberg (Springer, New York 1997) pp. 459-475

    Chapter  Google Scholar 

  65. Kohler K.J.: The Kiel Intonation Model (KIM), its Implementation in TTS Synthesis and its Application to the Study of Spontaneous Speech (1995), retrieved on July 15th, 2006 from http://www.ipds.uni-kiel.de/kjk/forschung/kim.en.html

    Google Scholar 

  66. G.P. Kochanski, C. Shih: Stem-ML: Language independent prosody description, Proc. Int. Conf. Spoken Lang. Process. 3, 239-242 (2000)

    Google Scholar 

  67. G.P. Kochanski, C. Shih: Prosody modeling with soft templates, Speech Commun. 39(3-4), 311-352 (2003)

    Article  MATH  Google Scholar 

  68. G.P. Kochanski, C. Shih: Automated modelling of Chinese intonation in continuous speech, Proc. Eurospeech 01, 911-914 (2001)

    Google Scholar 

  69. T. Lee, G. Kochanski, C. Shih, Y. Li: Modeling tones in continuous Cantonese speech, Proc. 2002 International Conference on Spoken Language Processing Denver (2002) pp. 2401-2404

    Google Scholar 

  70. C. Shih, G. Kochanski: Modeling intonation: Asking for confirmation in English, Proc. 15th Int. Congress of Phonetic Sciences Barcelona (2003)

    Google Scholar 

  71. S. Quazza, L. Donetti, L. Moisa, P.L. Salza: ACTOR: A multilingual unit-selection speech synthesis system, Proc. 4th ESCA Workshop on Speech Synthesis Pitlochry (2001)

    Google Scholar 

  72. F. Campillo-Díaz, E.R. Banga: Combined prosody and candidate unit selections for corpus-based text-to-speech systems, Proc. 7th Int. Conference on Spoken Language Processing (2002) pp. 141-144

    Google Scholar 

  73. A. Raux, A. Black: A unit selection approach to F0 modeling and its application to emphasis, ASRU 2003 St. Thomas (2003)

    Google Scholar 

  74. J. van Santen, A. Kain, E. Klabbers, T. Mishra: Synthesis of prosody using multi-level sequence units, Speech Commun. 46(3-4), 365-375 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jan van Santen Dr. , Taniya Mishra or Esther Klabbers Dr. .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Santen, J.v., Mishra, T., Klabbers, E. (2008). Prosodic Processing. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-49127-9_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49125-5

  • Online ISBN: 978-3-540-49127-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics