Prosodic Processing

Santen, Jan van; Mishra, Taniya; Klabbers, Esther

doi:10.1007/978-3-540-49127-9_23

Jan van Santen Dr.⁴,
Taniya Mishra⁵ &
Esther Klabbers Dr.⁶

Part of the book series: Springer Handbooks ((SHB))

7944 Accesses
10 Citations

Abstract

Speech synthesis systems have to generate natural-sounding speech output from text. One of the key aspects of speech is prosody, which must be both natural (i.e., sounding like a human) and meaningful (i.e., sounding like a human who understands the contents of the text). The computation of prosody from text can be divided into the computation of prosodic tags from text and the computation of acoustic speech features from these tags. This chapter focuses on the latter. It provides an overview of prosody in human-human communication, including the communicative functions of prosody and the acoustic correlates. Discussed next is a historical overview of the various methods that have been used for prosody generation in speech synthesis, as well as of current methods. Special attention is paid to prosody generation in unit selection synthesis methods, in which large corpora are searched for fragments of speech that match the phonemes and prosodic tags computed from text and that optimize various cost functions, and in which prosody is not modeled and speech not modified. We conclude the chapter by advocating hybrid approaches in which search capabilities of unit selection methods are combined with the speech modification methods from more-traditional approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 579.00; Price excludes VAT (USA)

Hardcover Book: USD 729.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

CART:: classification and regression tree
FC:: functional contour
ML:: maximum-likelihood
RMSE:: root-mean-square error
TTS:: text-to-speech
ToBI:: tone and break indices

References

J. van Santen: Contextual effects on vowel duration, Speech Commun. 11(6), 513-546 (1992)
Article Google Scholar
J. van Santen: Exploring N-way tables with Sums-of-Product models, J. Mathemat. Psychol. 37(3), 327-371 (1993)
Article MathSciNet MATH Google Scholar
B. Möbius, J. van Santen: Modeling segmental duration in German text-to-speech synthesis, Proc. 1996 Int. Conf. Spoken Lang. Process. Philadelphia (1996) pp. 2395-2398
Google Scholar
C. Shih, B. Ao: Duration study for the Bell Laboratories Mandarin text-to-speech system. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirschberg (Springer, New York 1996) pp. 383-397
Google Scholar
J. van Santen: Assignment of segmental duration in text-to-speech synthesis, Computer Speech Language 8, 95-128 (1994)
Article Google Scholar
H. Kato, M. Tsuzaki, Y. Sagisaka: Acceptability for temporal modification of single vowel segments in isolated words, J. Acoust. Soc. Am. 104(1), 540-549 (1998)
Article Google Scholar
J. van Santen, C. Shih: Suprasegmental and segmental timing models in Mandarin Chinese and American English, J. Acoust. Soc. Am. 107(2), 1012-1026 (2000)
Article Google Scholar
J. van Santen: Segmental duration and speech timing. In: Computing Prosody, ed. by Y. Sagisaka, W.N. Campbell, N. Higuchi (Springer, New York 1996)
Google Scholar
J.B. Pierrehumbert: The phonetics and phonology of English intonation, Ph.D. Thesis (MIT, Cambridge 1980)
Google Scholar
H. Fujisaki: Dynamic characteristics of voice fundamental frequency in speech and singing. In: The Production of Speech, ed. by P.F. MacNeilage (Springer, New York 1983) pp. 39-55
Chapter Google Scholar
K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, J. Hirschberg: ToBI: A standard for labeling English prosody, Proc. 1992 Int. Conf. Spoken Language Processing Banff (1992) pp. 867-870
Google Scholar
K.J. Kohler: Macro and micro F0 in the synthesis of intonation. In: Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech, ed. by J. Kingston, M.E. Beckman (Cambridge Univ. Press, New York 1990) pp. 115-138
Chapter Google Scholar
M. dʼImperio, D. House: Perception of questions and statements in Neapolitan Italian, Proc. Fifth European Conference on Speech Communication and Technology Rhodes (1997)
Google Scholar
D.J. Broad, F. Clermont: Linear scaling of vowel-formant ensembles (VFEs) in consonantal contexts, Speech Commun. 37, 175-195 (2002)
Article MATH Google Scholar
D.H. Klatt: Interaction between two factors that influence vowel duration, J. Acoust. Soc. Am. 54, 1102-1104 (1973)
Article Google Scholar
D.H. Klatt: Linguistic uses of segmental duration in English: Acoustic and perceptual evidence, J. Acoust. Soc. Am. 59, 1209-1221 (1976)
Article Google Scholar
J. Allen, S. Hunnicut, D. Klatt: Text-to-Speech: The MITalk System (Cambridge Univ. Press, Cambridge 1987)
Google Scholar
J.B. Pierrehumbert: Synthesizing intonation, J. Acoust. Soc. Am. 70, 985-995 (1981)
Article Google Scholar
R. Sproat (Ed.): Multilingual Text-to-Speech Synthesis: The Bell Labs Approach (Kluwer, Dordrecht 1997)
Google Scholar
P. Taylor: Analysis and synthesis of intonation using the Tilt model, J. Acoust. Soc. Am. 107(3), 1697-1714 (2000)
Article Google Scholar
K. Dusterhoff, A. Black: Generating F0 contours for speech synthesis using the Tilt intonation theory, Intonation: Theory, Models and Applications, Proc. ESCA Workshop, ed. by A. Botinis, G. Kouroupetroglou, G. Carayiannis (1997) pp. 107-110
Google Scholar
A. Black, P. Taylor: CHATR: A generic speech synthesis system, Proc. COLINGʼ94 Kyoto (1994) pp. 983-986
Google Scholar
A. Black, N. Campbell: Prosody and the selection of source units for concatenative synthesis. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirschberg (Springer, New York 1995) pp. 279-292
Google Scholar
J. van Santen: Combinatorial issues in text-to-speech synthesis, Proc. Eurospeech 97, 2511-2514 (1997)
Google Scholar
B. Möbius: Rare events and closed domains: Two delicate concepts in speech synthesis, Proc. 4rd ESCA Workshop on Speech Synthesis Pitlochry (2001)
Google Scholar
M.D. Riley: Tree-based modeling for speech synthesis. In: Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit, T. Sawallis (Elsevier, Amsterdam 1992) pp. 265-273
Google Scholar
D.H. Klatt: Synthesis by rule of segmental durations in English sentences. In: Frontiers of Speech Communication Research, ed. by B. Lindblom, S. Öhman (Academic, New York 1979) pp. 287-300
Google Scholar
J.P. Olive, M.Y. Liberman: Text to speech - An overview, J. Acoust. Soc. Am. 78(Suppl. 1), 6 (1985)
Article Google Scholar
R. Carlson, B. Granström: A search for durational rules in a real-speech database, Phonetica 43, 140-154 (1986)
Article Google Scholar
K.J. Kohler: Zeitstrukturierung in der Sprachsynthese, ITG-Fachbericht 105, 165-170 (1994), in German
Google Scholar
K. Bartkova, C. Sorin: A model of segmental duration for speech synthesis in French, Speech Commun. 6, 245-260 (1987)
Article Google Scholar
L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone: Classification and Regression Trees (Wadsworths Brooks, Monterey 1984)
MATH Google Scholar
H. Chung: Duration models and the perceptual evaluation of spoken Korean, Proc. Speech Prosody 2002 Aix-en-Provence (2002)
Google Scholar
R. Batůšek: A duration model for Czech text-to-speech synthesis, Proc. Speech Prosody 2002 Aix-en-Provence (2002)
Google Scholar
N.S. Krishna, H.A. Murthy: Duration modeling of Indian languages Hindi and Telugu, 5th ISCA Workshop of Speech Synthesis Pittsburgh (2005)
Google Scholar
F. Tesser, P. Cosi, C. Drioli, G. Tisato: Prosodic data driven modelling of a narrative style in Festival TTS, Proc. 5th ISCA Workshop on Speech Synthesis Pittsburgh (2005)
Google Scholar
W.N. Campbell: Syllable-based segmental durations. In: Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit, T. Sawallis (Elsevier, Amsterdam 1992) pp. 43-60
Google Scholar
E. Klabbers: Segmental and Prosodic Improvements to Speech Generation, Ph.D. Thesis (Eindhoven University of Technology, Eindhoven 2000)
Google Scholar
A. Maghbouleh: An empirical comparison of automatic decision tree and linear regression models for vowel durations, Proc. second meeting of the ACL Special Interest Group in Computational Phonology Santa Cruz (1996)
Google Scholar
D.R. Ladd: Intonational Phonology (Cambridge Univ. Press, Cambridge 1996)
Google Scholar
J.A. Goldsmith: Autosegmental and Metrical Phonology (Blackwell, Oxford 1990)
Google Scholar
N. Campbell, J. Venditti: J-ToBI: An intonation labelling system for Japanese, Proc. Autumn Meeting Acoust. Soc. Jpn. 1, 317-318 (1995)
Google Scholar
C. Mayo, M. Aylett, D.R. Ladd: Prosodic transcription of Glasgow English: An evaluation study of Glatobi, Proc. ESCA Workshop: Intonation: Theory, Models and Applications, ed. by A. Botinis, G. Kouroupetroglou, G. Carayiannis (ESCA, 1997) pp. 231-234
Google Scholar
M. Reyelt, M. Grice, R. Benzmuller, J. Mayer, A. Batliner: Prosodische Etikettierung des Deutschen mit ToBI. In: Natural Language and Speech Technology, ed. by D. Gibbon (Mouton de Gruyter, Berlin 1996) pp. 144-155, in German
Google Scholar
M. Jilka, G. Mohler, G. Dogil: Rules for the generation of ToBI-based American English intonation, Speech Commun. 28, 83-108 (1999)
Article Google Scholar
A. Black, A. Hunt: Generating F0 contours from the ToBI labels using linear regression, Proc. 4th Int. Conf. Spoken Language Process. 3, 1385-1388 (1996)
Google Scholar
C. Traber: F0 generation with a database of natural F0 patterns and with a neural network. In: Talking Machines: Theories, Models, and Designs, ed. by G. Bailly, C. Benoit, T. Sawallis (Elsevier, Amsterdam 1992) pp. 287-304
Google Scholar
C. Traber: Syntactic processing and prosody control in the SVOX TTS system for German, Proc. Eurospeech 93, 2099-2102 (1993)
Google Scholar
C. Traber: SVOX: The Implementation of a Text-to-Speech System for German, Ph.D. Thesis (ETH Zurich, Zurich 1995)
Google Scholar
A. Cohen, J. ʼt Hart: On the anatomy of intonation, Lingua 19, 177-192 (1967)
Article Google Scholar
J. ʼt Hart, R. Collier, A. Cohen: A Perceptual Study of Intonation: An Experimental-Phonetic Approach to Speech Melody (Cambridge Univ. Press, Cambridge 1990)
Book Google Scholar
J. Terken: Synthesizing natural-sounding intonation for Dutch: rules and perceptual evaluation, Computer Speech Language 7, 27-48 (1993)
Article Google Scholar
N. Willems, R. Collier, J. ʼt Hart: A synthesis scheme for British English intonation, J. Acoust. Soc. Am. 84(4), 1250-1261 (1988)
Article Google Scholar
J. van Hemert, U. Adriaens-Porzig, L. Adriaens: Speech synthesis in the SPICOS project. In: Analyse und Synthese gesprochener Sprache, ed. by H. Tillmann, G. Willee (Georg Olms, Hildesheim 1987) pp. 34-39
Google Scholar
H. Fujisaki, K. Hirose: Modelling the dynamic characteristics of voice fundamental frequency with applications to analysis and synthesis of intonation, Preprints of the Working Group on Intonation, 13th Intl. Congress of Linguists Tokyo (1982) pp. 57-70
Google Scholar
H. Fujisaki: Modelling in the study of tonal features of speech with application to multilingual speech synthesis, Joint Conference of SNLP and Oriental COCOSDA Hua Hin Prachuapkirikhan (2002)
Google Scholar
H. Mixdorff: Quantitative tone and intonation modeling across languages, Proc. Int. Symp. Tonal Aspects of Languages: With Emphasis on Tone Languages Beijing (2004) pp. 137-142
Google Scholar
J. van Santen, B. Möbius: A quantitative model of F ₀ generation and alignment. In: Intonation: Analysis, Modeling and Technology, ed. by A. Botinis (Kluwer Academic, Dordrecht 1999) pp. 269-288
Google Scholar
J. van Santen, B. Möbius, J. Venditti, C. Shih: Description of the Bell Labs Intonation System, Proc. 3rd ESCA Speech Synthesis Workshop Jenolan Caves (1998) pp. 293-298
Google Scholar
V. Aubergé: Prosody modeling with a dynamic lexicon of intonative forms: Application for text-to-speech synthesis, Proc. ESCA Workshop on Prosody (1993) pp. 62-65
Google Scholar
B. Holm, G. Bailly: Generating prosody by superposing multi-parametric overlapping contours, Proc. Int. Conf. Speech and Language Processing Beijing (2000) pp. 203-206
Google Scholar
G. Bailly, B. Holm: SFC: A trainable prosodic model, Speech Commun. 46, 364-384 (2005)
Article Google Scholar
K.J. Kohler: Studies in German intonation, Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung, Universität Kiel 25, 295-360 (1991)
Google Scholar
K.J. Kohler: Parametric control of prosodic variables by symbolic input in TTS synthesis. In: Progress in Speech Synthesis, ed. by J. van Santen, R. Sproat, J. Olive, J. Hirschberg (Springer, New York 1997) pp. 459-475
Chapter Google Scholar
Kohler K.J.: The Kiel Intonation Model (KIM), its Implementation in TTS Synthesis and its Application to the Study of Spontaneous Speech (1995), retrieved on July 15th, 2006 from http://www.ipds.uni-kiel.de/kjk/forschung/kim.en.html
Google Scholar
G.P. Kochanski, C. Shih: Stem-ML: Language independent prosody description, Proc. Int. Conf. Spoken Lang. Process. 3, 239-242 (2000)
Google Scholar
G.P. Kochanski, C. Shih: Prosody modeling with soft templates, Speech Commun. 39(3-4), 311-352 (2003)
Article MATH Google Scholar
G.P. Kochanski, C. Shih: Automated modelling of Chinese intonation in continuous speech, Proc. Eurospeech 01, 911-914 (2001)
Google Scholar
T. Lee, G. Kochanski, C. Shih, Y. Li: Modeling tones in continuous Cantonese speech, Proc. 2002 International Conference on Spoken Language Processing Denver (2002) pp. 2401-2404
Google Scholar
C. Shih, G. Kochanski: Modeling intonation: Asking for confirmation in English, Proc. 15th Int. Congress of Phonetic Sciences Barcelona (2003)
Google Scholar
S. Quazza, L. Donetti, L. Moisa, P.L. Salza: ACTOR: A multilingual unit-selection speech synthesis system, Proc. 4th ESCA Workshop on Speech Synthesis Pitlochry (2001)
Google Scholar
F. Campillo-Díaz, E.R. Banga: Combined prosody and candidate unit selections for corpus-based text-to-speech systems, Proc. 7th Int. Conference on Spoken Language Processing (2002) pp. 141-144
Google Scholar
A. Raux, A. Black: A unit selection approach to F0 modeling and its application to emphasis, ASRU 2003 St. Thomas (2003)
Google Scholar
J. van Santen, A. Kain, E. Klabbers, T. Mishra: Synthesis of prosody using multi-level sequence units, Speech Commun. 46(3-4), 365-375 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

OGI School of Science and Engineering, Department of Computer Science and Electrical Engineering, Oregon Health And Science University, 20000 NW Walker Rd, 97006-8921, Beaverton, OR, USA
Jan van Santen Dr.
Center for Spoken Language Understanding, Computer Science and Electrical Engineering, OGI School of Science and Engineering, Oregon Health and Science University, 20000 NW Walker Road, 97006, Beaverton, OR, USA
Taniya Mishra
Center for Spoken Language Understanding, OGI School of Science and Engineering, Oregon Health & Science University, 20000 NW Walker Rd, 97006, Beaverton, OR, USA
Esther Klabbers Dr.

Authors

Jan van Santen Dr.
View author publications
You can also search for this author in PubMed Google Scholar
Taniya Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Esther Klabbers Dr.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jan van Santen Dr. , Taniya Mishra or Esther Klabbers Dr. .

Editor information

Editors and Affiliations

INRS-EMT, University of Quebec, 800 de la Gauchetiere Ouest, H5A 1K6, Montreal, Quebec, Canada
Jacob Benesty Dr.
Avayalabs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA
M. Mohan Sondhi Ph.D.
Alcatel-Lucent, Bell Laboratories, 600 Mountain Avenue, 07974, Murray Hill, NJ, USA
Yiteng Arden Huang Dr.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Santen, J.v., Mishra, T., Klabbers, E. (2008). Prosodic Processing. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-49127-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49125-5
Online ISBN: 978-3-540-49127-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics