Abstract
This work addresses automatic transcription of television and radio broadcasts in multiple languages. Transcription of such types of data is a major step in developing automatic tools for indexation and retrieval of the vast amounts of information generated on a daily basis. Radio and television broadcasts consist of a continuous data stream made up of segments of different linguistic and acoustic natures, which poses challenges for transcription. Prior to word recognition, the data is partitioned into homogeneous acoustic segments. Non-speech segments are identified and removed, and the speech segments are clustered and labeled according to bandwidth and gender. Word recognition is carried out with a speaker-independent large vocabulary, continuous speech recognizer which makes use of n-gram statistics for language modeling and of continuous density HMMs with Gaussian mixtures for acoustic modeling. This system has consistently obtained top-level performance in DARPA evaluations. Over 500 hours of unpartitioned unrestricted American English broadcast data have been partitioned, transcribed and indexed, with an average word error of about 20%. With current IR technology there is essentially no degradation in information retrieval performance for automatic and manual transcriptions on this data set.
Similar content being viewed by others
References
S.S. Chen and P.S. Gopalakrishnan, “Environment and channel change detection and clustering via the Bayesian information criterion,” in Proc. DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, Virginia, Feb. 1998, pp. 127–132.
J.S. Garofolo, E.M. Voorhees, C.G.P. Auzanne, V.M. Stanford, and B.A. Lund, “Design and preparation of the 1996 Hub-4 broadcast news benchmark test corpora,” in Proc. of the DARPA Speech RecognitionWorkshop, Chantilly, Virginia, Feb. 1997, pp. 15–21. (see also http://www.nist.gov/speech/tests/).
J.S. Garofolo, C.G.P. Auzanne, E.M. Voorhees, and B. Fisher, “The TREC spoken document retrieval track: a success story,” in Proc. 8th Text Retrieval Conference TREC-8, Gaithersburg, Maryland, Nov. 1998, pp. 107–130.
J.L. Gauvain and C.H. Lee, “Maximum a posteriori estimation for multivariate gaussain mixture observation of markov chains, IEEE Trans. on SAP, Vol. 2, No. 2, pp. 291–298, April 1994.
J.L. Gauvain, L. Lamel, G. Adda, and M. Adda-Decker, “The LIMSI Nov93 WSJ system,” in Proc. ARPA Spoken Language Technologies Workshop, Plainsboro, New Jersey, March 1994, pp. 125–128.
J.L. Gauvain, G. Adda, L. Lamel, and M. Adda-Decker, “Transcribing broadcast news: the LIMSI Nov96 Hub4 system,” in Proc. ARPA Speech Recognition Workshop, Chantilly, Virginia, Feb. 1997, pp. 56–63.
J.L. Gauvain, Y. de Kercadio, L. Lamel, and G. Adda, “The LIMSI SDR system for TREC-8,” in Proc. 8th Text Retrieval Conference TREC-8, Gaithersburg, Maryland, Nov. 1999, pp. 475–482.
J.L. Gauvain, L. Lamel, G. Adda, and M. Jardino, “The LIMSI 1998 Hub-4E transcription system,” in Proc. DARPA Broadcast News Workshop, Herndon, Virginia, Feb. 1999, pp. 99–104.
T. Hain, S.E. Johnson, A. Tuerk, P.C. Woodland, and S.J. Young. “Segment generation and clustering in the HTK broadcast news transcription system,” in DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, Virginia, Feb. 1998, pp. 133–137.
D. Hiemstra and K. Wessel, “Twenty-one at TREC-7: ad-hoc and cross-language track,” in Proc. 7th Text Retrieval Conference TREC-7, 227–238, Gaithersburg, Maryland, Nov. 1999.
K.S. Jones, S. Walker, and S.E. Robertson, “A probabilistic model of information retrieval: development and status,” A technical report of the computer laboratory, University of Cambridge, U.K., 1998.
F.M.G. de Jong, J.L. Gauvain, J. den Hartog, and K. Netter, “Olive: speech based video retrieval,” in Proc. CBMI'99, Toulouse, France, Oct. 1999.
C.J. Leggetter and P.C. Woodland,“Maximumlikelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Computer Speech and Language, Vol. 9, No. 2, pp. 171–185, 1995.
D.R.H. Miller, T. Leek, and R.M. Schwartz, “BBN at TREC7: using hidden markov models for information retrieval,” in Proc. 7th Text Retrieval Conference TREC-7, Gaithersburg, Maryland, Nov. 1999, pp. 133–142.
M.F. Porter, “An Algorithm for Suffix, Stripping,” Program Vol. 14, No. 3, pp. 130–137, 1980.
PSMedia. http://www.thomson.com/psmedia/bnews.html
M. Siegler, U. Jain, B. Raj, and R. Stern, “Automatic segmentation, classification and clustering of broadcast news audio,” in Proc. DARPA Speech Recognition Workshop, Chantilly, Virginia, Feb. 1997, pp. 97–99.
UMass. ftp://ciir-ftp.cs.umass.edu/pub/stemming/
S. Walker and R. de Vere, “Improving subject retrieval in online catalogues: 2. Relevance feedback and query expansion,” British Library Research Paper 72, British Library, London, U.K., 1990.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gauvain, J., Lamel, L. & Adda, G. Audio Partitioning and Transcription for Broadcast Data Indexation. Multimedia Tools and Applications 14, 187–200 (2001). https://doi.org/10.1023/A:1011303401042
Issue Date:
DOI: https://doi.org/10.1023/A:1011303401042