Abstract
We present our ongoing research focused on speaker recognition in historical oral archives. This research is part of our long-term effort aimed at enabling versatile access to the archive of the Czech Radio (CRo). Based on a manually annotated partition of the archive, we compiled a database covering a time span of more than 30 years to carry out our experimental study. Hence we were able to investigate the impact of various aspects that make it challenging to process historical data. We show the shift of scores for target (genuine) speaker trials introduced by the aging effect, the value of the signal-to-noise ratio or by the variable amount of the enrollment and test data. Scores for speaker detection trials were assessed by a system based on the i-vector paradigm and probabilistic linear discriminative analysis. We also assessed the performance of this system using an evaluation database containing contemporary recordings collected over a time span of approximately 4 years. Although using state-of-the-art techniques, capable of dealing with nuisance inter-session variability, we demonstrate remarkable degradation in the performance of the system in the evaluation containing historical data compared to the one containing contemporary data only. Specifically, the Equal Error Rate (EER) of the system rose to 8.27 % from 1.93 %. The revealed difference thus exemplifies that compensation techniques need to be employed to cope with additional variability introduced in the historical data by various sources.
Similar content being viewed by others
Notes
The original name of the company was the Radiojournal company.
In [19], the nuisance variability is expressed as sum of an intra-speaker variability confined to a lower-dimensional subspace and a residual noise which is assumed to be Gaussian with a diagonal covariance matrix. A full covariance matrix used in our case is thus simply a generalization of the model.
We used relevance factor of 16.0 in our experiments.
We used the Bosaris toolkit available at https://sites.google.com/site/bosaristoolkit/ to plot our DET curves.
Please note that the results presented in [23] and in this work are not directly comparable. In [23] we used the test database in a two-fold cross-validation setup with one fold used for calibration training and the second for testing, and vice versa. Here we pooled all the test data together. Furthermore, different development data sets were used in [23] and in this work. In the former study, the available data was much more limited, particularly for estimation of intra-speaker variability, requiring recordings from multiple sessions per speaker.
Let us stress that we strictly distinguished between different excerpts and different sessions. Hence, no model was trained for a speaker having multiple excerpts available but they were all drawn from a single session.
The linear regression fit curves displayed in the figure are not intended to represent a true dependency but just its trend.
References
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’07. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 1027–1035
Boháč M, Blavka K (2013) Text-to-speech alignment for imperfect transcriptions. In: Habernal I, Matoušek V (eds) Text, speech, and dialogue, lecture notes in computer science, vol 8082. Springer, Berlin Heidelberg, pp 536–543
Brümmer N (2009) EM for JFA. Tech. rep., South Africa. Available at https://sites.google.com/site/nikobrummer/EMforJFA.pdf?attredirects=0
Brummer N, Burget L, Kenny P, Matějka P, de EV, Karafiát M, Kockmann M, Glembek O, Plchot O, Baum D, Senoussauoi M (2010) ABC system description for NIST SRE 2010. In: Proc. NIST 2010 speaker recognition evaluation. Brno University of Technology, pp 1–20
Chaloupka J, Nouza J, Červa P, Málek J (2013) Downdating lexicon and language model for automatic transcription of Czech historical spoken documents. In: Habernal I, Matoušek V (eds) Text, speech, and dialogue, lecture notes in computer science, vol 8082. Springer, Berlin Heidelberg, pp 201–208
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19 (4):788–798
Doddington GR, Przybocki MA, Martin AF, Reynolds DA (2000) The NIST speaker recognition evaluation - overview, methodology, systems, results, perspective. Speech Commun 31 (2–3):225–254
Ferrer L, Graciarena M, Zymnis A, Shriberg E (2008) System combination using auxiliary information for speaker verification. In: IEEE international conference on acoustics, speech and signal processing - ICASSP 2008, Las Vegas, pp 4853–4856
Garcia-Romero D, Espy-Wilson CY (2011) Analysis of i-vector length normalization in speaker recognition systems. In: Interspeech’11. Florence, pp 249–252
Kanagasundaram A, Dean D, Gonzalez-Dominguez J, Sridharan S, Ramos D, Gonzalez-Rodriguez J (2013) Improving the PLDA based speaker verification in limited microphone data conditions. In: Interspeech 2013. International Speech communication association (ISCA ), Lyon, pp 3674–3678
Kelly F, Drygajlo A, Harte N (2012) Speaker verification with long-term ageing data. In: 2012 5th IAPR international conference on biometrics (ICB), pp 478–483
Kelly F, Harte N (2011) Effects of long-term ageing on speaker verification. In: Vielhauer C, Dittmann J, Drygajlo A, Juul N, Fairhurst M (eds) Biometrics and ID management, lecture notes in computer science, vol 6583. Springer, Berlin Heidelberg, pp 113–124
Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Process 13 (3):345–354
Kenny P, Stafylakis T, Ouellet P, Alam M, Dumouchel P (2013) PLDA for speaker verification with utterances of arbitrary duration. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7649–7653
Kim C, Stern RM (2008) Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In: INTERSPEECH. ISCA, pp 2598–2601
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52:12–40
Matveev Y (2013) The problem of voice template aging in speaker recognition systems. In: železný M, Habernal I, Ronzhin A (eds) Speech and computer, lecture notes in computer science, vol 8113, pp 345–353. Springer International Publishing
Nouza J, Blavka K, Bohac M, Cerva P, Zdansky J, Silovsky J, Prazak J (2012) Voice technology to enable sophisticated access to historical audio archive of the czech radio. In: Grana C, Cucchiara R (eds) Multimedia for cultural heritage, communications in computer and information science, vol 247. Springer, Berlin Heidelberg, pp 27–38
Prince SJD, Elder JH (2007) Probabilistic linear discriminant analysis for inferences about identity. In: Proceedings ICCV 2007, Rio de Janeiro, Brazil, pp 1–8
Rajan P, Tomi Kinnunen VH (2013) Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In: Interspeech’13. Lyon, pp 3694–3697
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using Adapted Gaussian mixture models. Digit Signal Process 1–3:19–41
Sarkar AK, Matrouf D, Bousquet P-M, Bonastre J-F (2012) Study of the Effect of I-vector Modeling on Short and Mismatch Utterance Duration for Speaker Verification. In: INTERSPEECH’12. ISCA, Portland, OR, USA
Silovsky J, Cerva P, Zdansky J (2009) Comparison of generative and discriminative approaches for speaker recognition with limited data. Radioengineering 18 (3):307–316
Silovsky J, Zdansky J, Nouza J, Cerva P, Prazak J (2012) Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams. In: MMSP’12. Banff, pp 118–123
van Leeuwen DA, Brummer N (2007) An introduction to application-independent evaluation of speaker recognition systems. Lect Notes Comput Sci 4343/2007:330–353
Acknowledgments
This research work was supported by the Czech Ministry of Culture (project no. DF11P01OVV013 in program NAKI).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Silovsky, J., Nouza, J. & Kucharova, M. Search for speaker identity in historical oral archives. Multimed Tools Appl 75, 3767–3786 (2016). https://doi.org/10.1007/s11042-014-2067-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2067-2