Archives of Acoustics, 42, 1, pp. 127–135, 2017

Speaker Model Clustering to Construct Background Models for Speaker Verification

Adana Science and Technology University

Zekeriya TÜFEKCİ
Çukurova Univesity

Çukurova Univesity

Conventional speaker recognition systems use the Universal Background Model (UBM) as an imposter for all speakers. In this paper, speaker models are clustered to obtain better imposter model represen- tations for speaker verification purpose. First, a UBM is trained, and speaker models are adapted from the UBM. Then, the $k$-means algorithm with the Euclidean distance measure is applied to the speaker models. The speakers are divided into two, three, four, and five clusters. The resulting cluster centers are used as background models of their respective speakers. Experiments showed that the proposed method consistently produced lower Equal Error Rates (EER) than the conventional UBM approach for 3, 10, and 30 seconds long test utterances, and also for channel mismatch conditions. The proposed method is also compared with the $i$-vector approach. The three-cluster model achieved the best performance with a 12.4% relative EER reduction in average, compared to the $i$-vector method. Statistical significance of the results are also given.
Keywords: Gaussian mixture models; k-means; imposter models; speaker clustering; speaker verification
Full Text: PDF
Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN).


Apsingekar V.R., De Leon P.L. (2009), Speaker Model Clustering for Efficient Speaker Identification in Large Population Applications, IEEE Trans. Audio. Speech. Lang. Processing, 17, 848–853.

Auckenthaler R., Mason J.S. (2001), Gaussian selection applied to text-independent speaker verification, Proc. Speaker Odyssey: The Speaker Recognition Workshop, 83–88, Greece.

Beigi H.S.M., Maes S.H., Chaudhari U.V., Sorensen S. (1999), A hierarchical approach to largescale speaker recognition, European Conference on Speech Communication and Technology, 2203–2206, Hungary.

Bimbot F., Bonastre J.-F., Fredouille C., Gravier G., Magrin-Chagnolleau I., Meignier S., Merlin T., Ortega-Garcia J., PetrovskaDelacretaz D., Reynolds D.A. (2004), A Tutorial on Text-Independent Speaker Verification, EURASIP J. Adv. Signal Process., 2004, 430–451.

Brew A., Cunningham P. (2009), Combining Cohort and UBM Models in Open Set Speaker Identification, Seventh International Workshop on ContentBased Multimedia Indexing, 62–67, Crete.

Brew A., Cunningham P. (2010), Combining cohort and UBM models in open set speaker detection, Multimed. Tools Appl., 48, 141–159.

Campbell J.P. (1997), Speaker recognition: a tutorial, Proc. IEEE, 85, 1437–1462.

Campbell W.M., Sturim D.E., Reynolds D.A., Solomonoff A. (2006), SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation, IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, I-97-100, France.

De Leon P.L., Apsingekar V. (2007), Reducing Speaker Model Search Space in Speaker Identification, Biometrics Symposium, 1–6, USA.

Dehak N., Kenny P.J., Dehak R., Dumouchel P., Ouellet P. (2011), Front-End Factor Analysis for Speaker Verification, IEEE Trans. Audio. Speech. Lang. Processing, 19, 788–798.

Doddington G., Przybocki M., Martin A., Reynolds D. (2000), The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective, Speech Communication, 31, 225–254.

Gillick L., Cox S. (1989), Some statistical issues in the comparison of speech recognition algorithms, International Conference on Acoustics, Speech, and Signal Processing, 532–535.

Hossa R., Makowski R. (2016), An Effective Speaker Clustering Method using UBM and Ultra-Short Training Utterances, Archives of Acoustics, 41, 107–118.

Kenny P. (2005), Joint factor analysis of speaker and session variability: Theory and algorithms, CRIM, Montr. CRIM-06/08-13, 1–17.

Kenny P., Boulianne G., Ouellet P., Dumouchel P. (2007), Joint Factor Analysis Versus Eigenchannels in Speaker Recognition, IEEE Trans. Audio, Speech Lang. Process., 15, 1435–1447.

Kinnunen T., Li H. (2010), An overview of textindependent speaker recognition: From features to supervectors, Speech Communication, 52, 12–40.

McClanahan R.D., De Leon P.L. (2012), Mixture Component Clustering for Efficient Speaker Verification, Interspeech, 1086-1090, USA.

McClanahan R.D., De Leon P.L. (2015), Reducing computation in an i-vector speaker recognition system using a tree-structured universal background model, Speech Communication, 66, 36–46.

McLaren M., Vogt R., Baker B., Sridharan S. (2010), Data-Driven Background Dataset Selection for SVM-Based Speaker Verification, IEEE Trans. Audio. Speech. Lang. Processing, 18, 1496–1506.

Pallet D., FisherW., Fiscus J. (1990), Tools for the analysis of benchmark speech recognition, International Conference on Acoustics, Speech, and Signal Processing, 97–100.

Reynolds D.A. (1995), Speaker Identification and Verification using Gaussian mixture speaker odels, Speech Communication, 17, 91–108.

Reynolds D.A. (1997), Comparison of Background Normalization Methods for Text-Independent Speaker Verification, European Conference on Speech Communication and Technology, Greece.

Reynolds D.A., Quatieri T.F., Dunn R.B. (2000), Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Processing, 10, 19–41.

Reynolds D.A., Rose R.C. (1995), Robust textindependent speaker identification using Gaussian mixture speaker models, IEEE Trans. Speech Audio Process., 3, 72–83.

Richardson F., Reynolds D., Dehak N. (2015), Deep Neural Network Approaches to Speaker and Language Recognition, IEEE Signal Processing Letters, 22, 1671–1675.

Sadjadi S.O., Slaney M., Heck L. (2013), MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research, Speech and Language Processing Technical Committee Newsletter, IEEE, 1–4.

Saeidi R., Kinnunen T., Mohammadi H.R.S., Rodman R., Franti P. (2010), Joint frame and Gaussian selection for text independent speaker verification, IEEE International Conference on Acoustics, Speech and Signal Processing, 4530–4533, USA.

Xiang B., Berger T. (2003), Efficient textindependent speaker verification with structural gaussian mixture models and neural network, IEEE Trans. Speech Audio Process., 11, 447–456.

Xiong Z., Zheng T.F., Song Z., Soong F., Wu W. (2006), A tree-based kernel selection approach to efficient Gaussian mixture model–universal background model based speaker identification, Speech Communication, 48, 1273–1282.

Zhu D., Ma B., Li H. (2011), Speaker Verification With Feature-Space MAPLR Parameters, IEEE Trans. Audio. Speech. Lang. Processing, 19, 505–515.

DOI: 10.1515/aoa-2017-0014