Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

Yunfei ZI; Shengwu XIONG

doi:10.24425/aoa.2024.148768

Authors

Yunfei ZI Wuhan University of Technology, China 0000-0002-4778-7109
Shengwu XIONG Wuhan University of Technology, China

Abstract

This work aims to further compensate for the weaknesses of feature sparsity and insufficient discriminative acoustic features in existing short-duration speaker recognition. To address this issue, we propose the Bark-scaled Gauss and the linear filter bank superposition cepstral coefficients (BGLCC), and the multidimensional central difference (MDCD) acoustic feature extracted method. The Bark-scaled Gauss filter bank focuses on low-frequency information, while linear filtering is uniformly distributed, therefore, the filter superposition can obtain more discriminative and richer acoustic features of short-duration audio signals. In addition, the multi-dimensional central difference method captures better dynamics features of speakers for improving the performance of short utterance speaker verification. Extensive experiments are conducted on short-duration text-independent speaker verification datasets generated from the VoxCeleb, SITW, and NIST SRE corpora, respectively, which contain speech samples of diverse lengths, and different scenarios. The results demonstrate that the proposed method outperforms the existing acoustic feature extraction approach by at least 10% in the test set. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods.

Keywords:

Bark-scaled Gauss, linear filter, filter bank superposition, multi-dimensional central difference, speaker recognition

References

1. Atal B.S. (1974), Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, The Journal of the Acoustical Society of America, 55(6): 1304–1312, https://doi.org/10.1121/1.1914702

2. Bai Z., Zhang X.-L., Chen J. (2020), Speaker verification by partial AUC optimization with Mahalanobis distance metric learning, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 1533–1548, https://doi.org/10.1109/TASLP.2020.2990275

3. Campbell J.P. (1997), Speaker recognition: A tutorial, Proceedings of the IEEE, 85(9): 1437–1462, https://doi.org/10.1109/5.628714

4. Chowdhury A., Ross A. (2020), Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals, IEEE Transactions on Information Forensics and Security, 15: 1616–1629, https://doi.org/10.1109/TIFS.2019.2941773

5. Chung J.S., Nagrani A., Zisserman A. (2018), Voxceleb2: Deep speaker recognition, [in:] Proceedings of Interspeech 2018, pp. 1086–1090, https://doi.org/10.21437/Interspeech.2018-1929

6. Das R.K., Mahadeva Prasanna S.R. (2016), Exploring different attributes of source information for speaker verification with limited test data, The Journal of the Acoustical Society of America, 140(1): 184, https://doi.org/10.1121/1.4954653

7. Dehak N., Dehak R., Glass J., Reynolds D., Kenny P. (2010), Cosine similarity scoring without score normalization techniques, [in:] Proceedings of Odyssey 2010 – The Speaker and Language Recognition Workshop.

8. Desplanques B., Thienpondt J., Demuynck K. (2020), ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification, [in:] Proceedings of Annual conference of the International Speech Communication Association 2020, pp. 3830–3834, https://doi.org/10.21437/Interspeech.2020-2650

9. Greenberg C.S. et al. (2013), The 2012 NIST speaker recognition evaluation, [in:] Proceedings of Interspeech 2013, pp. 1971–1975, https://doi.org/10.21437/Interspeech.2013-469

10. Herrera-Camacho A., Zúñiga-Sainos A., Sierra-Martínez G., Tramgol-Curipe J., Mota-Montoya M., Jarquín-Casas A. (2019), Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE, [in:] Proceedings of International Conference on Video, Signal and Image Processing 2019, pp. 105–110, https://doi.org/10.1145/3369318.3369330

11. Huang L., Pun C.-M. (2020), Audio replay spoof attack detection by joint segment-based linear filter bank feature extraction and attention-enhanced DenseNet-BiLSTM network, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 1813–1825, https://doi.org/10.1109/TASLP.2020.2998870

12. Kenny P., Boulianne G., Dumouchel P. (2005), Eigenvoice modeling with sparse training data, IEEE Transactions on Speech and Audio Processing, 13(3): 345–354, https://doi.org/10.1109/TSA.2004.840940

13. Kinnunen T., Li H. (2010), An overview of text-independent speaker recognition: From features to supervectors, Speech Communication, 52(1): 12–40, https://doi.org/10.1016/j.specom.2009.08.009

14. Liu Z., Wu Z., Li T., Li J., Shen C. (2018), GMM and CNN hybrid method for short utterance speaker recognition, IEEE Transactions on Industrial Informatics, 14(7): 3244–3252, https://doi.org/10.1109/TII.2018.2799928

15. Martin A.F., Greenberg C.S. (2010), The NIST 2010 speaker recognition evaluation, [in:] Proceedings of Interspeech 2010, pp. 2726–2729, https://doi.org/10.21437/Interspeech.2010-722

16. McLaren M., Ferrer L., Castan D., Lawson A. (2016), The speakers in the wild (SITW) speaker recognition database, [in:] Proceedings of Interspeech 2016, pp. 818–822, https://doi.org/10.21437/Interspeech.2016-1129

17. Nagrani A., Chung J.S., Zisserman A. (2017), Vox-Celeb: A large-scale speaker identification dataset, [in:] Proceedings of Interspeech 2017, pp. 2616–2620, https://doi.org/10.21437/Interspeech.2017-950

18. Nosratighods M., Ambikairajah E., Epps J., Carey M.J. (2010), A segment selection technique for speaker verification, Speech Communication, 52(9): 753–761, https://doi.org/10.1016/j.specom.2010.04.007

19. Omar M.K., Pelecanos J.W. (2010), Training universal background models for speaker recognition, [in:] Proceedings of Odyssey 2010 – The Speaker and Language Recognition Workshop, pp. 52–57.

20. Paseddula C., Gangashetty S.V. (2018), DNN based acoustic scene classification using score fusion of MFCC and inverse MFCC, [in:] Proceedings of International Conference on Industrial and Information Systems 2018, pp. 18–21, https://doi.org/10.1109/ICIINFS.2018.8721379

21. Paszke A. et al. (2017), Automatic differentiation in PyTorch, [in:] Proceedings of NIPS 2017 Workshop, pp. 1–4.

22. Povey D. et al. (2018), Semi-orthogonal low-rank matrix factorization for deep neural networks, [in:] Proceedings of Interspeech 2018, pp. 3743–3747, https://doi.org/10.21437/Interspeech.2018-1417

23. Schroff F., Kalenichenko D., Philbin J. (2015), FaceNet: A unified embedding for face recognition and clustering, [in:] Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2015, pp. 815–823, https://doi.org/10.1109/CVPR.2015.7298682

24. Snyder D., Garcia-Romero D., Sell G., McCree A., Povey D., Khudanpur S. (2019), Speaker recognition for multi-speaker conversations using x-vectors, [in:] Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 2019, pp. 5796–5800, https://doi.org/10.1109/ICASSP.2019.8683760

25. Todisco M., Delgado H., Evans N. (2017), Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification, Computer Speech & Language, 45: 516–535, https://doi.org/10.1016/j.csl.2017.01.001

26. Villalba J. et al. (2020), State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations, Computer Speech & Language, 60: 101026, https://doi.org/10.1016/j.csl.2019.101026

27. Vogt R., Sridharan S., Mason M. (2010), Making confident speaker verification decisions with minimal speech, IEEE Transactions on Audio, Speech, and Language Processing, 18(6): 1182–1192, doi: 10.1109/ TASL.2009.2031505.

28. Wu Z., Yu Z., Yuan J., Zhang J. (2016), A twice face recognition algorithm, Soft Computing – A Fusion of Foundations, Methodologies and Applications, 20(3): 1007–1019, https://doi.org/10.1007/s00500-014-1561-9

29. Yang H., Deng Y., Zhao H.-A. (2019), A comparison of MFCC and LPCC with deep learning for speaker recognition, [in:] Proceedings of International Conference on Big Data and Computing 2019, pp. 160–164, https://doi.org/10.1145/3335484.3335528

30. Zhang C., Koishida K., Hansen J.H.L. (2018), Text-independent speaker verification based on triplet convolutional neural network embeddings, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(9): 1633–1644, https://doi.org/10.1109/TASLP.2018 2831456.

31. Zinchenko K., Wu C.-Y., Song K.-T. (2017), A study on speech recognition control for a surgical robot, IEEE Transactions on Industrial Informatics, 13(2): 607–615, https://doi.org/10.1109/TII.2016.2625818

Online first
Early birds
2026, Vol 51
	No 1	No 2
2025, Vol 50
	No 1	No 2	No 3	No 4
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

Downloads

Authors

Abstract

Keywords:

References

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact