SpeakerNet for Cross-lingual Text-Independent Speaker Verification

Hafsa HABIB; Huma TAUSEEF; Muhammad Abuzar FAHIEM; Saima FARHAN; Ghousia USMAN

doi:10.24425/aoa.2020.134073

Authors

Hafsa HABIB Lahore College for Women University, Pakistan
Huma TAUSEEF Lahore College for Women University, Pakistan
Muhammad Abuzar FAHIEM Lahore College for Women University, Pakistan
Saima FARHAN Lahore College for Women University, Pakistan
Ghousia USMAN Lahore College for Women University, Pakistan

Abstract

Biometrics provide an alternative to passwords and pins for authentication. The emergence of machine learning algorithms provides an easy and economical solution to authentication problems. The phases of speaker verification protocol are training, enrollment of speakers and evaluation of unknown voice. In this paper, we addressed text independent speaker verification using Siamese convolutional network. Siamese networks are twin networks with shared weights. Feature space can be learnt easily by training these networks even if similar observations are placed in proximity. Extracted features from Siamese then can be classified using difference or correlation measures. We have implemented a customized scoring scheme that utilizes Siamese’ capability of applying distance measures with the convolutional learning. Experiments made on cross language audios of multi-lingual speakers confirm the capability of our architecture to handle gender, age and language independent speaker verification. Moreover, our designed Siamese network, SpeakerNet, provided better results than the existing speaker verification approaches by decreasing the equal error rate to 0.02.

Keywords:

Convolutional Neural Network, Deep learning, Siamese network, speaker verification, textindependent, binary operation, Urdu speaker recognition

References

1. Bell P. et al. (2015), The MGB challenge: evaluating multi-genre broadcast media recognition, Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 687–693, https://doi.org/10.1109/ASRU.2015.7404863

2. Bromley J., Guyon I., LeCun Y., Säckinger E., Shah R. (1994), Signature verification using a "Siamese" time delay neural network, Proceedings of the 6th International Conference on Neural Information Processing Systems (NIPS), pp. 737–744, Colorado.

3. Campbell W.M., Sturim D.E., Reynolds D.A. (2006), Support vector machines using GMM supervectors for speaker verification, IEEE Signal Processing Letters, 13(5): 308–311, https://doi.org/10.1109/LSP.2006.870086

4. Cao X., Wipf D., Wen F., Duan G., Sun J. (2013), A practical transfer learning algorithm for face verification, Proceedings of IEEE International Conference on Computer Vision, pp. 3208–3215, https://doi.org/10.1109/ICCV.2013.398

5. Chopra S., Hadsell R., LeCun Y. (2005), Learning a similarity metric discriminatively, with application to face verification, Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 539–546, https://doi.org/10.1109/CVPR.2005.202

6. Chung J.S., Nagrani A., Zisserman A.J. (2018), Voxceleb2: Deep speaker recognition, arXiv preprint arXiv:1806.05622.

7. Cieri C., Miller D., Walker K. (2004), The Fisher Corpus: a resource for the next generations of speech-to-text, Proccedings of the Fourth International Conference on Language Resources and Evaluation (LREC), Lisbon, pp. 69–71.

8. Czyżewski A., Bratoszewski P., Hoffmann P., Lech M., Szczodrak, M. (2017), The project IDENT: Multimodal biometric system for bank client identity verification, [in:] Multimedia Communications, Services and Security, Dziech A., Czyżewski A. [Eds.], Communications in Computer and Information Science, Vol. 785, pp. 16–32, Springer, Cham.

9. Czyżewski A., Hoffmann P., Szczuko P., Kurowski A., Lech M., Szczodrak, M. (2019), Analysis of results of large-scale multimodal biometric identity verification experiment, IET Biometrics, 8(1): 92–100, https://doi.org/10.1049/iet-bmt.2018.5030

10. Dehak N., Kenny P. J., Dehak R., Dumouchel P., Ouellet P. (2011), Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 19(4): 788–798, https://doi.org/10.1109/TASL.2010.2064307

11. García-Salinas J.S., Villaseñor-Pineda L., Reyes-García C.A., Torres-García A.A. (2019), Transfer learning in imagined speech EEG-based BCIs, Biomedical Signal Processing and Control, 50: 151–157, https://doi.org/10.1016/j.bspc.2019.01.006

12. Glorot X., Bengio, Y. (2010), Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, pp. 249–256.

13. Hermessi H., Mourali O., Zagrouba E. (2019), Deep feature learning for soft tissue sarcoma classification in MR images via transfer learning, Expert Systems with Applications, 120: 116–127, https://doi.org/10.1016/j.eswa.2018.11.025

14. Hershey S. et al. (2017), CNN architectures for large-scale audio classification, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, pp. 131–135, https://doi.org/10.1109/ICASSP.2017.7952132

15. Hinton G., et al. (2012), Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, 29(6): 82–97, https://doi.org/10.1109/MSP.2012.2205597

16. Hong Q., Li L., Zhang J., Wan L., Guo, H. (2017), Transfer learning for PLDA-based speaker verification, Speech Communication, 92: 90–99, https://doi.org/10.1016/j.specom.2017.05.004

17. Huang Z., Siniscalchi S.M., Lee C.-H. (2016), A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition, Neurocomputing, 218: 448–459, https://doi.org/10.1016/j.neucom.2016.09.018

18. Ji S., Xu W., Yang M., Yu K. (2013), 3D convolutional neural networks for human action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1): 221–231, https://doi.org/10.1109/TPAMI.2012.59

19. Koch G., Zemel R., Salakhutdinov R. (2015), Siamese neural networks for one-shot image recognition, Proceedings of International Conference on Machine Learning (ICML) Deep Learning Workshop, Lille, Vol. 2, pp. 1–8.

20. Korvel G., Treigys P., Tamulevicus G., Bernataviciene J., Kostek B. (2018), Analysis of 2d feature spaces for deep learning-based speech recognition, Journal of the Audio Engineering Society, 66(12): 1072–1081, https://doi.org/10.17743/jaes.2018

21. Krizhevsky A., Sutskever I., Hinton G.E. (2017), Imagenet classification with deep convolutional neural networks, Communications of the ACM, 60(6): 84–90, https://doi.org/10.1145/3065386

22. Larcher A., Lee K.A., Ma B., Li H. (2014), Text-dependent speaker verification: Classifiers, databases and RSR2015, Speech Communication, 60: 56–77, https://doi.org/10.1016/j.specom.2014.03.001

23. Lei Y., Scheffer N., Ferrer L., McLaren M. (2014a), A novel scheme for speaker recognition using a phonetically-aware deep neural network, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, pp. 1695–1699, https://doi.org/10.1109/ICASSP.2014.6853887

24. Lei Z., Luo J., Yang, Y. (2014b), A simple way to extract I-vector from normalized statastics, [In:] Biometric Recognition, Lecture Notes in Computer Science, CCBR, Sun Z., Shan S., Sang H., Zhou J., Wang Y., Yuan W. [Eds.], Vol. 8833, pp. 366–374, Springer International Publishing, Cham, https://doi.org/10.1007/978-3-319-12484-1_41

25. Li C. et al. (2017), Deep speaker: an end-to-end neural speaker embedding system, arXiv preprint arXiv:1705.02304.

26. Martin A.F., Greenberg C.S. (2010), The NIST 2010 speaker recognition evaluation, Proceedings of Eleventh Annual Conference of the International Speech Communication Association (INTERSPEECH), Chiba, pp. 2726–2729.

27. McCool C. et al. (2012), Bi-modal person recognition on a mobile phone: Using mobile phone data, Proceedings of IEEE International Conference on Multimedia and Expo Workshops, Melbourne, pp. 635–640, https://doi.org/10.1109/ICMEW.2012.116

28. McLaren M., Ferrer L., Castan D., Lawson A. (2016), The Speakers in the Wild (SITW) speaker recognition database, Proceedings of Seventeenth Annual Conference of the International Speech Communication Association (INTERSPEECH), San Francisco, pp. 818–822.

29. Mobiny A., Najarian, M. (2018), Text-independent speaker verification using long short-term memory networks, arXiv preprint arXiv:1805.00604.

30. Morrison G. S., Rose P., Zhang C. (2012), Protocol for the collection of databases of recordings for forensic-voice-comparison research and practice, Australian Journal of Forensic Sciences, 44(2): 155–167, https://doi.org/10.1080/00450618.2011.630412

31. Nagrani A., Chung J.S., Zisserman A. (2017), Voxceleb: a large-scale speaker identification dataset, arXiv preprint arXiv:1706.08612.

32. Qi Y., Song Y.-Z., Zhang H., Liu, J. (2016), Sketch-based image retrieval via siamese convolutional neural network, Proceedings of IEEE International Conference on Image Processing (ICIP), Phoenix, pp. 2460–2464, https://doi.org/10.1109/ICIP.2016.7532801

33. Ramirez J., Górriz J. M., Segura J. C. (2007), Voice activity detection. Fundamentals and speech recognition system robustness, [in:] Robust Speech Recognition and Understanding, Grimm M., Kroschel K. [Eds.], pp. 1–22, IntechOpen, Vienna, https://doi.org/10.5772/4740

34. Reynolds D.A., Quatieri T.F., Dunn, R.B. (2000), Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, 10(1–3): 19–41, https://doi.org/10.1006/dspr.1999.0361

35. Schroff F., Kalenichenko D., Philbin, J. (2015), Facenet: A unified embedding for face recognition and clustering, Proceedings of 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, pp. 815–823, https://doi.org/10.1109/CVPR.2015.7298682

36. Shen W., Zhou M., Yang F., Yang C., Tian J. (2015), Multi-scale convolutional neural networks for lung nodule classification, [In:] Information Processing in Medical Imaging (IPMI), Ourselin S., Alexander D., Westin CF., Cardoso M. [Eds.], pp. 588–599, Springer, Cham, https://doi.org/10.1007/978-3-319-19992-4_46

37. Shi X., Du X., Zhu, M. (2018), End-to-end residual CNN with L-GM loss speaker verification system, Proceedings of 23rd IEEE International Conference on Digital Signal Processing (DSP), Shanghai, pp. 1–5, https://doi.org/10.1109/ICDSP.2018.8631697

38. Simonyan K., Zisserman A. (2014), Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.

39. Soleymani S., Dabouei A., Iranmanesh S.M., Kazemi H., Dawson J., Nasrabadi N.M. (2019), Prosodic-enhanced Siamese convolutional neural networks for cross-device text-independent speaker verification, Proceedings of 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), Los Angeles, pp. 1–7, https://doi.org/10.1109/BTAS.2018.8698585

40. Szczuko P., Czyżewski A., Hoffmann P., Bratoszewski P., Lech, M. (2019), Validating data acquired with experimental multimodal biometric system installed in bank branches, Journal of Intelligent Information Systems, 52(1): 1–31, https://doi.org/10.1007/s10844-017-0491-2

41. Torfi A., Shirvani R. A. (2018), Attention-based guided structured sparsity of deep neural networks, arXiv preprint arXiv:1802.09902.

42. Tran D., Bourdev L., Fergus R., Torresani L., Paluri M. (2015), Learning spatiotemporal features with 3d convolutional networks, Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, pp. 4489–4497, https://doi.org/10.1109/ICCV.2015.510

43. Vinyals O., Blundell C., Lillicrap T., Wierstra D. (2016), Matching networks for one shot learning, Proceedings of Advances in Neural Information Processing Systems (NIPS), Barcelona, pp. 3630–3638.

44. Wang D., Li L., Tang Z., Zheng T.F. (2017), Deep speaker verification: Do we need end to end?, Proceedings of IEEE Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, pp. 177–181, https://doi.org/10.1109/APSIPA.2017.8282024

45. Woo R. H., Park A., Hazen T. J. (2006), The MIT mobile device speaker verification corpus: data collection and preliminary experiments, Proceedings of IEEE Odyssey – The Speaker and Language Recognition Workshop, San Juan, pp. 1–6, https://doi.org/10.1109/ODYSSEY.2006.248083

46. Zeinali H., Sameti H., Stafylakis T. (2018), DeepMine speech processing database: Text-dependent and independent speaker verification and speech recognition in Persian and English, Proccedings of Odyssey 2018 The Speaker and Language Recognition Workshop, Les Sables d'Olonne, pp. 386–392, https://doi.org/10.21437/Odyssey.2018-54

47. Zhang C., Ranjan, S., Hansen J. (2018), An analysis of transfer learning for domain mismatched text-independent speaker verification, Proceedings of Odyssey 2018 The Speaker and Language Recognition Workshop, Les Sables d'Olonne, pp.181–186, https://doi.org/10.21437/Odyssey.2018-26 .

48. Zhang L., Yang J., Zhang D. (2017), Domain class consistency based transfer learning for image classification across domains, Information Sciences, 418: 242–257, https://doi.org/10.1016/j.ins.2017.08.034

Online first
Early birds
2026, Vol 51
	No 1	No 2
2025, Vol 50
	No 1	No 2	No 3	No 4
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

SpeakerNet for Cross-lingual Text-Independent Speaker Verification

Downloads

Authors

Abstract

Keywords:

References

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact