Archives of Acoustics, 45, 4, pp. 573–583, 2020

SpeakerNet for Cross-lingual Text-Independent Speaker Verification

Lahore College for Women University

Lahore College for Women University

Muhammad Abuzar FAHIEM
Lahore College for Women University

Lahore College for Women University

Ghousia USMAN
Lahore College for Women University

Biometrics provide an alternative to passwords and pins for authentication. The emergence of machine learning algorithms provides an easy and economical solution to authentication problems. The phases of speaker verification protocol are training, enrollment of speakers and evaluation of unknown voice. In this paper, we addressed text independent speaker verification using Siamese convolutional network. Siamese networks are twin networks with shared weights. Feature space can be learnt easily by training these networks even if similar observations are placed in proximity. Extracted features from Siamese then can be classified using difference or correlation measures. We have implemented a customized scoring scheme that utilizes Siamese’ capability of applying distance measures with the convolutional learning. Experiments made on cross language audios of multi-lingual speakers confirm the capability of our architecture to handle gender, age and language independent speaker verification. Moreover, our designed Siamese network, SpeakerNet, provided better results than the existing speaker verification approaches by decreasing the equal error rate to 0.02.
Keywords: Convolutional Neural Network; Deep learning; Siamese network; speaker verification; textindependent; binary operation; Urdu speaker recognition
Full Text: PDF


Bell P. et al. (2015), The MGB challenge: evaluating multi-genre broadcast media recognition, Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 687–693, doi: 10.1109/ASRU.2015.7404863.

Bromley J., Guyon I., LeCun Y., Säckinger E., Shah R. (1994), Signature verification using a "Siamese" time delay neural network, Proceedings of the 6th International Conference on Neural Information Processing Systems (NIPS), pp. 737–744, Colorado.

Campbell W.M., Sturim D.E., Reynolds D.A. (2006), Support vector machines using GMM supervectors for speaker verification, IEEE Signal Processing Letters, 13(5): 308–311, doi: 10.1109/LSP.2006.870086.

Cao X., Wipf D., Wen F., Duan G., Sun J. (2013), A practical transfer learning algorithm for face verification, Proceedings of IEEE International Conference on Computer Vision, pp. 3208–3215, doi: 10.1109/ICCV.2013.398.

Chopra S., Hadsell R., LeCun Y. (2005), Learning a similarity metric discriminatively, with application to face verification, Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 539–546, doi: 10.1109/CVPR.2005.202.

Chung J.S., Nagrani A., Zisserman A.J. (2018), Voxceleb2: Deep speaker recognition, arXiv preprint arXiv:1806.05622.

Cieri C., Miller D., Walker K. (2004), The Fisher Corpus: a resource for the next generations of speech-to-text, Proccedings of the Fourth International Conference on Language Resources and Evaluation (LREC), Lisbon, pp. 69–71.

Czyżewski A., Bratoszewski P., Hoffmann P., Lech M., Szczodrak, M. (2017), The project IDENT: Multimodal biometric system for bank client identity verification, [in:] Multimedia Communications, Services and Security, Dziech A., Czyżewski A. [Eds.], Communications in Computer and Information Science, Vol. 785, pp. 16–32, Springer, Cham.

Czyżewski A., Hoffmann P., Szczuko P., Kurowski A., Lech M., Szczodrak, M. (2019), Analysis of results of large-scale multimodal biometric identity verification experiment, IET Biometrics, 8(1): 92–100, doi: 10.1049/iet-bmt.2018.5030.

Dehak N., Kenny P. J., Dehak R., Dumouchel P., Ouellet P. (2011), Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 19(4): 788–798, doi: 10.1109/TASL.2010.2064307.

García-Salinas J.S., Villaseñor-Pineda L., Reyes-García C.A., Torres-García A.A. (2019), Transfer learning in imagined speech EEG-based BCIs, Biomedical Signal Processing and Control, 50: 151–157, doi: 10.1016/j.bspc.2019.01.006.

Glorot X., Bengio, Y. (2010), Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, pp. 249–256.

Hermessi H., Mourali O., Zagrouba E. (2019), Deep feature learning for soft tissue sarcoma classification in MR images via transfer learning, Expert Systems with Applications, 120: 116–127, doi: 10.1016/j.eswa.2018.11.025.

Hershey S. et al. (2017), CNN architectures for large-scale audio classification, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, pp. 131–135, doi: 10.1109/ICASSP.2017.7952132.

Hinton G., et al. (2012), Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, 29(6): 82–97, doi: 10.1109/MSP.2012.2205597.

Hong Q., Li L., Zhang J., Wan L., Guo, H. (2017), Transfer learning for PLDA-based speaker verification, Speech Communication, 92: 90–99, doi: 10.1016/j.specom.2017.05.004.

Huang Z., Siniscalchi S.M., Lee C.-H. (2016), A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition, Neurocomputing, 218: 448–459, doi: 10.1016/j.neucom.2016.09.018.

Ji S., Xu W., Yang M., Yu K. (2013), 3D convolutional neural networks for human action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1): 221–231, doi: 10.1109/TPAMI.2012.59.

Koch G., Zemel R., Salakhutdinov R. (2015), Siamese neural networks for one-shot image recognition, Proceedings of International Conference on Machine Learning (ICML) Deep Learning Workshop, Lille, Vol. 2, pp. 1–8.

Korvel G., Treigys P., Tamulevicus G., Bernataviciene J., Kostek B. (2018), Analysis of 2d feature spaces for deep learning-based speech recognition, Journal of the Audio Engineering Society, 66(12): 1072–1081, doi: 10.17743/jaes.2018.

Krizhevsky A., Sutskever I., Hinton G.E. (2017), Imagenet classification with deep convolutional neural networks, Communications of the ACM, 60(6): 84–90, doi: 10.1145/3065386.

Larcher A., Lee K.A., Ma B., Li H. (2014), Text-dependent speaker verification: Classifiers, databases and RSR2015, Speech Communication, 60: 56–77, doi: 10.1016/j.specom.2014.03.001.

Lei Y., Scheffer N., Ferrer L., McLaren M. (2014a), A novel scheme for speaker recognition using a phonetically-aware deep neural network, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, pp. 1695–1699, doi: 10.1109/ICASSP.2014.6853887.

Lei Z., Luo J., Yang, Y. (2014b), A simple way to extract I-vector from normalized statastics, [In:] Biometric Recognition, Lecture Notes in Computer Science, CCBR, Sun Z., Shan S., Sang H., Zhou J., Wang Y., Yuan W. [Eds.], Vol. 8833, pp. 366–374, Springer International Publishing, Cham, doi: 10.1007/978-3-319-12484-1_41.

Li C. et al. (2017), Deep speaker: an end-to-end neural speaker embedding system, arXiv preprint arXiv:1705.02304.

Martin A.F., Greenberg C.S. (2010), The NIST 2010 speaker recognition evaluation, Proceedings of Eleventh Annual Conference of the International Speech Communication Association (INTERSPEECH), Chiba, pp. 2726–2729.

McCool C. et al. (2012), Bi-modal person recognition on a mobile phone: Using mobile phone data, Proceedings of IEEE International Conference on Multimedia and Expo Workshops, Melbourne, pp. 635–640, doi: 10.1109/ICMEW.2012.116.

McLaren M., Ferrer L., Castan D., Lawson A. (2016), The Speakers in the Wild (SITW) speaker recognition database, Proceedings of Seventeenth Annual Conference of the International Speech Communication Association (INTERSPEECH), San Francisco, pp. 818–822.

Mobiny A., Najarian, M. (2018), Text-independent speaker verification using long short-term memory networks, arXiv preprint arXiv:1805.00604.

Morrison G. S., Rose P., Zhang C. (2012), Protocol for the collection of databases of recordings for forensic-voice-comparison research and practice, Australian Journal of Forensic Sciences, 44(2): 155–167, doi: 10.1080/00450618.2011.630412.

Nagrani A., Chung J.S., Zisserman A. (2017), Voxceleb: a large-scale speaker identification dataset, arXiv preprint arXiv:1706.08612.

Qi Y., Song Y.-Z., Zhang H., Liu, J. (2016), Sketch-based image retrieval via siamese convolutional neural network, Proceedings of IEEE International Conference on Image Processing (ICIP), Phoenix, pp. 2460–2464, doi: 10.1109/ICIP.2016.7532801.

Ramirez J., Górriz J. M., Segura J. C. (2007), Voice activity detection. Fundamentals and speech recognition system robustness, [in:] Robust Speech Recognition and Understanding, Grimm M., Kroschel K. [Eds.], pp. 1–22, IntechOpen, Vienna, doi: 10.5772/4740.

Reynolds D.A., Quatieri T.F., Dunn, R.B. (2000), Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, 10(1–3): 19–41, doi: 10.1006/dspr.1999.0361.

Schroff F., Kalenichenko D., Philbin, J. (2015), Facenet: A unified embedding for face recognition and clustering, Proceedings of 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, pp. 815–823, doi: 10.1109/CVPR.2015.7298682.

Shen W., Zhou M., Yang F., Yang C., Tian J. (2015), Multi-scale convolutional neural networks for lung nodule classification, [In:] Information Processing in Medical Imaging (IPMI), Ourselin S., Alexander D., Westin CF., Cardoso M. [Eds.], pp. 588–599, Springer, Cham, doi: 10.1007/978-3-319-19992-4_46.

Shi X., Du X., Zhu, M. (2018), End-to-end residual CNN with L-GM loss speaker verification system, Proceedings of 23rd IEEE International Conference on Digital Signal Processing (DSP), Shanghai, pp. 1–5, doi: 10.1109/ICDSP.2018.8631697.

Simonyan K., Zisserman A. (2014), Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.

Soleymani S., Dabouei A., Iranmanesh S.M., Kazemi H., Dawson J., Nasrabadi N.M. (2019), Prosodic-enhanced Siamese convolutional neural networks for cross-device text-independent speaker verification, Proceedings of 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), Los Angeles, pp. 1–7, doi: 10.1109/BTAS.2018.8698585.

Szczuko P., Czyżewski A., Hoffmann P., Bratoszewski P., Lech, M. (2019), Validating data acquired with experimental multimodal biometric system installed in bank branches, Journal of Intelligent Information Systems, 52(1): 1–31, doi: 10.1007/s10844-017-0491-2.

Torfi A., Shirvani R. A. (2018), Attention-based guided structured sparsity of deep neural networks, arXiv preprint arXiv:1802.09902.

Tran D., Bourdev L., Fergus R., Torresani L., Paluri M. (2015), Learning spatiotemporal features with 3d convolutional networks, Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, pp. 4489–4497, doi: 10.1109/ICCV.2015.510.

Vinyals O., Blundell C., Lillicrap T., Wierstra D. (2016), Matching networks for one shot learning, Proceedings of Advances in Neural Information Processing Systems (NIPS), Barcelona, pp. 3630–3638.

Wang D., Li L., Tang Z., Zheng T.F. (2017), Deep speaker verification: Do we need end to end?, Proceedings of IEEE Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, pp. 177–181, doi: 10.1109/APSIPA.2017.8282024.

Woo R. H., Park A., Hazen T. J. (2006), The MIT mobile device speaker verification corpus: data collection and preliminary experiments, Proceedings of IEEE Odyssey – The Speaker and Language Recognition Workshop, San Juan, pp. 1–6, doi: 10.1109/ODYSSEY.2006.248083.

Zeinali H., Sameti H., Stafylakis T. (2018), DeepMine speech processing database: Text-dependent and independent speaker verification and speech recognition in Persian and English, Proccedings of Odyssey 2018 The Speaker and Language Recognition Workshop, Les Sables d'Olonne, pp. 386–392, doi: 10.21437/Odyssey.2018-54.

Zhang C., Ranjan, S., Hansen J. (2018), An analysis of transfer learning for domain mismatched text-independent speaker verification, Proceedings of Odyssey 2018 The Speaker and Language Recognition Workshop, Les Sables d'Olonne, pp.181–186, doi: 10.21437/Odyssey.2018-26 .

Zhang L., Yang J., Zhang D. (2017), Domain class consistency based transfer learning for image classification across domains, Information Sciences, 418: 242–257, doi: 10.1016/j.ins.2017.08.034.

DOI: 10.24425/aoa.2020.134073

Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN)