Archives of Acoustics, 47, 2, pp. 181-189, 2022
10.24425/aoa.2022.141648

Spoofed Speech Detection with Weighted Phase Features and Convolutional Networks

Gökay DİŞKEN
Adana Alparslan Türkeş Science and Technology University
Turkey

Detection of audio spoofing attacks has become vital for automatic speaker verification systems. Spoofing attacks can be obtained with several ways, such as speech synthesis, voice conversion, replay, and mimicry. Extracting discriminative features from speech data can improve the accuracy of detecting these attacks. In fact, a frame-wise weighted magnitude spectrum is found to be effective to detect replay attacks recently. In this work, discriminative features are obtained in a similar fashion (frame-wise weighting), however, a cosine normalized phase spectrum is used since phase-based features have shown decent performance for the given task. The extracted features are then fed to a convolutional neural network as input. In the experiments ASVspoof 2015 and 2017 databases are used to investigate the proposed system’s spoof detection performance for both synthetic and replay attacks, respectively. The results showed that the proposed approach achieved 34.5% relative decrease in the average EER for ASVspoof 2015 evaluation set, compared to the ordinary cosine normalized phase features. Furthermore, the proposed system outperformed the others at detecting S10 attack type of ASVspoof 2015 database.
Keywords: spoofing detection; cosine normalized cepstrum; convolutional neural networks
Full Text: PDF

References

Alam M.J., Kenny P., Bhattacharya G., Stafylakis T. (2015), Development of CRIM system for the automatic speaker verification spoofing and countermeasures challenge 2015, [in:] Interspeech 2015, pp. 2072–2076, Dresden, Germany.

Alzantot M., Wang Z., Srivastava M.B. (2019), Deep residual neural networks for audio spoofing detection, [in:] Interspeech 2019, pp. 1078–1082, doi: 10.21437/Interspeech.2019-3174.

Białobrzeski R., Kosmider M., Matuszewski M., Plata M., Rakowski A. (2019), Robust Bayesian and light neural networks for voice spoofing detection, [in:] Interspeech 2019, pp. 1028–1032, doi: 10.21437/Interspeech.2019-2676.

Cai W., Wu H., Cai D., Li M. (2019), The DKU replay detection system for the ASVspoof 2019 challenge: on data augmentation, feature representation, classification, and fusion, [in:] Interspeech 2019, pp. 1023–1027, doi: 10.21437/Interspeech.2019-1230.

Chang S.-Y., Wu K.-C., Chen C.-P. (2019), Transfer-representation

learning for detecting spoofing attacks with converted and synthesized speech in automatic speaker verification system, [in:] Interspeech 2019, pp. 1063–1067, doi: 10.21437/Interspeech.2019-2014.

Chen Z., Zhang W., Xie Z., Xu X., Chen D. (2018), Recurrent neural networks for automatic replay spoofing attack detection, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings (ICASSP), pp. 2052–2056, doi: 10.1109/

ICASSP.2018.8462644.

Chettri B., Benetos E., Sturm B.L.T. (2020), Dataset Artefacts in anti-spoofing systems: a case study on the ASVspoof 2017 benchmark, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 3018–3028, doi: 10.1109/TASLP.2020.3036777.

Chettri B., Sturm B.L., Benetos E. (2018), Analysing replay spoofing countermeasure performance under varied conditions, 2018 IEEE 28th International Workshop on Machine Learning for Signal

Processing (MLSP), pp. 1–6, doi: 10.1109/MLSP.2018.8516968.

De Leon P.L., Pucher M., Yamagishi J., Hernaez I., Saratxaga I. (2012), Evaluation of speaker verification security and detection of HMM-Based synthetic speech, IEEE Transactions on Audio, Speech, and Language Processing, 20(8): 2280–2290, doi: 10.1109/TASL.2012.2201472.

Dehak N., Kenny P.J., Dehak R., Dumouchel P., Ouellet P. (2011), Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 19(4): 788–798, doi: 10.1109/TASL.2010.2064307.

Delgado H. et al. (2018), ASVspoof 2017 Version 2.0: Meta-data analysis and baseline enhancements, [in:] The Speaker and Language Recognition Workshop, pp. 296–303, doi: 10.21437/Odyssey.2018-42.

Dinkel H., Qian Y., Yu K. (2017), Small-footprint convolutional neural network for spoofing detection, [in:] 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3086–3091, doi: 10.1109/IJCNN.2017.7966240.

Font R., Espín J.M., Cano M.J. (2017), Experimental analysis of features for replay attack detection-Results on the ASVspoof 2017 Challenge, [in:] Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, pp. 7–11, doi: 10.21437/Interspeech.2017-450.

Gomez-Alanis A., Peinado A.M., Gonzalez J.A., Gomez A.M. (2019), A light convolutional GRU-RNN deep feature extractor for ASV spoofing detection, Interspeech 2019, pp. 1068–1072, doi: 10.21437/Interspeech.2019-2212.

González Hautamäki R., Kinnunen T., Hautamäki V., Laukkanen A.-M. (2015), Automatic versus human speaker verification: the case of voice mimicry, Speech Communication, 72: 13–31, doi: 10.1016/

j.specom.2015.05.002.

Hanilçi C. (2018a), Features and classifiers for replay spoofing attack detection, [in:] 2017 10th International Conference on Electrical and Electronics Engineering, ELECO 2017, pp. 1187–1191, Bursa, Turkey.

Hanilçi C. (2018b), Linear prediction residual features for automatic speaker verification anti-spoofing, Multimedia Tools and Applications, 77(13): 16099–16111, doi: 10.1007/s11042-017-5181-0.

Hanilçi C., Kinnunen T., Sahidullah M., Sizov A. (2015), Classifiers for synthetic speech detection: a comparison, [in:] Interspeech 2015, pp. 2057–2061, Dresden, Germany.

Hanilçi C., Kinnunen T., Sahidullah M., Sizov A. (2016), Spoofing detection goes noisy: An analysis of synthetic speech detection in the presence of additive noise, Speech Communication, 85: 83–97, doi: 10.1016/j.specom.2016.10.002.

Jung J., Shim H., Heo H.-S., Yu H.-J. (2019), Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 challenge, [in:] Interspeech 2019, pp. 1083–1087, doi: 10.21437/Interspeech.2019-1991.

Kinnunen T. (2017), The ASVspoof 2017 Challenge: Assessing the limits of replay spoofing attack detection, [in:] Interspeech 2017, pp. 1–5, Stockholm, Sweden.

Liu M., Wang L., Dang J., Nakagawa S., Guan H., Li X. (2019), Replay attack detection using magnitude and phase information with attention-based adaptive filters, [in:] ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6201–6205, doi: 10.1109/ICASSP.2019.8682739.

Liu Y., Tian Y., He L., Liu J., Johnson M.T. (2015), Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing, [in:] Interspeech

, pp. 2082–2086, Dresden, Germany.

Novoselov S., Kozlov A., Lavrentyeva G., Simonchik K., Shchemelinin V. (2016), STC anti-spoofing systems for the ASVspoof 2015 challenge, [in:] 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5475–5479, doi: 10.1109/ICASSP.2016.7472724.

Patel T.B., Patil H.A. (2015), Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech, [in:] Interspeech 2015, pp. 2062–2066, Dresden, Germany.

Rafi B.S.M., Murty K.S.R. (2019), Importance of analytic phase of the speech signal for detecting replay attacks in automatic speaker verification systems, [in:] ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6306–6310, doi: 10.1109/ICASSP.2019.8683500.

Reynolds D.A., Quatieri T.F., Dunn R.B. (2000), Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, 10(1–3): 19–41, doi: 10.1006/dspr.1999.0361.

Sahidullah M., Kinnunen T., Hanilçi C. (2015), A comparison of features for synthetic speech detection, [in:] Interspeech 2015, pp. 2087–2091, Dresden, Germany.

Singh M., Pati D. (2019), Combining evidences from Hilbert envelope and residual phase for detecting replay attacks, International Journal of Speech Technology, 22(2): 313–326, doi: 10.1007/s10772-019-09604-x.

Srinivas K., Das R.K., Patil H.A. (2018), Combining phase-based features for replay spoof detection system, [in:] 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 151–155, doi: 10.1109/ISCSLP.2018.8706672.

Tian X., Wu Z., Xiao X., Chng E.S., Li H. (2016), Spoofing detection from a feature representation perspective, [in:] 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2119–2123, doi: 10.1109/ICASSP.2016.7472051.

Todisco M., Delgado H., Evans N. (2017), Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification, Computer Speech & Language, 45: 516–535, doi: 10.1016/j.csl.2017.01.001.

Todisco M. et al. (2019), ASVspoof 2019: future horizons in spoofed and fake audio detection, [in:] Interspeech 2019, pp. 1008–1012, doi: 10.21437/Interspeech.2019-2249.

Tom F., Jain M., Dey P. (2018), End-to-end audio replay attack detection using deep convolutional networks with attention, [in:] Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, pp. 681–685, doi: 10.21437/Interspeech.2018-2279.

Wu Z., Chng E.S., Li H. (2012), Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition, [in:] 13th Annual Conference of the International Speech Communication Association 2012, Interspeech 2012, pp. 1698–1701, Portland, OR,

USA.

Wu Z. et al. (2017), ASVspoof: The automatic speaker verification spoofing and countermeasures challenge, IEEE Journal of Selected Topics in Signal Processing, 11(4): 588–604, doi: 10.1109/JSTSP.2017.2671435.

Xiao X., Tian X., Du S., Xu H., Chng E.S., Li H. (2015), Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge, [in:] Interspeech 2015, pp. 2052–2056, Dresden, Germany.

Yang J., Das R.K. (2020), Long-term high frequency features for synthetic speech detection, Digital Signal Processing, 97(1): 1–11, doi: 10.1016/j.dsp.2019.102622.

Yang J., Das R.K., Li H. (2018), Extended constant-Q cepstral coefficients for detection of spoofing attacks, [in:] 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1024–1029, doi: 10.23919/APSIPA.

8659537.

Yang J., Liu L. (2018), Playback speech detection based on magnitude–phase spectrum, Electronics Letters, 54(14): 901–903, doi: 10.1049/el.2018.0739.

Yang J., Liu L., He Q. (2019), Discriminative feature based on FWMW for playback speech detection, Electronics Letters, 55(15): 861–864, doi: 10.1049/el.2019.1025.

Yang J., Xu L., Ren B., Ji Y. (2020), Discriminative features based on modified log magnitude spectrum for playback speech detection, EURASIP Journal on Audio, Speech, and Music Processing, doi: 10.1186/s13636-020-00173-5.

Zeinali H. et al. (2019), Detecting spoofing attacks using VGG and SincNet: BUT-Omilia submission to ASVspoof 2019 challenge, [in:] Interspeech 2019, pp. 1073–1077, doi: 10.21437/Interspeech.2019-2892.

Zhang C., Yu C., Hansen J.H.L. (2017), An investigation of deep-learning frameworks for speaker verification antispoofing, IEEE Journal of Selected Topics in Signal Processing, 11(4): 684–694, doi: 10.1109/JSTSP.2016.2647199.




DOI: 10.24425/aoa.2022.141648

Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN)