Abstract
This article addresses the issue of detecting speech signal segments in an acoustic signal and analyzes potential decision fusion for a group of voice activity detectors (VADs). We designed ten new VADs using three different types of neural network architectures and three time-frequency signal representations. One of the proposed models has higher classification efficiency than competitive solutions. We used our VAD models to analyse data fusion and improve the final classification decision. For this purpose, we used gradient-free and gradient-based optimizers with different objective functions. The analysis revealed the impact of individual classifiers on the final decisions and the potential gains or losses resulting from VAD fusion. Compared with existing models, the models we proposed achieved higher classification accuracy at the cost of increased memory requirements. The final choice of a specific model depends on the platform constraints on which the VAD system will be deployed.
Keywords:
voice activity detection (VAD), deep neural networks, data fusionReferences
- Aloradi A., Elminshawi M., Chetupalli S.R., Habets E.A.P (2023), Target-speaker voice activity detection in multi-talker scenarios: An empirical study, [in:] Speech Communication – 15th ITG Conference, pp. 250–254, https://doi.org/10.30420/456164049.
- Blanke S. (2020), Gradient-Free-Optimizers: Simple and reliable optimization with local, global, population-based and sequential techniques in numerical search spaces, https://github.com/SimonBlanke/Gradient-Free-Optimizers. (access: 16.06.2024).
- Dosovitskiy A. et al. (2021), An image is worth 16x16 words: Transformers for image recognition at scale, https://arxiv.org/abs/2010.11929.
- He K., Zhang X., Ren S., Sun J. (2016), Deep residual learning for image recognition, [in:] 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, https://doi.org/10.1109/CVPR.2016.90.
- Kim T., Chang J., Ko J.H. (2022), ADA-VAD: Unpaired adversarial domain adaptation for noise-robust voice activity detection, [in:] ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7327–7331, https://doi.org/10.1109/icassp43922.2022.9746755.
- Kingma D.P., Ba J. (2015), Adam: A method for stochastic optimization, [in:] ICLR 2015.
- Kittler J., Hatef M., Duin R.P.W., Matas J. (1998), On combining classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3): 226–239, https://doi.org/10.1109/34.667881.
- Lavechin M. et al. (2023), Brouhaha: Multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation, https://doi.org/10.48550/arXiv.2210.13248.
- Ma C., Dai G., Zhou J. (2022), Short-term traffic flow prediction for urban road sections based on time series analysis and LSTM BILSTM method, IEEE Transactions on Intelligent Transportation Systems, 23(6): 5615–5624, https://doi.org/10.1109/tits.2021.3055258.
- McFee B. et al. (2015), librosa: Audio and music signal analysis in python, [in:] Proceedings of the 14th Python in Science Conference, https://doi.org/10.25080/Majora-7b98e3ed-003.
- Peinado A.M., Segura J.C. (2006), Speech Recognition Over Digital Channels: Robustness and Standards, Wiley, https://doi.org/10.1002/0470024720.
- Rabiner L.R., Schafer R.W. (2010), Theory and Applications of Digital Speech Processing, Pearson.
- Rijsbergen C.J.V. (1979), Information Retrieval, 2nd ed., Butterworth-Heinemann.
- Rokach L. (2005), Ensemble methods for classifiers, [in:] Data Mining and Knowledge Discovery Handbook, Maimon O., Rokach L. [Eds], pp. 957–980, Springer, https://doi.org/10.1007/0-387-25465-x_45.
- Schörkhuber C., Klapuri A. (2010), Constant-Q transform toolbox for music processing, [in:] 7th Sound and Music Computing Conference (SMC2010), https://doi.org/10.5281/zenodo.849741.
- Smietanka L., Maka T. (2023), Augmented transformer for speech detection in adverse acoustical conditions, [in:] 2023 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pp. 14–18, https://doi.org/10.23919/spa59660.2023.10274438.
- Song S., Desplanques B., Demuynck K., Madhu N. (2022), SoftVAD in iVector-based acoustic scene classification for robustness to foreground speech, [in:] 2022 30th European Signal Processing Conference (EUSIPCO), pp. 404–408, https://doi.org/10.23919/eusipco55093.2022.9909938.
- Svirsky J., Lindenbaum O. (2023), SG-VAD: Stochastic gates based speech activity detection, [in:] ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/icassp49357.2023.10096938.
- Team S. (2024), Silero VAD: Pre-trained enterprise-grade voice activity detector (VAD), number detector and language classifier, https://github.com/snakers4/silero-vad.
- Wang R., Moazzen I., Zhu W.-P. (2022), A computation-efficient neural network for VAD using multi-channel feature, [in:] 2022 30th European Signal Processing Conference (EUSIPCO), pp. 170–174, https://doi.org/10.23919/eusipco55093.2022.9909914.
- Yadav S., Legaspi P.A.D., Alink M.S.O., Kokkeler A.B.J., Nauta B. (2023), Hardware implementations for voice activity detection: Trends, challenges and outlook, IEEE Transactions on Circuits and Systems I: Regular Papers, 70(3): 1083–1096, https://doi.org/10.1109/tcsi.2022.3225717.
- Yang Q., Liu Q., Li N., Ge M., Song Z., Li H. (2024), SVAD: A robust, low-power, and light-weight voice activity detection with spiking neural networks, [in:] ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 221–225, https://doi.org/10.1109/icassp48485.2024.10446945.
- Zhang Y., Zou H., Zhu J. (2023), Vsanet: Realtime speech enhancement based on voice activity detection and causal spatial attention, [in:] 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8, https://doi.org/10.1109/asru57964.2023.10389633.
- Zhao Y., Champagne B. (2022), An efficient transformer-based model for voice activity detection, [in:] 2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, https://doi.org/10.1109/mlsp55214.2022.9943501.

