Analysis of Decision Fusion in Speech Detection

Tomasz MAKA; Lukasz SMIETANKA

doi:10.24425/aoa.2025.156933

Authors

Tomasz MAKA Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, Poland
Lukasz SMIETANKA Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, Poland

Abstract

This article addresses the issue of detecting speech signal segments in an acoustic signal and analyzes potential decision fusion for a group of voice activity detectors (VADs). We designed ten new VADs using three different types of neural network architectures and three time-frequency signal representations. One of the proposed models has higher classification efficiency than competitive solutions. We used our VAD models to analyse data fusion and improve the final classification decision. For this purpose, we used gradient-free and gradient-based optimizers with different objective functions. The analysis revealed the impact of individual classifiers on the final decisions and the potential gains or losses resulting from VAD fusion. Compared with existing models, the models we proposed achieved higher classification accuracy at the cost of increased memory requirements. The final choice of a specific model depends on the platform constraints on which the VAD system will be deployed.

Keywords:

voice activity detection (VAD), deep neural networks, data fusion

References

Aloradi A., Elminshawi M., Chetupalli S.R., Habets E.A.P (2023), Target-speaker voice activity detection in multi-talker scenarios: An empirical study, [in:] Speech Communication – 15th ITG Conference, pp. 250–254, https://doi.org/10.30420/456164049.

Blanke S. (2020), Gradient-Free-Optimizers: Simple and reliable optimization with local, global, population-based and sequential techniques in numerical search spaces, https://github.com/SimonBlanke/Gradient-Free-Optimizers. (access: 16.06.2024).

Dosovitskiy A. et al. (2021), An image is worth 16x16 words: Transformers for image recognition at scale, https://arxiv.org/abs/2010.11929.

He K., Zhang X., Ren S., Sun J. (2016), Deep residual learning for image recognition, [in:] 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, https://doi.org/10.1109/CVPR.2016.90.

Kim T., Chang J., Ko J.H. (2022), ADA-VAD: Unpaired adversarial domain adaptation for noise-robust voice activity detection, [in:] ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7327–7331, https://doi.org/10.1109/icassp43922.2022.9746755.

Kingma D.P., Ba J. (2015), Adam: A method for stochastic optimization, [in:] ICLR 2015.

Kittler J., Hatef M., Duin R.P.W., Matas J. (1998), On combining classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3): 226–239, https://doi.org/10.1109/34.667881.

Lavechin M. et al. (2023), Brouhaha: Multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation, https://doi.org/10.48550/arXiv.2210.13248.

Ma C., Dai G., Zhou J. (2022), Short-term traffic flow prediction for urban road sections based on time series analysis and LSTM BILSTM method, IEEE Transactions on Intelligent Transportation Systems, 23(6): 5615–5624, https://doi.org/10.1109/tits.2021.3055258.

McFee B. et al. (2015), librosa: Audio and music signal analysis in python, [in:] Proceedings of the 14th Python in Science Conference, https://doi.org/10.25080/Majora-7b98e3ed-003.

Peinado A.M., Segura J.C. (2006), Speech Recognition Over Digital Channels: Robustness and Standards, Wiley, https://doi.org/10.1002/0470024720.

Rabiner L.R., Schafer R.W. (2010), Theory and Applications of Digital Speech Processing, Pearson.

Rijsbergen C.J.V. (1979), Information Retrieval, 2nd ed., Butterworth-Heinemann.

Rokach L. (2005), Ensemble methods for classifiers, [in:] Data Mining and Knowledge Discovery Handbook, Maimon O., Rokach L. [Eds], pp. 957–980, Springer, https://doi.org/10.1007/0-387-25465-x_45.

Schörkhuber C., Klapuri A. (2010), Constant-Q transform toolbox for music processing, [in:] 7th Sound and Music Computing Conference (SMC2010), https://doi.org/10.5281/zenodo.849741.

Smietanka L., Maka T. (2023), Augmented transformer for speech detection in adverse acoustical conditions, [in:] 2023 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pp. 14–18, https://doi.org/10.23919/spa59660.2023.10274438.

Song S., Desplanques B., Demuynck K., Madhu N. (2022), SoftVAD in iVector-based acoustic scene classification for robustness to foreground speech, [in:] 2022 30th European Signal Processing Conference (EUSIPCO), pp. 404–408, https://doi.org/10.23919/eusipco55093.2022.9909938.

Svirsky J., Lindenbaum O. (2023), SG-VAD: Stochastic gates based speech activity detection, [in:] ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/icassp49357.2023.10096938.

Team S. (2024), Silero VAD: Pre-trained enterprise-grade voice activity detector (VAD), number detector and language classifier, https://github.com/snakers4/silero-vad.

Wang R., Moazzen I., Zhu W.-P. (2022), A computation-efficient neural network for VAD using multi-channel feature, [in:] 2022 30th European Signal Processing Conference (EUSIPCO), pp. 170–174, https://doi.org/10.23919/eusipco55093.2022.9909914.

Yadav S., Legaspi P.A.D., Alink M.S.O., Kokkeler A.B.J., Nauta B. (2023), Hardware implementations for voice activity detection: Trends, challenges and outlook, IEEE Transactions on Circuits and Systems I: Regular Papers, 70(3): 1083–1096, https://doi.org/10.1109/tcsi.2022.3225717.

Yang Q., Liu Q., Li N., Ge M., Song Z., Li H. (2024), SVAD: A robust, low-power, and light-weight voice activity detection with spiking neural networks, [in:] ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 221–225, https://doi.org/10.1109/icassp48485.2024.10446945.

Zhang Y., Zou H., Zhu J. (2023), Vsanet: Realtime speech enhancement based on voice activity detection and causal spatial attention, [in:] 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8, https://doi.org/10.1109/asru57964.2023.10389633.

Zhao Y., Champagne B. (2022), An efficient transformer-based model for voice activity detection, [in:] 2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, https://doi.org/10.1109/mlsp55214.2022.9943501.

Online first
2025, Vol 50
	No 1	No 2	No 3	No 4
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

Analysis of Decision Fusion in Speech Detection

Downloads

Authors

Abstract

Keywords:

References

Other articles by the same author(s)

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact