Benchmarking the First Realistic Dataset for Speech Separation

Downloads

Authors

  • Rawad MELHEM Higher Institute for Applied Sciences and Technology, Syria
  • Oumayma AL DAKKAK Higher Institute for Applied Sciences and Technology, Syria
  • Assef JAFAR Higher Institute for Applied Sciences and Technology, Syria

Abstract

This paper presents a thorough benchmarking analysis of a recently introduced realistic dataset for speech separation tasks. The dataset contains audio mixtures that replicate real-life scenarios and is accompanied by ground truths, making it a valuable resource for researchers. Although the dataset construction methodology was recently disclosed, its benchmarking and detailed performance analysis have not yet been conducted. In this study, we evaluate the performance of four speech separation models using two distinct testing sets, ensuring a robust evaluation. Our findings underscore the dataset’s efficacy to advance speech separation research within authentic environments. Furthermore, we propose a novel approach for assessing metrics in real-world speech separation systems, where ground truths are unavailable. This method aims to improve accuracy evaluations and refine models for practical applications.We make the dataset publicly available to encourage innovation and collaboration in the field.

Keywords:

single-channel, speech separation, deep learning, corpus, datasets

References


  1. Barker , Marxer R., Vincent E., Watanabe S. (2015), The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines, [in:] 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), https://doi.org/10.1109/ASRU.2015.7404837.

  2. Barker , Watanabe S., Vincent E., Trmal J. (2018), The fifth ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines, [in:] Proceedings of Interspeech, pp. 1561–1565, https://doi.org/10.21437/Interspeech.2018-1768.

  3. Brandschain , Graff D., Cieri C., Walker K., Caruso C., Neely A. (2010), Mixer 6, [in:] Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10).

  4. Cornell et al. (2024), The CHiME-8 DASR challenge for generalizable and array agnostic distant automatic speech recognition and diarization, [in:] 8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024), https://doi.org/10.21437/CHiME.2024-1.

  5. Cornell et al. (2023), The CHiME-7 DASR challenge: Distant meeting transcription with multiple devices in diverse scenarios, [in:] 7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023), https://doi.org/10.21437/CHiME.2023-1.

  6. Cosentino , Pariente M., Cornell S., Deleforge A., Vincent E. (2020), LibriMix: An open-source dataset for generalizable speech separation, arXiv, https://doi.org/10.48550/arXiv.2005.11262.

  7. Han , Long Y. (2023), Heterogeneous separation consistency training for adaptation of unsupervised speech separation, EURASIP Journal on Audio, Speech, and Music Processing, 2023: 6, https://doi.org/10.1186/s13636-023-00273-y.

  8. Hershey R., Chen Z., Le Roux J., Watanabe S. (2016), Deep clustering: Discriminative embeddings for segmentation and separation, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35, https://doi.org/10.1109/ICASSP.2016.7471631.

  9. Ismael N., Kadhim H.M. (2024), NNMF with speaker clustering in a uniform filter-bank for blind speech separation, Iraqi Journal for Electrical, Electronic Engineering, 20(1): 111–121, https://doi.org/10.37917/ijeee.20.1.12.

  10. Kariyappa et al. (2023), Cocktail party attack: Breaking aggregation-based privacy in federated learning using independent component analysis, [in:] Proceedings of the 40th International Conference on Machine Learning, pp. 15884–15899.

  11. Le Roux , Wisdom S., Erdogan H., Hershey J.R. (2019), SDR – Half-baked or well done?, [in:] ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP.2019.8683855.

  12. Maciejewski , Wichern G., McQuinn E., Le Roux J. (2020), WHAMR!: Noisy and reverberant single-channel speech separation, [in:] ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP40776.2020.9053327.

  13. Melhem , Hamadeh R., Jafar A. (2024a), Study of the performance of CEEMDAN in underdetermined speech separation, arXiv, https://doi.org/10.48550/arXiv.2411.11312.

  14. Melhem , Jafar A., Dakkak O.A. (2024b), Towards solving cocktail-party: The first method to build a realistic dataset with ground truths for speech separation, Romanian Journal of Acoustics and Vibration, 20(1): 103–113.

  15. Melhem , Jafar A., Hamadeh R. (2021), Improving deep attractor network by BGRU and GMM for speech separation, Journal of Harbin Institute of Technology (New Series), 28(3): 90–96, https://doi.org/10.11916/j.issn.1005-9113.2019044.

  16. Nagrani , Chung J.S., Zisserman A. (2017), Vox-Celeb: A large-scale speaker identification dataset, [in:] Proceedings of INTERSPEECH 2017, pp. 2616–2620, https://doi.org/10.21437/Interspeech.2017-950.

  17. Panayotov , Chen G., Povey D., Khudanpur S. (2015), Librispeech: An ASR corpus based on public domain audio books, [in:] 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, https://doi.org/10.1109/ICASSP.2015.7178964.

  18. Rix W., Beerends J.G., Hollier M.P., Hekstra A.P. (2001), Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, [in:] 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 749–752, https://doi.org/10.1109/ICASSP.2001.941023.

  19. Saijo , Wichern G., Germain F.G., Pan Z., Le Roux J. (2024), TF-Locoformer: Transformer with local modeling by convolution for speech separation and enhancement, [in:] 2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC), https://doi.org/10.1109/IWAENC61483.2024.10694313.

  20. Subakan , Ravanelli M., Cornell S., Bronzi M., Zhong J. (2020), Attention is all you need in speech separation, [in:] ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP39728.2021.9413901.

  21. Subakan , Ravanelli M., Cornell S., Grondin F. (2021), Real-M: Towards speech separation on real mixtures, [in:] ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP43922.2022.9746662.

  22. Taal H., Hendriks R.C., Heusdens R., Jensen J. (2010), A short-time objective intelligibility measure for time-frequency weighted noisy speech, [in:] 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217, https://doi.org/10.1109/ICASSP.2010.5495701.

  23. Wang -Q. (2024), Mixture to mixture: Leveraging close-talk mixtures as weak-supervision for speech separation, [in:] IEEE Signal Processing Letters, 31: 1715–1719, https://doi.org/10.1109/LSP.2024.3417284.

  24. Wang -Q., Cornell S., Choi S., Lee Y., Kim B.-Y., Watanabe S. (2023), TF-GRIDNET: Making time-frequency domain models great again for monaural speaker separation, [in:] ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP49357.2023.10094992.

  25. Wang -Q., Watanabe S. (2023), UNSSOR: Unsupervised neural speech separation by leveraging overdetermined training mixtures, arXiv, https://doi.org/10.48550/arXiv.2305.20054.

  26. Wichern et al . (2019), WHAM!: Extending speech separation to noisy environments, [in:] Proceedings Interspeech 2019 , https://doi.org/10.21437/Interspeech.2019-2821.

  27. Zhuo , Luo Y., Mesgarani N. (2017), Deep attractor network for single-microphone speaker separation, [in:] 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP.2017.7952155.