Benchmarking the First Realistic Dataset for Speech Separation

Rawad MELHEM; Oumayma AL DAKKAK; Assef JAFAR

doi:10.24425/aoa.2025.154816

Authors

Rawad MELHEM Higher Institute for Applied Sciences and Technology, Syria
Oumayma AL DAKKAK Higher Institute for Applied Sciences and Technology, Syria
Assef JAFAR Higher Institute for Applied Sciences and Technology, Syria

Abstract

This paper presents a thorough benchmarking analysis of a recently introduced realistic dataset for speech separation tasks. The dataset contains audio mixtures that replicate real-life scenarios and is accompanied by ground truths, making it a valuable resource for researchers. Although the dataset construction methodology was recently disclosed, its benchmarking and detailed performance analysis have not yet been conducted. In this study, we evaluate the performance of four speech separation models using two distinct testing sets, ensuring a robust evaluation. Our findings underscore the dataset’s efficacy to advance speech separation research within authentic environments. Furthermore, we propose a novel approach for assessing metrics in real-world speech separation systems, where ground truths are unavailable. This method aims to improve accuracy evaluations and refine models for practical applications. We make the dataset publicly available to encourage innovation and collaboration in the field.

Keywords:

single-channel, speech separation, deep learning, corpus, datasets

References

Barker J., Marxer R., Vincent E., Watanabe S. (2015), The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines, [in:] 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), https://doi.org/10.1109/ASRU.2015.7404837.

Barker J., Watanabe S., Vincent E., Trmal J. (2018), The fifth ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines, [in:] Proceedings of Interspeech, pp. 1561–1565, https://doi.org/10.21437/Interspeech.2018-1768.

Brandschain L., Graff D., Cieri C., Walker K., Caruso C., Neely A. (2010), Mixer 6, [in:] Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10).

Cornell S. et al. (2024), The CHiME-8 DASR challenge for generalizable and array agnostic distant automatic speech recognition and diarization, [in:] 8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024), https://doi.org/10.21437/CHiME.2024-1.

Cornell S. et al. (2023), The CHiME-7 DASR challenge: Distant meeting transcription with multiple devices in diverse scenarios, [in:] 7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023), https://doi.org/10.21437/CHiME.2023-1.

Cosentino J., Pariente M., Cornell S., Deleforge A., Vincent E. (2020), LibriMix: An open-source dataset for generalizable speech separation, arXiv, https://doi.org/10.48550/arXiv.2005.11262.

Han J., Long Y. (2023), Heterogeneous separation consistency training for adaptation of unsupervised speech separation, EURASIP Journal on Audio, Speech, and Music Processing, 2023: 6, https://doi.org/10.1186/s13636-023-00273-y.

Hershey J.R., Chen Z., Le Roux J., Watanabe S. (2016), Deep clustering: Discriminative embeddings for segmentation and separation, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35, https://doi.org/10.1109/ICASSP.2016.7471631.

Ismael R.N., Kadhim H.M. (2024), NNMF with speaker clustering in a uniform filter-bank for blind speech separation, Iraqi Journal for Electrical, Electronic Engineering, 20(1): 111–121, https://doi.org/10.37917/ijeee.20.1.12.

Kariyappa S. et al. (2023), Cocktail party attack: Breaking aggregation-based privacy in federated learning using independent component analysis, [in:] Proceedings of the 40th International Conference on Machine Learning, pp. 15884–15899.

Le Roux J., Wisdom S., Erdogan H., Hershey J.R. (2019), SDR – Half-baked or well done?, [in:] ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP.2019.8683855.

Maciejewski M., Wichern G., McQuinn E., Le Roux J. (2020), WHAMR!: Noisy and reverberant single-channel speech separation, [in:] ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP40776.2020.9053327.

Melhem R., Hamadeh R., Jafar A. (2024a), Study of the performance of CEEMDAN in underdetermined speech separation, arXiv, https://doi.org/10.48550/arXiv.2411.11312.

Melhem R., Jafar A., Dakkak O.A. (2024b), Towards solving cocktail-party: The first method to build a realistic dataset with ground truths for speech separation, Romanian Journal of Acoustics and Vibration, 20(1): 103–113.

Melhem R., Jafar A., Hamadeh R. (2021), Improving deep attractor network by BGRU and GMM for speech separation, Journal of Harbin Institute of Technology (New Series), 28(3): 90–96, https://doi.org/10.11916/j.issn.1005-9113.2019044.

Nagrani A., Chung J.S., Zisserman A. (2017), Vox-Celeb: A large-scale speaker identification dataset, [in:] Proceedings of INTERSPEECH 2017, pp. 2616–2620, https://doi.org/10.21437/Interspeech.2017-950.

Panayotov V., Chen G., Povey D., Khudanpur S. (2015), Librispeech: An ASR corpus based on public domain audio books, [in:] 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, https://doi.org/10.1109/ICASSP.2015.7178964.

Rix A.W., Beerends J.G., Hollier M.P., Hekstra A.P. (2001), Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, [in:] 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 749–752, https://doi.org/10.1109/ICASSP.2001.941023.

Saijo K., Wichern G., Germain F.G., Pan Z., Le Roux J. (2024), TF-Locoformer: Transformer with local modeling by convolution for speech separation and enhancement, [in:] 2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC), https://doi.org/10.1109/IWAENC61483.2024.10694313.

Subakan C., Ravanelli M., Cornell S., Bronzi M., Zhong J. (2020), Attention is all you need in speech separation, [in:] ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP39728.2021.9413901.

Subakan C., Ravanelli M., Cornell S., Grondin F. (2021), Real-M: Towards speech separation on real mixtures, [in:] ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP43922.2022.9746662.

Taal C.H., Hendriks R.C., Heusdens R., Jensen J. (2010), A short-time objective intelligibility measure for time-frequency weighted noisy speech, [in:] 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217, https://doi.org/10.1109/ICASSP.2010.5495701.

Wang Z.-Q. (2024), Mixture to mixture: Leveraging close-talk mixtures as weak-supervision for speech separation, [in:] IEEE Signal Processing Letters, 31: 1715–1719, https://doi.org/10.1109/LSP.2024.3417284.

Wang Z.-Q., Cornell S., Choi S., Lee Y., Kim B.-Y., Watanabe S. (2023), TF-GRIDNET: Making time-frequency domain models great again for monaural speaker separation, [in:] ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP49357.2023.10094992.

Wang Z.-Q., Watanabe S. (2023), UNSSOR: Unsupervised neural speech separation by leveraging overdetermined training mixtures, arXiv, https://doi.org/10.48550/arXiv.2305.20054.

Wichern G. et al. (2019), WHAM!: Extending speech separation to noisy environments, [in:] Proceedings Interspeech 2019 , https://doi.org/10.21437/Interspeech.2019-2821.

Zhuo C., Luo Y., Mesgarani N. (2017), Deep attractor network for single-microphone speaker separation, [in:] 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP.2017.7952155.

Online first
2025, Vol 50
	No 1	No 2	No 3
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

Benchmarking the First Realistic Dataset for Speech Separation

Downloads

Authors

Abstract

Keywords:

References

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact