Speech Enhancement Based on Discrete Wavelet Packet Transform and Itakura-Saito Nonnegative Matrix Factorisation
Abstract
Nonnegative matrix factorization (NMF) is one of the most popular machine learning tools for speech enhancement (SE). However, there are two problems reducing the performance of the traditional NMFbased SE algorithms. One is related to the overlap-and-add operation used in the short time Fourier transform (STFT) based signal reconstruction, and the other is the Euclidean distance used commonly as an objective function; these methods can cause distortion in the SE process. In order to get over these shortcomings, we propose a novel SE joint framework which combines the discrete wavelet packet transform (DWPT) and the Itakura-Saito nonnegative matrix factorisation (ISNMF). In this approach, the speech signal was first split into a series of subband signals using the DWPT. Then, the ISNMF was used to enhance the speech for each subband signal. Finally, the inverse DWPT (IDWT) was utilised to reconstruct these enhanced speech subband signals. The experimental results show that the proposed joint framework effectively enhances the performance of speech enhancement and performs better in the unseen noise case compared to the traditional NMF methods.Keywords:
speech enhancement, discrete wavelet packet transform, nonnegative matrix factorisation, Itakura-Saito divergenceReferences
1. Bavkar S., Sahare S. (2013), PCA based single channel speech enhancement method for highly noisy environment, Proceedings of International Conference on Advances in Computing, pp. 1103–1107, Mysore, https://doi.org/10.1109/ICACCI.2013.6637331
2. Boll S. (1979), Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics Speech & Signal Processing, 27(2): 113–120, https://doi.org/10.1109/TASSP.1979.1163209
3. Bouzid A., Ellouze N. (2016), Speech enhancement based on wavelet packet of an improved principal component analysis, Computer Speech & Language, 35: 58–72, https://doi.org/10.1016/j.csl.2015.06.001
4. Chien J.T., Yang P.K. (2015), Bayesian factorization and learning for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(1): 185–195, https://doi.org/10.1109/TASLP.2015.2502141
5. Coifman R.R., Wickerhauser M.V. (1992), Entropy-based algorithms for best basis selection, IEEE Transactions on Information Theory, 38(2): 713–718, https://doi.org/10.1109/18.119732
6. Févotte C., Le Roux J., Hershey J.R. (2013), Non-negative dynamical system with application to speech and audio, Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3158–3162, Vancouver, https://doi.org/10.1109/ICASSP.2013.6638240
7. Gokhale M., Khanduja D.K. (2010), Time domain signal analysis using wavelet packet decomposition approach, International Journal of Communications, Network and System Sciences, 3(3): 321–329, https://doi.org/10.4236/ijcns.2010.33041
8. Grancharov V., Samuelsson J., Kleijn B. (2006), On causal algorithms for speech enhancement, IEEE Transactions on Speech & Audio Processing, 14(3): 764–773, https://doi.org/10.1109/TSA.2005.857802
9. Hansen J.H., Pellom B.L. (1998), An effective quality evaluation protocol for speech enhancement algorithms, Proceedings of Fifth International Conference on Spoken Language Processing, pp. 0917–0921, Sydney.
10. Islam M.S., Al Mahmud T.H., Khan W.U., Ye Z. (2019), Supervised single channel speech enhancement based on dual-tree complex wavelet transforms and nonnegative matrix factorization using the joint learning process and subband smooth ratio mask, Electronics, 8(3): 353–371, https://doi.org/10.3390/electronics8030353
11. Krawczyk-Becker M., Gerkmann T. (2016), An evaluation of the perceptual quality of phase-aware single-channel speech enhancement, Journal of the Acoustical Society of America, 140(4): EL364–EL369, https://doi.org/10.1121/1.4965288
12. Lai Y.-H., Chen F., Wang S.-S., Lu X., Tsao Y., Lee C.-H. (2016), A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation, IEEE Transactions on Biomedical Engineering, 64(7): 1568–1578, https://doi.org/10.1109/TBME.2016.2613960
13. Lee D.D., Seung H.S. (1999), Learning the parts of objects by non-negative matrix factorization, Nature, 401(6755): 788–791, https://doi.org/10.1038/44565
14. Lee S., Han D.K., Ko H. (2017), Single-channel speech enhancement method using reconstructive NMF with spectrotemporal speech presence probabilities, Applied Acoustics, 117: 257–262, https://doi.org/10.1016/j.apacoust.2016.04.024
15. Li J., Sakamoto S., Hongo S., Akagi M., Suzuki Y.I. (2011), Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication, Speech Communication, 53(5): 677–689, https://doi.org/10.1016/j.specom.2010.04.009
16. Li Y., Zhang X., Sun M. (2017), Robust Non‐negative matrix factorization with β‐divergence for speech separation, ETRI Journal, 39(1): 21–29, https://doi.org/10.4218/etrij.17.0115.0122
17. Luts H. et al. (2010), Multicenter evaluation of signal enhancement algorithms for hearing aids, Journal of the Acoustical Society of America, 127(3): 1491–1505, https://doi.org/10.1121/1.3299168
18. Magron P., Virtane B. (2018), Expectation-maximization algorithms for Itakura-Saito nonnegative matrix factorization, Proceedings of 2018 Conference of the International Speech Communication Association (INTERSPEECH), pp. 856–860, Graz, https://doi.org/10.21437/Interspeech.2018-1840
19. Mavaddaty S., Ahadi S.M., Seyedin S. (2017), Speech enhancement using sparse dictionary learning in wavelet packet transform domain, Computer Speech & Language, 44: 22–47, https://doi.org/10.1016/j.csl.2017.01.009
20. Mohammadiha N., Smaragdis P., Leijon A. (2013), Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing, 21(10): 2140–2151, https://doi.org/10.1109/TASL.2013.2270369
21. Mowlaee P., Saeidi R. (2014), Time-frequency constraints for phase estimation in single-channel speech enhancement, Proceedings of 2014 14th International Workshop on Acoustic Signal Enhancement, pp. 337–341, Juan-les-Pins, https://doi.org/10.1109/IWAENC.2014.6954314
22. Nakano M., Kameoka H., Le Roux J., Kitano Y., Ono N., Sagayama S. (2010), Convergence-guaranteed multiplicative algorithms for nonnegative matrix factorization with β-divergence, Proceedings of 2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 283–288, Kittila, https://doi.org/10.1109/MLSP.2010.5589233
23. Nie S., Shan L., Wenju L., Xueliang Z., Jianhua T. (2018), Deep learning based speech separation via NMF-style reconstructions, IEEE/ACM Transactions on Audio Speech & Language Processing, 26(11): 2043–2055, https://doi.org/10.1109/TASLP.2018.2851151
24. Panfili L. M., Haywood J., McCloy D.R., Souza P.E., Wright R.A. (2017), The UW/NU Corpus, Version 2.0, https://depts.washington.edu/phonlab/projects/uw-nu.php
25. Rix A.W., Beerends J.G., Hollier M.P., Hekstra A.P. (2001), Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP), pp. 749–752, Salt Lake City, https://doi.org/10.1109/ICASSP.2001.941023
26. Saleem N., Khattak M.I.I., Ali M.Y., Shafi M. (2019), Deep neural network for supervised single-channel speech enhancement, Archives of Acoustics, 44(1): 3–12, https://doi.org/10.24425/aoa.2019.126347
27. Saleem N., Khattak M.I., Shafi M. (2018), Unsupervised speech enhancement in low SNR environments via sparseness and temporal gradient regularization, Applied Acoustics, 141: 333–347, https://doi.org/10.1016/j.apacoust.2018.07.027
28. Scalart P., Filho J.V. (1996), Speech enhancement based on a priori signal to noise estimation, Proceedings of 1996 IEEE International Conference on Acoustics, pp. 629–632, Atlanta, https://doi.org/10.1109/ICASSP.1996.543199
29. Sun D.L., Fevotte C. (2014), Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence, Proceedings of 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP), pp. 6201–6205, Florence, https://doi.org/10.1109/ICASSP.2014.6854796
30. Sun P., Qin J. (2016), Wavelet packet transform based speech enhancement via two-dimensional SPP estimator with generalized gamma priors, Archives of Acoustics, 41(3): 579–590, https://doi.org/10.1515/aoa-2016-0056
31. Taal C.H., Hendriks R.C., Heusdens R., Jensen J. (2011), An algorithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing, 19(7):, 2125–2136, https://doi.org/10.1109/TASL.2011.2114881
32. Varga A., Steeneken H.J. (1993), Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication, 12(3): 247–251, https://doi.org/10.1016/0167-6393%2893%2990095-3
33. Varshney Y.V., Abbasi Z.A., Abidi M.R., Farooq O. (2017), Frequency selection based separation of speech signals with reduced computational time using sparse NMF, Archives of Acoustics, 42(2): 287–295, https://doi.org/10.1515/aoa-2017-0031
34. Veisi H., Sameti H., Aroudi A. (2015), Hidden Markov model-based speech enhancement using multivariate Laplace and Gaussian distributions, Iet Signal Processing, 9(2): 177–185, https://doi.org/10.1049/iet-spr.2014.0032
35. Wang D., Jiang M., Niu F., Cao Y., Zhou C. (2018), Speech Enhancement Control Design Algorithm for Dual-Microphone Systems Using β-NMF in a Complex Environment, Complexity, 2018, Article ID 6153451, https://doi.org/10.1155/2018/6153451
36. Wang D., Chen J. (2018), Supervised speech separation based on deep learning: An overview, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10): 1702–1726, https://doi.org/10.1109/TASLP.2018.2842159
37. Wang D., Hansen J.H.L. (2018), Speech enhancement for cochlear implant recipients, Journal of the Acoustical Society of America, 143(4): 2244–2254, https://doi.org/10.1121/1.5031112
38. Wang M., Zhang E., Tang Z. (2018), Speech Enhancement Based on NMF under Electric Vehicle Noise Condition, IEEE Access, 6: 9147–9159, https://doi.org/10.1109/ACCESS.2018.2797165
39. Wang S.S., Chern A., Tsao Y., Hung J.W., Lai Y.H., Su B. (2016), Wavelet speech enhancement based on nonnegative matrix factorization, IEEE Signal Processing Letters, 23(8): 1101–1105, https://doi.org/10.1109/LSP.2016.2571727
40. Wang S.S. et al. (2015), Improving denoising auto-encoder based speech enhancement with the speech parameter generation algorithm, Proceedings of 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 365–369, Hong Kong, https://doi.org/10.1109/APSIPA.2015.7415295
2. Boll S. (1979), Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics Speech & Signal Processing, 27(2): 113–120, https://doi.org/10.1109/TASSP.1979.1163209
3. Bouzid A., Ellouze N. (2016), Speech enhancement based on wavelet packet of an improved principal component analysis, Computer Speech & Language, 35: 58–72, https://doi.org/10.1016/j.csl.2015.06.001
4. Chien J.T., Yang P.K. (2015), Bayesian factorization and learning for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(1): 185–195, https://doi.org/10.1109/TASLP.2015.2502141
5. Coifman R.R., Wickerhauser M.V. (1992), Entropy-based algorithms for best basis selection, IEEE Transactions on Information Theory, 38(2): 713–718, https://doi.org/10.1109/18.119732
6. Févotte C., Le Roux J., Hershey J.R. (2013), Non-negative dynamical system with application to speech and audio, Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3158–3162, Vancouver, https://doi.org/10.1109/ICASSP.2013.6638240
7. Gokhale M., Khanduja D.K. (2010), Time domain signal analysis using wavelet packet decomposition approach, International Journal of Communications, Network and System Sciences, 3(3): 321–329, https://doi.org/10.4236/ijcns.2010.33041
8. Grancharov V., Samuelsson J., Kleijn B. (2006), On causal algorithms for speech enhancement, IEEE Transactions on Speech & Audio Processing, 14(3): 764–773, https://doi.org/10.1109/TSA.2005.857802
9. Hansen J.H., Pellom B.L. (1998), An effective quality evaluation protocol for speech enhancement algorithms, Proceedings of Fifth International Conference on Spoken Language Processing, pp. 0917–0921, Sydney.
10. Islam M.S., Al Mahmud T.H., Khan W.U., Ye Z. (2019), Supervised single channel speech enhancement based on dual-tree complex wavelet transforms and nonnegative matrix factorization using the joint learning process and subband smooth ratio mask, Electronics, 8(3): 353–371, https://doi.org/10.3390/electronics8030353
11. Krawczyk-Becker M., Gerkmann T. (2016), An evaluation of the perceptual quality of phase-aware single-channel speech enhancement, Journal of the Acoustical Society of America, 140(4): EL364–EL369, https://doi.org/10.1121/1.4965288
12. Lai Y.-H., Chen F., Wang S.-S., Lu X., Tsao Y., Lee C.-H. (2016), A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation, IEEE Transactions on Biomedical Engineering, 64(7): 1568–1578, https://doi.org/10.1109/TBME.2016.2613960
13. Lee D.D., Seung H.S. (1999), Learning the parts of objects by non-negative matrix factorization, Nature, 401(6755): 788–791, https://doi.org/10.1038/44565
14. Lee S., Han D.K., Ko H. (2017), Single-channel speech enhancement method using reconstructive NMF with spectrotemporal speech presence probabilities, Applied Acoustics, 117: 257–262, https://doi.org/10.1016/j.apacoust.2016.04.024
15. Li J., Sakamoto S., Hongo S., Akagi M., Suzuki Y.I. (2011), Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication, Speech Communication, 53(5): 677–689, https://doi.org/10.1016/j.specom.2010.04.009
16. Li Y., Zhang X., Sun M. (2017), Robust Non‐negative matrix factorization with β‐divergence for speech separation, ETRI Journal, 39(1): 21–29, https://doi.org/10.4218/etrij.17.0115.0122
17. Luts H. et al. (2010), Multicenter evaluation of signal enhancement algorithms for hearing aids, Journal of the Acoustical Society of America, 127(3): 1491–1505, https://doi.org/10.1121/1.3299168
18. Magron P., Virtane B. (2018), Expectation-maximization algorithms for Itakura-Saito nonnegative matrix factorization, Proceedings of 2018 Conference of the International Speech Communication Association (INTERSPEECH), pp. 856–860, Graz, https://doi.org/10.21437/Interspeech.2018-1840
19. Mavaddaty S., Ahadi S.M., Seyedin S. (2017), Speech enhancement using sparse dictionary learning in wavelet packet transform domain, Computer Speech & Language, 44: 22–47, https://doi.org/10.1016/j.csl.2017.01.009
20. Mohammadiha N., Smaragdis P., Leijon A. (2013), Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing, 21(10): 2140–2151, https://doi.org/10.1109/TASL.2013.2270369
21. Mowlaee P., Saeidi R. (2014), Time-frequency constraints for phase estimation in single-channel speech enhancement, Proceedings of 2014 14th International Workshop on Acoustic Signal Enhancement, pp. 337–341, Juan-les-Pins, https://doi.org/10.1109/IWAENC.2014.6954314
22. Nakano M., Kameoka H., Le Roux J., Kitano Y., Ono N., Sagayama S. (2010), Convergence-guaranteed multiplicative algorithms for nonnegative matrix factorization with β-divergence, Proceedings of 2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 283–288, Kittila, https://doi.org/10.1109/MLSP.2010.5589233
23. Nie S., Shan L., Wenju L., Xueliang Z., Jianhua T. (2018), Deep learning based speech separation via NMF-style reconstructions, IEEE/ACM Transactions on Audio Speech & Language Processing, 26(11): 2043–2055, https://doi.org/10.1109/TASLP.2018.2851151
24. Panfili L. M., Haywood J., McCloy D.R., Souza P.E., Wright R.A. (2017), The UW/NU Corpus, Version 2.0, https://depts.washington.edu/phonlab/projects/uw-nu.php
25. Rix A.W., Beerends J.G., Hollier M.P., Hekstra A.P. (2001), Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP), pp. 749–752, Salt Lake City, https://doi.org/10.1109/ICASSP.2001.941023
26. Saleem N., Khattak M.I.I., Ali M.Y., Shafi M. (2019), Deep neural network for supervised single-channel speech enhancement, Archives of Acoustics, 44(1): 3–12, https://doi.org/10.24425/aoa.2019.126347
27. Saleem N., Khattak M.I., Shafi M. (2018), Unsupervised speech enhancement in low SNR environments via sparseness and temporal gradient regularization, Applied Acoustics, 141: 333–347, https://doi.org/10.1016/j.apacoust.2018.07.027
28. Scalart P., Filho J.V. (1996), Speech enhancement based on a priori signal to noise estimation, Proceedings of 1996 IEEE International Conference on Acoustics, pp. 629–632, Atlanta, https://doi.org/10.1109/ICASSP.1996.543199
29. Sun D.L., Fevotte C. (2014), Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence, Proceedings of 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP), pp. 6201–6205, Florence, https://doi.org/10.1109/ICASSP.2014.6854796
30. Sun P., Qin J. (2016), Wavelet packet transform based speech enhancement via two-dimensional SPP estimator with generalized gamma priors, Archives of Acoustics, 41(3): 579–590, https://doi.org/10.1515/aoa-2016-0056
31. Taal C.H., Hendriks R.C., Heusdens R., Jensen J. (2011), An algorithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing, 19(7):, 2125–2136, https://doi.org/10.1109/TASL.2011.2114881
32. Varga A., Steeneken H.J. (1993), Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication, 12(3): 247–251, https://doi.org/10.1016/0167-6393%2893%2990095-3
33. Varshney Y.V., Abbasi Z.A., Abidi M.R., Farooq O. (2017), Frequency selection based separation of speech signals with reduced computational time using sparse NMF, Archives of Acoustics, 42(2): 287–295, https://doi.org/10.1515/aoa-2017-0031
34. Veisi H., Sameti H., Aroudi A. (2015), Hidden Markov model-based speech enhancement using multivariate Laplace and Gaussian distributions, Iet Signal Processing, 9(2): 177–185, https://doi.org/10.1049/iet-spr.2014.0032
35. Wang D., Jiang M., Niu F., Cao Y., Zhou C. (2018), Speech Enhancement Control Design Algorithm for Dual-Microphone Systems Using β-NMF in a Complex Environment, Complexity, 2018, Article ID 6153451, https://doi.org/10.1155/2018/6153451
36. Wang D., Chen J. (2018), Supervised speech separation based on deep learning: An overview, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10): 1702–1726, https://doi.org/10.1109/TASLP.2018.2842159
37. Wang D., Hansen J.H.L. (2018), Speech enhancement for cochlear implant recipients, Journal of the Acoustical Society of America, 143(4): 2244–2254, https://doi.org/10.1121/1.5031112
38. Wang M., Zhang E., Tang Z. (2018), Speech Enhancement Based on NMF under Electric Vehicle Noise Condition, IEEE Access, 6: 9147–9159, https://doi.org/10.1109/ACCESS.2018.2797165
39. Wang S.S., Chern A., Tsao Y., Hung J.W., Lai Y.H., Su B. (2016), Wavelet speech enhancement based on nonnegative matrix factorization, IEEE Signal Processing Letters, 23(8): 1101–1105, https://doi.org/10.1109/LSP.2016.2571727
40. Wang S.S. et al. (2015), Improving denoising auto-encoder based speech enhancement with the speech parameter generation algorithm, Proceedings of 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 365–369, Hong Kong, https://doi.org/10.1109/APSIPA.2015.7415295

