Archives of Acoustics, 45, 4, pp. 565–572, 2020
10.24425/aoa.2020.134072

Speech Enhancement Based on Discrete Wavelet Packet Transform and Itakura-Saito Nonnegative Matrix Factorisation

Houguang LIU
China Univerity of Mining and Technology
China

Wenbo WANG
China Univerity of Mining and Technology
China

Lin XUE
China Univerity of Mining and Technology
China

Jianhua YANG
China Univerity of Mining and Technology
China

Zhihua WANG
China Univerity of Mining and Technology
China

Chunli HUA
China Univerity of Mining and Technology
China

Nonnegative matrix factorization (NMF) is one of the most popular machine learning tools for speech enhancement (SE). However, there are two problems reducing the performance of the traditional NMFbased SE algorithms. One is related to the overlap-and-add operation used in the short time Fourier transform (STFT) based signal reconstruction, and the other is the Euclidean distance used commonly as an objective function; these methods can cause distortion in the SE process. In order to get over these shortcomings, we propose a novel SE joint framework which combines the discrete wavelet packet transform (DWPT) and the Itakura-Saito nonnegative matrix factorisation (ISNMF). In this approach, the speech signal was first split into a series of subband signals using the DWPT. Then, the ISNMF was used to enhance the speech for each subband signal. Finally, the inverse DWPT (IDWT) was utilised to reconstruct these enhanced speech subband signals. The experimental results show that the proposed joint framework effectively enhances the performance of speech enhancement and performs better in the unseen noise case compared to the traditional NMF methods.
Keywords: speech enhancement; discrete wavelet packet transform; nonnegative matrix factorisation; Itakura-Saito divergence
Full Text: PDF

References

Bavkar S., Sahare S. (2013), PCA based single channel speech enhancement method for highly noisy environment, Proceedings of International Conference on Advances in Computing, pp. 1103–1107, Mysore, doi: 10.1109/ICACCI.2013.6637331.

Boll S. (1979), Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics Speech & Signal Processing, 27(2): 113–120, doi: 10.1109/TASSP.1979.1163209.

Bouzid A., Ellouze N. (2016), Speech enhancement based on wavelet packet of an improved principal component analysis, Computer Speech & Language, 35: 58–72, doi: 10.1016/j.csl.2015.06.001.

Chien J.T., Yang P.K. (2015), Bayesian factorization and learning for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(1): 185–195, doi: 10.1109/TASLP.2015.2502141.

Coifman R.R., Wickerhauser M.V. (1992), Entropy-based algorithms for best basis selection, IEEE Transactions on Information Theory, 38(2): 713–718, doi: 10.1109/18.119732.

Févotte C., Le Roux J., Hershey J.R. (2013), Non-negative dynamical system with application to speech and audio, Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3158–3162, Vancouver, doi: 10.1109/ICASSP.2013.6638240.

Gokhale M., Khanduja D.K. (2010), Time domain signal analysis using wavelet packet decomposition approach, International Journal of Communications, Network and System Sciences, 3(3): 321–329, doi: 10.4236/ijcns.2010.33041.

Grancharov V., Samuelsson J., Kleijn B. (2006), On causal algorithms for speech enhancement, IEEE Transactions on Speech & Audio Processing, 14(3): 764–773, doi: 10.1109/TSA.2005.857802.

Hansen J.H., Pellom B.L. (1998), An effective quality evaluation protocol for speech enhancement algorithms, Proceedings of Fifth International Conference on Spoken Language Processing, pp. 0917–0921, Sydney.

Islam M.S., Al Mahmud T.H., Khan W.U., Ye Z. (2019), Supervised single channel speech enhancement based on dual-tree complex wavelet transforms and nonnegative matrix factorization using the joint learning process and subband smooth ratio mask, Electronics, 8(3): 353–371, doi: 10.3390/electronics8030353.

Krawczyk-Becker M., Gerkmann T. (2016), An evaluation of the perceptual quality of phase-aware single-channel speech enhancement, Journal of the Acoustical Society of America, 140(4): EL364–EL369, doi: 10.1121/1.4965288.

Lai Y.-H., Chen F., Wang S.-S., Lu X., Tsao Y., Lee C.-H. (2016), A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation, IEEE Transactions on Biomedical Engineering, 64(7): 1568–1578, doi: 10.1109/TBME.2016.2613960.

Lee D.D., Seung H.S. (1999), Learning the parts of objects by non-negative matrix factorization, Nature, 401(6755): 788–791, doi: 10.1038/44565.

Lee S., Han D.K., Ko H. (2017), Single-channel speech enhancement method using reconstructive NMF with spectrotemporal speech presence probabilities, Applied Acoustics, 117: 257–262, doi: 10.1016/j.apacoust.2016.04.024.

Li J., Sakamoto S., Hongo S., Akagi M., Suzuki Y.I. (2011), Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication, Speech Communication, 53(5): 677–689, doi: 10.1016/j.specom.2010.04.009.

Li Y., Zhang X., Sun M. (2017), Robust Non‐negative matrix factorization with β‐divergence for speech separation, ETRI Journal, 39(1): 21–29, doi: 10.4218/etrij.17.0115.0122.

Luts H. et al. (2010), Multicenter evaluation of signal enhancement algorithms for hearing aids, Journal of the Acoustical Society of America, 127(3): 1491–1505, doi: 10.1121/1.3299168.

Magron P., Virtane B. (2018), Expectation-maximization algorithms for Itakura-Saito nonnegative matrix factorization, Proceedings of 2018 Conference of the International Speech Communication Association (INTERSPEECH), pp. 856–860, Graz, doi: 10.21437/Interspeech.2018-1840.

Mavaddaty S., Ahadi S.M., Seyedin S. (2017), Speech enhancement using sparse dictionary learning in wavelet packet transform domain, Computer Speech & Language, 44: 22–47, doi: 10.1016/j.csl.2017.01.009.

Mohammadiha N., Smaragdis P., Leijon A. (2013), Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing, 21(10): 2140–2151, doi: 10.1109/TASL.2013.2270369.

Mowlaee P., Saeidi R. (2014), Time-frequency constraints for phase estimation in single-channel speech enhancement, Proceedings of 2014 14th International Workshop on Acoustic Signal Enhancement, pp. 337–341, Juan-les-Pins, doi: 10.1109/IWAENC.2014.6954314.

Nakano M., Kameoka H., Le Roux J., Kitano Y., Ono N., Sagayama S. (2010), Convergence-guaranteed multiplicative algorithms for nonnegative matrix factorization with β-divergence, Proceedings of 2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 283–288, Kittila, doi: 10.1109/MLSP.2010.5589233.

Nie S., Shan L., Wenju L., Xueliang Z., Jianhua T. (2018), Deep learning based speech separation via NMF-style reconstructions, IEEE/ACM Transactions on Audio Speech & Language Processing, 26(11): 2043–2055, doi: 10.1109/TASLP.2018.2851151.

Panfili L. M., Haywood J., McCloy D.R., Souza P.E., Wright R.A. (2017), The UW/NU Corpus, Version 2.0, https://depts.washington.edu/phonlab/projects/uw-nu.php.

Rix A.W., Beerends J.G., Hollier M.P., Hekstra A.P. (2001), Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP), pp. 749–752, Salt Lake City, doi: 10.1109/ICASSP.2001.941023.

Saleem N., Khattak M.I.I., Ali M.Y., Shafi M. (2019), Deep neural network for supervised single-channel speech enhancement, Archives of Acoustics, 44(1): 3–12, doi: 10.24425/aoa.2019.126347.

Saleem N., Khattak M.I., Shafi M. (2018), Unsupervised speech enhancement in low SNR environments via sparseness and temporal gradient regularization, Applied Acoustics, 141: 333–347, doi: 10.1016/j.apacoust.2018.07.027.

Scalart P., Filho J.V. (1996), Speech enhancement based on a priori signal to noise estimation, Proceedings of 1996 IEEE International Conference on Acoustics, pp. 629–632, Atlanta, doi: 10.1109/ICASSP.1996.543199.

Sun D.L., Fevotte C. (2014), Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence, Proceedings of 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP), pp. 6201–6205, Florence, doi: 10.1109/ICASSP.2014.6854796.

Sun P., Qin J. (2016), Wavelet packet transform based speech enhancement via two-dimensional SPP estimator with generalized gamma priors, Archives of Acoustics, 41(3): 579–590, doi: 10.1515/aoa-2016-0056.

Taal C.H., Hendriks R.C., Heusdens R., Jensen J. (2011), An algorithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing, 19(7):, 2125–2136, doi: 10.1109/TASL.2011.2114881.

Varga A., Steeneken H.J. (1993), Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication, 12(3): 247–251, doi: 10.1016/0167-6393(93)90095-3.

Varshney Y.V., Abbasi Z.A., Abidi M.R., Farooq O. (2017), Frequency selection based separation of speech signals with reduced computational time using sparse NMF, Archives of Acoustics, 42(2): 287–295, doi: 10.1515/aoa-2017-0031.

Veisi H., Sameti H., Aroudi A. (2015), Hidden Markov model-based speech enhancement using multivariate Laplace and Gaussian distributions, Iet Signal Processing, 9(2): 177–185, doi: 10.1049/iet-spr.2014.0032.

Wang D., Jiang M., Niu F., Cao Y., Zhou C. (2018), Speech Enhancement Control Design Algorithm for Dual-Microphone Systems Using β-NMF in a Complex Environment, Complexity, 2018, Article ID 6153451, doi: 10.1155/2018/6153451.

Wang D., Chen J. (2018), Supervised speech separation based on deep learning: An overview, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10): 1702–1726, doi: 10.1109/TASLP.2018.2842159.

Wang D., Hansen J.H.L. (2018), Speech enhancement for cochlear implant recipients, Journal of the Acoustical Society of America, 143(4): 2244–2254, doi: 10.1121/1.5031112.

Wang M., Zhang E., Tang Z. (2018), Speech Enhancement Based on NMF under Electric Vehicle Noise Condition, IEEE Access, 6: 9147–9159, doi: 10.1109/ACCESS.2018.2797165.

Wang S.S., Chern A., Tsao Y., Hung J.W., Lai Y.H., Su B. (2016), Wavelet speech enhancement based on nonnegative matrix factorization, IEEE Signal Processing Letters, 23(8): 1101–1105, doi: 10.1109/LSP.2016.2571727.

Wang S.S. et al. (2015), Improving denoising auto-encoder based speech enhancement with the speech parameter generation algorithm, Proceedings of 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 365–369, Hong Kong, doi: 10.1109/APSIPA.2015.7415295.




DOI: 10.24425/aoa.2020.134072

Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN)