Adversarial Audio Inpainting with Selective State Space Model, Efficient Attention, and Large-Scale Pre-Trained Model

Junkang Yang; Hongqing Liu; Liming Shi; Lu Gan; Hiromitsu Nishizaki; Chee Siang Leow

doi:10.24423/archacoust.2026.4269

Authors

Junkang Yang University of Yamanashi, Japan 0009-0001-1758-3746
Hongqing Liu Chongqing University of Posts and Telecommunications, China 0000-0003-4839-1525
Liming Shi Chongqing University of Posts and Telecommunications, China
Lu Gan Brunel University, United Kingdom 0000-0003-1056-7660
Hiromitsu Nishizaki University of Yamanashi, Japan 0000-0002-7717-8312
Chee Siang Leow University of Yamanashi, Japan 0009-0008-1382-8962

Abstract

Recent deep-learning based speech enhancement algorithms have many applications in the areas of noise reduction, de-reverberation, bandwidth extension, echo cancellation, to name a few. Packet loss is also one of the main causes of voice quality degradation in Voice over Internet Protocol (VoIP) calls. Currently, generative adversarial networks (GANs) have shown a strong ability in image generation, and many of those models also work well in speech tasks. In this work, we propose a lightweight model based on GAN to handle the task of audio packet loss concealment. Specifically, we use a U-shaped network operating in the time-frequency domain as a generator, which is trained by the Mel-GAN discriminator with multi-loss. In addition, to enhance the model’s performance under unfavorable channels, we introduce noise and bandwidth loss in the training data. The experiments show that our method outperforms the baseline in both objective and subjective metrics under an ideal channel with no other distortions, and it still largely maintains its performance in the presence of noise and bandwidth loss.

Keywords:

adversarial learning, audio inpainting, selective state space model, attention mechanism

References

Aironi C., Cornell S., Serafini L., Squartini S. (2023), A time-frequency generative adversarial based method for audio packet loss concealment, [in:] 2023 31st European Signal Processing Conference (EUSIPCO), pp. 121–125, https://doi.org/10.23919/EUSIPCO58844.2023.10290027

Blum N., Lachapelle S., Alvestrand H. (2021), WebRTC: real-time communication for the open web platform, Communications of the ACM, 64(8): 50–54, https://doi.org/10.1145/3453182

Chen S. et al. (2022), WavLM: large-scale self-supervised pre-training for full stack speech processing, IEEE Journal of Selected Topics in Signal Processing, 16(6): 1505–1518, https://doi.org/10.1109/JSTSP.2022.3188113

Chi Z. et al. (2022), XLM-E: cross-lingual language model pre-training via ELECTRA, [in:] Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6170–6182, https://doi.org/10.18653/v1/2022.acl-long.427

Dai T., Wang J., Guo H., Li J., Wang J., Zhu Z. (2024), FreqFormer: frequency-aware transformer for lightweight image super-resolution, [in:] Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pp. 731–739, https://doi.org/10.24963/ijcai.2024/81

Davy S., Belton N., Tobin J., Zuber O.B., Dong L., Xuewen Y. (2023), A causal convolutional approach for packet loss concealment in low powered devices, [in:] ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, https://doi.org/10.1109/ICASSP49357.2023.10096505

Diener L., Purin M., Sootla S., Saabas A., Aichner R., Cutler R. (2023), PLCMOS – a data-driven non-intrusive metric for the evaluation of packet loss concealment algorithms, [in:] Proceedings Interspeech 2023, pp. 2533–2537, https://doi.org/10.21437/Interspeech.2023-1532

Diener L., Sootla S., Branets S., Saabas A., Aichner R., Cutler R. (2022), INTERSPEECH 2022 audio deep packet loss concealment challenge, [in:] Proceedings Interspeech 2022, pp. 580–584, https://doi.org/10.21437/Interspeech.2022-10829

Dubey H. et al. (2024), ICASSP 2023 deep noise suppression challenge, [in:] IEEE Open Journal of Signal Processing, 5: 725–737, https://doi.org/10.1109/OJSP.2024.3378602

Ebner P.P., Eltelt A. (2020), Audio inpainting with generative adversarial network, arXiv, https://doi.org/10.48550/arXiv.2003.07704

Fu S.-W. et al. (2021), MetricGAN+: an improved version of MetricGAN for speech enhancement, Proceedings Interspeech 2021, pp. 201–205, https://doi.org/10.21437/Interspeech.2021-599

Gu A., Dao T. (2023), Mamba: linear-time sequence modeling with selective state spaces, arXiv, https://doi.org/10.48550/arXiv.2312.00752

Guan Y., Yu G., Li A., Zheng C., Wang J. (2022), TMGAN-PLC: audio packet loss concealment using temporal memory generative adversarial network, Proceedings Interspeech 2022, pp. 565–569, https://doi.org/10.21437/Interspeech.2022-644

Hao S., Li X., Peng W., Fan Z., Ji Z., Ganchev I. (2024), YOLO-CXR: a novel detection network for locating multiple small lesions in chest X-ray images, IEEE Access, 12: 156003–156019, https://doi.org/10.1109/ACCESS.2024.3482102

He K., Zhang X., Ren S., Sun J. (2016), Deep residual learning for image recognition, [in:] Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Janicki A., Księżak B. (2008), Packet loss concealment algorithm for VoIP transmission in unreliable networks, [in:] New Trends in Multimedia and Network Information Systems, 181: 23–33, https://doi.org/10.3233/978-1-58603-904-2-23

Kumar K. et al. (2019), MelGAN: generative adversarial networks for conditional waveform synthesis, [in:] Advances in Neural Information Processing Systems.

Lagrange M., Marchand S., Rault J.-B. (2005), Long interpolation of audio signals using linear prediction in sinusoidal modeling, Journal of the Audio Engineering Society, 53(10): 891–905.

Lecomte J. et al. (2015), Packet-loss concealment technology advances in EVS, [in:] 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5708–5712, https://doi.org/10.1109/ICASSP.2015.7179065

Lee B.-K., Chang J.-H. (2015), Packet loss concealment based on deep neural networks for digital speech transmission, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(2): 378–387, https://doi.org/10.1109/TASLP.2015.2509780

Li N., Zheng X., Zhang C., Guo L., Yu B. (2022), End-to-end multi-loss training for low delay packet loss concealment, [in:] Proceedings Interspeech 2022, pp. 585–589, https://doi.org/10.21437/Interspeech.2022-11439

Lin J., Wang Y., Kalgaonkar K., Keren G., Zhang D., Fuegen C. (2021), A time-domain convolutional recurrent network for packet loss concealment, [in:] ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7148–7152, https://doi.org/10.1109/ICASSP39728.2021.9413595

Liu B., Song Q., Yang M., Yuan W., Wang T. (2022), PLCNet: real-time packet loss concealment with semi-supervised generative adversarial network, [in:] Proceedings Interspeech 2022, pp. 575–579, https://doi.org/10.21437/Interspeech.2022-10428

Lotfidereshgi R., Gournay P. (2018), Speech prediction using an adaptive recurrent neural network with application to packet loss concealment, [in:] 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5394–5398, https://doi.org/10.1109/ICASSP.2018.8462185

Maher R.C. (1994), A method for extrapolation of missing digital audio data, Journal of the Audio Engineering Society, 42(5): 350–357.

Miotello F., Pezzoli M., Comanducci L., Antonacci F., Sarti A. (2024), Deep prior-based audio inpainting using multi-resolution harmonic convolutional neural networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32: 113–123, https://doi.org/10.1109/TASLP.2023.3324556

Mohamed M.M., Schuller B.W. (2020), Concealnet: an end-to-end neural network for packet loss concealment in deep speech emotion recognition, arXiv, https://doi.org/10.48550/arXiv.2005.07777

Moliner E., Lehtinen J., Valimaki V. (2023), Solving audio inverse problems with a diffusion model, [in:] ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, https://doi.org/10.1109/ICASSP49357.2023.10095637

Pascual S., Serra J., Pons J. (2021), Adversarial auto-encoding for packet loss concealment, [in:] 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 71–75, https://doi.org/10.1109/WASPAA52581.2021.9632730

Raffel C. et al. (2020), Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research, 21(140): 1–67.

Reddy C.K.A., Gopal V., Cutler R. (2022), Dnsmos P.835: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors, [in:] ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 886–890, https://doi.org/10.1109/ICASSP43922.2022.9746108

Rix A.W., Beerends J.G., Hollier M.P., Hekstra A.P. (2001), Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, [in:] 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2: 749–752, https://doi.org/10.1109/ICASSP.2001.941023

Rodbro C.A., Murthi M.N., Andersen S.V., Jensen S.H. (2006), Hidden Markov model-based packet loss concealment for voice over IP, IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1609–1623, https://doi.org/10.1109/TSA.2005.858561

Sun Y. et al. (2025), Deliod a lightweight detection model for intestinal organoids based on deep learning, Scientific Reports, 15(1): 5040, https://doi.org/10.1038/s41598-025-89409-y

Taal C.H., Hendriks R.C., Heusdens R., Jensen J. (2011), An algorithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing, 19(7): 2125–2136, https://doi.org/10.1109/TASL.2011.2114881

Valin J.-M. et al. (2022), Real-time packet loss concealment with mixed generative and predictive model, [in:] Proceedings Interspeech 2022, pp. 570–574, https://doi.org/10.21437/Interspeech.2022-903

Valin J.-M., Maxwell G., Terriberry T.B., Vos K. (2016), High-quality, low-delay music coding in the opus codec, arXiv, https://doi.org/10.48550/arXiv.1602.04845

Vaswani A. et al. (2017), Attention is all you need, [in:] Advances in Neural Information Processing Systems.

Wang J., Guan Y., Zheng C., Peng R., Li X. (2021), A temporal-spectral generative adversarial network based end-to-end packet loss concealment for wideband speech transmission, The Journal of the Acoustical Society of America, 150(4): 2577–2588, https://doi.org/10.1121/10.0006528

Westhausen N.L., Meyer B.T. (2022), tPLCnet: real-time deep packet loss concealment in the time domain using a short temporal context, [in:] Proceedings Interspeech 2022, pp. 2903–2907, https://doi.org/10.21437/Interspeech.2022-10157

Wichern G. et al. (2019), WHAM!: extending speech separation to noisy environments, [in:] Proceedings Interspeech, https://doi.org/10.21437/Interspeech.2019-2821

Xu W., Wan Y. (2024), ELA: efficient local attention for deep convolutional neural networks, arXiv, https://doi.org/10.48550/arXiv.2403.01123

Yamagishi J., Veaux C., MacDonald K. (2019), CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92), University of Edinburgh. The Centre for Speech Technology Research (CSTR), https://doi.org/10.7488/ds/2645

Yang D.-H., Kim D., Chang J.-H. (2023), Masked frequency modeling for improving packet loss concealment in speech transmission systems, [in:] 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5, https://doi.org/10.1109/WASPAA58266.2023.10248056

Yang J., Liu H., Gan L., Jing X. (2024), Spectral network based on lattice convolution and adversarial training for noise-robust speech super-resolution, The Journal of the Acoustical Society of America, 156(5): 3143–3157, https://doi.org/10.1121/10.0034364

Yen H., Germain F.G., Wichern G., Le Roux J. (2023), Cold diffusion for speech enhancement, [in:] ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, https://doi.org/10.1109/ICASSP49357.2023.10096064

Zhao H. (2023), A GAN speech inpainting model for audio editing software, [in:] Proceedings Interspeech 2023, pp. 5127–5131, https://doi.org/10.21437/Interspeech.2023-904">https://doi.org/10.21437/Interspeech.2023-904.

Online first
Early birds
2026, Vol 51
	No 1	No 2
2025, Vol 50
	No 1	No 2	No 3	No 4
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

Adversarial Audio Inpainting with Selective State Space Model, Efficient Attention, and Large-Scale Pre-Trained Model

Downloads

Authors

Abstract

Keywords:

References

cover

ippt-pan

Issue

Pages

Section

DOI

License

How to Cite

Principal Contact

Address

Support Contact