Adversarial Audio Inpainting with Selective State Space Model, Efficient Attention and Large-Scale Pre-Trained Model
Abstract
Recent deep-learning based speech enhancement algorithms have many applications in the areas of noise reduction, de-reverberation, bandwidth extension, echo cancellation, to name a few. Packet loss is also one of the main causes of voice quality degradation in VoIP calls. Currently, generative adversarial networks (GANs) have shown a strong ability in image generation, and many of those models also work well in speech tasks. In this work, we propose a light-weight model based on GAN to handle the task of audio packet loss concealment. Specifically, we use a U-shaped network operating in the time-frequency domain as a generator, which is trained by a Mel-GAN discriminator with multi-loss. In addition, to enhance the model’s performance under unfavorable channels, we introduce noise and bandwidth loss in the training data. The experiments show that our method outperforms the baseline in both objective and subjective metrics under an ideal channel with no other distortions, and it still largely maintains its performance in the presence of noise and bandwidth loss.

