Audio Strips Network (ASNet) and Amalgamation Audio Features (A2F): A Synergistic Approach for Audio Source Separation
Abstract
Audio Source Separation refers to the procedure of decomposing a mixed audio signal into its constituent components. This technique enables numerous applications, including creative music production, educational tools, karaoke, transcription, and music analysis. Despite the recent success of deep learning-based source separation techniques, these techniques often do not perform very accurately and do not provide high-quality separation of sources when many contain complex combinations in their mixtures. Source separation techniques generally rely on temporal or spectral features for analysis, which does not fully capture the complex dynamics of audio signals. To address these limitations, proposed the Amalgamation Audio Features (A2F), a hybrid representation combining temporal and spectral features. Then, Proposed the Audio Strips Network (ASNet), a novel framework designed to achieve clean and precise separation of individual audio sources with enhanced performance. ASNet utilized A2F, to separate sources more effectively. The model is trained and evaluated on the MUSDB, DSD100 and MUSDB18-HQ dataset, a benchmark for music source separation, and its standard measures like the Signal-to-Distortion Ratio (SDR) and Signal-to-Interference Ratio (SIR) are used to examine performance. ASNet achieves enhanced separation performance with SDR values of drums 12.63, vocal 11.42, bass 12.01 and other 11.14, and SIR values of drums 9.57, vocal 9.61, bass 9.66 and other 9.67. This advancement benefits musicians through high-quality remixing and creativity while aiding researchers in improving Deep Learning and hybrid audio processing models.

