From Speech to Underwater Acoustics: A Transfer Learning Framework for Real-Time Passive Diver Detection Using Keyword Spotting Models
Abstract
Passive acoustic detection of divers faces challenges such as low signal-to-noise ratios (SNRs), data scarcity, and latency in conventional methods. This paper proposes Keyword Spotting for Diver Detection (KWS-DD)—a transfer learning framework that repurposes speech-oriented KWS models for data-efficient diver detection. Diver inhalation signatures are treated as acoustic "keywords," enabling adaptation of the transformer-based HuBERT architecture (pre-trained on speech) to identify quasi-periodic respiratory events in underwater audio. The core innovation of this work lies in adapting the state-of-the-art speech model HuBERT for accurate diver detection via non-speech inhalation acoustics. This approach eliminates the need for respiratory cycles accumulation, enabling real-time detection using minimal domain-specific data (120 inhalation samples). Deployed in diverse marine conditions, the solution achieved 94.4% accuracy and 94.6% F1-score for inhalation sounds. This represents a more than 50% range extension over conventional methods, which proved unreliable beyond 10 meters in low-SNR environments. The framework reduces false alarms caused by boat noise and generalizes to external datasets, validating cross-domain transferability. This work bridges AI-based speech processing and passive sonar signal processing, offering a resource-efficient solution for real-time underwater surveillance.

