10.24425/aoa.2025.153648
Estimating Ensemble Location and Width in Binaural Recordings of Music with Convolutional Neural Networks
References
Abdel-Hamid O., Mohamed A.-r., Jiang H., Penn G. (2012), Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, [in:] 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280, https://doi.org/10.1109/ICASSP.2012.6288864.
Algazi V.R., Duda R.O., Thompson D.M., Avendano C. (2001), The CIPIC HRTF database, [in:] Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), pp. 99–102, https://doi.org/10.1109/ASPAA.2001.969552.
Andreopoulou A., Begault D.R., Katz B.F.G. (2015), Inter-laboratory round robin HRTF measurement comparison, [in:] IEEE Journal of Selected Topics in Signal Processing, 9(5): 895–906, https://doi.org/10.1109/JSTSP.2015.2400417.
Antoniuk P. (2024), Software repository: Estimating ensemble location and width in binaural recordings of music with convolutional neural networks, GitHub, https://github.com/pawel-antoniuk/ensemble-width-cnn (access: 07.01.2024).
Antoniuk P., Zieliński S.K. (2023), Blind estimation of ensemble width in binaural music recordings using ‘spatiograms’ under simulated anechoic conditions, [in:] Audio Engineering Society Conference: AES 2023 International Conference on Spatial and Immersive Audio.
Armstrong C., Thresh L., Murphy D., Kearney G. (2018), A perceptual evaluation of individual and nonindividual HRTFs: A case study of the SADIE II database, Applied Sciences, 8(11): 2029, https://doi.org/10.3390/app8112029.
Arthi S., Sreenivas T.V. (2021), Spatiogram: A phase based directional angular measure and perceptual weighting for ensemble source width, ArXiv, https://doi.org/10.48550/arXiv.2112.07216.
Austrian Academy of Sciences (2014), HRTF-Database, https://www.oeaw.ac.at/en/ari/das-institut/software/hrtf-database.
Benaroya E.L., Obin N., Liuni M., Roebel A., Raumel W., Argentieri S. (2018), Binaural localization of multiple sound sources by non-negative tensor factorization, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(6): 1072–1082, https://doi.org/10.1109/TASLP.2018.2806745.
Blauert J. (1996), Spatial Hearing: The Psychophysics of Human Sound Localization, The MIT Press, https://doi.org/10.7551/mitpress/6391.001.0001.
Branke J. (1995), Evolutionary algorithms for neural network design and training, [in:] Proceedings of the First Nordic Workshop on Genetic Algorithms and its Application, pp. 145–163.
Braren H.S., Fels J. (2020), A high-resolution individual 3D adult head and torso model for HRTF simulation and validation: HRTF measurement, RWTH Publications, https://doi.org/10.18154/RWTH-2020-06761.
Bregman A. (1994), Auditory scene analysis: The perceptual organization of sound, The Journal of the Acoustical Society of America, 95(2): 1177–1178, https://doi.org/10.1121/1.408434.
Brinkmann F., Dinakaran M., Pelzer R., Grosche P., Voss D., Weinzierl S. (2019), A cross-evaluated database of measured and simulated HRTFs including 3D head meshes, anthropometric features, and headphone impulse responses, Journal of the Audio Engineering Society, 67(9): 705–718, https://doi.org/10.17743/jaes.2019.0024.
Brinkmann F. et al. (2017), A high resolution and full-spherical head-related transfer function database for different head-above-torso orientations, Journal of the Audio Engineering Society, 65(10): 841–848, https://doi.org/10.17743/jaes.2017.0033.
Cherry E.C. (1953), Some experiments on the recognition of speech, with one and with two ears, The Journal of the Acoustical Society of America, 25(5): 975–979, https://doi.org/10.1121/1.1907229.
Chollet F. et al. (2015), Keras, GitHub, https://github.com/fchollet/keras (access: 07.01.2024).
Chung M.-A., Chou H.-C., Lin C.-W. (2022), Sound localization based on acoustic source using multiple microphone array in an indoor environment, Electronics, 11(6): 890, https://doi.org/10.3390/electronics11060890.
Clifton R.K., Gwiazda J., Bauer J.A., Clarkson M.G., Held R.M. (1988), Growth in head size during infancy: Implications for sound localization, Developmental Psychology, 24(4): 477–483, https://doi.org/10.1037/0012-1649.24.4.477.
Dietz M., Ewert S.D., Hohmann V. (2011), Auditory model based direction estimation of concurrent speakers from binaural signals, Speech Communication, 53(5): 592–605, https://doi.org/10.1016/j.specom.2010.05.006.
Eisenman A. et al. (2020), Check-N-Run: A checkpointing system for training recommendation models, ArXiv.
Espi M., Fujimoto M., Kinoshita K., Nakatani T. (2015), Exploiting spectro-temporal locality in deep learning based acoustic event detection, EURASIP Journal on Audio, Speech, and Music Processing, 2015: 26, https://doi.org/10.1186/s13636-015-0069-2.
Gardner B., Martin K. (1994), HRTF Measurements of a KEMAR dummy-head microphone, https://sound.media.mit.edu/resources/KEMAR.html (access: 06.19.2024).
Garofolo J.S., Lamel L., Fisher W.M., Fiscus J.G., Pallett D.S., Dahlgren N.L. (1993), DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM, NIST Speech Disc 1-1.1, NIST Publications, https://doi.org/10.6028/NIST.IR.4930.
Hahmann M., Fernandez-Grande E., Gunawan H., Gerstoft P. (2022), Sound source localization using multiple ad hoc distributed microphone arrays, JASA Express Letters, 2(7): 074801, https://doi.org/10.1121/10.0011811.
Han Y., Park J., Lee K. (2017), Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification, [in:] Workshop on Detection and Classification of Acoustic Scenes and Events.
Hirsh I.J. (1950), Binaural hearing aids: A review of some experiments, Journal of Speech and Hearing Disorders, 15(2): 114–123, https://doi.org/10.1044/jshd.1502.114.
Ioffe S., Szegedy C. (2015), Batch normalization: Accelerating deep network training by reducing internal covariate shift, [in:] Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456.
ITU (2023), BS.1770: Algorithms to measure audio programme loudness and true-peak audio level, International Communications Union, Geneva, Switzerland.
Kaveh M., Barabell A. (1986), The statistical performance of the MUSIC and the minimum-norm algorithms in resolving plane waves in noise, [in:] IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(2): 331–341, https://doi.org/10.1109/TASSP.1986.1164815.
King A.J., Kacelnik O., Mrsic-Flogel T.D., Schnupp J.W., Parsons C.H., Moore D.R. (2001), How plastic is spatial hearing?, Audiology and Neurotology, 6(4): 182–186, https://doi.org/10.1159/000046829.
Kingma D.P., Ba J. (2014), Adam: A method for stochastic optimization, [in:] International Conference on Learning Representations.
Krizhevsky A., Sutskever I., Hinton G.E. (2012), ImageNet classification with deep convolutional neural networks, [in:] Advances in Neural Information Processing Systems 25 (NIPS 2012), 25.
Kuhn M., Johnson K. (2013), Applied Predictive Modeling, Springer, New York, https://doi.org/10.1007/978-1-4614-6849-3.
LeCun Y. et al. (1989), Handwritten digit recognition with a back-propagation network, [in:] Advances in Neural Information Processing Systems 2 (NIPS 1989), 2.
Lin M., Chen Q., Yan S. (2013), Network in network, [in:] International Conference on Learning Representations.
Listen HRTF Database (n.d.), http://recherche.ircam.fr/equipes/salles/listen/ (access: 06.19.2024).
Liu M., Hu J., Zeng Q., Jian Z., Nie L. (2022), Sound source localization based on multi-channel cross-correlation weighted beamforming, Micromachines, 13(7): 1010, https://doi.org/10.3390/mi13071010.
Liu Q., Wang W., de Campos T., Jackson P.J.B., Hilton A. (2018), Multiple speaker tracking in spatial audio via PHD filtering and depth-audio fusion, [in:] IEEE Transactions on Multimedia, 20(7): 1767–1780, https://doi.org/10.1109/TMM.2017.2777671.
Ma N., Brown G.J. (2016), Speech localisation in a multitalker mixture by humans and machines, [in:] Interspeech 2016, pp. 3359–3363, https://doi.org/10.21437/Interspeech.2016-1149.
Ma N., Gonzalez J.A., Brown G.J. (2018), Robust binaural localization of a target sound source by combining spectral source models and deep neural networks, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11): 2122–2131, https://doi.org/10.1109/TASLP.2018.2855960.
Ma N., May T., Brown G.J. (2017), Exploiting deep neural networks and head movements for robust binaural localisation of multiple sources in reverberant environments, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12): 2444–2453, https://doi.org/10.1109/TASLP.2017.2750760.
May T., Ma N., Brown G.J. (2015), Robust localisation of multiple speakers exploiting head movements and multi-conditional training of binaural cues, [in:] 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2679–2683, https://doi.org/10.1109/ICASSP.2015.7178457.
May T., van de Par S., Kohlrausch A. (2011), A probabilistic model for robust localization based on a binaural auditory front-end, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 19(1): 1–13, https://doi.org/10.1109/TASL.2010.2042128.
May T., van de Par S., Kohlrausch A. (2012), A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 20(7): 2016–2030, https://doi.org/10.1109/TASL.2012.2193391.
Miikkulainen R. et al. (2017), Evolving deep neural networks, ArXiv.
Morgan N., Bourlard H. (1989), Generalization and parameter estimation in feedforward nets: Some experiments, [in:] Advances in Neural Information Processing Systems 2 (NIPS 1989).
Pan Z., Zhang M.,Wu J., Wang J., Li H. (2021), Multi-tone phase coding of interaural time difference for sound source localization with spiking neural networks, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2656–2670, https://doi.org/10.1109/TASLP.2021.3100684.
Pang C., Liu H., Li X. (2019), Multitask learning of time-frequency CNN for sound source localization, [in:] IEEE Access, 7: 40725–40737, https://doi.org/10.1109/ACCESS.2019.2905617.
Pavlidi D., Puigt M., Griffin A., Mouchtaris A. (2012), Real-time multiple sound source localization using a circular microphone array based on single-source confidence measures, [in:] 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2625–2628, https://doi.org/10.1109/ICASSP.2012.6288455.
Pocock S.J., Hughes M.D. (1989), Practical problems in interim analyses, with particular regard to estimation, Controlled Clinical Trials, 10(4): 209–221, https://doi.org/10.1016/0197-2456(89)90059-7.
Porschmann C., Arend J., Neidhardt A. (2017), A spherical near-field HRTF set for auralization and psychoacoustic research, [in:] Proceedings of the 142nd AES Convention.
Raake A. (2016), A computational framework for modelling active exploratory listening that assigns meaning to auditory scenes – Reading the world with two ears, Two!Ears, http://twoears.eu (access: 06.11.2024).
Rumsey F. (2002), Spatial quality evaluation for reproduced sound: Terminology, meaning, and a scene-based paradigm, Journal of the Audio Engineering Society, 50(9): 651–666.
Sainath T.N., Mohamed A.-r., Kingsbury B., Ramabhadran B. (2013), Deep convolutional neural networks for LVCSR, [in:] 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618, https://doi.org/10.1109/ICASSP.2013.6639347.
Senior M. (2023), The ’Mixing Secrets’ Free Multitrack Download Library, Cambridge Music Technology, https://cambridge-mt.com/ms/mtk/ (06.10.2024).
Shafiee M.J., Mishra A., Wong A. (2016), Deep learning with Darwin: Evolutionary synthesis of deep neural networks, Neural Processing Letters, 48: 603–613, https://doi.org/10.1007/s11063-017-9733-0.
Spagnol S., Miccini R., Unnthórsson R. (2020), The Viking HRTF Dataset v2.
Spagnol S., Purkhus K.B., Unnthórsson R., Bjornsson S.K. (2019), The Viking HRTF Dataset.
Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. (2014), Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, 15(56): 1929–1958, http://jmlr.org/papers/v15/srivastava14a.html.
Stanley K.O., Miikkulainen R. (2002), Evolving neural networks through augmenting topologies, Evolutionary Computation, 10(2): 99–127, https://doi.org/10.1162/106365602320169811.
The MathWorks Inc. (2022a), Audio Toolbox, Version: 9.13.0 (R2022b), Natick, Massachusetts, United States, https://www.mathworks.com.
The MathWorks Inc. (2022b), MATLAB, Version: 9.13.0 (R2022b), Natick, Massachusetts, United States, https://www.mathworks.com.
Thiemann J., Muller M., Marquardt D., Doclo S., van de Par S. (2016), Speech enhancement for multimicrophone binaural hearing aids aiming to preserve the spatial auditory scene, [in:] EURASIP Journal on Advances in Signal Processing, 2016(1), https://doi.org/10.1186/s13634-016-0314-6.
Thomas S., Ganapathy S., Saon G., Soltau H. (2014), Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions, [in:] 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2519–2523, https://doi.org/10.1109/ICASSP.2014.6854054.
Van Rossum G., Drake F.L. (2009), Python 3 Reference Manual, Scotts Valley, CA: CreateSpace.
Vecchiotti P., Ma N., Squartini S., Brown G.J. (2019), End-to-end binaural sound localisation from the raw waveform, [in:] ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 451–455, https://doi.org/10.1109/ICASSP.2019.8683732.
Vera-Diaz J.M., Pizarro D., Macias-Guarasa J. (2018), Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates, Sensors, 18(10): 3418, https://doi.org/10.3390/s18103418.
Virtanen P. et al. (2020), SciPy 1.0: Fundamental algorithms for scientific computing in Python, Nature Methods, 17: 261–272, https://doi.org/10.1038/s41592-019-0686-2.
Wang J., Wang J., Qian K., Xie X., Kuang J. (2020), Binaural sound localization based on deep neural network and affinity propagation clustering in mismatched HRTF condition, EURASIP Journal on Audio, Speech, and Music Processing, 2020, https://doi.org/10.1186/s13636-020-0171-y.
Watanabe K., Iwaya Y., Suzuki Y., Takane S., Sato S. (2014), Dataset of head-related transfer functions measured with a circular loudspeaker array, Acoustical Science and Technology, 35(3): 159–165, https://doi.org/10.1250/ast.35.159.
Wierstorf H., Geier M., Raake A., Spors S. (2011), A free database of head-related impulse response measurements in the horizontal plane with multiple distances, [in:] 130th Convention. Engineering Brief. Audio Engineering Society.
Woodruff J., Wang D. (2012), Binaural localization of multiple sources in reverberant and noisy environment, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 20(5): 1503–1512, https://doi.org/10.1109/TASL.2012.2183869.
Yang Q., Zheng Y. (2022), DeepEar: Sound localization with binaural microphones, [in:] IEEE INFOCOM 2022 – IEEE Conference on Computer Communications, pp. 960–969, https://doi.org/10.1109/INFOCOM48880.2022.9796850.
Yu G., Wu R., Liu Y., Xie B. (2018), Near-field head-related transfer-function measurement and database of human subjects, The Journal of the Acoustical Society of America, 143(3): EL194–EL198, https://doi.org/10.1121/1.5027019.
Zhang H., Kiranyaz S., Gabbouj M. (2018), Finding better topologies for deep convolutional neural networks by evolution, ArXiv, https://doi.org/10.48550/arXiv.1809.03242.
Zhang W., Samarasinghe P.N., Chen H., Abhayapala T.D. (2017), Surround by sound: A review of spatial audio recording and reproduction, Applied Sciences, 7(5): 532, https://doi.org/10.3390/app7050532.
Zieliński S.K., Antoniuk P., Lee H. (2022a), Spatial audio scene characterization (SASC): Automatic localization of front-, back-, up-, and down-positioned music ensembles in binaural recordings, Applied Sciences, 12(3): 1569, https://doi.org/10.3390/app12031569.
Zieliński S.K., Antoniuk P., Lee H., Johnson D. (2022b), Automatic discrimination between front and back ensemble locations in HRTF-convolved binaural recordings of music, EURASIP Journal on Audio, Speech, and Music Processing, 2022(1): 3, https://doi.org/10.1186/s13636-021-00235-2.
Zieliński S.K., Lee H., Antoniuk P., Dadan O. (2020), A comparison of human against machine-classification of spatial audio scenes in binaural recordings of music, Applied Sciences, 10(17): 5956, https://doi.org/10.3390/app10175956.
DOI: 10.24425/aoa.2025.153648