Estimating Ensemble Location and Width in Binaural Recordings of Music with Convolutional Neural Networks

Paweł ANTONIUK; Sławomir Krzysztof ZIELIŃSKI

doi:10.24425/aoa.2025.153648

Authors

Paweł ANTONIUK Faculty of Computer Science, Białystok University of Technology, Poland 0000-0002-0914-8920
Sławomir Krzysztof ZIELIŃSKI Faculty of Computer Science, Białystok University of Technology, Poland 0000-0002-3205-974X

Abstract

Binaural audio technology has been in existence for many years. However, its popularity has significantly increased over the past decade as a consequence of advancements in virtual reality and streaming techniques. Along with its growing popularity, the quantity of publicly accessible binaural audio recordings has also expanded. Consequently, there is now a need for automated and objective retrieval of spatial content information, with ensemble location and width being the most prominent. This study presents a novel method for estimating these ensemble parameters in binaural recordings of music. For this purpose, a dataset of 23 040 binaural recordings was synthesized from 192 publicly-available music recordings using 30 head-related transfer functions. The synthesized excerpts were then used to train a multi-task spectrogram-based convolutional neural network model, aiming to estimate the ensemble location and width for unseen recordings. The results indicate that a model for estimating ensemble parameters can be successfully constructed with low prediction errors: 4.76° (±0.10°) for ensemble location and 8.57° (±0.19°) for ensemble width. The method developed in this study outperforms previous spatiogram-based techniques recently published in the literature and shows promise for future development as part of a novel tool for binaural audio recordings analysis.

Keywords:

ensemble width, ensemble location, binaural, spatial audio, localization, convolutional neural network, head-related transfer function, angle of arrival

References

1. Abdel-Hamid O., Mohamed A.-r., Jiang H., Penn G. (2012), Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, [in:] 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280, https://doi.org/10.1109/ICASSP.2012.6288864

2. Algazi V.R., Duda R.O., Thompson D.M., Avendano C. (2001), The CIPIC HRTF database, [in:] Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), pp. 99–102, https://doi.org/10.1109/ASPAA.2001.969552

3. Andreopoulou A., Begault D.R., Katz B.F.G. (2015), Inter-laboratory round robin HRTF measurement comparison, [in:] IEEE Journal of Selected Topics in Signal Processing, 9(5): 895–906, https://doi.org/10.1109/JSTSP.2015.2400417

4. Antoniuk P. (2024), Software repository: Estimating ensemble location and width in binaural recordings of music with convolutional neural networks, GitHub, https://github.com/pawel-antoniuk/ensemble-width-cnn (access: 07.01.2024).

5. Antoniuk P., Zieliński S.K. (2023), Blind estimation of ensemble width in binaural music recordings using ‘spatiograms’ under simulated anechoic conditions, [in:] Audio Engineering Society Conference: AES 2023 International Conference on Spatial and Immersive Audio.

6. Armstrong C., Thresh L., Murphy D., Kearney G. (2018), A perceptual evaluation of individual and nonindividual HRTFs: A case study of the SADIE II database, Applied Sciences, 8(11): 2029, https://doi.org/10.3390/app8112029

7. Arthi S., Sreenivas T.V. (2021), Spatiogram: A phase based directional angular measure and perceptual weighting for ensemble source width, ArXiv, https://doi.org/10.48550/arXiv.2112.07216

8. Austrian Academy of Sciences (2014), HRTF-Database, https://www.oeaw.ac.at/en/ari/das-institut/software/hrtf-database

9. Benaroya E.L., Obin N., Liuni M., Roebel A., Raumel W., Argentieri S. (2018), Binaural localization of multiple sound sources by non-negative tensor factorization, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(6): 1072–1082, https://doi.org/10.1109/TASLP.2018.2806745

10. Blauert J. (1996), Spatial Hearing: The Psychophysics of Human Sound Localization, The MIT Press, https://doi.org/10.7551/mitpress/6391.001.0001

11. Branke J. (1995), Evolutionary algorithms for neural network design and training, [in:] Proceedings of the First Nordic Workshop on Genetic Algorithms and its Application, pp. 145–163.

12. Braren H.S., Fels J. (2020), A high-resolution individual 3D adult head and torso model for HRTF simulation and validation: HRTF measurement, RWTH Publications, https://doi.org/10.18154/RWTH-2020-06761

13. Bregman A. (1994), Auditory scene analysis: The perceptual organization of sound, The Journal of the Acoustical Society of America, 95(2): 1177–1178, https://doi.org/10.1121/1.408434

14. Brinkmann F., Dinakaran M., Pelzer R., Grosche P., Voss D., Weinzierl S. (2019), A cross-evaluated database of measured and simulated HRTFs including 3D head meshes, anthropometric features, and headphone impulse responses, Journal of the Audio Engineering Society, 67(9): 705–718, https://doi.org/10.17743/jaes.2019.0024

15. Brinkmann F. et al. (2017), A high resolution and full-spherical head-related transfer function database for different head-above-torso orientations, Journal of the Audio Engineering Society, 65(10): 841–848, https://doi.org/10.17743/jaes.2017.0033

16. Cherry E.C. (1953), Some experiments on the recognition of speech, with one and with two ears, The Journal of the Acoustical Society of America, 25(5): 975–979, https://doi.org/10.1121/1.1907229

17. Chollet F. et al. (2015), Keras, GitHub, https://github.com/fchollet/keras (access: 07.01.2024).

18. Chung M.-A., Chou H.-C., Lin C.-W. (2022), Sound localization based on acoustic source using multiple microphone array in an indoor environment, Electronics, 11(6): 890, https://doi.org/10.3390/electronics11060890

19. Clifton R.K., Gwiazda J., Bauer J.A., Clarkson M.G., Held R.M. (1988), Growth in head size during infancy: Implications for sound localization, Developmental Psychology, 24(4): 477–483, https://doi.org/10.1037/0012-1649.24.4.477

20. Dietz M., Ewert S.D., Hohmann V. (2011), Auditory model based direction estimation of concurrent speakers from binaural signals, Speech Communication, 53(5): 592–605, https://doi.org/10.1016/j.specom.2010.05.006

21. Eisenman A. et al. (2020), Check-N-Run: A checkpointing system for training recommendation models, ArXiv.

22. Espi M., Fujimoto M., Kinoshita K., Nakatani T. (2015), Exploiting spectro-temporal locality in deep learning based acoustic event detection, EURASIP Journal on Audio, Speech, and Music Processing, 2015: 26, https://doi.org/10.1186/s13636-015-0069-2

23. Gardner B., Martin K. (1994), HRTF Measurements of a KEMAR dummy-head microphone, https://sound.media.mit.edu/resources/KEMAR.html (access: 06.19.2024).

24. Garofolo J.S., Lamel L., Fisher W.M., Fiscus J.G., Pallett D.S., Dahlgren N.L. (1993), DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM, NIST Speech Disc 1-1.1, NIST Publications, https://doi.org/10.6028/NIST.IR.4930

25. Hahmann M., Fernandez-Grande E., Gunawan H., Gerstoft P. (2022), Sound source localization using multiple ad hoc distributed microphone arrays, JASA Express Letters, 2(7): 074801, https://doi.org/10.1121/10.0011811

26. Han Y., Park J., Lee K. (2017), Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification, [in:] Workshop on Detection and Classification of Acoustic Scenes and Events.

27. Hirsh I.J. (1950), Binaural hearing aids: A review of some experiments, Journal of Speech and Hearing Disorders, 15(2): 114–123, https://doi.org/10.1044/jshd.1502.114

28. Ioffe S., Szegedy C. (2015), Batch normalization: Accelerating deep network training by reducing internal covariate shift, [in:] Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456.

29. ITU (2023), BS.1770: Algorithms to measure audio programme loudness and true-peak audio level, International Communications Union, Geneva, Switzerland.

30. Kaveh M., Barabell A. (1986), The statistical performance of the MUSIC and the minimum-norm algorithms in resolving plane waves in noise, [in:] IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(2): 331–341, https://doi.org/10.1109/TASSP.1986.1164815

31. King A.J., Kacelnik O., Mrsic-Flogel T.D., Schnupp J.W., Parsons C.H., Moore D.R. (2001), How plastic is spatial hearing?, Audiology and Neurotology, 6(4): 182–186, https://doi.org/10.1159/000046829

32. Kingma D.P., Ba J. (2014), Adam: A method for stochastic optimization, [in:] International Conference on Learning Representations.

33. Krizhevsky A., Sutskever I., Hinton G.E. (2012), ImageNet classification with deep convolutional neural networks, [in:] Advances in Neural Information Processing Systems 25 (NIPS 2012), 25.

34. Kuhn M., Johnson K. (2013), Applied Predictive Modeling, Springer, New York, https://doi.org/10.1007/978-1-4614-6849-3

35. LeCun Y. et al. (1989), Handwritten digit recognition with a back-propagation network, [in:] Advances in Neural Information Processing Systems 2 (NIPS 1989), 2.

36. Lin M., Chen Q., Yan S. (2013), Network in network, [in:] International Conference on Learning Representations.

37. Listen HRTF Database (n.d.), http://recherche.ircam.fr/equipes/salles/listen/ (access: 06.19.2024).

38. Liu M., Hu J., Zeng Q., Jian Z., Nie L. (2022), Sound source localization based on multi-channel cross-correlation weighted beamforming, Micromachines, 13(7): 1010, https://doi.org/10.3390/mi13071010

39. Liu Q., Wang W., de Campos T., Jackson P.J.B., Hilton A. (2018), Multiple speaker tracking in spatial audio via PHD filtering and depth-audio fusion, [in:] IEEE Transactions on Multimedia, 20(7): 1767–1780, https://doi.org/10.1109/TMM.2017.2777671

40. Ma N., Brown G.J. (2016), Speech localisation in a multitalker mixture by humans and machines, [in:] Interspeech 2016, pp. 3359–3363, https://doi.org/10.21437/Interspeech.2016-1149

41. Ma N., Gonzalez J.A., Brown G.J. (2018), Robust binaural localization of a target sound source by combining spectral source models and deep neural networks, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11): 2122–2131, https://doi.org/10.1109/TASLP.2018.2855960

42. Ma N., May T., Brown G.J. (2017), Exploiting deep neural networks and head movements for robust binaural localisation of multiple sources in reverberant environments, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12): 2444–2453, https://doi.org/10.1109/TASLP.2017.2750760

43. May T., Ma N., Brown G.J. (2015), Robust localisation of multiple speakers exploiting head movements and multi-conditional training of binaural cues, [in:] 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2679–2683, https://doi.org/10.1109/ICASSP.2015.7178457

44. May T., van de Par S., Kohlrausch A. (2011), A probabilistic model for robust localization based on a binaural auditory front-end, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 19(1): 1–13, https://doi.org/10.1109/TASL.2010.2042128

45. May T., van de Par S., Kohlrausch A. (2012), A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 20(7): 2016–2030, https://doi.org/10.1109/TASL.2012.2193391

46. Miikkulainen R. et al. (2017), Evolving deep neural networks, ArXiv.

47. Morgan N., Bourlard H. (1989), Generalization and parameter estimation in feedforward nets: Some experiments, [in:] Advances in Neural Information Processing Systems 2 (NIPS 1989).

48. Pan Z., Zhang M.,Wu J., Wang J., Li H. (2021), Multi-tone phase coding of interaural time difference for sound source localization with spiking neural networks, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2656–2670, https://doi.org/10.1109/TASLP.2021.3100684

49. Pang C., Liu H., Li X. (2019), Multitask learning of time-frequency CNN for sound source localization, [in:] IEEE Access, 7: 40725–40737, https://doi.org/10.1109/ACCESS.2019.2905617

50. Pavlidi D., Puigt M., Griffin A., Mouchtaris A. (2012), Real-time multiple sound source localization using a circular microphone array based on single-source confidence measures, [in:] 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2625–2628, https://doi.org/10.1109/ICASSP.2012.6288455

51. Pocock S.J., Hughes M.D. (1989), Practical problems in interim analyses, with particular regard to estimation, Controlled Clinical Trials, 10(4): 209–221, https://doi.org/10.1016/0197-2456(89)90059-7

52. Porschmann C., Arend J., Neidhardt A. (2017), A spherical near-field HRTF set for auralization and psychoacoustic research, [in:] Proceedings of the 142nd AES Convention.

53. Raake A. (2016), A computational framework for modelling active exploratory listening that assigns meaning to auditory scenes – Reading the world with two ears, Two!Ears, http://twoears.eu (access: 06.11.2024).

54. Rumsey F. (2002), Spatial quality evaluation for reproduced sound: Terminology, meaning, and a scene-based paradigm, Journal of the Audio Engineering Society, 50(9): 651–666.

55. Sainath T.N., Mohamed A.-r., Kingsbury B., Ramabhadran B. (2013), Deep convolutional neural networks for LVCSR, [in:] 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618, https://doi.org/10.1109/ICASSP.2013.6639347

56. Senior M. (2023), The ’Mixing Secrets’ Free Multitrack Download Library, Cambridge Music Technology, https://cambridge-mt.com/ms/mtk/ (06.10.2024).

57. Shafiee M.J., Mishra A., Wong A. (2016), Deep learning with Darwin: Evolutionary synthesis of deep neural networks, Neural Processing Letters, 48: 603–613, https://doi.org/10.1007/s11063-017-9733-0

58. Spagnol S., Miccini R., Unnthórsson R. (2020), The Viking HRTF Dataset v2.

59. Spagnol S., Purkhus K.B., Unnthórsson R., Bjornsson S.K. (2019), The Viking HRTF Dataset.

60. Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. (2014), Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, 15(56): 1929–1958, http://jmlr.org/papers/v15/srivastava14a.html

61. Stanley K.O., Miikkulainen R. (2002), Evolving neural networks through augmenting topologies, Evolutionary Computation, 10(2): 99–127, https://doi.org/10.1162/106365602320169811

62. The MathWorks Inc. (2022a), Audio Toolbox, Version: 9.13.0 (R2022b), Natick, Massachusetts, United States, https://www.mathworks.com

63. The MathWorks Inc. (2022b), MATLAB, Version: 9.13.0 (R2022b), Natick, Massachusetts, United States, https://www.mathworks.com

64. Thiemann J., Muller M., Marquardt D., Doclo S., van de Par S. (2016), Speech enhancement for multimicrophone binaural hearing aids aiming to preserve the spatial auditory scene, [in:] EURASIP Journal on Advances in Signal Processing, 2016(1), https://doi.org/10.1186/s13634-016-0314-6

65. Thomas S., Ganapathy S., Saon G., Soltau H. (2014), Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions, [in:] 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2519–2523, https://doi.org/10.1109/ICASSP.2014.6854054

66. Van Rossum G., Drake F.L. (2009), Python 3 Reference Manual, Scotts Valley, CA: CreateSpace.

67. Vecchiotti P., Ma N., Squartini S., Brown G.J. (2019), End-to-end binaural sound localisation from the raw waveform, [in:] ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 451–455, https://doi.org/10.1109/ICASSP.2019.8683732

68. Vera-Diaz J.M., Pizarro D., Macias-Guarasa J. (2018), Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates, Sensors, 18(10): 3418, https://doi.org/10.3390/s18103418

69. Virtanen P. et al. (2020), SciPy 1.0: Fundamental algorithms for scientific computing in Python, Nature Methods, 17: 261–272, https://doi.org/10.1038/s41592-019-0686-2

70. Wang J., Wang J., Qian K., Xie X., Kuang J. (2020), Binaural sound localization based on deep neural network and affinity propagation clustering in mismatched HRTF condition, EURASIP Journal on Audio, Speech, and Music Processing, 2020, https://doi.org/10.1186/s13636-020-0171-y

71. Watanabe K., Iwaya Y., Suzuki Y., Takane S., Sato S. (2014), Dataset of head-related transfer functions measured with a circular loudspeaker array, Acoustical Science and Technology, 35(3): 159–165, https://doi.org/10.1250/ast.35.159

72. Wierstorf H., Geier M., Raake A., Spors S. (2011), A free database of head-related impulse response measurements in the horizontal plane with multiple distances, [in:] 130th Convention. Engineering Brief. Audio Engineering Society.

73. Woodruff J., Wang D. (2012), Binaural localization of multiple sources in reverberant and noisy environment, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 20(5): 1503–1512, https://doi.org/10.1109/TASL.2012.2183869

74. Yang Q., Zheng Y. (2022), DeepEar: Sound localization with binaural microphones, [in:] IEEE INFOCOM 2022 – IEEE Conference on Computer Communications, pp. 960–969, https://doi.org/10.1109/INFOCOM48880.2022.9796850

75. Yu G., Wu R., Liu Y., Xie B. (2018), Near-field head-related transfer-function measurement and database of human subjects, The Journal of the Acoustical Society of America, 143(3): EL194–EL198, https://doi.org/10.1121/1.5027019

76. Zhang H., Kiranyaz S., Gabbouj M. (2018), Finding better topologies for deep convolutional neural networks by evolution, ArXiv, https://doi.org/10.48550/arXiv.1809.03242

77. Zhang W., Samarasinghe P.N., Chen H., Abhayapala T.D. (2017), Surround by sound: A review of spatial audio recording and reproduction, Applied Sciences, 7(5): 532, https://doi.org/10.3390/app7050532

78. Zieliński S.K., Antoniuk P., Lee H. (2022a), Spatial audio scene characterization (SASC): Automatic localization of front-, back-, up-, and down-positioned music ensembles in binaural recordings, Applied Sciences, 12(3): 1569, https://doi.org/10.3390/app12031569

79. Zieliński S.K., Antoniuk P., Lee H., Johnson D. (2022b), Automatic discrimination between front and back ensemble locations in HRTF-convolved binaural recordings of music, EURASIP Journal on Audio, Speech, and Music Processing, 2022(1): 3, https://doi.org/10.1186/s13636-021-00235-2

80. Zieliński S.K., Lee H., Antoniuk P., Dadan O. (2020), A comparison of human against machine-classification of spatial audio scenes in binaural recordings of music, Applied Sciences, 10(17): 5956, https://doi.org/10.3390/app10175956

Online first
Early birds
2026, Vol 51
	No 1	No 2
2025, Vol 50
	No 1	No 2	No 3	No 4
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

Estimating Ensemble Location and Width in Binaural Recordings of Music with Convolutional Neural Networks

Downloads

Authors

Abstract

Keywords:

References

Other articles by the same author(s)

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact