Archives of Acoustics, 47, 1, pp. 71–79, 2022

Pursuing Listeners’ Perceptual Response in Audio-Visual Interactions – Headphones vs Loudspeakers: A Case Study

Bartłomiej MRÓZ
Multimedia Systems Department, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology

Audio Acoustics Lab., Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology

This study investigates listeners’ perceptual responses in audio-visual interactions concerning binaural spatial audio. Audio stimuli are coupled with or without visual cues to the listeners. The subjective test participants are tasked to indicate the direction of the incoming sound while listening to the audio stimulus via loudspeakers or headphones with the head-related transfer function (HRTF) plugin. First, the methodology assumptions and the experimental setup are described to the participants. Then, the results are presented and analysed using statistical methods. The results indicate that the headphone trials showed much higher perceptual ambiguity for the listeners than when the sound is delivered via loudspeakers. The influence of the visual modality dominates the audio-visual evaluation when loudspeaker playback is employed. Moreover, when the visual stimulus is present, the headphone playback pattern of behavior is not always in response to the loudspeaker playback.
Keywords: human perception; audio-visual interaction; 3D perception; binaural spatial audio.
Full Text: PDF
Copyright © The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).


Alais D., Burr D. (2004), The ventriloquist effect results from near-optimal bimodal integration, Current Biology, 14(3): 257–262, doi: 10.1016/j.cub. 2004.01.029.

Bernschütz B. (2013), A spherical far field HRIR/HRTF compilation of the Neumann Ku 100, [in:] Proceedings of the 39th DAGA, pp. 592–595, Meran, Italy.

Bertelson P. (1998), Starting from the Ventriloquist: The perception of multimodal event, [in:] Advances In Psychological Science, Vol. 2. Biological And Cognitive Aspects, Sabourin M., Craik F.I.M., Robert M. [Eds], pp. 419–439, Psychology Press/Erlbaum (UK) Taylor & Francis.

Bertelson P., Aschersleben G. (1998), Automatic visual bias of perceived auditory location, Psychonomic Bulletin & Review, 5: 482–489, doi: 10.3758/ bf03208826.

Bizley J.K., Maddox R.K., Lee A.K.C. (2016), Defining auditory-visual objects: behavioral tests and physiological mechanisms, Trends in Neurosciences, 39(2): 74–85, doi: 10.1016/j.tins.2015.12.007.

Blauert J., Braasch J. [Eds] (2020), The Technology of Binaural Understanding, Springer International Publishing, doi: 10.1007/978-3-030-00386-9.

Chiou R., Rich A.N. (2012), Cross-modality correspondence between pitch and spatial location modulates attentional orienting, Perception, 41(3): 339–353, doi: 10.1068/p7161.

Ecker A.J., Heller L.M. (2005), Auditory – visual interactions in the perception of a ball’s path, Perception, 34(1): 59–75, doi: 10.1068/p5368.

Frissen I., Vroomen J., De Gelder B., Bertelson P. (2004), The aftereffects of ventriloquism: generalization across sound-frequencies, Acta Psychologica, 118(1–2): 93–100, doi: 10.1016/j.actpsy.2004.10.004.

Gardner M.B. (1968), Proximity image effect in sound localization, The Journal of the Acoustical Society of America, 43(1): 163, doi: 10.1121/1.1910747.

Hampel F.R. (1974), The influence curve and its role in robust estimation, Journal of the American Statistical Association, 69(346): 382–393, doi: 10.2307/ 2285666.

Hendrickx E., Paquier M., Koehl V. (2015), Audiovisual spatial coherence for 2D and stereoscopic3D movies, Journal of the Audio Engineering Society, 63(11):889–899, doi: 10.17743/jaes.2015.77.

IEM Plug-In Suite (n.d.),

Ildirar S., Levin D.T., Schwan S., Smith T.J. (2017), Audio facilitates the perception of cinematic continuity by first-time viewers, Perception, 47(3): 276–295, doi: 10.1177/0301006617745782.

Iravantchi Y., Goel M., Harrison C. (2020), Digital ventriloquism: giving voice to everyday objects, [in:] Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, doi: 10.1145/ 3313831.3376503.

Kohlrausch A., Par S. van de (2005), Audiovisual interaction in the context of multi-media applications, [in:] Communication Acoustics, Blauert J. [Ed.], pp. 109–138, Springer, Berlin, Heidelberg, doi: 10.1007/3-540-27437-5_5.

Komiyama S. (1989), Subjective evaluation of angular displacement between picture and sound directions for HDTV sound systems, Journal of the Audio Engineering Society, 37(4): 210–214, e-lib/browse.cfm?elib=6094.

Kunka B., Kostek B. (2012), Objectivization of audio-visual correlation analysis, Archives of Acoustics, 37(1): 63–72, doi: /10.2478/V10168-012-0009-4.

Kunka B., Kostek B. (2013), New aspects of virtual sound source localization research–impact of visual angle and 3-D video content on sound perception, Journal of the Audio Engineering Society, 61(5): 280–289,

Morein-Zamir S., Soto-Faraco S., Kingstone A. (2003), Auditory capture of vision: examining temporal ventriloquism, Cognitive Brain Research, 17(1): 154– 163, doi: 10.1016/s0926-6410(03)00089-2.

Munn S.M., Pelz J.B. (2008), 3D point-of-regard, position and head orientation from a portable monocular video-based eye tracker, [in:] Proceedings of the Eye Tracking Research & Application Symposium, ETRA 2008, pp. 181–188, Savannah, Georgia, USA, doi: 10.1145/1344471.1344517.

Pike C., Stenzel H. (2017), Direct and indirect listening test methods – a discussion based on audio-visual spatial coherence experiments, [in:] 143rd Audio Engineering Society Convention, New York, USA.

Pruchnicki P., Plaskota P. (2008), Automatic measuring system for head-related transfer function measurement, Archives of Acoustics, 33(1): 19–25.

Radeau M., Bertelson P. (1977), Adaptation to auditory-visual discordance and ventriloquism in semirealistic situations, Perception & Psychophysics, 22(2): 137–146, doi: 10.3758/bf03198746.

Ramos O.A., Tommasini F.C. (2014), Magnitude modelling of HRTF using principal component analysis applied to complex values, Archives of Acoustics, 39(4): 477–482, doi: 10.2478/aoa-2014-0051.

Regan D., Spekreijse H. (1977), Auditory-visual interactions and the correspondence between perceived auditory space and perceived visual space, Perception, 6(2): 133–138, doi: 10.1068/p060133.

Romanov M., Berghold P., Frank M., Rudrich D., Zaunschirm M., Zotter F. (2017), Implementation and evaluation of a low-cost headtracker for binaural synthesis, [in:] 142nd Audio Engineering Society Convention, Berlin, Germany.

Sorati M., Behne D.M. (2021), Considerations in audio-visual interaction models: an ERP study of music perception by musicians and non-musicians, Frontiers in Psychology, 11: 33551911, doi: 10.3389/fpsyg.2020.594434.

Stenzel H., Francombe J., Jackson P.J.B. (2019), Limits of perceived audio-visual spatial coherence as defined by reaction time measurements, Frontiers in Neuroscience, 13: 451, doi: 10.3389/fnins.2019.00451.

Stenzel H., Jackson P.J.B. (2018), Perceptual thresholds of audio-visual spatial coherence for a variety of audio-visual objects, [in:] Audio Engineering Society International Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA.

Stenzel H., Jackson P.J.B., Francombe J. (2017), Modeling horizontal audio-visual coherence with the psychometric function, [in:] 142nd Audio Engineering Society Convention, Berlin, Germany.

Storek D., Rund F., Marsalek P. (2016), Subjective evaluation of three headphone-based virtual sound source positioning methods including differential headrelated transfer function, Archives of Acoustics, 41(3): 437–447, doi: 10.1515/aoa-2016-0043.

Vorländer M. (2014), Virtual acoustics, Archives of Acoustics, 39(3): 307–318, doi: 10.2478/aoa-2014-0036.

Vorländer M. (2020), Auralization. Fundamentals of Acoustics, Modelling, Simulation, Algorithms and Acoustic Virtual Reality, Springer International Publishing, doi: 10.1007/978-3-030-51202-6.

Vroomen J., Bertelson P., De Gelder B. (2001), The ventriloquist effect does not depend on the direction of automatic visual attention, Perception & Psychophysics, 63(4): 651–659, doi: 10.3758/bf03194427.

Vroomen J., De Gelder B. (2004), Perceptual effects of cross-modal stimulation: ventriloquism and the freezing phenomenon, [in:] Handbook of Multisensory Processes, Calvert G.A., Spence C., Stein B.E. [Eds], pp. 141–150, MIT Press.

Walker L., Walker P., Francis B. (2012), A common scheme for cross-sensory correspondences across stimulus domains, Perception, 41(10): 1186–1192, doi: 10.1068/p7149.

Woodcock J., Davies W.J., Cox T.J. (2019), Influence of visual stimuli on perceptual attributes of spatial audio, Journal of the Audio Engineering Society, 67(7/8): 557–567, doi: 10.17743/jaes.2019.0019.

Yao S.-N., Collins T., Liang C. (2017), Head-related transfer function selection using neural networks, Archives of Acoustics, 42(3): 55–62, doi: 10.2478/aoa2013-0007.

Zaunschirm M., Schoerkhuber C., Hoeldrich R. (2018), Binaural rendering of Ambisonic signals by head-related impulse response time alignment and a diffuseness constraint, The Journal of the Acoustical Society of America, 143(6): 3616, doi: 10.1121/1.504 0489.

Zotter F., Frank M. (2019), Ambisonics. A Practical 3D Audio Theory for Recording, Studio Production, Sound Reinforcement, and Virtual Reality, Springer International Publishing, pp. 89–90, doi: 10.1007/978-3030-17207-7.

DOI: 10.24425/aoa.2022.140733