Sonic Interactions in Virtual Reality
Sonic Interactions in Virtual Reality
Sonic Interactions
in Virtual Reality:
State of the Art, Current Challenges, and Future Directions
In recent years, the availability of low-cost head-mounted displays has enhanced interest in immersive sound
experiences. Investigation of immersive sound in the context of VR is not new, although there’s general con-
sensus that sound is an underused modality in VR. In an interactive immersive experience, sound can direct
the user’s attention, enhancing the sense of presence and the ability to move users by creating interactive
time-varying experiences. The auditory modality possesses unique features. Unlike its visual counterpart,
auditory perception is always on, since we cannot close our ears. Thus, this sensory channel always flows
with information about the surrounding environment, regardless of whether we are pay attention to it or not.
Visual perception is superior in terms of spatial resolution, but it is inherently directional. Our limited field of
view entails that we have to turn our heads or bodies in order to perceive the surrounding environment. Audi-
tory perception, on the other hand, is omnidirectional. Moreover, auditory cues are inherently temporal in
nature. A sound-based event is by definition an unfolding event. In sum, it appears that auditory displays
constitute a relatively inexpensive approach in terms of both hardware and software simulation, and are a
valuable component of VR systems intended to represent multimodal virtual spaces and elicit a sensation of
presence in these spaces.
IEEE Computer Graphics and Applications Published by the IEEE Computer Society
March/April 2018 31 0272-1716/18/$33.00 ©2018 IEEE
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS
In the past decades, several methods have been developed for automatic generation and propagation of sound
by making use of techniques developed in computer graphics (CG), such as texture mapping or synthesis,1
beam tracing,2 and rigid-body simulations.3 These methods make it possible to create realistic sound related
to objects in motion and dependent on the geometric relationship between the receiver and the sound source.
Sound synthesis, propagation, and rendering are slowly becoming important research fields in relation to VR.
However, several research questions remain open, especially when immersive sound is embedded into eco-
logically valid interactive multisensory experiences. In these situations, results from investigations of the role
of immersive sound in static experiences might not remain valid. Audio techniques for interactive multi-
modal systems and CG applications differ from offline techniques that have only the aim of accuracy and
authenticity. A perceptually plausible but less authentic scheme for sonic interactions is convenient in prac-
tice, thus remaining efficient in terms of memory and computational power and having low-enough latency.
However, the trade-off between accuracy and plausibility is complex, and finding techniques that balance
them effectively remains challenging. This challenge is particularly pertinent in relation to VR because real-
time constraints entail that system resources have to be carefully shared with graphics, sensors, application
logic, and high-level functionality (e.g., artificial intelligence)
The sidebar “The Plausibility of Binaural Synthesis” presents an overview of the challenges to present a
plausible binaural scene using headphones.
In this article we provide an overview of the state of the art of sound synthesis, propagation, and rendering in
immersive virtual environments. The different elements of a sound production pipeline can be seen in the
sidebar “Immersive Sound Has Several Applications.” Although immersive sound can be delivered through a
speaker setup or headphone setup, we focus on sound rendering through headphones, since it is the hardware
solution available with state-of-the-art consumer head-mounted displays and is easily integrated in mobile
and portable devices for future mobile-VR experiences. Headphone-based sound rendering makes it possible
to completely control the sound arriving at each ear. In addition to offering more precise control of binaural
cues, unwanted sounds such as echoes and reverberation in listeners’ physical environment are kept from
reaching the ears.
However, this comes at a price, since headphones may be experienced as intrusive by the user, and they may
be detrimental to the naturalness of the listening experience and to the externalization of sound sources, thus
interfering with the perceptual localization of the surrounding space outside the head.4 The acoustic coupling
between the headphones and eardrums varies heavily from person to person and with small displacements of
headphones. Headphone-induced coloration should be reduced by product design criteria and equalization
algorithms that are able to minimize artifacts in the auralization (see the sidebar “Headphone Technologies
for Natural Spatial Sounds” for further details). Moreover, the critical aspect of headphone listening is
whether stimuli are heard as being inside the head (being lateralized, subjected to the so-called in-head locali-
zation) or outside the head (being localized in space). For this reason, dynamic binaural spatialization of
room acoustics is a key element in terms of externalization, localization, realism, and immersion.
The auralization of a VR scene can be defined by geometric complexity and the implemented acoustic effects
(i.e., the order of reflections, diffraction, and scattering). Particular attention should be given to diffracted
occlusion and sound propagation of early reflections that have to be coherently rendered in order to imple-
ment a perceptually plausible and efficient auralization. Accordingly, there exist several studies that cover
various aspects of this issue and take advantage of recent developments in CPU and GPU processing allow-
ing flexible scaling of aural complexity (see “More Than 50 Years of Artificial Reverberation”5 for a recent
review).
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY
Figure 1. A schematic representation of the different sound elements needed to create an immersive sonic
experience. Action sounds are those sounds produced by the user. Action sounds depend on the gestures
performed by the person experiencing the virtual environment. Environmental sounds are those sounds
produced by objects in the environment, as well as the soundscape of the different spaces. Soundscapes are
usually created by combining sampled sounds, while sounds created by objects in the environment can be
algorithmically generated using physical models. Sound propagation refers to the simulation of the acoustics of
the environment. Binaural rendering relates to how the sound reaches each of our ears and creates a sense
that sounds are located at a specific azimuth (horizontal direction), distance, and elevation.
In the literature, the term virtual acoustics has been used to cover the three main elements of the system:
source modeling, room acoustics modeling, and receiver modeling.6 Complete virtual-acoustics systems that
include all three elements include the DIVA (Digital Interactive Virtual Acoustics) system, developed at the
Helsinki University of Technology;6 NASA’s SLAB (Sound Lab; https://software.nasa.gov/software/ARC-
14991-1); and Spat (http://forumnet.ircam.fr/product/spat-en), developed at IRCAM (Institute for Research
and Coordination in Acoustics/Music).
In a traditional audio design pipeline, a monophonic sound is synthesized, and then it is spatialized—e.g.,
virtually positioned in its environment.
A problem for practical adoption of synthesis techniques such as modal synthesis is that synthesized sounds
do not yet sound indistinguishable from real ones. Unlike the low-poly shape efficiency in visual rendering,
oversimplifying the mode-shape matrix compromises the sound quality. The CPU and RAM are also primary
issues. The mode-shape matrix for modal sounds can easily take much RAM. Moreover, deficiencies in mod-
eling good contact mechanics are also a contributing factor, especially given that engines run the physics at
visual-update rates, which in most situations are too low for sound synthesis. Nevertheless, high frequencies
with their highly oscillatory mode shapes still remain a problem.
Another key challenge in widely adopting modal techniques is the lack of automatic determination of satis-
factory material parameters that recreate realistic audio quality of sound-producing materials. Musical signal
processing has much to offer in parameter estimation and residual-excitation-signal calculation.7 A particular
way of achieving these is commuted synthesis, which commutes and consolidates the output modes the reso-
nator cannot handle well with the excitation signal, based on linear, time-invariant system theory.7 Ren et al.8
introduced such a method that uses prerecorded audio clips to estimate material parameters that capture the
inherent quality of recorded sound-producing materials. The method extracts perceptually salient features
from audio examples. Recent research proposes to perform the computation of the modes directly on the
GPU to ensure real-time implementation in situations where the modes are too high. This can be the case, for
example, in complex environments where several objects are present.
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS
Data-driven approaches are also a promising direction. The technique proposed by Lloyd et al.9 is efficient
enough to be used in a shipped game. It analyzes recordings to fit a sinusoid-plus-noise model.7 The pro-
duced impact sounds are convincingly realistic, and the technique is memory- and CPU-efficient but is capa-
ble only of producing random variations on an impact sound. The value for games and VR lies in breaking
the monotony of playing the exact same clip without the memory cost of multiple clips, while still being
physically inspired.
Simulating sound events for VR imposes additional challenges, as opposed to film or games (see the sidebar
“Sound Source Modeling” for further details). One essential difference is the fact that users are fully sur-
rounded by the environment. In VR, sound can help direct the attention of the user and enhance the sensation
of place and space.10
SOUND PROPAGATION
Sound propagation refers to the simulation of sound’s movement from the sound-producing object to the ears
of the listener. A complete survey of methods for sound propagation in interactive virtual environments is
described in “More Than 50 Years of Artificial Reverberation.”5 A primary challenge in acoustic modeling
of sound propagation is the computation of reverberation paths from a sound source to a listener (receiver).
As sound may travel from source to receiver via a multitude of reflection, transmission, and diffraction paths,
accurate simulation is extremely computationally intensive.
One simple approach to simulate the space is by capturing the so-called room impulse response (RIR) and
convolve it with the original dry signal.11 This method is simple but lacks flexibility. Therefore, several geo-
metric (high-frequency approximation of sound propagating as rays), wave-based (a solver for underlying
physical equations), and hybrid methods have been proposed for sound propagation modeling. Wave-based
techniques can be mathematically and computationally expensive, so more efficient solutions must be inves-
tigated in order to use these techniques in real-time simulations. James et al.12 investigated preprocessing of
global sound radiation effects for a single vibrating object and showed that expensive linear acoustic transfer
phenomena can be largely precomputed to reduce the required computational resources in support of efficient
real-time sound radiation. However, memory allocation grows with the number of sound sources (>100
Mbytes each), requiring further research to make sound-field encoding with an equivalent source method
(ESM) practical.
An efficient and accurate sound propagation algorithm has been proposed13 that relies on an adaptive rectan-
gular decomposition of 3D scenes to enable efficient and accurate simulation of sound propagation in com-
plex virtual environments with moving sources. Virtual wave fields can be encoded to perceptual-parameter
fields, such as propagation delays, loudness, and decay times, ready to be quantized and compressed; this
technique is practical on large, complex interactive scenes with millions of polygons. Savioja proposed
wave-based simulation that combines finite-difference methods together with computation on the GPU.14
Mehra et al.15 described an interactive ESM-based sound propagation system for efficiently generating accu-
rate and realistic sounds of outdoor scenes with few (~10) reflecting objects. This extends previous work on
precomputing a set of pressure fields due to elementary spherical harmonic (SH) sources using frequency
domain wave-based sound propagation, compressing a high-dimensional acoustic field allocated in memory
in order to account for time-varying sources and the listener’s position and directivity.
Geometric techniques use precomputed spatial subdivision and beam tree data structures to enable real-time
acoustic modeling and auralization in interactive virtual environments with static sound sources.2 On the
other hand, modern geometric acoustic systems supporting moving sources rely on efficient ray-tracing algo-
rithms shared with CG—e.g., using bidirectional path tracing.16 Common aspects with CG algorithms emerge
in finding intersections between rays and geometric primitives in a 3D space, optimizing spatial data struc-
tures in both computational cost and memory, while ensuring enough rays for convergence of a perceptually
coherent auralization.
Hybrid solutions combining wave-based and geometric techniques have been proposed.17 Here, numerical
wave-based techniques are used to precompute the pressure field in the near-object regions and geometric
propagation techniques in the far-field regions to model sound propagation.
Efficient simulations of reverberation in outdoor environments is a field with limited research results. Re-
cently, however, an efficient algorithm to simulate reverberation outdoors based on a digital waveguide web
has been proposed.18 The design of the algorithm is based on a set of digital waveguides19 connected by scat-
tering junctions at nodes that represent the reflection points of the environment under study. The structure of
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY
the proposed reverberator allows for accurate reproduction of reflections between discrete reflection points.
This algorithm extends the scattering-delay-network approach proposed for reverberation in computer games.
Binaural Rendering
Binaural hearing refers to the ability of humans to interpret sounds arriving at both ears into complex audi-
tory scenes.
Shinn-Cunningham and Shilling4 distinguish between three types of headphone simulation—namely, diotic
displays, dichotic displays, and spatialized audio. Diotic displays refer to the display of identical signals in
both channels—something that gives the listener the sensation that all sound sources are located inside the
head.4
Dichotic displays involve stereo signals that just contain frequency-dependent interaural intensity differences
(IIDs), interaural level differences (ILDs), or interaural time differences (ITDs). The two authors explain that
this type of display is very simple since the effect can be achieved by scaling and delaying the signal arriving
at each ear. Just as with diotic displays, this does not enable proper spatialization of the sound sources since
listeners may get the feeling that the sounds are moving inside their head from one ear to the other.4
Finally, spatialized sound makes it possible to render most of the spatial cues available in the real world. This
is made possible by the use of various cues, such as the ones provided by acoustic transformations produced
by the listener’s body in the so-called head-related transfer function (HRTF), sound reflections and diffrac-
tions in the surrounding space, and dynamic modifications of these cues caused by body movements.
Figure 2 depicts a complete scheme of a state-of-the-art auralization system, emphasizing the key elements
for each component of Figure 1.
Figure 2. Block diagram of a typical system for binaural rendering and auralization.
H (θ , φ , r , ω ) = H min (θ , φ , r , ω ) exp − j 2π f τ (θ , φ , r , ω ) ,
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS
where ϴ and ϕ define the source’s direction of arrival, and r is the source’s distance from the listener’s head.
Signal-processing algorithms largely benefit from this HRTF decomposition in stability and performance. As
a matter of fact, the extracted pure delay or time shift is related to the monaural time of arrival (TOA), which
allows ITD extraction once the difference between the left- and right-ear TOA is computed.
Recording individual HRTFs of a single listener implies a necessary trade-off between resources and time
that is heavily subjected to measurement errors, making this complexity unfeasible for any real-world appli-
cation. A common practice is to use the same generic HRTFs, such as those that can be recorded using a
dummy head, for any listener, reaching a proper trade-off between the representativeness of a wide human
population and average efficacy (see the historical review on this topic by Paul20). However, generic HRTFs
generally result in a degradation of the listening experience in localization and immersion.11
Recent literature is increasingly investigating the use of personalized HRTFs that provide listeners with
HRTFs perceptually equivalent to their own individual HRTFs. Synthetic HRTFs can be physically modeled
considering different degrees of simplification: from basic geometry for the head, pinna (the external part of
the ear), shoulders, and torso to accurate numerical simulations with boundary element method (BEM) and
finite-difference time-domain (FDTD) methods.11
Recent research efforts have been investigating the development of HRTF personalization for individual us-
ers of a virtual audio display, usually in the form of precomputed HRTF filters or optimized selection proce-
dures of existing nonindividual HRTFs. In “Efficient Personalized HRTF Computation for High-Fidelity
Spatial Sound,”21 a technique to obtain personalized HRTFs using a camera is proposed. The technique com-
bines a state-of-the-art image-based 3D-modeling technique with an efficient numerical-simulation pipeline
based on the adaptive-rectangular-decomposition technique. On the other hand, we can select and manipulate
existing HRTFs according to their acoustic information, corresponding anthropometric information, and lis-
teners’ subjective ratings, if available. The procedure can be automatic or guided by qualitative tests on the
perceptual impact of nonindividual HRTFs. Finally, listeners can increase their localization performance
through adaptation procedures directly in VR games22 that provide tools for remapping localization cues to
spatial directions with nonindividual HRTF training.
Auralization
Rendering sound propagation for VR through headphones requires the spatialization of the directional RIR
(i.e., early and high-order reflections) for each individual listener. In particular, this involves computation of
binaural room impulse responses (BRIRs), which are the combination of two components: the head-related
impulse response (HRIR) and the spatial room impulse response (SRIR) (see Figure 3).
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY
Interactive auralization forces HRTF databases or models to cover most of the psychoacoustic effects in lo-
calization and timbre distortion due to dynamic changes of the active-listening experience, thus defining the
memory requirements for HRTF spatial and distance resolution and the spectral information needed (i.e.,
HRTF filter length).11 The audibility of interpolation errors between available HRTFs or HRTF encoding on
a suitable functional basis, like in the spherical harmonic (SH) domain, is a fundamental aspect of auralizing
the sound entering the listener’s ear canals with headphones. Several strategies exist for HRTF interpolation
and representation leading to a perceptually optimal HRTF measurement grid of 4° to 5° spacing in both azi-
muth and elevation, with a progressive reduction of spatial points toward the polar directions. (See “Assisted
Listening Using a Headset: Enhancing Audio Perception in Real, Augmented, and Virtual Environments”23
for a compact review on this topic.)
Moreover, a compromise between computational efficiency and latency for convolutions with HRTF filters
can be reached by a partitioned-block algorithm.5 The rendering of head and torso independent movements is
increasingly becoming important where full-body interactions are key elements in immersive VR applica-
tions. Nowadays, head-tracking devices are embedded in VR headsets with low drift and latency, ensuring
performance less than the critical latencies between 150 to 500 ms, keeping in mind that HRTFs are usually
acquired in a fixed head-and-torso spatial connection. On the other hand, little attention has been given to the
benefits of rendering head-above-torso orientations, which should require tracking the shoulders’ position
related to head rotations.11
Given an HRTF representation, previous approaches tend to evaluate HRTFs for each sound propagation
path, resulting in a process too slow for interactive-VR latency requirements. Schissler et al.24 presented an
approach for computation of the SRIR that performs the convolution with the HRTF in the SH domain for
RIR directional projection for the listener. This approach employs aspects similar to those of the well-known
virtual reproduction of a multichannel surround system, where a virtual sound field is encoded to an HRTF-
based binaural system. Relevant examples for VR are high-order ambisonics, wave-field synthesis, and direc-
tional audio coding.11
As we discussed for HRTFs, even for SRIRs perceptually motivated techniques for sound scene auralization
are preferred in order to control computational resources. The culling of inaudible reflections has to be ap-
plied in complex multipath environments, and many approaches have been proposed in the literature. Ac-
cordingly, besides limiting, clustering, and projecting reflections in the reference sphere around the listener,
binaural loudness models can be used to calculate a masking threshold for a time-frequency representation of
sound sources.25
CONCLUSION
In this article we presented an overview of the state of the art of interactive sound rendering for immersive
environments. As shown in this article, in recent years progress has been made in relation to the three differ-
ent elements of the sound design pipeline—e.g., source modeling, room acoustics modeling, and listener
modeling. From the point of view of source modeling, many sound events have been synthesized, and real-
time simulations exist. However, these simulations have not yet been adopted in the commercial software
engines usually used by the VR community. This can be due to the fact that the quality of the simulations is
not high enough, so professionals still prefer to rely on recordings despite their limitations, such as the fact
that they do not show the same flexibility as simulations.
Moreover, although physics-based modeling using techniques such as modal synthesis is fast, it still isn’t fast
enough to accommodate several sound-producing objects in a fraction of a CPU core, which is what VR
needs. Overall, in practice, one might need to make the CPU cost and sound quality of physically based
sound synthesis competitive with hardware decompression of recorded sounds.
Much progress has also been made in regard to room acoustics modeling, and efficient algorithms combined
with increased computational power and parallel processing make such models available in real time. Work-
ing in perceptual parametric domains, as with the technique presented in “Parametric Wave Field Coding for
Precomputed Sound Propagation,”13 makes artistic soundscape creation easy. But, like any parametric en-
coder, the encoding is lossy: not all relevant perceptual aspects of the impulse response are completely mod-
eled. Source and listener directivity is usually approximated; delayed echoes are not rendered faithfully with
noticeable degradation of outdoor environments. Finding a compact set of perceptual metrics that incorporate
such missing aspects is the main challenge to allowing adaptation to several environments, from outdoor to
indoor, and better simulation of the acoustic sensation of moving in different spaces such as a tunnel, an arch,
etc.
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS
Moreover, an active field of research is the capturing and implementation of personalized HRTFs. While lis-
tening tests in static environments have shown that subjects have a better sensation of space with personal-
ized HRTFs, it has not yet been investigated whether high-fidelity personalized HRTFs are needed in
complex virtual environments where several independent variable are present, such as the visual and proprio-
ceptive elements of the environments.
Finally, a challenging issue that has received less attention is auralization of nearby sources’ acoustics, which
seems to be perceptually relevant for action sounds in the proximal region or the listener’s peripersonal
space. Independence between HRTF directional and distance information does not hold, due to the change in
the reference sound field from a plane wave to a spherical wave. Thus, efficient interpolation methods and
range extrapolation algorithms in HRTF rendering should be developed. Furthermore, users’ actions dynami-
cally define near-field HRTFs because of body movements that have specific degrees of freedom for each
body articulation and connection. Modeling such complexity is a pivotal challenge for future full-body VR
experiences.
• Ergonomic and “ear adequate” delivery systems. The less knowledge listeners have that the sounds
are being played back from headphones, the more likely they will be to externalize stimuli in a natu-
ral way. The choice of headphones is thus crucial, depending on the low pressure of headphones
strap or cups and on changes in the acoustic load inside the ear canal.
• Individual spectral cues. Binaural signals should be available that are appropriate to listener-specific
spectral filtering. Particular attention should be given to headphone-specific compensation filters for
acoustic distortion introduced by a playback device that has to be adapted to different listeners and
repositioned on the ears.
• Head movements. Body movements in everyday-life activities produce perceptually relevant dy-
namic changes in interaural cues. Optimization of head-tracker latency, spatial discretization of the
impulse response dataset, and real-time interpolation for rendering arbitrary source positions pose
critical challenges in multimodal VR scenarios.
• Room acoustics. The availability of virtual room acoustics that resemble a real-world sound field,
and therefore contain plausible reverberation models, is relevant for externalization, spatial impres-
sion, and perception of space. Particular attention should be given to reflections and edge diffraction
for each sound source, which pose relevant challenges in finding the trade-off between accuracy and
plausibility due to computational and memory constraints, while at the same time preserving low-
enough latency for interaction. (See the sidebar “Sound Propagation” for further details.)
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY
it. The ultimate goal is to improve the learning aspects of music listening, either for education or for personal
enrichment.
VR musical instruments are also a field where naturally physics-based sound modeling and rendering play an
important role. For a recent overview of VR musical instruments, we refer to the work presented in “Virtual
Reality Musical Instruments: State of the Art, Design Principles, and Future Directions.”28
Carefully designed sounds can be used to reproduce the acoustics of specific spaces, such as the Notre Dame
Cathedral.29 In this particular situation, a computational acoustic model of the acoustics of the space was cap-
tured and reproduced to be utilized in an immersive VR experience.
Assisted listening and navigation for partially or fully visually impaired users is an example of an important
application of interactive sound rendering for virtual environments. As an example, in “Assisted Listening
Using a Headset: Enhancing Audio Perception in Real, Augmented, and Virtual Environments,”23 an over-
view of the use of augmented audition for navigation is presented.
Among the many possibilities that one can create with VR technologies, Figure A depicts some case studies
and tools developed at the Multisensory Experience Lab of Aalborg University Copenhagen.
Figure A. Applications and tools in interactive VR. Top left: interactive and creative virtual instruments. Bottom
left: a multimodal rendering of an outdoor VR environment for mediation. Top center: the origin of the generic
acoustic contribution for a human-like listener. Bottom center: an indoor scene with reduced visual information
for orientation and mobility purposes. Top and bottom right: examples of VR applications for users of different
ages.
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS
Hp
Popen Popen
= Hp
,
Pblocked Pblocked
where Popen and Pblocked denote the free-field sound pressure at the entrance of the open and blocked ear canal,
Hp
respectively, while Popen
Hp
and Pblocked denote the same sound pressure observation points when the sound
source is headphones. Headphones with PDR ≈ 1 satisfy the free-air equivalent coupling (FEC) characteris-
tic.31 For low frequencies, intersubject variability is limited up to ≈ 4 kHz because headphones act as an
acoustic cavity introducing only a constant level variation.
In contrast, in the higher spectrum, the headphone position and listener’s anthropometry give rise to several
frequency notches that are difficult to predict due to
It is worthwhile to notice that the fidelity of spatial sound reproduction relies on the amount of individualiza-
tion in the headphone correction of both measurement techniques and equalization methods, with an empha-
sis on high-frequency control in the inverse-filtering problem.32
Obtaining in situ robust and individual headphone calibration with straightforward procedures in order to
always apply listener-specific corrections to headphones is a challenging research issue, especially for in-
serted earphones that do not satisfy FEC criteria. An innovative approach to estimate the sound pressure in an
occluded ear canal involves binaural earphones with microphones that can extract the ear canal transfer
function (ECTF) and compensate for occlusion via an adaptive inverse-filtering method in real time.27
The authors use a synthesis technique known as modal synthesis.34 The main principle behind modal synthe-
sis is the decomposition of vibrating objects in their modes—e.g. the resonances of the system. Each mode
can be simulated as a mass-spring system or an exponentially decaying sine wave.
O’Brien et al.35 describe a real-time technique for generating realistic and compelling sounds that correspond
to the motions of rigid objects. In this case, the actions are not necessarily actions of the user in the environ-
ment, but actions between objects present in the simulation. By numerically precomputing the shape and fre-
quencies of an object’s deformation modes, audio can be synthesized interactively directly from the force
data generated by a standard rigid-body simulation. This approach allows accurate modeling of the sounds
generated by arbitrarily shaped objects based only on a geometric description of the objects and a handful of
material parameters.
Avanzini et al.36 present some efficient yet accurate algorithms to simulate frictional interactions between
rubbed dry surfaces. The work is based on the assumption that similar frictional interactions appear in several
events such as a door squeaking or a rubbed wineglass. For this reason, the same simulation algorithm can be
used with different parameters adapted to the different objects interacting.
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY
An efficient physically informed model to simulate liquid sounds has been proposed by Doel.37 The model-
ing approach is based on the assumption that the physics of liquid sounds is complex to simulate mathemati-
cally. A physically informed approach considers some physical elements of the interaction between bubbles
and solid surfaces, such as the fact that bubbles decrease in size when they hit the floor, combined with a per-
ceptual-based approach where a bubble is simulated using a sine wave whose frequency decreases over time.
Perceptual experiments show that subjects find such an approach suitable for inclusion in a VR simulation.37
An approach for fluid simulation is also proposed by Moss et al.38 Here, the simulation of liquid sounds is
made directly from a visual simulation of fluid dynamics. The main advantage is the fact that sound simula-
tion can be obtained from the visual simulation with minimal additional computational cost and possibly in
real time. Cirio et al.39 describe an approach to render vibrotactile and sonic feedback for fluids. The model is
divided into three components, following the physical processes that generate sound during solid-fluid inter-
action: the initial high-frequency impact, the small-bubble harmonics, and the main-cavity oscillation.
A simulation of aerodynamic sounds such as aeolian tones and cavity tones is presented by Dobashi et al.40
The authors propose an efficient implementation of aerodynamic sound based on computational fluid dynam-
ics, where the sound generated depends on the speed of the aerodynamic action. Precomputations of an ob-
ject in different poses with respect to the flow facilitate real-time simulation.
One important action performed by users in virtual environments is walking. Nordahl et al.41 present different
algorithms to simulate the sound of walking on solid surfaces such as wood and asphalt as well as aggregate
surfaces such as sand and snow. Based on perceptual evaluations, the authors show that users were able to
recognize most of the simulated surfaces they were exposed to. Moreover, a second study showed that users
were able to better identify the surfaces they were exposed to when soundscapes were added to the simula-
tion. In these cases, the soundscapes were simulating different environments such as a beach, forest, etc. This
result shows the importance of environmental sound rendering in creating a sense of space.
REFERENCES
1. T. Takala and J. Hahn, “Sound Rendering,” ACM SIGGRAPH Computer Graphics, vol. 26, no. 2,
1992, pp. 211–220.
2. T. Funkhouser et al., “A beam tracing approach to acoustic modeling for interactive virtual
environments,” Proceedings of the 25th Annual Conference on Computer Graphics and Interactive
Techniques (SIGGRAPH 98), 1998, pp. 21–32.
3. K. Van Den Doel, P.G. Kry, and D.K. Pai, “Foleyautomatic: physically-based sound effects for
interactive simulation and animation,” Proceedings of the 28th Annual Conference on Computer
Graphics and Interactive Techniques (SIGGRAPH 01), 2001, pp. 537–544.
4. B. Shinn-Cunningham and R. Shilling, “Virtual Auditory Displays,” Handbook of Virtual
Environment Technology, Lawrence Erlbaum Associates Publishers, 2002.
5. V. Valimaki et al., “More than 50 years of artificial reverberation,” Proceedings of the 60th
International Conference on Dereverberation and Reverberation of Audio, Music, and Speech
(DREAMS 16), 2016.
6. L. Savioja et al., “Creating interactive virtual acoustic environments,” Journal of the Audio
Engineering Society, vol. 47, no. 9, 1999, pp. 675–705.
7. P.R. Cook, Real sound synthesis for interactive applications, CRC Press, 2002.
8. Z. Ren, H. Yeh, and M.C. Lin, “Example-guided physically based modal sound synthesis,” ACM
Transactions on Graphics, vol. 32, no. 1, 2013.
9. D.B. Lloyd, N. Raghuvanshi, and N.K. Govindaraju, “Sound synthesis for impact sounds in video
games,” Proceedings of the 2011 ACM SIGGRAPH Symposium on Interactive 3D Graphics and
Games (I3D 2011), 2011, pp. 55–62.
10. R. Nordhal and N.C. Nilsson, “The sound of being there: presence and interactive audio in
immersive virtual reality,” Oxford Handbook of Interactive Audio, Oxford University Press, 2014.
11. B. Xie, Head-Related Transfer Function and Virtual Auditory Display, J. Ross, 2013.
12. D.L. James, J. Barbič, and D.K. Pai, “Precomputed acoustic transfer: output-sensitive, accurate
sound generation for geometrically complex vibration sources,” ACM Transactions on Graphics,
vol. 25, no. 3, 2006, pp. 987–995.
13. N. Raghuvanshi and J. Snyder, “Parametric wave field coding for precomputed sound
propagation,” ACM Transactions on Graphics, vol. 33, no. 4, 2014, pp. 1–11.
14. L. Savioja, “Real-time 3D finite-difference time-domain simulation of low-and mid-frequency
room acoustic,” Proceedings of the 13th International Conference on Digital Audio Effects (DAfx
10), 2010.
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS
15. R. Mehra et al., “Wave: Interactive wave-based sound propagation for virtual environments,” IEEE
Transactions on Visualization and Computer Graphics, no. 21, 2014, pp. 434–442.
16. C. Cao et al., “Interactive sound propagation with bidirectional path tracing,” ACM Transactions
on Graphics, vol. 35, no. 6, 2016, pp. 1–11.
17. H. Yeh et al., “Wave-ray coupling for interactive sound propagation in large complex scenes,”
ACM Transactions on Graphics, vol. 32, no. 6, 2013.
18. F. Stevens et al., “Modeling sparsely reflecting outdoor acoustic scenes using the waveguide web,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 8, 2017.
19. J.O. Smith, “Physical modeling using digital waveguides,” Computer music journal, vol. 16, no. 4,
1992, pp. 74–91.
20. S. Paul, “Binaural Recording Technology: A Historical Review and Possible Future
Developments,” Acta Acustica united with Acustica, vol. 95, no. 5, 2009, pp. 767–788.
21. A. Meshram et al., “P-HRTF: Efficient Personalized HRTF Computation for High-Fidelity Spatial
Sound,” Proceedings of the 2014 IEEE International Symposium on Mixed and Augmented Reality
(ISMAR 14), vol. ISMAR 14, 2014, pp. 53–61.
22. G. Parseihian and B.F.G. Katz, “Rapid head-related transfer function adaptation using a virtual
auditory environment,” The Journal of the Acoustical Society of America, vol. 131, no. 4, 2012, pp.
2948–2957.
23. V. Välimäki et al., “Enhancing audio perception in real, augmented, and virtual environments,”
IEEE Signal Processing Magazine, vol. 33, no. 2, 2015, pp. 92–99.
24. C. Schissler, A. Nicholls, and R. Mehra, “Efficient HRTF-based spatial audio for area and
volumetric sources,” IEEE Trans. Visualization and Computer Graphics, vol. 22, no. 4, 2016, pp.
1356–1366.
25. H. Hacihabiboglu et al., “Perceptual Spatial Audio Recording, Simulation, and Rendering: An
overview of spatial-audio techniques based on psychoacoustics,” IEEE Signal Processing, vol. 34,
no. 3, 2017, pp. 36–54.
26. C. Summers, V. Lympouridis, and C. Erkut, “Sonic interaction design for virtual and augmented
reality environments,” Proceedings of the 2015 IEEE 2nd VR Workshop on Sonic Interactions for
Virtual Environments (SIVE 15), 2015, pp. 1–6.
27. J. Janer et al., “Immersive orchestras: audio processing for orchestral music VR content,”
Proceedings of the 8th International Conference on Games and Virtual Worlds for Serious
Applications (VS Games 16), 2016, pp. 1–2.
28. S. Serafin et al., “Virtual reality musical instruments: State of the art, design principles, and future
directions,” Computer Music Journal, vol. 40, no. 3, 2016.
29. B.F. Katz et al., “Experience with a virtual reality auralization of notre-dame cathedral,” The
Journal of the Acoustical Society of America, vol. 141, no. 5, 2017.
30. H. Møller, “Fundamentals of binaural technology,” Applied Acoustics, vol. 36, no. 3-4, 1992, pp.
171–218.
31. B. Boren et al., “Coloration Metrics for Headphone Equalization,” Proceedings of the 21st
International Conference on Auditory Display, 2015, pp. 29–34.
32. F. Denk et al., “An individualised acoustically transparent earpiece for hearing devices,”
International Journal of Audiology, 2017, pp. 1–9.
33. W.W. Gaver, “What in the world do we hear?: An ecological approach to auditory event
perception,” Ecological Psychology, vol. 5, no. 1, 1993, pp. 1–29.
34. J.-M. Adrien, “The missing link: Modal synthesis,” Representations of musical signals, MIT Press,
1991.
35. J.F. O'Brien, C. Shen, and C.M. Gatchalian, “Synthesizing sounds from rigid-body simulations,”
Proceedings of the 2002 ACM SIGGRAPH/Eurographics Symposium on Computer Animation,
2002, pp. 175–181.
36. F. Avanzini, S. Serafin, and D. Rocchesso, “Interactive simulation of rigid body interaction with
frictioninduced sound generation,” IEEE Transactions on Speech and Audio Processing, vol. 13,
no. 5, 2005, pp. 1073–1081.
37. K.v.d. Doel, “Physically based models for liquid sounds,” ACM Transactions on Applied
Perception, vol. 2, no. 4, 2005, pp. 534–546.
38. W. Moss et al., “Sounding liquids: Automatic sound synthesis from fluid simulation,” ACM
Transactions on Graphics, vol. 29, no. 3, 2010.
39. G. Cirio et al., “Vibrotactile rendering of splashing fluids,” IEEE Transactions on Haptics, vol. 1,
no. 6, 2013, pp. 117–122.
40. Y. Dobashi, T. Yamamoto, and T. Nishita, “Real-time rendering of aerodynamic sound using
sound textures based on computational fluid dynamics,” ACM Transactions on Graphics, vol. 22,
no. 3, 2003, pp. 732–740.
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY
41. R. Nordahl, L. Turchet, and S. Serafin, “Sound synthesis and evaluation of interactive footsteps
and environmental sounds rendering for virtual reality applications,” IEEE Transactions on
Visualization and Computer Graphics, vol. 17, no. 9, 2011, pp. 1234–1244.
Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.