0% found this document useful (0 votes)
1 views13 pages

Sonic Interactions in Virtual Reality

This article reviews the state of the art in sonic interactions within virtual reality (VR), emphasizing the importance of high-fidelity sound simulations for immersive experiences. It discusses various techniques for sound synthesis, propagation, and rendering, highlighting the challenges of balancing accuracy and computational efficiency. The article also explores the role of immersive sound in enhancing user presence and interaction within virtual environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views13 pages

Sonic Interactions in Virtual Reality

This article reviews the state of the art in sonic interactions within virtual reality (VR), emphasizing the importance of high-fidelity sound simulations for immersive experiences. It discusses various techniques for sound synthesis, propagation, and rendering, highlighting the challenges of balancing accuracy and computational efficiency. The article also explores the role of immersive sound in enhancing user presence and interaction within virtual environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

THEME ARTICLE: Virtual and Augmented Reality

Sonic Interactions
in Virtual Reality:
State of the Art, Current Challenges, and Future Directions

Stefania Serafin A high-fidelity but efficient sound simulation is an essential


Aalborg University
element of any VR experience. Many techniques used in virtual
Michele Geronazzo
acoustics are graphical rendering techniques suitably modified to
Aalborg University
account for sound generation and propagation. In recent years,
Cumhur Erkut
Aalborg University several advances in hardware and software technologies have
Niels C. Nilsson been facilitating the development of immersive interactive sound-
Aalborg University rendering experiences. In this article, we present a review of the
Rolf Nordahl state of the art of such simulations, with a focus on the different
Aalborg University
elements that, combined, provide a complete interactive sonic
experience. This includes physics-based simulation of sound
effects and their propagation in space together with binaural rendering to simulate the position of
sound sources. We present how these different elements of the sound design pipeline have been
addressed in the literature, trying to find the trade-off between accuracy and plausibility. Recent
applications and current challenges are also presented.

In recent years, the availability of low-cost head-mounted displays has enhanced interest in immersive sound
experiences. Investigation of immersive sound in the context of VR is not new, although there’s general con-
sensus that sound is an underused modality in VR. In an interactive immersive experience, sound can direct
the user’s attention, enhancing the sense of presence and the ability to move users by creating interactive
time-varying experiences. The auditory modality possesses unique features. Unlike its visual counterpart,
auditory perception is always on, since we cannot close our ears. Thus, this sensory channel always flows
with information about the surrounding environment, regardless of whether we are pay attention to it or not.
Visual perception is superior in terms of spatial resolution, but it is inherently directional. Our limited field of
view entails that we have to turn our heads or bodies in order to perceive the surrounding environment. Audi-
tory perception, on the other hand, is omnidirectional. Moreover, auditory cues are inherently temporal in
nature. A sound-based event is by definition an unfolding event. In sum, it appears that auditory displays
constitute a relatively inexpensive approach in terms of both hardware and software simulation, and are a
valuable component of VR systems intended to represent multimodal virtual spaces and elicit a sensation of
presence in these spaces.

IEEE Computer Graphics and Applications Published by the IEEE Computer Society
March/April 2018 31 0272-1716/18/$33.00 ©2018 IEEE

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS

In the past decades, several methods have been developed for automatic generation and propagation of sound
by making use of techniques developed in computer graphics (CG), such as texture mapping or synthesis,1
beam tracing,2 and rigid-body simulations.3 These methods make it possible to create realistic sound related
to objects in motion and dependent on the geometric relationship between the receiver and the sound source.
Sound synthesis, propagation, and rendering are slowly becoming important research fields in relation to VR.
However, several research questions remain open, especially when immersive sound is embedded into eco-
logically valid interactive multisensory experiences. In these situations, results from investigations of the role
of immersive sound in static experiences might not remain valid. Audio techniques for interactive multi-
modal systems and CG applications differ from offline techniques that have only the aim of accuracy and
authenticity. A perceptually plausible but less authentic scheme for sonic interactions is convenient in prac-
tice, thus remaining efficient in terms of memory and computational power and having low-enough latency.
However, the trade-off between accuracy and plausibility is complex, and finding techniques that balance
them effectively remains challenging. This challenge is particularly pertinent in relation to VR because real-
time constraints entail that system resources have to be carefully shared with graphics, sensors, application
logic, and high-level functionality (e.g., artificial intelligence)
The sidebar “The Plausibility of Binaural Synthesis” presents an overview of the challenges to present a
plausible binaural scene using headphones.
In this article we provide an overview of the state of the art of sound synthesis, propagation, and rendering in
immersive virtual environments. The different elements of a sound production pipeline can be seen in the
sidebar “Immersive Sound Has Several Applications.” Although immersive sound can be delivered through a
speaker setup or headphone setup, we focus on sound rendering through headphones, since it is the hardware
solution available with state-of-the-art consumer head-mounted displays and is easily integrated in mobile
and portable devices for future mobile-VR experiences. Headphone-based sound rendering makes it possible
to completely control the sound arriving at each ear. In addition to offering more precise control of binaural
cues, unwanted sounds such as echoes and reverberation in listeners’ physical environment are kept from
reaching the ears.
However, this comes at a price, since headphones may be experienced as intrusive by the user, and they may
be detrimental to the naturalness of the listening experience and to the externalization of sound sources, thus
interfering with the perceptual localization of the surrounding space outside the head.4 The acoustic coupling
between the headphones and eardrums varies heavily from person to person and with small displacements of
headphones. Headphone-induced coloration should be reduced by product design criteria and equalization
algorithms that are able to minimize artifacts in the auralization (see the sidebar “Headphone Technologies
for Natural Spatial Sounds” for further details). Moreover, the critical aspect of headphone listening is
whether stimuli are heard as being inside the head (being lateralized, subjected to the so-called in-head locali-
zation) or outside the head (being localized in space). For this reason, dynamic binaural spatialization of
room acoustics is a key element in terms of externalization, localization, realism, and immersion.
The auralization of a VR scene can be defined by geometric complexity and the implemented acoustic effects
(i.e., the order of reflections, diffraction, and scattering). Particular attention should be given to diffracted
occlusion and sound propagation of early reflections that have to be coherently rendered in order to imple-
ment a perceptually plausible and efficient auralization. Accordingly, there exist several studies that cover
various aspects of this issue and take advantage of recent developments in CPU and GPU processing allow-
ing flexible scaling of aural complexity (see “More Than 50 Years of Artificial Reverberation”5 for a recent
review).

IMMERSIVE SOUND FOR VR


Figure 1 shows a schematic representation of the different sound elements needed to create an immersive
sonic experience. In the figure, action sounds are those sounds produced by the user that need to change ac-
cording to movements. One of the essential elements of sound events in VR that distinguishes it from film is
the importance of interactivity: actions of the users such as walking, hitting, throwing, etc., often generate
sounds. Environmental sounds are those sounds produced by objects in the environment, as well as the
soundscape of the space. Sound propagation represents the simulation of the acoustics of the space, either a
closed space such as a room or an open space. Binaural rendering relates to how the sound reaches each of
our ears and allows us to perceive sound-producing objects as placed at a specific azimuth (horizontal direc-
tion), distance, and elevation (vertical direction).

March/April 2018 32 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY

Figure 1. A schematic representation of the different sound elements needed to create an immersive sonic
experience. Action sounds are those sounds produced by the user. Action sounds depend on the gestures
performed by the person experiencing the virtual environment. Environmental sounds are those sounds
produced by objects in the environment, as well as the soundscape of the different spaces. Soundscapes are
usually created by combining sampled sounds, while sounds created by objects in the environment can be
algorithmically generated using physical models. Sound propagation refers to the simulation of the acoustics of
the environment. Binaural rendering relates to how the sound reaches each of our ears and creates a sense
that sounds are located at a specific azimuth (horizontal direction), distance, and elevation.

In the literature, the term virtual acoustics has been used to cover the three main elements of the system:
source modeling, room acoustics modeling, and receiver modeling.6 Complete virtual-acoustics systems that
include all three elements include the DIVA (Digital Interactive Virtual Acoustics) system, developed at the
Helsinki University of Technology;6 NASA’s SLAB (Sound Lab; https://software.nasa.gov/software/ARC-
14991-1); and Spat (http://forumnet.ircam.fr/product/spat-en), developed at IRCAM (Institute for Research
and Coordination in Acoustics/Music).
In a traditional audio design pipeline, a monophonic sound is synthesized, and then it is spatialized—e.g.,
virtually positioned in its environment.
A problem for practical adoption of synthesis techniques such as modal synthesis is that synthesized sounds
do not yet sound indistinguishable from real ones. Unlike the low-poly shape efficiency in visual rendering,
oversimplifying the mode-shape matrix compromises the sound quality. The CPU and RAM are also primary
issues. The mode-shape matrix for modal sounds can easily take much RAM. Moreover, deficiencies in mod-
eling good contact mechanics are also a contributing factor, especially given that engines run the physics at
visual-update rates, which in most situations are too low for sound synthesis. Nevertheless, high frequencies
with their highly oscillatory mode shapes still remain a problem.
Another key challenge in widely adopting modal techniques is the lack of automatic determination of satis-
factory material parameters that recreate realistic audio quality of sound-producing materials. Musical signal
processing has much to offer in parameter estimation and residual-excitation-signal calculation.7 A particular
way of achieving these is commuted synthesis, which commutes and consolidates the output modes the reso-
nator cannot handle well with the excitation signal, based on linear, time-invariant system theory.7 Ren et al.8
introduced such a method that uses prerecorded audio clips to estimate material parameters that capture the
inherent quality of recorded sound-producing materials. The method extracts perceptually salient features
from audio examples. Recent research proposes to perform the computation of the modes directly on the
GPU to ensure real-time implementation in situations where the modes are too high. This can be the case, for
example, in complex environments where several objects are present.

March/April 2018 33 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS

Data-driven approaches are also a promising direction. The technique proposed by Lloyd et al.9 is efficient
enough to be used in a shipped game. It analyzes recordings to fit a sinusoid-plus-noise model.7 The pro-
duced impact sounds are convincingly realistic, and the technique is memory- and CPU-efficient but is capa-
ble only of producing random variations on an impact sound. The value for games and VR lies in breaking
the monotony of playing the exact same clip without the memory cost of multiple clips, while still being
physically inspired.
Simulating sound events for VR imposes additional challenges, as opposed to film or games (see the sidebar
“Sound Source Modeling” for further details). One essential difference is the fact that users are fully sur-
rounded by the environment. In VR, sound can help direct the attention of the user and enhance the sensation
of place and space.10

SOUND PROPAGATION
Sound propagation refers to the simulation of sound’s movement from the sound-producing object to the ears
of the listener. A complete survey of methods for sound propagation in interactive virtual environments is
described in “More Than 50 Years of Artificial Reverberation.”5 A primary challenge in acoustic modeling
of sound propagation is the computation of reverberation paths from a sound source to a listener (receiver).
As sound may travel from source to receiver via a multitude of reflection, transmission, and diffraction paths,
accurate simulation is extremely computationally intensive.
One simple approach to simulate the space is by capturing the so-called room impulse response (RIR) and
convolve it with the original dry signal.11 This method is simple but lacks flexibility. Therefore, several geo-
metric (high-frequency approximation of sound propagating as rays), wave-based (a solver for underlying
physical equations), and hybrid methods have been proposed for sound propagation modeling. Wave-based
techniques can be mathematically and computationally expensive, so more efficient solutions must be inves-
tigated in order to use these techniques in real-time simulations. James et al.12 investigated preprocessing of
global sound radiation effects for a single vibrating object and showed that expensive linear acoustic transfer
phenomena can be largely precomputed to reduce the required computational resources in support of efficient
real-time sound radiation. However, memory allocation grows with the number of sound sources (>100
Mbytes each), requiring further research to make sound-field encoding with an equivalent source method
(ESM) practical.
An efficient and accurate sound propagation algorithm has been proposed13 that relies on an adaptive rectan-
gular decomposition of 3D scenes to enable efficient and accurate simulation of sound propagation in com-
plex virtual environments with moving sources. Virtual wave fields can be encoded to perceptual-parameter
fields, such as propagation delays, loudness, and decay times, ready to be quantized and compressed; this
technique is practical on large, complex interactive scenes with millions of polygons. Savioja proposed
wave-based simulation that combines finite-difference methods together with computation on the GPU.14
Mehra et al.15 described an interactive ESM-based sound propagation system for efficiently generating accu-
rate and realistic sounds of outdoor scenes with few (~10) reflecting objects. This extends previous work on
precomputing a set of pressure fields due to elementary spherical harmonic (SH) sources using frequency
domain wave-based sound propagation, compressing a high-dimensional acoustic field allocated in memory
in order to account for time-varying sources and the listener’s position and directivity.
Geometric techniques use precomputed spatial subdivision and beam tree data structures to enable real-time
acoustic modeling and auralization in interactive virtual environments with static sound sources.2 On the
other hand, modern geometric acoustic systems supporting moving sources rely on efficient ray-tracing algo-
rithms shared with CG—e.g., using bidirectional path tracing.16 Common aspects with CG algorithms emerge
in finding intersections between rays and geometric primitives in a 3D space, optimizing spatial data struc-
tures in both computational cost and memory, while ensuring enough rays for convergence of a perceptually
coherent auralization.
Hybrid solutions combining wave-based and geometric techniques have been proposed.17 Here, numerical
wave-based techniques are used to precompute the pressure field in the near-object regions and geometric
propagation techniques in the far-field regions to model sound propagation.
Efficient simulations of reverberation in outdoor environments is a field with limited research results. Re-
cently, however, an efficient algorithm to simulate reverberation outdoors based on a digital waveguide web
has been proposed.18 The design of the algorithm is based on a set of digital waveguides19 connected by scat-
tering junctions at nodes that represent the reflection points of the environment under study. The structure of

March/April 2018 34 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY

the proposed reverberator allows for accurate reproduction of reflections between discrete reflection points.
This algorithm extends the scattering-delay-network approach proposed for reverberation in computer games.

Binaural Rendering
Binaural hearing refers to the ability of humans to interpret sounds arriving at both ears into complex audi-
tory scenes.
Shinn-Cunningham and Shilling4 distinguish between three types of headphone simulation—namely, diotic
displays, dichotic displays, and spatialized audio. Diotic displays refer to the display of identical signals in
both channels—something that gives the listener the sensation that all sound sources are located inside the
head.4
Dichotic displays involve stereo signals that just contain frequency-dependent interaural intensity differences
(IIDs), interaural level differences (ILDs), or interaural time differences (ITDs). The two authors explain that
this type of display is very simple since the effect can be achieved by scaling and delaying the signal arriving
at each ear. Just as with diotic displays, this does not enable proper spatialization of the sound sources since
listeners may get the feeling that the sounds are moving inside their head from one ear to the other.4
Finally, spatialized sound makes it possible to render most of the spatial cues available in the real world. This
is made possible by the use of various cues, such as the ones provided by acoustic transformations produced
by the listener’s body in the so-called head-related transfer function (HRTF), sound reflections and diffrac-
tions in the surrounding space, and dynamic modifications of these cues caused by body movements.
Figure 2 depicts a complete scheme of a state-of-the-art auralization system, emphasizing the key elements
for each component of Figure 1.

Figure 2. Block diagram of a typical system for binaural rendering and auralization.

Head-Related Transfer Function


Binaural anechoic spatial sound can be synthesized by convolving an anechoic sound signal with the left and
right HRTFs chosen among measurements over a discrete and limited set of spatial locations for an individ-
ual listener.11 The minimum-phase characteristic of HRTF allows its decomposition into a pure delay, τ ( •) ,
followed by a minimum-phase system, H min ( • ) :

H (θ , φ , r , ω ) = H min (θ , φ , r , ω ) exp  − j 2π f τ (θ , φ , r , ω )  ,

March/April 2018 35 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS

where ϴ and ϕ define the source’s direction of arrival, and r is the source’s distance from the listener’s head.
Signal-processing algorithms largely benefit from this HRTF decomposition in stability and performance. As
a matter of fact, the extracted pure delay or time shift is related to the monaural time of arrival (TOA), which
allows ITD extraction once the difference between the left- and right-ear TOA is computed.
Recording individual HRTFs of a single listener implies a necessary trade-off between resources and time
that is heavily subjected to measurement errors, making this complexity unfeasible for any real-world appli-
cation. A common practice is to use the same generic HRTFs, such as those that can be recorded using a
dummy head, for any listener, reaching a proper trade-off between the representativeness of a wide human
population and average efficacy (see the historical review on this topic by Paul20). However, generic HRTFs
generally result in a degradation of the listening experience in localization and immersion.11
Recent literature is increasingly investigating the use of personalized HRTFs that provide listeners with
HRTFs perceptually equivalent to their own individual HRTFs. Synthetic HRTFs can be physically modeled
considering different degrees of simplification: from basic geometry for the head, pinna (the external part of
the ear), shoulders, and torso to accurate numerical simulations with boundary element method (BEM) and
finite-difference time-domain (FDTD) methods.11
Recent research efforts have been investigating the development of HRTF personalization for individual us-
ers of a virtual audio display, usually in the form of precomputed HRTF filters or optimized selection proce-
dures of existing nonindividual HRTFs. In “Efficient Personalized HRTF Computation for High-Fidelity
Spatial Sound,”21 a technique to obtain personalized HRTFs using a camera is proposed. The technique com-
bines a state-of-the-art image-based 3D-modeling technique with an efficient numerical-simulation pipeline
based on the adaptive-rectangular-decomposition technique. On the other hand, we can select and manipulate
existing HRTFs according to their acoustic information, corresponding anthropometric information, and lis-
teners’ subjective ratings, if available. The procedure can be automatic or guided by qualitative tests on the
perceptual impact of nonindividual HRTFs. Finally, listeners can increase their localization performance
through adaptation procedures directly in VR games22 that provide tools for remapping localization cues to
spatial directions with nonindividual HRTF training.

Auralization
Rendering sound propagation for VR through headphones requires the spatialization of the directional RIR
(i.e., early and high-order reflections) for each individual listener. In particular, this involves computation of
binaural room impulse responses (BRIRs), which are the combination of two components: the head-related
impulse response (HRIR) and the spatial room impulse response (SRIR) (see Figure 3).

Figure 3. High-level acoustic components for VR auralization.

March/April 2018 36 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY

Interactive auralization forces HRTF databases or models to cover most of the psychoacoustic effects in lo-
calization and timbre distortion due to dynamic changes of the active-listening experience, thus defining the
memory requirements for HRTF spatial and distance resolution and the spectral information needed (i.e.,
HRTF filter length).11 The audibility of interpolation errors between available HRTFs or HRTF encoding on
a suitable functional basis, like in the spherical harmonic (SH) domain, is a fundamental aspect of auralizing
the sound entering the listener’s ear canals with headphones. Several strategies exist for HRTF interpolation
and representation leading to a perceptually optimal HRTF measurement grid of 4° to 5° spacing in both azi-
muth and elevation, with a progressive reduction of spatial points toward the polar directions. (See “Assisted
Listening Using a Headset: Enhancing Audio Perception in Real, Augmented, and Virtual Environments”23
for a compact review on this topic.)
Moreover, a compromise between computational efficiency and latency for convolutions with HRTF filters
can be reached by a partitioned-block algorithm.5 The rendering of head and torso independent movements is
increasingly becoming important where full-body interactions are key elements in immersive VR applica-
tions. Nowadays, head-tracking devices are embedded in VR headsets with low drift and latency, ensuring
performance less than the critical latencies between 150 to 500 ms, keeping in mind that HRTFs are usually
acquired in a fixed head-and-torso spatial connection. On the other hand, little attention has been given to the
benefits of rendering head-above-torso orientations, which should require tracking the shoulders’ position
related to head rotations.11
Given an HRTF representation, previous approaches tend to evaluate HRTFs for each sound propagation
path, resulting in a process too slow for interactive-VR latency requirements. Schissler et al.24 presented an
approach for computation of the SRIR that performs the convolution with the HRTF in the SH domain for
RIR directional projection for the listener. This approach employs aspects similar to those of the well-known
virtual reproduction of a multichannel surround system, where a virtual sound field is encoded to an HRTF-
based binaural system. Relevant examples for VR are high-order ambisonics, wave-field synthesis, and direc-
tional audio coding.11
As we discussed for HRTFs, even for SRIRs perceptually motivated techniques for sound scene auralization
are preferred in order to control computational resources. The culling of inaudible reflections has to be ap-
plied in complex multipath environments, and many approaches have been proposed in the literature. Ac-
cordingly, besides limiting, clustering, and projecting reflections in the reference sphere around the listener,
binaural loudness models can be used to calculate a masking threshold for a time-frequency representation of
sound sources.25

CONCLUSION
In this article we presented an overview of the state of the art of interactive sound rendering for immersive
environments. As shown in this article, in recent years progress has been made in relation to the three differ-
ent elements of the sound design pipeline—e.g., source modeling, room acoustics modeling, and listener
modeling. From the point of view of source modeling, many sound events have been synthesized, and real-
time simulations exist. However, these simulations have not yet been adopted in the commercial software
engines usually used by the VR community. This can be due to the fact that the quality of the simulations is
not high enough, so professionals still prefer to rely on recordings despite their limitations, such as the fact
that they do not show the same flexibility as simulations.
Moreover, although physics-based modeling using techniques such as modal synthesis is fast, it still isn’t fast
enough to accommodate several sound-producing objects in a fraction of a CPU core, which is what VR
needs. Overall, in practice, one might need to make the CPU cost and sound quality of physically based
sound synthesis competitive with hardware decompression of recorded sounds.
Much progress has also been made in regard to room acoustics modeling, and efficient algorithms combined
with increased computational power and parallel processing make such models available in real time. Work-
ing in perceptual parametric domains, as with the technique presented in “Parametric Wave Field Coding for
Precomputed Sound Propagation,”13 makes artistic soundscape creation easy. But, like any parametric en-
coder, the encoding is lossy: not all relevant perceptual aspects of the impulse response are completely mod-
eled. Source and listener directivity is usually approximated; delayed echoes are not rendered faithfully with
noticeable degradation of outdoor environments. Finding a compact set of perceptual metrics that incorporate
such missing aspects is the main challenge to allowing adaptation to several environments, from outdoor to
indoor, and better simulation of the acoustic sensation of moving in different spaces such as a tunnel, an arch,
etc.

March/April 2018 37 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS

Moreover, an active field of research is the capturing and implementation of personalized HRTFs. While lis-
tening tests in static environments have shown that subjects have a better sensation of space with personal-
ized HRTFs, it has not yet been investigated whether high-fidelity personalized HRTFs are needed in
complex virtual environments where several independent variable are present, such as the visual and proprio-
ceptive elements of the environments.
Finally, a challenging issue that has received less attention is auralization of nearby sources’ acoustics, which
seems to be perceptually relevant for action sounds in the proximal region or the listener’s peripersonal
space. Independence between HRTF directional and distance information does not hold, due to the change in
the reference sound field from a plane wave to a spherical wave. Thus, efficient interpolation methods and
range extrapolation algorithms in HRTF rendering should be developed. Furthermore, users’ actions dynami-
cally define near-field HRTFs because of body movements that have specific degrees of freedom for each
body articulation and connection. Modeling such complexity is a pivotal challenge for future full-body VR
experiences.

SIDEBAR: THE PLAUSIBILITY OF BINAURAL SYNTHESIS


Producing plausible virtual acoustic scenarios over headphones with particular attention to acoustical-space
properties and externalization issues remains a major challenge because of many sources of error in the re-
production of binaural signals and forced simplifications due to the time-variant nature of complex acoustic
scenes. Requirements can be grouped into four categories:

• Ergonomic and “ear adequate” delivery systems. The less knowledge listeners have that the sounds
are being played back from headphones, the more likely they will be to externalize stimuli in a natu-
ral way. The choice of headphones is thus crucial, depending on the low pressure of headphones
strap or cups and on changes in the acoustic load inside the ear canal.
• Individual spectral cues. Binaural signals should be available that are appropriate to listener-specific
spectral filtering. Particular attention should be given to headphone-specific compensation filters for
acoustic distortion introduced by a playback device that has to be adapted to different listeners and
repositioned on the ears.
• Head movements. Body movements in everyday-life activities produce perceptually relevant dy-
namic changes in interaural cues. Optimization of head-tracker latency, spatial discretization of the
impulse response dataset, and real-time interpolation for rendering arbitrary source positions pose
critical challenges in multimodal VR scenarios.
• Room acoustics. The availability of virtual room acoustics that resemble a real-world sound field,
and therefore contain plausible reverberation models, is relevant for externalization, spatial impres-
sion, and perception of space. Particular attention should be given to reflections and edge diffraction
for each sound source, which pose relevant challenges in finding the trade-off between accuracy and
plausibility due to computational and memory constraints, while at the same time preserving low-
enough latency for interaction. (See the sidebar “Sound Propagation” for further details.)

SIDEBAR: IMMERSIVE SOUND HAS SEVERAL


APPLICATIONS
Summers et al.26 discuss specific immersive and aesthetic auditory-design techniques as applied to create
full-motion, multiparticipant VR game experiences for live commercial location-based entertainment (LBE)
installations. The authors introduce fundamental design considerations and emerging best practices used in
spatially responsive VR audio design and implementation, and complement these with practical examples
that embrace the complexities of immersive and aesthetic audio design in VR. The design considerations are
based on the understanding that audio has the potential to craft the story elements of a product, control the
pacing of an experience, enforce the narrative, elicit and influence emotion, create mood, shape perception,
and reinforce the way people experience characters.
Janer et al.27 describe an application where audio signal processing and VR content are combined to create
novel immersive experiences for orchestral-music audiences. In VR, the auralization of sound sources of rec-
orded live content remains still a rather unexplored topic. In this paper, a multimodal experience is built
where visual and audio cues bring a sonic augmentation of the real scene. In the particular scenario of orches-
tral-music content, the goal is to acoustically zoom in on a particular instrument when the VR user stares at

March/April 2018 38 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY

it. The ultimate goal is to improve the learning aspects of music listening, either for education or for personal
enrichment.
VR musical instruments are also a field where naturally physics-based sound modeling and rendering play an
important role. For a recent overview of VR musical instruments, we refer to the work presented in “Virtual
Reality Musical Instruments: State of the Art, Design Principles, and Future Directions.”28
Carefully designed sounds can be used to reproduce the acoustics of specific spaces, such as the Notre Dame
Cathedral.29 In this particular situation, a computational acoustic model of the acoustics of the space was cap-
tured and reproduced to be utilized in an immersive VR experience.
Assisted listening and navigation for partially or fully visually impaired users is an example of an important
application of interactive sound rendering for virtual environments. As an example, in “Assisted Listening
Using a Headset: Enhancing Audio Perception in Real, Augmented, and Virtual Environments,”23 an over-
view of the use of augmented audition for navigation is presented.
Among the many possibilities that one can create with VR technologies, Figure A depicts some case studies
and tools developed at the Multisensory Experience Lab of Aalborg University Copenhagen.

Figure A. Applications and tools in interactive VR. Top left: interactive and creative virtual instruments. Bottom
left: a multimodal rendering of an outdoor VR environment for mediation. Top center: the origin of the generic
acoustic contribution for a human-like listener. Bottom center: an indoor scene with reduced visual information
for orientation and mobility purposes. Top and bottom right: examples of VR applications for users of different
ages.

SIDEBAR: HEADPHONE TECHNOLOGIES FOR NATURAL


SPATIAL SOUNDS
In order to deliver natural auditory experiences with headphones, the correct sound pressure level (SPL) due
to a virtual sound field should be reproduced at the eardrums of a listener, taking into account the head-
phone’s acoustic contribution to the headphone impulse response, HpIR. (See Figure 3 in the main article for
a schematic view.) In the literature, sound transmission from the headphone to the eardrum is often repre-
sented through an analog circuit model30 with the following desired listening condition:
Zheadphone ≈ Zradiation,
where Zheadphone denotes the equivalent impedance outside the ear canal with headphones, and Zradiation denotes
the equivalent impedance outside the ear canal in free-field listening conditions. This model is valid for
wavelengths greater than the ear canal’s width, thus approximately under 10 kHz, and gives rise to the so-
called Pressure Division Ratio (PDR):

March/April 2018 39 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS

Hp
Popen Popen
= Hp
,
Pblocked Pblocked

where Popen and Pblocked denote the free-field sound pressure at the entrance of the open and blocked ear canal,
Hp
respectively, while Popen
Hp
and Pblocked denote the same sound pressure observation points when the sound
source is headphones. Headphones with PDR ≈ 1 satisfy the free-air equivalent coupling (FEC) characteris-
tic.31 For low frequencies, intersubject variability is limited up to ≈ 4 kHz because headphones act as an
acoustic cavity introducing only a constant level variation.
In contrast, in the higher spectrum, the headphone position and listener’s anthropometry give rise to several
frequency notches that are difficult to predict due to

• standing waves starting to grow inside the headphone cups and


• the outer ear’s resonances yielding to an individual characterization of headphone acoustics.

It is worthwhile to notice that the fidelity of spatial sound reproduction relies on the amount of individualiza-
tion in the headphone correction of both measurement techniques and equalization methods, with an empha-
sis on high-frequency control in the inverse-filtering problem.32
Obtaining in situ robust and individual headphone calibration with straightforward procedures in order to
always apply listener-specific corrections to headphones is a challenging research issue, especially for in-
serted earphones that do not satisfy FEC criteria. An innovative approach to estimate the sound pressure in an
occluded ear canal involves binaural earphones with microphones that can extract the ear canal transfer
function (ECTF) and compensate for occlusion via an adaptive inverse-filtering method in real time.27

SIDEBAR: SOUND SOURCE MODELING


One of the first examples of the use of physics-based models for interactive audiovisual environments is Fo-
leyAutomatic.3 The authors of the related paper present a simulation of different solid interactions such as
impact, friction, and rolling, using a combination of physics-based simulations and recordings extracted from
real objects. The simulations follow the classification of everyday sounds proposed by Gaver in the context
of human-computer interaction (see Figure B).33

Figure B. The classification of everyday sounds proposed by Gaver.30

The authors use a synthesis technique known as modal synthesis.34 The main principle behind modal synthe-
sis is the decomposition of vibrating objects in their modes—e.g. the resonances of the system. Each mode
can be simulated as a mass-spring system or an exponentially decaying sine wave.
O’Brien et al.35 describe a real-time technique for generating realistic and compelling sounds that correspond
to the motions of rigid objects. In this case, the actions are not necessarily actions of the user in the environ-
ment, but actions between objects present in the simulation. By numerically precomputing the shape and fre-
quencies of an object’s deformation modes, audio can be synthesized interactively directly from the force
data generated by a standard rigid-body simulation. This approach allows accurate modeling of the sounds
generated by arbitrarily shaped objects based only on a geometric description of the objects and a handful of
material parameters.
Avanzini et al.36 present some efficient yet accurate algorithms to simulate frictional interactions between
rubbed dry surfaces. The work is based on the assumption that similar frictional interactions appear in several
events such as a door squeaking or a rubbed wineglass. For this reason, the same simulation algorithm can be
used with different parameters adapted to the different objects interacting.

March/April 2018 40 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY

An efficient physically informed model to simulate liquid sounds has been proposed by Doel.37 The model-
ing approach is based on the assumption that the physics of liquid sounds is complex to simulate mathemati-
cally. A physically informed approach considers some physical elements of the interaction between bubbles
and solid surfaces, such as the fact that bubbles decrease in size when they hit the floor, combined with a per-
ceptual-based approach where a bubble is simulated using a sine wave whose frequency decreases over time.
Perceptual experiments show that subjects find such an approach suitable for inclusion in a VR simulation.37
An approach for fluid simulation is also proposed by Moss et al.38 Here, the simulation of liquid sounds is
made directly from a visual simulation of fluid dynamics. The main advantage is the fact that sound simula-
tion can be obtained from the visual simulation with minimal additional computational cost and possibly in
real time. Cirio et al.39 describe an approach to render vibrotactile and sonic feedback for fluids. The model is
divided into three components, following the physical processes that generate sound during solid-fluid inter-
action: the initial high-frequency impact, the small-bubble harmonics, and the main-cavity oscillation.
A simulation of aerodynamic sounds such as aeolian tones and cavity tones is presented by Dobashi et al.40
The authors propose an efficient implementation of aerodynamic sound based on computational fluid dynam-
ics, where the sound generated depends on the speed of the aerodynamic action. Precomputations of an ob-
ject in different poses with respect to the flow facilitate real-time simulation.
One important action performed by users in virtual environments is walking. Nordahl et al.41 present different
algorithms to simulate the sound of walking on solid surfaces such as wood and asphalt as well as aggregate
surfaces such as sand and snow. Based on perceptual evaluations, the authors show that users were able to
recognize most of the simulated surfaces they were exposed to. Moreover, a second study showed that users
were able to better identify the surfaces they were exposed to when soundscapes were added to the simula-
tion. In these cases, the soundscapes were simulating different environments such as a beach, forest, etc. This
result shows the importance of environmental sound rendering in creating a sense of space.

REFERENCES
1. T. Takala and J. Hahn, “Sound Rendering,” ACM SIGGRAPH Computer Graphics, vol. 26, no. 2,
1992, pp. 211–220.
2. T. Funkhouser et al., “A beam tracing approach to acoustic modeling for interactive virtual
environments,” Proceedings of the 25th Annual Conference on Computer Graphics and Interactive
Techniques (SIGGRAPH 98), 1998, pp. 21–32.
3. K. Van Den Doel, P.G. Kry, and D.K. Pai, “Foleyautomatic: physically-based sound effects for
interactive simulation and animation,” Proceedings of the 28th Annual Conference on Computer
Graphics and Interactive Techniques (SIGGRAPH 01), 2001, pp. 537–544.
4. B. Shinn-Cunningham and R. Shilling, “Virtual Auditory Displays,” Handbook of Virtual
Environment Technology, Lawrence Erlbaum Associates Publishers, 2002.
5. V. Valimaki et al., “More than 50 years of artificial reverberation,” Proceedings of the 60th
International Conference on Dereverberation and Reverberation of Audio, Music, and Speech
(DREAMS 16), 2016.
6. L. Savioja et al., “Creating interactive virtual acoustic environments,” Journal of the Audio
Engineering Society, vol. 47, no. 9, 1999, pp. 675–705.
7. P.R. Cook, Real sound synthesis for interactive applications, CRC Press, 2002.
8. Z. Ren, H. Yeh, and M.C. Lin, “Example-guided physically based modal sound synthesis,” ACM
Transactions on Graphics, vol. 32, no. 1, 2013.
9. D.B. Lloyd, N. Raghuvanshi, and N.K. Govindaraju, “Sound synthesis for impact sounds in video
games,” Proceedings of the 2011 ACM SIGGRAPH Symposium on Interactive 3D Graphics and
Games (I3D 2011), 2011, pp. 55–62.
10. R. Nordhal and N.C. Nilsson, “The sound of being there: presence and interactive audio in
immersive virtual reality,” Oxford Handbook of Interactive Audio, Oxford University Press, 2014.
11. B. Xie, Head-Related Transfer Function and Virtual Auditory Display, J. Ross, 2013.
12. D.L. James, J. Barbič, and D.K. Pai, “Precomputed acoustic transfer: output-sensitive, accurate
sound generation for geometrically complex vibration sources,” ACM Transactions on Graphics,
vol. 25, no. 3, 2006, pp. 987–995.
13. N. Raghuvanshi and J. Snyder, “Parametric wave field coding for precomputed sound
propagation,” ACM Transactions on Graphics, vol. 33, no. 4, 2014, pp. 1–11.
14. L. Savioja, “Real-time 3D finite-difference time-domain simulation of low-and mid-frequency
room acoustic,” Proceedings of the 13th International Conference on Digital Audio Effects (DAfx
10), 2010.

March/April 2018 41 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER GRAPHICS AND APPLICATIONS

15. R. Mehra et al., “Wave: Interactive wave-based sound propagation for virtual environments,” IEEE
Transactions on Visualization and Computer Graphics, no. 21, 2014, pp. 434–442.
16. C. Cao et al., “Interactive sound propagation with bidirectional path tracing,” ACM Transactions
on Graphics, vol. 35, no. 6, 2016, pp. 1–11.
17. H. Yeh et al., “Wave-ray coupling for interactive sound propagation in large complex scenes,”
ACM Transactions on Graphics, vol. 32, no. 6, 2013.
18. F. Stevens et al., “Modeling sparsely reflecting outdoor acoustic scenes using the waveguide web,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 8, 2017.
19. J.O. Smith, “Physical modeling using digital waveguides,” Computer music journal, vol. 16, no. 4,
1992, pp. 74–91.
20. S. Paul, “Binaural Recording Technology: A Historical Review and Possible Future
Developments,” Acta Acustica united with Acustica, vol. 95, no. 5, 2009, pp. 767–788.
21. A. Meshram et al., “P-HRTF: Efficient Personalized HRTF Computation for High-Fidelity Spatial
Sound,” Proceedings of the 2014 IEEE International Symposium on Mixed and Augmented Reality
(ISMAR 14), vol. ISMAR 14, 2014, pp. 53–61.
22. G. Parseihian and B.F.G. Katz, “Rapid head-related transfer function adaptation using a virtual
auditory environment,” The Journal of the Acoustical Society of America, vol. 131, no. 4, 2012, pp.
2948–2957.
23. V. Välimäki et al., “Enhancing audio perception in real, augmented, and virtual environments,”
IEEE Signal Processing Magazine, vol. 33, no. 2, 2015, pp. 92–99.
24. C. Schissler, A. Nicholls, and R. Mehra, “Efficient HRTF-based spatial audio for area and
volumetric sources,” IEEE Trans. Visualization and Computer Graphics, vol. 22, no. 4, 2016, pp.
1356–1366.
25. H. Hacihabiboglu et al., “Perceptual Spatial Audio Recording, Simulation, and Rendering: An
overview of spatial-audio techniques based on psychoacoustics,” IEEE Signal Processing, vol. 34,
no. 3, 2017, pp. 36–54.
26. C. Summers, V. Lympouridis, and C. Erkut, “Sonic interaction design for virtual and augmented
reality environments,” Proceedings of the 2015 IEEE 2nd VR Workshop on Sonic Interactions for
Virtual Environments (SIVE 15), 2015, pp. 1–6.
27. J. Janer et al., “Immersive orchestras: audio processing for orchestral music VR content,”
Proceedings of the 8th International Conference on Games and Virtual Worlds for Serious
Applications (VS Games 16), 2016, pp. 1–2.
28. S. Serafin et al., “Virtual reality musical instruments: State of the art, design principles, and future
directions,” Computer Music Journal, vol. 40, no. 3, 2016.
29. B.F. Katz et al., “Experience with a virtual reality auralization of notre-dame cathedral,” The
Journal of the Acoustical Society of America, vol. 141, no. 5, 2017.
30. H. Møller, “Fundamentals of binaural technology,” Applied Acoustics, vol. 36, no. 3-4, 1992, pp.
171–218.
31. B. Boren et al., “Coloration Metrics for Headphone Equalization,” Proceedings of the 21st
International Conference on Auditory Display, 2015, pp. 29–34.
32. F. Denk et al., “An individualised acoustically transparent earpiece for hearing devices,”
International Journal of Audiology, 2017, pp. 1–9.
33. W.W. Gaver, “What in the world do we hear?: An ecological approach to auditory event
perception,” Ecological Psychology, vol. 5, no. 1, 1993, pp. 1–29.
34. J.-M. Adrien, “The missing link: Modal synthesis,” Representations of musical signals, MIT Press,
1991.
35. J.F. O'Brien, C. Shen, and C.M. Gatchalian, “Synthesizing sounds from rigid-body simulations,”
Proceedings of the 2002 ACM SIGGRAPH/Eurographics Symposium on Computer Animation,
2002, pp. 175–181.
36. F. Avanzini, S. Serafin, and D. Rocchesso, “Interactive simulation of rigid body interaction with
frictioninduced sound generation,” IEEE Transactions on Speech and Audio Processing, vol. 13,
no. 5, 2005, pp. 1073–1081.
37. K.v.d. Doel, “Physically based models for liquid sounds,” ACM Transactions on Applied
Perception, vol. 2, no. 4, 2005, pp. 534–546.
38. W. Moss et al., “Sounding liquids: Automatic sound synthesis from fluid simulation,” ACM
Transactions on Graphics, vol. 29, no. 3, 2010.
39. G. Cirio et al., “Vibrotactile rendering of splashing fluids,” IEEE Transactions on Haptics, vol. 1,
no. 6, 2013, pp. 117–122.
40. Y. Dobashi, T. Yamamoto, and T. Nishita, “Real-time rendering of aerodynamic sound using
sound textures based on computational fluid dynamics,” ACM Transactions on Graphics, vol. 22,
no. 3, 2003, pp. 732–740.

March/April 2018 42 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.
VIRTUAL AND AUGMENTED REALITY

41. R. Nordahl, L. Turchet, and S. Serafin, “Sound synthesis and evaluation of interactive footsteps
and environmental sounds rendering for virtual reality applications,” IEEE Transactions on
Visualization and Computer Graphics, vol. 17, no. 9, 2011, pp. 1234–1244.

ABOUT THE AUTHORS


Stefania Serafin is a professor in Aalborg University’s Department of Architecture, Design, and Media
Technology, with a focus on sound for multimodal environments. Her research interests include sonic-
interaction design and sound and music computing. Serafin received a PhD from Stanford University’s
Center for Computer Research in Music and Acoustics. She’s the president of the Sound and Music
Computing Association. Contact her at [email protected].
Michele Geronazzo is a postdoctoral researcher in Aalborg University’s Department of Architecture,
Design, and Media Technology. His research interests include binaural spatial audio, virtual and aug-
mented reality, and human-computer interaction. Geronazzo received a PhD in information and commu-
nication technology from the University of Padova. Contact him at [email protected].
Cumhur Erkut is an associate professor at Aalborg University. His research interests include sound
and music computing, sonic-interaction design, and embodied interaction. Erkut received a PhD in elec-
trical engineering from the Helsinki University of Technology. Contact him at [email protected].
Niels C. Nilsson is an assistant professor at Aalborg University. His research interests include locomo-
tion, perception, and cognition in immersive virtual environments. Nilsson received a PhD in media
technology from Aalborg University. He’s a member of IEEE. Contact him at [email protected].
Rolf Nordahl is an associate professor in Aalborg University’s Department of Architecture, Design, and
Media Technology. His research interests include presence and the user experience in VR, with a focus
on locomotion. He received a PhD in engineering from Aalborg University. He’s a member of IEEE.
Contact him at [email protected].

March/April 2018 43 www.computer.org/cga

Authorized licensed use limited to: Linkoping University Library. Downloaded on September 14,2022 at 14:41:54 UTC from IEEE Xplore. Restrictions apply.

You might also like