0% found this document useful (0 votes)
140 views

Pro Techniques for Sound Design

Uploaded by

Francis Pill
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views

Pro Techniques for Sound Design

Uploaded by

Francis Pill
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

FREE G U I D E

Pr o Tech n iqu es f or
Sou n d Design

www.routledge.com
Table of Cont ent s
1. Pract ical Sound Design
Jean-Luc Sinclair
Principles of Game Audio and Sound Design

2. Designing a m om ent
Rob Bridgett
Leading with Sound

3. Em ot ion in sound design


Neil Hillman
Sound for Moving Pictures

4. Using Am bisonics and Advance Audio Pract ices


Dennis Baxter
Immersive Sound Production

5. Leveraging Mot ion and Concept ual Fram eworks of


Sound as a Novel Means of Sound Design in Ext ended
Realit y
Ed. by Michael Filimowicz
Designing Interactions for Music and Sound
Get f u r t h er in sigh t s on
Sou n d Design
from these key books featured in this guide

Sign-up for Email Updates »

Br ow se ou r f u ll r an ge of Au dio book s.
Br ow se »
Free standard shipping included on all online orders.
Int roduct ion
Sound Designers are often found in theatrical, television/movie, and
corporate productions and are responsible for everything the
audience hears. This includes creating sound effects for visual
media, as well as helping to design and oversee system
installations.
This free guide is ideal for those who utilize sound design
techniques in their careers. Using both research based and hands
on advice, it covers a range of topics like the spatialization of sound,
3D audio techniques and sound design for games.
The chapters featured are sourced from a selection of Routledge
books which focus on sound design and audio engineering. More
details about each chapter are noted below.
If you would like to delve deeper into any of the topics, the full
versions of the books are available to purchase from
Routledge.com.

Ch apt er 1 - Pr act ical Sou n d Design


This chapter taken from 'Principles of Game Audio and Sound
Design' describes some typical problems that linear sound designers
run through, like cut scenes, as well as some fundamentals of signal
flow and gain staging.

Ch apt er 2 - Design in g a m om en t
Context drives the interactive design range of the sound, and how it
needs to change over time and circumstance. This chapter from
'Leading with Sound' explores how to use the web of context to
create a narrative sound story to convey information to the
audience.
Q
Introduction
Chapter 3 - Emotion in sound design
The audience's reaction is crucial to storytelling. This chapter from
'Sound for Moving Pictures' explains how to employ ways to elicit
desired emotions in an audience using a wealth of research from
specialists in mixing and sound design.

Chapter 4 - Using Ambisonics and Advance Audio Practices


The production of multichannel, multi-format audio requires 3D
immersive sound. This chapter from 'lmmersive Sound Design'
explains what ambisonic production is and how it's used for the
spatialization of audio. As ambisonics is the only platform that
tracks user interaction with soundfield rotations for 360 video, it is
an essential tool for those trying to stay current with developments
in sound design.

Chapter 5 - Leveraging Motion and Conceptual Frameworks of


Sound as a Novel Means of Sound Design in Extended Reality
Extended Reality (XR) offers a multitude of new opportunities and
difficulties for sound design. This chapter from 'Designing
Interactions for Music and Sound' explores the parallels in sound
and motion, as well as a host of innovative sound design
techniques to advance to use of sound design in XR.
CH A PT ER
1

Pr act ical Sou n d


Design

This chapter is excerpted from

Principles of Game Audio and


Sound Design
Jean-Luc Sinclair
© 2020 Taylor & Francis Group. All rights reserved.

Lear n M or e »
6 PRACTICAL SOUND DESIGN

Learning Objectives
In Chapter fve we looked at the origins of sound design and some of the
most commonly used techniques and processes used in the trade. In this
chapter we look at a few more specifc examples on how to apply these tech-
niques in the context of linear and interactive sound design. We will also
introduce the concept of prototyping, which consists of building interactive
sound objects such as vehicles or crowd engines and recreating their behav-
ior in the game by building an interactive model of it, in a software such as
MaxMASP or Pure Data, prior to integration in the game engine. The process
of prototyping is extremely helpful in testing, communicating and demon-
strating the intended behavior or possible behaviors of the interactive ele-
ments in a game. But frst we shall take a closer look at some of the major
pitfalls most game sound designers run into when setting up a session for
linear sound design, such as cut scenes, as well as some basics of signal fow
and gain staging.

1. Setting Up a Sound Design Session and Signal Flow


Sound design is both a creative and technical endeavor. There is a ‘what’ ele-
ment and a ‘how’ element. The ‘what’ is the result we intend to create, and the
‘how’, of course, the method we use to get there. This is a common struggle
to most artists and one that the great painter Wassily Kandinsky had identified
and articulated in his writings, a testimony to the universality of this struggle
to all artists. A solid understanding of the signal flow in DAWs and gain staging
overall is critical to obtaining good results. Students often end up struggling
with the technology itself, as much as the sound design portion, complicating
their tasks a great deal. Often, however, the technical matters can be overcome
with a better understanding of the technical side, leaving the student to focus
on the matter at hand, the creative.
116 PRACTICAL SOUND DESIGN

1. Signal Flow
The term signal flow refers to the order through which the audio signal
encounters or flows through the various elements in a mixer or via external
processors, from the input – which is usually the hard drive – or a mic input
to the digital audio converters (DACs) and out to the speakers.
In this chapter we will use Avid’s Pro Tools as our DAW. The concepts dis-
cussed here, however, will easily apply to another software, especially as most
DAW mixers tend to mimic the behavior and setup of classic analog mixers.
Let’s take a look at how the signal flows, from input to output, in a tra-
ditional DAW and how understanding this process will make us better audio
engineers and therefore sound designers.
The following chart will help us understand this process in more detail:

Figure 6.1 Main elements of a mixer channel strip

a. Input

In most mixers the very first stage is the input. The input varies whether we
are in recording mode, in which case the input will usually be a microphone
or line input or whether we are in playback mode, in which case the input will
be the audio clip or clips in the currently active playlist.

b. Inserts

The next stage your signal is going to run into are the inserts or insert sec-
tion. This is where you can add effects to your audio, such as equalization,
PRACTICAL SOUND DESIGN 117

compression and whatever else may be available. Inserts are often referred
to as an access point, allowing you to add one or multiple processors in your
signal path. In most DAWs, the signal goes from the first insert to the last from
top to bottom.

c. Pre-Fader Send

After the inserts, a pre-fader send is the next option for your signal. This is
where you will send a copy of your audio to another section of your mixer,
using a bus. A bus is a path that allows you to move one or multiple signals to
a single destination on another section of the mixer. Sending out a signal at
this point of the channel strip means the amount sent will be irrespective of
the main fader, therefore changes in volume across the track set by the main
fader will not affect the amount of audio going out on the pre-fader send. The
amount of signal sent is only dependent on the level of the send and, of course,
the level of the signal after the insert section.
If you were to send vocals to a reverb processor at this stage, fading out the
vocals would not affect the level of the reverb, and you would eventually end
up with reverberation only after fading out the vocals.

d. Volume Fader

The next stage is the volume fader, which controls the overall level of the
channel strip or audio track. When the volume fader is set to a value of 0dB,
known as unity, no gain is applied to the overall track, and all the audio is play-
ing at the post insert audio level. Raising or lowering the fader by any amount
will change the current gain value by as much.
Often it is here that you will find panning, to place the audio output in
stereo or surround space, depending on the format you are working with.

e. Metering: Pre-Fader vs. Post Fader

Next to the volume fader, you will usually find a level meter. Please check with
your DAW’s manual to find out exactly how the meter is measuring the level
(Peak, RMS, LUFS etc.). Some DAWS will allow you to change the method for
metering. Irrelevant of the method employed, you have the option to monitor
signals pre-fader or post-fader. By default, most mixers will have their meters
set to post fader mode, which means the fader will display the level after the
volume fader and will therefore be affected by it. When monitoring pre-fader,
the meter will display the level of the signal right after the last insert, giving
you an accurate sense of the level at this stage. It’s probably a good idea to
at least occasionally monitor your signals pre-fader, so you can be sure your
signal is clean coming out of the insert section.
Please refer to your DAW’s documentation to find out how to monitor pre
or post-fader.
118 PRACTICAL SOUND DESIGN

f. Post-Fader Send

Next we find the post-fader send. The level sent to the bus will be impacted
by any changes in the level of the volume fader. This is the most commonly
used type of send. In this case, if you are sending vocals to a reverb processor,
fading out the vocals will also fade out the level of the reverb.

g. Output

Last, we find the output, which determines where the signal is routed to next,
by default usually the master bus, where all the audio is summed. Often the
output of an audio track should be routed to a submix, where multiple audio
tracks that can or should be processed in the same way are mixed together,
such as all the ambience tracks in a session or the dialog, music etc.
It’s probably a good rule of thumb to make sure that no track be routed directly
to the master fader but rather to a subgroup or submix. Routing individual tracks
directly to the master will make your mix messy and difficult to manage.
You may have already noticed that DAWs often do not display the informa-
tion on a channel strip in their mixer in the order through which the signal
flows from top to bottom. If unaware of this, it is easy to make mistakes that
get in the way of the task at hand.

2. Working With Video


Sound designers are often asked to work to linear video clips when working in
games. Models, such as AI characters, can be exported to video before they are
implemented in the game engine, and the animations are often given to the sound
designers as linear loops prior to their implementation in the game. Working to
video is also a great way to experiment freely in the DAW of your choice, prior
to exporting the sounds you created as assets to be imported in the game.
In other cases, you will be given a video clip of a cut scene, a cinematic
sequence often used to move the plot forward between levels.
Either way, it is important to be aware of a few key issues when working to
picture. Every DAW has slightly different way of importing video, so if you are
unsure, please refer to the user manual; the points made here, however, will
apply regardless of the DAW you are working in. As in the rest of this chapter,
Avid’s Pro Tools will be used to illustrate these concepts.

a. Know Your Frame Rate

Frame rates for video are usually lower than the ones we work with in gam-
ing. Frame rates ranking from 24 to 30 frames per second are common in
video, film and broadcast. Find out what the frame rate is of the video you are
working with, and make sure to set your DAW’s timeline to be displayed in
Timecode format, rather than bars and beats.
PRACTICAL SOUND DESIGN 119

Figure 6.2

Timecode is a way to make sure that each and every frame in a piece of
video will have a single address that can be easily recalled and is expressed in
the following format:

HH:MM:SS:FF.

Hours, Minutes, Seconds and Frames.

It is important to understand that, although expressed in seconds and frames,


time code is a positional reference, an address for each frame in the video file.
Do make sure your DAW’s session is running at the same frame rate as the
picture. Setting up our timeline to time code format allows us to move through
our session in a frame by frame way, using the nudge feature. Nudging allows
you to scrub forward and backwards through the video and allows you to find
out exactly and easily where the sync points for each event are in the picture,
down to frame accuracy. In some cases, you might need to use a nudge value
to half frame for events where synchronization is critical.
The first frame of the clip should be lined up with the address: 01:00:00:00
in the timeline; any material such as slates that provide information about the
video clip or countdowns will therefore start prior to the hour mark. Lining
up the first frame of video with the address 01:00:00:00 is not a requirement
but rather a convention and will make it easier to keep track of time.
Once you have imported the video, set up your DAW to the proper time
timecode format and lined up your movie, you’re almost ready to sound
design. The next step is to set up the routing and gain staging of the session.

3. Clipping Is Easy – Mind the Signal Path


As you can see from Figure 6.1, the inserts are located pre-fader. A common
mistake is to assume that if an audio track is clipping and the meter is in the
red, that the problem can be solved by reducing the level with the main fader.
This will indeed turn the audio level down, and the meter may no longer be in
the red, if they are monitoring the level post fader, which is often the default.
Doing this, however, only makes the signal quieter, and the clipping is still
present, polluting your signal.
120 PRACTICAL SOUND DESIGN

Figure 6.3

The clipping may not be obvious, especially to tired ears and mixed in with
other audio signals, but this can lead to harsh sounding mixes and make your
task much more difficult.
A better solution would be to turn the gain down at the level of the first
insert by inserting a trim plugin and turning the level down before it hits the
first plugin and preventing any clipping to occur in the first place.

Use the Dynamic Range

The term dynamic range in the context of a mixing session or a piece of equip-
ment refers to the difference– or ratio – between the loudest and the softest
sound or signal that can be accurately processed by the system. In digital audio,
the loud portion of the range refers to the point past which clipping occurs, intro-
ducing distortion by shaving off the top of the signal. The top of the dynamic
range in the digital audio domain is set to 0dBFS, where FS stands for full scale.
Figure 6.4 shows the same audio file, but the right one shows the charac-
teristic flat top of a clipped audio file, and the fidelity of the audio file will be
severely affected.

Figure 6.4
PRACTICAL SOUND DESIGN 121

In the digital audio world, the bottom of the dynamic range depends on the
number of bits the session or processor is running at. A rule of thumb is that
1 bit = 6dB of dynamic range. Keep in mind this is an approximation, but it
is a workable one. A session at 24 bits will therefore offer a dynamic range
of 144dB, from 0 to −144dBFS. This, theoretically, represents a considerable
improvement over previous high-end large format analog mixing consoles.
Any signal below that level will simply blend into the background noise and
probably will sound quite noisy as it approaches that level.

Figure 6.5

Clipping therefore ought not to be an issue. Yet is often is. A well-mastered


modern music pop track, when imported into a session, will already bring
your master fader dangerously close to the 0dB mark. While it might be
tempting to lower the master fader at this stage, refrain from doing so. Always
address gain staging issues as early as possible. Lowering the master fader may
lower the level on the master bus meter, but in reality, it lends itself to a session
where you are constantly fighting for headroom.
There again, a better solution is to lower the level of the music track, ideally
at the first insert, and push its levels down by 10 to 15dB, with the volume
fader for both the music track and the master fader still at unity. This will give
you a lot more headroom to work with while leaving the volume fader at unity.
If the music track now peaks at −15dB, it is still 133dB above the bottom
of your dynamic range, which, if working with a clean signal where no noise
is already present, gives you more than enough dynamic range to work with.
As good practice, I recommend always keeping the mixer’s master fader at
unity.
122 PRACTICAL SOUND DESIGN

4. Setting Up a Basic Session for Linear Mixes and Cut Scenes


Next we will organize the mix around the major components of our soundtrack,
usually music, dialog and sound effects.

a. Music, Dialog and Sound Efects

Delivery of stems is quite common and often expected when working with lin-
ear media. Stems are submixes of the audio by category such as music, dialog
and sound effects. Stems make it convenient to make changes to the mix, such
as replacing the dialog, without needing to revisit the entire mix. Having a
separate music bounce also allows for more flexible and creative editing while
working on the whole mix to picture.
It also makes sense to structure our overall mix in terms of music, effects
and dialog busses for ease of overall mixing. Rather than trying to mix all
tracks at once, the mix ultimately comes down to a balance between the three
submixes, allowing us to quickly change the relative balance between the
major components of the mix.

b. Inserts vs. Efects Loops for Reverberation

Effect loops are set up by using a pre or post-fader send to send a portion of
the signal to a processor, such as reverb, in order to obtain both a dry and
wet version of our signals in the mixer, allowing for maximum flexibility. The
effect we are routing the signal to usually sits on an auxiliary input track.

Figure 6.6

Additionally, when it comes to sound effects such as reverb and delays, which
are meant to be applied to multiple tracks, it usually makes more sense to use
effects loops and sends rather than inserting a new reverb plugin directly on every
track that requires one. The point of reverberation when working with sound
replacement is often to give us a sense for the space the scene takes place in,
PRACTICAL SOUND DESIGN 123

which means than most sound effects and dialog tracks will require some rever-
beration at some point. All our sounds, often coming from completely different
contexts, will also sound more cohesive and convincing when going through the
same reverb or reverbs. Furthermore, applying individual plugins to each track
requiring reverb is wasteful in terms of CPU resources and makes it very difficult
to make changes, such as a change of space from indoors to outdoors, as they
must be replicated over multiple instances of the plugins. This process is also time
consuming and difficult to manage as your mix grows in complexity.
As a rule, always set up separate aux send effect loops for reverberation
processors and delays used for modeling the environment. In addition to the
benefits mentioned earlier, this will also allow you to process the effects inde-
pendently from the original dry signal. The use of equalization or effects such
as chorus can be quite effective in enhancing the sound of a given reverb. As
all rules, though, it can be broken but only if there is a reason for it.

c. Setting Up the Mix Session

The structure suggested here is intended as a starting point, and ultimately


every audio engineer settles on a format that fits their workflow and the needs
of the project the best. Different formats for delivery may have different needs
in terms of routing and processing, but we can start to include all the elements
outlined so far into a cohesive mix layout.
Figure 6.7 represents the suggested starting point for your mix. From top
to bottom:

Figure 6.7
124 PRACTICAL SOUND DESIGN

d. Master Output and Sub Master

In this configuration, no audio from the mix is routed directly to the master
fader. Rather there is an additional mixing stage, a master sub mix where all
the audio from our mix is routed. The sub master is then sent to the master
output sub master -> master output. This gives us an additional mix stage,
the sub master, where all premastering and/or mastering processing can be
applied, while the master output of the mix is used as a monitoring stage only,
such as audio levels, spatial image and spectral balance.
Since all premastering or mastering is done at the master sub mix, our master
outputs will be ‘clean’. Should we wish to use a reference track, this configura-
tion means that we can route our reference track directly to the master out and
compare it to the mix without running the reference through any of the master-
ing plugins as well as easily adjust the levels between our mix and the reference.

e. Submixes and Efects Loops

The next stage from the top is where we find the submixes by categories or
groups for music, dialog and sound effect, as well as the effect loops for reverb
and other global effects. All the audio or MIDI tracks in the session are summed
to one of these, no tracks going out directly to the master or sub master output.
Each of the group will likely in turn contain a few submixes depending on the
needs and complexity of the mix. Sound effects are often the most complex
of the groups and often contain several submixes as illustrated in the diagram.

Figure 6.8
PRACTICAL SOUND DESIGN 125

The screenshot shows an example of a similar mix structure for stereo out-
put realized in Avid’s Pro Tools, although this configuration is useful regard-
less of the DAW you are working with. The submixes are located on the left
side of the screen, to the left of the master fader, and the main groups for
music, dialog and sound effects are located on the right side.

• On each of the audio tracks routed to the groups a trim plugin would
be added at the first insert, in order to provide the sound designer with
an initial gain stage and prevent clipping.
• Each audio track is ultimately routed to a music, dialog or sound effect
submix, but some, especially sound effects, are routed to subgroups,
such as ambience, gunshots and vehicles that then get routed to the
sound effect submix.
• Three effect loops were added for various reverberation plugins or
effects.

f. Further Enhancements

We can further enhance our mix by adding additional features and effects to
our mix to give us yet more control and features.

Dedicated Software LFE Submix

Adding weight to certain sounds, such as impacts and explosions, can be


achieved using a subharmonic generator plugin that will generate low fre-
quency components to any sound that runs through it. These plugins can be
difficult to manage as they introduce powerful low-end frequencies that can in
turn make the mix challenging to manage. Rather than applying these plugins
as inserts on one or multiple tracks, use an effect loop instead, setting it up in
the same way you would a reverb, and send any audio file you desire to add
weight to it.
Using a dedicated submix for the plugin means that we can process the low
frequencies introduced in our mix by the plugin independently from the dry
signal, making it easy to compress them or even high pass filter the very lower
frequency components out.

Group Sidechaining

Sidechaining is a commonly used technique in mixing where a compressor sits


on track A but is listening (aka ‘is keyed’) to track B, compressing A only when
the level of B crosses the threshold.
We can also use our subgroup structure to apply sidechain compression on
an entire submix at once. A common example of group sidechaining involves
the sound effects being sidechained to the dialog so that the mix naturally
ducks the effects when dialog occurs. Another option would be to sidechain
the music to the sound effect, if we want our sequence to be driven mostly by
126 PRACTICAL SOUND DESIGN

sounds effects where there is no dialog present. This type of group sidechain-
ing is most common in game engines but is also used in linear mixing.

Monitoring

While the meters in the mixer section of your DAW give you some sense of the
levels of your track, it is helpful to set up additional monitoring for frequency
content of the mix, stereo image (if applicable) and a good LUFS meter to have
an accurate sense of the actual loudness of your mix.
At this point, we are ready to mix. Additional steps may be required, based
on the session and delivery requirements, of course.

2. Practical Sound Design and Prototyping


When dealing with interactive objects that the player can pilot or operate,
our task becomes a little bit more difficult, as we now need to create sound
objects that can respond in real time and in a believable fashion to the actions
of the player. Often this might involve manipulating sounds in real time,
pitching shifting, layering and crossfading between sounds. More complex
manipulations are also possible; granular synthesis as noted in the previous
chapter is a great way to manipulate audio. Of course, the power of granu-
lar synthesis comes at a computational cost that may disqualify it in certain
situations.

1. Guns
Guns are a staple of sound design in entertainment, and in order to stay
interesting from game to game they demand constant innovation in terms
of sound design. In fact, the perceived impact and power of a weapon very
much depends on the sound associated with it. The following is meant as an
introduction to the topic of gun sound design, as well as an insight as to how
they are implemented in games. There are lots of great resources out there
on the topic, should the reader decide to investigate the topic further, and is
encouraged to do so.

a. One Shot vs. Loops

There are many types of guns used in games, but one of the main differences
is whether the weapon is a single shot or an automatic weapon.
Most handguns are single shot or one shot, meaning that for every shot
fired the used needs to push the trigger. Holding down the trigger will not fire
additional rounds.
Assault rifles and other compact and sub compact weapons are sometimes
automatic, meaning the weapon will continue to fire as long as the player is
pushing the trigger or until the weapon runs out of ammunition.
PRACTICAL SOUND DESIGN 127

The difference between one shot and automatic weapons affects the way
we design sounds and implement the weapon in the game. With a one-shot
weapon it is possible to design each sound as a single audio asset including
both the initial impulse, the detonation when the user presses the trigger, as
well as the tail of the sound, the long decaying portion of the sound.

Figure 6.9

In the case of an automatic weapon, the sound designer may design the
weapon in two parts: a looping sound to be played as long as the player is
holding onto the trigger and a separate tail sound to be played as soon as the
player lets go of the trigger, to model the sound of the weapon decaying as the
player stops firing. This will sound more realistic and less abrupt. Additional
sounds may be designed and triggered on top of the loop, such as the sound
of the shell casings being ejected from the rifle.

Figure 6.10

b. General Considerations

Overall, regardless of the type of weapon you are sound designing and imple-
menting, when designing gun sounds, keep these few aspects in mind:

• Sound is really the best way to give the player a sense of the power and
capabilities of the weapon they’re firing. It should make the player feel the
power behind their weapon and short of haptic feedback, sound remains
the best way to convey the impact and energy of the weapon to the player.
Sound therefore plays an especially critical role when it comes to weapons.
128 PRACTICAL SOUND DESIGN

• Guns are meant to be scary and need to be loud. Very loud. Perhaps louder
than you’ve been comfortable designing sounds so far if this a new area for
you. A good loudness maximizer/mastering limiter is a must, as is a tran-
sient shaper plugin, in order to make the weapon both loud and impactful.
• Guns have mechanical components; from the sound of the gun being han-
dled to the sound of the firing pin striking the round in the chamber to that
of the bullet casings being ejected after each shot (if appropriate), these
elements will make the weapon sound more compelling and give you as a
sound designer the opportunity to make each gun slightly different.
• As always, do not get hung up on making gun sounds realistic, even if
you are sound designing for a real-life weapon. A lot of sound design-
ers won’t even use actual recordings of hand guns or guns at all when
working sound designing for one.
• The sound of a gun is highly dependent on its environment, especially
the tail end of it. If a weapon is to be fired in multiple environments, you
might want to design the initial firing sound and a separate environmen-
tal layer separately, so you can swap the appropriate sound for a given
environment. Some sound designers will take this two-step approach
even for linear applications. That environmental layer may be played on
top of the gun shot itself or baked in with the tail portion of the sound.

Figure 6.11

• A simple rule of thumb for determining the overall loudness of gun


is the ratio of the length of the barrel to the caliber of the bullet. The
shorter the barrel and the bigger the caliber, the louder the gun.
• Most bullets travel faster than the speed of sound and therefore will
create a supersonic crack. Some bullets are subsonic, designed specifi-
cally to avoid creating excessive noise.

c. Designing a Gunshot

One approach when sound designing a gun is to break down the sound into
several layers. A layered approach makes it easy to experiment with various
PRACTICAL SOUND DESIGN 129

samples for each of the three sounds, and individually process the different
aspects of the sound for best results.
Three separate layers are a good place to start:

• Layer 1: the detonation, or the main layer. In order to give your guns
maximum impact, you will want to make sure this sample has a nice
transient component to it. This is the main layer of the sound, which
we are going to augment with the other two.
• Layer 2: a top end, metallic/mechanical layer. This layer will increase
realism and add to the overall appeal of the weapon. You can use this
layer to give your guns more personality.
• Layer 3: a sub layer, to add bottom end and make the sound more
impactful. A subharmonic generator plugin might be helpful. This
layer will give your sound weight.

When selecting samples for each layer, prior to processing, do not limit
yourself to the sounds that are based in reality. For instance, when looking
for a sound for the detonation or the main layer, go bigger. For a handgun,
try a larger rifle or shotgun recording; they often sound more exciting than
handguns. Actual explosions, perhaps smaller ones for handguns, may be
appropriate too.

Figure 6.12

The Detonation/Main Body Layer

As always, pick your samples wisely. A lot of sound effects libraries out there
are filled with gun sounds that are not always of the best quality, may not be
the right perspective (recorded from a distance) or already have a lot reverber-
ation baked in. You’ll usually be looking for a dry sample, as much as possible
anyway, something that ideally already sounds impressive and scary. Look for
something with a healthy transient. You might want to use a transient shaping
130 PRACTICAL SOUND DESIGN

plugin or possibly a compressor with a slow attack time as described in the


previous chapter in order to emphasize the transients further. An equalization
scoop around 300–400Hz might actually be a good way to make a bit more
room for the low and mid frequencies to cut through.

The Top End/Mechanical Layer

When a shot is fired through a gun, some of the energy is transferred into
the body of the gun and in essence turns the gun itself into a resonator. This
is partially responsible for the perceived mechanical or metallic aspect of the
sound. In addition, some guns will eject the casing of the bullet after every
shot. The sound of the case being ejected and hitting the floor obviously makes
a sound too. The mechanical layer gives you a lot of opportunity for custom-
ization. When sound designing a lot of guns for a game, inevitably they will
tend to sound somewhat similar. This layer is a good place to try to add some
personality to each gun. Generally speaking, you will be looking for a bright
sound layer that will cut through the detonation and the bottom end layers. It
should help give your gun a fuller sound by filling up the higher frequencies
that the detonation and the sub may not reach. It also adds a transient to your
gun sound, which will make it sound all the more realistic and impactful.

The Sub Layer

The purpose of the sub layer is to give our sounds more weight and impact and
give the player a sense of power, difficult to achieve otherwise, except perhaps
via haptic feedback systems. Even then, sound remains a crucial aspect of
making the player ‘feel’ like their weapon is as powerful as the graphics imply.
A sub layer can be created in any number of ways, all worth experimenting
with.
It can be created using a synthesizer by modifying or creating an existing
bass preset and applying a subharmonic generator to it to give it yet more
depth and weight. Another option is to start from an actual recording, perhaps
an explosion or detonation, low pass filtering it and processing it with a sub-
harmonic generator to give it more weight still. A third option would be to use
a ready-made sub layer, readily found in lots of commercial sound libraries.
Avoid using a simple sine wave for this layer. It may achieve the desired effect
on nice studio monitors but might get completely lost on smaller speakers,
while a more complex waveform, closer to a triangle wave, will translate much
better, even on smaller speakers.

Modeling the Environment

Guns and explosions are impossible to abstract from the environment they
occur in. Indeed, the same weapon will sound quite different indoors and
PRACTICAL SOUND DESIGN 131

outdoors, and since in games it is often possible to fire the same gun in
several environments, game sound designers sometimes resort to design-
ing the tail end of the gun separately so that the game engine may con-
catenate them together based on the environment they are played into. In
some cases, sound designers will also add an environment layer to the gun
sounds simply because the reverb available in the game may not be quite
sophisticated enough to recreate the depth of the sound a detonation will
create when interacting with the environment. This environment layer is
usually created by running the sound of the gun through a high-end rever-
beration plugin.
The environment layer may be baked into the sound of the gun – that is,
bounced as a single file out of the DAW you are working with – or triggered
separately by the game engine, on top of the gun sound. This latter approach
allows for a more flexible weapon sound, one that can adapt to various
environments.

Putting It all Together

Once you have selected the sounds for each layer, you are close to being done,
but there still remain a few points to take into consideration.
Start by adjusting the relative mix of each layer to get the desired effect.
If you are unsure how to proceed, start by listening to some of your favorite
guns and weapons sounds from games and movies. Consider importing one or
more in the session you are currently working on as a reference. (Note: make
sure you are not routing your reference sound to any channels that you may
have added processors to.) Listen, make adjustments and check against your
reference. Repeat as needed.
Since guns are extremely loud, don’t be shy, and use loudness maximizers
and possibly even gain to clip the waveform or a layer in it. The real danger
here is to destroy transients in your sound, which may ultimately play against
you. There is no rule here, but use your ears to strike a compromise that is
satisfactory. This is where a reference sound is useful, as it can be tricky to
strike the proper balance.
In order to blend the layers together, some additional processing may
be a good idea. Compression, limiting, equalization and reverberation
should be considered in order to get your gun sound to be cohesive and
impactful.

Player Feedback

It is possible to provide the player with subtle hints to let them know how
much ammunition they have left via sound cues rather than by having to
look at the screen to find out. This is usually done by increasing the volume
132 PRACTICAL SOUND DESIGN

of the mechanical layer slightly as the ammunition is running out. The idea is
to make the gun sound slightly hollower as the player empties the magazine.
This approach does mean that you will need to render the mechanical layer
separately from the other two and control its volume via script. While this
requires a bit more work, it can increase the sense of immersion and real-
ism as well as establish a deeper connection between the player and their
weapon.

2. Prototyping Vehicles
When approaching the sound design for a vehicle or interactive element, it is
first important to understand the range of actions and potential requirements
for sounds as well as limitations prior to starting the process.
The implementation may not be up to you, so you will need to know and
perhaps suggest what features are available to you. You will likely need the
ability to pitch shift up and down various engine loops and crossfade between
different loops for each rpm. Consider the following as well: will the model
support tire sounds? Are the tire sounds surface dependent? Will you need
to provide skidding samples? What type of collision sounds do you need to
provide? The answers to these questions and more lie in the complexity of the
model you are dealing with.

a. Specifcations

A common starting point for cars is to assume a two gear vehicle, low and high
gear. For each gear we will create an acceleration and deceleration loop, which
the engine will crossfade between based on the user action.

• Eng_loop_low_acc.wav Low RPM engine loop for acceleration.


• Eng_loop_low_de.wav Low RPM engine loop for deceleration.
• Eng_loop_high_acc.wa High RPM engine loop for acceleration.
• Eng_loop_high_de.wav High RPM engine loop for deceleration.

This is a basic configuration that can easily be expanded upon by adding more
RPM samples and therefore a more complex gear mechanism.
The loops we create should be seamless, therefore steady in pitch and
without any modulation applied. We will use input from the game engine
to animate them, to create a sense of increased intensity as we speed up by
pitching the sound up or decreased intensity as we slow down by pitching the
sound down. As the user starts the car and accelerates, we will raise the pitch
and volume of our engine sample for low RPM and eventually crossfade into
the high RPM engine loop, which will also increase in pitch and volume until
we reach the maximum speed. When the user slows down, we will switch to
the deceleration samples.
PRACTICAL SOUND DESIGN 133

Figure 6.13

Let’s start by creating the audio loops, which we can test using the basic car
model provided in the Unity Standard’s asset package, also provided in the
Unity level accompanying this chapter.

b. Selecting Your Material

When working on a vehicle it is tempting to start from the sound of a similar


looking or functioning real-world vehicle and try to recreate it in the game.
Sample libraries are full of car and truck samples that can be used for this
purpose, or, if you are feeling adventurous, you can probably record a car
yourself. A little online research can give you tips about what to look out for
when recording vehicles. This can be a very effective approach but can be
somewhat underwhelming ultimately without further processing. Remember
that reality, ultimately, can be a little boring.
Another approach still is to look at other types of vehicles, such as propel-
ler airplanes, boats and other vehicles and layer them together to create a new
engine sound altogether.
Finally, the third option is to use sounds that have nothing to do with a car
engine – gathered via recordings – or synthesize and create the loops required
from this material.
Always try to gather and import in your sound design session more than
you think you will need. This will allow you to be flexible and give you more
options to experiment with.

c. Processing and Preparing Your Material

Once you have gathered enough sounds to work with it’s time to import them
and process them in order to create the four loops we need to create.
134 PRACTICAL SOUND DESIGN

There are no rules here, but there are definitely a few things to watch out for:

• The sample needs to loop seamlessly, so make sure that there are no obvi-
ous variations in pitch and amplitude that could make it sound like a loop.
• Do not export your sounds with micro fades.

Use all the techniques at your disposal to create the best possible sound, but, of
course, make sure that whatever you create is in line with both the aesthetics
of the vehicle and the game in general.
Here are a few suggestions for processing:

• Layer and mix: do not be afraid to layer sounds in order to create the
right loop.
• Distortion (experiment with various types of distortion) can be applied
to increase the perceived intensity of the loop. Distortion can be
applied or ‘printed’ as a process in the session, or it can be applied in
real time in the game engine and controlled by a game parameter, such
as RPM or user input.
• Pitch shifting is often a good way to turn something small into some-
thing big and vice versa or into something entirely different.
• Comb filtering is a process that often naturally occurs in a combustion
engine; a comb filter tuned to the right frequency might make your
sound more natural and interesting sounding.

Once you have created the assets and checked that length is correct, that they loop
without issue and that they sound interesting, it’s time for the next step, hearing
them in context, something that you can only truly do as you are prototyping.

d. Building a Prototype

No matter how good your DAW is, it probably won’t be able to help you with
the next step, making sure that, in the context of the game, as the user speeds
up and slows down, your sounds truly come to life and enhance the experi-
ence significantly.
The next step is to load the samples in your prototype. The tools you use
for prototyping may vary, from a MaxMSP patch to a fully functioning object
in the game engine. The important thing here is not only to find out if the
sounds you created in the previous step work well when ‘put to picture’, it’s
also to find out what are the best ranges for the parameters the game engine
will control. In the case of the car, the main parameters to adjust are pitch shift,
volume and crossfades between samples. In other words, tuning your model. If
the pitch shift applied to the loops is too great, it may make the sound feel too
synthetic, perhaps even comical. If the range is too small, the model might not
be as compelling as it otherwise could be and lose a lot of its impact.
We will rely on the car model that comes in with the Unity Standard Assets
package, downloadable from the asset store. It is also included in the Unity
level for this chapter. Open the Unity project PGASD_CH06 and open the
PRACTICAL SOUND DESIGN 135

scene labelled ‘vehicle’. Once the scene is open, in the hierarchy, locate and
click on the Car prefab. At the bottom of the inspector for the car you will
find the Car Audio script.

Figure 6.14

The script reveals four slots for audio clips, as well as some adjustable param-
eters, mostly dealing with pitch control. The script will also allow us to work
with a single clip for all the engine sounds or with four audio clips, which is
the method we will use. You can switch between both methods by clicking on
the Engine Sound Style tab. You will also find the script that controls the audio
for the model, and although you are encouraged to look through it, it may
make more sense to revisit the script after going through Chapters seven and
eight if you haven’t worked with scripting and C# in Unity. This script will
crossfade between a low and high intensity loop for acceleration and decel-
eration and perform pitch shifting and volume adjustments in response to the
user input. For the purposes of this exercise, it is not necessary to understand
how the script functions as long as four appropriate audio loops have been
created. Each loop audio clip, four in total, is then assigned to a separate audio
source. It would not be possible for Unity to swap samples as needed using
a single audio source and maintain seamless playback. A short interruption
would be heard as the clips get swapped.
Next, import your sounds in the Unity project for each engine loop, load
them in the appropriate slot in the car audio script and start the scene. You
should be able to control the movement of the car using the WASD keys.
Listen to the way your sounds sound and play off each other. After driving the
vehicle for some time and getting a feel for it, ask yourself a few basic questions:

• Does my sound design work for this? Is it believable and does it make
the vehicle more exciting to drive?
• Do the loops work well together? Are the individual loops seamless?
Do the transitions from one sample to another work well and convey
136 PRACTICAL SOUND DESIGN

the proper level of intensity? Try to make sure you can identify when
and how the samples transition from one another when the car is
driving.
• Are any adjustments needed? Are the loops working well as they are,
or could you improve them by going back to your DAW and exporting
new versions? Are the parameter settings for pitch or any other avail-
able ones at their optimum? The job of a game audio designer includes
understanding how each object we are designing sound for behaves,
and adjusting available parameters properly can make or break our
model.

In all likelihood, you will need to experiment in order to get to the best results.
Even if your loops sound good at first, try to experiment with the various dif-
ferent settings available to you. Try using different loops, from realistic, based
on existing sounding vehicles, to completely made up, using other vehicle
sounds and any other interesting sounds at your disposal. You will be surprised
at how different a car can feel when different sounds are used for its engine.
Other sounds may be required in order to make this a fully interactive and
believable vehicle. Such a list may include:

• Collision sounds, ideally different sounds for different impact velocity.


• Tire sounds, ideally surface-dependent.
• Skidding sounds.
• Shock absorbers sounds.

There is obviously a lot more to explore here and to experiment with. This car
model does not include options to implement a lot of the sounds mentioned
earlier, but that could be easily changed with a little scripting knowledge.
Even so, adding features may not be an option based on other factors such
as RAM, performance, budget or deadlines. Our job is, as much as possible,
to do our best with what we are handed, and sometimes plead for a feature
we see as important to making the model come to life. If you know how to
prototype regardless of the environment, your case for implementing new
features will be stronger if you already have a working model to demonstrate
your work and plead your case more convincingly to the programming team
or the producer.

3. Creature Sounds
Creatures in games are often AI characters that can sometimes exhibit a wide
range of emotions, which sound plays a central role in effectively communi-
cating. As always, prior to beginning the sound design process, try to under-
stand the character or creature you are working on. Start with the basics: is it
endearing, cute, neutral, good, scary etc.? Then consider what its emotional
PRACTICAL SOUND DESIGN 137

span is. Some creatures can be more complex than others, but all will usually
have a few basic emotions and built in behaviors, from simply roaming around
to attacking, getting hurt or dying. Getting a sense for the creature should be
the first thing on your list.

a. Primary vs. Secondary Sounds

Once you have established the basic role of the creature in the narrative,
consider its physical characteristics: is it big, small, reptilian, feline? The
appearance and its ‘lineage’ are great places to start in terms of the sonic
characteristics you will want to bring out. Based on its appearance, you can
determine if it should roar, hiss, bark, vocalize, a combination of these or
more. From these characteristics, you can get a sense for the creature’s main
voice or primary sounds, the sounds that will clearly focus the player’s atten-
tion and become the trademark of this character. If the creature is a zombie,
the primary sounds will likely be moans or vocalizations.
Realism and believability come from attention to detail; while the main
voice of the creature is important, so are all the peripheral sounds that will
help make the creature truly come to life. These are the secondary sounds:
breaths, movement sounds coming from a creature with a thick leathery skin,
gulps, moans and more will help the user gain a lot better idea of the type of
creature they are dealing with, not to mention that this added information
will also help consolidate the feeling of immersion felt by the player. In the
case of a zombie, secondary sounds would be breaths, lips smacks, bones
cracking or breaking etc. It is, however, extremely important that these
peripheral or secondary sounds be clearly understood as such and do not get
in the way of the primary sounds, such as vocalizations or roars for instance.
This could confuse the gamer and could make the creature and its intentions
hard to decipher. Make sure that they are mixed in at lower volume than the
primary sounds.
Remember that all sound design should be clearly understood or leg-
ible. If it is felt that a secondary sound conflicts with one of the primary
sound effects, you should consider adjusting the mix further or removing it
altogether.

b. Emotional Span

Often, game characters, AI or not, will go through a range of emotions in the


game’s lifespan. These are often, for AI at least, dictated by the game state and
will change based on the gameplay. A sentinel character can be relaxed, alert,
fighting, inflict or take damage and possibly kill or die. These actions or states
should be reflected sonically of course, by making sure our sound design for
each state is clear and convincing. It may be overkill to establish a mood map
(but if it helps you, by all means do), yet it is important to make sure that the
138 PRACTICAL SOUND DESIGN

sounds you create all translate these emotions clearly and give us a wide range
of sonic transformations while at the same time clearly appearing to be ema-
nating from the same creature.
The study or observation of how animals express their emotions in the real
world is also quite useful. Cats and dogs can be quite expressive, making it
clear when they are happy by purring or when they are angry by hissing and
growling in a low register, possibly barking etc. Look beyond domestic ani-
mals and always try to learn more.
Creatures sound design tends to be approached in one of several ways:
by processing and layering human voice recordings, by using animal sounds,
by working from entirely removed but sonically interesting material or any
combination of these.

c. Working With Vocal Recordings

A common approach to designing creature sounds is to begin with a human


voice and emote based on the character in a recording studio. These sounds
are usually meant to be further processed, but it is important to record a lot
of good quality material at this stage. Do not worry too much about synchro-
nization at this point; this is what editing is for. Try loosely matching anima-
tions, that is if any were provided, and record a wide variety of sounds. Your
voice or that of the talent may not match the expected range of the character,
perhaps lacking depth or having too much of it, but the raw sounds and emo-
tions are more important at this point. Emotion is harder to add to a sound
after the fact, and while it can be done, usually by drawing pitch envelopes
and layering different sounds together, it is faster to work with a file that
already contains the proper emotional message and process it to match the
character on screen.
As always, record more material than you think you’re going to need. This
will give you more to work with and choose from, always recording multiple
takes of each line or sound.
Also make sure your signal path is clean, giving you a good signal to work
with in the first place. This means watching out for noise, unwanted room
ambiences, room tones etc.
Traditionally, large diaphragm condenser microphones are used for voice
recording, but in noisy environments you may obtain cleaner results with a
good dynamic microphone, though you might need to add some high-end
back into the signal during the design and mix process.

Pitch Shifting in the Context of Creature Design

Your voice talent may sound fabulous and deliver excellent raw material, but
it is unlikely that they will be able to sound like a 50 meters tall creature or
a ten centimeters fairy. This is where pitch shifting can be extremely helpful.
PRACTICAL SOUND DESIGN 139

Pitch shifting was detailed in the previous chapters, but there are a few fea-
tures that are going to be especially helpful in the context of creature sound
design.
Since pitch is a good way to gauge the size of a character, it goes without
say that raising the pitch will make the creature feel smaller, while lowering it
will inevitably increase its perceived size.
The amount of pitch shift to be applied is usually specified in cents and
semitones.
Note: there are 12 semitones in an octave and 100 cents in a semitone.
The amount by which to transpose the vocal recording is going to be a
product of size and experimentation, yet an often-overlooked feature is the
formant shift parameter. Not all pitch shifting plugins have one, but it is rec-
ommended to invest in a plugin that does.
Formants are peaks of spectral energy that result from resonances usually
created by the physical object that created the sound in the first place. More
specifically, when it comes to speech, they are a product of the vocal tract and
other physical characteristics of the performer. The frequency of these for-
mants therefore does not change very much, even across the range of a singer,
although they are not entirely static in the human voice.

Table 6.1

Formant E A 0h 0oh`
Frequencies in Hz

Men Formant 1 270 660 730 300


Formant 2 2290 1720 1090 870
Formant 3 3010 2410 2440 2240
Women Formant 1 310 860 850 370
Formant 2 2790 2050 1220 950
Formant 3 3310 2850 2810 2670

These values are meant as starting points only, and the reader is encouraged to research more
information online for more detailed information.

When applying pitch shifting techniques that transpose the signal and
ignore formants, these resonant frequencies also get shifted, implying a
smaller and smaller creature as they get shifted upwards. This is the clas-
sic ‘chipmunk’ effect. Having individual control over the formants and the
amount of the pitch shift can be extremely useful. Lowering the formants
without changing the pitch can make a sound appear to be coming from
a larger source or creature and inversely. Having independent control of
the pitch and formant gives us the ability to create interesting and unusual
hybrid sounds.
140 PRACTICAL SOUND DESIGN

A lot of pitch correction algorithms provide this functionality as well and


are wonderful tools for sound design. Since pitch correction algorithms often
include a way to draw pitch, they can also be used to alter the perceived emo-
tion of a recording. By drawing an upward pitch gesture at the end of a sound,
it will tend to sound inquisitive, for instance.

Distortion in the Context of Creature Design

Distortion is a great way to add intensity to a sound. The amount and type of
distortion should be decided based on experience and experimentation, but
when it comes to creature design, distortion can translate into ferocity. Distor-
tion can either be applied to an individual layer of the overall sound or to a
submix of sounds to help blend or fuse the sounds into one while making the
overall mix slightly more aggressive. Of course, if the desired result is to use
distortion to help fuse sounds together and add mild harmonics to our sound,
a small amount of distortion should be applied.
Watch out for the overall spectral balance upon applying distortion, as
some algorithms tend to take away high frequencies and as a result the overall
effect can sound a bit lo-fi. If so, try to adjust the high frequency content by
boosting high frequencies using an equalizer or aural exciter.
Note: as with many processes, you might get more natural-sounding results
by applying distortion in stages rather than all at once. For large amounts, try
splitting the process in two separate plugins, in series each carrying half of the
load.

Equalization in the Context of Creature Design

As with any application, a good equalizer will provide you with the abil-
ity to fix any tonal issues with the sound or sounds you are working with.
Adding bottom end to a growl to make it feel heavier and bigger or sim-
ply bringing up the high frequency content after a distortion stage, for
instance.
Another less obvious application of equalization is the ability to add
formants into a signal that may not contain any or add more formants to a
signal that already does. By adding formants found in a human voice to a
non-human creature and sounds, we can achieve interesting hybrid resulting
sounds.
Since a formant is a buildup of acoustical energy at a specific frequency, it
is possible to add formants to a sound by creating very narrow and powerful
boosts at the right frequency. This technique was mentioned in Chapter five as
a way to add resonances to a sound and therefore make it appear like it takes
place in a closed environment.
In order to create convincing formant, drastic equalization curves are
required. Some equalizer plugins will include various formants as parts of
their presets.
PRACTICAL SOUND DESIGN 141

Figure 6.15

d. Working With Animal Samples

Animal samples can provide us with great starting points for our creature
sound design. Tigers, lions and bears are indeed a fantastic source of fero-
cious and terrifying sounds, but at the same time they offer a huge range of
emotions: purring, huffing, breathing, whining. The animal kingdom is a very
rich one, and do not limit your searches to these obvious candidates. Look
far and wide, research other sound designer’s works on films and games and
experiment.
The main potential pitfall when working with animal samples is to
create something that actually sounds like an animal, in other words too
easily recognizable as a lion or large feline for instance. This is usually
because the samples used could be processed further in order to make
them sound less easily identifiable. Another trick to help disguise sounds
further is to chop off the beginning of the sample you are using. By remov-
ing the onset portion of a sample you make it harder to identify. Taking
this technique further you can also swap the start of a sample with another
one, creating a hybrid sound that after further processing will be difficult
to identify.

Amplitude Modulation in the Context of Creature Design

Amplitude modulation can be used in two major ways: to create a tremolo


effect or to add sidebands to an existing sound. A rapid tremolo effect is a
good way to bring up insect-like quality in creatures, such as the rapid wing
flap of a fly. It can also be applied to other aspects of a sound and impart to
other sounds a similar quality.
When applied as ring modulation, the process will drastically change the
current harmonic relationship of the sound by adding sidebands to every
142 PRACTICAL SOUND DESIGN

frequency component of the original sound while at the same time remov-
ing these original components. In other words, ring modulation removes the
original partials in the sound file and replaces them with sidebands. While the
process can sound a little electronic, it is a great way to drastically change a
sound while retaining some of its original properties.

Convolution in the Context of Creature Design

• Convolution can be a potentially very powerful tool for creature sound


design. Although most frequently used for reverberation, convolution
can be very effective at creating hybrid sounds by taking characteris-
tics of two different sounds and creating a new, hybrid audio file as a
result. The outcome will tend to be interesting, perhaps even surpris-
ing, as long as both files share a common spectrum. In other words,
for convolution to yield its most interesting results, it is best if the files’
frequency content overlaps. You will also find that often, unless the
algorithm used compensates for it, the resulting file of a convolution
can come out lacking in high frequencies. This is because convolution
tends to yield more energy in the areas in both files which share the
most, while its output will minimize the frequency content where the
energy in either or both files is less strong. High frequencies are often
not as powerful in most sounds as other frequency ranges, such as mid-
range frequencies.

When trying to create hybrid sounds using convolution, first make sure the
files you are working with are optimal and share at least some frequency con-
tent. You may also find that you get slightly more natural results if you apply
an equalizer to emphasize high frequencies in either input file, rather than
compensating after the process.
Some convolution plugins will give you control over the window length or
size. Although this term, window size, may be labelled slightly differently in
different implementations, it is usually expressed as a power of two, such as
256 or 512 samples. This is because most convolution algorithms are imple-
mented in the frequency domain, often via a Fourier algorithm, such as the
fast Fourier transform.
In this implementation, both audio signals are broken down into small
windows whose length is a power of two, and a frequency analysis is run
on each window or frame. The convolution algorithm then performs a
spectral multiplication of each frame and outputs a hybrid. The resulting
output is then returned to the time domain by performing an inverse Fou-
rier transform.
The process of splitting the audio in windows of a fixed length is not
entirely transparent, however. There is a tradeoff at the heart of this process
that is common to a lot of FFT-based algorithms: a short window size, such
PRACTICAL SOUND DESIGN 143

as 256 and under, will tend to result in better time resolution but poorer fre-
quency resolution. Inversely, a larger window size will yield better frequency
resolution and a poorer time resolution. In some cases, with larger window
sizes, some transients may end up lumped together, disappearing or getting
smeared. Take your best guess to choose the best window size based on your
material, and adjust from there.
Experimentation and documenting your results are keys to success.

e. Working With Non-Human or Animal Samples

Perhaps not as obvious when gathering material for sound design for crea-
tures and monsters is to use material that comes from entirely different
sources than human or animal sources. Remember that we can find interest-
ing sounds all around us, and non-organic elements can be great sources of
raw material. Certain types of sounds might be more obvious candidates
than others. The sound of a flame thrower can be a great addition to a
dragon-like creature, and the sound of scraping concrete blocks or stone can
be a great way to add texture to an ancient molten lava monster, but we can
also use non-human or animal material for primary sounds such as vocaliza-
tions or voices.
Certain sounds naturally exhibit qualities that make them sound
organic. The right sound of a bad hinge on a cabinet door, for instance,
can sound oddly similar to a moan or creature voice when the door is
slowly opening. The sound of a plastic straw pulled out of a fast food cup
can also, especially when pitch shifted down, have similar characteristics.
The sound of a bike tire pump can sound like air coming out of a large
creature’s nostrils and so on. It’s also quite possible to add formants to
most sounds using a flexible equalizer as was described in the previous
section.
Every situation is different of course, and every creature is too. Keep exper-
imenting with new techniques and materials and trying new sounds and new
techniques. Combining material, human, animal and non-organic, can create
some of the most interesting and unpredictable results.

4. An Adaptive Crowd Engine Prototype in MaxMSP


Our next example is a simple adaptive crowd engine, built this time in Max-
MSP. MaxMSP is a graphical programming environment for audio and visual
media. This example is meant to recreate the crowd engines you can find in
classic large arena sports games and demonstrate the basic mechanics of how
the crowd sounds react to the action.1
In order to create an evolving and dynamic ambience, we will rely on four
basic loops, one for each state the crowd can be in: quiet, medium intensity,
high intensity, and finally upset or booing.
144 PRACTICAL SOUND DESIGN

Rather than doing simple crossfades between two samples, we will rely on
an XY pad instead, with each corner linked to an audio file. An XY pad gives
us more options and a much more flexible approach than a simple crossfade.
By moving the cursor to one of the corners, we can play only one file at a time.
By sliding it toward another edge, we can mix between two files at a time, and
by placing the cursor in the center of the screen, we can play all four at once.
This means that we could, for instance, recreate the excitement of fans as their
teams is about to score, while at the same time playing a little of the boos from
the opposite teams as they express their discontent. As you can see, XY pads
are a great way to create interactive audio objects, certainly not limited to a
crowd engine.

Figure 6.16

We will rely on four basic crowd loops for the main sound of the crowd:

• Crowd_Lo_01.wav: A low intensity crowd sample: the crowd is quiet


and waiting for something to happen.
PRACTICAL SOUND DESIGN 145

• Crowd_Mid_01.wav: A medium intensity crowd sample: the crowd is


getting excited while watching a play.
• Crowd_Hi_01.wav: A high intensity crowd sample: the crowd is cel-
ebrating a score or play.
• Crowd_Boo_01.wav: the crowd is unhappy and booing the action.

Each one of these samples should loop seamlessly, and we will work with
loops about 30 seconds to a minute in length, although that figure can be
adjusted to match memory requirement vs. desired complexity and degree of
realism of the prototype. As always when choosing loops, make sure that the
looping point is seamless but also that the recording doesn’t contain an easily
remembered sound, such as an awkward and loud high pitch burst of laughter
by someone close to the microphone, which would eventually be remembered
by the player and suddenly feel a lot less realistic and would eventually get
annoying. In order to load the files into the crowd engine just drag the desired
file to the area on each corner labelled drop file.
As previously stated, we will crossfade between these sounds by moving the
cursor in the XY pad area. When the cursor is all the way in one corner, only
the sound file associated with that corner should play; when the cursor is in
the middle, all four sound files should play. Furthermore, for added flexibility,
each sound file should also have its own individual sets of controls for pitch,
playback speed and volume. We can use the pitch shift as way to increase
intensity, by bringing the pitch up slightly when needed or by lowering its
pitch slightly to lower the intensity of the sound in a subtle but efficient man-
ner. This is not unlike how we approached the car engine, except that we will
use much smaller ranges in this case.
In order to make our crowd engine more realistic we will also add a sweeteners
folder. Sweeteners are usually one-shot sounds triggered by the engine to make
the sonic environment more dynamic. In the case of a crowd engine these could be
additional yells by fans, announcements on the PA, an organ riff at a baseball game
etc. We will load samples from a folder and set a random timer for the amount
of time between sweeteners. Audio files can be loaded in the engine by dragging
and dropping them in each corner of the engine, and sweeteners can be loaded by
dropping a folder containing .wav or .aif files into the sweetener area.
Once all the files have been loaded, press the space bar to start the playback.
By slowly moving and dragging around the cursor in the XY pad while the
audio files are playing, we are able to recreate various moods from the crowd
by starting at a corner and moving toward another. The XY pad is convenient
because it allows us to mix more than one audio file at once; the center posi-
tion would play all four, while a corner will only play one.
Recreating the XY pad in Unity would not be very difficult; all it would
require are five audio sources, (one for each corner plus one for the sweeten-
ers) and a 2D controller moving on a 2D plane.
The architecture of this XY pad is very open and can be applied to many
other situations with few modifications. Further improvements may include
146 PRACTICAL SOUND DESIGN

the addition of a granular synthesis or other processing stage, which could


be used to further animate the audio generated by our engine and obtain a
significantly wider range of variations and intensities, albeit at some compu-
tational cost. Perhaps a more obvious improvement would be to work with
multiple loops for the crowd states, which would also give us more potential
for variations. This architecture also does not have to be used for a crowd
engine; it could easily be applied to ambiences, machines, vehicles and lots
more situations.

Conclusion
Sound design, either linear or interactive, is a skill learned through experimenta-
tion and creativity, but that also requires the designer to be organized and aware
of the pitfalls ahead of them. When it comes to linear sound design, organizing
the session for maximum flexibility while managing dynamic range are going
to be some of the most important aspects to watch out for on the technical
side of things. When it comes to interactive sound design, being able to build
or use prototypes that effectively demonstrate the behavior of the object in the
game by simulating the main parameters is also very important. This will allow
you to address any potential faults with the mechanics or sound design prior
to implementation in the game and communicate more effectively with your
programming team.

Note
1. In order to tryout this example, the reader will need to install Cycling74’s MaxMSP,
a free trial version being available from their website.
CH A PT ER
2

Design in g a
m om en t

This chapter is excerpted from

Leading with Sound


Rob Bridgett
© 2021 Taylor & Francis Group. All rights reserved.

Lear n M or e »
15 Designing a moment
Temporality in interactive sound design

Given that we have now established how important it is to develop intensity


curves (dynamic change cover time), we can start to build deeper detail into
the changing context of sound from one moment to the next in the game.
Understanding a sound’s evolving context is extremely important because it
will tell you, the sound designer, what the emotional and gameplay lenses are
that you are looking through – the importance of foreshadowing the next
event or beat, the important of letting the player know they are running low
on a particular resource, be that health, ammunition, or oxygen if under-
water – sounds can and should change over time given the different contexts
the player finds themselves in. Sound designers are, and should always be,
context junkies, because understanding context is the only method through
which we will be able to express the correct information and feeling to the
audience as well as understanding the overall interactive design range of the
sound, and how it needs to change over time and circumstance.
It is important to understand not just the moment you are designing, but
the entire context that this moment occupies within the continuum of the rest
of the gameplay, or story in regard to the sounds being used. What is the
moment preceding? What is the moment following? Getting the right infor-
mation for these contexts can be tricky in game development, because the
development process is so fluid, ongoing and iterative, and even contextual
itself among different disciplines. Unfortunately, the overall web of contexts
can sometimes only be revealed and apparent to those making the game,
towards the end of the development.
Designing an explosion sound can take an almost infinite amount of forms
and depending on what happened previously in the recent memory of the
player, the explosion may need to accomplish different things beyond just an
‘epic explosion’ which is seen onscreen.
Let’s take an example. (In this example the explosion signifies a wall that
has been brought down by one of our team mates so that we can get through
and escape from an enclosed compound – once we hear the comms and then the
explosion occurs, we must move fast, as the private security team will no doubt
be alerted and come straight into action to find us and respond.) From a
gameplay viewpoint, our explosion needs to be clearly audible and
Designing a moment 105
communicate to the player that the explosion has occurred, while also relaying
the 3D location to the player so that they know in which direction to proceed
to escape. Because of its gameplay and narrative importance, we really need to
hear this explosion, as well as quite a long tail to keep its 3D location present
for longer. We also need it to play really big, because it is an explosion big
enough to punch a hole in a thick perimeter wall of a compound. It also needs
to be louder and more significant than other gameplay explosion sounds, in
order to sound more significant – I should say that all around us many explo-
sions are playing from mortar rockets and grenades, as the private security
team (bad guys) try to find us. Moving on to some of the other sounds that
occur around the explosion, we can do several things to make more room for
this moment: we can precede the moments prior with some stillness and
moments of tension – almost all of this will need to be co-ordinated with the
design and scripting team, so we can stop mortar and grenade launching events
occurring a few seconds before our main explosion occurs. In that same
preceding moment, we can also begin to de-emphasise (duck down) other
ambient sounds that are playing, like dripping water on leaves and anything
that the player may be otherwise hearing in the immediate visual area, this
allows the player to use their hearing to ‘reach’ further into the outer diegesis
where the explosion is expected to happen. Now we are getting into the story-
telling contexts: we now suddenly hear a few distant shouts of an enemy guard
as they discover our team mate, then we hear a few gunshots, from their loca-
tion – an immediate cause for concern as this would mean that our team mate
has been discovered, and also that they may or may not have been able to set
off the explosion – some more shouts from an enemy security guard to alert the
others – all the while we retain our position and still wait for the cue – the
anxiety building all the time. Then when the explosion is finally triggered,
something feels wrong… the explosion goes off early with no warning, and we
receive no comms telling us that the explosion will be triggered as arranged.
Our teammate was discovered and had to set off the explosion early, in the
process killing the guard who discovered them, as well as seriously injuring
themselves. We hear the news from our injured teammate over the radio.
Knowing a particular sequence of events is going to occur, but then having that
go wrong, as sound designers the explosion can be designed through this lens
using adjectives such as ‘premature, unexpected’ (we can interpret this as
having a very sudden transient at the beginning of the sound, almost like
receiving a sudden shock) ‘sickening’, ‘revealing’ (we can interpret this as
having a long, low tail to the sound, the sound that was supposed to signal our
escape from this mission, is now exposed and audible to everyone in the com-
plex, the sound going on for longer than would be naturally expected keeps it
alive in both our own ears and the ears of the enemy, meaning everyone can
hear this and will now be headed towards that location). Hearing the echoes of
the sound around the entire, wider environment can also help achieve this
affect. Then, even before the tail of the sound has decayed away, we can begin
to hear the alarms and sounds and shouts of activity from the security team as
106 Designing a moment
they begin to respond. As a player, this will induce a panic and a need to
confront them as they get between our location and that of the hole in the
perimeter wall, and our inured teammate. So, the time pressure is on.
Hearing alarms and the distant scurrying of the enemy all around us,
including some occasional distant shots, raises the player’s attention and
awareness of sounds around them. All this is occurring offscreen, so we are
relying on our ears to communicate the story of what is happening. We had
a plan, we had expectations of what to listen for, and those expectations
started to unravel and change through sound.
Back to the sound designer’s perspective, for this example, on paper, if we
simply received a list of sound effects to design from the design department,
without context, such as ‘wall explosion’, ‘distant gun shots’, ‘enemy walla
shouting’ (not dialogue lines, as it would be indeterminate as to what
was said) and ‘distant alarms’ we would not be able to put the story of those
sounds together. Without understanding the narrative, the sequences of
events, and the expectations of what the player is listening for – we would just
assemble this scene perhaps using our own interpretation on what the most
important sounds were, and how they are heard. If we just get this informa-
tion in the form of a shopping list of sound effects to give to the sound team
to just create those assets as listed, working to simply check those things off a
list, we would lose all this important contextual detail. The first question the
sound designer is hopefully going to ask, about each and every one of those
sounds, is what the context is, what is happening before and after each sound
in the scene. What is expected outcome, vs the surprise of this outcome we
hear, and in what sequence are these sounds occurring? It may seem hard to
believe that sound design can be commissioned without knowing these things,
but it happens all the time.
Sound may go even further in creating a narrative sound story in this example,
by deciding to exaggerate even more tension and feeling that something is
wrong. In the moments before we hear the explosion, we may decide to create
some ominous portents in the immediate diegesis where the player is situated –
bringing down insect sounds around the player (insects would normally not be
affected by anything other than the temperature of the environment or close
proximity of a predator) could be a subtle hint to the player that something is
wrong, a sign, an omen. Adding some confusing and ambiguous sounds in the
extreme distant diegesis may similarly add tension to the stillness prior to the
explosion, perhaps a distant 1980s telephone ringer from another house way up
on a hill, filtered through the environment, echoing and reverberated, this would
certainly add an element of intrigue, and it would add an expected and unpre-
dictable image to the general feeling – very distant and very subtle, these slight
and unexpected sounds can start to bring in this idea of unpredictability and we
can start to feel that our character’s plan is unravelling. This kind of orchestra-
tion of sounds in the diegesis, particularly ambiguous sounds, brings all kinds of
questions and feeling to the player – especially heightened in a situation where
the player is already focussed and listening out for specific cues from the
Designing a moment 107
teammate. Adding additional sound stories does not always require tight inte-
gration between the audio and design scripting teams, we are exploiting an
already created lull in the action and adding something in there to create
further tension and further our contexts. As long as we don’t place sounds
that ruin or distract attention too much from the overall idea.
Much of building the context around sound requires that this kind of cross-
discipline orchestration and knowledge transfer needs to take place – it is
orchestration not only of the sounds themselves, and the timing and location at
which those sounds occur, but also the co-ordination of, and separation of, the
triggers for those sounds. This can require a lot of meetings and conversation in
front of white boards, or in front of the game itself, understanding visually and
temporally what is happening in the scene and where. Once the details of the
scene are understood across the teams, the design elements of the sound may
require an even further level of planning and detail into exploring the unseen and
in-between areas of the scene, the moments where a sound is triggered, but may
need a lot more time and space than was originally thought from the L1 version.
Sometimes event sequencing and tuning requires a few milliseconds, sometimes
seconds. Co-ordinating the temporal fabric of an in-game moment and its
triggering events can be quite complex, but it is all achievable with a good team
who can communicate fluidly across disciplines, having the right level of
detailed conversations, and the right reviews at the right time. It is essential to
have representatives of all the disciplines at such a review – staying on the
same page can be difficult within game development teams, and it is easy for
one discipline to understand the scene from one context or perspective, whereas
another discipline sees and interprets the scene from an entirely different per-
spective. All these differences, in this web of contexts, should be confronted and
reconciled as soon as possible, and the only way to do that is through review and
conversation about context, and again, always coming back to asking what is the
most important thing in the scene.
Getting to the most important thing could be a question like ‘what is the
most important thing for this character at this moment’ – meaning their
motivation, what are they listening for – or it could be ‘what is the most
important thing for the player at this moment’. There is often a narrative sub-
text priority – for example, hearing the location via audible cues, of the person
you are supposed to be rescuing, combined with a simultaneous gameplay
priority, for example low health. The narrative priority is often overtaken by
the gameplay priority, pushing the narrative elements into the background, but
not removing them entirely. Narrative tells us the immediate and overall emo-
tional goal, whereas the gameplay priority is giving us the immediate survival
needs on a second-to-second basis.
The creation of a tension curve is a good starting point to understanding
the dynamics and priorities of the various elements that go into making up a
game – these can be created for gameplay areas such as combat, or explora-
tion, or traversal experiences, but they can also be created for maps overall, as
well as cinematic scenes as we have highlighted. They allow teams to come
108 Designing a moment
together to talk about and think about what the experience is that they are
creating for the player. They are a focal point for developers to plot the
journey that the player will be taken on, and to figure out where the dynamics
of the game are at a given moment. From here, the team can begin to really
dive in and talk about the detail of the smaller contexts of those scenes. A
single plot on a dynamics curve may be zoomed-into and exploded to reveal
several smaller sub curves or sub events that are driving the overall intensity
of that particular moment, all the while understanding more about the con-
text of each moment, and of the needs of each sound. Once this information
is understood, then the actual work of designing and implementing those
sounds, can begin – and from there, through regular reviews and meetings,
the work can be tweaked adjusted and polished until the desired experience at
the desired quality level is reached.
CH A PT ER
3

Em ot ion in sou n d
design

This chapter is excerpted from

Sound for Moving Pictures


Neil Hillman
© 2021 Taylor & Francis Group. All rights reserved.

Lear n M or e »
2 Emotion in sound design

Introduction
This chapter looks at notable examples of current professional and academic lit-
erature that are relevant to this topic, and to the increasing interest in the study of
sound and emotion. This chapter also clarifies the terminology used, discusses to
what extent the relationships of music and emotions – and speech and emotions –
are relevant to this book, and looks at existing sound-related theoretical structures.

2.1 Defining the nature of emotions


The original motivation for this book came from my desire as a professional
Sound Designer to investigate and understand quite what the ‘emotional’ element
of an audience’s reaction to moving picture sound design, and soundtrack balanc-
ing, might be; and then, if it was possible to shed light on what that is, to look at
whether reproducible techniques might be employed by fellow Sound Designers
and Re-recording Mixers, to elicit target emotions in an audience.
Clearly, audiences do have emotional reactions to movie soundtracks – there
are obvious, outward signs when a film makes an audience laugh out loud in
a theatre, or even cry; and many of us will have first-hand experience of what
it is to feel happy, fearful or uncomfortable whilst watching (and listening) to
a film.
However, after reading works relevant to my investigation, it soon became
apparent that it was important for me to determine whether the response of an
audience to soundtrack stimuli could be described consistently; because there
appeared to be a distinct commingling of the terms audience emotion and audi-
ence affect. And in seeking an answer to the question of which is the most appro-
priate term between emotion and affect, there seemed to be many examples of
misunderstanding, or even misappropriation, of the two terms.
It is not helped by the fact that any attempt to provide a simple, clear-cut clari-
fication of the difference between affect and emotion is somewhat challenging;
18 Emotion in sound design
not least of all because language and concepts become increasingly abstract the
deeper one delves into specialist works. However, Massumi proposes that:

Affect is ‘unformed and unstructured,’ and it is always prior to and/or outside


of conscious awareness.
(Shouse, 2005)

Whilst in his essay ‘Why Emotions Are Never Unconscious’, Clore proposes that:

emotions that are felt cannot be unconscious.


(1994, p. 285)

Therefore it may be reasonable to suggest that considering affect as an uncon-


scious process, whilst regarding emotion as a conscious one, could be a good
place to start in differentiating the two terms; at least for the purpose of this con-
versation – so, for instance, the bodily response that arises due to the threat of an
oncoming vehicle, or being caught in a tsunami, could be considered as an exam-
ple of affect; whereas crying in sympathy with an on-screen character’s situation
would be seen as an example of emotion.
But Shouse, for one, is specific about the difference between the terms emotion
and affect:

Although feeling and affect are routinely used interchangeably, it is impor-


tant not to confuse affect with feelings and emotions. As Brian Massumi’s
definition of affect – in his introduction to Deleuze and Guattari’s A
Thousand Plateaus – makes clear, affect is not a personal feeling. Feelings
are personal and biographical, emotions are social, and affects are preper-
sonal. (Shouse, 2005)

And then, by considering an aspect of the work presented by Deleuze and Guattari
themselves, specifically their ‘autonomy of affect’ theory, which proposes that
affect is independent of the bodily mode through which an emotion is made vis-
ible (Schrimshaw, 2013), it seemed to be incongruous, particularly as far as the
topic of this study is concerned, to elevate the impersonal concept of affect over
the personal and social factors that constitute a cinema-viewing experience, and
more readily align with the term emotion.
Another clarification of the two terms is provided by Lisa Feldman Barrett,
writing an endnote to Chapter 4 of her book How Emotions are Made: The Secret
Life of the Brain:

Many scientists use the word ‘affect’ when really, they mean emotion.
They’re trying to talk about emotion cautiously, in a non-partisan way, with-
out taking sides in any debate. As a result, in the science of emotion, the word
‘affect’ can sometimes mean anything emotional. This is unfortunate because
Emotion in sound design 19
affect is not specific to emotion; it is a feature of consciousness. (Feldman
Barrett, 2017)

Furthermore, Shaviro is emphatic about what is primarily engaging an audience:

Reading a novel, hearing a piece of music, or watching a movie is an emo-


tional experience first of all. Cognition and judgment only come about later,
if at all. (Shaviro, 2016)

And so, throughout this book, it is proposed that the context for the work under-
taken by a Sound Designer and Re-recording Mixer most appropriately lies within
the boundaries of influencing audience emotion.
The challenge of defining what constitutes an emotion remains, however. As
Kathrin Knautz observes, whilst it may be straightforward to determine our own,
because of the difficulty in defining emotions, some researchers resort to formu-
lating a definition by instead looking at the features of emotions. (Knautz, 2012)
Fehr and Russell comment on this conundrum:

Everyone knows what an emotion is, until one is asked to give a definition.
Then, it seems, no one knows.
(1984, p. 464)

In the introduction to his book From Passions to Emotions, Dixon (2003) suggests
that the rise in academic work in a range of fields concerned with the emotions is a
modern trend, one that is in direct contrast to the preoccupation with intellect and
reason of earlier studies. Furthermore, he feels that this is no bad thing:

Being in touch with one’s emotions is an unquestioned good. (Dixon, 2003,


p.1)

Through his research on Pan-cultural recognition of emotional expressions


(Ekman et al., 1969) and his subsequent work Basic Emotions (Ekman, 1999),
Ekman suggests that six fundamental emotions exist in all human beings: happi-
ness, sadness, fear, anger, surprise and disgust.
Plutchik (2001) broadly agrees with Ekman, but further develops the catego-
ries by creating a wheel of eight opposing emotions, where positive emotions are
counterpointed by equal and opposite negative states: joy versus sadness; anger
versus fear; trust versus disgust; and surprise versus anticipation.
From Ekman and Plutchik’s definitions, Antonio Damasio (2000) further
suggests that more complex emotional states can arise: such as embarrassment,
jealousy, guilt, or pride (sometimes referred to as social emotions), or well-
being, malaise, calm, tension (background emotions). As one of the world’s
leading experts on the neurophysiology of emotions, Damasio summarizes the
fact that without exception men and women of all ages, social and educational
20 Emotion in sound design
backgrounds are subject to emotions; he also refers to the way different sounds
evoke emotion:

Human emotion is not just about sexual pleasures or fear of snakes. It is also
about the horror of witnessing suffering and about the satisfaction of seeing
justice served; about our delight at the sensual smile of Jeanne Moreau or
the thick beauty of words and ideas in Shakespeare’s verse; about the world-
weary voice of Dietrich Fischer-Dieskau singing Bach’s Ich habe genung
and the simultaneously earthly and otherworldly phrasings of Maria João
Pires playing any Mozart, any Schubert; and about the harmony that Einstein
sought in the structure of an equation. In fact, fine human emotion is even
triggered by cheap music and cheap movies, the power of which should never
be underestimated. (Damasio, 2000, pp. 35–36)

The first studies of emotion with regard to sound were related to music and came
in the late nineteenth century, coinciding with psychology becoming an independ-
ent discipline around 1897; although the early peak in studies was seen sometime
later, in the 1930s and 1940s (Juslin and Sloboda, 2010).
Today, a multidisciplinary approach pervades the field of emotion in music,
and although there is not yet unanimous agreement on whether there are uniquely
musical emotions, or whether the nature of these emotions is basic or complex,
the field of emotion in music is steadily advancing (Ibid.).
Jenefer Robinson articulates the complexity that the analysis of music and
emotions can produce:

the sighing figure is heard as a sigh of misery (a vocal expression), a syn-


copated rhythm is heard as an agitated heart (autonomic activity), a change
from tonic minor to parallel major is heard as a change of viewpoint (a cog-
nitive evaluation) on the situation from unhappiness to happiness, or unease
to serenity, and given the close connection between the two keys and the
fact that the melody remains largely the same, we readily hear the evalua-
tion as ambiguous or as shifting: the situation can be seen as both positive
and negative. […] Overall, we may hear the piece as moving from grief and
anguish to serene resignation, all of which are cognitively complex emotions.
(Robinson, 2005, p. 320)

However, Juslin and Sloboda broaden the perspective of the way sound can evoke
emotion from that of a purely music-based discussion, by suggesting that it is now
recognized that a significant proportion of our day-to-day emotions are evoked
by cultural products other than music; and therefore designers should be mindful
of emotion in the products and interfaces that they design, in order to make them
richer and challenging to the user (Juslin and Sloboda, 2010).
From the advent of the medium, moving picture producers have described and
promoted their films by describing the emotions that the audience is intended to
feel when they watch them (e.g. horror, romantic-comedy, or mystery-thriller).
Emotion in sound design 21
So, it is reasonable to suggest that audiences have proven themselves to not only
being susceptible to, but even desirous of having their emotions evoked in a
movie theatre.
Holland, in Literature and the Brain (2009), writes on our emotional response
to literary work, of which cinema is an important part:

The brain’s tricks become clearer at the movies.


(Holland, 2009, p. 2)

Clearly then, it is important to consider what might be happening to an audience


as they watch a movie.
Holland proposes that a well-designed soundtrack is instrumental in engaging
and enveloping a viewing audience, and a listening-viewer absorbed by on-screen
activities forgets their own body and its immediate surroundings, enabling them
to be transported to all kinds of otherwise improbable locations and situations.
Central to Holland’s line of reasoning is that an emotion is a call to action, or a
disposition to act; yet when we sit in the cinema and have our emotions evoked
through the sound and pictures we are viewing, we remain seated. This, he sug-
gests, is due to a unique contract with the work. Even though we are figuratively
transported by our emotions towards a certain state of mind, we identify that it
is the circumstances of the on-screen activity or character that has aroused these
feelings within us, and it is not a direct consequence of us being in the represented
situation (Holland, 2009).
Because most bodily responses brought about by emotions are visible to
others, they in turn bring about ‘mirroring’ in the viewer. Humans tend to
respond to emotional expressions they see with similar emotions themselves;
and as early as 1890, Darwin noted that emotions communicate in this fashion
(Darwin, 1890).
But Holland suggests that since it is a mirroring process at work, the impulse to
act on the emotion is inhibited: i.e. whilst watching certain actions, motor regions
of the brain experience an impulse to act (the mirroring). However, the brain
inhibits this musculoskeletal expression through a process called the ‘inverted
mirror response’; more fully described by Marco Iacoboni (2008) in his work on
‘super mirror neurons’ (Holland, 2009).
For Holland though, mirroring is not the complete picture of a fuller, immersive
and emotional involvement with an on-screen subject. Our own past experiences
of circumstances like the viewed events are also powerfully evoked; and he states:

We bring to bear on what we now see, some feeling or experience from our
own past. And my bringing my own past to bear on the here and now of trag-
edy makes me feel it all the more strongly. (Holland, 2009, p. 72)

Richard Gerrig includes this in what he calls a ‘participatory response’, and he


notes how it can enrich and intensify one’s ‘emotional experience’ (Holland,
2009, quoting Gerrig, 1996).
22 Emotion in sound design
It is evident then that sound triggering or affecting human emotions is not just
limited to music; other sounds too can contribute to this process. Certainly, some of
the wider range of emotional stimulation that Damasio describes sits comfortably
within the remit of the audio post-production stages of filmed stories, or televised
drama. Juslin and Sloboda’s comments also suggest that there is both scope and a
basis for the thoughtful use of soundtrack elements to evoke emotional responses
within a listening-viewer; and Holland’s description of how audiences engage
with what they see on-screen would seem to further support this proposition.

2.2 The relevance of speech and emotions research,


and music and emotions research, to this study
Whilst there is little research yet dealing specifically with moving picture sound
design and emotions, there is a substantial body of research concerning both
speech and emotions (e.g. Banse & Scherer, 1996; Cowie, 2000; Pereira, 2000)
and music and emotions (e.g. Hunter & Schellenberg, 2010; Juslin & Sloboda,
2010; Swaminathan & Schellenberg, 2015).
Speech and music are two key elements of the compound that constitutes a
moving picture soundtrack; and both contribute greatly to the viewing experience
of movie audiences, not only by virtue of their expressing of emotion, but also by
their being capable of inducing emotion in listening-viewers.
Three aspects of speech and emotions research are particularly relevant in this
study.
First and foremost, both speech and a film’s soundtrack are designed to com-
municate with an audience. A film soundtrack, intended as a compound of speech,
sound effects and music, not only has the ability to be as literal as speech in por-
traying emotions (indeed it contains speech and therefore a character can utter
words such as “I feel sad”, telling the audience explicitly what emotion is at play),
it can also be more so than a musical score alone might.
However, it is important to make clear that this statement is not intended to
diminish the importance of music in movies. Far from it, music is a powerful
emotional tool, particularly when skilfully deployed within a film soundtrack (e.g.
Damasio, 2000).
Many movies are most memorable precisely for their featured musical inter-
ludes,1 which create iconic snapshots that go on to define a production, long after
the film’s fuller storyline has left the consciousness of audiences; e.g. Tiny Dancer
(Comp. Elton John/Bernie Taupin) in Almost Famous (2000) (Dir. Cameron
Crowe/Sound Designer Mike Wilhoit), Bohemian Rhapsody (Comp. Freddie
Mercury) in Wayne’s World (1992) (Dir. Penelope Spheeris/Sound Designer John
Benson) or Always Look On The Bright Side of Life (Comp. Eric Idle) in Life of
Brian (1979) (Dir. Terry Jones/Re-recording Mixer Hugh Strain) to name but
three of a long, 90-years-plus list, that began with The Jazz Singer (1927) (Dir.
Alan Crosland/Sound Engineer Nathan Levinson), the film widely considered to
be the first commercial ‘talkie’.2
Emotion in sound design 23
But if songs or arias with a text are discounted, it is reasonable to argue that a
music score is less directly meaningful, and overall, it is more abstract than literal
in its nature.
As an aside to this immediate point, but nonetheless still highly relevant to
the way music is used in movies, there is also the constant consideration by
the Re-recording Mixer that music has the ability to emotionally overwhelm a
soundtrack, particularly if its application is not judiciously metered and carefully
balanced with the other mix elements.3 As Sider suggests:

Rather than allow the audience to come to their own conclusions the music
presses an emotional button that tells the audience what to feel, overriding the
words and thoughts of the film’s characters. (Sider, 2003, p. 9)

Tarkovsky would seem to go further:

Above all, I feel that the sounds of this world are so beautiful in themselves
that if only we could learn to listen to them properly, cinema would have no
need of music at all. (Tarkovsky, 1987, p. 162)

So whilst this book looks carefully at the interplay between dialogue and sound
effects, a relationship to which music also makes a conspicuous contribution,
music in this study is treated respectfully for its emotional power in its own right;
but from a Re-recording Mixer’s perspective, music is but one of the sounds that
require balancing.
Because all sounds – not just music – can be emotionally important in a movie
(e.g. a single gunshot suddenly featured in a scene that had only music playing
will immediately draw the listener’s attention away from the music) and whilst
a sound may be interpreted in several ways, often depending on the context it is
heard in, all sounds in this study are referred to, considered as, or classified by,
their primary emotional function or purpose in the soundtrack.
And so, through the combination of all these sounds, the relative proportions
of which are solely determined by the Re-recording Mixer during the act of pre-
mixing and final mixing, the underlying meaning of the soundtrack is revealed.
Secondly, when considering the soundtrack and the way it forms part of
an audio-visual work, there are comparisons that may be drawn between the
Re-recording Mixer’s mix-balancing with an emotional intent in mind, and the
way that everyday speech is used to convey emotion. In speech, the meanings
of words are quite fixed within a language, yet the actual emphasis of the words
being spoken can be quite fluid due to inflection, tonality or accent.
The emphasis on words plays an important role in inducing different emotions
in the listener. For example, I might say the words ‘I’m really sad’ in a helpless
sounding way, or in a sarcastic sounding way. The words are the same and indi-
cate an emotion, but the sound of the words will determine the emotion that the
listener will perceive.
24 Emotion in sound design
So too in a movie, where the words of dialogue that the characters use may
on their own have clear meaning for the plot and storyline; yet when balanced
amongst other mix elements in the soundtrack, what results is a listening experi-
ence that is emotionally richer for the other sound elements that have been placed
carefully around the speech.
Additionally, the visual elements of a film (the acting, editing, lighting, grad-
ing, composition, etc.) can powerfully portray a particular emotional direction
(similarly to how the meaning of words do in speech). But the soundtrack, and
the balancing of its elements by the Re-recording Mixer, can shift the emotional
direction of the overall experience.
This is similar to how the changes in the prosodic patterns that naturally exist
in speech produce emotional shifts: e.g. the tendency to speak unwittingly loud
when gleeful, or in a higher than usual pitch when greeting a sexually attractive
person (Bachorowski, 1999); and this is described in other research studies of
listeners inferring emotion from vocal cues (see Frick, 1985; Graham, San Juan &
Khu, 2016; van Bezooijen, 1984 to name but a few).
In an audio-visual piece of work with emotional meanings already suggested
through the visuals, or through words and other selected sounds, variations in
emotional meaning can also be produced by manipulating the mix balance of the
sound track; which is similar to how natural variations in pitch, loudness, tempo
and rhythm do in speech.

2.3 Hearing the soundtrack


In Listening, the opening chapter of social theorist and writer Jacques Attali’s
work Noise: The Political Economy of Music (1985), the author attaches a much
greater importance to the act of listening than that often attributed to the purely
cinematic act of audition, or the emotional effect a soundtrack may evoke:

For twenty-five centuries, Western knowledge has tried to look upon the
world. It has failed to understand that the world is not for the beholding. It is
for hearing. It is not legible, but audible. (Attali, 1985, p. 3)

Which implies that sound itself carries a quality, or set of qualities, that can not
only inform a cinema audience, but also impart meaning on what they are seeing;
which in turn relates to the assertions of Holland (2009) and accords with my
notion that (especially) within narrative filmmaking, a significant responsibility is
capable of being borne by the soundtrack to fully engage and emote an audience.
In his essay Art in Noise, Mark Ward suggests that:

it is unlikely one may have a meaningful narrative experience without it also


being an emotional one. (Ward, 2015, p. 158)

Ward also argues against the primacy of speech and music in the traditional
process of soundtrack dissection, instead elevating what might be termed as
Emotion in sound design 25
environmental sound, or sound effects, to a status at least equal to dialogue and
score (Ward, 2015). This also implies that these fuller soundtracks require careful
balancing by the Re-recording Mixer:

Sound design […] is considered to be a process by which many sound frag-


ments are created, selected, organised, and blended into a unified, coherent,
and immersive auditory image. (Ward, 2015, p. 161)

Ward then goes on to make three key assumptions:

i) Cinema is not a visual medium, but multimodal: what is cinematic about


cinema is moving imagery, not moving pictures. (Ward, 2015, p. 158)
ii) Sound can modify visual perception: sound design through careful crafting,
may steer and deflect the eye’s passage across a screen, or draw the eye to
some objects but disregard others. (Ward, 2015, p. 159)
iii) […] contemporary sound design [is] a playful recombination of auditory and
visual fragments, and a heightened manipulation of auditory spatialisation,
temporal resolution, and timbre. (Ward, 2015, p.161)

In arguing that the cinema experience is an emotional one, Ward sub-categorizes


the construction of a soundtrack into three distinct areas; and his citing of audi-
tory spatialization and temporal resolution directly accord with two of this study’s
Four Sound Areas, e.g. Spatial and Temporal (which will be more thoroughly
described in Chapter 4).
Michel Chion also utilizes a tripartite classification when he describes the way
in which soundtrack elements are heard by an audience; and he refers to these
three states as causal, semantic and reduced listening.
Causal listening, the most common form of listening mode

consists of listening to a sound in order to gather information about its cause


(or source).
(Chion, 1994, p. 25)

Causal listening can condition, or even prepare, the listener by the very nature of
the sounds heard – for instance, the sound effect of a dog barking readily recalls
the image of a dog in the listener.
Chion goes on to describe how a film soundtrack might manipulate causal lis-
tening through its relationship to the pictures; a term he calls Synchresis; whereby
we are not necessarily listening to the initial causes of the sounds in question, but
rather causes that the film has led us to believe in:

[In] causal listening we do not recognize an individual, or a unique and par-


ticular item, but rather a category of human, mechanical, or animal cause: an
adult man’s voice, a motorbike engine, the song of a meadowlark. Moreover,
in still more ambiguous cases far more numerous than one might think, what
26 Emotion in sound design
we recognize is only the general nature of the sound’s cause. (Chion, 1994,
p. 27)

Chion describes semantic listening as

that which refers to a code or a language to interpret a message. (Chion, 1994,


p. 28)

For Chion, causal and semantic listening can occur simultaneously within a sound
sequence:

We hear at once what someone says and how they say it. In a sense, causal
listening to a voice is to listening to it semantically, as perception of the hand-
writing of a written text is to reading it. (Chion, 1994, p. 28)

Chion thirdly suggests that reduced listening refers to the listening mode that
focuses on the traits of the very sound itself, independent of its cause and of its
meaning:

Reduced listening has the enormous advantage of opening up our ears and
sharpening our power of listening […] The emotional, physical and aesthetic
value of a sound is linked not only to the causal explanation we attribute to it
but also to its own qualities of timbre and texture, to its own personal vibra-
tion. (Chion, 1994, p. 31)

Finally, Chion asserts that natural sounds or noises have become the forgotten
or repressed elements within the soundtrack – in practice and in analysis; whilst
music has historically been well studied and the spoken voice more recently has
found favour for research:

noises, those humble footsoldiers, have remained the outcasts of theory, hav-
ing been assigned a purely utilitarian and figurative value and consequently
neglected.
(Chion, 1994, pp. 144–145)

Another view of separating an audience’s listening processes is proposed by


Sound Designer and Re-recording Mixer Walter Murch (American Graffiti, 1973;
The Conversation, 1974; Apocalypse Now, 1979).4 He describes a way in which
he views the elements of a soundtrack ‘positioned’ in a virtual spectrum for audi-
tioning; and he suggests that this positioning is instrumental in how the soundtrack
is processed in the brain of the listening-viewer.
In his essay ‘Dense Clarity, Clear Density’, Murch likens the sound design
palette to the spectrum of visible colours: from the colour red at one end of the
scale, to the colour violet at the other.
Emotion in sound design 27
Conceptually superimposing sound on to this visual image, he places what he
describes as ‘Embodied sound’ (the clearest example of which is music) at the
Red extreme and what he describes as ‘Encoded sound’ (the clearest example of
which is speech) at the Violet extreme.
With these two extremities of speech and music bracketing the available range,
all usable sound must therefore fall between them: with almost all sound effects
somewhere in the middle – half-way between language and music. Murch con-
siders these sound effects, whilst usually referring to something specific within a
soundtrack, not to be as abstract as music, but nonetheless, not to be as universally
and immediately understood as spoken language.
Murch goes on to suggest that separate areas of the brain process the different
types of audio information, with encoded sound (language) dealt with by the left
half of the brain, and embodied sound (music) dealt with by the right hemisphere.
He then proposes that by evenly spreading the elements of his mix between the
two pillars of the audio-scale, a clearer (even though busier) soundtrack, with
a higher mix-element count, can be achieved than a soundtrack in which mul-
tiple mix-elements are concentrated in one particular area of the audio sound
spectrum.
This left–right duality of the brain, in Murch’s opinion, therefore, enables
twice as many ‘layers’ – five – to be achieved in a soundtrack when the type of
sound used is spread, for example:

Layer 1: dialogue
Layer 2: music
Layer 3: footsteps (Murch’s ‘linguistic effects’)
Layer 4: musical effects (Murch’s ‘atmospheric tonalities’)
Layer 5: sound effects.

Figure 2.1 Walter Murch’s ‘Encoded – Embodied’ sound spectrum


28 Emotion in sound design
If, however, you desire two-and-a-half layers of dialogue to be heard simulta-
neously, elements elsewhere must be sacrificed to retain clarity in this density
of dialogue. Murch refers to this phenomenon as his ‘Law of two-and-a-half’
and this ‘rule-of-thumb’ is defined by Murch based on his long experience as a
Sound Designer, a Re-recording Mixer and sound editor, as well as a picture edi-
tor (Murch, 2005).
Ward, Chion and Murch’s theories are particularly significant for the central
topic of this book as they address issues directly related to soundtrack production
and listening-viewers.

2.4 The impact of linking what we hear, to what we see


In her paper ‘Making Gamers Cry’, Karen Collins suggests that:

Our emotional and neurophysiological state can be directly affected by what


we see: for instance, if we see pain or fear in someone else, we understand
this in terms of our own psychophysiological experience of similar pain
or fear. For example, neurons that normally fire when a patient is pricked
with a needle will also fire when the patient watches another patient being
pricked. (Collins, 2011, p. 2)

This highlights the fact that seeing something on-screen can evoke an emotional
reaction in the observer’s brain through the activity of so-called ‘mirror neurons’,
which are thought to be the main route to human empathy.
Neuroscientist Vilayanur Ramachandran believes that these mirror neurons
actually dissolve the barrier between self and others, light-heartedly referring to
them as ‘Gandhi Neurons’ (Ramachandran, 2009).
But what would seem to be highly significant to this investigation into emo-
tions evoked by sound, is what Keysers et al. (2003) described from the research
they conducted into monkey mirror neurons, in which they found that the same
neurons fired whether an action is performed, seen or simply heard:

By definition, ‘mirror neurons’ discharge both when a monkey makes a spe-


cific action and when it observes another individual making a similar action
(Gallese et al. 1996; Rizzolatti et al. 1996). Effective actions for mirror neu-
rons are those in which a [monkey’s] hand or mouth interacts with an object.
(Keysers et al., 2003, p. 628)

In plain terms:

These audio-visual mirror neurons respond as if we are experiencing the cause


behind the event, when only the sound of the action is presented. In other
words, when the monkey hears the sound, the brain responds as if it is also
seeing and experiencing the action creating the sound. (Collins, 2011, p. 2)
Emotion in sound design 29
These results would seem to add credence to the notion that sound alone is a
powerful emotional tool that can be put to good use in moving picture production.
This clinically observed reaction to the effect of ‘hearing-without-seeing’ (which
in cinematic rather than laboratory terms could include the practice of ‘sound-lead-
ing-picture’), is an established sound design technique frequently used to purposely
develop the tension of an unsettling event or situation, through the presence of
(often) abstract sound effects, whose origination remains for the most part unseen.
However, as the story develops, the Sound Designer in the tracklay, and then
the Re-recording Mixer in the mix itself, may consider that what originally were
Abstract area sounds, later on contribute to the Narrative sound area (a more thor-
ough definition of the sound areas is presented in Chapter 4).
Dykhoff notes:

The spectators’ imagination is by far the best filmmaker if it’s given a fair
chance to work. The more precise a scene is, the more unlikely it is to affect
the audience emotionally. By being explicit the filmmaker reduces the pos-
sibilities for interpretation. […] With a minimal amount of visual information
and sounds suggesting something, you can get the audiences’ imaginations
running. (Dykhoff, 2003)

There are many examples of this style of feature film sound design, but a notable
example is the sounds associated with the dinosaurs featured in Jurassic Park
(1993) (Sound Designer and Re-recording Mixer – Gary Rydstrom), which are
seen on-screen for only 15 of the movie’s total 127 minutes – a little over 10% of
the film’s total running time; whilst their mysterious ‘off-screen’ sound is heard by
the audience long before they eventually make an appearance (Van Luling, 2014).
Regarding audience emotions being evoked by the soundtrack, Dykhoff goes
on to make a highly relevant point:

It’s interesting to speculate about how much information the trigger must
contain and how much it actually triggers. (Dykhoff, 2003)

An exploration of the existing literature on emotions and film would seem to sug-
gest that the understanding of the relationship between the overall organization of
a soundtrack and the emphasis within the mix – and the resulting emotions evoked
in an audience – is still very much in its infancy; even if work on the correlation
between emotion categories and types of sounds, or emotions and the acoustic
parameters of sounds in music and speech, has begun to be examined more closely:

Without doubt, there is emotional information in almost any kind of sound


received by humans every day: be it the affective state of a person transmit-
ted by means of speech; the emotion intended by a composer while writing a
musical piece, or conveyed by a musician while performing it; or the affec-
tive state connected to an acoustic event occurring in the environment, in
30 Emotion in sound design
the soundtrack of a movie, or in a radio play. […] emotional expressivity in
sound is one of the most important methods of human communication. Not
only human speech, but also music and ambient sound events carry emotional
information.
(Weninger et al., 2013)

Whilst sounds such as speech, music, effects and atmospheres constitute the tradi-
tional groupings of sounds within a moving picture soundtrack – especially dur-
ing its editing and mixing stages – the Four Sound Areas of this research are not
intended to be considered as alternative labels for the long-established audio post-
production working categories of ‘dialogue’, ‘music’ and ‘effects’ stems.
Rather, they sit alongside instead of replacing those headings; and in any case
they do not directly correspond to those categories, by virtue of their being used in
a rather different context: the traditional labels of dialogue, music and effects are
used primarily in the sub-master ‘stems’ delivery process before (and after) the
final mixing of the soundtrack has been undertaken by the Re-recording Mixer.
As will be seen in subsequent chapters, the Four Sound Areas framework is
instead an alternative kind of structure: one that can guide Sound Designers on
how best to group emotionally complementary sounds together at the track-laying
stage of a moving picture project (i.e. a ‘bottom-up’ approach); and then help
Re-recording Mixers to understand which elements of a mix require emphasis, to
increase their ability to enhance, steer or evoke an audience towards a particular
area of emotion (i.e. a ‘top-down’ approach).

2.5 Practical exercise – deconstructing a scene from Minority


Report (2002) (DVD) using the Four Sound Areas
Director Steven Spielberg’s 2002 film is set in the year 2054 and is based on
a 1956 short story by the Science Fiction writer Philip K. Dick. The plot for
Minority Report centres around the experimental ‘PreCrime Department’, located
in Washington, D.C., and the Department’s ability to prevent murder through
policing advanced warnings of murderous intent in the city’s citizens. This infor-
mation is provided by three highly-developed siblings known as the ‘PreCogs’,
who are kept in a state of suspended animation, floating in a tank of liquid that
provides both nutrients and conductivity for the images from their brains to be
projected and recorded.
The plot unfolds when the PreCogs visualize the head of the Department, Chief
John Anderton, committing a murder. Soon on the run from his own colleagues in
PreCrime and seeking to prove his innocence, Anderton discovers the existence
of so-called Minority Reports; situations where the PreCog ‘pre-visions’ are in
fact fallible, shown by a difference in their collective presentation of images and
characters in the future criminal event. Kept secret to ensure that the experimen-
tal PreCrime Department gains nationwide acceptance, Anderton must reveal the
truth of this fallibility in the PreCogs to prove his innocence; and also, to prevent
any future miscarriages of justice.
Emotion in sound design 31
The film is sound designed and mixed by Gary Rydstrom (ably assisted by
Andy Nelson as his Re-recording partner) and opens with a busy layering of
sounds to complement the fast-paced picture editing.
This first scene has examples of Narrative sounds in the dialogue and com-
munication noises between the PreCrime Police officers and the judicial ‘Remote
Witnesses’, as the replayed PreCog visions are examined; and examples of sounds
in both the Narrative and Abstract sound areas provide the sound effects of operat-
ing the futuristic projector. The associated sounds are of scrubbing backwards and
forwards through the vision time-line, and the distinctive room tones, spot effects
and atmospheres between the portrayed locations; and there is an example of the
Abstract and Temporal sound areas being used together in the Kubrick-esque use
of a classical music score to accompany Chief Anderton operating the projector.
But the type and placement of this music may not just be a nod to the futuristic,
‘space-age’ feel created by Kubrick in his landmark 1968 film 2001: A Space
Odyssey – it also accompanies the images of Chief Anderton as the Conductor of
an orchestra, as he manipulates the PreCog images and sound through the move-
ment of his hands and arms (e.g. at 00:04:45, 00:05:55 and 00:06:42).
These sound layers have three distinct areas of origination: they emanate from
within the projected image (which would seem to accord with sounds placed in
the Narrative sound area); from the dialogue and sound effects as the Police offic-
ers operate the viewing equipment (sounds from the Narrative and Abstract sound
areas); and from the musical score that punctuates the viewing of the PreCog
visions in the PreCrime gallery and ‘actuality footage’ of the future crime that only
we, the audience, are able to see (a wealth of sounds that populate the Narrative,
Abstract, Temporal and Spatial areas of this study).
This opening sequence serves as a perfect summary of what Rydstrom and
Nelson deliver throughout the rest of the film: a carefully balanced, central dia-
logue (for the majority of the time, serving as a part of the Narrative sound area)
that is unchallenged in its intelligibility by any other sound element; thought-
fully understated, futuristic-yet-familiar, spot effects for technological equipment
and processes, panned appropriately across the front and rear sound fields; room
atmospheres from interior air conditioning and an external suburban atmosphere-
track made up of distant traffic, playing children and birds filling the surround
channels (all of which contribute to the soundtrack in both the Narrative and
Abstract sound areas), along with the noticeable reverberation added to the dia-
logue, from lines played out in the PreCogs tank area; a room nicknamed by the
PreCrime Police as “The Temple”, and characterized by its capacious dimensions
and hard-reflecting surfaces (the reverberation on the dialogue contributing to the
Spatial sound area).
The opening of Rydstrom’s soundtrack is playful with the audience: in the
opening scenes, he switches the emotional emphasis back and forth between con-
ceivable serenity and veiled anxiety; from the subtle positivity of the sounds of
children playing and birds singing in the outside world, to the seemingly cold
and antiseptic world of the PreCogs tank room and its monotonous, brooding
atmosphere.
32 Emotion in sound design
It is in this room that the first, but certainly not the last, example of an induced
startle-reflex is demonstrated (and where Rydstrom expertly evokes an emotion
in the audience that is rooted in fear): the dynamic range of the music (primarily
flexing within the Temporal sound area) having been given full reign to accelerate
to full-scale, full energy, diminishes progressively down to almost silence, save
for a quiet sustained note from the score’s string section – along with occasional,
delicate drips of water and the distant hum of plant gear (these latter sounds sitting
inconspicuously in the Abstract sound area). But by this, Rydstrom is ‘setting-
up’ an unexpectant audience for an explosion of exhaled breath and speech from
Agnes, as she suddenly emerges from the water of the Precogs tank (these sounds
sitting within the Narrative and Abstract sound areas). In an instant, the maximum
dynamic range of the soundtrack is engaged to trigger the sudden, ‘heart-stop-
ping’ moment in the audience. (DVD, commencing at 00:24:00, with the ‘audio
shock’ at 00:27:25.)
This effect requires careful preparation of the audience; and by utilizing a
descent to near-silence just before the metaphorical coup de grâce is delivered,
Rydstrom has effectively re-aligned the listener’s hearing-threshold to a point
well below the median soundtrack level. When the sudden, climactic burst of
sound is delivered, it is with the maximum dynamic range available to the replay
system, but to ears already responding to much lower sound pressure levels.
Given sufficient time with unfamiliar, low-level sounds, and with hearing
so highly sensitized, the listening-viewer attempts to make sense of the sounds
that they are discerning, but not necessarily recognizing (the sounds are of the
Abstract sound area); and a heightened awareness is induced: in essence, the audi-
ence is alert to danger, and their associated chemical and neural responses are
automatically engaged. In short, the audience has been primed to be emotionally
induced into fear.
The manipulation of the Narrative, Abstract and Temporal Sound areas to
achieve the classic cinematic ‘audio-shock’, is characterized by the way in which
skilled practitioners (e.g. Sound Editors and Re-recording Mixers) use a gentle
‘rise-time’ to progressively increase, develop and hold their audience in a state of
heightened awareness of some impending danger, just prior to the scene’s denoue-
ment; usually through almost indistinct Narrative Sound, ambiguous Abstract
Sound and high Temporal Sound elements (a product of the nature of the sounds,
and most importantly, the relative balance between the Narrative, Abstract and
Temporal sound areas). This condition is held just long enough for unfamiliar
sounds to become recognizable, familiar audio ‘bearings’ to be re-established,
and those markers that suggest to the listening-viewer that there is no imminent
danger, to be reinstated. During which time, the unconscious state-of-readiness
within the audience – the induced ‘fight or flight’ instinct – subsides, returning
to a near normal level; only for a fast ramp of the soundtrack from (usually) the
Narrative sound area to unexpectedly communicate the sudden reappearance of
mortal danger.
Minority Report is an excellent example of such skilful sound-blending; with
the classifications used for the Four Sound Areas framework readily identifiable
Emotion in sound design 33
within Rydstrom’s soundtrack. He achieves several points of predetermined emo-
tional impact, aided and abetted by his clean, precise and uncluttered sound design
and the thoughtful, well-balanced and smooth mixing of the movie’s audio over-
all. Such clarity is technically impressive, given the busy nature of the soundtrack
at key sections of the film.

2.5.1 Questions
·· What emotions were evoked in you by the sound design of the opening
sequences of Minority Report?
·· What influence do you think that had on the rest of the movie?

2.5.2 Discussion points


·· What is the fundamental difference between affect and emotion?
·· How do sound effects differ from music in evoking emotion?

Notes
1 A traditional and frequently heard idiom amongst film industry technicians wanting
to highlight the importance of the soundtrack is ‘No one ever came out of a cinema
whistling a two-shot’.
2 Director Alan Crosland and Sound Engineer Nathan Levinson had completed a movie
for Warner Brothers a year earlier – Don Juan (1926) – that used the same Vitaphone
sound playback system as The Jazz Singer (1927). However, although the soundtrack
of Don Juan was synchronized to picture, it consisted solely of music with no speech
from the actors.
3 There is a famous Hollywood story that suggests the composer Arnold Schoenberg
once wrote a film score thinking that a feature film would subsequently be made to
match his music.
4 As well as being the Sound Designer and Re-recording Mixer, Murch also picture-
edited The Conversation (1974) and Apocalypse Now (1979). He won an Academy
Award for Best Sound Mixing on Apocalypse Now.
CH A PT ER
4

Usin g Am bison ics


an d Advan ce
Au dio Pr act ices

This chapter is excerpted from

Immersive Sound Production


Dennis Baxter
© 2022 Taylor & Francis Group. All rights reserved.

Lear n M or e »
46

4 
Immersive Sound Production
Using Ambisonics and Advance
Audio Practices

Spatialization with Ambisonics Production Methods


It is abundantly clear that across all media platforms 3D immersive sound is critical to the
authenticity of high-​definition pictures, 360 video, virtual reality and augmented reality. The
fact is that multichannel, multi-​format audio production is not going away and there should
be a commitment by all audio producers and practitioners to advance the quality of audio to
the consumer using every tool and practice available.
Ambisonic production is a flexible and powerful creation tool for the spatialization of audio
for a convincing immersive experience. Audio producers and practitioners have begun to
realize the benefits from the use of ambisonics because of its unique approach to capturing, pro-
cessing and reproducing the soundfield. Dolby Atmos and MPEG-​H 3D support ambisonics,
however broadcast and broadband industries have been slow to use and adopt ambisonics as a
production platform and tool. Significantly though, in the last few years, ambisonics has been
adopted by YouTube and Facebook for 360 video because ambisonics is the only platform that
truly and accurately tracks user interactivity with smooth and efficient soundfield rotations
from the camera’s point of view.
Ambisonic audio production has been around a while, albeit slightly heuristic in the early
days, and it was not until significant magnifications of the theory that resulted in Higher
Order Ambisonics (HOA) that some of the early ambisonics models became valuable for
advanced audio production. Soundfield deconstruction and reconstruction is a powerful tool,
particularly because of the flexibility that ambisonics provides with the capability of rendering
to a vast range of production and reproduction options. HOA is a far more sophisticated pro-
duction tool than the proponents of early ambisonics ever envisioned and there are clearly
significant advantages to HOA and scene-​based delivery over current channel-​based and
object-​based multichannel audio workflows and practices.

What is Ambisonics?
Ambisonics is a completely different approach that as much as possible captures and reproduces
the entire 360 immersive soundfield from every direction equally –​sound from the front,
sides, above and below to a single capture/​focal point. Ambisonic attempts to reproduce as
much of the soundfield as possible regardless of speaker number or location because ambi-
sonics is a speaker independent representation of the soundfield and transmission channels do
not carry speaker setups. Since HOA is based on the entire soundfield in all dimensions, a sig-
nificant benefit with this audio spatial coding is the ability to create dimensional sound mixes
with spatial proximity with horizontal and vertical localization.

DOI: 10.4324/9781003052876-4
47

Ambisonics and Advance Audio Practices 47

How Does It Work?


A propagating sound wave originating from one source does not move in a straight line but
expands in a series of sphere-​shaped waves equally in all directions. The behavior of sound
waves as they propagate through a medium and even how sound waves reflect off an object
was explained by the principle of wave fronts by the Dutch scientist Christiaan Huygens.
A wave front is a series of locations on a sound wave where all points are in the same pos-
ition on that sound wave. For example, all points on the crest of the same wave form a wave
front. Huygens further states that each point on an advancing wave may be considered to be
a new point source generating additional outward spreading spherical wavelets that form a
new coherent wave.1
Expanding functions of a sphere and soundwave expansion can be explained by spherical
wave fronts that may vary in amplitude and phase as a function of spherical angle and can be
efficiently modeled using spherical harmonics which are mathematical functions (models)
for mathematical analysis in geometry and physical sciences. Spherical arrays can be used for
soundfield analysis by decomposing the soundfield around a point in space using spherical
harmonics.
Decomposing a soundfield to spherical harmonics is a process of converting the soundfield
to Associated Legendre Polynomials which map the angular response and Spherical Bessel

Figure 4.1 Omni directional soundwave expansion


48

48 Ambisonics and Advance Audio Practices


Functions which map the radial component and together form the Spherical Harmonic
functions. HOA coefficients are the coefficients that are used to formulate the desired com-
bination of the spherical harmonics functions. Spherical harmonic functions are important to
spherical coordinates and solving equations with wave propagation and integral to calculating
HOA coefficients.
Soundwaves can create a complex soundfield that can be composed of hundreds of sound
sources creating their own sound waves, diffractions, scatterings and reflections. Ambisonics is
a method to capture as much of the 3D soundfield as desired up to the maximum HOA order,
and captures both direct and reflected sounds. Considering that sound comes at us from every
direction it was not a giant leap to consider capturing the entire soundfield at a point in space
from a single point receptor –​a microphone or listener.
Reproduction of soundfields begins the synthesis of a soundfield by recombining the
amplitudes of the spherical harmonics and making the reproduced sound match the measured
soundfield.
Ambisonics was presented by Dr. Michael Gerzon based on psychoacoustical consider-
ation where he developed a mathematical model for capturing and reproducing a simple
dimensional soundfield. First generation ambisonics was and is a 3D format although a low-​
resolution format that never caught on outside of the UK till VR’s adoption of the format.2
The spatial resolution for basic ambisonics is quite low but can be increased by adding
more direction components to achieve a more useable ambisonic format called HOA
(Higher Order Ambisonics). HOAs are based on a mathematical framework for modeling 3D
soundfields on a spherical surface where the HOA signals can be calculated based on the spa-
tial location of the sound sources. The HOA signals can be derived from spatial sampling and
spatial rendering the three-​dimensional space. HOA is used to reconstruct the dimensional
soundfield by decomposing the soundfield into spherical harmonics which contain spatial
information of the original soundfield. Significantly HOA signals preserve the spatial audio
information.
The soundfield modeling projects the soundfield onto a set of spherical harmonics, and
the number and shape of the spherical harmonics determine the resolution of the soundfield.
HOA project/​inject more spherical harmonics into the equation. Spherical harmonics are
special functions defined on the surface of a sphere. Each additional harmonic coefficient adds
higher spatial resolution to the modeled or synthesized soundfield.3

Figure 4.2 Spherical harmonics –​3rd Order


49

Ambisonics and Advance Audio Practices 49

What is Scene-​Based Audio?


Ambisonic audio production is nothing new but clearly is a paradigm shift in the contem-
plation and production of future ready sound. One reason ambisonic lagged in acceptance
was the fact that ambisonics’ true benefits were not realized till the expansion of the ori-
ginal concept to higher-​order-​ambisonics along with the development of powerful pro-
duction tools. Typically audio production tools have been compressors, dynamic controllers
and equalization, but spatialization of audio has continued to develop past reverberation and
room simulation.
The capability to construct high resolution soundfields depends on the ability to capture,
construct and render soundfields with the highest possible resolution. Proponents of HOA
have titled advance ambisonic production SBA –​scene-​based audio. Scene-​based production
is the natural evolution of HOA although now with the tools to increase the flexibility of
HOA, but at the end of the day it is still a HOA production with advanced scene-​based tools.
HOA is a process and SBA are the tools to make HOA more useful as a production tool.
Scene-​based audio has an enhanced set of audio features for manipulation of the audio
scene during playback. It provides the user the flexibility to alter POV –​points of view, zoom
or focus on a specific direction, mirror, attenuate or amplify an audio element as well as rotate
the soundfield. Unique about ambisonic production is the ability to deliver any combination
of visual experiences –​TV,VR and 360 video on a single audio workflow.4

How Does Ambisonics Operate?


The concept is simple –​capture the entire soundfield then render and reproduce as much of
the soundfield as possible. Ambisonics looks at the soundfield as a grid of equally spaced sound
zones that need to be captured to a single point.
This macro-​level approach can be derived from a combination of mono stereo and multi-​
capsule array microphones similarly to the way broadcasters capture sound today.The problem
with all microphone capture is that the further the microphone is from the source the more
diluted the signal. The inverse square law states that the intensity of the sound will decrease
in an inverse relationship as the soundwave propagates further from the sound source. For
every factor of two in distance, the intensity of the soundwave is decreased by a factor of four.
Additionally, microphones not only capture the sound you want but also a lot of what you

Figure 4.3 Higher order ambisonics (HOA) listens to the entire soundfield at a macro level
50

50 Ambisonics and Advance Audio Practices


do not want –​background noise. It is difficult to isolate objects and estimate their individual
metadata.
Microphone position is never optimal in sports. Building on the concept of entire
soundfield capture and that ambisonics treats sound from all direction equally, then micro-
phone placement at baseball would have to be over the pitcher to capture the infield equally
and deliver an immersive crowd. Since this is an impossible location to achieve, multiple
microphone positions are located to capture a balanced, holistic representation of the com-
plete soundfield. Additionally, closely correlated microphone arrays lack much capture detail
beyond a relatively small sound capture zone ultimately requiring additional “spot” micro-
phone for detail.
A significant practical aspect of HOA is that you do not need to fully capture in HOA,
but can optimize individual microphones into HOA. Individual microphones can be placed
symmetrically or separated at arbitrary locations and use the capture information from all
of these microphones to derive the HOA coefficients. The HOA encoder generates the 3D
coefficients. However, when you capture an ambisonics foundation it will deliver desirable
and predictable results for you to build your detailed and specific sound element on top of
this foundation.
Multiple HOA capture points have been suggested and as the costs come down and
the performance of multi-​ capsule ambisonic microphone improve, multiple ambi-
sonic microphones may be realistic. Significant is the fact that previously produced and
up-​produced music and legacy content that is processed into HOA material can be added
to an HOA mix.
Encoding HOA creates a set of signals that are dependent on the direction and position of
the sound source and not the speaker for reproduction. With the audio rendered at the play-
back device, the playback renderer matches the HOA soundfield to the number of speakers
(or headphones) and their location in such a way that the soundfield created in playback
resembles closely that of the original sound pressure field.
A typical scene could contain hundreds of objects which with their metadata must be
recreated on the consumer device. Not all consumer devices are created equal and may not
have the ability to render complex scenes. Significantly HOA’s rendering is independent of
scene complexity because spatial characteristics are already mixed into the scene.

Figure 4.4 Higher order ambisonics (HOA) multiple capture zones


51

Ambisonics and Advance Audio Practices 51

Production Note
HOA allows the sound designer/​producer to create or develop the direction of the
sound and not be tied to where the speaker or reproduction device may be, which is
contradictory to the way a lot of sound is produced.

The fact is that dimensional sound formats will not go away. People want options for con-
sumption and the challenge is how to produce and manage a wide range of audio experiences
and formats as fast and cheaply as possible. Cheaply also means minimizing the amount of data
being transferred. If the rendering is done in the consumer device, it inherently means that
there is a need to deliver more channels/​data to the device. However, data compression has
significantly advanced to the point that twice as much audio can be delivered over the same
bit stream as previous codecs.
Now consider the upside of the production options using HOA. You have the ability
to reproduce essentially all spatial formats over 7 of the 16 channels (a metadata channel is
needed), then you have another eight individual channels for multiple languages, custom
audio channels, and other audio elements or objects that are unique to a particular mix.
Additionally, producing the foundation soundfield separately from the voice and
personalized elements facilitates maximum dynamic range along with loudness compliance
while delivering consistent sound over the greatest number of playout options.
The sonic advantages of ambisonics reside with the capture and/​or creation of HOA.
Ambisonics works on a principle of sampling and reproducing the entire soundfield.
Intuitively, as you increase the ambisonic order the results will be higher spatial resolution
and greater detail in the capture and reproduction of the soundfield. However, nothing comes
without a cost. Greater resolution requires more soundfield coefficients to map more of the
soundfield with greater detail. Some quick and easy math: fourth order ambisonics requires
25 coefficients, fifth order requires 36, and sixth order requires 49 and so on. The problem
has been that HOA Production requires a very high channel count to be effective which did
not fit in the current echo system, but coding from Qualcomm and THX has reduced the
bandwidth for a HOA signal to fit in 8 channels of the 15 or 16 channel architecture leaving
channels for objects and interactive channels.
Dr. Deep Sen has been researching the benefits of HOA for decades and headed a team that
developed a “mezzanine coding” that reduces the channels up to the 29th order HOA (900
Channels) to 6 channels +​control track. Now consider a sound designer’s production options.
HOA provides the foundation for stereo, 5.1, 7.1, 7.1+​4, 10.2, 11.1 up to 22.2 and higher
using only 7 channels in the data stream. I suspect that there are points of diminishing returns.
Scalability –​the first four channels in a fifth order and a first order are exactly the same.4

Capture or Create?
HOA is a post-​production as well as a live format; however, live production is dependent on
complexity and latency.

High Order Ambisonics: HOA in Post-​Production


3D Panning Tools, effects and ambisonics mastering tools reside in the editing and hosting
platform of Ableton while Neuindo and ProTool use third-​party plug-​ins for immersive
sound production.
52

52 Ambisonics and Advance Audio Practices


Ableton includes 3D positioning tools, azimuth and elevation, 3D effects and masters in real
time to HOA. Each channel can have a set of tools. Ableton includes a couple of interesting
programs including a spinning program that automates motion in 3D space with vertical
rotations, convolution reverbs and three-​dimensional delays from a single channel. Ableton’s
output is saleable from headphones to immersive speaker arrays.
Neuindo is a popular DAW that supports dearVR immersive, allowing the sound designer
to create immersive and 3D content. For an action sound designer the Doppler Effect plug-​in
does a nice job simulating the perception of movement and distance by pitch changes as the
source passes you.
ProTools HD is a widely used DAW in post-​production, however it derives much of its
functionality from third-​party plug-​ins. A set of scene-​based HOA Tools was developed under
the guidance of Dr. Deep Sen and resulted in significant advancements for further production
and development for HOA.
Because HOA deals with spherical expansion, tools like rotation, sphere size and interesting
room and reverb simulation programs have been developed. Distance is interesting because
you are not just changing the volume when you move a sound element closer or farther away,
but as in the real world the change in distance can change the tone of a sound as well.
The ability to adjust the size of an object has fascinating production possibilities. Size
expands the perception of magnitude of a sound element by diverging the sound element
into adjacent channels.
Size is a processing feature that can be useful in speech intelligibility or as an effect for dra-
matic enhancement. The soundfield can also be widened or squeezed to match the TV size.
Building acoustic environments is common with object-​based audio production but spatial
enhancements have proven effective in immersive sound production for both speaker and
ambisonic methods of production as well. Room simulators are capable of creating acoustic
space and use complex reflection algorithms to recreate the variety of a dimensional space.The
ability to contour parameters such as reflections and diffusion empowers the sound designer in
recreating and creating the realistic sonic space.
Facebook offers a free Audio 360 Spatialiser plugin that replaces the conventional DAW
panner giving each channel full 3D positioning, distance and room modeling. The channel
input options are mono, stereo, 4.0, 5.0, 6.0, 7.0 and B Format 1st, 2nd and 3rd Order ambi-
sonics, as well as controls for azimuth, elevation, distance, spread, attenuation, doppler, room
modeling and directionality. Ambisonic controls are source roll, pitch and yaw, plus the ability
to control the diffuse soundfield.

Figure 4.5 Channel control strip from a HOA tools plug-​in with controls for azimuth, elevation, dis-
tance and size
53

Ambisonics and Advance Audio Practices 53


The focus effect is not baked into the mix and can be controlled from your app in real time
or encoded into a FB360 video as metadata. Focus control gives the sound designer the ability
for a range of controls from full headtracking to the ability to define a mix area and have
sounds outside that zone attenuate. Focus control includes focus size of the area and off-​focus
level, the attenuated level outside of the focus area. Focus azimuth/​elevation are values that
are relative to the listener’s point of view when the headtracking is disabled. In the loudness
plugin the mixer can set the overall loudness of your mix and the maximum loudness and true
peak in your mix as if the listener were looking at the loudest direction.
Facebook describes Audio 360 as an immersive sphere of audio and is tied to headphones.
360 video can be viewed on screens and goggles. Facebook 360 provides a suite of soft-
ware tools for publishing spatial audio and video that can be exported to a range of formats
including ambisonics.The format sports 4K resolution, image stabilization,VR and can stream
live, making possible unique entertainment experiences.
A feature I found unique is called “Points of Interest” which is a unique production tool to
guide your viewer through your video. Ambisonics is the only format that locks the picture to
the sound for rotation and more.
Many of the Ambisonic Tools are 1st Order and are often mastering tools like the Waves
B360 Ambisonic Encoder which has panner-​like controls then outputs the channels to four
channels of B Format with gain and phase information equivalent to its direction in the
soundfield. Additionally, YouTube video supports 1st Order ambisonics with head-​locked
stereo.

High Order Ambisonics (HOA) –​Live


The first broadcast of MPEG-​H using HOA was done by 19 different manufacturers at the
2018 European Athletics Championship in Berlin. The test demonstrates an end-​to-​end pro-
duction workflow –​capture, process, record and distribute live UHD content featuring high
frame rates and dynamic range with Next Generation Audio. The tests used familiar work-
flow using a combination of mono, stereo and multi-​capsule array microphones similarly to
the way broadcasters capture sound today. The SSL mixing console supports height channels
and the panning of the microphones was accomplished on the mixer. The mix output was
encoded and streamed.
Complex HOA production will probably need processing, which results in latency. As
of publication, a significant problem for live capture using multichannel array microphones
above 3rd Order is that the microphones are computer-​controlled arrays with advance pro-
cessing that may have too much latency for exact lip-​sync. It appears that 1st and 2nd Order
ambisonics can be used with whatever amount of latency comes with the format conver-
sion or decoding, but 3rd Order and greater ambisonics appear to require more processing,
resulting in more latency and more problems.
A sporting event where the crowd was reproduced with a few frames of delay would
probably be perfectly acceptable. Up-​producing music using HOA would certainly result in
some latency during the up-​conversion, however would have no detriment to the produc-
tion. A soundfield foundation with a static capture and reproduction will result in minimum
latency.
HOA technology fulfills the need to produce immersive content, distribute it in an efficient
bit stream and have it played out on a wide variety of speaker configurations and formats –​
creatively, efficiently and in a consumer-​friendly way. By simply rendering the underlying
soundfield representation, HOA ensures a consistent and accurate playback of the soundfield
across virtually all speaker configurations. This process seems to work well across many play-
back formats and could possibly eliminate the need to downmix or upmix to achieve anything
54

54 Ambisonics and Advance Audio Practices


from stereo to 22.2 or more.This concept could be a significant solution to a problem that has
burdened sound mixers who have to produce both in stereo and surround.

Spatialization: Advance Audio Production Techniques and Practices


(AAPTP)
Advanced audio production techniques are beyond the room reverbs and echo-​type devices
of earlier times. Mixing consoles provide the basics for dynamics management, tone control
and fundamental panning, but advance spatial enhancement is done with applications and
processes using plug-​ins and out of mixing console processors both in live and post-​produc-
tion workflows. Plug-​ins are a specific application that can be added in the production signal
chain and are usually hosted and resident in the mixing console. Before plug-​ins there has
been a history of using standalone “blackboxes” for signal processing that were patched into
the signal flow to process the audio.
With the migration from analog mixing consoles to digital mixing desk came the possi-
bility of advance signal processing inside the digital desk. All digital mixing consoles contain
equalization, time shifting and dynamics processing designed and built in by the manufacturer,
but until recently there has been a reluctance for console manufacturers to unlock the pro-
prietary audio engine to third-​party application developers. For the manufacturers there was
a higher comfort level with a side chain, blackbox-​type device as opposed to an in-​line appli-
cation crashing and shutting down the mixer.
All audio console manufacturers discussed in this book have integrated third-​ party
applications into their mixing platforms and this will continue to advance. However, you
should always exercise caution when adding any new application to a computer platform and
remember all computers crash at some time. Additionally, always listen for latency and digital
artifacts that will affect the clarity and quality of your sound.
Advance audio production should be looked at as an umbrella of tools that not only can
adjust the spatial properties of sound elements, but also change the tone and sonic characteristics
of spaces. Basic spatialization can be as simple as time and timbre difference between a direct
sound and a delayed or diffused element of the original sound. This is what is known as basic
reverb or echo and occurs naturally from reflections off surfaces in the path of the original
sound waves. This basic tool can simulate a concert hall or the natural spatialization of sound
like what you hear when you are in an expansive European Cathedral. Our brain tells us this
is a cathedral and there is an expectation of what the sound of a cathedral is.
Virtual simulators are a growing theme of plug-​ins that can create and shape any sonic
characteristics of a sound element including size, magnitude and distance, as well as adjust
the spatial characteristics of a sonic enclosures where a sound object resides. Advance Audio
Production uses advance modeling and virtual simulation done on plug-​ins and hosting
computers and depending latency can be done in real time and applied live. A composite
soundfield is often an amalgam of sound layers that have been spatialized to complete the
dimensional soundfield which can be forgiving with precise synchronization and localization.

Where Did All This Come From?


Hearing and the perception of sound is uniquely personal. Many factors affect our hearing
including the shape of our head, ears and the physical condition of our auditory system.These
factors impact the natural collection of sound by humans just as the electrical, mechanical and
physical characteristics of microphones effect the quality of sound collection.
Beyond the physical collection of sound is the processing and interpreting of sonic infor-
mation. Psychoacoustics is the science of how the human brain perceives, understands and
55

Ambisonics and Advance Audio Practices 55


reacts to the sounds that we hear. Perception of sound is affected by the human anatomy, while
cognition is what is going on in the brain.

Limits of perception
The human auditory system can only process sound waves within a certain frequency range.
This does not mean these extended frequencies do not exist, just that humans do not process
them through the auditory system. Additionally, the auditory system does not process all fre-
quencies the same. Some frequencies are more intense even when they are at the same ampli-
tude. For example, low frequency sound waves require significantly more energy to be heard
than high frequencies. Our hearing is not linear and the equal loudness curves known profes-
sionally as the Fletcher-​Munson curves show the relationship between frequency, amplitude
and loudness. Finally, complex soundfields can suffer from frequency masking. Two sounds of
the same amplitude and overlapping frequencies are difficult to understand because the brain
needs a minimum difference in frequency to process the sound individually.
Sound localization is impacted by the size of the head and chest and the physical distance
between the ears.This is known as the head related transfer function (HRTF).The sound usu-
ally reaches the left ear and right ear at slightly different times and intensities and along with
tone and timbre the brain uses these clues to identify the location a sound is coming from.
Cognition is what happens in the mind where the brain infuses personal biases and
experiences. For example, when a baby laughs there is one reaction as opposed to when
they cry. Cognitive perception of sound has created an entire language of sound. Defining
and describing sound is often a difficult exercise because our vocabulary for sound includes
descriptive phrases that comprise both objective and subjective metaphors. Technical
characteristics such as distortion, dynamic range, frequency content and volume are measur-
able and have a fairly universal understanding, but when describing the aesthetic aspects and
sonic characteristics of sound, our descriptors tend to become littered with subjective phrase-
ology. Here is the simple yet complicated phrase which has always irritated me: “I don’t like
the way that sounds.” Exactly what does that mean? I worked with a producer who made that
comment halfway through a broadcast season during which I had not changed anything sub-
stantial in the sound mix. Being the diligent audio practitioner, I took his comment to heart
and really spent time listening to try to understand why he said what he said.
Broadcast sound and sound design is a subjective interpretation of what is being presented
visually. The balance of the commentary with the event and venue sound is interpreted by
the sound mixer. The sports director and producer are consumed with camera cuts, graphics
and replays while possibly focusing on the sonic qualities of a mix may be beyond their con-
centration. Factor in the distractions and high ambient noise levels in an OB van –​remember
technical communications are verbal and everybody in the OB Van wears headsets –​and now
you have to wonder who is really listening.
Meanwhile, after objectively listening and considering what the problem could be,
I inquired about the balance of mix, its tonal qualities, and my physical execution of the mix.
Once again the answer was, “I don’t like the sound.” My next move was to look really busy
and concerned and ultimately do nothing. That seemed to work.
When surround sound came along, a common description emerged to describe the sound
design goals: to enhance the viewer experience. At least now when there is talk about multi-
channel 3D sound, the conversation begins with the nebulous notion of immersive experi-
ence. I think this has to do with creating the illusion of reality … go ahead, close your eyes …
do you believe you are there?
So what do balance, bite, clarity, detail, fidelity, immersive experience, punch, presence,
rounded, reach, squashed or warmth have to do with sound? As audio practitioners we seem
56

56 Ambisonics and Advance Audio Practices


to act like we know. After all, we make that mysterious twist of the knob and push of the fader
achieve audio nirvana, but audio descriptors are important to humanize the audio experience
and conquer the psychoacoustic and physiological aspects of sound.
The psychology of sound also has to do with the memory of sound and reminders from
physical cues such as pitch, frequency, tempo and rhythms triggers a sensory and perhaps
emotional experience. I believe that if you have ever heard a beautiful voice or guitar then
that becomes the benchmark for reference. A lot of what a sound designer has to do is satisfy
the memory, but I argue perhaps it is time to create a new impression. Psychoacoustics could
be considered how the mind is tricked by sound while the physiological aspects of sound
reinforce the illusion. For example, low frequencies, a fast tempo or pace affect breathing and
cardiovascular patterns.When I mixed car racing I always tried to emphasize the low frequen-
cies of the cars to heighten the visceral experience.

Principles of Psychoacoustics
Understanding how we hear, along with how the brain perceives sounds, gives sound
designers and software engineers the ability to model sound-​shaping algorithms based on
psychoacoustic principles and thought. Common considerations when modeling sound are
frequency and time, so instead of using elevation to achieve height try using equalization
which can be an effective means for creating impression of height.We naturally hear high fre-
quencies as coming from above because high frequencies are more directional and reach our
ears with less reflection. This principle is known as the Blauart Effect.5
Significantly, a lot of the low frequency energy has already been lost. By equalizing cer-
tain frequencies, you can create the illusion of top and bottom; in other words, the greater
the contrast between the tone of the top and the bottom, the wider the image appears to be.
This principle works well for sports and entertainment because you can build a discernable
layer of high frequency sounds (such as atmosphere) slightly above the horizontal perspective
of the ear.

The Haas Effect/​The Precedence Effect


The Haas Effect/​the precedence effect is the founding principle of how we localize sounds.
The Haas Effect, also known as the precedence effect, is a key psychoacoustic principle that can
be applied to create the illusion of width and a realistic sense of depth and spaciousness. Helmut
Haas explained why, when two identical sounds occur within 30 milliseconds of each other,
the brain perceives the sounds as a single event. Depending on frequency content this delay
can reach as much as 40ms. Short delays result in the signal going in and out of phase and are
underlying concepts for chorus, flanger and phase types of devices that are not used in broad-
cast but proper application creating a wider perception of space is beneficial to the sound mix.
Blauert came to the same conclusion as Haas about delay and localization, in that as a
constant delay is applied to one speaker the phantom image is perceived to move toward the
non-​delayed signal. Blauert further said that the maximum effect is achieved when the delay
is approximately 1.0ms.
Because the ears can easily distinguish between the first impression of a sound and its
successive reflections this gives us the ability to localize sound coming from any direction.The
listener perceives that the direction of the sound is from the direction heard first –​preceding
the second. While panning manipulates the sound by affecting the levels between the left and
right channel, the Haas effect works because of the timing difference between the channels
exactly the way our ears work.The precedence effect helps us understand how binaural audio
works as well as how reverberation and early reflection affect our perception of sound. 6
57

Ambisonics and Advance Audio Practices 57


There have been some studies about how our perception of sound changes with a change
in sound characteristics such as pitch shift or frequency variation. The Doppler shift is a valu-
able audio tool to enhance the sense of motion. It has an additional effect that appears to move
or shift high frequencies above the listener. The faster a sound source is, the higher the sound
is pitched up. The Doppler shift can be captured live with microphone placement, however
there are some programs that can effectively emulate this effect.7

Phantom Imaging –​Virtual Sources –​Phantom Sources


All channel-​based reproduction systems, such as stereo, surround and immersive, produce
phantom imaging where we perceive a sound source between channels/​speakers from level
and time interactions.

Psychoacoustic Masking
Psychoacoustic masking is the brain’s ability to accept and subdue, to basically filter certain
distracting sounds. I have read articles touting the miracles of sound replication by Edison’s
Victrola. Edison demoed real singers side by side with his devices, he would pull back the cur-
tain exclaiming better than life, pay no attention to those pops and ticks in the recording. The
mechanical reproduction devices suffered from a significant amount of scratches and ticks but
the brain filters out the undesirable noise. For example, radio static is filtered out by the brain
when a high proportion of high frequency components are introduced. Additionally, noise
and artifacts from over compressed digital sampling may be filtered by the brain but result in
unhealthy sound.

The Missing Fundamental Frequency


The missing fundamental frequency is an acoustical illusion resulting in the perception
of nonexistent sounds. The harmonic structure determines our perception of pitch rather
than strictly the original frequency. The brain calculates the difference from one harmonic
to the next to decide the real pitch of a tone even when the fundamental frequency is
missing. This is the reason why you can hear sounds over small speakers that cannot repro-
duce the full range of frequencies –​the brain fills in the missing fundamental frequency.
Sub-​harmonic synthesizers create the tone as a virtual pitch below the audible frequencies
of hearing.
At certain frequencies harmonics in the mix can contribute to the boosting of certain fre-
quencies. Additive spectral synthesis can be used for adjusting the timbre of your sounds by
combining and subtracting harmonics.8

Applied Psychoacoustic: Creating Production Tools


The physics of an environment, the ear and the brain are at play when creating psychoacoustic
production tools. Acoustic simulators in the 50s were as basic as spring reverbs and it was a
time when stereo widening was achieved by adjusting the relationship of the sides and the
center signal. But no more. Some manufacturers and researchers go into a variety of halls and
spaces and do impulse measurements of decay times, reverberation field measurements and
vector analysis of reflections to try and mimic the real soundfields.
3D audio effects involve the virtual placement of sound anywhere in front, to the sides,
behind and above the listener. Spatial enhancements such as reverb and room simulators
are useful tools in dimensional and immersive sound production because they recreate the
58

58 Ambisonics and Advance Audio Practices


perception of the physical size of a space as well as playing a significant role in creating the
illusion of a three-​dimensional space.
Basic reverb and delays are a single dimension balance between the direct sound and
reflected energy where advance audio production techniques are three-​dimensional, founded
on psychoacoustic considerations. Spatialization can also be achieved by processing an audio
signal and by infusing the processed signals into the immersive soundfield. There are room
simulators as well as a variety of dimensional reverberation programs that can effectively pro-
cess an audio signal into a variety of immersive formats with height control. This type of pro-
cessing gives cohesion between the lower and upper layers as well as control of the reflections
and diffusion of the returning audio signals.
Psychoacoustic modeling software can take a sound or group of sounds and digitally rec-
reate them in a digital acoustic map of essentially any desired sonic space –​virtualization
of space. Room simulators are capable of creating acoustic space using complex reflection
algorithms to recreate the variety of a dimensional space. The ability to contour parameters
like reflections and diffusion empowers the sound designer in recreating and creating the real-
istic sonic space.

Advanced Spatial Enhancement


In addition to panning and placement, spatialization tools are capable of distance and size
functions. Distance is interesting because you are not just changing the volume when you
move a sound element closer or farther away, but as in the real world the change in distance
can change the tone of a sound as well. Size is a 360-​degree hemispherical assessment or per-
ceptual evaluation of expanse, and advance spatial enhancement tools can expand the apparent
dimensional aspects or size of a sound element beyond the original region enhances the
dimension or magnitude of the original sound element.
The ability to adjust the size of an object has fascinating production possibilities. Size
expands the perception of magnitude of a sound element by diverging the sound element
into adjacent channels. Size is a processing feature that can be useful in speech intelligibility
or as an effect for dramatic enhancement. The soundfield can also be widened or squeezed
to match the TV size. In short, there are many tools that are added to the creative tools of the
audio mixing engineer.

The Secret Sauce: Plug-​Ins and Other Black Boxes


Digital mixing desks and digital audio workstations (DAWs) depend on plugins for increased
functionality and expansion. Beyond localization environment/​room simulators is a valuable
tool in the advance audio toolbox.
Creating an immersive soundfield for outdoor winter sports is challenging because, in
reality, wind does not make any sound until it collides with something like the trees in a
forest. I created several of these challenging soundscapes underscoring that the ability to create
such believable soundfields is a powerful live production tool. The DSpatial audio engine
can operate in a standalone configuration using, or being used to generate, a soundfield in
real time.

DSpatial
DSpatial created a bundle of plug-​ins that work under the AAX platform in a fully coordinated
way. Reality builder is inserted on each input channel and can operate in real time in the pro
59

Ambisonics and Advance Audio Practices 59


tools environment along with the option to run off-​line. In off-​line mode the rendering is
much faster than in real time. The DSpatial core engine can run in stand-​alone mode which
means no latency.
I enjoyed a discussion about sound design principles and practices with Rafael Duyos,
the brains behind DSpatial, who is much more than a coder. I believe he gets what sound
designers dream of.

DENNIS BAXTER (dB) : As a sound designer creating a sense of motion and speed has always
been a challenge, particularly with events that do not have a lot of dynamics like downhill
skiing or ski jumping. Creating the illusion of someone flying through the air on a pair
of skis is a challenge.
R AFAEL DU YOS ( RD ) : Scientifically speaking, what we have done is a balanced trade-​ off
between physical and psychoacoustic modeling principles. By that I mean that if some-
thing mathematically correct doesn’t sound right, we have twisted it until it sounds right.
After all, film and TV are not reality but a particular interpretation of reality. So we are not
always true to reality, but we are true to the human perception of it.
R D: We have applied this (principle) to all the effects we have modeled. For example, Doppler
is a direct consequence of the delay between the source and the listener, when any or
both are moving in relation to the other, but we have made this delay optional because
sometimes it can become disturbing. Inertia was implemented to make the Doppler
effect more realistic by simulating the mass of moving objects. Inertia is applied to each
source according to its actual mass. Small masses have much more erratic movements.The
Doppler of a fly doesn’t sound the same as the Doppler of a plane. Doppler and Inertia
usually have to be adjusted in parallel; very high degrees of Doppler usually require more
inertia.
In the case of proximity, for example, we have even provided for an adjustment of the
amount of proximity effect, from nothing (like current panning systems) to fully real-
istic. We use equalization only marginally. Normally we use impulse responses and
convolutions because they are much more realistic. A very important part of the algo-
rithm is the reflections. Take binaural, for example. A loose HRTF usually doesn’t sound
very realistic. However, if you take a good binaural microphone, it sounds much better
than an HRTF alone, and that’s because with the microphones you get millions of micro
reflections coming from everywhere.That’s what we try to model as much as possible.We
are probably the system that needs the most computation to work, but we are not worried
about that because computers are getting more and more powerful. Time is on our side.
d B : I thought your program for Walls and doors –​reflection, refraction, diffraction and
scattering produced by walls and doors was very clever and useful. Can you explain your
scatter principle?
R D: Dispersion is achieved through extreme complexity. The key to our system is our
Impulse Response creator. This is something that cannot be achieved with algorithmic
reverberations, and allows us to get the best of convolution and the best of algorithms.
R D: The complexity of IR modeling allows us to create fully decorrelated IRs for each of the
speakers. That’s simply not possible with microphone recorded IRs. For us it’s the essen-
tial part of our design. Our walls, doors, reflection, refraction, diffraction and scattering
base their performance on the complexity. Rotate, collapse, explode, etc. are created in
our DSpatial native format, and can then be exported to any format, be it Ambisonics,
binaural, ATMOS, Auro3D. There is no format limit. As we record the automations and
not the audio, we can always change it later.
d B : What are the X,Y and Z controls for?
60

60 Ambisonics and Advance Audio Practices


R D: There is an X Y Z slider for each of the sources, and these represent the positions of that
source in the 3D space. As simple as that. If the final mix is not in 3D, the projection of
the 3D space in two or one dimensions are accomplished. It is possible to edit on a two
or three-​dimensional plane or even on an Equirectangular plane. You will automatically
see the effect of these movements on the X,Y and Z sliders.
d B : Some of the controls are for Center Force and Space Amount –​please explain.
R D: Center Force is a feature that a Skywalker engineer asked for when we showed them our
first prototype.They were obsessed with the dialogue being attracted to the center speaker.
Somehow Center-​Force defines the intensity of attraction that the Centre speaker exerts
over the dialogue, as if C was a magnet.
d B : Can you explain Ambients? Is this like an Ambients noise generators?
R D: It is that and much more. Ambients are an audio injection system based on a player of
audio sound files, for diverse use. Its first use is to create sounds of environments such as
noise from cities, sound from restaurants, parks, people, animal, machines or any general
sound environment. Even synthetic sounds made with synthesizers, musical instruments
and phrases. In a word: any sound that can be put in an audio file.
R D: Once the type of Ambient is set, it can be injected into the final mix using three-​
dimensional spatialization parameters, through a simple joystick-​like pad that is fully
automatable. In addition to ambient sounds, you can use music and sound effects such as

Figure 4.6 
DSpatial ambient options window
61

Ambisonics and Advance Audio Practices 61

Figure 4.7 
DSpatial-​Reality-​2-​0

gunshots, screaming, horses, door closing sounds, footsteps, etc. In these cases, there is a
pad-​controlled firing mode, of course, supporting spatialization parameters.
The Ambient system is also intelligent enough to use multichannel audio in both the
ambience source and the number of channels in the final mix, ensuring the best possible
spatialization.
d B : Can you explain Spatial objects?
R D: Spatial-​Object is what we call DSpatial objects which is the next generation of objects.
Traditional objects are simple mono or stereo files located in a grid of speakers. They
lack the ambience, which in reality is closely linked to the original signal. The envir-
onment is supplied separately in the form of beds. But that has the problem that the
beds don’t have good spatial resolution. If our goal is to make the system realistic, using
beds is not a good idea. To be realistic, objects have to be linked to their reflections. But
for that you need an integrated system that manages everything. That is exactly what
Reality Builder does.
R D: DSpatial-​ Objects are devoted to production, not just delivery. Contrary to all object-​
based formats, DSpatial work with objects from the very beginning of a production.
d B : Remember, Dolby required a bed to get started.
R D: With a DSpatial workflow it is ideal is to work dry, and add as much, or as few, reverbs
as you want afterwards. There is no need to record the original reflections, hyper-​realism
and repositioning possibilities DSpatial extreme realism allows for total control in
post-​production.

This author listens and mixes in a neutral acoustic environment using ProTools, Nuendo
and Reaper with 11.1 Genelec speakers 7.1.4 and has auditioned and mixed the plug-​ins
described in this book.
The ability to create sonic spaces in real time is a powerful tool in immersive sound cre-
ation and production. Remember sports sound design is equal parts sports specific, event
specific and venue specific. As discussed in Chapter 5, microphones capturing sports specific
sound is possible, but capturing the right venue tone is complicated by poor acoustics and
little noise control. Advance audio production practices advocate manufacturing an immersive
soundbed to develop upon.
62

62 Ambisonics and Advance Audio Practices


Advance audio production practices can be extended to include the aural re-​assembly of a
hostile acoustic environment where the background noise completely overwhelms the fore-
ground. Such was the distraction with the vuvuzelas at the 2010 World Cup. As I have said, a
sports venue has a rather homogenous sound throughout and infusing a new room tone on
the venue, similarly to what is done in film, solves a lot of problems.

Sound Particles
You probably have heard sound particles on film-​type productions, but sound particles has
developed an immersive audio generator that produces sounds in virtual sound worlds. Sound
particles is a 3D native audio system that uses computer graphic imagery (CGI) (modeling)
techniques to create three-​dimensional images in films and television. Sound particles uses
similar CGI computer modeling principles to generate thousands of 3D sound particles cre-
ating complex sound effects and spatial imaging.All sound particles processes require rendering.
Practice application –​sound particles is a post-​production plug-​in but because of flex-
ible I/​O configurations a timed event could be triggered, exported from the live domain to
sound particles, rendered and played out live through the sound I/​O with the live action. For
example, a wide shot of the back stretch of a horse race is probably a sample playback and the
sample playback could be processed, rendered in real time and timed to the duration of the
horses run along a particular distance.
Sound particles can be anything from a single simple particle to a group of particles forming
complex systems. To build a new project from scratch, open the menu and select EMPTY
which opens a blank timeline. Now you can build your new timeline with video at the top
and then add audio track(s), add particle group, add particle emitter, add microphone or begin
with presets.
An audio track is the sound that is going to be processed and can be mono, stereo or ambi-
sonic. This is usually some file format such as a .wave or other audio file. You import your
audio file or files to the timeline. In the case of using multiple files each particle will randomly
select an audio file from the selection of imported files.

Figure 4.8 Sound particles menu SuperNova


63

Ambisonics and Advance Audio Practices 63


In the menus you can select a particle group where particles start at the same time or a par-
ticle emitter that emits particles at a certain rate. In a particle group you can set the number
of particles –​the default is 100 but the user can select from 1 to 100,000.You can change the
shape of the particle, for example circle, cylinder, rectangle or sphere.
Menus provide information about when and where the particle starts after its initial value.
Point is when all particles are created at the same point. Line is when all particles are created
within a line segment, inside circle, inside rectangle, inside sheer, outside sphere, inside cylinder
and cylinder surface.
Movement modifiers control straight line and rotational acceleration.. For example, straight
line movement is where each particle is moving in a straight line with gradually increasing or
random velocities. While rotational acceleration controls the movement of a particle around
its axis, additional menus control audio modifiers such as gain, EQ, time and pitch and delay.
An interesting feature is a random delay where each particle will start with a random delay of
up to five seconds.
Hundreds of presets for positional automation such as Doppler, explosion, flyby, hurri-
cane, jumping around, machine gun, magnetic poles, moving tunnel, rotating grid, spinning
and more can be selected and added to the timeline or automation can be programed by
the user.
In order to render the scene you need to have a point of reference –​the program uses the
concept of a microphone and can be any polar pattern from mono, stereo, multichannel –​
immersive Dolby Atmos, Auro 3D, NHK 22.2 or ambisonics up to 6th Order. The micro-
phone renders each particle in terms of their distance by attenuating the sound, in terms of
direction by applying panning and Doppler effect.You can change the position of the micro-
phone as well as the group on the axis grid.
There are menus for speaker setup from immersive, surround to an edit mode using azi-
muth and elevation as well as Audio Hardware I/​O. Binaural monitoring can happen on any
type of audio format and with ambisonics you can have head tracking if you add an ambisonic
microphone to the scene.
Render can be online or offline depending on the complexity of the scene.You can render
a project with more than one track and more than one microphone. Export the file with
interleaved and non-​interleaved where the channel will be exported as its own file. File
formats are .WAV, .AIFF, FLAC, Bit depth, sample rate, channel order and metadata.

Other Plug-​Ins
DTS-​X Neural Surround Upmixer converts stereo and surround sound content to 5.1.4,
7.1.4, 7.1.5 and 9.1.4. (See Chapter 8.)
The WAVES MaxxAudio Suite includes extended bass range using psychoacoustics offering
better sound reproduction through small speakers, laptops, tablets and portable speakers.Waves
has a standalone headtracking controller.
The NuGen Halo Upmix 3D is channel-​based output as well as ambisonics. Native upmix
to Dolby Atmos 7.1.2 stems and height channel control as well as 1st Order ambisonics.
During rendering, the software conforms the mix to the required loudness specification and
prepares the content for delivery over a wide array of audio formats from mono to various
immersive formats supporting up to 7.1.2. Nugen’s software can also down-​process audio
signals with its Halo Downmix feature that gives the audio mastering process new ranges for
downmix coefficients, and a Netflix preset as well.
The Gaudio Spatial Upmix extracts each sound object from the stereo mix and then
spatializes the 3D scene on binaural rendering technology adopted from Next Generation
Audio standard ISO/​IEC 23008-​3 MPEG-​H.
64

64 Ambisonics and Advance Audio Practices


The Ambisonic toolkit has four different ways to encode mono source: planewave, omni,
spreader and diffuser, and two different stereo algorithms.
The Blue Ripple can encode mono sources into a B-​Format audio.
The SSA Plug-​Ins offers Ambisonic gate/​expander, De-​essing, rotation, compression, delay
and equalizer.

Outside the Box: Black Box Processing


My first experience of a Black box was in 1996, when I used one called the Spatializer. It had
eight analog inputs that were controlled by eight joy sticks that could output an expanded
stereo –​spatialized two channel or a quad output. This device clearly gave the impression of
an extended soundfield to the left and right and gave a better impression with simple sources
like a single microphone than a group of sounds.
Linear acoustic has designed and built stand-​alone boxes for loudness control and man-
agement for over a decade. I discussed the new immersive real-​time up-​processor with Larry
Schindel, Senior Product Manager at Linear Acoustic. Linear Acoustic® UPMAX® ISC
upmixing (up-​processing) can be used to maintain the sound field regardless of the channel
configuration of the incoming content. It can also be used creatively to enhance the surround
and immersive soundfield.
Audio elements are extracted using frequency domain filtering and time domain ampli-
tude techniques; the LFE is derived from the left, center and right channels without impact
to the full range left and right speaker.The surround soundfield can be adjusted via the center
channel width control and the surround channel depth controls.
Parameters in the upmixer can be adjusted to help shape the sound for the user’s tastes,
such as whether the center channel sounds are routed hard center or spread a bit into other
channels, or how far back into the surrounds a sound would go to steer upmixed content.The
UPMAX ISC can monitor the input signal and auto-​detect whether upmixing is needed and
native surround content will pass through unprocessed. UPMAX ISC upmix 2, 3, 5.1 and 7.1
to 7.1.4.
Upmixing can be inserted into a mix buss or on the output buss in the OB van or at the
network because there will always be a mix of legacy material with mono or stereo sound and

Figure 4.9 Linear Acoustic UPMAX signal flow


65

Ambisonics and Advance Audio Practices 65


it is important to maintain a consistent sound field image all the way through the chain to
the consumer/​listener. Content passing through in native immersive formats is automatically
detected and pass through unprocessed.
UPMAX is a software component included in several Linear Acoustic processors plus is a
standalone blackbox for upmixing legacy 2, 3, 5.1 and 7.1 channel audio to 5.1.4 and 7.1.4.
Content that passes through the UPMAX that is native immersive will automatically be passed
through without processing. UPMAX has been used in live situations for upmixing music,
effects and legacy material that is not already immersive. UPMAX I/​Os are AES, MADI
and SDI.
Illusonics IAP is a real-​time immersive sound up-​processor. There are features that would
appeal to an audiophile although it is an up-​processor/​sound enhancement device and not
a high-​end pass through exciter type box to compensate for dull material. It extracts spatial
information and creates space around the listener.
If you consider the wide array of inputs, HDMI, Digital Coax, Optical, USB, IAP networks,
analog and phono you might think it is a high-​end consumer device except for the price tag.
HDMI inputs support up to eight channels of 192 kHz and 96 kHz 24 bit audio, the digital
coax, optical S/​PDIF, the USB port and the UPnP/​DLNA network protocol support 96
kHz and 92 kHz 24 bit audio inputs. The outputs are HDMI, balanced XLR and unbalanced
RCA. IAP configuration as well as gain, polarity and delay can be applied to input and output
channels through your Mac/​PC via a USB cable.
There are six adjustment parameters: center, depth, immersion gain, immersion high,
immersion size and clarity. Center determines the degree a phantom center is converted to
real center. Center increases the sweet spot from everywhere else in the room localizing dialog
and soloist in the center of the space. For example, if a stereo signal is selected (2 x mono)
which includes dialog, a center channel will be derived. If the HDMI input is accessed the
center channel will be directed to the center channel.
Additional surround, height or center information can make depth/​immersion more nat-
ural, controlling early sound reflections. Depth beyond 50 percent amplifies the rear channels.

Figure 4.10 Illusonics –​menu for loudspeakers setups –​outputs for 20 positions


66

66 Ambisonics and Advance Audio Practices


Immersion gain is the psychoacoustic sensation of the degree of encircled that a listener
perceives. Immersion gain is how strong diffused sound is reproduced. Immersion high –​
equalization control –​brilliance and immersion size –​the reverberation time RT60 of the
immersion signals. Clarity makes the reproduced sound more dry, reducing the amount of
room reverberance and tone controls with bass and treble frequency and gain controls.

Notes
1 Christiaan Huygens. Sciencedirect.com/​topics/​physics-​and-​astronomy/​Huygens-​principle, courses.
lumenlearning.com/​austincc-​physics2/​chapter/​27-​2-​huygens-​principle, Traite de la Lumiere.
Limited John Wiley and Sons 1690,
2 MI. A. Gerzon, “Periphony: With-​Height Sound Reproduction,” J. Audio Eng. Soc., vol. 21, no. 1,
pp. 2–​10 (1973 February).
3 Olivieri, Ferdinando, Nils Peters, and Deep Sen. 2019. Review of Scene-​Based Audio and Higher
Order Ambisonics: A Technology Overview and Application to Next-​Generation Audio, vr and 360°
Video. EBU Technical Review. https://​tech.ebu.ch/​docs/​tec​hrev​iew/​trev_​2​019-​Q4_​SBA_​HOA_​
Tech​nolo​gy_​O​verv​iew.pdf.
4 D. Sen, N. Peters, M. Kim, and M. Morrell, “Efficient Compression and Transportation of Scene-​
Based Audio for Television Broadcast,” Paper 2-​1, (2016 July).
5 Blauert, Jens. 2001. Spatial Hearing: The Psychophysics of Human Sound Localization. Cambridge: The
MIT Press.
6 H. Haas,“The Influence of a Single Echo on the Audibility of Speech,” J. Audio Eng. Soc., vol. 20, no. 2,
pp. 146–​159 (1972 March).
7 “The Doppler Effect: Christian Doppler Wissensplattform.” n.d. Accessed December 16, 2021. www.
christian-​doppler.net/​en/​doppler-​effect/​.
8 McKamey, Timothy. 2013. “Restoration of the Missing Fundamental.” Sound Possibilities Forum.
September 7, 2013. https://​sou​ndpo​ssib​ilit​ies.net/​2013/​09/​06/​rest​orat​ion-​of-​the-​miss​ing-​fund​
amen​tal/​.
CH A PT ER
5
Lever agin g M ot ion an d
Con cept u al
Fr am ew or k s of Sou n d
as a Novel M ean s of
Sou n d Design in
Ext en ded Realit y

This chapter is excerpted from

Designing Interactions for


Music and Sound
Ed. by Michael Filimowicz
© 2022 Taylor & Francis Group. All rights reserved.

Lear n M or e »
8 Leveraging Motion and
Conceptual Frameworks of Sound
as a Novel Means of Sound Design
in Extended Reality
Tom A. Garner

1 Introduction
Bemoaning the under-appreciation of sound, specifically when compared to
visuals, in the design of virtual worlds is almost something one could build
an academic career on. Indeed, many of my prior works have been introduced
in this manner, to the extent that I sometimes even question my commitment
to addressing the issue. Were it to be solved, I would need to find something
else to complain about. Personal issues aside, it would be unfair to suggest that
sound design for virtual worlds has not progressed. In many ways it has, and
in leaps and bounds, but it often feels at least one step behind its visual cousin.
It is one of the most recent examples of this issue that is the subject of this
chapter, namely the consideration of sound amidst a form of cross-pollination
of technology and practice that is being driven by extended reality, or XR.
The term ‘extended reality’, long before it was abbreviated, goes back at least
25 years. It appears in the title of a 1996 paper by Yuval Ne’eman, in which the
term described a theoretical infinite sequence of parent universes connected in
a linear sequence, each birthing the next in line; essentially, reality extending
beyond the known universe. Extended reality reappears in academic literature
a couple of years later, this time analogous to augmented reality, in its ability
to extend reality by way of digital overlays upon a user-view of a physical
environment (Klinker et al. 1998). Over the next few years, the term remained
rather obscure, but the notion of extending our reality through technology, art
and thought persisted and continued to develop.
In most contemporary definitions, the meaning of XR takes much influence
from the taxonomy of Milgram and Kishino (1994) and functions as an umbrella
term to refer to the collective suite of virtual, augmented and mixed-reality
technologies, such as head-mounted displays, spatial computing systems and
wearables. To be clear, this usage of the term would arguably be best abbrevi-
ated to ‘xR’, with the prefix in lower case to signify something. Research typi-
cally deploys xR when describing an area of industry such as manufacturing
(Fast-Berglund et al. 2018) or construction (Alizadehsalehi et al. 2020) that
utilise a combination of virtual, augmented or mixed-reality systems as a suite
of technological solutions. Otherwise ‘XR’ is at present the default, and we

DOI: 10.4324/9781003140535-8
178 Tom A. Garner
therefore use this format throughout the chapter. Broadly speaking, the ratio
of virtual to physical content within a singular user experience identifies three
conceptual classes within extended reality, namely: Virtual Reality (VR), with
its emphasis upon virtual content; Augmented Reality (AR), which prioritises
experience of physical content; and Mixed-Reality (MR), a more balanced or
complex interplay between physical and virtual content.
In many recent cases, what constitutes virtual, augmented or mixed reality
has become entwined with specific hardware devices, presented as platforms
to exclusively deliver that form of XR content. The head-mounted display
(HMD) has arguably become so synonymous with virtual reality, in particular,
that many perceive the device and the concept to be the same thing. MR has
its equivalent in location-based experiences: installations comprising bespoke
physical and virtual content, such as digitally enhanced museum exhibits or
theme park rollercoasters. The immediate problem with us understanding XR
in this way, which feeds heavily into matters of extended reality sound design,
is that it constrains our expectations for what technologies and practices can
be deployed. If XR is restricted to VR and AR in particular, both of which are
themselves viewed as restricted to HMD hardware, this arguably limits numer-
ous opportunities to provide more nuanced, effective and efficient solutions.
The core aim of this chapter is to emphasise the great potential of sound
design research and practice to meaningfully enhance extended reality applica-
tions, both now and in the future. Feeding into this overarching ambition, the
discussion commences with a rationale for cross-pollination: extending the
meaning of XR by considering it more holistically, as a wider array of tech-
nologies that should not be deployed or developed in isolation, but rather as
a collection of potentials from which an ideal solution can emerge. Following
on from this, the discussion then turns to make the case for human motion to
be appreciated as one of the most significant opportunities to drive innovative
sound design in XR. This is done in three stages, each based on a key prem-
ise. The first premise is that human motion is the defining innovative asset of
contemporary XR technology. The second is that sound and human motion are
intrinsically and deeply interconnected. The final premise is that the substan-
tial body of literature concerning acoustic ecology and theories of sound and
listening can be leveraged to reveal numerous opportunities for developing
innovative approaches to motion-driven XR sound design.

2 Cross-pollination: an extended definition of XR


Beginning in 2018, the UK government department for Research and Innovation
invested over £39 million into the Audience of the Future challenge1 (AoTF).
Revolving heavily around notions of immersive and interactive experience,
AoTF sought to connect national museums, film production companies, game
studios, theatre companies, universities, orchestras and other partners to explore
this potential for cross-pollination to a singular experience. Here the phrase
cross-pollination to a singular experience feels highly appropriate, as it speaks
Motion, sound design and extended reality 179
to a fundamental ethos of extended reality: to bring together any combination
of technologies, environments, concepts, designs, objects, creatures and people
to produce a world, perceived by its audience or user as a singular experience.
The technological overlap and cross-influence between various forms of
media, including cinema, television, radio and digital games, has arguably con-
tributed to the impressive rate of progress observed recently. So too has the
cross-application of many types of assets and their production methods, a small
sub-section of which includes motion capture for animation, photogrammetry
for production of 3D models, and spatialisation processing for 3D audio. Con-
temporary techniques in virtual cinematography that feature on both the silver
and small screens are delivered primarily by way of game engine technology.
Online film and television streaming platforms have begun experimenting with
interactive, choose-your-own-adventure-style content. Live theatre is increas-
ingly experimenting with digital and network technologies to facilitate subtle yet
engaging interactive opportunities, where audiences are no longer passive spec-
tators but can exert direct influence over the events as they unfold upon the stage.
The vast majority of big-budget digital game studios continue to borrow from
film and television in their quest to provide audiences with a ‘cinematic’ game-
play experience. Here, what was traditionally interactive media reveals clear
ambitions to acquire the crafted qualities of film, television and theatre, which in
turn are revealing a corresponding ambition to become more interactive.
When we consider some of the existing conceptual frameworks of XR (of
which there are decidedly few), the majority reinforce this notion of cross-pol-
lination by either incorporating a wide array of devices into their framework,
or emphasising the importance of focussing upon reception and user experi-
ence as defining features (Flavián et al. 2019). For example, Çöltekin and col-
leagues (2020) present a wide-ranging taxonomy of display devices, including
HMDs, but also traditional/non-immersive (smartphones, tablets and moni-
tors), semi-immersive monoscopic (curved screens, extra-large screens) and
stereoscopic (3D screens, CAVE) displays. Similarly, Doolani and colleagues
(2020) assert that the key features required to identify something as ‘XR’ are
display device, image source, environment, presence, awareness (referring to
perceived realism of the rendered objects), interaction, perspective and appli-
cation. Here, there are no further specifics within these requirements, no limi-
tations on type of display or the nature of the interaction, nor are there any
constraints set on how multiple types of XR could be used in tandem to create
a ‘blended-reality’ experience.
Sound design also benefits from this cross-pollination of progress, with spa-
tial audio being a good example. Developments in spatial audio present clear
added value that transcends the different forms of digital media: namely, the
potential to direct audience attention, obscure distractions by physically sur-
rounding the audience in sound, and exploit this immersive quality to evoke
a sense of diegetic presence (by way of the illusion that the soundscape exists
simultaneously within the physical and virtual world). This raises the ques-
tion of what other opportunities for sound design can be unearthed when
180 Tom A. Garner
considering XR from this more holistic perspective. The next sections of this
chapter seek to address this question.

3 Motion and naturalistic interaction


The etymology of the term ‘naturalistic interaction’ appears to possess a rela-
tively short history. In the late 1970s, the term was used in passing to describe
a deceptive role-play technique, utilised as part of a study into the validity of
role-play for examining heterosexual social interactions (Bellack et al. 1979).
In the naturalistic interaction group, subjects did not realise they were role-
playing, and this was compared against a subject-aware role-play condition2.
The 1980s saw naturalistic interaction feature a little more frequently (Dono-
hue et al. 1984; Krasnor & Rubin 1983), appearing largely in studies concern-
ing social interaction where the participants were largely unaware that they
were being observed. Whilst the context of these studies is somewhat removed
from the topic of this chapter, there is a significant connection to be drawn,
specifically with regards to the notion of unaware role-play.
Following the timeline, naturalistic approaches to human-computer interac-
tion (HCI) also begin to crop up in the 1970s, initially referring to speech-
based HCI methods (Orcutt & Anderson 1974; Smith & Goodwin 1970) but
then widening focus to consider various matters of interacting with a graphical
user interface (Treu 1976). By the 1990s, various concepts relevant to natural-
istic HCI became cemented in the discourse, with the advent of human-com-
puter interface design as a recognised discipline. Here, resonances of earlier
usage of the term naturalistic interaction can be observed in a HCI context.
As Laurel and Mountford (1990) point out, computer technology should feel
invisible and ‘subservient to [the] goal’ (p. 248). Laurel and Mountford also
posit that a wide range of opportunities are presented within interface design to
achieve this vision, citing sound design, speech recognition and gesture input,
amongst others.
More recent research addressing ‘natural HCI’ has largely investigated mat-
ters of emotion, voice and gesture (D’Amico et al. 2010), with the latter featur-
ing very prominently across the last decade (Linqin et al. 2017; Plouffe et al.
2015; Rautaray & Agrawal 2012). Motion-based HCI broadly enjoys much
praise from the research community, as Song and colleagues (2012) attest to
in suggesting that ‘[i]ntelligent gesture recognition systems open a new era of
natural human-computer interaction’ (p. 1). Their reasoning for this judgement
reflects our earlier observation in early usage of naturalistic interaction, argu-
ing that human motion requires little to no conscious thought, allowing a user
to focus entirely on the task or purpose of the interaction whilst the mediating
effect of the technology goes by unnoticed. We only notice our smartphones as
tangible devices when they stop working or slow down.
Just as the 1990s saw synergies between naturalistic interaction and HCI,
so too did VR join the narrative. Although the popular myth that Jaron Lanier
coined the term ‘virtual reality’ in 1987 still persists, most sources are clear
Motion, sound design and extended reality 181
in stating that this point actually marked the popularisation of the term, not
its conception. Existing in more obscure forms in the early 20th Century, VR
began to appear in academic literature in the early 1970s, a good example of
which being Norton’s (1972) conceptual exploration. One of the first studies
to consider human motion specifically within the domain of VR did so with a
focus upon comparing walking on the spot to move a virtual avatar forwards
with the use of a hand-held controller, in terms of their effect upon user pres-
ence (Slater et al. 1995). As the more naturalistic method of VR interaction,
Slater and colleagues’ study observed that walking on the spot did present an
enhanced feeling of being physically present within the virtual world. Indeed,
as the research into motion-tracking and naturalistic interaction for VR con-
tinued, researchers increasingly professed it to be one of the key benefits of
VR itself. This could be observed across numerous areas of application that
included education (Helsel 1992), medicine (Székely & Satava 1999) and arti-
ficial intelligence (Luck & Aylett 2000), to name a few. Across these examples
and more, the principle generally remains consistent: tracking human motion
to facilitate naturalistic interaction is arguably the most prominent benefit of
VR technology.

4 Sound in extended reality

4.1 Current challenges and a rationale for a hybrid approach


Sound is unquestioningly interwoven throughout the history of XR. Jaron
Lanier, a highly influential figure in early virtual reality, was himself a com-
poser who utilised VR technology to push the physical boundaries of musical
performance (Johnson et al. 2019). Well-established sound design techniques
for virtual environments include ambiences for immersion, spatialised sound
for user-localisation and interactive audio for feedback (Vi et al. 2019). Sound
provides powerful tools to establish setting, characterisation and narrative, and
can efficiently provide clear signification on matters of goals, tasks and pro-
gress (Skult & Smed 2020). Enhanced data representation, another notable
application of XR, also benefits substantially from careful consideration of
sound. XR sonification systems translate various forms of data into sound to
enable deeper, more reliable and/or more efficient interpretations of that data.
Such systems have been shown to be highly effective in transforming chemical
data (Morawitz 2018), for example.
A review by Serafin and colleagues (2018) explores the current challenges
and promising contemporary approaches of interactive audio. The article
points to three key research challenges, each of which is relevant to one of
three aspects of virtual acoustics: source modelling, receiver modelling and
room acoustics modelling (see Savioja et al. 1999). The first challenge relates
to source modelling, and strives for richly populated soundscapes comprising
fully interactive audio, in which sounds reflect the precise nuances of a user’s
interactions with virtual objects, to the extent that they are indistinguishable
182 Tom A. Garner
from physical (actual) sounds. The second challenge concerns receiver mod-
elling, and aims to produce spatialisation that is perceptually indistinguish-
able from real-world experience. Finally, room acoustics modelling denotes
the ambition for realistic simulation of environmental acoustics to present a
realistic perception of space and place. Equivalence to real-world experience
is a central qualitative aspiration which all three challenges feed into, with the
additional requirement across all three being to do so under the constraints of
limited computer processing resources.
Following their explanation of the three research challenges outlined here,
Serafin and colleagues (2018) proceed to document how the response has so far
been structured into two broadly isolated pathways: sample-based and genera-
tive methods. They observe that the historical limitations of these two routes
still persist, particularly with regards to the first challenge of source modelling.
Despite progress, sample-based approaches continue to hungrily consume
computing resources, which severely limits the number of samples that can
be packaged within an interactive experience. Generative audio, by compari-
son, remains in most instances clearly distinguishable from mechanically pro-
duced sound. Several encouraging developments documented by Serafin and
colleagues are notable for their near-unanimous favour of a hybrid approach,
utilising sample-based and synthetic/algorithmic elements together to help lev-
erage the benefits (and minimise the limitations) of each. This is an important
point that reflects the first key assertion of this chapter: that a cross-pollination
attitude to design may yield the most significant benefits to an XR experi-
ence, both in terms of sound design and in general. XR does not always strive
for ever-increasing interactivity or realism. There is certainly great value in
endeavouring to fully simulate our auditory reality, but whilst such constraints
persist, at least in the near-to-mid-term, the flexibility offered by a holistic
interpretation of XR presents us with a range of hybrid techniques to produce
an experience that, whilst not perfect, is optimal for the current technological
state of the art. Summers and colleagues (2015) perfectly encapsulate this sen-
timent in their position on best practice sound design for virtual and augmented
reality: ‘The challenge therefore is to combine the most advanced emerging
technologies . . . with plausible and acceptable sonic interaction design, in
terms of experience, emotion, narrative, and storytelling’ (p. 38).

4.2 Current technical approaches and research directions


At present, the vast majority of development in XR is provided by two compet-
ing platforms, Unity and Unreal Engine (the only real alternative being pro-
prietary engines built in-house for exclusive use by the company for which
they were made). Both platforms have a pedigree in games development, but
in recent years have sought to diversify their portfolio of uses in ways that,
once again, reflect the cross-pollination ethos of holistic XR discussed ear-
lier within this chapter. Unity and Unreal Engine proudly tout this diversity
(by way of their respective ‘Solutions’3 and ‘Spotlight’4 pages), which now
Motion, sound design and extended reality 183
extends beyond games to include usage for film and television, live broadcast-
ing, animation, automotive design, transportation, manufacturing, engineering,
architecture and construction, branding, education and even gambling.
In an extensive technical review of audio programming, Goodwin (2019)
draws numerous parallels between runtime sound systems for games and those
for XR, with the majority of current sound design techniques and technologies
interchangeable between the two. With this in mind, it is not surprising that
two development tools that originated as exclusively game development plat-
forms have comfortably expanded to address numerous other forms of digital
media. Within both Unity and Unreal Engine, audio tools exist across multiple
layers of each system, from the native source development kits (SDKs), to
established third-party plugins, to more experimental or specialised tools cre-
ated by individuals or small, independent teams. Native SDKs largely provide
a broad range of real-time digital signal processing tools (equalization, com-
pression, reverberation etc.). By comparison, the more established, often big-
budget, third-party audio plugin tools typically facilitate more advanced effects
that may include convolution reverb, occlusion and refraction. These tools also
largely prioritise advanced spatialisation tools that include head-related trans-
fer function (HRTF) processing and compatibility with binaural and ambisonic
audio sample formats. Lastly, the more experimental, small-scale audio tools
typically seek to fill some of the gaps in functionality not addressed by either
native or big-budget audio SDKs. Of these gaps, generative audio is arguably
the main focus of this third group (see Johnson et al. 2019).
In terms of VR sound, recent research primarily falls into one of three cat-
egories: studying sound or sound-relevant matters using VR technology as part
of the method (see Sanchez et al. 2017; Vorländer et al. 2015); use of con-
temporary VR tracking technology for musical performance (see Hamilton &
Platz 2016; Serafin et al. 2016); and continuing to explore the big-three sound
design challenges (see Section 4.1), referencing VR but with relevance gen-
eralisable to all virtual environments (see Hong et al. 2017; Raghuvanshi &
Snyder 2018).
Many examples of AR sound design research broadly explore pervasive,
location-based games (using GPS data or other forms of positional markers to
situate the experience within real-world interiors or geography), specifically
designs in which sound is heavily prioritised as the primary source of sensory
feedback (e.g., Chatzidimitris et al. 2016; D’Auria et al. 2015; Kaghat et al.
2020). The research into AR sound is arguably at an earlier state of progress
compared to its elder sibling, VR. As such, many studies take the form of
proof of concept or prototype projects that demonstrate the potential of audio-
centric (or even audio-only) AR within a specific area of application, such as
cultural heritage, tourism or gaming. However, as this sub-domain progresses,
it is sensible to assume that matters of nuance and design will become increas-
ingly prevalent, and also that the research will soon explore many of the issues
described here that are presently focussed upon VR or virtual environments
more broadly.
184 Tom A. Garner
In a review of ‘sonic interaction design’ techniques, Summers and col-
leagues (2015) argue explicitly for XR sound design to consider four key
issues: embodiment (the physical effect of sound upon the body), context (e.g.,
user expectations, designers’ ambitions for cognitive or affective impact),
experimentation (to make use of, but also think beyond established frame-
works and practices by ‘playing with sound’ during the design process) and
holistic design (retaining an openness to alternative methods and a willingness
to utilise multiple approaches in combination to achieve the optimal result).
Otherwise, research articles explicitly addressing XR sound design are, for the
moment, rather few and far between, with much focus remaining separately on
sound in virtual or augmented reality systems, leaving a notable gap for a new
comprehensive theoretical framework on XR sound design.

4.3 Parallels in sound and motion


Sound and the body are widely acknowledged to share a deep connection.
A sudden, unexpected sound will likely startle us, temporarily taking hold and
contorting our whole body as the autonomic nervous system kicks in with a
fight or flight response, well before we have chance to consciously register
what is happening. Some of the most emotionally evocative sounds are those
that we have a strong bodily response to (Cox 2008). Although certain content
upon video streaming platforms has made the assumption that sound can be a
powerful tool in promoting sleep long before scientific study had the opportu-
nity to validate the claim, recent research has supported the notion that sounds
capable of inducing Autonomous Sensory Meridian Response (ASMR) can
indeed have physiological effects that prepare the body for sleep (Poerio et al.
2018).
Sound and human motion are, of course, intrinsically and meaningfully con-
nected when we consider forms of organised sound such as music, with perfor-
mance effectively being a conversion of our bodily motion. From an emergent
perspective of human perception, our experience of the world is fundamentally
cross-modal. Music and motion are no exception, with elements such as dance
being ‘a type of corporeal articulation of our cognition of music’ (Nymoen
et al. 2013, p. 2). The relationship between music and dance is widely accepted
to be both deep and pervasive. As Seeger (1994) asserts, ‘music and dance
are inextricably involved in human social processes. They take their meaning
from, and give meaning to, time, space, the body and its parts, human artefacts,
personal experience, social identity, relations of production and social status’
(p. 686). Expanding on this idea, Haga (2008) provides a comprehensive
exploration of the correspondences between the fundamentals of music and
movement of the body, identifying various parallels between kinematics and
dynamics. To give an example, Haga explains the notion of ‘effort’ as relevant
to both music and motion across four factors: weight (strong or heavy to gentle
or light), time (sudden to sustained), space (direct or straight-lined to indirect
or wavy-lined) and flow (controlled or bound to free).
Motion, sound design and extended reality 185
When seeking to extract features from human motion as relevant to gestural
interaction within XR, basic music theory very quickly becomes an abundant
source. This is particularly prominent when we consider temporal features.
The linear but multi-layered nature of music, as it exists across time, maps
particularly well to how the body moves. Forgoing a discussion on quantum
music5, a single musical line cannot exist in a superposition. If it moves, that
movement can only be in a singular direction without an additional musical
line being layered on top. The body does, of course, follow the same principle.
I may move my right hand upwards. Then I may add a layer and simultane-
ously move my left hand downwards, but I cannot move one hand upwards
and downwards in a single motion. The equivalent affordances and constraints
of fundamental properties between music and motion mean that increasingly
higher-level features of one also apply very well to the other. For instance,
both music and motion can be analysed in absolute or relative terms. A musi-
cal phrase may move from an absolute C♮ to an E♭ with the relative ‘space
between the notes’ being a minor third interval. Equally, I may move my right
hand forwards to an absolute change value of 30cm, but were I to move my
left hand at the same velocity, the change value of my right hand as relative to
my left would be zero.
In musical theory, contrapuntal motion draws further parallels with relative
user motion in XR. The four core forms of contrapuntal motion define the fun-
damental dependencies of two entities. ‘Parallel motion’ describes two musical
lines moving in the same direction with consistent interval changes, equivalent
to moving both hands in the same direction and at the same velocity. Retaining
the same direction but varying the intervals, ‘similar motion’ would be equiva-
lent to both hands moving in the same direction, but with different rates of
displacement. In a ‘contrary motion’, two musical lines move in opposite direc-
tions (described as ‘strict contrary motion’ if the intervals are also consistent),
comparable to, for instance, rolling the right hand clockwise whilst the left
rolls counter-clockwise. Lastly, ‘oblique motion’ describes the movement of
one musical line whilst the other remains at a constant pitch; again, straightfor-
ward to replicate in the hands, with one hand moving whilst the other remains
stationary.
The tracking itself is arguably one of contemporary VR-HMDs’ most
impressive features, particularly in those utilising so-called ‘inside out’ track-
ing systems6 that remove the need for any external sensors, thereby reducing
setup time and physical space requirements. Tracking is for most purposes,
highly accurate and precise with high spatial and temporal resolution. These
tracking qualities facilitate reliable detection of a wide range of movement fea-
tures. From an individual tracked object, such as one of the hands or the head,
we can determine the absolute (sometimes referred to as ‘global’) orientation
and location across six degrees of freedom. This presents us with 12 easily
controllable user-actions, as each degree of freedom provides bi-directional
movement (increase/decrease, clockwise/counter-clockwise). Things esca-
late exponentially when temporality is considered, with the broad temporal
186 Tom A. Garner
features of acceleration, speed and deceleration. Without relevant training,
temporal features of human movement are of course more difficult to control,
but a typical VR user can broadly be expected to be capable of at least a binary
differentiation (fast/slow, immediate/slow attack, immediate/slow decay)
which provides six additional movement features, across two directions, for
each degree of freedom, for just one tracked object.
Just as displacement of hand position and orientation can be analogous to
changes in pitch along a musical phrase, the temporal features of such motions
can equally draw parallels to features of rhythm and tempo. Frank (2000) breaks
down tempo into five basic elements: sustain, aligned repetition, non-aligned
repetition, aligned non-repetition and non-aligned non-repetition. Applying
this to human motion, sustain may represent a slow and steady movement or
it could equally describe stillness or holding a fixed pose. Repetition describes
the same action recurring, whilst alignment refers to whether the action has
coordination with a regular pulse. For example, an aligned-repetitive motion
would be descriptive of a recurrent upwards then downwards motion with an
observable beat-per-minute (BPM) value. A non-aligned-repetitive motion, by
comparison, could be exemplified by the same up/down action, but with each
repetition occurring at seemingly random points in time, whilst an aligned-
non-repetitive motion would best describe a series of movements with seem-
ingly random velocities, occurring in tandem with a coordinated BPM.
Unsurprisingly, when we consider the interaction affordance outlined, recent
examples of XR technology for novel sound interaction are broadly focussed
upon musical applications. The Skeleton Conductor project (Pajala-Assefa
2019), for example, utilises the head and hand tracking of commercial VR-
HMDs to procedurally generate musical content. Here, as in most cases, the
content is rather heavily constrained to preserve perceived musicality; in this
instance, by way of fixing factors such as key and instrumentation whilst fea-
tures such as dynamics, tempo and pitch (within the appropriate scale or mode)
can be manipulated in real-time by the user changing specific features of their
movement. The take-home point here is that human motion, trackable using
XR technology, presents us with an abundance of clearly defined actions that
can be mapped to features of music in a way that is conceptually consistent
and intuitive for the user. Of course, sound extends beyond music, raising the
question of how naturalistic human interaction based on tracked motion could
also impact the design of non-musical sound.

5 Towards a conceptual model of extended reality sound

5.1 Theories of sound and taxonomies of listening


Sound design as it relates to XR is of course a creative process, with the mer-
its of improvisatory, exploratory, even playful approaches being advocated in
recent research (Summers et al. 2015). That said, this does not mean we would
be correct to disregard the value of a structured theoretical framework of XR
Motion, sound design and extended reality 187
sound. Indeed, much of the research advocating the development and usage
of such understanding does so in the context of creative compositional and
design applications (see Collins 2013; Grimshaw 2007; Tuuri & Eerola 2012).
The meaning of sound has been explored in relation to its fundamental nature
(what is sound?) but also its fundamental location (where is sound?). Various
theories attempt to address the first question, with some asserting that sound
is a property of an object, whilst others argue that it is an event. Additional
theories have posited sound to be the relationship between object and event,
whilst others take an even more holistic view to describe sound as a phenom-
enon encapsulating object, event, resonating space and listener. With regards
to the second question of sound location, O’Callaghan (2011) identifies three
possibilities: distal (sound is located at the source), medial (sound is located
at the soundwave between source and listener) and proximal (sound is located
at the listener).
In constructing a comprehensive understanding of sound, another key dis-
tinction to make clear is that between hearing and listening, the latter of which
cannot be reduced to the former (Sterne 2003). Understanding the nature of
listening in an increasingly digital world, where interpersonal communication
is increasingly conducted remotely and often without any sound whatsoever,
is becoming ever more complex. As Crawford (2009) observes, the term ‘lis-
tening’ has become a metaphor for ‘paying attention’ whilst engaging with
digital communication such as social media. Indeed, as Rice (2015) points
out, ‘in contemporary usage [listening] does not always refer to auditory atten-
tion. The meanings of listening have proliferated into non-auditory spheres’
(p. 101). Our ambition to better understand sound also suffers from the blurred
lines that differentiate the three broadest classes of sound: music, speech and
non-music-non-speech-sound (a properly concise and accurate term for which
still frustratingly evades the literature). One effect of this blurring is that many
ideas presented within research directly considering speech, for instance, may
still have relevance to the other sonic forms. Of course, XR sound will, in
many cases, incorporate all three sound classes, but we remain encouraged to
consider the three classes holistically to understand the relationships between
them, not simply each class in isolation.
Exploring sound in terms of discrete listening modes is a common feature of
the literature with general agreement over the nature of individual modes, but
more divergence on the number of modes. There is also observable difference
between various theories based on the extent to which the authors are attempt-
ing to conceptualise listening relevant to generalisable, everyday experience
or to more specialised usage. Table 8.1 summarises all of the discrete modes
identified across the sources that were reviewed for this chapter.
Arguably one of the most comprehensive taxonomies of listening, Tuuri
and Eerola’s (2012) modes of listening taxonomy, identifies nine distinct ways
in which a listener may extract meaning from sound. These ‘modes’ exist on
a continuum between those that are more immediate and experiential (such
as kinaesthetic listening [identifying the position, orientation, and movement
188 Tom A. Garner
Table 8.1 Summary of the various ‘modes of listening’ identified in the literature

Mode Description Source(s)

Listening in search Actively analysing the soundscape or scanning Truax 2001


for a particular cue
Listening in readiness Ready to respond to a sound cue but not
actively scanning
Background listening Passive listening with some potential to recall
aspects of soundscape
Navigational listening To use sound cues to localise oneself and Grimshaw
navigate around a space 2007
Theatre listening Active interpretation of sound but no agency to Rebelo
interact directly et al.
Museum listening Some agency to interact with sound within a 2008
controlled and fixed space
City listening Greater agency to interact with sound within an
uncontrolled space
Causal listening To identify the sound source object and/or event Chion
Semantic listening To interpret discrete meaning (e.g., an 2012
instruction) Tuuri &
Reduced listening To analyse the characteristics of the sound itself Eerola
2012
Reflexive listening Pre-attentive bodily response (e.g., jump in Tuuri &
response to sudden sound) Eerola
Kinaesthetic listening Pre-attentive sense of motion evoked by sound 2012
Connotative listening Free-form associations immediately associated
with sound
Empathetic listening To infer aspects of the emotional state of the
source
Functional listening To interpret a sense of a sound’s meaning/
purpose/function
Critical listening To apply a value judgement to the quality/
appropriateness of a sound
Analytic listening To analyse discrete properties of a sound within Bijsterveld
a focussed point in time 2019
Synthetic listening To analyse the general properties of a sound
over a wider period of time
Interactive listening To interact with the source and/or environment
then analyse the response

of source] and causal listening [matching a sound to a source object and/or


event]) and those that are more reflective and considered (including functional
listening [identifying the purpose of the sound] and critical listening [a value
judgement on the quality/appropriateness of the sound]). Whilst taxonomies
such as this are attributable to everyday, more general listening, other works
have chosen to focus upon more specialised contexts. Bijsterveld (2019), for
example, considers listening in its professional usage across science, medicine
and engineering, presenting a factorial typology of listening modes based on
purpose (monitoring, diagnostics and exploration) and method. The method,
Motion, sound design and extended reality 189
described as ‘ways of listening’, is separated into three modes that broadly
overlap with the notion of reduced listening, in which the listener considers
more objective acoustic features of a sound. Bijsterveld’s three modes are syn-
thetic (general acoustic impressions of a soundscape or individual sound over
time), analytic (specific acoustic properties or features of an individual sound
or soundscape at a limited point in time) and interactive listening (acoustic
changes directly attributable to an intended action by the listener).
The aforementioned modes of listening create the sense that our relation-
ship with sound is based on function, with the fine details of the sound signal’s
acoustic properties, the listener’s physiological and psychological state, and
the surrounding environment collectively determining how we attend to, per-
ceive and respond to sound. As mentioned at the beginning of this chapter, XR
presents the user with a world. The world may be largely digital, predominantly
physical, or a balance of the two; but it is a world, nevertheless, that invites
interaction, presents tasks and responds with feedback. As such, the aforemen-
tioned theories are as relevant to XR worlds as they are to a digital-free world,
and therefore should be considered carefully by XR sound designers.

5.2 Interaction, diegesis and virtual acoustic ecologies


Interactivity within the listening experience is a crucial area to unpick further
when we consider XR systems, but it is worth noting that sound interaction is
certainly not limited to digital technology. For example, Rebelo and colleagues’
(2008) listening in place typology distinguishes between three listening modes
based on the nature of the listener’s interaction with the soundscape. Each
of these modes is named after their archetypal example, making their labels
both metaphorical and literal. The ‘Theatre of Listening’ applies to scenarios
in which the projection of the sound is known to the designer, as the audience’s
position is both known and constant. In this mode, the listener is a specta-
tor and the experience is homogenous (comparable between each listener and
repeat audition). By contrast, the ‘Museum of Listening’ presents partially-
fragmented projection, as the listener is free to move around the space as they
wish, facilitating some heterogeneity of experience, but the boundaries and
characteristics of the space are similar to the theatre, as they are both known
and controllable. Finally, the ‘City of Listening’ affords fully fragmented pro-
jection and heterogeneity of experience, with obscure boundaries and uncon-
trolled characteristics.
The nature of sound within reality, such as it is, can be expressed across the
four dimensions of length, breadth, depth and time. Going back to Milgram
and Kishino’s (1994) continuum, the types of XR are defined in relation to
reality, but do not exist independent of it; they all exist within it. However,
through our imaginative processes, we are able to conceptualise other worlds
to the extent that we are even able, in some instances, to feel more physically
present in an ‘other’ world than in our own. Consequently, XR effectively adds
a fifth dimension to any framework: a dimension that describes the nature and
190 Tom A. Garner
relationship of multiple realities. This dimension can be explored by way of
diegesis.
Dating back to Plato’s Republic as a means of conceptualising the rela-
tionship between physical and narrative worlds (Halliwell 2014), diegesis
has found common usage in discourse concerning literature, film and, more
recently, virtual worlds. Whilst the basic interpretation of diegesis presents a
binary distinction, diegetic (of the physical world, existing in space) and non/
extra-diegetic (not of the physical world, existing in para-space), the relation-
ship between the two is arguably more complex and often opaque, particularly
in terms of sound and interactive media. At a broader level of sound-classifica-
tion, diegesis typically differentiates auditory icons (sounds that signify their
natural/mechanical source—e.g., a gunshot) from earcons (sounds, usually
synthesised, that signify something other than their source—e.g., a user-inter-
face ‘window opening’ sound). Earlier digital games more cleanly differenti-
ated icons and earcons in diegetic terms, with the former representing objects
and events within the game world, whilst the latter provided what was effec-
tively an augmented reality layering of information over the game world, typi-
cally as a heads-up display or in-game menu. Contemporary games and XR are
increasingly obscuring these distinctions somewhat by way of diegetic design:
techniques that seek to reduce overlays and extra-diegetic content to present
everything as existing within the game world.
Extending beyond a taxonomical depiction of sound, Grimshaw’s (2007)
Acoustic Ecology of the First-Person Shooter presents a more ontological
framework, identifying numerous relationships between the components of
the system which exist fundamentally because the system is interactive. At the
broadest level, Grimshaw’s Ecology describes a causal loop between game,
player and soundscape with no explicit beginning or end. The player influences
the game by way of haptic input, which in turn determines the soundscape
through sonification of the updated game state. The soundscape completes the
loop by influencing the player. This ecology emphasises numerous ways in
which diegesis can affect the nature of listening. Here, choraplast and topo-
plast functions of sound are both concerned with matters of space, but distin-
guished by diegesis, with choraplast function relevant to the ‘resonating space’
in which the player is physically placed and topoplast function connected to
the virtual para-space in which the player’s avatar is placed. Matters of time
are addressed by chronoplast and aionoplast functions. These terms can also
be separated by way of diegesis, with chronoplast describing the function of
expressing more discrete temporal qualities of the game to which the player
needs to respond, whilst aionoplast denotes the setting, and expresses temporal
qualities such as historical period, which exist near-exclusively in the virtual
world and present less diegetic overlap.
The four functions of sound outlined here are, at least initially, concerned
with cognitive player interaction but not physical interaction. Grimshaw’s
Ecology addresses this with three further functions that collectively form
navigational listening: attractors, retainers, and connectors. These functions
Motion, sound design and extended reality 191
still fit within the broader matters of space and time, but incorporate player-
action, specifically navigation, with attractors encouraging the player to move
towards a certain point, retainers encouraging them to remain in their cur-
rent position and connectors providing feedback to the player, confirming their
movement from one point to another. Additionally, signal sounds also func-
tion as means of directing player-action, but more broadly, extending beyond
navigation to include any interactive affordance within the game. Grimshaw’s
Ecology also incorporates qualities of sound, specifically causality (referring
to the extent to which a sound is perceived to correspond to the dynamic physi-
cal properties of its source objects/event), indexicality (the extent to, or ease
with which a sound signifies something) and immersion. The latter of these has
become a buzzword in VR discourse, but its usage here describes a cognitive
and affective connection between the player and the game, facilitated in part by
two further qualities of the sound design: challenge-based immersion (sounds
that evoke player-responses that require cognitive and/or physical skill) and
imaginative immersion (sounds that engage imaginative processes to connect
the player to their character and the virtual world). The latter of these can be
further separated into proprioceptive-immersive (sounds directly connected/
emanating from the player-character/avatar—e.g., audible heart-beat) and
exteroceptive-immersive (sounds external to the character that contextualise
them within the virtual environment—e.g., the footsteps, neighs and whinnies
of the character’s horse that reinforce the role of cowboy).
Complementary to, but also extending beyond the concepts discussed so far,
Collins (2013) emphasises the importance of interactivity to understanding our
relationship with sound in digital games by presenting a non-linear model of
sonic interactivity. Within this model, multi-modal, interpersonal and physical
components all feed into our psychological interaction with sound during play.
Importantly, Collins extends beyond the boundaries of playing the game to
consider sociocultural, interpersonal and physical interactions within the so-
called ‘meta-game’ that can also influence a player’s relationships with game
sound. Additionally, several key aspects of game sound documented by Collins
that resonate with XR sound design include disembodied sound (intentionally
separating the sound from its source, typically to raise tension through uncer-
tainty [schizophonia]), synchresis (integrating sound with image with intent
to create a congruent, incongruent or neutral composite effect) and kineson-
ics (similar to synchresis, but addressing the integration of sound with player
action).
It would be understandable to find the theories of sound, such as the modes
of listening, complex enough in the context of a physical world, without add-
ing further dimensionality with an extended reality world. Blending the physi-
cal with the digital, the fictive with the non-fictive, and the real with the unreal
is liable to make you feel something of a ‘tumbling down the rabbit hole’ sen-
sation. Whilst there are no easy or perfect answers, diegesis is arguably an
important tool. Understanding the relationship between the virtual and physi-
cal worlds within a particular XR experience will help the sound designer to
192 Tom A. Garner
‘position’ the listener, determine the nature of the interactivity and understand
the functions that the composition of their soundscape needs to support.

6 Applying motion to extended reality sound


Hinted at in the title of the previous section, this chapter does not have ambi-
tions of presenting a comprehensive new theoretical framework of XR sound,
but rather to bring together knowledge from some of the leading sources to move
us a step further towards that ambition. Figure 8.1 illustrates this first step within
a taxonomy that may hopefully serve as a means of at-a-glance inspiration.
Relating back to our earlier discussion on the connections between sound
and human motion, reviewing the taxonomy clearly emphasises the poten-
tial of sound driven by motion-tracking data to contribute to a deeper user
experience. For example, consider the use of a virtual Geiger counter. In this
instance, the user manipulates their full body to sonically scan the virtual
environment (analytic listening) using the auditory feedback to identify their
relative position to a target, aiding their movement towards it (navigational
listening). The fine rotational motions of their hand intentionally manipulate
the sound to provide usable feedback (interactive listening). Of course, the
target may not necessarily remain static, with its sudden movement surprising
the user (reflexive listening) and prompting a pre-attentive directional change
(kinaesthetic listening). In this example, the designer may wish to tweak the
experiential quality of the sound. For instance, they could constrain the tempo
range of the auditory feedback to intentionally make it more difficult for the
user to accurately track the target (reduced causality). They may also make the
sound responsive to non-target objects, creating the potential for the user to
accidentally track the wrong thing (reduced indexicality).
For another example, consider the use of head-tracking data to drive sound
relevant to the characteristics of the player avatar, such as the chainmail head-
dress of a knight. Here we are utilising proprioceptive immersion to embody
the player within the virtual world, whilst also functioning as an aionoplast
by persistently reinforcing the historical period by way of a sound that clearly
signifies that moment in time. In an even more multi-layered example, we may
instead embody our player as a robot, their head and arm motion driving a
series of sounds to reflect those movements. Here, in addition to the elements
in the chainmail example, the robot sounds could be modulated based on dam-
age to specific parts of the player-avatar, with more jarring or discontinuous
sound heard when a part of the robot is damaged, thereby engaging semantic
and interactive listening as the sound provides initial feedback that prompts
further movement to diagnose the location of the damage.
These examples are almost certainly just scraping the surface, and there
remains a wealth of innovative sound design techniques waiting to be discov-
ered, combined, refined and played with. Then, maybe, when a not-too-distant
future headline proclaims the next big leap in XR, they’ll be talking about
sound design.
Designer-sound
Sonic attachment:
Acousmatic (detachment)
Synchresis (attachment)
XR System XR-sound Kinesonics User-sound User
Multimodality
XR Continuum Class: Diegesis: Congruence/incongruence Experiential quality: Embodiment:
Virtuality Diegetic Causality Brain
Virtual Reality Non/Extradiegetic Design functions: Indexicality Body
Augmented Virtuality Transdiegetic Choraplast Immersion Environment
Mixed Reality Ideodiegetic Topoplast
Aionoplast Listening modes: Perceptual factors:
Augmented Reality Kinediegetic
Chronoplast Reflexive Memory
Reality Exodiegetic
Attractor Kinaesthetic Belief
Auditory icons/earcons
Application: Retainer Connotative Expectation
Domain Listening in place: Connector Causal Emotion
Purpose Theatre Empathetic Attention, focus
Location Museum What is sound? Functional
Sound is a property of an object Wider influences:
Inputs City Semantic Socio-cultural factors
Outputs Sound is an event Reduced
Sound is a phenomenon Technological factors
Synthetic Interpersonal factors
Mechanics: Analytic
Where is sound? Evolutionary factors
Rulesets Interactive
Proximal location
Goal/tasks Critical
Medial location
Feedback Navigational
Distal location
Interactions
Instruction Sound Class:
Speech
Music
Other
Motion, sound design and extended reality 193

Figure 8.1 Initial taxonomy of XR sound


194 Tom A. Garner
Notes
1 Audience of the future challenge: www.ukri.org/our-work/our-main-funds/
industrial-strategy-challenge-fund/artificial-intelligence-and-data-economy/
audience-of-the-future-challenge/
2 For those of you who are curious, the Bellack et al. (1979) study found that whilst
female participants largely interacted similarly between groups, male participants
significantly altered their behaviour when they knew they were role-playing.
3 https://unity.com/solutions
4 www.unrealengine.com/en-US/feed/spotlights/
5 Quantum music is indeed a real thing and is worth reading into for those of you
who are interested: (www.technologyreview.com/2015/04 /15/168638—accessed
03.03.2021)
6 Inside-out HMD tracking broadly describes any system that measures orientation
and/or location by way of sensors within the headset. These could be inertia meas-
urement units (such as gyroscopes and accelerometers) or camera-based room map-
ping (using computer vision algorithms for physical landmark detection).

References
Alizadehsalehi, S., Hadavi, A., & Huang, J. C. (2020). From BIM to extended reality in
AEC industry. Automation in Construction, 116, 103254.
Bellack, A. S., Hersen, M., & Lamparski, D. (1979). Role-play tests for assessing social
skills: Are they valid? Are they useful? Journal of Consulting and Clinical Psychol-
ogy, 47(2), 335.
Bijsterveld, K. (2019). Sonic Skills: Listening for Knowledge in Science, Medicine and
Engineering (1920s-Present) (p. 174). Springer Nature, Cham.
Chatzidimitris, T., Gavalas, D., & Michael, D. (2016, April 18–20). SoundPacman:
Audio augmented reality in location-based games. In 2016 18th Mediterranean Elec-
trotechnical Conference (MELECON) (pp. 1–6). IEEE, Lemesos, Cyprus.
Chion, M. (2012). The three listening modes. The Sound Studies Reader, 48–53.
Collins, K. (2013). Playing with Sound: A Theory of Interacting with Sound and Music
in Video Games. MIT Press.
Çöltekin, A., Lochhead, I., Madden, M., Christophe, S., Devaux, A., Pettit, C., . . . Hed-
ley, N. (2020). Extended reality in spatial sciences: A review of research challenges
and future directions. ISPRS International Journal of Geo-Information, 9(7), 439.
Cox, T. J. (2008). Scraping sounds and disgusting noises. Applied Acoustics, 69(12),
1195–1204.
Crawford, K. (2009). Following you: Disciplines of listening in social media. Con-
tinuum, 23(4), 525–535.
D’Amico, G., Del Bimbo, A., Dini, F., Landucci, L., & Torpei, N. (2010) Natural
human—computer interaction. In: Shao, L., Shan, C., Luo, J., & Etoh, M. (eds.),
Multimedia Interaction and Intelligent User Interfaces. Advances in Pattern Recog-
nition. Springer, London.
D’Auria, D., Di Mauro, D., Calandra, D. M., & Cutugno, F. (2015). A 3D audio aug-
mented reality system for a cultural heritage management and fruition. Journal of
Digital Information Management, 13(4).
Donohue, W. A., Diez, M. E., & Hamilton, M. (1984). Coding naturalistic negotiation
interaction. Human Communication Research, 10(3), 403–425.
Motion, sound design and extended reality 195
Doolani, S., Wessels, C., Kanal, V., Sevastopoulos, C., Jaiswal, A., Nambiappan, H., &
Makedon, F. (2020). A review of extended reality (XR) technologies for manufactur-
ing training. Technologies, 8(4), 77.
Fast-Berglund, Å., Gong, L., & Li, D. (2018). Testing and validating Extended Reality
(xR) technologies in manufacturing. Procedia Manufacturing, 25, 31–38.
Flavián, C., Ibáñez-Sánchez, S., & Orús, C. (2019). The impact of virtual, augmented
and mixed reality technologies on the customer experience. Journal of Business
Research, 100, 547–560.
Frank, R. J. (2000, August 27 – September 1). Temporal elements: A cognitive system of
analysis for electro-acoustic music. In International Computer Music Conference Pro-
ceedings (Vol. 2000). Michigan Publishing, University of Michigan Library, Berlin.
Goodwin, S. N. (2019). Beep to Boom: The Development of Advanced Runtime Sound
Systems for Games and Extended Reality. Routledge, New York.
Grimshaw, M. N. (2007). The acoustic ecology of the first-person shooter (Doctoral
dissertation, The University of Waikato).
Haga, E. (2008). Correspondences between music and body movement (Doctoral dis-
sertation, University of Oslo).
Halliwell, S. (2014). Diegesis—mimesis. Handbook of Narratology, 129–137.
Helsel, S. (1992). Virtual reality and education. Educational Technology, 32(5), 38–42.
Hong, D., Lee, T. H., Joo, Y., & Park, W. C. (2017, February). Real-time sound propa-
gation hardware accelerator for immersive virtual reality 3D audio. In Proceedings
of the 21st ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games
(pp. 1–2). ACM, New York.
Johnson, D., Damian, D., & Tzanetakis, G. (2019). Osc-xr: A toolkit for extended reality
immersive music interfaces. http://smc2019.uma.es/articles/S3/S3_04_SMC2019_
paper.pdf (accessed 04.03.2021)
Kaghat, F. Z., Azough, A., Fakhour, M., & Meknassi, M. (2020). A new audio aug-
mented reality interaction and adaptation model for museum visits. Computers &
Electrical Engineering, 84, 106606.
Klinker, G., Stricker, D., & Reiners, D. (1998, June). The use of reality models in aug-
mented reality applications. In European Workshop on 3D Structure from Multiple
Images of Large-Scale Environments (pp. 275–289). Springer, Berlin, Heidelberg.
Krasnor, L. R., & Rubin, K. H. (1983). Preschool social problem solving: Attempts and
outcomes in naturalistic interaction. Child Development, 1545–1558.
Laurel, B., & Mountford, J. (1990). The Art of Human-computer Interface Design. Pub-
lished by Addison-Wesley Longman, Boston
Linqin, C., Shuangjie, C., Min, X., Jimin, Y., & Jianrong, Z. (2017). Dynamic hand ges-
ture recognition using RGB-D data for natural human-computer interaction. Journal
of Intelligent & Fuzzy Systems, 32(5), 3495–3507.
Luck, M., & Aylett, R. (2000). Applying artificial intelligence to virtual reality: Intel-
ligent virtual environments. Applied Artificial Intelligence, 14(1), 3–32.
Milgram, P., & Kishino, F. (1994). A taxonomy of mixed reality visual displays. IEICE
Transactions on Information and Systems, 77(12), 1321–1329.
Morawitz, F. (2018, March). Quantum: An art-science case study on sonification and
sound design in virtual reality. In 2018 IEEE 4th VR Workshop on Sonic Interactions
for Virtual Environments (SIVE) (pp. 1–5). IEEE.
Norton, R. (1972). What is virtuality? The Journal of Aesthetics and Art Criticism,
30(4), 499–505.
196 Tom A. Garner
Nymoen, K., Godøy, R. I., Jensenius, A. R., & Torresen, J. (2013). Analyzing corre-
spondence between sound objects and body motion. ACM Transactions on Applied
Perception (TAP), 10(2), 1–22.
O’Callaghan, C. (2011). Lessons from beyond vision (sounds and audition). Philosoph-
ical Studies, 153(1), 143–160.
Orcutt, J. D., & Anderson, R. E. (1974). Human-computer relationships: Interactions
and attitudes. Behavior Research Methods & Instrumentation, 6(2), 219–222.
Pajala-Assefa, H., & Erkut, C. (2019, October). A study of movement-sound within
extended reality: Skeleton conductor. In Proceedings of the 6th International Confer-
ence on Movement and Computing (pp. 1–4). ACM, New York.
Plouffe, G., Cretu, A. M., & Payeur, P. (2015, October). Natural human-computer inter-
action using static and dynamic hand gestures. In 2015 IEEE International Sympo-
sium on Haptic, Audio and Visual Environments and Games (HAVE) (pp. 1–6). IEEE.
Poerio, G. L., Blakey, E., Hostler, T. J., & Veltri, T. (2018). More than a feeling: Auton-
omous sensory meridian response (ASMR) is characterized by reliable changes in
affect and physiology. PloS One, 13(6), e0196645.
Raghuvanshi, N., & Snyder, J. (2018). Parametric directional coding for precomputed
sound propagation. ACM Transactions on Graphics (TOG), 37(4), 1–14.
Rautaray, S. S., & Agrawal, A. (2012). Real time multiple hand gesture recognition
system for human computer interaction. International Journal of Intelligent Systems
and Applications, 4(5), 56–64.
Rebelo, P., Green, M., & Hollerweger, F. (2008). A typology for listening in place. In
Proceedings of the 5th International Mobile Music Workshop (pp. 15–18).
Rice, T. (2015). Listening. In: Novak, D., & Sakakeeny, M. (eds.), Keywords in Sound.
Duke University Press, Durham, NC.
Sanchez, G. M. E., Van Renterghem, T., Sun, K., De Coensel, B., & Botteldooren,
D. (2017). Using Virtual Reality for assessing the role of noise in the audio-visual
design of an urban public space. Landscape and Urban Planning, 167, 98–107.
Savioja, L., Huopaniemi, J., Lokki, T., & Väänänen, R. (1999). Creating interactive
virtual acoustic environments. Journal of the Audio Engineering Society, 47(9),
675–705.
Seeger, A. (1994). Music and dance. Companion Encyclopedia of Anthropology,
686–705.
Serafin, S., Erkut, C., Kojs, J., Nilsson, N. C., & Nordahl, R. (2016). Virtual reality
musical instruments: State of the art, design principles, and future directions. Com-
puter Music Journal, 40(3), 22–40.
Serafin, S., Geronazzo, M., Erkut, C., Nilsson, N. C., & Nordahl, R. (2018). Sonic inter-
actions in virtual reality: State of the art, current challenges, and future directions.
IEEE Computer Graphics and Applications, 38(2), 31–43.
Skult, N., & Smed, J. (2020). Interactive storytelling in extended reality: Concepts for
the design. Game User Experience and Player-Centered Design, 449–467.
Slater, M., Steed, A., & Usoh, M. (1995). The virtual treadmill: A naturalistic meta-
phor for navigation in immersive virtual environments. In Virtual Environments’ 95
(pp. 135–148). Springer, Vienna.
Smith, S. L., & Goodwin, N. C. (1970). Computer-generated speech and man-computer
interaction. Human Factors, 12(2), 215–223.
Song, Y., Demirdjian, D., & Davis, R. (2012). Continuous body and hand gesture rec-
ognition for natural human-computer interaction. ACM Transactions on Interactive
Intelligent Systems (TiiS), 2(1), 1–28.
Motion, sound design and extended reality 197
Sterne, J. (2003). The Audible Past: Cultural Origins of Sound Reproduction. Duke
University Press, Durham, NC.
Summers, C., Lympouridis, V., & Erkut, C. (2015, March). Sonic interaction design
for virtual and augmented reality environments. In 2015 IEEE 2nd VR Workshop on
Sonic Interactions for Virtual Environments (SIVE) (pp. 1–6). IEEE.
Székely, G., & Satava, R. M. (1999). Virtual reality in medicine. BMJ: British Medical
Journal, 319(7220), 1305.
Treu, S. (1976, October). A framework of characteristics applicable to graphical user-
computer interaction. In Proceedings of the ACM/SIGGRAPH Workshop on User-
oriented Design of Interactive Graphics Systems (pp. 61–71). ACM, New York.
Truax, B. (2001). Acoustic Communication. Greenwood Publishing Group, Santa Bar-
bara, CA.
Tuuri, K., & Eerola, T. (2012). Formulating a revised taxonomy for modes of listening.
Journal of New Music Research, 41(2), 137–152.
Vi, S., da Silva, T. S., & Maurer, F. (2019, September). User experience guidelines for
designing hmd extended reality applications. In IFIP Conference on Human-Computer
Interaction (pp. 319–341). Springer, Cham.
Vorländer, M., Schröder, D., Pelzer, S., & Wefers, F. (2015). Virtual reality for architec-
tural acoustics. Journal of Building Performance Simulation, 8(1), 15–25.

You might also like