0% found this document useful (0 votes)
29 views

Spatial Audio Signal Processing For Binaural Repro

Uploaded by

Timoty S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Spatial Audio Signal Processing For Binaural Repro

Uploaded by

Timoty S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Acta Acustica 2022, 6, 47

Ó The Author(s), Published by EDP Sciences, 2022


https://doi.org/10.1051/aacus/2022040

Available online at:


https://acta-acustica.edpsciences.org

REVIEW ARTICLE

Spatial audio signal processing for binaural reproduction of recorded


acoustic scenes – review and challenges
Boaz Rafaely1,* , Vladimir Tourbabin2, Emanuel Habets3 , Zamir Ben-Hur2 , Hyunkook Lee4, Hannes Gamper5,
Lior Arbel1 , Lachlan Birnie6, Thushara Abhayapala6, and Prasanga Samarasinghe6
1
School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel
2
Reality Labs Research, Meta, Redmond, WA 98052, USA
3
International Audio Laboratories Erlangen (a joint institution of the Friedrich Alexander University Erlangen-Nürnberg (FAU)
and Fraunhofer IIS), 91058 Erlangen, Germany
4
Applied Psychoacoustics Laboratory (APL), University of Huddersfield, Huddersfield HD1 3DH, United Kingdom
5
Audio and Acoustics Research Group, Microsoft Research, Redmond, WA 98052, USA
6
Audio and Acoustic Signal Processing Group, The Australian National University, Canberra, Australian Capital Territory 2601,
Australia

Received 25 March 2022, Accepted 8 September 2022

Abstract – Spatial audio has been studied for several decades, but has seen much renewed interest recently
due to advances in both software and hardware for capture and playback, and the emergence of applications
such as virtual reality and augmented reality. This renewed interest has led to the investment of increasing
efforts in developing signal processing algorithms for spatial audio, both for capture and for playback. In par-
ticular, due to the popularity of headphones and earphones, many spatial audio signal processing methods have
dealt with binaural reproduction based on headphone listening. Among these new developments, processing
spatial audio signals recorded in real environments using microphone arrays plays an important role. Following
this emerging activity, this paper aims to provide a scientific review of recent developments and an outlook for
future challenges. This review also proposes a generalized framework for describing spatial audio signal process-
ing for the binaural reproduction of recorded sound. This framework helps to understand the collective progress
of the research community, and to identify gaps for future research. It is composed of five main blocks, namely:
the acoustic scene, recording, processing, reproduction, and perception and evaluation. First, each block is
briefly presented, and then, a comprehensive review of the processing block is provided. This includes topics
from simple binaural recording to Ambisonics and perceptually motivated approaches, which focus on careful
array configuration and design. Beamforming and parametric-based processing afford more flexible designs and
shift the focus to processing and modeling of the sound field. Then, emerging machine- and deep-learning
approaches, which take a further step towards flexibility in design, are described. Finally, specific methods
for signal transformations such as rotation, translation and enhancement, enabling additional flexibility in
reproduction and improvement in the quality of the binaural signal, are presented. The review concludes by
highlighting directions for future research.

Keywords: Audio signal processing, Spatial audio, Virtual reality, Augmented reality, Array processing

1 Introduction events, as well as real-time video conferencing with immer-


sive spatial audio.
Binaural reproduction of acoustic scenes refers to the Headphone-based playback of binaural sound, dating
playback of sound at the listener’s ears in a way that recre- back to the 19th century [1], has become highly popular
ates a real-world listening experience of the scene. Ideally, in recent decades with the availability of personal head-
the sound scene reproduced at another time and/or place phones. This also led to the rise in popularity of head-
should be perceptually indistinguishable from the real phone-based binaural reproduction, and particularly, the
scene. Some important examples include capture and subse- reproduction of recorded acoustic scenes. The latter was
quent reproduction of musical performances or social initially based on binaural recording, using microphones
placed at the ears of a manikin [2]. While providing an
impressive spatial audio experience, binaural recording
*Corresponding author: [email protected] generally does not support listener individualization and
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
2 B. Rafaely et al.: Acta Acustica 2022, 6, 47

head tracking, which are important for creating a realistic In synergy with the emerging technologies and applica-
acoustic scene through headphone listening [3, 4]. The flex- tions, new directions in spatial audio signal processing are
ibility required for individualization and head-tracking was evolving that attempt to overcome the challenges
later obtained with the soundfield microphone and the mentioned above, and more. The aim of this review paper
Ambisonics spatial audio format [5]; these greatly advanced is to provide an updated account of these emerging meth-
the recording and reproduction of real sound scenes through ods, published in the past few years, and propose directions
the separation of the recorded sound as captured by the for future research. The paper first introduces a generalized
microphone and the effect of the head on the signal at the framework for binaural reproduction of recorded acoustic
ears, represented by the head-related transfer function scenes, then focuses on processing approaches, and con-
(HRTF). Ambisonics was then extended to high-order cludes with prospects for future research. Regarding pro-
Ambisonics [6–9] recorded by spherical microphone arrays cessing approaches, this paper first presents approaches
[10, 11], providing higher spatial detail by supporting more that consider the microphone array as the dominant design
recording channels. The seamless incorporation of HRTF element, and therefore require very specific microphone
into Ambisonics generated a remarkable listening experi- array designs. In binaural recording, two microphones are
ence within an elegant mathematical setting. Indeed, placed at the ears of a dummy head, while in Ambisonics,
Ambisonics and HRTF have been the topic of extensive a dedicated array must be designed to capture spherical
research in the past two decades, supporting a wide range harmonics signals. In perceptual-based arrays, the micro-
of applications and research areas. For example, listening phones and their arrangement are by-design carefully
to sounds generated in simulated or measured acoustic configured to produce perceptually useful signals. Next,
spaces has been studied under auralization [12], investigat- beamforming-based processing makes a step forward by lift-
ing the listening experience from a human hearing perspec- ing the constraints on array configuration, thus allowing a
tive [13, 14]. The theory and practice of spatial audio flexible design. Spatial filters, or beamformers, designed
recording and reproduction [15], and particularly Ambison- specifically for the array at hand, form the basis of the
ics [16], have been established, supported by advancements approach. This is then followed by parametric approaches,
in spherical microphone array design and processing where a further step is made from array-focused methods to
[17, 18]. New approaches to spatial audio processing and methods that exploit information in the sound field. The
coding are still being proposed [19–21], facilitated by information is modeled and the model parameters are esti-
improved ways for headphone listening [22, 23]. However, mated, providing the basis for the spatial reproduction.
in spite of these impressive advances over the past few Finally, machine- and deep-learning approaches provide
decades, new emerging applications raise entirely new chal- an even more flexible framework that can exploit informa-
lenges for spatial audio in general, and binaural reproduc- tion both in the array configuration and in the sound field.
tion of recorded scenes, in particular. Transformations such as rotation, translation and signal
A set of such emerging technologies that provides a new enhancement, tailored to the signal processing approaches,
exciting platform for binaural reproduction applications is are then presented, followed by conclusions and an outlook
virtual reality (VR), augmented reality (AR), and mixed for the future.
reality (MR) [12, 24, 25]. These originated from gaming,
and have now been expanded to multimedia, education,
personal communication, and virtual meetings, among 2 Overview
many other areas. The new platforms introduce a unique
This section presents an overview of the entire process
set of challenges imposed by the fact that, in many cases,
comprising binaural reproduction of recorded acoustic
audio is captured by microphones that are embedded in
scenes. A generalized framework that encapsulates this
consumer devices, which are often wearable. This is partic-
process is first presented, from the acoustic scene being
ularly challenging for the reproduction of recorded acoustic
recorded to the perception and evaluation of the reproduced
scenes. The first challenge is space and hardware limita-
spatial audio. Each part of this process is reviewed in the
tions, which has led to the deployment of a small number
following subsections, while processing approaches are
of microphones of arbitrary arrangement, and often with
reviewed in greater detail in the subsequent sections.
unfavorable spatial diversity. Examples of these devices
are mobile phones, laptops, smart speakers and VR head- 2.1 Generalized framework
sets. These devices also introduce other challenges, imposed
both by the motion of wearable and mobile arrays during The generalized framework of spatial audio signal pro-
signal acquisition [26], which hinders a stable listening expe- cessing for the binaural reproduction of recorded acoustic
rience of the reproduced scene, and by a low-latency con- scenes is presented in Figure 1. The process presented in
straint that occurs in applications involving real-time the figure starts from the acoustic scene – the real-world
interactions, such as virtual video conferencing. In addition, environment within which the sound is generated. This
acoustic scenes recorded by these devices may contain envi- could be a concert hall with music sounds, an office with
ronmental noise and interfering sound, superimposed on the speech sounds, an outdoor environment with street sounds,
desired sound such as speech and music, which may degrade and other scenes. A recording device, such as a microphone
a virtual meeting, for example. array of any type that is positioned in the scene, produces
B. Rafaely et al.: Acta Acustica 2022, 6, 47 3

Figure 1. Generalized framework of spatial audio signal processing for binaural reproduction of recorded scenes.

recorded audio signals. The recording device can be include recordings of music or other artistic performances
anything from a dummy head directly recording binaural [29], and perceptual assessment and comparison of concert
signals, to spherical arrays or arrays of other configurations. hall sound [30, 31]. These applications are usually character-
Processing is then applied to the recorded audio signals in ized by elevated reverberation, and the acoustic sources of
preparation for reproduction; this stage is the main focus interest are primarily musical instruments and human
of this review paper and includes a wide range of spatial voices. Multiple outdoor applications have also been
audio signal processing methods, from Ambisonics, through explored. For example, spatial recordings have been utilized
parametric audio, to deep learning. Note that Figure 1 to capture urban sounds, including traffic, subway stations,
shows another optional layer behind processing labeled and social gatherings; these were used to facilitate percep-
transformations, which includes enhancement, rotation tual soundscape studies using various reproduction methods
and translation. After processing, the spatial audio signal [32, 33]. Finally, spatial sound recording methods have also
is ready for reproduction – this paper focuses on headphone been proposed for use in open outdoor environments to
reproduction, which is widely used in many applications. record nature sounds like waterfalls, birds, and wind [34];
Finally, the headphone signals are perceived by listeners, these methods were utilized in applications related to art
or can be evaluated objectively; this is the final block of and entertainment [35].
the framework, and is labeled perception and evaluation.
More details on each block of the framework are presented 2.3 Recording devices
in the following subsections.
A large variety of devices have been successfully
2.2 Acoustic scenes employed for spatial sound capture. The function of the
capture devices is to record the essential spatial information
Spatial sound recording and binaural reproduction have that enables either physically accurate [33] or perceptually
found numerous applications in a large variety of acoustic plausible [36] reproduction of the signals at the listener’s
scenes, ranging from relatively small indoor spaces to expan- ears. Probably, the most straightforward recording device
sive outdoor areas. The indoor examples include offices and enabling binaural reproduction is the binaural microphone
meeting rooms, where binaural reproduction has been (see, for example, [37]), which can be placed on the head
employed for teleconferencing applications [27, 28]; these of a human subject [32] or on an acoustically designed
have recently received increased attention due to growing binaural fixture [2, 38]. More complex microphone array
popularity of VR and AR platforms and the mushrooming systems have been proposed to improve spatial capture
of distance working/learning in response to the Covid pan- resolution and facilitate sound field manipulation. These
demic. The acoustic source type of particular interest in this include the B-format soundfield microphone array (com-
application is human speech. Another category of indoor prised of four capsules located on the faces of a tetrahedron
acoustic scenes that has received significant attention in [39]), high-order spherical arrays [40] (that facilitate sound
the past few decades is concert halls. The applications field decomposition and manipulation in the spherical
4 B. Rafaely et al.: Acta Acustica 2022, 6, 47

harmonics domain [41, 42]), approaches that support flexi- dedicated hardware, such as a head-mounted inertial mea-
ble recording arrays [43, 44], and very large microphone- surement unit that operates in real time with limited
array systems with interpolation processing [45]. There also latency [50, 51]. Finally, the frequency response of the head-
exist various perceptually-motivated microphone arrays phone may affect perception, and so often this response is
(PMMAs), designed for capturing acoustic scenes. Whilst compensated for using headphone equalization [52–54].
high-order arrays attempt to reconstruct the sound field
in the reproduction process in a way that is as physically 2.6 Perception and evaluation
accurately as possible, perceptually motivated arrays focus
on plausibly representing the sound field using psychoa- The last block of the generalized framework presented
coustic cues such as interchannel time- and level-differences in Figure 1 is perception and evaluation. Perception is the
and interchannel coherence [36]. The capture devices aim of the binaural reproduction process – to recreate
mentioned above enable various processing methods for spatial sounds that are perceptually indistinguishable from
enhancing and manipulating the sound field prior to repro- the real sounds, i.e., the listener perceives the spatial sound
duction, as described in Section 3. authentically as if he/she were actually in the scene [55].
Therefore, evaluating whether this aim has been achieved
2.4 Processing is of fundamental importance. The evaluation can be both
technical, by means of errors in the reproduced binaural
The processing block in Figure 1 transforms the recorded signals, and perceptual, by listening tests. Technical evalu-
signals from the previous block into binaural signals ready ation can be performed, for example, by quantifying the
for headphone reproduction in the following block. This errors between a reference signal and the reproduced signal,
aim can be achieved with a wide range of approaches and or by evaluating the accuracy of binaural cues, such as
methods, from binaural recording, which directly produces interaural time- and level-differences (ITD and ILD), and
a binaural signal, to methods such as Ambisonics and beam- interaural cross-correlation (IACC) [56–60].
forming-based processing, which employ microphone arrays Perceptual evaluation has traditionally used a global
and more complex operations. This variety of methods is attribute called “basic audio quality” or subjective prefer-
reviewed in more detail in the following sections, constitut- ence. However, recent studies in spatial audio increasingly
ing the main part of this review paper. The methods include tend to evaluate different systems in terms of specific attri-
binaural recording, Ambisonics, perceptually motivated butes (e.g., [61–66]). Examples of such attributes are sound
approaches, parametric processing, beamforming-based source localization, externalization, coloration, apparent
processing, machine- and deep-learning based methods, source width (ASW) and listener envelopment (LEV). More
and transformations such as signal enhancement, transla- general measures for evaluating the overall perceptual accu-
tion and rotation of the listener’s virtual position. racy, such as plausibility [67, 68] and authenticity [55, 69],
have also been suggested. Once listening tests have been
2.5 Reproduction performed, comprehensive analysis could lead to a percep-
tual model to replace further listening tests. Examples
The reproduction block in Figure 1 converts the binau- include localization and externalization models [70–74],
ral signals back to sounds using electroacoustic transducers. and a surround sound quality model [75]. Moreover,
When dedicated transducers are used for the left and right machine-learning algorithms have also been suggested for
ears, there is no cross-talk between the left binaural signal the evaluation of spatial perception [76, 77]; this will be
and the right ear and vice versa, which allows for more further discussed in Section 3.6.
direct control of the sound at the ears. The most common While the auditory attributes stated above have tradi-
device for binaural playback of sound is the headphone tionally been studied in the context of a static listener posi-
[4, 46], which comes in different forms, including circum- tion and head orientation with a fixed perspective, recent
aural (over-the-ear), supra-aural (on-the-ear), earbud, developments in VR and AR require 6-degrees-of-freedom
in-ear, and bone-conducting. Some over-the-ear headphones (6DoF), where the listener is free to rotate his/her head
use an open design, allowing audio leakage out of the ear- and also walk around in a virtual or real space. New tools
pieces and ambient sound leakage into the earpieces. Other have been developed to perform listening tests and behav-
headphones use a closed design to preclude leakage. When ioral studies in interactive virtual environments [78]. In a
using headphones to play binaural signals, even though recent study, various direct and indirect audio quality eval-
the real sound sources are the electroacoustic transducers uation methods were compared in virtual reality scenes of
at the ears, sounds can still be perceived outside the varying complexity [79]. It was found that rank-order
listener’s head by carefully controlling the left and elimination proved to be the fastest method, required the
right ear signals. This phenomenon, known as sound exter- least amount of repetitive motion, and yielded the highest
nalization, contributes to the realistic perception of a discrimination between spatial conditions. Scene complex-
virtual scene [47]. ity was found to be a main effect within results, while
Another factor related to headphone reproduction that behavioral and task load index results imply more complex
contributes to realistic perception is head-tracking, which scenes, and interactive aspects of 6-DoF VR can impede
stabilizes the perceived virtual scene, despite the listener’s quality judgments. Recent perceptual studies [80, 81] also
head movements [3, 48, 49]. Head-tracking requires found that such a dynamic environment could lead to
B. Rafaely et al.: Acta Acustica 2022, 6, 47 5

dramatic changes in the perceived reverberation, loudness, spherical harmonics representations of the sound field and
ASW and LEV, making evaluation much more challenging the HRTF are combined to form the binaural signal.
under such dynamic conditions. HOA signals can be derived from microphone record-
ings, typically using a spherical array such as the 4th order
Eigenmike [96]. The process of computing the Ambisonics
3 Processing approaches signals is often termed plane-wave decomposition (PWD)
[97], because Ambisonics can be related to the plane-wave
This section presents a review of methods associated
amplitude density function [18]. However, practical arrays
with the processing block in Figure 1, providing the
have a limited number of microphones, which may limit
mapping from the captured microphone signals to binaural
the spherical harmonics order and the spatial resolution,
signals ready for listening.
and introduces spatial aliasing at high frequencies [98].
Methods that reduce aliasing may extend the frequency
3.1 Binaural recording
range of operation of the array, for example, by aliasing
In binaural recording, microphones are placed at the cancellation [99]. Moreover, the typically-small array size
ears of a dummy head, capturing the sound at the ears of affects the robustness of PWD at low frequencies due to
a potential listener at the recording position. While binaural the low magnitude of the radial functions that encode scat-
recordings have a long history [1], they are still widely used tering off the array [97]. A robust PWD method was
today, as they generate binaural signals, ready for listening, recently proposed to overcome these low frequency limita-
without the need for further processing [4]. While an attrac- tions [100]. Another approach to enhance the Ambisonics
tive option in spatial audio, binaural recording suffers signals is by upscaling, which aims to extend the spherical
from two main limitations, both related to the innate harmonics order, and leads to enhanced spatial resolution
embedding of the HRTF in the recording. The first is that and higher-quality spatial audio signals. Earlier work
head-tracking is typically not possible, as the head position includes the employment of compressed sensing [101–103]
is captured in the recording. The second is that individual- and sparse decomposition based on dictionary learning
ized HRTF cannot be supported, as the signal embeds the [104], while more recent work includes the employment of
HRTF of the dummy head. Solutions to the former exist, sparse recovery [105] and deep-learning [106–108]. Order-
such as motion-tracked binaural recordings [82, 83], or limited Ambisonics signals translate to order truncation of
binaural cue adaptation [84]; however, these are still limited the HRTF [109], which may have a detrimental effect on
in their accuracy and flexibility. These two limitations call the perception of the reproduced binaural signals
for more flexible recording solutions, in which the sound [93, 110]. Several methods that overcome this limitation
field is recorded separately from the HRTF, which can then have been suggested in recent years [94]. Correction of spec-
be integrated in post-processing. Such approaches are tral deficiencies by diffuse-field equalization was suggested
presented next. in [110, 111]. Other approaches suggested modifying the
HRTF phase component, e.g., time-aligned binaural decod-
3.2 Ambisonics ing [95], magnitude least-square (MagLS) [112], and bilat-
eral Ambisonics [56]. The phase was shown to contribute
Ambisonics was first introduced in the 1970s as a way to significantly to the increased order of the HRTF [113],
record and reproduce spatial audio using 4 audio channels, and so its modification leads to improved reproduction
denoted as first-order Ambisonics (FOA) [5, 85–87]. using low-order Ambisonics.
Around the late 1990s, the higher-order Ambisonics Ambisonics has been established as a common standard
(HOA) technology, using a spherical harmonics formula- for spatial audio, but even with the improvements described
tion, emerged [6, 8, 9]. FOA and HOA were originally devel- above, it has limitations that drive the search for improved
oped for loudspeaker array reproduction. In 1999, an solutions. A main limitation appears when Ambisonics with
approach for headphone reproduction of Ambisonics signals a low spherical harmonics order is used, for which the
was introduced [88], using “virtual loudspeaker reproduc- binaural reproduction may be of poor quality. Other limita-
tion”. Headphone reproduction using Ambisonics has been tions are detailed next. The frequency range of the
significantly advanced in the past decade as new applica- Ambisonics signals when captured with compact micro-
tions have emerged (see Sect. 1). Specifically, a formulation phone arrays such as a spherical array may be limited by
in the spherical harmonics domain of binaural reproduction spatial aliasing and robustness constraints, as discussed
using Ambisonics signals, which also employed a spherical earlier in this section. On the positive side, Ambisonics
harmonics representation of the HRTF [89], was presented readily supports spatial rotation, which is useful for head-
[7, 42, 90, 91]. The use of the spherical harmonics formula- tracking and 3 degrees-of-freedom (3DoF) rendering. How-
tion has become popular in recent years, due to the possibil- ever, the incorporation of spatial translation is not trivial
ities for efficient processing in the spherical harmonics [114]. Another limitation is that the recording of Ambison-
domain, the inherent separation of the sound field and ics signals often requires a spherical array, which may not
the HRTF representations, and the ease of rotation of these be available when using microphone arrays embedded in
representations, which is useful for head-tracking, for exam- consumer devices, for example. Finally, the recording of real
ple [16, 92–95]. Figure 2 presents a general diagram for scenes may also be corrupted by noise and interference and
Ambisonics-based binaural reproduction, showing how the may require enhancement. Various methods that try to
6 B. Rafaely et al.: Acta Acustica 2022, 6, 47

Figure 2. A block diagram illustrating Ambisonics-based spatial audio processing, showing a spherical microphone array and
operations of plane wave decomposition (PWD) and filtering by HRTF.

Figure 3. A block diagram illustrating perceptually motivated microphone-array processing for binaural reproduction.

overcome these limitations of Ambisonics are described in perceived source position. In particular, higher ratios of
the next sections. ICTD to ICLD lead to more spacious, but less localizable,
sources, and a greater sense of depth and spread [63].
3.3 Perceptually motivated approaches Achieving a sufficient amount of interchannel decorrelation
is another important design goal for PMMAs. Decorrelation
As outlined in Section 2.3, PMMAs aim to preserve is not only important for an auditory spatial impression, i.e.,
psychoacoustic cues directly in the microphone-array sig- ASW and LEV, [58, 118], but also for extending the size of
nals, such that perceptual attributes of the acoustic scene the listening area in loudspeaker reproduction. This is of less
are plausibly rendered. This is in contrast to reconstructing importance in binaural reproduction, where the listener is
the sound field in a physically accurate manner in post- always at the sweet spot [119, 120]. Decorrelation is also fre-
processing, an approach often employed when Ambisonics quency dependent [121]. Since low-frequency decorrelation
signals are computed from spherical microphone arrays, has been reported to be important for LEV [122], various
for example, as reviewed in Section 3.2. In particular, most decorrelation methods have been proposed [118, 123, 124].
PMMAs focus on manipulating interchannel time difference Furthermore, decorrelation of vertically oriented signals
(ICTD), interchannel level difference (ICLD) and inter- has been found to have a minimal, or no, effect on the verti-
channel correlation (ICC) for virtual image localization cal spread of virtual sources, depending on source frequency
and spatial-impression rendering. The concept relies on [124, 125]. This allows a three-dimensional microphone
the perceptual phenomena of summing localization and array to be more compact vertically. Examples include the
the precedence effect [69]. Typically, the signals of a PMMA ORTF-3D [126] and ESMA-3D [59] arrays.
do not require any further decoding process for reproduc- Despite providing good perceptual quality with a small
tion; each microphone-array signal is discretely routed to number of microphones, PMMAs do not directly support
each corresponding loudspeaker. For binaural reproduction generic representations like Ambisonics, making this
using a PMMA recording, loudspeakers are replaced by approach specific to a loudspeaker configuration. With this
virtual sources, while the source signals are convolved with limitation in mind, methods have been developed to trans-
the head-related impulse responses (HRIRs) associated with form PMMA signals into Ambisonics. A recent study [127]
the virtual source positions. This approach, illustrated in investigated the perceived spatial and timbral degradation
Figure 3, offers an attractive advantage – binaural repro- when signals of various PMMAs were directly encoded to
duction with good perceptual quality can be achieved even Ambisonics with different orders, and binaurally repro-
with a small number of microphones. duced using the MagLS decoding method [95]. A multiple
There exist several models of the ICTD and ICLD trade- stimulus with hidden reference and anchor (MUSHRA) lis-
off for controlling the degrees of image shift [115–117], that tening test revealed that the perceived degradation was
are used for designing the spacing and relative angle between minimal with the order of 2 or higher, depending on the
microphones in an array. These models can also be used to decoder. This suggests that Ambisonics could be a useful
affect the characteristics of a virtual source for a given coding and delivery format for PMMA recordings.
B. Rafaely et al.: Acta Acustica 2022, 6, 47 7

Figure 4. A general diagram illustrating beamforming-based spatial audio processing.

In summary, PMMAs aim for high perceptual quality then extended mathematically to employ beamformers to
with a small number of microphones, but come at the cost estimate signals in specific arrival directions. This approach
of highly specific microphone-array designs. Alternative builds on a well-established theory of beamforming [131],
approaches with a similar aim, but supporting more flexible with well-defined design methods. Early work incorporated
array designs are reviewed next. maximum-directivity beamformers, leading to Ambisonics
signals and PWD for spherical arrays [132–134]. Later,
3.4 Beamforming-based processing other beamformers, such as the delay-and-sum beamformer,
were also investigated [135]. However, these studies were
Beamforming-based processing refers to the family of limited to spherical arrays.
methods that transform microphone signals into a binaural Another direction of research work related to beamform-
signal in two stages. In the first stage, beamforming, or spa- ing that was also applied to spherical microphone arrays
tial filtering, is applied to the microphone-array signals, used beamforming or spatial filtering to shape the directivity
most commonly to represent sound field components associ- of the sound field, thus reducing noise arriving from direc-
ated with specific directions. Then, in the second stage, tions attenuated by the spatial filter (see Sect. 4.2), with
these components are filtered by the appropriate HRTF the entire process embedded in an Ambisonics setting
and combined to form binaural signals, as illustrated in [136–139]. This approach demonstrated a trade-off between
Figure 4. This is a useful approach, and in its current form noise reduction and spatial audio quality. A different
it has been developed with great flexibility to array config- approach, also related to the methods in Section 4.2, placed
uration. Ambisonics signals derived from spherical micro- emphasis on noise reduction using high-performance beam-
phone arrays can be considered as a special case of this formers, such as the maximum-directivity distortionless
approach, as detailed below. response (MVDR) beamformer [140] and the linearly-
Early work developed within this framework employed constrained minimum variance (LCMV) beamformer
Ambisonics signals, and did not explicitly use the term [141]. These approaches only partly supported spatial audio
beamforming. Here, Ambisonics signals were decoded into reproduction quality by incorporating constraints in the
signals that are used to directly drive an array of actual beamformer design to ensure basic cues of the binaural sig-
loudspeakers [128], or, alternatively, an array of virtual nals, such as ILD and ITD for specific sources at the beam-
loudspeakers [42]. The set of virtual loudspeaker array former output. This approach did not involve HRTF and
signals was further filtered with HRTF to produce binaural the quality of the reproduced spatial audio was limited.
signals. This approach was later extended from Ambisonics In more recent studies, the beamforming approach
to spherical arrays in general, by decomposing the measured developed in previous work was applied to arrays of arbi-
microphone signals into spherical harmonics and then plane trary configuration, such as arrays mounted on helmets
waves, finally reproducing binaural signals by combining [142], or glasses [143], linear arrays [43, 144], and wall
each plane wave with the appropriate HRTF [41]. Further mounted planar arrays [145]. These recent studies extended
work implemented this approach on a real spherical array previous work which was mostly developed for Ambisonics
[129], and analyzed the approach theoretically [130]. signals. Design methods were further developed by propos-
Having established the generation of virtual loudspeaker ing a framework for selecting the number of beamforming
signals and then signals related to PWD, the approach was directions [145], by direct matching of microphone signals
8 B. Rafaely et al.: Acta Acustica 2022, 6, 47

to binaural signals [44, 143], and by designing virtual sound fields, sparse recovery approaches have been
artificial heads [43, 144]. The last may require efficient employed [20, 153]. Multiple plane-wave modeling and a
representations of HRTF, e.g., [146]. These are initial steps more flexible representation of the reverberant part of the
in the development of methods that will support high qual- sound field have also been the basis for HOA extensions
ity binaural reproduction based on practical microphone of DirAC [20, 154–156], leading to improved spatial resolu-
arrays, such as wearable arrays and arrays with arbitrary tion and a more accurate representation of complex sound
configuration. fields. This approach, developed for Ambisonics signals
In summary, while considerable progress has been made and spherical arrays, has been extended to incorporate
for beamfroming-based binaural reproduction, most previ- general microphone arrays, by employing optimal multiple
ous work was developed for Ambisonics signals; it may channel filters to estimate direct signals from sources
not be possible to accurately compute these signals from [20, 152, 157]. Figure 5 presents a general block diagram,
signals measured by arrays with a small number of micro- capturing the main processing blocks common to paramet-
phones (e.g., from microphones mounted on devices). For ric spatial audio signal processing for binaural reproduction.
such arrays, current beamforming-based design methodol- While the approaches discussed above are often presented
ogy may offer an attractive and flexible alternative; in the context of loudspeaker reproduction, they are never-
however, at this point in time, further research providing theless relevant for headphone reproduction by employing
theoretical grounding is required, as well as further develop- virtual loudspeakers, or by rendering sources by incorporat-
ment of processing methods to support high quality binau- ing HRTF [20, 21].
ral reproduction from such arrays. Overall, parametric processing has been a promising
avenue for binaural reproduction from microphone-array
3.5 Parametric processing recordings, as it has the potential to capture important
spatial information through the modelling process. Further
Parametric processing is based on relatively simple, in research may provide high-quality reproduction even with
some cases, perceptually motivated, sound field modelling. challenging environments that include multiple dynamic
The processing generally consists of two steps. In the first sources, spatially complex sources [158, 159], reverberation
step a specific sound field model is assumed and its param- and noise, and by employing compact arrays with only a
eters and signals are estimated, while in the second step the few microphones. Improved methods for estimating infor-
binaural signals are synthesized. Reproduction based on a mation on individual sources and on reverberant compo-
small number of parameters may be advantageous when nents, as well as methods that incorporate early room
the complexity of the sound field cannot be captured by reflections [160–163], may advance the parametric approach
the recording array. In this case, estimating a small number even further. The parametric processing approach also
of perceptually important parameters may be more useful supports signal transformations such as rotation and trans-
than attempting to capture the full complexity of the sound lation, due to the simplified sound-field representation, as
field. will be further discussed below.
One of the earliest approaches of parametric signal
processing for spatial audio is based on decomposing the 3.6 Machine- and deep-learning based processing
sound field into a direct-sound component, representing
the sound source, and a diffuse sound component, repre- With the advent of deep-learning methods, machine
senting reflections and room reverberation. The approach, learning has seen broad application for a wide variety of
referred to as DirAC (directional audio coding) [147], was research problems, including in the fields of audio and
developed for FOA. A similar approach decomposed the acoustics. Recently, novel machine-learning-based methods
sound field into primary and ambient components [148]. that fit within the generalized framework shown in Figure 1
The former component is highly correlated between input have been proposed.
channels (representing sources), and the latter are uncorre- Understanding the characteristics of the acoustic envi-
lated (representing reverberation and background noise). ronment may be useful in a spatial audio processing frame-
Both approaches process the signals in the time-frequency work (see Sect. 2.2). The annual IEEE AASP Challenge on
domain, exploiting the sparsity property of audio signals Detection and Classification of Acoustic Scenes and Events
such as speech. Therefore, while only one source per time- (DCASE) [164] includes contributions related to automati-
frequency bin is modeled, overall, these approaches can cally classifying the type of acoustic scene [165], or detecting
model an acoustic scene with multiple sources. Another and localizing sound events from spatial audio recordings
alternative, high-angular-resolution plane-wave expansion [166]. Grumiaux proposed the use of the time-domain veloc-
(HARPEX) [149, 150], models two plane waves per time- ity vector as an input feature for a deep neural network
frequency bin, complemented by two opposing plane waves, (DNN) to count and localize multiple speakers in Ambison-
thus enriching the plane-wave model. ics signals [167]. A related problem is the blind estimation of
While useful, these early approaches for parametric room acoustic parameters from audio recordings [168],
spatial audio processing are limited due to their simplistic including the estimation of reverberation time and the
models [21], and so methods employing more complex early-to-late reverberation ratio [169–173].
models have been developed [20, 151, 152]. With the aim Given a recording of an acoustic scene, data-driven
of extracting multiple dominant plane waves from complex and machine-learning approaches can be used for audio
B. Rafaely et al.: Acta Acustica 2022, 6, 47 9

Figure 5. A general diagram illustrating parametric-based spatial audio processing, incorporating a first stage of parametric
modeling and parameter estimation, and a second stage of HRTF-based binaural reproduction.

processing (see Sect. 2.4). A method has been proposed in 4 Transformations


[174] for upmixing monophonic recordings to FOA by com-
bining audio processing and computer-vision methods to This section presents processing methods that can be
infer sound source locations from a panoramic video record- considered as additions to the main processing chain of
ing of the scene. The estimated Ambisonics signals can then mapping microphone signals to binaural signals, as illus-
be processed further for binaural reproduction. Directly trated in Figure 1. These include signal enhancement to
deriving the binaural output signals from a monophonic reduce unwanted interfering sounds in the spatial audio sig-
recording has also been proposed, taking into account the nal, and translation and rotation that support the mobility
position and orientation of the listener relative to the source of a listener in a virtual audio environment.
[175]. In another study, convolutional neural networks were
employed to upscale the Ambisonics order of encoded FOA 4.1 Rotation and translation
recordings [106]. On the reproduction side, general
adversarial networks were proposed to reduce the error During binaural reproduction with headphones, listen-
when rendering Ambisonics-encoded sound fields over four ers may rotate their heads, leading to a corresponding rota-
loudspeakers [108]. Finally, data-driven and machine- tion of the acoustic scene, which is perceived as unnatural.
learning-based approaches have been proposed for the This can be corrected by head-tracking, i.e., rotating the
perceptual evaluation of reproduced scenes. A model that acoustic scene to counter the listener’s head rotations,
predicts front-back and elevation perception of sound thereby stabilizing the virtual scene and providing the feel-
sources was introduced in [70], while predicting spatial ing of immersion in a real scene. This head-tracking is
audio quality using computational models was proposed denoted as having 3DoF. Furthermore, listeners may move
in [77]. For an extensive review of current data-based freely, i.e., walk through the reproduced scene with a com-
spatial audio methods the reader is referred to the work by bination of rotational and translational movements. The
Cobos et al. [176]. latter refers to moving forwards and backwards, up and
With the increased popularity of machine- and deep- down, and left and right. The translation is often referred
learning research, it is expected that these approaches will to as sound field translation, sound field navigation, or
play an increasingly more significant role in the near future scene walk-through. When paired with rotation, the com-
– for spatial audio, in general, and for the binaural repro- plete freedom of movement is denoted 6DoF. The objective
duction of recorded acoustic scenes, in particular. As these of 6DoF reproduction is to enable a listener to walk through
approaches are data-driven, they have the potential to over- an acoustic scene in VR/AR, leaning close to sound sources
come limitations imposed by microphone-array configura- or reflectors and hearing a realistic life-like recreation of the
tions, and to implicitly exploit information embedded in true experience (ideally with matched visuals).
the sound field, leading to highly flexible solutions; never- A schematic illustrating how recordings are compen-
theless, these solutions may require tailoring to specific sated for listener rotation and translation for the case of
systems and applications. an Ambisonics signal is given in Figure 6. Typically, the
10 B. Rafaely et al.: Acta Acustica 2022, 6, 47

Figure 6. Generalized framework of sound field translation and rotation for binaural reproduction of recorded scenes.

recorded sound field is processed into an intermediate repre- The second 6DoF technique, denoted extrapolation-
sentation stage that supports sound field rotation and based, records the acoustic scene from a single spatial
translation. The intermediate sound field is then recom- position, usually with a single HOA microphone. The
posed back to an Ambisonics representation at the listener’s Ambisonics recording is processed into a secondary repre-
new position, and the binaural signals are rendered as usual. sentation built out of virtual loudspeakers [191–193], virtual
Further detail on the approaches for listener rotation and microphones [194], virtual near-field point-sources
translation are provided in the following. [195–198], virtual far-field plane waves [199–202], or virtual
A straightforward method for enabling head rotation is near- and far-field sources [203]. Alternatively, the HOA
to record the scene with multiple binaural microphones at recording can be directly re-expanded about a translated
different azimuth rotations. For example, by having micro- position without the secondary representation [182, 204].
phones [82, 177] or binaural microphones [178] placed The extrapolation-based translation, however, is usually
around the equator of a sphere. During reproduction the limited by a sweet-spot distance that is defined by the
listener’s head rotation is tracked and the microphone well-known truncation properties of the Ambisonics decom-
signals closest to the ears are interpolated or directly played position [205]. To address this limitation, methods often use
back. However, currently, head rotation for 3DoF is additional assumptions or parametric information about
more commonly achieved by rotating the Ambisonics repre- the recording to gain extended translation. For example,
sentation of the sound field [128, 179], or equivalently, methods use known or estimated source directions and/or
rotating the Ambisonics representation of the HRTF distances [192, 196, 198, 200, 206], a distance map
[180]. Ambisonics rotation is easily performed by applying [197, 199], or spatial sparsity assumptions [203, 207]. While
a time- and frequency-independent rotation matrix to the the translation distance is limited, the extrapolation-based
Ambisonics coefficients [181–186]. The challenges of head- approach benefits from its unobtrusive and cost efficient
rotation-enabled binaural reproduction for recording by use of a single HOA microphone. Lastly, the potential appli-
non-standard or wearable microphone arrays [187–189] cability to other single recording devices, such as wearable
are still the subject of ongoing research. microphone arrays, suggests that the extrapolation
There are three main approaches towards enabling approach will continue to develop.
6DoF, each distinguished by the recording setup. The first The third 6DoF technique, denoted interpolation-based,
is a source-based approach, where spot microphones are records the scene from multiple spatial positions with a dis-
used to record each sound source individually within the tributed grid of Ambisonics (first-order or higher-order)
scene [190]. The recorded scene is virtually pieced back microphones. Existing approaches for Ambisonics interpo-
together by representing the sources as virtual objects at lation can broadly be classified into two categories: para-
similar positions. The virtual object signals are panned metric approaches in the time-frequency domain, and
and amplified depending on the listener’s real-time position broadband approaches in the time domain. The parametric
and rotation. This approach is easy to adapt to different approach exploits time-frequency analysis of the multiple
binaural rendering methods. However, no source-directivity Ambisonics recordings to infer underlying source character-
information is captured, and the specific acoustics of the istics (mainly the location information), which are then
environment are not typically captured or reproduced. explicitly [208–212] or implicitly [213–215] used to render
B. Rafaely et al.: Acta Acustica 2022, 6, 47 11

the reproduced sound field at interpolated listening posi- noise fields that are not highly directional single sources, a
tions. Tracking-based solutions for moving sources have directional shaping filter that allocates higher directional
also been proposed [216–220]. Additional information on gain to directions with higher signal-to-noise ratio was
source locations enlarges the supported range of shifted introduced. This processing operates directly on the
listening perspectives with high spatial definition, yet the Ambisonics signal [136, 230], and while defined in a closed
time-frequency processing often results in musical noise arti- mathematical form, leads to a trade-off between enhance-
facts. In contrast, broadband approaches such as weighted ment level and reproduction quality. Later research aimed
averaging and virtual loudspeaker objects (VLO) make no to provide significant enhancement while perfectly preserv-
attempt at analyzing underlying source characteristics, ing the desired spatial audio signal. Designed for Ambison-
and their time-domain processing avoids the risk of intro- ics signals, this aim is achieved by first estimating the DOA
ducing musical noise. The weighted averaging method of the desired source, then estimating the source signal
[221–223] applies distance-based weights to each recording using high-directivity beamforming, and finally estimating
and has a few notable shortcomings, including a limited the transfer function from the source signal to the Ambison-
listener movement region and poor localization accuracy. ics signals. This process leads to a reconstruction of the
In contrast, the VLO method [193, 224] maps the record- desired Ambisonics signal with the full spatial information,
ings to multiple surround playback rings of virtual loud- while providing enhancement through the contribution of
speaker objects, whose direction and amplitude vary with the beamforming [231]. Recently, this approach was also
the desired listener position, thus providing enhanced investigated for a wearable microphone array [143, 158].
spatial fidelity. More recently, in [225, 226], the authors pre- An alternative approach, also aiming to achieve significant
sented methods that merge and extend the concepts of enhancement while preserving spatial information,
parametric and broadband interpolation. Overall, interpo- employed masking in the time-frequency domain, applied
lation-based approaches offer potentially longer translation directly to the Ambisonics signals or to the same signals
distances, but with the trade-off of the increased costs asso- spatially transformed by beamforming [232]. While high
ciated with using multiple HOA microphones [227, 228]. noise attenuation was achieved by masking in the trans-
While numerous rotation and translation solutions have formed spatial domain, masking in the Ambisonics, or
been recently developed, capturing and accurately repro- spherical harmonics domain, better preserved spatial infor-
ducing large acoustic scenes still remains an open problem. mation in the attenuated noise. Some of these approaches
Future directions include the extension of these methods to were generalized in a broad framework for signal enhance-
more general signals beyond Ambisonics, and improving the ment [233], which incorporates source signal estimation
accuracy and translation regions to support realistic free under various sound field models in a way that preserves
walking in virtual and augmented reproductions of cap- both the individual sources and the reverberant signal com-
tured sound scenes. ponents, while minimizing the contribution of undesired
noise.
4.2 Signal enhancement While recent methods for the enhancement of spatial
audio signals introduced a significant improvement com-
Spatial audio signals may be composed of both desired pared to early methods, improved methods that provide
components, such as speech and music, and undesired com- superior performance in challenging environments with
ponents, such as noise and interfering sounds. Therefore, in multiple speakers, reverberation and noise, may be highly
addition to processing aiming at binaural reproduction, desirable when considering realistic scenarios. In addition,
signal enhancement may also be required in order to atten- many of the methods are designed for Ambisonics signals
uate the interfering components, thereby delivering to the and spherical arrays, and so enhancement methods for more
listener high quality spatial audio which is also clean. general array configurations may also be necessary.
This problem has been investigated for hearing aids,
where the delivery of clean speech is of great importance,
while binaural hearing aids also aim to deliver spatial 5 Conclusion and outlook
cues to the listener. With this in mind, binaural beamformers
that aim to attenuate undesired signal components This review paper presented an overview and recent
[140, 141] have been developed; these were also extended developments in spatial audio signal processing, focusing
to include time-frequency masking [229]. However, because on recorded sound and binaural reproduction. Significant
spatial information relies on beamforming constraints, it is progress has been made in the past decade, with the pro-
only partially preserved in the binaural signal. Furthermore, posal of new methods and approaches, making a notable
these methods are designed for binaural microphone arrays step towards providing high quality audio from recorded
and may not always be applicable to general arrays. sound. Nevertheless, there are clear challenges ahead. These
With the aim of overcoming the limitations of binaural are outlined within the structure of the general framework
signal enhancement, several studies developed enhancement presented in this paper.
solutions for Ambisonics signals. In the first approach,
directional constraints were introduced into the Ambisonics  Acoustic scene – real-world scenes may be challeng-
encoding process to attenuate directional interferences. ing, with several moving sources, reverberant envi-
Then, with the aim of affording more flexibility to target ronments, noise and interference. Most methods
12 B. Rafaely et al.: Acta Acustica 2022, 6, 47

developed to date assume stationary sources, and so References


bridging the gap to handle several, and moving,
sources in lively environments could be an important 1. M.F. Davis: History of spatial coding. Journal of the Audio
target for future research. Engineering Society 51, 6 (2003) 554–569.
 Recording – spatial audio signals are recorded by 2. M. Vorländer: Past, present and future of dummy heads, in
microphone arrays, and with emerging applications Proceedings of Acústica, Guimarães, Portugal, 2004, pp. 13–17.
such as smart homes and VR/AR, arrays may be of 3. D.R. Begault, E.M. Wenzel, M.R. Anderson: Direct com-
parison of the impact of head tracking, reverberation, and
varying configurations (e.g., on a device), may be
individualized head-related transfer functions on the spatial
composed of only a few microphones, and may be perception of a virtual speech source. Journal of the Audio
dynamic in space (e.g., wearable arrays). With many Engineering Society 49, 10 (2001) 904–916.
of the current methods developed for spherical arrays, 4. B. Xie: Head-related transfer function and virtual auditory
an important challenge is to extend emerging methods display. 2nd ed., J. Ross Publishing, 2013.
to work with general arrays and perform well even 5. M.A. Gerzon: Periphony: with-height sound reproduction.
with moving arrays composed of only a few micro- Journal of the Audio Engineering Society 21, 1 (February
phones. Wearable arrays may also introduce chal- 1973) 2–10.
lenges with respect to the limited computation 6. J.S. Bamford: An analysis of ambisonic sound systems of
resources available, and latency constraints imposed first and second order. PhD thesis, University of Waterloo,
Ontario, Canada, 1995.
by real-time reproduction and head-tracking, for
7. J. Daniel: Acoustic field representation, application to the
example. Overcoming these challenges may open great transmission and the reproduction of complex sound envi-
opportunities for delivering affordable spatial audio ronments in a multimedia context. PhD thesis, Université
for consumer devices. de Paris, Paris, France, 2000.
 Processing – while signal processing has been the main 8. D.G. Malham, A. Myatt: 3-D sound spatialization using
topic of this paper and is incorporated in the points ambisonic techniques. Computer Music Journal 19, 4 (1995)
above as well, a main avenue of research that has been 58–70.
reviewed here is spatial audio signal processing based 9. M.A. Poletti: The design of encoding functions for stereo-
on learning from measured data. With deep-learning phonic and polyphonic sound systems. Journal of the Audio
methods continuously developing, their incorporation Engineering Society 44, 11 (1996) 948–963.
10. T.D. Abhayapala, D.B. Ward: Theory and design of high
in the challenging tasks outlined here could be of great
order sound field microphones using spherical microphone
benefit. Learning from measured data could also array, in IEEE International Conference on Acoustics,
include parametric representation of sound fields Speech, and Signal Processing (ICASSP), Orlando, Florida,
based on microphone-array recordings, which have USA, 2002, pp. 1949–1952.
great potential for high performance with compact 11. J. Meyer, G. Elko: A highly scalable spherical microphone
representations. Furthermore, emerging approaches array based on an orthonormal decomposition of the
for manipulating sound field information for transla- soundfield, in IEEE International Conference on Acoustics,
tion and rotation, for example, by non-linear transfor- Speech, and Signal Processing (ICASSP), Orlando, Florida,
mation of the directional space (i.e., warping), may USA, 2002, pp. II-1781–II-1784.
lead to new possibilities and increased flexibility for 12. M. Vorländer: Auralization: fundamentals of acoustics,
modelling, simulation, algorithms and acoustic virtual
VR/AR and other applications.
reality. Springer, 2020.
 Reproduction and perception – over headphones, and, 13. J. Blauert, J. Braasch: The technology of binaural under-
in particular, using individualized HRTF, will proba- standing. Springer, 2020.
bly be key to high quality spatial audio. The incorpo- 14. H. Hacihabiboglu, E. De Sena, Z. Cvetkovic, J. Johnston, J.
ration of individualized HRTF in state-of-the art O. Smith III: Perceptual spatial audio recording, simulation,
algorithms is therefore essential. Furthermore, and rendering: an overview of spatial-audio techniques
improved understanding of the relation between the based on psychoacoustics. IEEE Signal Processing Maga-
processed audio signal and perception may be essen- zine 34, 3 (2017) 36–54.
tial to ensure that important signal information is 15. W. Zhang, P.N. Samarasinghe, H. Chen, T.D. Abhayapala:
maintained or enhanced. Performance evaluations, Surround by sound: a review of spatial audio recording and
reproduction. Applied Sciences 7, 3 (2017) 532.
currently mostly developed for listeners with head-
16. F. Zotter, M. Frank: Ambisonics: a practical 3D audio
tracking, should be extended to 6DoF motion. Also, theory for recording, studio production, sound reinforce-
mathematically formulated objectives, essential for ment, and virtual reality. Springer Nature, 2019.
machine- and deep-learning, that incorporate percep- 17. D.P. Jarrett, E.A.P. Habets, P.A. Naylor: Theory and
tual attributes, could be useful for developing data- applications of spherical microphone array processing.
based learning solutions that are perceptually Springer-Verlag, Berlin, 2017.
motivated. 18. B. Rafaely, Fundamentals of spherical array processing.
Springer-Verlag, Berlin, 2019.
19. J. Herre, J. Hilpert, A. Kuntz, J. Plogsties: MPEG-H 3D
audio – the new standard for coding of immersive spatial
audio. IEEE Journal of Selected Topics in Signal Processing
Conflict of interest 9, 5 (2015) 770–779.
20. V. Pulkki, S. Delikaris-Manias, A. Politis: Parametric time-
The authors declare no conflict of interest. frequency domain spatial audio. John Wiley & Sons, 2017.
B. Rafaely et al.: Acta Acustica 2022, 6, 47 13

21. K. Kowalczyk, O. Thiergart, M. Taseska, G. Del Galdo, V. 40. em32 Eigenmike array. mhAcoustics, 25 Summit Ave,
Pulkki, E.P.A. Habets: Parametric Spatial Sound Process- Summit, NJ 07901, USA. Accessed on December 6, 2021.
ing: A flexible and efficient solution to sound scene https://mhacoustics.com/products
acquisition, modification, and reproduction. IEEE Signal 41. R. Duraiswami, D. Zotkin, Z. Li, E. Grassi, N. Gumerov, L.
Processing Magazine 32, 2 (2015) 31–42. Davis: High-order spatial audio capture and its binaural
22. V.R. Algazi, R.O. Duda: Headphone-based spatial sound. head-tracked playback over headphones with HRTF cues,
IEEE Signal Processing Magazine 28, 1 (2011) 33–42. in The 119th Convention of Audio Engineering Society,
23. K. Sunder, J. He, E.L. Tan, W.-S. Gan: Natural sound vol. 3, New York, NY, USA, 01 2005, pp. 1–16.
rendering for headphones: integration of signal processing 42. M. Noisternig, T. Musil, A. Sontacchi, R. Holdrich: 3D
techniques. IEEE Signal Processing Magazine 32, 2 (2015) binaural sound reproduction using a virtual ambisonic
100–113. approach, in IEEE International Symposium on Virtual
24. D.R. Begault, L.J. Trejo: 3-D sound for virtual reality and Environments, Human-Computer Interfaces and Measure-
multimedia. NASA, Ames Research Center, Moffett Field, ment Systems, 2003. VECIMS ‘03. 2003, IEEE,2003,
California, 2000, pp. 132–136. pp. 174–178.
25. P. Milgram, H. Takemura, A. Utsumi, F. Kishino: Aug- 43. M. Fallahi, M. Hansen, S. Doclo, S. van de Par, D. Püschel,
mented reality: a class of displays on the reality-virtuality M. Blau: Evaluation of head-tracked binaural auralizations
continuum. Telemanipulator and Telepresence Technologies, of speech signals generated with a virtual artificial head in
International Society for Optics and Photonics, 1995, pp. anechoic and classroom environments. Acta Acustica 5
282–292. (2021) 30.
26. V. Tourbabin, B. Rafaely: Analysis of distortion in audio 44. L. Madmoni, J. Donley, V. Tourbabin, B. Rafaely: Beam-
signals introduced by microphone motion, in 2016 24th forming-based binaural reproduction by matching of bin-
European Signal Processing Conference (EUSIPCO), Buda- aural signals, in Audio Engineering Society Conference:
pest, Hungary, 2016, pp. 998–1002. International Conference on Audio for Virtual and Aug-
27. A. Alexandridis, A. Griffin, A. Mouchtaris: Capturing and mented Reality, 2020.
reproducing spatial audio based on a circular microphone 45. S. Sakamoto, J. Kodama, S. Hongo, T. Okamoto, Y. Iwaya,
array. Journal of Electrical and Computer Engineering 2013 Y. Suzuki: A 3D sound-space recording system using
(2013) 1–16. spherical microphone array with 252ch microphones, in
28. I. Toshima, H. Uematsu, T. Hirahara: A steerable dummy 20th International Congress on Acoustics 2010, ICA 2010 –
head that tracks three-dimensional head movement: Tele- Incorporating Proceedings of the 2010 Annual Conference of
Head. Acoustical Science and Technology 24 (09 2003) the Australian Acoustical Society, Sydney, Australia, 2010,
327–329. pp. 3032–3035.
29. Zylia: Zylia ZM-1 microphone. Accessed on December 6, 46. A. Roginska, P. Geluso: Immersive sound: the art and science
2021. https://www.zylia.co/ of binaural and multi-channel audio, Taylor & Francis, 2017.
30. T. Lokki: Subjective comparison of four concert halls based 47. S. Werner, F. Klein, T. Mayenfels, K. Brandenburg: A
on binaural impulse responses. Acoustical Science and summary on acoustic room divergence and its effect on
Technology 26, 2 (2005) 200–203. externalization of auditory events, in 2016 Eighth Interna-
31. T. Lokki, J. Pätynen, S. Tervo, S. Siltanen, L. Savioja: tional Conference on Quality of Multimedia Experience
Engaging concert hall acoustics is made up of temporal (QoMEX), IEEE, 2016, pp. 1–6.
envelope preserving reflections. The Journal of the Acous- 48. W.O. Brimijoin, A.W. Boyd, M.A. Akeroyd: The contribu-
tical Society of America 129, 6 (2011) EL223–EL228. tion of head movement to the externalization and internal-
32. O. Axelsson, M.E. Nilsson, B. Berglund: A principal ization of sounds. PloS one 8, 12 (2013) e83068.
components model of soundscape perception. The Journal 49. F.L. Wightman, D.J. Kistler: The importance of head
of the Acoustical Society of America 128, 5 (2010) 2836– movements for localizing virtual auditory display objects, in
2846. International Conference on Auditory Display, Georgia
33. B. Boren, M. Musick, J. Grossman, A. Roginska: I hear Institute of Technology, 1994.
NY4D: hybrid acoustic and augmented auditory display for 50. M.-V. Laitinen, T. Pihlajamäki, S. Lösler, V. Pulkki:
urban soundscapes, in International Conference on Audi- Influence of resolution of head tracking in synthesis of
tory Display, New York, NY, USA, 2014. binaural audio, in Audio Engineering Society Convention
34. A. Leudar: An alternative approach to 3D audio recording 132, Audio Engineering Society, 2012.
and reproduction. Divergence Press 3, 1 (2014). 51. P. Stitt, E. Hendrickx, J.-C. Messonnier, B. Katz: The
35. Eden Project: Rainforest at night: heart of darkness. influence of head tracking latency on binaural rendering in
Accessed on December 6, 2021. https://web.archive.org/ simple and complex sound scenes, in Audio Engineering
web/20110719132826/http://www.edenproject.com/come- Society Convention 140, Audio Engineering Society, 2016.
and-visit/whats-on/heart-of-darkness.php 52. I. Engel, D.L. Alon, P.W. Robinson, R. Mehra: The effect of
36. H. Lee: Multichannel 3D microphone arrays: a review. generic headphone compensation on binaural renderings, in
Journal of the Audio Engineering Society 69, 1/2 (2021) Audio Engineering Society Conference: 2019 AES Interna-
5–26. tional Conference on Immersive and Interactive Audio,
37. B&K: Binaural microphone B&K type 4101-B. Accessed on Audio Engineering Society, 2019.
December 6, 2021. https://www.bksv.com/en/transducers/ 53. A. Lindau, F. Brinkmann: Perceptual evaluation of head-
acoustic/binaural/binaural-microphone?tab=overview phone compensation in binaural synthesis based on non-
38. 3Dio: Free-space binaural microphone. Accessed on individual recordings. Journal of the Audio Engineering
December 6, 2021. https://3diosound.com/products/free- Society 60, 1/2 (2012) 54–62.
space-binaural-microphone 54. D. Pralong, S. Carlile: The role of individualized headphone
39. Sennheiser: Sennheiser AMBEO VR mic. Accessed on calibration for the generation of high fidelity virtual
December 6, 2021. https://en-us.sennheiser.com/microphone- auditory space, The Journal of the Acoustical Society of
3d-audio-ambeo-vr-mic America 100, 6 (1996) 3785–3793.
14 B. Rafaely et al.: Acta Acustica 2022, 6, 47

55. F. Brinkmann, A. Lindau, S. Weinzierl: On the authenticity 73. J. Reijniers, D. Vanderelst, C. Jin, S. Carlile, H. Peremans:
of individual dynamic binaural synthesis. The Journal of the An ideal-observer model of human sound localization.
Acoustical Society of America 142, 4 (2017) 1784–1795. Biological Cybernetics 108, 2 (2014) 169–181.
56. Z. Ben-Hur, D.L. Alon, R. Mehra, B. Rafaely: Binaural 74. R. Baumgartner, P. Majdak: Decision making in auditory
reproduction based on bilateral ambisonics and ear-aligned externalization perception: model predictions for static
HRTFs. IEEE/ACM Transactions on Audio, Speech, and conditions, Acta Acustica 5 (2021) 59.
Language Processing 29 (2021) 901–913. 75. F. Rumsey, S. Zieliński, R. Kassier: On the relative
57. D. Griesinger: General overview of spatial impression, importance of spatial and timbral fidelities in judgments
envelopment, localization, and externalization, in Audio of degraded multichannel audio quality. The Journal of the
Engineering Society Conference: 15th International Confer- Acoustical Society of America 118, 2 (2005) 968–976.
ence: Audio, Acoustics & Small Spaces, Copenhagen, 76. I. Ananthabhotla, V.K. Ithapu, W.O. Brimijoin: A frame-
Denmark, 1998. work for designing head-related transfer function distance
58. T. Hidaka, T. Okano, L. Beranek: Interaural cross correla- metrics that capture localization perception. JASA Express
tion (IACC) as a measure of spaciousness and envelopment Letters 1, 4 (2021) 044401.
in concert halls. The Journal of the Acoustical Society of 77. P. Majdak, R. Baumgartner: Computational models for
America 92, 4 (1992) 2469–2469. listener-specific predictions of spatial audio quality, in EAA
59. H. Lee: Capturing 360° audio using an equal segment Spatial Audio Signal Processing Symposium, Paris, France,
microphone array (ESMA). Journal of the Audio Engineer- 2019, pp. 155–159.
ing Society 67, 1/2 (2019) 13–26. 78. T. Robotham, O.S. Rummukainen, J. Herre, E.A.P. Habets:
60. T. Okano, L.L. Beranek, T. Hidaka: Relations among Evaluation of binaural renderers in virtual reality environ-
interaural cross-correlation coefficient (IACCE), lateral ments: platform and examples, in Proc. of the 145th AES
fraction (LFE), and apparent source width (ASW) in Convention, New York, NY, USA, 2018.
concert halls. The Journal of the Acoustical Society of 79. T. Robotham, O.S. Rummukainen, M. Kurz, M. Eckert, E.
America 104, 1 (1998) 255–265. A.P. Habets: Comparing direct and indirect methods of
61. A. Lindau, V. Erbes, S. Lepa, H.-J. Maempel, F. Brinkman, audio quality evaluation in virtual reality scenes of varying
S. Weinzierl: A spatial audio quality inventory (SAQI). complexity. IEEE Transactions on Visualization and
Acta Acustica united with Acustica 100, 5 (2014) 984–994. Computer Graphics 28, 5 (2022) 2091–2101.
62. G. Lorho: Individual vocabulary profiling of spatial 80. B.I. Băcilă, H. Lee: Listener-position and orientation
enhancement systems for stereo headphone reproduction, dependency of auditory perception in an enclosed space:
in Audio Engineering Society Convention 119, Audio elicitation of salient attributes. Applied Sciences 11, 4
Engineering Society, 2005. (2021) 1–24.
63. C. Millns, H. Lee: An investigation into spatial attributes of 81. C. Schneiderwind, A. Neidhardt: Perceptual differences of
360° microphone techniques for virtual reality, in Audio position dependent room acoustics in a small conference
Engineering Society Convention 144, Milan, Italy, 2018. room, in The International Symposium on Room Acoustics,
64. G. Reardon, A. Genovese, G. Zalles, P. Flanagan, A. Amsterdam, Netherlands, 2019.
Roginska: Evaluation of binaural renderers: multidimen- 82. V.R. Algazi, R.O. Duda, D.M. Thompson: Motion-tracked
sional sound quality assessment, in Audio Engineering binaural sound. Journal of the Audio Engineering Society
Society Conference: International Conference on Audio for 52, 11 (2004) 1142–1156.
Virtual and Augmented Reality, Redmons, WA, USA, 2018. 83. A. Lindau, S. Roos: Perceptual evaluation of discretization
65. L.S.R. Simon, N. Zacharov, B.F.G. Katz: Perceptual and interpolation for motion-tracked binaural (MTB-)
attributes for the comparison of head-related transfer recordings, in Proceedings of the 26th Tonmeistertagungm
functions. The Journal of the Acoustical Society of America VDT International Convention, Leipzig, Germany, 2010,
140, 5 (2016) 3623–3632. pp. 680–701.
66. N. Zacharov, T. Pedersen, C. Pike: A common lexicon for 84. S. Nagel, P. Jax: Dynamic binaural cue adaptation, in 2018
spatial sound quality assessment – latest developments, in 16th International Workshop on Acoustic Signal Enhance-
2016 Eighth International Conference on Quality of ment (IWAENC), IEEE, 2018, pp. 96–100.
Multimedia Experience (QoMEX), Lisbon, Portugal, 2016, 85. P.G. Craven, M.A. Gerzon: Coincident microphone simu-
pp. 1–6. lation covering three dimensional space and yielding various
67. A. Lindau, S. Weinzierl: Assessing the plausibility of virtual directional outputs, 1977. US Patent 4,042,779
acoustic environments. Acta Acustica united with Acustica 86. P.B. Fellgett: Ambisonic reproduction of directionality in
98, 5 (2012) 804–810. surround-sound systems. Nature 252, 5484 (1974) 534–538.
68. R.S. Pellegrini: Quality assessment of auditory virtual 87. M.A. Gerzon: The design of precisely coincident microphone
environments, in International Conference on Auditory arrays for stereo and surround sound, in Audio Engineering
Display, Helsinki, Finland, 2001. Society Convention 50, Audio Engineering Society, 1975.
69. J. Blauert: Spatial hearing: the psychophysics of human 88. J.-M. Jot, V. Larcher, J.-M. Pernaux: A comparative study
sound localization. MIT Press, 1997. of 3-D audio encoding and rendering techniques, in Audio
70. R. Baumgartner, P. Majdak, B. Laback: Modeling sound- Engineering Society Conference: 16th International Confer-
source localization in sagittal planes for human listeners. ence: Spatial Sound Reproduction, Arktikum, Rovaniemi,
The Journal of the Acoustical Society of America 136 Finland, 1999.
(8 2014) 791–802. 89. M.J. Evans, J.A.S. Angus, A.I. Tew: Analyzing head-
71. V. Best, R. Baumgartner, M. Lavandier, P. Majdak, N. related transfer function measurements using surface spher-
Kopčo: Sound externalization: a review of recent research. ical harmonics. The Journal of the Acoustical Society of
Trends in Hearing 24 (2020) 1–14. America 104, 4 (1998) 2400–2411.
72. S. Li, R. Baumgartner, J. Peissig: Modeling perceived 90. B. Rafaely, A. Avni: Interaural cross correlation in a sound
externalization of a static, lateral sound image. Acta field represented by spherical harmonics. The Journal of the
Acustica 4, 5 (2020) 21. Acoustical Society of America 127, 2 (2010) 823–828.
B. Rafaely et al.: Acta Acustica 2022, 6, 47 15

91. A. Sontacchi, M. Noisternig, P. Majdak, R. Holdrich: 106. G. Routray, S. Basu, P. Baldev, R.M. Hegde: Deep-sound
An objective model of localisation in binaural sound field analysis for upscaling ambisonic signals, in EAA
reproduction systems, in Audio Engineering Society Con- Spatial Audio Signal Processing Symposium, Paris, France,
ference: 21st International Conference: Architectural 2019, pp. 1–6.
Acoustics and Sound Reinforcement, Audio Engineering 107. L. Zhang, X. Wang, R. Hu, D. Li, W. Tu: Estimation of
Society, 2002. spherical harmonic coefficients in sound field recording
92. Z. Ben-Hur, D. Alon, R. Mehra, B. Rafaely: Binaural using feed-forward neural networks. Multimedia Tools and
reproduction using bilateral Ambisonics. Journal of the Applications 80 (2021) 6187–6202.
Audio Engineering Society, in AES International Confer- 108. L. Zhang, X. Wang, R. Hu, D. Li, W. Tu: Optimization of
ence on Audio for Virtual and Augmented Reality (AVAR), sound fields reproduction based higher-order ambisonics
Redmond, WA, USA, August 2020, pp. 1–6. (HOA) using the generative adversarial network (GAN).
93. A. Avni, J. Ahrens, M. Geier, S. Spors, H. Wierstorf, B. Multimedia Tools and Applications 80, 2 (2021) 2205–2220.
Rafaely: Spatial perception of sound fields recorded by 109. Z. Ben-Hur, J. Sheaffer, B. Rafaely: Joint sampling theory
spherical microphone arrays with varying spatial resolution. and subjective investigation of plane-wave and spherical
The Journal of the Acoustical Society of America 133, 5 harmonics formulations for binaural reproduction. Applied
(2013) 2711–2721. Acoustics 134 (2018) 138–144.
94. T. Lübeck, H. Helmholz, J.M. Arend, C. Pörschmann, J. 110. Z. Ben-Hur, F. Brinkmann, J. Sheaffer, S. Weinzierl, B.
Ahrens: Perceptual evaluation of mitigation approaches of Rafaely: Spectral equalization in binaural signals repre-
impairments due to spatial undersampling in binaural sented by order-truncated spherical harmonics. The Journal
rendering of spherical microphone array data. Journal of of the Acoustical Society of America 141, 6 (2017) 4087–
the Audio Engineering Society 68, 6 (2020) 428–440. 4096.
95. M. Zaunschirm, C. Schörkhuber, R. Höldrich: Binaural 111. C. Hold, H. Gamper, V. Pulkki, N. Raghuvanshi, I.J.
rendering of ambisonic signals by head-related impulse Tashev: Improving binaural ambisonics decoding by spher-
response time alignment and a diffuseness constraint. The ical harmonics domain tapering and coloration compensa-
Journal of the Acoustical Society of America 143, 6 (2018) tion, in IEEE International Conference on Acoustics,
3616–3627. Speech and Signal Processing (ICASSP), Brighton, UK,
96. em32 Eigenmike microphone array release notes (v17. 0). 2019, pp. 261–265.
mhAcoustics, 25 Summit Ave, Summit, NJ 07901, USA, 112. C. Schörkhuber, M. Zaunschirm, R. Höldrich: Binaural
2013. rendering of ambisonic signals via magnitude least squares,
97. B. Rafaely: Plane-wave decomposition of the sound field on in Fortschritte der Akustik (DAGA), München, Germany,
a sphere by spherical convolution. The Journal of the 2018, pp. 339–342.
Acoustical Society of America 116, 4 (2004) 2149–2157. 113. F. Brinkmann, S. Weinzierl: Comparison of head-related
98. B. Rafaely, B. Weiss, E. Bachmat: Spatial aliasing in transfer functions pre-processing techniques for spherical
spherical microphone arrays. IEEE Transactions on Signal harmonics decomposition, in Audio Engineering Society
Processing 55, 3 (2007) 1003–1010. Conference: International Conference on Audio for Virtual
99. D.L. Alon, B. Rafaely: Beamforming with optimal aliasing and Augmented Reality, Redmons, WA, USA, 2018.
cancellation in spherical microphone arrays. IEEE/ACM 114. L. Birnie, T. Abhayapala, P. Samarasinghe, V. Tourbabin:
Transactions on Audio, Speech, and Language Processing Sound field translation methods for binaural reproduction,
24, 1 (2016) 196–210. in IX-Degrees-of-Freedom Binaural IEEE Workshop on
100. D.L. Alon, B. Rafaely: Spatial decomposition by spherical Applications of Signal Processing to Audio and Acoustics
array processing, in Parametric Time-Frequency Domain (WASPAA), IEEE, 2019, pp. 140–144.
Spatial Audio, Chapter 2, V. Pulkki, S. Delikaris-Manias, A. 115. H. Lee, F. Rumsey: Level and time panning of phantom
Politis, Eds., Wiley.2017, pp. 25–47. images for musical sources. Journal of the Audio Engineer-
101. A. Wabnitz, N. Epain, C.T. Jin, A frequency-domain ing Society 61, 12 (2013) 978–988.
algorithm to upscale ambisonic sound scenes, in IEEE 116. M. Williams, G. Le Du: Microphone array analysis for
International Conference on Acoustics, Speech and Signal multichannel sound recording, in Audio Engineering Society
Processing (ICASSP), Kyoto, Japan, 2012, pp. 385–388. Convention 107, New York, NY, USA, 1999.
102. A. Wabnitz, N. Epain, A. McEwan, C. Jin, Upscaling 117. H. Wittek, G. Theile: The recording angle – based on
Ambisonic sound scenes using compressed sensing tech- localisation curves, in Audio Engineering Society Conven-
niques, in IEEE Workshop on Applications of Signal tion 112, Munich, Germany, 2002.
Processing to Audio and Acoustics (WASPAA), New Paltz, 118. F. Zotter, M. Frank: Efficient phantom source widening.
NY, USA, 2011, pp. 1–4. Archives of Acoustics 38, 1 (2013) 27–37.
103. P.K.T. Wu, N. Epain, C. Jin: A super-resolution beam- 119. K. Hamasaki, K. Hiyama: Reproducing spatial impression
forming algorithm for spherical microphone arrays using a with multichannel audio, in Audio Engineering Society
compressed sensing approach, in IEEE International Con- Conference: 24th International Conference: Multichannel
ference on Acoustics, Speech and Signal Processing Audio, The New Reality, Banff, Alberta, Canada, 2003.
(ICASSP), Vancouver, Canada, 2013, pp. 649–653. 120. F. Rumsey: Spatial audio, Focal Press, 2001.
104. N. Murata, S. Koyama, N. Takamune, H. Saruwatari: Sparse 121. M. Kuster: Spatial correlation and coherence in reverberant
sound field decomposition with parametric dictionary learn- acoustic fields: extension to microphones with arbitrary
ing for super-resolution recording and reproduction, in IEEE first-order directivity. The Journal of the Acoustical Society
International Workshop on Computational Advances in of America 123, 1 (2008) 154–162.
Multi-Sensor Adaptive Processing (CAMSAP), Cancun, 122. D. Griesinger: Reproducing low frequency spaciousness and
Mexico, 2015, pp. 69–72. envelopment in listening rooms, in Audio Engineering
105. G. Routray, R.M. Hegde: Sparse plane-wave decomposition Society Convention 145, New York, NY, USA, 2018.
for upscaling ambisonic signals, in 2020 International 123. C. Gribben, H. Lee: A comparison between horizontal and
Conference on Signal Processing and Communications vertical interchannel decorrelation. Applied Sciences 7, 11
(SPCOM), Bangalore, India, 2020, pp. 1–5. (2017) 1–21.
16 B. Rafaely et al.: Acta Acustica 2022, 6, 47

124. C. Gribben, H. Lee: The frequency and loudspeaker- Transactions on Audio, Speech, and Language Processing
azimuth dependencies of vertical interchannel decorrelation 24, 3 (2016) 543–558.
on the vertical spread of an auditory image. Journal of the 142. P. Calamia, S. Davis, C. Smalt, C. Weston: A conformal,
Audio Engineering Society 66, 7/8 (2018) 537–555. helmet-mounted microphone array for auditory situational
125. H. Lee, C. Gribben: Effect of vertical microphone layer awareness and hearing protection, in IEEE Workshop on
spacing for a 3D microphone array. Journal of the Audio Applications of Signal Processing to Audio and Acoustics
Engineering Society 62, 12 (2014) 870–884. (WASPAA), New Paltz, NY, USA, 2017, pp. 96–100.
126. H. Wittek, G. Theile: Development and application of a 143. H. Beit-On, M. Lugasi, L. Madmoni, A. Menon, A. Kumar,
stereophonic multichannel recording technique for 3D audio J. Donley, V. Tourbabin, B. Rafaely: Audio signal process-
and VR, in 143rd International Convention of the Audio ing for telepresence based on wearable array in noisy and
Engineering Society, Audio Engineering Society, 2017. dynamic scenes, in IEEE International Conference on
127. H. Lee, M. Frank, F. Zotter: Spatial and timbral fidelities of Acoustics, Speech, and Signal Processing (ICASSP), Singa-
binaural ambisonics decoders for main microphone array pore, 2022, accepted for publication.
recordings, in Audio Engineering Society Conference: Inter- 144. M. Blau, A. Budnik, M. Fallahi, H. Steffens, S.D. Ewert, S.
national Conference on Immersive and Interactive Audio, Van de Par: Toward realistic binaural auralizations–per-
York, UK, 2019. ceptual comparison between measurement and simulation-
128. A. McKeag, D.S. McGrath: Sound field format to binaural based auralizations and the real room for a classroom
decoder with head tracking, in 6th Austrailian Regional scenario. Acta Acustica 5 (2021) 8.
Convention of the AES, Audio Engineering Society, 1996. 145. I. Ifergan, B. Rafaely: On the selection of the number of
129. A.M. O’Donovan, D.N. Zotkin, R. Duraiswami: Spherical beamformers in beamforming-based binaural reproduction.
microphone array based immersive audio scene rendering, in EURASIP Journal on Audio, Speech and Music Processing
International Conference on Auditory Display,2008. 6 (2022) 1–17.
130. J. Jiang, B. Xie, H. Mai: The number of virtual loudspeakers 146. D. Marelli, R. Baumgartner, P. Majdak: Efficient approxima-
and the error for spherical microphone array recording and tion of head-related transfer functions in subbands for accurate
binaural rendering, in Audio Engineering Society Confer- sound localization. IEEE/ACM Transactions on Audio,
ence: International Conference on Spatial Reproduction- Speech, and Language Processing 23, 7 (2015) 1130–1143.
Aesthetics and Science, Tokyo, Japan, 2018. 147. V. Pulkki: Spatial sound reproduction with directional
131. H.L. Van Trees: Optimum array processing. John Wiley & audio coding. Journal of the Audio Engineering Society 55,
Sons, 2002. 6 (2007) 503–516.
132. W. Song, W. Ellermeier, J. Hald: Binaural auralization 148. M.M. Goodwin, J.-M. Jot: Primary-ambient signal decom-
based on spherical-harmonics beamforming. The Journal of position and vector-based localization for spatial audio
the Acoustical Society of America 123, 5 (2008) 3159–3159. coding and enhancement, in IEEE International Conference
133. W. Song, W. Ellermeier, J. Hald: Psychoacoustic evaluation on Acoustics, Speech and Signal Processing (ICASSP),
of multichannel reproduced sounds using binaural synthesis Honolulu, Hawaii, USA, 2007, pp. I-9–I-12.
and spherical beamforming. The Journal of the Acoustical 149. N. Barrett, S. Berge: A new method for B-format to
Society of America 130, 4 (2011) 2063–2075. binaural transcoding, in Audio Engineering Society Confer-
134. W. Song, W. Ellermeier, J. Hald: Using beamforming and ence: 40th International Conference: Spatial Audio: Sense
binaural synthesis for the psychoacoustical evaluation of the Sound of Space, Audio Engineering Society, 2010.
target sources in noise. The Journal of the Acoustical 150. S. Berge, B. Allmenndigitale, N. Barrett: High angular
Society of America 123, 2 (2008) 910–924. resolution planewave expansion, in Proceedings of the 2nd
135. S. Spors, H. Wierstorf, M. Geier: Comparison of modal International Symposium on Ambisonics and Spherical
versus delay-and-sum beamforming in the context of data- Acoustics, Paris, France, 2010.
based binaural synthesis, in Audio Engineering Society 151. O. Thiergart, E.A.P. Habets: Parametric sound acquisition
Convention 132, Budapest, Hungary, April 2012. using a multi-wave signal model and spatial filters, in
136. M. Jeffet, N.R. Shabtai, B. Rafaely: Theory and perceptual Parametric Time-Frequency Domain Spatial Audio, V.
evaluation of the binaural reproduction and beamforming Pulkki, S. Delikaris-Manias, A. Politis, Eds., John Wiley
tradeoff in the generalized spherical array beamformer. & Sons. 2017.
IEEE/ACM Transactions on Audio, Speech, and Language 152. O. Thiergart, M. Taseska, E.A.P. Habets: An informed
Processing 24, 4 (2016) 708–718. parametric spatial filter based on instantaneous direction-
137. N.R. Shabtai, B. Rafaely: Binaural sound reproduction of-arrival estimates. IEEE/ACM Transactions on Audio,
beamforming using spherical microphone arrays, in IEEE Speech, and Language Processing 22, 12 (2014) 2182–2196.
International Conference on Acoustics, Speech and Signal 153. C.T. Jin, Y. Shiduo, F. Antonacci, A. Sarti: Perspectives on
Processing (ICASSP), Vancouver, Canada, 2013, pp. 101–105. microphone array processing including sparse recovery, ray
138. N.R. Shabtai, B. Rafaely: Spherical array beamforming for space analysis, and neural networks. Acoustical Science and
binaural sound reproduction, in IEEE Convention of Technology 41, 1 (2020) 308–317.
Electrical and Electronics Engineers in Israel, Eilat, Israel, 154. A. Politis, J. Vilkamo, V. Pulkki: Sector-based parametric
2012, pp. 1–5. sound field reproduction in the spherical harmonic domain.
139. N.R. Shabtai: Optimization of the directivity in binaural IEEE Journal of Selected Topics in Signal Processing 9, 5
sound reproduction beamforming. The Journal of the (2015) 852–866.
Acoustical Society of America 138, 5 (2015) 3118–3128. 155. V. Pulkki, A. Politis, G. Del Galdo, A. Kuntz: Parametric
140. E. Hadad, D. Marquardt, S. Doclo, S. Gannot: Theoretical spatial audio reproduction with higher-order B-format
analysis of binaural transfer function MVDR beamformers microphone input, in Audio Engineering Society Conven-
with interference cue preservation constraints. IEEE/ACM tion 134, Audio Engineering Society, 2013.
Transactions on Audio, Speech, and Language Processing 156. A. Politis, S. Tervo, V. Pulkki: Compass: Coding and
23, 12 (2015) 2449–2464. multidirectional parameterization of ambisonic sound scenes,
141. E. Hadad, S. Doclo, S. Gannot: The binaural LCMV in IEEE International Conference on Acoustics, Speech and
beamformer and its performance analysis. IEEE/ACM Signal Processing (ICASSP), 2018, pp. 6802–6806.
B. Rafaely et al.: Acta Acustica 2022, 6, 47 17

157. L. McCormack, A. Politis, R. Gonzalez, T. Lokki, V. Pulkki: IEEE International Conference on Acoustics, Speech and
Parametric ambisonic encoding of arbitrary microphone Signal Processing (ICASSP), Barcelona, Spain, 2020, pp.
arrays. IEEE/ACM Transactions on Audio, Speech, and 431–435.
Language Processing 30 (2022) 2062–2075. 174. P. Morgado, N. Vasconcelos, T. Langlois, O. Wang: Self-
158. J. Fernandez, L. McCormack, P. Hyvärinen, A. Politis, V. supervised generation of spatial audio for 360 video, 2018.
Pulkki: Enhancing binaural rendering of head-worn micro- arXiv preprint arXiv:1809.02587
phone arrays through the use of adaptive spatial covariance 175. A. Richard, D. Markovic, I.D. Gebru, S. Krenn, G.A.
matching. The Journal of the Acoustical Society of America Butler, F. Torre, Y. Sheikh: Neural synthesis of binaural
151, 4 (2022) 2624–2635. speech from mono audio, in International Conference on
159. L. McCormack, A. Politis, V. Pulkki: Rendering of source Learning Representations, 2021.
spread for arbitrary playback setups based on spatial 176. M. Cobos, J. Ahrens, K. Kowalczyk, A. Politis: An overview
covariance matching, in IEEE Workshop on Applications of machine learning and other data-based methods for
of Signal Processing to Audio and Acoustics, New Paltz, spatial audio capture, processing, and reproduction. EUR-
NY, USA, 2021. ASIP Journal on Audio, Speech, and Music Processing
160. J. Daniel, S. Kitić: Echo-enabled direction-of-arrival and 2022, 1 (2022) 1–21.
range estimation of a mobile source in Ambisonic domain, 177. HEAR360: 8Ball microphone. Accessed on December 6,
2022. arXiv preprint arXiv:2203.05265 2021. https://8ballmicrophones.com
161. S. Kitić, J. Daniel: Generalized time domain velocity vector, 178. 3DOI: Omni binaural microphone. Accessed on December
in IEEE International Conference on Acoustics, Speech and 6, 2021. https://3diosound.com/products/omni-binaural-
Signal Processing (ICASSP), Singapore, 2022, pp. 936–940. microphone
162. T. Shlomo, B. Rafaely: Blind amplitude estimation of early 179. M. Noisternig, A. Sontacchi, T. Musil, R. Holdrich: A 3D
room reflections using alternating least squares, in ICASSP ambisonic based binaural sound reproduction system, in
2021 – 2021 IEEE International Conference on Acoustics, Audio Engineering Society Conference: 24th International
Speech and Signal Processing (ICASSP), Toronto, Canada, Conference: Multichannel Audio, The New Reality, 2003.
2021, pp. 476–480. 180. L.S. Davis, R. Duraiswami, E. Grassi, N.A. Gumerov, Z. Li,
163. T. Shlomo, B. Rafaely: Blind localization of early room D.N. Zotkin: High order spatial audio capture and its
reflections using phase aligned spatial correlation. IEEE binaural head-tracked playback over headphones with
Transactions on Signal Processing 69 (2021) 1213–1225. HRTF cues, in Audio Engineering Society Convention
164. IEEE AASP challenge on detection and classification of 119, Audio Engineering Society, 2005.
acoustic scenes and events (DCASE). Accessed on December 181. C.H. Choi, J. Ivanic, M.S. Gordon, K. Ruedenberg: Rapid
6, 2021. http://dcase.community/challenge2021/ and stable determination of rotation matrices between
165. A. Mesaros, T. Heittola, T. Virtanen: A multi-device spherical harmonics by direct recursion. The Journal of
dataset for urban acoustic scene classification, in Proceed- Chemical Physics 111, 19 (1999) 8825–8831.
ings of the Detection and Classification of Acoustic Scenes 182. N.A. Gumerov, R. Duraiswami: Fast multipole methods for
and Events 2018 Workshop (DCASE2018), Surrey, UK, the helmholtz equation in three dimensions. Elsevier, 2005.
2018, pp. 9–13. 183. P.J. Kostelec, D.N. Rockmore: FFTs on the rotation group.
166. A. Politis, S. Adavanne, T. Virtanen: A dataset of rever- Journal of Fourier Analysis and Applications 14, 2 (2008)
berant spatial sound scenes with moving sources for sound 145–179.
event localization and detection, in Proceedings of the 184. D. Pinchon, P.E. Hoggan: Rotation matrices for real
Workshop on Detection and Classification of Acoustic spherical harmonics: general rotations of atomic orbitals in
Scenes and Events (DCASE2020), 2020. space-fixed axes. Journal of Physics A: Mathematical and
167. P.-A. Grumiaux: Deep learning for speaker counting and Theoretical 40, 7 (2007) 1597.
localization with Ambisonics signals. PhD thesis, Université 185. B. Rafaely, M. Kleider: Spherical microphone array beam
Grenoble Alpes (UGA), 2021. steering using Wigner-D weighting. IEEE Signal Processing
168. J. Eaton, N.D. Gaubitch, A.H. Moore, P.A. Naylor: Letters 15 (2008) 417–420.
Estimation of room acoustic parameters: the ACE chal- 186. F. Zotter: Analysis and synthesis of sound-radiation with
lenge. IEEE/ACM Transactions on Audio, Speech, and spherical arrays. PhD thesis, University of Music and
Language Processing 24, 10 (2016) 1681–1693. Performing Arts, Vienna, Austria, 2009.
169. H. Gamper, I.J. Tashev: Blind reverberation time estima- 187. J. Ahrens, H. Helmholz, D.L. Alon, S.V.A. Garí: A head-
tion using a convolutional neural network, in 2018 16th mounted microphone array for binaural rendering, in 2021
International Workshop on Acoustic Signal Enhancement Immersive and 3D Audio: from Architecture to Automotive
(IWAENC), IEEE, 2018, pp. 136–140. (I3DA), IEEE, 2021, pp. 1–7.
170. P. Götz, C. Tuna, A. Walther, E.A.P. Habets: Blind 188. J. Ahrens, H. Helmholz, D.L. Alon, S.V.A. Garí: Spherical
reverberation time estimation in dynamic acoustic condi- harmonic decomposition of a sound field based on micro-
tions, in IEEE International Conference on Acoustics, phones around the circumference of a human head, in IEEE
Speech and Signal Processing (ICASSP), Singapore, 2022. Workshop on Applications of Signal Processing to Audio
171. S. Deng, W. Mack, E.A.P. Habets: Online blind reverber- and Acoustics (WASPAA), IEEE, 2021, pp. 231–235.
ation time estimation using CRNNs, in INTERSPEECH, 189. L. Madmoni, J. Donley, V. Tourbabin, B. Rafaely: Binaural
Incheon, Korea, 2020, pp. 5061–5065. reproduction from microphone array signals incorporating
172. S. Duangpummet, J. Karnjana, W. Kongprawechnon, M. head-tracking, in 2021 Immersive and 3D Audio: from
Unoki: Blind estimation of room acoustic parameters and Architecture to Automotive (I3DA), IEEE, 2021, pp. 1–5.
speech transmission index using MTF-based CNNs, in The 190. D. Rivas Méndez, C. Armstrong, J. Stubbs, M. Stiles, G.
European Signal Processing Conference (EUSIPCO), Kearney: Practical recording techniques for music produc-
Dublin, Ireland, 2021, pp. 181–185, abs/2103.07904 tion with six-degrees of freedom virtual reality, in Audio
173. D. Looney, N.D. Gaubitch: Joint estimation of acoustic Engineering Society Convention 145, Audio Engineering
parameters from single-microphone speech observations, in Society, 2018.
18 B. Rafaely et al.: Acta Acustica 2022, 6, 47

191. J. Daniel: Spatial sound encoding including near field effect: 206. M. Kentgens, P. Jax: Ambient-aware sound field translation
Introducing distance coding filters and a viable, new using optimal spatial filtering, in IEEE Workshop on
ambisonic format, in Audio Engineering Society Conference: Applications of Signal Processing to Audio and Acoustics
23rd International Conference: Signal Processing in Audio (WASPAA), IEEE, 2021, pp. 236–240.
Recording and Reproduction, Copenhagen, Denmark, 2003. 207. M. Kentgens, S. Al Hares, P. Jax: On the upscaling of
192. E. Stein, M.M. Goodwin: Ambisonics depth extensions for higher-order Ambisonics signals for sound field translation,
six degrees of freedom, in Audio Engineering Society in 2021 29th European Signal Processing Conference
Conference: International Conference on Headphone Tech- (EUSIPCO), IEEE, 2021, pp. 81–85.
nology, San Francisco, CA, USA, 2019. 208. A. Brutti, M. Omologo, P. Svaizer: Localization of multiple
193. F. Zotter, M. Frank, C. Schörkhuber, R. Höldrich: Signal- speakers based on a two step acoustic map analysis, in IEEE
independent approach to variable-perspective (6DoF) audio International Conference on Acoustics, Speech and Signal
rendering from simultaneous surround recordings taken at Processing (ICASSP), Las Vegas, NV, USA, 2008, pp.
multiple perspectives, in Fortschritte der Akustik (DAGA), 4349–4352.
Hannover, Germany, 2020. 209. A. Brutti, M. Omologo, P. Svaizer: Multiple source local-
194. E. Bates, H. O’Dwyer, K.-P. Flachsbarth, F.M. Boland: A ization based on acoustic map de-emphasis. EURASIP
recording technique for 6 degrees of freedom VR, in Audio Journal on Audio, Speech, and Music Processing 2010
Engineering Society Convention 144, Audio Engineering (2010) 1–17.
Society, 2018. 210. G. Del Galdo, O. Thiergart, T. Weller, E.A.P. Habets:
195. E. Fernandez-Grande: Sound field reconstruction using a Generating virtual microphone signals using geometrical
spherical microphone array. The Journal of the Acoustical information gathered by distributed arrays, in 2011 Joint
Society of America 139, 3 (2016) 1168–1178. Workshop on Hands-free Speech Communication and
196. T. Pihlajamaki, V. Pulkki: Synthesis of complex sound Microphone Arrays, IEEE, 2011, pp. 185–190.
scenes with transformation of recorded spatial sound in 211. O. Thiergart, G. Del Galdo, M. Taseska, E.A.P. Habets:
virtual reality. Journal of the Audio Engineering Society 63, Geometry-based spatial sound acquisition using distributed
7/8 (2015) 542–551. microphone arrays. IEEE Transactions on Audio, Speech,
197. A. Plinge, S.J. Schlecht, O. Thiergart, T. Robotham, O. and Language Processing 21, 12 (2013) 2583–2594.
Rummukainen, E.A.P. Habets: Six-degrees-of-freedom bin- 212. X. Zheng: Soundfield navigation: separation, compression
aural audio reproduction of first-order Ambisonics with and transmission. PhD thesis, University of Wollongong,
distance information, in Audio Engineering Society Confer- Wollongong, Australia, 2013.
ence: International Conference on Audio for Virtual and 213. J.G. Tylka, E. Choueiri: Soundfield navigation using an
Augmented Reality, 2018. array of higher-order Ambisonics microphones, in Audio
198. K. Wakayama, J. Trevino, H. Takada, S. Sakamoto, Y. Engineering Society Conference: International Conference
Suzuki: Extended sound field recording using position on Audio for Virtual and Augmented Reality, Los Angeles,
information of directional sound sources, in IEEE Workshop CA, USA, 2016.
on Applications of Signal Processing to Audio and Acoustics 214. J.G. Tylka, E.Y. Choueiri: Domains of practical applicabil-
(WASPAA), IEEE, 2017, pp. 185–189. ity for parametric interpolation methods for virtual sound
199. A. Allen, B. Kleijn: Ambisonics soundfield navigation using field navigation. Journal of the Audio Engineering Society
directional decomposition and path distance estimation, in 67, 11 (2019) 882–893.
International Conference on Spatial Audio, Graz, Austria, 215. J.G. Tylka: Virtual navigation of Ambisonics-encoded
2017. sound fields containing near-field sources. PhD thesis,
200. M. Kentgens, A. Behler, P. Jax, Translation of a higher Princeton University, Princeton, USA, 2019.
order Ambisonics sound scene based on parametric decom- 216. M.F. Fallon, S.J. Godsill: Acoustic source localization and
position, in IEEE International Conference on Acoustics, tracking of a time-varying number of speakers. IEEE
Speech and Signal Processing (ICASSP), IEEE, 2020, pp. Transactions on Audio, Speech, and Language Processing
151–155. 20, 4 (2011) 1409–1415.
201. F. Schultz, S. Spors: Data-based binaural synthesis includ- 217. S. Kitić, A. Guérin: Tramp: tracking by a real-time
ing rotational and translatory head-movements, in Audio ambisonic-based particle filter, in Proceedings of LOCATA
Engineering Society Conference: 52nd International Con- Challenge Workshop – a satellite event of IWAENC 2018,
ference: Sound Field Control-Engineering and Perception, Tokyo, Japan, 2018.
Guildford, UK, 2013. 218. J.-M. Valin, F. Michaud, J. Rouat: Robust 3D localization
202. Y. Wang, K. Chen: Translations of spherical harmonics and tracking of sound sources using beamforming and
expansion coefficients for a sound field using plane wave particle filtering, in IEEE International Conference on
expansions. The Journal of the Acoustical Society of Acoustics Speech and Signal Processing (ICASSP), vol. 4,
America 143, 6 (2018) 3474–3478. Toulouse, France, 2006, IV–841–IV–844.
203. L. Birnie, T. Abhayapala, V. Tourbabin, P. Samarasinghe: 219. J.-M. Valin, F. Michaud, J. Rouat: Robust localization and
Mixed source sound field translation for virtual binaural tracking of simultaneous moving sound sources using
application with perceptual validation. IEEE/ACM Trans- beamforming and particle filtering. Robotics and Autono-
actions on Audio, Speech, and Language Processing 29 mous Systems 55, 3 (2007) 216–228.
(2021) 1188–1203. 220. D.B. Ward, E.A. Lehmann, R.C. Williamson: Particle
204. J.G. Tylka, E. Choueiri: Comparison of techniques for filtering algorithms for tracking an acoustic source in a
binaural navigation of higher-order ambisonic soundfields, reverberant environment. IEEE Transactions on Speech
in Audio Engineering Society Convention 139, Audio and Audio Processing 11, 6 (2003) 826–836.
Engineering Society, 2015. 221. N. Mariette, B.F.G. Katz, K. Boussetta, O. Guillerminet:
205. J.G. Tylka, E.Y. Choueiri: Performance of linear extrapo- Sounddelta: a study of audio augmented reality using wifi-
lation methods for virtual sound field navigation. Journal of distributed ambisonic cell rendering, in Audio Engineering
the Audio Engineering Society 68, 3 (2020) 138–156. Society Convention 128, Audio Engineering Society, 2010.
B. Rafaely et al.: Acta Acustica 2022, 6, 47 19

222. E. Patricio, A. Ruminski, A. Kuklasinski, L. Januszkiewicz, 228. O. Olgun, E. Erdem, H. Hachabiboğlu: Rotation calibration of
T. Zernicki: Toward six degrees of freedom audio recording rigid spherical microphone arrays for multi-perspective 6DoF
and playback using multiple Ambisonics sound fields, in audio recordings, in 2021 Immersive and 3D Audio: from
Audio Engineering Society Convention 146, Audio Engi- Architecture to Automotive (I3DA), IEEE, 2021, pp. 1–7.
neering Society, 2019. 229. A.H. Moore, L. Lightburn, W. Xue, P.A. Naylor, M.
223. C. Schörkhuber, R. Höldrich, F. Zotter: Triplet-based Brookes: Binaural mask-informed speech enhancement for
variable-perspective (6DoF) audio rendering from simulta- hearing aids with head tracking, in 2018 16th International
neous surround recordings taken at multiple perspectives, in Workshop on Acoustic Signal Enhancement (IWAENC),
Fortschritte der Akustik (DAGA), vol. 4, Hannover, Ger- Tokyo, Japan, 2018, pp. 461–465.
many, 2020. 230. N.R. Shabtai, B. Rafaely: Generalized spherical array
224. P. Grosche, F. Zotter, C. Schörkhuber, M. Frank, R. beamforming for binaural speech reproduction. IEEE/
Höldrich: Method and apparatus for acoustic scene play- ACM Transactions on Audio, Speech, and Language
back, 2020. US Patent 10,785,588. Processing 22, 1 (2013) 238–247.
225. M. Blochberger, F. Zotter: Particle-filter tracking of sounds 231. C. Borrelli, A. Canclini, F. Antonacci, A. Sarti, S. Tubaro: A
for frequency-independent 3D audio rendering from dis- denoising methodology for higher order Ambisonics record-
tributed B-format recordings. Acta Acustica 5 (2021) 20. ings, in 2018 16th International Workshop on Acoustic
226. L. McCormack, A. Politis, T. McKenzie, C. Hold, V. Pulkki: Signal Enhancement (IWAENC), IEEE, 2018, pp. 451–455.
Object-based six-degrees-of-freedom rendering of sound 232. M. Lugasi, B. Rafaely: Speech enhancement using masking
scenes captured with multiple Ambisonic receivers. Journal for binaural reproduction of Ambisonics signals. IEEE/
of the Audio Engineering Society 70, 5 (2022) 355–372. ACM Transactions on Audio, Speech, and Language
227. E. Erdem, O. Olgun, H. Hacihabiboğlu: Internal time delay Processing 28 (2020) 1767–1777.
calibration of rigid spherical microphone arrays for multi- 233. A. Herzog, E.A.P. Habets: Direction and reverberation
perspective 6DoF audio recordings, in IEEE Workshop on preserving noise reduction of ambisonics signals. IEEE/
Applications of Signal Processing to Audio and Acoustics ACM Transactions on Audio, Speech, and Language
(WASPAA), IEEE, 2021, pp. 241–245. Processing 28 (2020) 2461–2475.

Cite this article as: Rafaely B. Tourbabin V. Habets E. Ben-Hur Z. Lee H, et al. 2022. Spatial audio signal processing for binaural
reproduction of recorded acoustic scenes – review and challenges. Acta Acustica, 6, 47.

You might also like