Calibration and Utilization of SMARTEAR
Calibration and Utilization of SMARTEAR
by
Carl Kevin L. Mirhan
Department of Physics
School of Arts and Sciences
University of San Carlos, Cebu City, Philippines
Contents
Introduction........................................................................................................................1
1.1 Rationale....................................................................................................................1
1.2 Objectives..................................................................................................................4
Theory.................................................................................................................................6
2.1 Sound.........................................................................................................................6
2.3 Noise..........................................................................................................................8
Methodology.....................................................................................................................23
3.1 Calibration................................................................................................................23
i
3.2.1 Data Acquisition................................................................................................27
3.2.4 Annotation.........................................................................................................31
4.1 Calibration................................................................................................................34
Bibliography:....................................................................................................................68
ii
List of Figures
Figure 2.1 A time domain signal (Sum) broken down into its frequency components....10
Figure 2.3 Equal loudness contours of the human ear for pure tones [9].........................13
Figure 2.4 SPL gains according to different frequencies for A, C and Z weightings [14]
...........................................................................................................................................15
Figure 2.5 An example of the microphone response with respect to frequency [17].......18
Figure 2.7 A plot of the same 50 observations with respect to their PC’s [19]................19
Y
Figure 3.4 (a) Reference sample in the time domain and (b) Frequency spectrum of the
reference sample................................................................................................................30
Figure 4.2 Graph of RMS from the time domain vs the RMS from the frequency domain
...........................................................................................................................................37
iii
Figure 4.3 Correction factors for each frequency present in the pink and white noise....38
Figure 4.4 SPL comparison of pure tones after correction factor application..................39
Figure 4.5 SPL comparison of pink and white noise after correction factor application. 40
Figure 4.13 a) Frequency spectrum that contains the barking found in the reference
sample and at a similar intensity and b) The first test frequency spectrum plotted against
the reference frequency spectrum......................................................................................48
Figure 4.14 a) Frequency spectrum that contains the barking found in the reference
sample but not at the same intensity and b) The second test frequency spectrum plotted
against the reference..........................................................................................................50
Figure 4.15 a) Frequency spectrum that contains barking from different dogs found in
the reference sample and b) The third test frequency spectrum plotted against the
reference frequency spectrum............................................................................................52
Figure 4.16 Confusion matrix of logistic regression predictions using a model fitted with
annotation 1 labels vs. annotation 1 true values................................................................54
iv
Figure 4.17 Confusion matrix of logistic regression predictions using a model fitted with
annotation 1 labels vs. annotation 2 true values................................................................55
Figure 4.18 Confusion matrix of logistic regression predictions using a model fitted with
annotation 2 labels vs. annotation 2 true values................................................................56
Figure 4.19 Confusion matrix of logistic regression predictions using a model fitted with
annotation 2 labels vs. annotation 1 true values................................................................57
Figure 4.22 Confusion matrix of the first iteration using the samples from Sidlakan
Marketing...........................................................................................................................61
Figure 4.23 a) Frequency spectrum of the sample with the ambulance siren and b)
Spectrum comparison of the sample with the ambulance and the reference sample.........62
Figure 4.24 Confusion matrix of the second iteration using the samples from Sidlakan
Marketing...........................................................................................................................62
v
List of Tables
Table 1.1 Rules and Regulations of the National Pollution Control Commission (NPCC)
.............................................................................................................................................3
Table 4.2 Accuracy and standard error values for predictions made by k-means............59
Table 4.3 Accuracy and standard error values for predictions made by logistic regression
...........................................................................................................................................59
vi
Chapter 1
Introduction
1.1 Rationale
Sound is a vital aspect of our day to day lives. Whether it be in the form of speech,
time around us. However, while there exist desirable and pleasant sounds, there is also
the presence of undesirable and unwanted sound in our immediate surroundings. This
nuisance.
issue that is featured among the top environmental risks to health. [1] And although
people often grow accustomed to noise levels in their vicinity, if exposure is chronic and
exceeds certain levels, then negative health outcomes can be seen.[2] The commonality
of these negative health outcomes is evident in the sheer number of studies correlating
1
noise and health. For example, in 2010, Vos et al. conducted a Global Burden of Disease
Study and
1
1.1 Rationale 2
estimated that hearing loss affected 1.3 billion people ranking it the 13th most important
contributor to the global years lived with disability (YLD). It was also discovered from
the study that adult-onset hearing loss unrelated to a specific disease process accounted
Stansfeld et al. also compiled a thorough study on noise and its many non-
cognitive difficulties in children, the study submitted that noise pollution and its effects
With the rising concern on noise and its effects on society, along with it comes an
increasing need to study environmental noise levels and, most importantly, to monitor
them. Numerous acoustic sensor monitoring systems have been deployed across the
functional sensor with cloud connectivity, on-board calculations and real-time data
presentation remotely and online. In their pilot test, two devices were deployed in a local
achieving precise calculations as well as the sending and publishing of the data obtained.
[5]
Whytock et al. also developed an audio recorder which they named Solo using
Raspberry Pi for bioacoustics research. In their study, they were able to deploy around 40
Solo units which gathered 52, 381 hours of audio recordings at a sampling rate of 16
kHz. Spectrograms of frequency vs. time showed that the extracted data from the
1.1 Rationale 2
recorded bird songs of specific species could be accurately utilized to differentiate one
amendment to Noise Control Regulations states that Philippine law requires a specific
maximum sound level for different classes of areas at different periods of the day. [7]
Shown in Figure 1 is the different categories of areas and their respective noise level
regulations as set by the National Pollution Control Commission (NPCC). The tabulation
shows that for areas that require quietness, especially schools and homes for the aged, the
maximum allowable noise level is at 50 dB in the day and that for residential areas, the
maximum allowable noise level is at 55 dB. This coincides with the WHO guidelines for
noise control which recommends that for road traffic noise, one of the most common
road traffic noise above this level is associated with adverse health effects. [1]
Table 1.1 Rules and Regulations of the National Pollution Control Commission (NPCC)
1.2 Objectives 4
The difficulty that arises concerning these regulations is that there is a lack of
Metro Manila exhibited noise levels that far exceeded the WHO recommended 53 dB.
Using a sound level meter, they measured the noise levels of tricycles traveling along the
road within the vicinity of major residential areas. They found that the noise levels
generated by tricycles ranged from 88 – 100 dBA where dBA is the A-weighted noise
varying sensitivity of the ear to sound for different frequencies. This range comes from
the variation of the load carried by the tricycles as well as the speed and the slope of the
road on which they were travelling. The study concluded that measured roadside noise
levels at a residential area with high tricycle traffic exceeded the local noise standards at
1.2 Objectives
This study aims to calibrate as well as test the effectivity of an easily deployable acoustic
Monitoring, Assessment, and Recording Tool for Environmental Acoustic Research, will
be the primary focus of this study. Because the microphone attached to the SMART-EAR
Specifically, their computed SPL readings will be the main parameter to be compared.
1.2 Objectives 4
This is important because with the help of this system, noise level data will be accurately
day and noise levels in schools, offices and residential areas can be monitored with much
ease.
This study also aims to analyze the frequency spectrums extracted from the
recordings to characterize specific acoustic activities. A residential area, for example, can
easily be identified by the presence of dogs barking in the neighborhood. This will be the
primary activity used in this study and, by using a reference sample where this activity
was prevalent, samples where the activity took place can be identified and clustered. This
study will also attempt to identify and cluster samples that contained barking in general.
This study will focus on the calibration of the SMART-EAR system using the computer
software, LabVIEW. It will not tackle the complete development of the system itself.
That can be found in Fornis, R.’s study entitled “Development of Sound Monitoring,
EAR)”[21].
During this study, gathering of data will only be carried out in one residential area
as well as one commercial area. West City Homes, a subdivision located in Labangon,
Cebu City as well as Sidlakan Marketing, a rice trading business located at Tabo-an,
Cebu City, will be the areas utilized for this study. For the sake for privacy, data
spectrum and not of a frequency-time spectrum. The acoustic activity used as a reference
in this study will be the barking of a female Yorkshire Terrier living in the subdivision.
Chapter 2
Theory
Before addressing the calibration as well as the application of the SMART-EAR system,
it is important to review what sound is and how it is produced. This would, in turn, give
greater insight into how the SMART-EAR system operates so that its application and use
would be at its most efficient. The concepts of sound pressure level, frequency weighting,
Discrete Fourier Transform (DFT), Fast Fourier Transform (FFT) and noise will be
properly addressed in this section. Common microphone calibration methods will also be
discussed here.
2.1 Sound
Sound acts as a stimulus via the propagation of pressure changes in a wave motion across
an elastic medium. In the case of human beings, this medium is usually air. When this
6
wave of pressure changes reaches our ears, our sense of hearing is excited which then
translates
6
2.1 Sound 7
into our generalized perception. [9] These pressure changes are brought about by
sense that, to measure sound, one would have to measure its sound pressure, denoted by p
. This is actually the most accessible parameter to measure. However, with regards to its
effects, the energy content of a certain sound signal over a period of time is more relevant
than its instantaneous value and thus, the root mean square (RMS) value is of more
T
~
p=
√
1
∫
T 0
p2 (t ) dt (2.1)
The pressure fluctuation, however, is actually very small compared to the normal
atmospheric air pressure and the faintest perceivable sound is of the order 20 μPa or 2
x 10−5Pa. On the other hand, the upper limit of perceivable sound is often called the
threshold of pain and is of the order 20 Pa. [11] This tells us that sound pressures would
20 Pa
−5
=10 6
2 x 10 Pa
2.3 Noise 8
Because of the large dynamic range of audible sounds, the strength of a sound is best
described as a logarithm of the sound pressure. This logarithm is commonly known as the
p2
L=10 log10
( )
p02
¿ 20 log 10 ( pp )
0
(2.2)
where L is the SPL, p denotes the pressure of a certain sound signal at a given time, and
2.3 Noise
As defined previously in the introduction, noise is any undesirable sound, often perceived
pressure and therefore, its strength fluctuates over time. Keeping this in mind, a need for
an equivalent sound pressure level is important and this is attained if the RMS of sound
By using Eq. 2.1 and applying it to Eq. 2.2, the equivalent SPL is:
p2
~
Leq =20 log 10
( )p0
2
¿ 20 log 10 (√ 1
∫
T 0
p 2 ( t ) dt
p0 ) (2.3)
This equation gives us a constant SPL during an averaging time T which has the
same total energy of a varying SPL of that same time interval. In most guidelines and
evaluations, the equivalent SPL (or noise level) is one of the most frequently utilized
data.
Signals are generally represented in the time domain, which simply provides the
amplitudes of a signal at the instants of time during which it was sampled. Fourier’s
theorem, however, states that a signal x(t) can be expressed as a sum of sinusoids of
requires the utilization of the Fourier transform given by the following equation [12]:
∞
X ( f )= ∫ x( t)e−2 πift dt (2.4)
−∞
2.4 Frequency Domain Analysis 10
Figure 2.1 A time domain signal (Sum) broken down into its frequency components
Consequently, the time signal can be retrieved from its frequency components by
∞
x (t)= ∫ X ( f ) e2 πift df (2.5)
−∞
Because this study will be dealing with discretized, digital signals, equations 2.4
and 2.5 cannot apply since they only deal with continuous signals over a certain period.
The Fourier transform for discrete samples is known as the Discrete Fourier Transform
N−1
1
X ( f )=( ) ∑ x (t ) e
N n=0
−2 πift / N
(2.6)
2.4 Frequency Domain Analysis 11
N−1
x ( t )= ∑ X ( f ) e2 πift / N (2.7)
n=0
DFT computation from the definition provided is often a slow and impractical
overcome this obstacle, most algorithms employ the Fast Fourier Transform (FFT),
which not only reduces the number of operations from the order of N 2 to N log 2 N but also
retains all the properties of the DFT. [12] Labview, the primary software to be utilized in
this study, employs the FFT in calculating the DFT of a time-domain signal.
To check the accuracy of the FFT algorithm, Parseval’s Theorem will be used.
The theory states that the total energy computed in the time domain must equal the total
n −1 n−1
2 1 2
∑|xi| = ∑
n k=0
| X k| (2.8)
i=0
Real signals are continuous-time, analog signals. This poses a problem for computers and
through a discrete-time system bridges the gap between the continuous-time and discrete-
time worlds. The difference between the highest and lowest frequencies of the spectral
components of a signal is the bandwidth of the signal and the Sampling Theorem states
that a real signal whose spectrum is bandlimited to f max Hz, can be reconstructed exactly
from its samples taken uniformly at a rate f s ≥ 2 f max samples per second. The minimum
f s= 2 f max (2.9)
This bandlimit f max is called the Nyquist frequency while the minimum sampling
rate f s, is called the Nyquist rate. Applying this theorem removes the aliasing, or the poor
the difference between an aliased signal and an adequately sampled signal is shown in
Figure 2.2.
In the case of human hearing, sensitivity is frequency dependent. This means that,
subjectively, comparing two tones of different frequencies will not sound equally loud
even if they both have the same SPL. [11] This was demonstrated in a study by Robinson
et.al. in 1956 wherein they employed the constant stimulus method. Participants in their
study were tasked to make comparisons between a pure tone of constant sound pressure
level and frequency and another pure tone of 1 kHz which had randomly varied pressure
levels. [13] They termed these loudness levels as ‘phon’ and because a 1 kHz tone was
used as reference, the phon would be equivalent to the sound pressure level of that 1 kHz
tone. Although this procedure required averaging of numerous results, the overall data
was consistent and so they compiled their findings in the figure below:
Figure 2.3 Equal loudness contours of the human ear for pure tones [9]
2.6 Frequency Weighting 14
loudness level of 60 phon while an 80 dB, 1 kHz tone generates a loudness level of 80
phon. This leads us to the conclusion that human hearing is very much frequency
dependent and so, when using a measuring device or sensor, a method is needed to mimic
To bridge this gap between objective sensor measurements and subjective human
hearing, frequency weightings are used. These are networks with frequency dependent
gains, and the International standard for sound level meters, IEC 61672-1, commonly
uses the A, C and Z weightings. The attenuation for decibel readings may be obtained by
The Z weighting has no filter applied on the signal and its gain is therefore 0. Its
(2.11)
W Z ( f )=1
The C weighting is a network wherein a
point. It is primarily used to assess noise with low frequency content and primarily
focuses on peak values of the signal. [14] Decibel gains for each frequency are computed
(2.12)
2.6 Frequency Weighting 14
2
f
( )
( ( ) )( (
W C ( f )=1.007152
1+
20.6 Hz
f
20.6 Hz
2
1+
1
f
12194 Hz )
2
)
2.6 Frequency Weighting 15
higher frequency cutoff point. It is mainly applied for general sound level measurement
and is most commonly used in occupational safety and health acts. Its weight function is
defined by:
2 2 2
f f f
( ) ( ) ( ) (2.13)
( ( ) )( √ ( ) )( √ ( ) )( (
W A ( f )=1.258905
1+
20.6 Hz
f
20.6 Hz
2
1+
107.7 Hz
f
107.7 Hz
2
1+
737.9 Hz
f
737.9 Hz
2
1+
1
f
12194 Hz )
2
)
Figure 2.4 shows a graph of the SPL gains plotted against the different
Figure 2.4 SPL gains according to different frequencies for A, C and Z weightings [14]
2.6 Frequency Weighting 15
2.7 Calibration Methods 16
The focus of this study is to successfully calibrate as well as use the SMART-EAR
system in a practical setting. The sound recording for a digital microphone is in voltage.
The sensitivity of the microphone is used to get the pressure value of the recorded sound
below:
(2.14)
V
P=
Sensitivity
methods of microphone calibration. The first is the sequential method wherein the
microphone under test (UT) and the reference microphone (REF) are located at the same
spatial position and are made to record the signal from a certain sound source one after
the other. The drawback of this method is that, for the sake of voltage comparison
between the two microphones, the sound signal must be temporally stable.
The second method, which is the simultaneous method, gets rid of this temporal
drawback is also that, at both positions, the sound pressure from the signal must be equal.
source and the microphones. This is to ensure that the wave front of the signal will be flat
when it reaches the microphones. Another important aspect is that the room in which
these methods will be employed in must be anechoic. Considering this, the sequential
method is more advantageous in minimizing the effects of sound reflection that would
2.7 Calibration Methods 16
occur in the room since both microphones would be located in the same place. The
equation used to calibrate the microphone using the comparison method is shown by:
2.7 Calibration Methods 17
V UT
Sensitiyvity UT =Sensitivity REF ( )
V REF
(2.15)
where V UT is the output voltage of the microphone under test and V REF is the output
The comparison method, however, can only be applied to test microphones with a
flat frequency response. In the case of calibrating a microphone whose response varies
depending on the frequency being sampled, the frequency spectrums for both the
microphone under test and the reference microphone would have to be analyzed. In a
study published by Garg, et.al. in 2018, they employed an averaging method for
frequency spectrums with a frequency resolution of 1Hz for both systems were
computed. They did this by dividing the time signal into a set of overlapping frames and
computing the average FFT for all frames. The correction factors were then obtained by
taking the difference of the two spectrums obtained from the WAV files. The researchers
Figure 2.5 An example of the microphone response with respect to frequency [17]
In this study, each sample will be broken down into their constituent frequency SPL
levels. Because the goal of this study is to identify characteristic acoustic events in the
for all samples. Principal Component Analysis (PCA) reduces the dimensionality of a
data set consisting of a large number of interrelated variables, while retaining as much as
possible of the variation present in the data set [19]. It is an orthogonal linear
transformation that transforms the data to a new coordinate system such that the greatest
variance by some scalar projection of the data comes to lie on the first coordinate (called
the first principal component), the second greatest variance on the second coordinate, and
composed of two variables x 1and x 2. PCA focuses on the variances of the two variables
and transforms the variables into two principal components z 1and z 2. The transformed
Figure 2.7 A plot of the same 50 observations with respect to their PC’s [19]
2.9 K-means Clustering 20
vectors of weights or coefficientsw(k) =(w1 , …. , w p)(k) that map each row vector x(i) to a
In the scenario that there are more than two variables, PCA would still work in
reducing the dimensionality of the data. For example, in this study, a single recording is
composed of thousands of frequencies, each having their own value. PCA would reduce
these thousands of variables into a specified number of PC’s. By reducing the number of
variables to work on, PCA is a fundamental tool in making the data easier to visualize
and understand.
number of observations into a defined target number of k clusters. Each cluster refers to a
expectation-maximization algorithm to locate the best centroids for each cluster. Once
centroids have been located, each data point is allocated to each cluster by reducing the
in-cluster sum of squares. And because k-means is unsupervised, labelling of the data is
not needed to carry out clustering. However, it will be important in checking the accuracy
of the clustering.
2.11 Receiver Operating Characteristic 21
supervised learning method. In logistic regression, data samples with an already existing
indicator variable are used to train a logistic model. This logistic model predicts the
probability of a certain class or event and, because this is a form of binary regression,
there are only two possible predictions (i.e. pass or fail) every time a new data sample is
fed into the model. In this study, the threshold will be set at 0.5. In other words, if the
computed probability of a sample ranges from 0 to 0.49, the model would classify the
sample as 0. On the other hand, it the computed probability ranges from 0.5 to 1, the
To test the accuracy of the clustering by the k-means algorithm as well as the predictions
made by the logistic model, Receiver Operating Characteristic (ROC) will be used. Data
samples in this study are either positive or negative for a certain acoustic activity. ROC
summarizes the results of both k-means and logistic regression by presenting the number
of true positive (positive data samples that were labelled correctly as positive), true
negative (negative data samples that were labelled correctly as negative), false positive
(negative data samples that were labelled incorrectly as positive, “false alarm”), and false
negative (positive data samples that were labelled incorrectly as negative, “misses’) cases
in a confusion matrix. The accuracies in both methods are defined by the following
formula:
2.11 Receiver Operating Characteristic 22
Accuracy ( ACC )=
∑ True Positive + ∑ True Negative (2.18)
∑ Total Population
Chapter 3
Methodology
This chapter will focus on describing in detail the calibration of the Sound Monitoring,
system using the LabVIEW software as well as the deployment and analysis of the
recordings gathered.
3.1 Calibration
Computer with built in data storage. It utilizes an external ADC and, to record sound, a
ReSpeaker 6-Mic Circular Array Kit is attached to the Raspberry Pi. It runs on a
modified Linux operating system named Raspbian which is already fitted with programs
to record
23
3.1.1 Data Acquisition 24
sound. It can record continuously at a sampling rate of 16 kHz on all six channels and can
store the signal retrieved from this system in the form of .wav files.
The reference system used is a Brüel & Kjær (B&K) microphone attached to a
National Instrumentation Data Acquisition Module (NI DAQ). The NI DAQ was then
signals generated by any sound source was measured using the B&K microphone from 1
Hz to 20 kHz. By utilizing a program built on LabVIEW, the generated signal was sent to
the computer for data logging and, through specifying the sampling rate, was stored at the
same rate as the system under test. This recording was also stored in the form of .wav
files.
For data acquisition, a speaker generating pink noise at varying sound pressure
levels was used. Pure tones were utilized in this study to test whether the test microphone
recorded by the SMART-EAR system and the B&K microphone. Both setups were
situated at least 2 meters from the sound source, as shown in Figure 3.2.
A separate LabVIEW program was created which would access a specified directory and
read the .wav file stored there. The recorded waveform was then represented by the
program as an array of amplitude values in the time domain. The program was built to
obtain the RMS value as well as the calculated SPL reading, based on equations 2.1 and
2.3, from that array. Because Eq. 2.3 calls for the pressure of the signal, the known
sensitivity of the reference microphone was used to convert the signal from its voltage
For the frequency domain analysis, a second LabVIEW program was created that
performed the Fast Fourier Transform (FFT) on the array based on the formula given by
equation 2.6. However, to acquire a more accurate average frequency spectrum for each
sample, the array of amplitude values in the time domain signal was first segmented into
3.1.3 Calibration Proper 26
overlapping frames just as what Garg, et.al. implemented in their study. These
overlapping frames were then averaged and the FFT was taken to generate an average
frequency spectrum. It is noted that the discrete Fourier transform is often defined with
LabVIEW program. A check on the results of the spectrum calculation was done using
Parseval’s Theorem (Eq.2.8) to ensure that the sample would produce the same
The samples used for calibration were white noise as well as pure tones, all designed to
be within a certain operating frequency range. Because the ReSpeaker 6-Mic Circular
Array Kit attached to the Raspberry Pi can only record at a sampling rate of 16 kHz,
frequencies up to 8kHz were used as audio samples to ensure that there would be no
undersampling. This was done to observe the constraints given by the Sampling
Theorem.
For microphones with flat frequency response, calibration would be done using
the comparison method. By plotting the voltage RMS values of the samples obtained
from the test microphone versus the voltage RMS values of the samples from the
reference microphone, a line of best fit would provide a sensitivity value for the
microphone under test by using equation 2.15. This value should hold true regardless of
However, in this study, no single sensitivity value could be applied for the system under
test. Calibration then had to be carried out in the frequency domain. The approach
employed here was similar to the one taken by Garg, et. al. wherein correction factors
per frequency were defined using equation 2.16. Applying these correction factors per
RMS. The correct pressure RMS could then be obtained using the sensitivity of the
To further verify the results of calibration, the correction factors were applied to
the results of white noise and pink noise recordings obtained by the SMART-EAR
system. Once the correction factors were applied to the recordings of the SMART-EAR
system, the voltage RMS values were divided by the sensitivity of the reference
RMS. Equation 2.2 was then used to calculate the SPL values. The SPL of each sample
of the white noise and pink noise recordings of the reference microphone was calculated
Once calibrated, the SMART-EAR system was first deployed in West City Homes, a
small subdivision, for field test. The system was placed in an open garage situated right
beside a main road of the subdivision to protect the device from the elements.
3.2.2 Preparation of Frequency Spectrums 28
Using the arecord software in the Raspberry Pi, recordings were gathered for
three 24-hour periods. Each sample was 2.5 minutes long and the system was set to
record every 5 minutes. The .wav files were stored into a USB and transferred to a
For this section of the study, preparation of the spectrums was carried out via the
documentation prepared for the SMART-EAR system. Using the Python3 program
3.2.2 Preparation of Frequency Spectrums 28
written by Fornis, R., the frequency spectrums for each recording was extracted and
saved as a .csv
3.2.2 Preparation of Frequency Spectrums 29
file. The program also applied correction factors to the frequency spectrums. As
recommended by the study of Fornis, R., only considering spectrum contributions for
frequencies from 19.5 Hz – 8 kHz gives a good accuracy of the results. By discarding
these frequencies, the results from the per frequency correction calibration became more
accurate and was below the most lenient tolerance of ±1.5 dB. [21] The same was carried
out in this study. After obtaining the frequency spectrums from each .wav file, the
voltage values for the first 19 frequencies were discarded and only the frequencies from
A Python3 program was used to convert the voltage values from the frequency
spectrum to their corresponding dB values. It is useful to note that these values are not dB
SPL (which is the measurement of volume level in the real world) but dB Full Scale, or
dBFS, which is the measurement of digital volume relative to the maximum value.
After the frequency spectrums of each sample were prepared, a reference sample
was selected. For this study, the acoustic activity that was used as a reference is the
barking of a certain female Yorkshire Terrier in the subdivision. Figure 3.4a shows the
reference sample in the time domain while Figure 3.4b shows the frequency spectrum of
Figure 3.4 (a) Reference sample in the time domain and (b) Frequency spectrum of the
reference sample
3.2.4 Annotation 31
Each sample was then plotted against the reference sample and Principal Component
The PCA algorithm used in this study is the one provided for by Python3 Scikit-
learn library. For this study, each (7981, 2) matrix, which is the comparison of the
reference frequency spectrum and test frequency spectrum against each other, was
reduced to a (1,2) matrix. These values are the principal component 1 (PC1) and principal
component 2 (PC2).
The principal component values for each sample with respect to the reference
3.2.4 Annotation
Annotation was carried out using two categories. For the first category, samples that
contained the exact barking found in the reference sample were annotated with a 1 and
those without the bark were annotated with a 0. This specific categorization will be
For the second category, samples that contained any form of barking (even those
of dogs not present in the reference sample) were annotated with a 1. Samples without
any barking were annotated with a 0. This general categorization will be referred to as
Both k-means clustering as well as logistic regression were performed on the annotated
data sets. This study utilized both machine learning modules provided by the Python3
Scikit-learn library.
For k-means, its accuracy was computed by comparing the clustered labels to
both the true values of annotation 1 and annotation 2. Confusion matrices were also
As for the logistic regression method, a randomly selected 30% of the total
number of annotated samples were used to train the model as this is the standard
percentage used in most machine learning studies. For the first iteration, the model was
trained with samples categorized by annotation 1 and predictions made by the model
were compared to the true values of annotation 1. For the second iteration, the model was
trained with samples categorized by annotation 1 and predictions made by the model
were compared to the true values of annotation 2. For the third iteration, the model was
trained with samples categorized by annotation 2 and predictions made by the model
were compared to the true values of annotation 2. For the final iteration, the model was
trained with samples categorized by annotation 2 and predictions made by the model
were compared to the true values of annotation 1. Both the second and fourth iterations
were done to check the robustness of this method, and whether or not a model fit with the
specific categorization can be applied to samples identified with the more generalized
Once the logistic regression model was fitted by both annotations, an application was
done to another dataset, this time, recordings gathered from a commercial area. The
entire data acquisition and preparation procedure was repeated for recordings gathered
from Sidlakan Marketing, a rice trading business. The SMART-EAR system was also
Annotation 1 and 2 was applied to the samples gathered, however, because the
dog found in the reference sample was not present in this area, annotation 1 was carried
out by labelling all data samples with a true value of 0. Annotation 2 proceeded normally.
The logistic regression models that were fitted by annotation 1 and 2 values were
used to predict labels for the samples gathered here. Predictions made by the model fitted
by annotation 1 was compared to the true annotated values of the samples from the
commercial area. The same was done for predictions made by the model fitted by
annotation 2.
Chapter 4
4.1 Calibration
For calibration, the Brüel & Kjær microphone was used as the reference microphone. It
was calibrated using a Brüel & Kjær microphone mini calibrator and its measured
The simultaneous comparison method discussed in Section 2.7 was first employed to
frequency or not. Each sample was run at 9 iterations each, each iteration being a
34
4.1.1 Single Sensitivity Calculation 35
Figure 4.1 shows that the test microphone has different lines of best fit with
respect to frequency, thereby indicating that the microphone used is frequency dependent.
Also, taking the slope of each line and applying equation 2.15 yields the sensitivity of the
Figure 4.1 Graph of RMS recorded by test and reference microphone shows frequency
dependency
4.1.2 Per Frequency Correction Factor Calculation 36
Since the microphone being utilized is frequency dependent, calibration will have to be
carried out in the frequency domain. Using the LabVIEW program that was developed
for this study, FFT’s of pink and white noise samples from both the test microphone and
To verify Parseval’s theorem, the accuracy of the FFT was checked. The RMS
value of the signals in the time domain was plotted against the RMS value of the same
Figure 4.2 Graph of RMS from the time domain vs the RMS from the frequency domain
Also, a line of best fit shows a slope of 1 which means that each sample’s computed
RMS value is the same in the time domain and the frequency domain.
Equation 2.16 was then applied to calculate the correction factor for each
frequency by taking the difference of each sample’s spectrums from both microphones.
Figure 4.2 shows the graph of the average frequency correction factor that resulted from
Figure 4.3 Correction factors for each frequency present in the pink and white noise
To check the accuracy of these correction factors, the FFT of the samples in
Section 4.1 were also computed. The correction factors were then applied by multiplying
them with the FFT’s of the samples recorded by the test microphone. The SPL of both
Figure 4.4 SPL comparison of pure tones after correction factor application
Figure 4.4 shows a better alignment of results between the test and reference
microphones for each frequency. Outliers found beyond the line of best fit can be
attributed to samples that experienced clipping during recording. But because pure tones
are rarely seen in our environment, greater emphasis is placed on white and pink noise
since both are random signals having varying intensities at a range of frequencies.
Applying the same method that was done on the pure tone samples, a graph of the SPL
calculated from the samples of the test microphone plotted against the SPL calculated
Figure 4.5 SPL comparison of pink and white noise after correction factor application
Both Figure 4.4 and 4.5 show that the test microphone has been calibrated and
that its SPL readings have matched those of the laboratory standard B&K microphone.
Because system calibration is assured, the SMART-EAR system will be taken to the
After calibration was accomplished, the SMART-EAR system was deployed at West City
Homes for field testing. The sample rate was set to 16 kHz and the Nyquist frequency
was 8 kHz. The 3 24-hour periods yielded 863 recordings, each 2.5 minutes long.
applied the correction factors, they were saved to a .csv file. The reference recording was
selected and each of the 863 frequency spectrums were plotted against it.
The red line found in Figure 4.6 is the line that represents one-to-one
correspondence. The orange line, on the other hand, is the line of best fit. These lines and
computed and saved into a .csv file. A PC graph was constructed from the computed
values where each data point was assigned one of three labels. Samples that were found
were labelled as Morning and Evening and samples from 10 PM to 5 AM were labelled
with Nighttime. This grouping of time periods is based on the NPCC guidelines (Table
1.1).
Clustering of samples can already be seen from Figure 4.7. The PC values of the
reference sample plotted against itself are (1243.488515, 8.95E-14). It can be assumed
therefore that the more similar a test sample is to the reference sample, the closer it is to
the x-axis. To test this assumption, samples were manually annotated using the categories
defined by annotation 1 and annotation 2. However, due to time constraints, only 420 of
the 863 samples could be categorized. (Sample data found in Table A.1 of Appendix.)
For annotation 1, 323 of the samples were annotated with 0 as they did not
contain the barking of the Yorkshire Terrier in the reference sample. The rest of the 97
samples were annotated with 1 as they contained the same barking of the Yorkshire
For annotation 2, 218 of the samples were annotated with 0 since they did not
contain barking of any kind. The other 202 samples were annotated with 1 since they
The 420 samples underwent the k-means clustering algorithm, where the grey
dots represent the centroid of the two clusters. Predictions made by the k-means
From Figure 4.11, K-means clustering was able to label samples with 95%
accuracy by comparing the predictions with the true values of annotation 1. Reiterating
this method 20 times using only a randomly selected 200 samples each iteration gives an
Figure 4.12, on the other hand, shows that, when compared to the true values of
annotation 2, the k-means clustering is only able to perform with 72% accuracy.
Reiterating this 20 times using randomly selected 200 samples each iteration gives an
The results show that the first annotation is a more dependable basis of
categorization for k-means as there is a clearer delineation of where one cluster starts and
where the other ends. Comparing the results of k-means clustering in Figure 4.10 and the
PC plot according to annotation 2 labels in Figure 4.9 show that, because there is no
proper delineation of the two clusters in annotation 2, k-means is unable to separate data
points that contained barking from the data points that did not.
This contrast in accuracy can be attributed to the fact that the PC values only
quantify how closely two frequency spectrums are from each other. As an example,
found in Figure 4.13 is the reference sample plotted against another sample which is
closely related to the reference which can be found at the point (1249.432, 187.6807) on
the PC graph. This was identified correctly by the k-means algorithm with respect to both
annotation 1 and 2.
4.2.3 K-means Clustering Results 48
Figure 4.13 a) Frequency spectrum that contains the barking found in the reference
sample and at a similar intensity and b) The first test frequency spectrum plotted against
the reference frequency spectrum
4.2.3 K-means Clustering Results 49
This sample contains the same dog barking in the reference sample at a similar
intensity. The SPL value of the reference sample is recorded at 79.422 dBA and the SPL
value of the test sample at 71.849 dBA. Note that the line of best fit generated from the
scatter plot has a slope of 0.986, which is parallel to the slope of one-to-one
correspondence. It’s y-intercept is also at a value of -8.218 which is only a small distance
of the line from the line of one-to-one correspondence. Because of their similarities, the
For the second sample, the dog found in the reference sample is barking but does
not do so at the same intensity. This is found on the point (1306.846, 398.471) on the PC
graph.
4.2.3 K-means Clustering Results 50
Figure 4.14 a) Frequency spectrum that contains the barking found in the reference
sample but not at the same intensity and b) The second test frequency spectrum plotted
against the reference
4.2.3 K-means Clustering Results 51
This sample was a false negative result with respect to both annotation 1 and 2. In
this test sample, the SPL value is recorded at 54.837 dBA, a sizable difference from the
SPL value of the reference sample. The line of best fit generated from the scatter plot has
However, it’s y-intercept is valued at -25.224 which is a sizeable distance from the line of
one-to-one correspondence. This distance explains the false negative prediction from the
kmeans algorithm (for the annotation 1) and how SPL can affect the PC values for
clustering.
For the third sample, there is the presence of dogs barking in the vicinity and the
SPL of recording is higher than that of the second example. However, the dog barking in
this recording is of a different breed and has a different bark altogether. This is found on
Figure 4.15 a) Frequency spectrum that contains barking from different dogs found in
the reference sample and b) The third test frequency spectrum plotted against the
reference frequency spectrum
4.2.3 K-means Clustering Results 53
The slope of the line of best fit is calculated at 0.846 while the y-intercept of the
line is valued at -21.500. In this third sample, the SPL value is recorded at 61.334 dBA.
The PC values calculated from the difference in frequency spectrums cause the sample to
be closer to the kmean centroid where there was no barking than to the centroid where
there was barking. This explains the false negative result (for the second annotation) and
how the frequencies involved can affect the PC values for clustering.
From the three samples above, it can be observed that the slope of the line of best
fit is determined by the dB values of the frequencies of the spectrum. Simply put, the
more similar a test sample is to the reference sample, the closer the slope is to 1. On the
other hand, the overall SPL of the recording determines the distance of the line of best fit
from the line of one-to-one correspondence. It can also be observed that both SPL and the
similarity of frequencies involved in both the test and reference sample can affect the
computation of PC values, which, in turn, affects clustering as well as the accuracy of the
Turning to logistic regression, for the first iteration, a model was trained using 30% of the
data according to annotation 1 true values. The model was then tested using all 420
samples, and the predictions made by the model were compared also to the true values of
annotation 1.
Figure 4.16 Confusion matrix of logistic regression predictions using a model fitted with
annotation 1 labels vs. annotation 1 true values
It is observed that the results from the first iteration closely resemble that of the
results from k-means clustering vs. annotation 1 true values. Reiterating this with
For the second iteration, the same model trained in the first iteration was used but
predictions made by the model were contrasted with the true values of annotation 2. The
Figure 4.17 Confusion matrix of logistic regression predictions using a model fitted with
annotation 1 labels vs. annotation 2 true values
Reiterating this with different sets of samples 20 times gives an accuracy of 73.2
± 0.2%. There is an increase of false negative results because, much like in k-means
clustering, the model could not distinguish samples that contained any form of barking
In the third iteration, training of the model was now done using samples with
annotation 2 labelling. Predictions made by the model were contrasted with the true
values of annotation 2. Figure 4.19 shows the confusion matrix that resulted from this
iteration.
Figure 4.18 Confusion matrix of logistic regression predictions using a model fitted with
annotation 2 labels vs. annotation 2 true values
annotation 1 were predicted to be positive by the model. The remaining 54 true positive
predictions were samples positive for annotation 2 but not for annotation 1. Running this
iteration with different sets of samples 20 times gives an accuracy of 84.2 ± 0.3%.
4.2.4 Logistic Regression Results 57
Lastly, in the fourth iteration, the logistic regression was still trained using
samples with annotation 2 labelling but the predictions made by the model were
contrasted with the true values of annotation 1. Below is the confusion matrix that
Figure 4.19 Confusion matrix of logistic regression predictions using a model fitted with
annotation 2 labels vs. annotation 1 true values
Observed here is an opposite result to the results of the second iteration for
logistic regression. Here, there is an increased number of false positive predictions. The
algorithm predicted more samples that contained the barking found in the reference
Running this 20 times and using different sets of samples to train the model gives
Table 4.2 Accuracy and standard error values for predictions made by k-means
Table 4.3 Accuracy and standard error values for predictions made by logistic regression
Tables 4.2 and 4.3 show the summary of performances of both machine learning
methods for this study. It can be seen that k-means clustering actually behaves similarly
to the logistic regression model trained with annotation 1 values. Both the logistic
regression and k-means clustering approach prove that this method of using PCA to
cluster data is only applicable to the specific case. And generalization, (i.e. being able to
identify other forms of activity in the same nature) is a limitation of the current method.
As for the accuracy values of the model trained by annotation 2, this can be attributed to
from the commercial area of Sidlakan Marketing was prepared and annotated. Figure
4.20 and Figure 4.21 show the PC plots with respect to annotation 1 and 2, respectively.
After annotating the values, the trained logistic regression models were applied to
predict sample labels. Figure 4.22 shows the confusion matrix that resulted from
comparing predictions made by the model fitted with annotation 1 of the training samples
Figure 4.22 Confusion matrix of the first iteration using the samples from Sidlakan
Marketing
For the first iteration, the logistic regression algorithm was able to predict 96.5%
of the labelling correctly when compared to the true value of annotation 1. Two of the
three false positive samples contained voices which had similar frequencies to the
barking in the reference sample. The third false positive sample contained an ambulance
siren whose frequency was also similar to the barking in the reference sample.
4.3 Application of Logistic Regression Model 62
4.3 Application of Logistic Regression Model 62
Figure 4.23 a) Frequency spectrum of the sample with the ambulance siren and b)
Spectrum comparison of the sample with the ambulance and the reference sample
For the second iteration, the logistic model was trained using annotation 2 of the
original dataset. Comparing the predictions made with the true annotation 2 labels of the
Figure 4.24 Confusion matrix of the second iteration using the samples from Sidlakan
Marketing
4.3 Application of Logistic Regression Model 63
Observe the sharp decrease in accuracy when applying the model trained by
annotation 2 of the original dataset. The model was only able to identify 69.4% of the
new dataset accurately. It can be concluded that the second model struggled to determine
which samples had barking in the recordings and which ones didn’t. This further proves
that generalization is currently a limitation of the study and that this method is only able
The study was able to calibrate the Sound Monitoring, Assessment, and Recording Tool
ReSpeaker 6-Mic Circular Array Kit was utilized as the test microphone.
Two calibration procedures were implemented for this study, calibration by single
microphone used for this study was the laboratory standard Brüel & Kjær microphone
connected to the LabView software. The study revealed that calibration by single
sensitivity was not effective in matching the performance of the test microphone to that
of the reference microphone. This was seen in the multiple sensitivity values that were
calculated for each frequency. The per frequency correction method was the better option
as, upon applying correction factors to the frequency spectrum, the calculated RMS and
64
Chapter 5: Conclusion and Recommendations 65
SPL values of the SMART-EAR system resembled those of the reference microphone.
This shows that the performance of the test microphone now matched that of the test
microphone.
Once calibration was completed, field testing was done. In this study, the
SMART-EAR system was placed in West City Homes and was set to record audio
samples for 2.5 minutes every 5 minutes. This was done for 3 24-hour periods leading to
a total number of 863 samples. Out of these samples, one was selected to be the reference
sample. This reference sample contained the barking of a female Yorkshire Terrier living
in the subdivision. By plotting the frequency spectrums of each sample against the
reference sample, principal component analysis (PCA) was applied to reduce the number
The PC plot that resulted from this dimensionality reduction was found to be very
effective in clustering data samples where the barking of the Yorkshire Terrier was
present. The PC plot revealed that the closer the frequency sample of the recording
resembled that of the reference sample, the closer the data sample was to the x-axis. It
was also revealed that both the computed SPL value, specifically the dBA, of the
recording and the primary frequencies involved in the sample affected the computed PC
contained the activity found in the reference sample from those that didn’t, with an
accuracy of 95.6 ± 0.2%. However, it struggled to cluster samples that contained barking
of any kind of dog from those that didn’t, with an accuracy of only 71.8 ± 0.5%. This
Chapter 5: Conclusion and Recommendations 65
proved that the current method is limited in identifying barking in the general sense and
that it is only able to identify the barking found in the reference sample.
Chapter 5: Conclusion and Recommendations 66
The results from applying logistic regression to the data set was similar to the
results of the k-means clustering. Using samples categorized by annotation 1 to train the
model, it was able to predict the presence of the reference barking in the test samples
annotation 2 to train the logistic regression model, it struggled to correctly predict the
presence of any kind of barking in the test samples with an accuracy of only 73.2 ± 0.2%.
Applying these models to predict labels of a second dataset, this time samples
gathered from a commercial area, revealed a great performance for the model fitted with
samples categorized by annotation 1 with an accuracy of 96.5%. On the other hand, the
samples had barking in the recordings and which ones did not, giving an accuracy of only
69.4%.
The data proves that this method is able to identify and cluster samples that
develop a method that would identify samples in a general sense, i.e. being able to
identify any kind of barking and not just the specific one found in the reference sample.
For example, a study done by Sethi, et. al. employed convolutional neural network to
dimensional feature space, the study was able to not only cluster specific biomes but also
identify anomalies in the soundscape. The same could be done for future studies.
However, instead of using CNN to create the feature space, different reference samples
Chapter 5: Conclusion and Recommendations 66
could be used to identify and cluster the same samples. Combining these PC plots would
It is also recommended to look into how extracting the frequencies involved in the
acoustic activity to be identified can affect results. In this study, only the spread of values
relative to the reference sample was taken for comparison. This study did not take into
account nor utilize the specific frequencies involved in the reference sample.
Lastly, during the gathering of samples, there were instances where recording of
the SMART-EAR system would stop abruptly. This was found to be caused by the length
of the recording in that the system often had difficulty writing 2.5 minutes of audio into a
.wav file. It is recommended to shorten the recording time to 1 minute to limit the system
68
Bibliography 69