Application of Depplearning and Intro To Autoencoders
Application of Depplearning and Intro To Autoencoders
Sound waves are one-dimensional. At every moment in time, they have a single value based on the
height of the wave. Let’s zoom in on one tiny part of the sound wave and take a look:
To turn this sound wave into numbers, we just record of the height of the wave at equally-spaced
points:
This is called sampling. We are taking a reading thousands of times a second and recording a number
representing the height of the sound wave at that point in time.
“CD Quality” audio is sampled at 44.1khz (44,100 readings per second). But for speech recognition, a
sampling rate of 16khz (16,000 samples per second) is enough to cover the frequency range of human
speech.
Pre-processing our Sampled Sound Data
We now have an array of numbers with each number representing the sound wave’s amplitude at
1/16,000th of a second intervals.
We could feed these numbers right into a neural network. But trying to recognize speech patterns by
processing these samples directly is difficult. Instead, we can make the problem easier by doing some
pre-processing on the audio data.
Let’s start by grouping our sampled audio into 20-millisecond-long chunks. Here’s our first 20
milliseconds of audio
Plotting those numbers as a simple line graph gives us a rough approximation of the original sound
wave for that 20 millisecond period of time:
This recording is only 1/50th of a second long. But even this short recording is a complex mish-mash
of different frequencies of sound. There’s some low sounds, some mid-range sounds, and even some
high-pitched sounds sprinkled in. But taken all together, these different frequencies mix together to
make up the complex sound of human speech.
To make this data easier for a neural network to process, we are going to break apart this complex
sound wave into it’s component parts. We’ll break out the low-pitched parts, the next-lowest-pitched-
parts, and so on. Then by adding up how much energy is in each of those frequency bands (from low
to high), we create a fingerprint of sorts for this audio snippet.
You can see that our 20 millisecond sound snippet
A neural network can find patterns in this kind of data more easily than raw sound waves.
So this is the data representation we’ll actually feed into our neural network.
Recognizing Characters from Short Sounds
Now that we have our audio in a format that’s easy to process, we will feed it into a deep neural
network. The input to the neural network will be 20 millisecond audio chunks. For each little audio
slice, it will try to figure out the letter that corresponds the sound currently being spoken.
We’ll use a recurrent neural network — that is, a neural network that has a memory that influences future
predictions. That’s because each letter it predicts should affect the likelihood of the next letter it will predict
too. For example, if we have said “HEL” so far, it’s very likely we will say “LO” next to finish out the word
“Hello”. It’s much less likely that we will say something unpronounceable next like “XYZ”. So having that
memory of previous predictions helps the neural network make more accurate predictions going forward.
After we run our entire audio clip through the neural network (one chunk at a time), we’ll end up with a
mapping of each audio chunk to the letters most likely spoken during that chunk. Here’s what that mapping
looks like for me saying “Hello”:
Our neural net is predicting that one likely thing I said was “HHHEE_LL_LLLOOO”. But it also
thinks that it was possible that I said “HHHUU_LL_LLLOOO” or even “AAAUU_LL_LLLOOO”.
First, we’ll replace any repeated characters a single character, then we’ll remove any blanks:
● HE_L_LO becomes HELLO
● HU_L_LO becomes HULLO
● AU_L_LO becomes AULLO
That leaves us with three possible transcriptions — “Hello”, “Hullo” and “Aullo”. If you say them out
loud, all of these sound similar to “Hello”.
The trick is to combine these pronunciation-based predictions with likelihood scores based on large
database of written text (books, news articles, etc). You throw out transcriptions that seem the least
likely to be real and keep the transcription that seems the most realistic.
So we’ll pick “Hello” as our final transcription instead of the others. Done!
Thank you
Introduction
to
Autoencoders
Autoencoders
An autoencoder is a neural network that is trained to attempt to
copy its input to its output. Internally, it has a hidden layer h that
describes a code used to represent the input. The network may be
viewed as consisting of two parts: an encoder function h = f (x)
and a decoder that produces a reconstruction r = g(h).
if the encoder and decoder are allowed too much capacity, the
autoencoder can learn to perform the copying task without
extracting useful information about the distribution of the data.
Regularized Autoencoders
in the overcomplete case in which the hidden code has dimension greater than the input. In these
cases, even a linear encoder and linear decoder can learn to copy the input to the output without
learning anything useful about the data distribution.
Ideally, one could train any architecture of autoencoder successfully, choosing the code dimension
and the capacity of the encoder and decoder based on the complexity of distribution to be modeled.
Regularized autoencoders provide the ability to do so. Rather than limiting the model capacity by
keeping the encoder and decoder shallow and the code size small, regularized autoencoders use a
loss function that encourages the model to have other properties besides the ability to copy its input
to its output.
Sparse Autoencoders
◦ A sparse autoencoder tries to ensure the neuron is inactive most of the time.
◦ A sparse autoencoder is simply an autoencoder whose training criterion involves a sparsity penalty Ω(h) on the
code layer h, in addition to the reconstruction error
Recontruction Error -> L(x, g(f(x))) + Ω(h).
{regularLoss} {loss to maintain activation value 0}
◦ Sparse autoencoders are typically used to learn features for another task such as classification.
Denoising autoencoders
Rather than adding a penalty Ω to the cost function, we can obtain an autoencoder
that learns something useful by changing the reconstruction error term of the cost
function. Traditionally, autoencoders minimize some function L(x, g(f(x)))
where L is a loss function penalizing g(f (x)) for being dissimilar from x, such as
the L^2 norm of their difference. This encourages g ◦ f to learn to be merely an
identity function if they have the capacity to do so. A denoising autoencoder or
DAE instead minimizes L(x, g(f(x˜)))
where x˜ is a copy of x that has been corrupted by some form of noise. Denoising
autoencoders must therefore undo this corruption rather than simply copying their
input.