0% found this document useful (0 votes)
90 views

Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science

This document provides an overview of artificial neural networks (ANN) and deep learning. It will cover ANN, logistic regression, and other deep learning techniques using hands-on code examples with Keras. The tutorial introduces ANN and their intuition as multi-layer networks that perform non-linear transformations of input data through multiple hidden layers. It explains the forward and backward passes of ANN and how weights are updated through backpropagation and gradient descent to train models. Code and visualizations will demonstrate applying these techniques to classification problems.

Uploaded by

Monika Dhochak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science

This document provides an overview of artificial neural networks (ANN) and deep learning. It will cover ANN, logistic regression, and other deep learning techniques using hands-on code examples with Keras. The tutorial introduces ANN and their intuition as multi-layer networks that perform non-linear transformations of input data through multiple hidden layers. It explains the forward and backward passes of ANN and how weights are updated through backpropagation and gradient descent to train models. Code and visualizations will demonstrate applying these techniques to classification problems.

Uploaded by

Monika Dhochak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Overview
Welcome to the Applied Deep Learning tutorial series. We will do a detailed analysis of
several deep learning techniques starting with Artificial Neural Networks (ANN), in
particular Feedforward Neural Networks. What separates this tutorial from the rest you
can find online is that we’ll take a hands-on approach with plenty of code examples and
visualization. I won’t go into too much math and theory behind these models to keep the
focus on application.

We will use the Keras deep learning framework, which is a high level API on top of
Tensorflow. Keras is becoming super popular recently because of its simplicity. It’s very
easy to build complex models and iterate rapidly. I also used barebone Tensorflow, and
actually struggled quite a bit. After trying out Keras I’m not going back.

Here’s the table of contents. First an overview of ANN and the intuition behind these
deep models. Then we will start simple with Logistic Regression, mainly to get familiar
with Keras. Then we will train deep neural nets and demonstrate how they outperform
linear models. We will compare the models on both binary and multiclass classification
datasets.

1. ANN Overview
1.1) Introduction
1.2) Intuition
1.3) Reasoning

2. Logistic Regression
2.1) Linearly Separable Data
2.2) Complex Data - Moons
2.3) Complex Data - Circles

3. Artificial Neural Networks (ANN)


3.1) Complex Data - Moons
3.2) Complex Data - Circles
3.3) Complex Data - Sine Wave

4. Multiclass Classification
4.1) Softmax Regression
4.2) Deep ANN

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 1/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

5. Conclusion

The code for this article is available here as a Jupyter notebook, feel free to download
and try it out yourself.

I think you’ll learn a lot from this article. You don’t need to have prior knowledge of deep
learning, only some basic familiarity with general machine learning. So let’s begin…

1. ANN Overview
1.1) Introduction
Artificial Neural Networks (ANN) are multi-layer fully-connected neural nets that look
like the figure below. They consist of an input layer, multiple hidden layers, and an
output layer. Every node in one layer is connected to every other node in the next layer.
We make the network deeper by increasing the number of hidden layers.

Figure 1

If we zoom in to one of the hidden or output nodes, what we will encounter is the figure
below.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 2/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Figure 2

A given node takes the weighted sum of its inputs, and passes it through a non-linear
activation function. This is the output of the node, which then becomes the input of
another node in the next layer. The signal flows from left to right, and the final output is
calculated by performing this procedure for all the nodes. Training this deep neural
network means learning the weights associated with all the edges.

The equation for a given node looks as follows. The weighted sum of its inputs passed
through a non-linear activation function. It can be represented as a vector dot product,
where n is the number of inputs for the node.

I omitted the bias term for simplicity. Bias is an input to all the nodes and always has the
value 1. It allows to shift the result of the activation function to the left or right. It also
helps the model to train when all the input features are 0. If this sounds complicated
right now you can safely ignore the bias terms. For completeness, the above equation
looks as follows with the bias included.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 3/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

So far we have described the forward pass, meaning given an input and weights how the
output is computed. After the training is complete, we only run the forward pass to make
the predictions. But we first need to train our model to actually learn the weights, and
the training procedure works as follows:

Randomly initialize the weights for all the nodes. There are smart initialization
methods which we will explore in another article.

For every training example, perform a forward pass using the current weights, and
calculate the output of each node going from left to right. The final output is the
value of the last node.

Compare the final output with the actual target in the training data, and measure the
error using a loss function.

Perform a backwards pass from right to left and propagate the error to every
individual node using backpropagation. Calculate each weight’s contribution to the
error, and adjust the weights accordingly using gradient descent. Propagate the error
gradients back starting from the last layer.

Backpropagation with gradient descent is literally the “magic” behind the deep learning
models. It’s a rather long topic and involves some calculus, so we won’t go into the
specifics in this applied deep learning series. For a detailed explanation of gradient
descent refer here. A basic overview of backpropagation is available here. For a detailed
mathematical treatment refer here and here. And for more advanced optimization
algorithms refer here.

In the standard ML world this feed forward architecture is known as the multilayer
perceptron. The difference between the ANN and perceptron is that ANN uses a non-
linear activation function such as sigmoid but the perceptron uses the step function. And
that non-linearity gives the ANN its great power.

1.2) Intuition
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 4/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

There’s a lot going on already, even with the basic forward pass. Now let’s simplify this,
and understand the intuition behind it.

Essentially what each layer of the ANN does is a non-linear transformation of the
input from one vector space to another.

Let’s use the ANN in Figure 1 above as an example. We have a 3-dimensional input
corresponding to a vector in 3D space. We then pass it through two hidden layers with 4
nodes each. And the final output is a 1D vector or a scalar.

So if we visualize this as a sequence of vector transformations, we first map the 3D input


to a 4D vector space, then we perform another transformation to a new 4D space, and
the final transformation reduces it to 1D. This is just a chain of matrix multiplications.
The forward pass performs these matrix dot products and applies the activation function
element-wise to the result. The figure below only shows the weight matrices being used
(not the activations).

Figure 3

The input vector x has 1 row and 3 columns. To transform it into a 4D space, we need to
multiply it with a 3x4 matrix. Then to another 4D space, we multiply with a 4x4 matrix.
And finally to reduce it to a 1D space, we use a 4x1 matrix.

Notice how the dimensions of the matrices represent the input and output dimensions of
a layer. The connection between a layer with 3 nodes and 4 nodes is a matrix
multiplication using a 3x4 matrix.

These matrices represent the weights that define the ANN. To make a prediction using
the ANN on a given input, we only need to know these weights and the activation
function (and the biases), nothing more. We train the ANN via backpropagation to
“learn” these weights.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 5/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

If we put everything together it looks like the figure below.

Figure 4

A fully connected layer between 3 nodes and 4 nodes is just a matrix multiplication of
the 1x3 input vector (yellow nodes) with the 3x4 weight matrix W1. The result of this dot
product is a 1x4 vector represented as the blue nodes. We then multiply this 1x4 vector
with a 4x4 matrix W2, resulting in a 1x4 vector, the green nodes. And finally a using
a 4x1 matrix W3 we get the output.

We have omitted the activation function in the above figures for simplicity. In reality
after every matrix multiplication, we apply the activation function to each element of the
resulting matrix. More formally

Equation 2

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 6/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The output of the matrix multiplications go through the activation function f. In case of
the sigmoid function, this means taking the sigmoid of each element in the matrix. We
can see the chain of matrix multiplications more clearly in the equations.

1.3) Reasoning
So far we talked about what deep models are and how they work, but why do we need to
go deep in the first place?

We saw that a layer of ANN just performs a non-linear transformation of its inputs from
one vector space to another. If we take a classification problem as an example, we want
to separate out the classes by drawing a decision boundary. The input data in its given
form is not separable. By performing non-linear transformations at each layer, we are
able to project the input to a new vector space, and draw a complex decision boundary to
separate the classes.

Let’s visualize what we just described with a concrete example. Given the following data
we can see that it isn’t linearly separable.

So we project it to a higher dimensional space by performing a non-linear


transformation, and then it becomes linearly separable. The green hyperplane is the
decision boundary.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 7/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

This is equivalent to drawing a complex decision boundary in the original input space.

So the main benefit of having a deeper model is being able to do more non-linear
transformations of the input and drawing a more complex decision boundary.

As a summary, ANNs are very flexible yet powerful deep learning models. They are
universal function approximators, meaning they can model any complex function. There
has been an incredible surge on their popularity recently due to a couple of reasons:
clever tricks which made training these models possible, huge increase in computational
power especially GPUs and distributed training, and vast amount of training data. All
these combined enabled deep learning to gain significant traction.

This was a brief introduction, there are tons of great tutorials online which cover deep
neural nets. For reference, I highly recommend this paper. It’s a fantastic overview of
deep learning and Section 4 covers ANN. Another great reference is this book which is
available online.

2. Logistic Regression
Despite its name, logistic regression (LR) is a binary classification algorithm. It’s the
most popular technique for 0/1 classification. On a 2 dimensional (2D) data LR will try
to draw a straight line to separate the classes, that’s where the term linear model comes

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 8/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

from. LR works with any number of dimensions though, not just two. For 3D data it’ll try
to draw a 2D plane to separate the classes. This generalizes to N dimensional data and N-
1 dimensional hyperplane separator. If you have a supervised binary classification
problem, given an input data with multiple columns and a binary 0/1 outcome, LR is the
first method to try. In this section we will focus on 2D data since it’s easier to visualize,
and in another tutorial we will focus on multidimensional input.

1.1) Linearly Separable Data


First let’s start with an easy example. 2D linearly separable data. We are using the scikit-
learn make_classification method to generate our data and using a helper function to
visualize it.

There is a LogisticRegression classifier available in scikit-learn, I won’t go into too much


detail here since our goal is to learn building models with Keras. But here’s how to train
an LR model, using the fit function just like any other model in scikit-learn. We see the
linear decision boundary as the green line.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 9/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

As we can see the data is linearly separable. We will now train the same logistic
regression model with Keras to predict the class membership of every input point. To
keep things simple for now, we won’t perform the standard practices of separating out
the data to training and test sets, or performing k-fold cross-validation.

Keras has great documentation, check it out for a more detailed description of its API.
Here’s the code for training the model, let’s go over it step by step below.

We will use the Sequential model API available here. The Sequential model allows us to
build deep neural networks by stacking layers one on top of another. Since we’re now
building a simple logistic regression model, we will have the input nodes directly
connected to output node, without any hidden layers. Note that the LR model has the
form y=f(xW) where f is the sigmoid function. Having a single output layer being directly
connected to the input reflects this function.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 10/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Quick clarification to disambiguate the terms being used. In neural networks literature,
it’s common to talk about input nodes and output nodes. This may sound strange at first
glance, what’s an input “node” per se? When we say input nodes, we’re talking about the
features of a given training example. In our case we have 2 features, the x and y
coordinates of the points we plotted above, so we have 2 input nodes. You can simply
think of it as a vector of 2 numbers. What about the output node then? The output of the
logistic regression model is a single number, the probability of an input data point
belonging to class 1. In other words P(class=1). The probability of an input point
belonging to class 0 is then P(class=0)=1−P(class=1). So you can simply think of the
output node as a vector with a single number (or simply a scalar) between 0 and 1.

In Keras we don’t add layers corresponding to input nodes, we only do for hidden and
output nodes. In our current model, we don’t have any hidden layers, the input nodes are
directly connected to the output node. This means our neural network definition in Keras
will just have one layer with one node, corresponding to the output node.

model = Sequential()
model.add(Dense(units=1, input_shape=(2,), activation='sigmoid'))

The Dense function in Keras constructs a fully connected neural network layer,
automatically initializing the weights as biases. It’s a super useful function that you will
see being used everywhere. The function arguments are defined as follows:

units: The first argument, representing number of nodes in this layer. Since we’re
constructing the output layer, and we said it has only one node, this value is 1.

input_shape: The first layer in Keras models need to specify the input dimensions.
The subsequent layers (which we don’t have here but we will in later sections) don’t
need to specify this argument because Keras can infer the dimensions automatically.
In this case our input dimensionality is 2, the x and y coordinates. The input_shape
parameter expects a vector, so in our case it’s a tuple with one number.

activation: The activation function of a logistic regression model is


the logisticfunction, or alternatively called the sigmoid. We will explore different
activation functions, where to use them and why in another tutorial.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 11/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=


['accuracy'])

We then compile the model with the compile function. This creates the neural network
model by specifying the details of the learning process. The model hasn’t been trained
yet. Right now we’re just declaring the optimizer to use and the loss function to
minimize. The arguments for the compile function are defined as follows:

optimizer: Which optimizer to use in order to minimize the loss function. There are a
lot of different optimizers, most of them based on gradient descent. We will explore
different optimizers in another tutorial. For now we will use the adamoptimizer,
which is the one people prefer to use by default.

loss: The loss function to minimize. Since we’re building a binary 0/1 classifier, the
loss function to minimize is binary_crossentropy. We will see other examples of loss
functions in later sections.

metrics: Which metric to report statistics on, for classification problems we set this
as accuracy.

history = model.fit(x=X, y=y, verbose=0, epochs=50)

Now comes the fun part of actually training the model using the fit function. The
arguments are as follows:

x: The input data, we defined it as X above. It contains the x and y coordinates of the
input points

y: Not to be confused with the y coordinate of the input points. In all ML


tutorials yrefers to the labels, in our case the class we’re trying to predict: 0 or 1.

verbose: Prints out the loss and accuracy, set it to 1 to see the output.

epochs: Number of times to go over the entire training data. When training models
we pass through the training data not just once but multiple times.

plot_loss_accuracy(history)

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 12/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The output of the fit method is the loss and accuracy at every epoch. We then plot it
using our custom function, and see that the loss goes down to almost 0 over time, and
the accuracy goes up to almost 1. Great! We have successfully trained our first neural
network model with Keras. I know this was a long explanation, but I wanted to explain
what we’re doing in detail the first time. Once you understand what’s going on and
practice a couple of times, all this becomes second nature.

Below is a plot of the decision boundary. The various shades of blue and red represent
the probability of a hypothetical point in that area belonging to class 1 or 0. The top left
area is classified as class 1, with the color blue. The bottom right area is classified as class
0, colored as red. And there is a transition around the decision boundary. This is a cool
way to visualize the decision boundary the model is learning.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 13/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The classification report shows the precision and recall of our model. We get close to
100% accuracy. The value shown in the report should be 0.997 but it got rounded up to
1.0.

The confusion matrix shows us how many classes were correctly classified vs
misclassified. The numbers on the diagonal axis represent the number of correctly
classified points, the rest are the misclassified ones. This particular matrix is not very
interesting because the model only misclassifies 3 points. We can see one of the
misclassified points at the top right part of the confusion matrix, the true value is class 0
but the predicted value is class 1.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 14/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

2.2) Complex Data - Moons


The previous dataset was linearly separable, so it was trivial for our logistic regression
model to separate the classes. Here is a more complex dataset which isn’t linearly
separable. The simple logistic regression model won’t be able to clearly distinguish
between the classes. We’re using the make_moons method of scikit-learn to generate the
data.

Let’s build another logistic regression model with the same parameters as we did before.
On this dataset we get 86% accuracy.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 15/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The current decision boundary doesn’t look as clean as the one before. The model tried
to separate out the classes from the middle, but there are a lot of misclassified points. We
need a more complex classifier with a non-linear decision boundary, and we will see an
example of that soon.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 16/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Precision of the model is 86%. It looks good on paper but we should easily be able to get
100% with a more complex model. You can imagine a curved decision boundary that will
separate out the classes, and a complex model should be able to approximate that.

The classification report and the confusion matrix looks as follows.

2.3 Complex Data - Circles


Let’s look at one final example where the liner model will fail. This time using
the make_circles function in scikit-learn.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 17/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Building the model with same parameters.

The decision boundary again passes from the middle of the data, but now we have much
more misclassified points.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 18/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The accuracy is around 50%, shown below. No matter where the model draws the line, it
will misclassify half of the points, due to the nature of the dataset.

The confusion matrix we see here is an example one belonging to a poor classifier.
Ideally we prefer confusion matrices to look like the ones we saw above. High numbers
along the diagonals meaning that the classifier was right, and low numbers everywhere
else where the classifier was wrong. In our visualization, the color blue represents large
numbers and yellow represents the smaller ones. So we would prefer to see blue on the
diagonals and yellow everywhere else. Blue color everywhere is a bad sign meaning that
our classifier is confused.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 19/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The most naive method which always predicts 1 no matter what the input is would get a
50% accuracy. Our model also got 50% accuracy, so it’s not useful at all.

3. Artificial Neural Networks (ANN)


Now we will train a deep Artificial Neural Networks (ANN) to better classify the datasets
which the logistic regression model struggled, Moons and Circles. We will also classify an
even harder dataset of Sine Wave to demonstrate that ANN can form really complex
decision boundaries.

3.1) Complex Data - Moons


While building Keras models for logistic regression above, we performed the following
steps:

Step 1: Define a Sequential model.

Step 2: Add a Dense layer with sigmoid activation function. This was the only layer
we needed.

Step 3: Compile the model with an optimizer and loss function.

Step 4: Fit the model to the dataset.

Step 5: Analyze the results: plotting loss/accuracy curves, plotting the decision
boundary, looking at the classification report, and understanding the confusion
matrix.

While building a deep neural network, we only need to change step 2 such that, we will
add several Dense layers one after another. The output of one layer becomes the input of
the next. Keras again does most of the heavy lifting by initializing the weights and biases,
and connecting the output of one layer to the input of the next. We only need to specify
how many nodes we want in a given layer, and the activation function. It’s as simple as
that.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 20/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

We first add a layer with 4 nodes and tanh activation function. Tanh is a commonly used
activation function, and we’ll learn more about it in another tutorial. We then add
another layer with 2 nodes again using tanh activation. We finally add the last layer with
1 node and sigmoid activation. This is the final layer that we also used in the logistic
regression model.

This is not a very deep ANN, it only has 3 layers: 2 hidden layers, and the output layer.
But notice a couple of patterns:

Output layer still uses the sigmoid activation function since we’re working on a
binary classification problem.

Hidden layers use the tanh activation function. If we added more hidden layers, they
would also use tanh activation. We have a couple of options for activation functions:
sigmoid, tanh, relu, and variants of relu. In another article we’ll explore the pros and
cons of each one. We will also demonstrate why using sigmoid activation in hidden
layers is a bad idea. For now it’s safe to use tanh.

We have fewer number of nodes in each subsequent layer. It’s common to have less
nodes as we stack layers on top of one another, sort of a triangular shape.

We didn’t build a very deep ANN here because it wasn’t necessary. We already achieve
100% accuracy with this configuration.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 21/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The ANN is able to come up with a perfect separator to distinguish the classes.

100% precision, nothing misclassified.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 22/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

3.2) Complex Data - Circles


Now let’s look at the Circles dataset, where the LR model achieved only 50% accuracy.
The model is the same as above, we only change the input to the fit function using the
current dataset. And we again achieve 100% accuracy.

Similarly the decision boundary looks just like the one we would draw by hand ourselves.
The ANN was able to figure out an optimal separator.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 23/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Just like above we get 100% accuracy.

4.3) Complex Data - Sine Wave


Let’s try to classify one final toy dataset. In the previous sections, the classes were
separable by one continuous decision boundary. The boundary had a complex shape, it

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 24/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

wasn’t linear, but still one continuous decision boundary was enough. ANN can draw
arbitrary number of complex decision boundaries, and we will demonstrate that.

Let’s create a sinusoidal dataset looking like the sine function, every up and down
belonging to an alternating class. As we can see in the figure, a single decision boundary
won’t be able to separate out the classes. We will need a series of non-linear separators.

Now we need a more complex model for accurate classification. So we have 3 hidden
layers, and an output layer. The number of nodes per layer has also increased to improve
the learning capacity of the model. Choosing the right number of hidden layers and
nodes per layer is more of an art than science, usually decided by trial and error.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 25/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The ANN was able to model a pretty complex set of decision boundaries.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 26/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Precision is 99%, we only have 14 misclassified points out of 2400. Pretty good.

4. Multiclass Classification
In the previous sections we worked on binary classification. Now we will take a look at a
multi-class classification problem, where the number of classes is more than 2. We will
pick 3 classes for demonstration, but our approach generalizes to any number of classes.

Here’s how our dataset looks like, spiral data with 3 classes, using
the make_multiclass method in scikit-learn.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 27/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

4.1) Softmax Regression


As we saw above, Logistic Regression (LR) is a classification method for 2 classes. It
works with binary labels 0/1. Softmax Regression (SR) is a generalization of LR where
we can have more than 2 classes. In our current dataset we have 3 classes, represented as
0/1/2.

Building the model for SR is very similar to LR, for reference here’s how we built our
Logistic Regression model.

And here’s how how we will build the Softmax Regression model.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 28/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

There are a couple of differences, let’s go over them one by one:

Number of nodes in the dense layer: LR uses 1 node, where SR has 3 nodes. Since we
have 3 classes it makes sense for SR to be using 3 nodes. Then the question is, why
does LR uses only 1 node, it has 2 classes so it appears like we should have used 2
nodes instead. The answer is, because we can achieve the same result with using only
1 node. As we saw above, LR models the probability of an example belonging to class
one: P(class=1). And we can calculate class 0 probability by: 1−P(class=1). But when
we have more than 2 classes, we need individual nodes for each class. Because
knowing the probability of one class doesn’t let us infer the probability of the other
classes.

Activation function: LR used sigmoid activation function, SR uses softmax. Softmax


scales the values of the output nodes such that they represent probabilities and sum
up to 1. So in our case P(class=0)+P(class=1)+P(class=2)=1. It doesn’t do it in a
naive way by dividing individual probabilities by the sum though, it uses the
exponential function. So higher values get emphasized more and lower values get
squashed more. We will talk in detail what softmax does in another tutorial. For now
you can simply think of it as a normalization function which lets us interpret the
output values as probabilities.

Loss function: In a binary classification problem like LR, the loss function is
binary_crossentropy. In the multiclass case, the loss function is
categorical_crossentropy. Categorical crossentropy is the generalization of binary
crossentropy to more than 2 classes. Going into the theory behind loss functions is
beyond the scope of this tutorial. But for now only knowing this property is enough.

Fit function: LR used the vector y directly in the fit function, which has just one
column with 0/1 values. When we’re doing SR the labels need to be in one-
hotrepresentation. In our case y_cat is a matrix with 3 columns, where all the values
are 0 except for the one that represents our class, and that is 1.

It took some time to talk about all the differences between LR and SR, and it looks like
there’s a lot to digest. But again after some practice this will become a habit, and you
won’t even need to think about any of this.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 29/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

After all this theory let’s take a step back and remember that LR is a linear classifier. SR
is also a linear classifier, but for multiple classes. So the “power” of the model hasn’t
changed, it’s still a linear model. We just generalized LR to apply it to a multiclass
problem.

Training the model gives us an accuracy of around 50%. The most naive method which
always predicts class 1 no matter what the input is would have an accuracy of 33%. The
SR model is not much of an improvement over it. Which is expected because the dataset
is not linearly separable.

Looking at the decision boundary confirms that we still have a linear classifier. The lines
look jagged due to floating point rounding but in reality they’re straight.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 30/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Here’s the precision and recall corresponding to the 3 classes. And the confusion matrix
is all over the place. Clearly this is not an optimal classifier.

4.2) Deep ANN


Now let’s build a deep ANN for multiclass classification. Remember that the changes
going from LR to ANN was minimal. We only needed to add more Dense layers. We will
do the same again. Adding a couple of Dense layers with tanh activation function.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 31/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Note that the output layer still has 3 nodes, and uses the softmax activation. The loss
function also didn’t change, still categorical_crossentropy. These won’t change going
from a linear model to a deep ANN, since the problem definition hasn’t changed. We’re
still working on multiclass classification. But now using a more powerful model, and that
power comes from adding more layers to our neural net.

We achieve 99% accuracy in just a couple of epochs.

The decision boundary is non-linear.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 32/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

We got almost 100% accuracy. We totally misclassified 5 points out of 1500.

5) Conclusion
Thanks for spending time reading this article, I know this was a rather lengthy tutorial
on Artificial Neural Networks and Keras. I wanted to be as detailed as possible, while still
keeping the article length manageable. I hope you enjoyed it.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 33/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

There was a common theme in this article such that we first introduced the task, we then
approached it using a simple method and observed the limitations. Afterwards we used a
more complex deep model to improve on it and got much better results. I think the
ordering is important. No complex method becomes successful unless it has evolved
from a simpler model.

The entire code for this article is available here if you want to hack on it yourself. If you
have any feedback feel free to reach out to me on twitter.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 34/34

You might also like