Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science
Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science
Overview
Welcome to the Applied Deep Learning tutorial series. We will do a detailed analysis of
several deep learning techniques starting with Artificial Neural Networks (ANN), in
particular Feedforward Neural Networks. What separates this tutorial from the rest you
can find online is that we’ll take a hands-on approach with plenty of code examples and
visualization. I won’t go into too much math and theory behind these models to keep the
focus on application.
We will use the Keras deep learning framework, which is a high level API on top of
Tensorflow. Keras is becoming super popular recently because of its simplicity. It’s very
easy to build complex models and iterate rapidly. I also used barebone Tensorflow, and
actually struggled quite a bit. After trying out Keras I’m not going back.
Here’s the table of contents. First an overview of ANN and the intuition behind these
deep models. Then we will start simple with Logistic Regression, mainly to get familiar
with Keras. Then we will train deep neural nets and demonstrate how they outperform
linear models. We will compare the models on both binary and multiclass classification
datasets.
1. ANN Overview
1.1) Introduction
1.2) Intuition
1.3) Reasoning
2. Logistic Regression
2.1) Linearly Separable Data
2.2) Complex Data - Moons
2.3) Complex Data - Circles
4. Multiclass Classification
4.1) Softmax Regression
4.2) Deep ANN
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 1/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
5. Conclusion
The code for this article is available here as a Jupyter notebook, feel free to download
and try it out yourself.
I think you’ll learn a lot from this article. You don’t need to have prior knowledge of deep
learning, only some basic familiarity with general machine learning. So let’s begin…
1. ANN Overview
1.1) Introduction
Artificial Neural Networks (ANN) are multi-layer fully-connected neural nets that look
like the figure below. They consist of an input layer, multiple hidden layers, and an
output layer. Every node in one layer is connected to every other node in the next layer.
We make the network deeper by increasing the number of hidden layers.
Figure 1
If we zoom in to one of the hidden or output nodes, what we will encounter is the figure
below.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 2/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Figure 2
A given node takes the weighted sum of its inputs, and passes it through a non-linear
activation function. This is the output of the node, which then becomes the input of
another node in the next layer. The signal flows from left to right, and the final output is
calculated by performing this procedure for all the nodes. Training this deep neural
network means learning the weights associated with all the edges.
The equation for a given node looks as follows. The weighted sum of its inputs passed
through a non-linear activation function. It can be represented as a vector dot product,
where n is the number of inputs for the node.
I omitted the bias term for simplicity. Bias is an input to all the nodes and always has the
value 1. It allows to shift the result of the activation function to the left or right. It also
helps the model to train when all the input features are 0. If this sounds complicated
right now you can safely ignore the bias terms. For completeness, the above equation
looks as follows with the bias included.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 3/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
So far we have described the forward pass, meaning given an input and weights how the
output is computed. After the training is complete, we only run the forward pass to make
the predictions. But we first need to train our model to actually learn the weights, and
the training procedure works as follows:
Randomly initialize the weights for all the nodes. There are smart initialization
methods which we will explore in another article.
For every training example, perform a forward pass using the current weights, and
calculate the output of each node going from left to right. The final output is the
value of the last node.
Compare the final output with the actual target in the training data, and measure the
error using a loss function.
Perform a backwards pass from right to left and propagate the error to every
individual node using backpropagation. Calculate each weight’s contribution to the
error, and adjust the weights accordingly using gradient descent. Propagate the error
gradients back starting from the last layer.
Backpropagation with gradient descent is literally the “magic” behind the deep learning
models. It’s a rather long topic and involves some calculus, so we won’t go into the
specifics in this applied deep learning series. For a detailed explanation of gradient
descent refer here. A basic overview of backpropagation is available here. For a detailed
mathematical treatment refer here and here. And for more advanced optimization
algorithms refer here.
In the standard ML world this feed forward architecture is known as the multilayer
perceptron. The difference between the ANN and perceptron is that ANN uses a non-
linear activation function such as sigmoid but the perceptron uses the step function. And
that non-linearity gives the ANN its great power.
1.2) Intuition
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 4/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
There’s a lot going on already, even with the basic forward pass. Now let’s simplify this,
and understand the intuition behind it.
Essentially what each layer of the ANN does is a non-linear transformation of the
input from one vector space to another.
Let’s use the ANN in Figure 1 above as an example. We have a 3-dimensional input
corresponding to a vector in 3D space. We then pass it through two hidden layers with 4
nodes each. And the final output is a 1D vector or a scalar.
Figure 3
The input vector x has 1 row and 3 columns. To transform it into a 4D space, we need to
multiply it with a 3x4 matrix. Then to another 4D space, we multiply with a 4x4 matrix.
And finally to reduce it to a 1D space, we use a 4x1 matrix.
Notice how the dimensions of the matrices represent the input and output dimensions of
a layer. The connection between a layer with 3 nodes and 4 nodes is a matrix
multiplication using a 3x4 matrix.
These matrices represent the weights that define the ANN. To make a prediction using
the ANN on a given input, we only need to know these weights and the activation
function (and the biases), nothing more. We train the ANN via backpropagation to
“learn” these weights.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 5/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Figure 4
A fully connected layer between 3 nodes and 4 nodes is just a matrix multiplication of
the 1x3 input vector (yellow nodes) with the 3x4 weight matrix W1. The result of this dot
product is a 1x4 vector represented as the blue nodes. We then multiply this 1x4 vector
with a 4x4 matrix W2, resulting in a 1x4 vector, the green nodes. And finally a using
a 4x1 matrix W3 we get the output.
We have omitted the activation function in the above figures for simplicity. In reality
after every matrix multiplication, we apply the activation function to each element of the
resulting matrix. More formally
Equation 2
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 6/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
The output of the matrix multiplications go through the activation function f. In case of
the sigmoid function, this means taking the sigmoid of each element in the matrix. We
can see the chain of matrix multiplications more clearly in the equations.
1.3) Reasoning
So far we talked about what deep models are and how they work, but why do we need to
go deep in the first place?
We saw that a layer of ANN just performs a non-linear transformation of its inputs from
one vector space to another. If we take a classification problem as an example, we want
to separate out the classes by drawing a decision boundary. The input data in its given
form is not separable. By performing non-linear transformations at each layer, we are
able to project the input to a new vector space, and draw a complex decision boundary to
separate the classes.
Let’s visualize what we just described with a concrete example. Given the following data
we can see that it isn’t linearly separable.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 7/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
This is equivalent to drawing a complex decision boundary in the original input space.
So the main benefit of having a deeper model is being able to do more non-linear
transformations of the input and drawing a more complex decision boundary.
As a summary, ANNs are very flexible yet powerful deep learning models. They are
universal function approximators, meaning they can model any complex function. There
has been an incredible surge on their popularity recently due to a couple of reasons:
clever tricks which made training these models possible, huge increase in computational
power especially GPUs and distributed training, and vast amount of training data. All
these combined enabled deep learning to gain significant traction.
This was a brief introduction, there are tons of great tutorials online which cover deep
neural nets. For reference, I highly recommend this paper. It’s a fantastic overview of
deep learning and Section 4 covers ANN. Another great reference is this book which is
available online.
2. Logistic Regression
Despite its name, logistic regression (LR) is a binary classification algorithm. It’s the
most popular technique for 0/1 classification. On a 2 dimensional (2D) data LR will try
to draw a straight line to separate the classes, that’s where the term linear model comes
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 8/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
from. LR works with any number of dimensions though, not just two. For 3D data it’ll try
to draw a 2D plane to separate the classes. This generalizes to N dimensional data and N-
1 dimensional hyperplane separator. If you have a supervised binary classification
problem, given an input data with multiple columns and a binary 0/1 outcome, LR is the
first method to try. In this section we will focus on 2D data since it’s easier to visualize,
and in another tutorial we will focus on multidimensional input.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 9/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
As we can see the data is linearly separable. We will now train the same logistic
regression model with Keras to predict the class membership of every input point. To
keep things simple for now, we won’t perform the standard practices of separating out
the data to training and test sets, or performing k-fold cross-validation.
Keras has great documentation, check it out for a more detailed description of its API.
Here’s the code for training the model, let’s go over it step by step below.
We will use the Sequential model API available here. The Sequential model allows us to
build deep neural networks by stacking layers one on top of another. Since we’re now
building a simple logistic regression model, we will have the input nodes directly
connected to output node, without any hidden layers. Note that the LR model has the
form y=f(xW) where f is the sigmoid function. Having a single output layer being directly
connected to the input reflects this function.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 10/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Quick clarification to disambiguate the terms being used. In neural networks literature,
it’s common to talk about input nodes and output nodes. This may sound strange at first
glance, what’s an input “node” per se? When we say input nodes, we’re talking about the
features of a given training example. In our case we have 2 features, the x and y
coordinates of the points we plotted above, so we have 2 input nodes. You can simply
think of it as a vector of 2 numbers. What about the output node then? The output of the
logistic regression model is a single number, the probability of an input data point
belonging to class 1. In other words P(class=1). The probability of an input point
belonging to class 0 is then P(class=0)=1−P(class=1). So you can simply think of the
output node as a vector with a single number (or simply a scalar) between 0 and 1.
In Keras we don’t add layers corresponding to input nodes, we only do for hidden and
output nodes. In our current model, we don’t have any hidden layers, the input nodes are
directly connected to the output node. This means our neural network definition in Keras
will just have one layer with one node, corresponding to the output node.
model = Sequential()
model.add(Dense(units=1, input_shape=(2,), activation='sigmoid'))
The Dense function in Keras constructs a fully connected neural network layer,
automatically initializing the weights as biases. It’s a super useful function that you will
see being used everywhere. The function arguments are defined as follows:
units: The first argument, representing number of nodes in this layer. Since we’re
constructing the output layer, and we said it has only one node, this value is 1.
input_shape: The first layer in Keras models need to specify the input dimensions.
The subsequent layers (which we don’t have here but we will in later sections) don’t
need to specify this argument because Keras can infer the dimensions automatically.
In this case our input dimensionality is 2, the x and y coordinates. The input_shape
parameter expects a vector, so in our case it’s a tuple with one number.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 11/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
We then compile the model with the compile function. This creates the neural network
model by specifying the details of the learning process. The model hasn’t been trained
yet. Right now we’re just declaring the optimizer to use and the loss function to
minimize. The arguments for the compile function are defined as follows:
optimizer: Which optimizer to use in order to minimize the loss function. There are a
lot of different optimizers, most of them based on gradient descent. We will explore
different optimizers in another tutorial. For now we will use the adamoptimizer,
which is the one people prefer to use by default.
loss: The loss function to minimize. Since we’re building a binary 0/1 classifier, the
loss function to minimize is binary_crossentropy. We will see other examples of loss
functions in later sections.
metrics: Which metric to report statistics on, for classification problems we set this
as accuracy.
Now comes the fun part of actually training the model using the fit function. The
arguments are as follows:
x: The input data, we defined it as X above. It contains the x and y coordinates of the
input points
verbose: Prints out the loss and accuracy, set it to 1 to see the output.
epochs: Number of times to go over the entire training data. When training models
we pass through the training data not just once but multiple times.
plot_loss_accuracy(history)
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 12/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
The output of the fit method is the loss and accuracy at every epoch. We then plot it
using our custom function, and see that the loss goes down to almost 0 over time, and
the accuracy goes up to almost 1. Great! We have successfully trained our first neural
network model with Keras. I know this was a long explanation, but I wanted to explain
what we’re doing in detail the first time. Once you understand what’s going on and
practice a couple of times, all this becomes second nature.
Below is a plot of the decision boundary. The various shades of blue and red represent
the probability of a hypothetical point in that area belonging to class 1 or 0. The top left
area is classified as class 1, with the color blue. The bottom right area is classified as class
0, colored as red. And there is a transition around the decision boundary. This is a cool
way to visualize the decision boundary the model is learning.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 13/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
The classification report shows the precision and recall of our model. We get close to
100% accuracy. The value shown in the report should be 0.997 but it got rounded up to
1.0.
The confusion matrix shows us how many classes were correctly classified vs
misclassified. The numbers on the diagonal axis represent the number of correctly
classified points, the rest are the misclassified ones. This particular matrix is not very
interesting because the model only misclassifies 3 points. We can see one of the
misclassified points at the top right part of the confusion matrix, the true value is class 0
but the predicted value is class 1.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 14/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Let’s build another logistic regression model with the same parameters as we did before.
On this dataset we get 86% accuracy.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 15/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
The current decision boundary doesn’t look as clean as the one before. The model tried
to separate out the classes from the middle, but there are a lot of misclassified points. We
need a more complex classifier with a non-linear decision boundary, and we will see an
example of that soon.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 16/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Precision of the model is 86%. It looks good on paper but we should easily be able to get
100% with a more complex model. You can imagine a curved decision boundary that will
separate out the classes, and a complex model should be able to approximate that.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 17/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
The decision boundary again passes from the middle of the data, but now we have much
more misclassified points.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 18/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
The accuracy is around 50%, shown below. No matter where the model draws the line, it
will misclassify half of the points, due to the nature of the dataset.
The confusion matrix we see here is an example one belonging to a poor classifier.
Ideally we prefer confusion matrices to look like the ones we saw above. High numbers
along the diagonals meaning that the classifier was right, and low numbers everywhere
else where the classifier was wrong. In our visualization, the color blue represents large
numbers and yellow represents the smaller ones. So we would prefer to see blue on the
diagonals and yellow everywhere else. Blue color everywhere is a bad sign meaning that
our classifier is confused.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 19/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
The most naive method which always predicts 1 no matter what the input is would get a
50% accuracy. Our model also got 50% accuracy, so it’s not useful at all.
Step 2: Add a Dense layer with sigmoid activation function. This was the only layer
we needed.
Step 5: Analyze the results: plotting loss/accuracy curves, plotting the decision
boundary, looking at the classification report, and understanding the confusion
matrix.
While building a deep neural network, we only need to change step 2 such that, we will
add several Dense layers one after another. The output of one layer becomes the input of
the next. Keras again does most of the heavy lifting by initializing the weights and biases,
and connecting the output of one layer to the input of the next. We only need to specify
how many nodes we want in a given layer, and the activation function. It’s as simple as
that.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 20/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
We first add a layer with 4 nodes and tanh activation function. Tanh is a commonly used
activation function, and we’ll learn more about it in another tutorial. We then add
another layer with 2 nodes again using tanh activation. We finally add the last layer with
1 node and sigmoid activation. This is the final layer that we also used in the logistic
regression model.
This is not a very deep ANN, it only has 3 layers: 2 hidden layers, and the output layer.
But notice a couple of patterns:
Output layer still uses the sigmoid activation function since we’re working on a
binary classification problem.
Hidden layers use the tanh activation function. If we added more hidden layers, they
would also use tanh activation. We have a couple of options for activation functions:
sigmoid, tanh, relu, and variants of relu. In another article we’ll explore the pros and
cons of each one. We will also demonstrate why using sigmoid activation in hidden
layers is a bad idea. For now it’s safe to use tanh.
We have fewer number of nodes in each subsequent layer. It’s common to have less
nodes as we stack layers on top of one another, sort of a triangular shape.
We didn’t build a very deep ANN here because it wasn’t necessary. We already achieve
100% accuracy with this configuration.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 21/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
The ANN is able to come up with a perfect separator to distinguish the classes.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 22/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Similarly the decision boundary looks just like the one we would draw by hand ourselves.
The ANN was able to figure out an optimal separator.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 23/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 24/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
wasn’t linear, but still one continuous decision boundary was enough. ANN can draw
arbitrary number of complex decision boundaries, and we will demonstrate that.
Let’s create a sinusoidal dataset looking like the sine function, every up and down
belonging to an alternating class. As we can see in the figure, a single decision boundary
won’t be able to separate out the classes. We will need a series of non-linear separators.
Now we need a more complex model for accurate classification. So we have 3 hidden
layers, and an output layer. The number of nodes per layer has also increased to improve
the learning capacity of the model. Choosing the right number of hidden layers and
nodes per layer is more of an art than science, usually decided by trial and error.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 25/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
The ANN was able to model a pretty complex set of decision boundaries.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 26/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Precision is 99%, we only have 14 misclassified points out of 2400. Pretty good.
4. Multiclass Classification
In the previous sections we worked on binary classification. Now we will take a look at a
multi-class classification problem, where the number of classes is more than 2. We will
pick 3 classes for demonstration, but our approach generalizes to any number of classes.
Here’s how our dataset looks like, spiral data with 3 classes, using
the make_multiclass method in scikit-learn.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 27/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Building the model for SR is very similar to LR, for reference here’s how we built our
Logistic Regression model.
And here’s how how we will build the Softmax Regression model.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 28/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Number of nodes in the dense layer: LR uses 1 node, where SR has 3 nodes. Since we
have 3 classes it makes sense for SR to be using 3 nodes. Then the question is, why
does LR uses only 1 node, it has 2 classes so it appears like we should have used 2
nodes instead. The answer is, because we can achieve the same result with using only
1 node. As we saw above, LR models the probability of an example belonging to class
one: P(class=1). And we can calculate class 0 probability by: 1−P(class=1). But when
we have more than 2 classes, we need individual nodes for each class. Because
knowing the probability of one class doesn’t let us infer the probability of the other
classes.
Loss function: In a binary classification problem like LR, the loss function is
binary_crossentropy. In the multiclass case, the loss function is
categorical_crossentropy. Categorical crossentropy is the generalization of binary
crossentropy to more than 2 classes. Going into the theory behind loss functions is
beyond the scope of this tutorial. But for now only knowing this property is enough.
Fit function: LR used the vector y directly in the fit function, which has just one
column with 0/1 values. When we’re doing SR the labels need to be in one-
hotrepresentation. In our case y_cat is a matrix with 3 columns, where all the values
are 0 except for the one that represents our class, and that is 1.
It took some time to talk about all the differences between LR and SR, and it looks like
there’s a lot to digest. But again after some practice this will become a habit, and you
won’t even need to think about any of this.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 29/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
After all this theory let’s take a step back and remember that LR is a linear classifier. SR
is also a linear classifier, but for multiple classes. So the “power” of the model hasn’t
changed, it’s still a linear model. We just generalized LR to apply it to a multiclass
problem.
Training the model gives us an accuracy of around 50%. The most naive method which
always predicts class 1 no matter what the input is would have an accuracy of 33%. The
SR model is not much of an improvement over it. Which is expected because the dataset
is not linearly separable.
Looking at the decision boundary confirms that we still have a linear classifier. The lines
look jagged due to floating point rounding but in reality they’re straight.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 30/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Here’s the precision and recall corresponding to the 3 classes. And the confusion matrix
is all over the place. Clearly this is not an optimal classifier.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 31/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
Note that the output layer still has 3 nodes, and uses the softmax activation. The loss
function also didn’t change, still categorical_crossentropy. These won’t change going
from a linear model to a deep ANN, since the problem definition hasn’t changed. We’re
still working on multiclass classification. But now using a more powerful model, and that
power comes from adding more layers to our neural net.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 32/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
5) Conclusion
Thanks for spending time reading this article, I know this was a rather lengthy tutorial
on Artificial Neural Networks and Keras. I wanted to be as detailed as possible, while still
keeping the article length manageable. I hope you enjoyed it.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 33/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science
There was a common theme in this article such that we first introduced the task, we then
approached it using a simple method and observed the limitations. Afterwards we used a
more complex deep model to improve on it and got much better results. I think the
ordering is important. No complex method becomes successful unless it has evolved
from a simpler model.
The entire code for this article is available here if you want to hack on it yourself. If you
have any feedback feel free to reach out to me on twitter.
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 34/34