0% found this document useful (0 votes)

90 views

Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science

This document provides an overview of artificial neural networks (ANN) and deep learning. It will cover ANN, logistic regression, and other deep learning techniques using hands-on code examples with Keras. The tutorial introduces ANN and their intuition as multi-layer networks that perform non-linear transformations of input data through multiple hidden layers. It explains the forward and backward passes of ANN and how weights are updated through backpropagation and gradient descent to train models. Code and visualizations will demonstrate applying these techniques to classification problems.

Uploaded by

Monika Dhochak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views

Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science

Uploaded by

Monika Dhochak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Overview
Welcome to the Applied Deep Learning tutorial series. We will do a detailed analysis of
several deep learning techniques starting with Artificial Neural Networks (ANN), in
particular Feedforward Neural Networks. What separates this tutorial from the rest you
can find online is that we’ll take a hands-on approach with plenty of code examples and
visualization. I won’t go into too much math and theory behind these models to keep the
focus on application.

We will use the Keras deep learning framework, which is a high level API on top of
Tensorflow. Keras is becoming super popular recently because of its simplicity. It’s very
easy to build complex models and iterate rapidly. I also used barebone Tensorflow, and
actually struggled quite a bit. After trying out Keras I’m not going back.

Here’s the table of contents. First an overview of ANN and the intuition behind these
deep models. Then we will start simple with Logistic Regression, mainly to get familiar
with Keras. Then we will train deep neural nets and demonstrate how they outperform
linear models. We will compare the models on both binary and multiclass classification
datasets.

1. ANN Overview
1.1) Introduction
1.2) Intuition
1.3) Reasoning

2. Logistic Regression
2.1) Linearly Separable Data
2.2) Complex Data - Moons
2.3) Complex Data - Circles

3. Artificial Neural Networks (ANN)

3.1) Complex Data - Moons
3.2) Complex Data - Circles
3.3) Complex Data - Sine Wave

4. Multiclass Classification
4.1) Softmax Regression
4.2) Deep ANN

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 1/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

5. Conclusion

The code for this article is available here as a Jupyter notebook, feel free to download
and try it out yourself.

I think you’ll learn a lot from this article. You don’t need to have prior knowledge of deep
learning, only some basic familiarity with general machine learning. So let’s begin…

1. ANN Overview
1.1) Introduction
Artificial Neural Networks (ANN) are multi-layer fully-connected neural nets that look
like the figure below. They consist of an input layer, multiple hidden layers, and an
output layer. Every node in one layer is connected to every other node in the next layer.
We make the network deeper by increasing the number of hidden layers.

Figure 1

If we zoom in to one of the hidden or output nodes, what we will encounter is the figure
below.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 2/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Figure 2

A given node takes the weighted sum of its inputs, and passes it through a non-linear
activation function. This is the output of the node, which then becomes the input of
another node in the next layer. The signal flows from left to right, and the final output is
calculated by performing this procedure for all the nodes. Training this deep neural
network means learning the weights associated with all the edges.

The equation for a given node looks as follows. The weighted sum of its inputs passed
through a non-linear activation function. It can be represented as a vector dot product,
where n is the number of inputs for the node.

I omitted the bias term for simplicity. Bias is an input to all the nodes and always has the
value 1. It allows to shift the result of the activation function to the left or right. It also
helps the model to train when all the input features are 0. If this sounds complicated
right now you can safely ignore the bias terms. For completeness, the above equation
looks as follows with the bias included.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 3/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

So far we have described the forward pass, meaning given an input and weights how the
output is computed. After the training is complete, we only run the forward pass to make
the predictions. But we first need to train our model to actually learn the weights, and
the training procedure works as follows:

Randomly initialize the weights for all the nodes. There are smart initialization
methods which we will explore in another article.

For every training example, perform a forward pass using the current weights, and
calculate the output of each node going from left to right. The final output is the
value of the last node.

Compare the final output with the actual target in the training data, and measure the
error using a loss function.

Perform a backwards pass from right to left and propagate the error to every
individual node using backpropagation. Calculate each weight’s contribution to the
error, and adjust the weights accordingly using gradient descent. Propagate the error
gradients back starting from the last layer.

Backpropagation with gradient descent is literally the “magic” behind the deep learning
models. It’s a rather long topic and involves some calculus, so we won’t go into the
specifics in this applied deep learning series. For a detailed explanation of gradient
descent refer here. A basic overview of backpropagation is available here. For a detailed
mathematical treatment refer here and here. And for more advanced optimization
algorithms refer here.

In the standard ML world this feed forward architecture is known as the multilayer
perceptron. The difference between the ANN and perceptron is that ANN uses a non-
linear activation function such as sigmoid but the perceptron uses the step function. And
that non-linearity gives the ANN its great power.

1.2) Intuition
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 4/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

There’s a lot going on already, even with the basic forward pass. Now let’s simplify this,
and understand the intuition behind it.

Essentially what each layer of the ANN does is a non-linear transformation of the
input from one vector space to another.

Let’s use the ANN in Figure 1 above as an example. We have a 3-dimensional input
corresponding to a vector in 3D space. We then pass it through two hidden layers with 4
nodes each. And the final output is a 1D vector or a scalar.

So if we visualize this as a sequence of vector transformations, we first map the 3D input

to a 4D vector space, then we perform another transformation to a new 4D space, and
the final transformation reduces it to 1D. This is just a chain of matrix multiplications.
The forward pass performs these matrix dot products and applies the activation function
element-wise to the result. The figure below only shows the weight matrices being used
(not the activations).

Figure 3

The input vector x has 1 row and 3 columns. To transform it into a 4D space, we need to
multiply it with a 3x4 matrix. Then to another 4D space, we multiply with a 4x4 matrix.
And finally to reduce it to a 1D space, we use a 4x1 matrix.

Notice how the dimensions of the matrices represent the input and output dimensions of
a layer. The connection between a layer with 3 nodes and 4 nodes is a matrix
multiplication using a 3x4 matrix.

These matrices represent the weights that define the ANN. To make a prediction using
the ANN on a given input, we only need to know these weights and the activation
function (and the biases), nothing more. We train the ANN via backpropagation to
“learn” these weights.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 5/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

If we put everything together it looks like the figure below.

Figure 4

A fully connected layer between 3 nodes and 4 nodes is just a matrix multiplication of
the 1x3 input vector (yellow nodes) with the 3x4 weight matrix W1. The result of this dot
product is a 1x4 vector represented as the blue nodes. We then multiply this 1x4 vector
with a 4x4 matrix W2, resulting in a 1x4 vector, the green nodes. And finally a using
a 4x1 matrix W3 we get the output.

We have omitted the activation function in the above figures for simplicity. In reality
after every matrix multiplication, we apply the activation function to each element of the
resulting matrix. More formally

Equation 2

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 6/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The output of the matrix multiplications go through the activation function f. In case of
the sigmoid function, this means taking the sigmoid of each element in the matrix. We
can see the chain of matrix multiplications more clearly in the equations.

1.3) Reasoning
So far we talked about what deep models are and how they work, but why do we need to
go deep in the first place?

We saw that a layer of ANN just performs a non-linear transformation of its inputs from
one vector space to another. If we take a classification problem as an example, we want
to separate out the classes by drawing a decision boundary. The input data in its given
form is not separable. By performing non-linear transformations at each layer, we are
able to project the input to a new vector space, and draw a complex decision boundary to
separate the classes.

Let’s visualize what we just described with a concrete example. Given the following data
we can see that it isn’t linearly separable.

So we project it to a higher dimensional space by performing a non-linear

transformation, and then it becomes linearly separable. The green hyperplane is the
decision boundary.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 7/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

This is equivalent to drawing a complex decision boundary in the original input space.

So the main benefit of having a deeper model is being able to do more non-linear
transformations of the input and drawing a more complex decision boundary.

As a summary, ANNs are very flexible yet powerful deep learning models. They are
universal function approximators, meaning they can model any complex function. There
has been an incredible surge on their popularity recently due to a couple of reasons:
clever tricks which made training these models possible, huge increase in computational
power especially GPUs and distributed training, and vast amount of training data. All
these combined enabled deep learning to gain significant traction.

This was a brief introduction, there are tons of great tutorials online which cover deep
neural nets. For reference, I highly recommend this paper. It’s a fantastic overview of
deep learning and Section 4 covers ANN. Another great reference is this book which is
available online.

2. Logistic Regression
Despite its name, logistic regression (LR) is a binary classification algorithm. It’s the
most popular technique for 0/1 classification. On a 2 dimensional (2D) data LR will try
to draw a straight line to separate the classes, that’s where the term linear model comes

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 8/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

from. LR works with any number of dimensions though, not just two. For 3D data it’ll try
to draw a 2D plane to separate the classes. This generalizes to N dimensional data and N-
1 dimensional hyperplane separator. If you have a supervised binary classification
problem, given an input data with multiple columns and a binary 0/1 outcome, LR is the
first method to try. In this section we will focus on 2D data since it’s easier to visualize,
and in another tutorial we will focus on multidimensional input.

1.1) Linearly Separable Data

First let’s start with an easy example. 2D linearly separable data. We are using the scikit-
learn make_classification method to generate our data and using a helper function to
visualize it.

There is a LogisticRegression classifier available in scikit-learn, I won’t go into too much

detail here since our goal is to learn building models with Keras. But here’s how to train
an LR model, using the fit function just like any other model in scikit-learn. We see the
linear decision boundary as the green line.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 9/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

As we can see the data is linearly separable. We will now train the same logistic
regression model with Keras to predict the class membership of every input point. To
keep things simple for now, we won’t perform the standard practices of separating out
the data to training and test sets, or performing k-fold cross-validation.

Keras has great documentation, check it out for a more detailed description of its API.
Here’s the code for training the model, let’s go over it step by step below.

We will use the Sequential model API available here. The Sequential model allows us to
build deep neural networks by stacking layers one on top of another. Since we’re now
building a simple logistic regression model, we will have the input nodes directly
connected to output node, without any hidden layers. Note that the LR model has the
form y=f(xW) where f is the sigmoid function. Having a single output layer being directly
connected to the input reflects this function.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 10/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Quick clarification to disambiguate the terms being used. In neural networks literature,
it’s common to talk about input nodes and output nodes. This may sound strange at first
glance, what’s an input “node” per se? When we say input nodes, we’re talking about the
features of a given training example. In our case we have 2 features, the x and y
coordinates of the points we plotted above, so we have 2 input nodes. You can simply
think of it as a vector of 2 numbers. What about the output node then? The output of the
logistic regression model is a single number, the probability of an input data point
belonging to class 1. In other words P(class=1). The probability of an input point
belonging to class 0 is then P(class=0)=1−P(class=1). So you can simply think of the
output node as a vector with a single number (or simply a scalar) between 0 and 1.

In Keras we don’t add layers corresponding to input nodes, we only do for hidden and
output nodes. In our current model, we don’t have any hidden layers, the input nodes are
directly connected to the output node. This means our neural network definition in Keras
will just have one layer with one node, corresponding to the output node.

model = Sequential()
model.add(Dense(units=1, input_shape=(2,), activation='sigmoid'))

The Dense function in Keras constructs a fully connected neural network layer,
automatically initializing the weights as biases. It’s a super useful function that you will
see being used everywhere. The function arguments are defined as follows:

units: The first argument, representing number of nodes in this layer. Since we’re
constructing the output layer, and we said it has only one node, this value is 1.

input_shape: The first layer in Keras models need to specify the input dimensions.
The subsequent layers (which we don’t have here but we will in later sections) don’t
need to specify this argument because Keras can infer the dimensions automatically.
In this case our input dimensionality is 2, the x and y coordinates. The input_shape
parameter expects a vector, so in our case it’s a tuple with one number.

activation: The activation function of a logistic regression model is

the logisticfunction, or alternatively called the sigmoid. We will explore different
activation functions, where to use them and why in another tutorial.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 11/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=

['accuracy'])

We then compile the model with the compile function. This creates the neural network
model by specifying the details of the learning process. The model hasn’t been trained
yet. Right now we’re just declaring the optimizer to use and the loss function to
minimize. The arguments for the compile function are defined as follows:

optimizer: Which optimizer to use in order to minimize the loss function. There are a
lot of different optimizers, most of them based on gradient descent. We will explore
different optimizers in another tutorial. For now we will use the adamoptimizer,
which is the one people prefer to use by default.

loss: The loss function to minimize. Since we’re building a binary 0/1 classifier, the
loss function to minimize is binary_crossentropy. We will see other examples of loss
functions in later sections.

metrics: Which metric to report statistics on, for classification problems we set this
as accuracy.

history = model.fit(x=X, y=y, verbose=0, epochs=50)

Now comes the fun part of actually training the model using the fit function. The
arguments are as follows:

x: The input data, we defined it as X above. It contains the x and y coordinates of the
input points

y: Not to be confused with the y coordinate of the input points. In all ML

tutorials yrefers to the labels, in our case the class we’re trying to predict: 0 or 1.

verbose: Prints out the loss and accuracy, set it to 1 to see the output.

epochs: Number of times to go over the entire training data. When training models
we pass through the training data not just once but multiple times.

plot_loss_accuracy(history)

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 12/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The output of the fit method is the loss and accuracy at every epoch. We then plot it
using our custom function, and see that the loss goes down to almost 0 over time, and
the accuracy goes up to almost 1. Great! We have successfully trained our first neural
network model with Keras. I know this was a long explanation, but I wanted to explain
what we’re doing in detail the first time. Once you understand what’s going on and
practice a couple of times, all this becomes second nature.

Below is a plot of the decision boundary. The various shades of blue and red represent
the probability of a hypothetical point in that area belonging to class 1 or 0. The top left
area is classified as class 1, with the color blue. The bottom right area is classified as class
0, colored as red. And there is a transition around the decision boundary. This is a cool
way to visualize the decision boundary the model is learning.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 13/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The classification report shows the precision and recall of our model. We get close to
100% accuracy. The value shown in the report should be 0.997 but it got rounded up to
1.0.

The confusion matrix shows us how many classes were correctly classified vs
misclassified. The numbers on the diagonal axis represent the number of correctly
classified points, the rest are the misclassified ones. This particular matrix is not very
interesting because the model only misclassifies 3 points. We can see one of the
misclassified points at the top right part of the confusion matrix, the true value is class 0
but the predicted value is class 1.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 14/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

2.2) Complex Data - Moons

The previous dataset was linearly separable, so it was trivial for our logistic regression
model to separate the classes. Here is a more complex dataset which isn’t linearly
separable. The simple logistic regression model won’t be able to clearly distinguish
between the classes. We’re using the make_moons method of scikit-learn to generate the
data.

Let’s build another logistic regression model with the same parameters as we did before.
On this dataset we get 86% accuracy.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 15/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The current decision boundary doesn’t look as clean as the one before. The model tried
to separate out the classes from the middle, but there are a lot of misclassified points. We
need a more complex classifier with a non-linear decision boundary, and we will see an
example of that soon.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 16/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Precision of the model is 86%. It looks good on paper but we should easily be able to get
100% with a more complex model. You can imagine a curved decision boundary that will
separate out the classes, and a complex model should be able to approximate that.

The classification report and the confusion matrix looks as follows.

2.3 Complex Data - Circles

Let’s look at one final example where the liner model will fail. This time using
the make_circles function in scikit-learn.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 17/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Building the model with same parameters.

The decision boundary again passes from the middle of the data, but now we have much
more misclassified points.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 18/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The accuracy is around 50%, shown below. No matter where the model draws the line, it
will misclassify half of the points, due to the nature of the dataset.

The confusion matrix we see here is an example one belonging to a poor classifier.
Ideally we prefer confusion matrices to look like the ones we saw above. High numbers
along the diagonals meaning that the classifier was right, and low numbers everywhere
else where the classifier was wrong. In our visualization, the color blue represents large
numbers and yellow represents the smaller ones. So we would prefer to see blue on the
diagonals and yellow everywhere else. Blue color everywhere is a bad sign meaning that
our classifier is confused.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 19/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The most naive method which always predicts 1 no matter what the input is would get a
50% accuracy. Our model also got 50% accuracy, so it’s not useful at all.

3. Artificial Neural Networks (ANN)

Now we will train a deep Artificial Neural Networks (ANN) to better classify the datasets
which the logistic regression model struggled, Moons and Circles. We will also classify an
even harder dataset of Sine Wave to demonstrate that ANN can form really complex
decision boundaries.

3.1) Complex Data - Moons

While building Keras models for logistic regression above, we performed the following
steps:

Step 1: Define a Sequential model.

Step 2: Add a Dense layer with sigmoid activation function. This was the only layer
we needed.

Step 3: Compile the model with an optimizer and loss function.

Step 4: Fit the model to the dataset.

Step 5: Analyze the results: plotting loss/accuracy curves, plotting the decision
boundary, looking at the classification report, and understanding the confusion
matrix.

While building a deep neural network, we only need to change step 2 such that, we will
add several Dense layers one after another. The output of one layer becomes the input of
the next. Keras again does most of the heavy lifting by initializing the weights and biases,
and connecting the output of one layer to the input of the next. We only need to specify
how many nodes we want in a given layer, and the activation function. It’s as simple as
that.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 20/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

We first add a layer with 4 nodes and tanh activation function. Tanh is a commonly used
activation function, and we’ll learn more about it in another tutorial. We then add
another layer with 2 nodes again using tanh activation. We finally add the last layer with
1 node and sigmoid activation. This is the final layer that we also used in the logistic
regression model.

This is not a very deep ANN, it only has 3 layers: 2 hidden layers, and the output layer.
But notice a couple of patterns:

Output layer still uses the sigmoid activation function since we’re working on a
binary classification problem.

Hidden layers use the tanh activation function. If we added more hidden layers, they
would also use tanh activation. We have a couple of options for activation functions:
sigmoid, tanh, relu, and variants of relu. In another article we’ll explore the pros and
cons of each one. We will also demonstrate why using sigmoid activation in hidden
layers is a bad idea. For now it’s safe to use tanh.

We have fewer number of nodes in each subsequent layer. It’s common to have less
nodes as we stack layers on top of one another, sort of a triangular shape.

We didn’t build a very deep ANN here because it wasn’t necessary. We already achieve
100% accuracy with this configuration.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 21/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The ANN is able to come up with a perfect separator to distinguish the classes.

100% precision, nothing misclassified.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 22/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

3.2) Complex Data - Circles

Now let’s look at the Circles dataset, where the LR model achieved only 50% accuracy.
The model is the same as above, we only change the input to the fit function using the
current dataset. And we again achieve 100% accuracy.

Similarly the decision boundary looks just like the one we would draw by hand ourselves.
The ANN was able to figure out an optimal separator.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 23/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Just like above we get 100% accuracy.

4.3) Complex Data - Sine Wave

Let’s try to classify one final toy dataset. In the previous sections, the classes were
separable by one continuous decision boundary. The boundary had a complex shape, it

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 24/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

wasn’t linear, but still one continuous decision boundary was enough. ANN can draw
arbitrary number of complex decision boundaries, and we will demonstrate that.

Let’s create a sinusoidal dataset looking like the sine function, every up and down
belonging to an alternating class. As we can see in the figure, a single decision boundary
won’t be able to separate out the classes. We will need a series of non-linear separators.

Now we need a more complex model for accurate classification. So we have 3 hidden
layers, and an output layer. The number of nodes per layer has also increased to improve
the learning capacity of the model. Choosing the right number of hidden layers and
nodes per layer is more of an art than science, usually decided by trial and error.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 25/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

The ANN was able to model a pretty complex set of decision boundaries.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 26/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Precision is 99%, we only have 14 misclassified points out of 2400. Pretty good.

4. Multiclass Classification
In the previous sections we worked on binary classification. Now we will take a look at a
multi-class classification problem, where the number of classes is more than 2. We will
pick 3 classes for demonstration, but our approach generalizes to any number of classes.

Here’s how our dataset looks like, spiral data with 3 classes, using
the make_multiclass method in scikit-learn.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 27/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

4.1) Softmax Regression

As we saw above, Logistic Regression (LR) is a classification method for 2 classes. It
works with binary labels 0/1. Softmax Regression (SR) is a generalization of LR where
we can have more than 2 classes. In our current dataset we have 3 classes, represented as
0/1/2.

Building the model for SR is very similar to LR, for reference here’s how we built our
Logistic Regression model.

And here’s how how we will build the Softmax Regression model.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 28/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

There are a couple of differences, let’s go over them one by one:

Number of nodes in the dense layer: LR uses 1 node, where SR has 3 nodes. Since we
have 3 classes it makes sense for SR to be using 3 nodes. Then the question is, why
does LR uses only 1 node, it has 2 classes so it appears like we should have used 2
nodes instead. The answer is, because we can achieve the same result with using only
1 node. As we saw above, LR models the probability of an example belonging to class
one: P(class=1). And we can calculate class 0 probability by: 1−P(class=1). But when
we have more than 2 classes, we need individual nodes for each class. Because
knowing the probability of one class doesn’t let us infer the probability of the other
classes.

Activation function: LR used sigmoid activation function, SR uses softmax. Softmax

scales the values of the output nodes such that they represent probabilities and sum
up to 1. So in our case P(class=0)+P(class=1)+P(class=2)=1. It doesn’t do it in a
naive way by dividing individual probabilities by the sum though, it uses the
exponential function. So higher values get emphasized more and lower values get
squashed more. We will talk in detail what softmax does in another tutorial. For now
you can simply think of it as a normalization function which lets us interpret the
output values as probabilities.

Loss function: In a binary classification problem like LR, the loss function is
binary_crossentropy. In the multiclass case, the loss function is
categorical_crossentropy. Categorical crossentropy is the generalization of binary
crossentropy to more than 2 classes. Going into the theory behind loss functions is
beyond the scope of this tutorial. But for now only knowing this property is enough.

Fit function: LR used the vector y directly in the fit function, which has just one
column with 0/1 values. When we’re doing SR the labels need to be in one-
hotrepresentation. In our case y_cat is a matrix with 3 columns, where all the values
are 0 except for the one that represents our class, and that is 1.

It took some time to talk about all the differences between LR and SR, and it looks like
there’s a lot to digest. But again after some practice this will become a habit, and you
won’t even need to think about any of this.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 29/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

After all this theory let’s take a step back and remember that LR is a linear classifier. SR
is also a linear classifier, but for multiple classes. So the “power” of the model hasn’t
changed, it’s still a linear model. We just generalized LR to apply it to a multiclass
problem.

Training the model gives us an accuracy of around 50%. The most naive method which
always predicts class 1 no matter what the input is would have an accuracy of 33%. The
SR model is not much of an improvement over it. Which is expected because the dataset
is not linearly separable.

Looking at the decision boundary confirms that we still have a linear classifier. The lines
look jagged due to floating point rounding but in reality they’re straight.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 30/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Here’s the precision and recall corresponding to the 3 classes. And the confusion matrix
is all over the place. Clearly this is not an optimal classifier.

4.2) Deep ANN

Now let’s build a deep ANN for multiclass classification. Remember that the changes
going from LR to ANN was minimal. We only needed to add more Dense layers. We will
do the same again. Adding a couple of Dense layers with tanh activation function.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 31/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

Note that the output layer still has 3 nodes, and uses the softmax activation. The loss
function also didn’t change, still categorical_crossentropy. These won’t change going
from a linear model to a deep ANN, since the problem definition hasn’t changed. We’re
still working on multiclass classification. But now using a more powerful model, and that
power comes from adding more layers to our neural net.

We achieve 99% accuracy in just a couple of epochs.

The decision boundary is non-linear.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 32/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

We got almost 100% accuracy. We totally misclassified 5 points out of 1500.

5) Conclusion
Thanks for spending time reading this article, I know this was a rather lengthy tutorial
on Artificial Neural Networks and Keras. I wanted to be as detailed as possible, while still
keeping the article length manageable. I hope you enjoyed it.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 33/34
8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

There was a common theme in this article such that we first introduced the task, we then
approached it using a simple method and observed the limitations. Afterwards we used a
more complex deep model to improve on it and got much better results. I think the
ordering is important. No complex method becomes successful unless it has evolved
from a simpler model.

The entire code for this article is available here if you want to hack on it yourself. If you
have any feedback feel free to reach out to me on twitter.

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 34/34

2 DNN-CNN-RNN
100% (1)
2 DNN-CNN-RNN
87 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
CS 611 Slides 5
No ratings yet
CS 611 Slides 5
28 pages
An Introduction To Artificial Neural Networks - by Srivignesh Rajan - Towards Data Science
No ratings yet
An Introduction To Artificial Neural Networks - by Srivignesh Rajan - Towards Data Science
11 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
SHALLOW NETWORKS VERSUS DEEP NETWORKS
No ratings yet
SHALLOW NETWORKS VERSUS DEEP NETWORKS
6 pages
Lecture 11 - Introduction To Artificial Neural Networks (ANN)
No ratings yet
Lecture 11 - Introduction To Artificial Neural Networks (ANN)
35 pages
Deep Learning - Intro, Methods & Applications
100% (1)
Deep Learning - Intro, Methods & Applications
37 pages
6COM1044 Deep Learning 1
No ratings yet
6COM1044 Deep Learning 1
49 pages
Artificial intelligence basics
No ratings yet
Artificial intelligence basics
13 pages
DL 02 Deep Forward Networks
No ratings yet
DL 02 Deep Forward Networks
47 pages
Unit 03 - Neural Networks - MD
No ratings yet
Unit 03 - Neural Networks - MD
24 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Artificial Neural Networks: Introduction To Computational Neuroscience
No ratings yet
Artificial Neural Networks: Introduction To Computational Neuroscience
42 pages
6S191 MIT DeepLearning L1
No ratings yet
6S191 MIT DeepLearning L1
104 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
Udacity Deep LEarning Part4 RNN
No ratings yet
Udacity Deep LEarning Part4 RNN
338 pages
Neural Networks - MS - Short - Public
No ratings yet
Neural Networks - MS - Short - Public
61 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
DS303_NN
No ratings yet
DS303_NN
20 pages
Lecture 12 - Neural Networks (DONE!!) PDF
No ratings yet
Lecture 12 - Neural Networks (DONE!!) PDF
27 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
6S191 MIT DeepLearning L1
No ratings yet
6S191 MIT DeepLearning L1
101 pages
ANN-Unit 6 - Deep Neural Networks
No ratings yet
ANN-Unit 6 - Deep Neural Networks
29 pages
Safari - 25 Jul 2019 at 11:43
No ratings yet
Safari - 25 Jul 2019 at 11:43
1 page
2. Neural Network Training
No ratings yet
2. Neural Network Training
73 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
7 pages
3.NN Backprop
No ratings yet
3.NN Backprop
56 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
COEN435_p3
No ratings yet
COEN435_p3
44 pages
MachineLearningSlides PartOne
No ratings yet
MachineLearningSlides PartOne
252 pages
Lect 12 -Deep Feed Forward NN- Review
No ratings yet
Lect 12 -Deep Feed Forward NN- Review
93 pages
Intro Deep Learning
No ratings yet
Intro Deep Learning
43 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
8.2.1: Introduction To Neural Networks: Objectives
No ratings yet
8.2.1: Introduction To Neural Networks: Objectives
11 pages
UNIT-I.pptx
No ratings yet
UNIT-I.pptx
90 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
What is Gradient Based Learning in Deep Learning
No ratings yet
What is Gradient Based Learning in Deep Learning
12 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
3 pages
6S191 MIT DeepLearning L1
No ratings yet
6S191 MIT DeepLearning L1
108 pages
6S191_MIT_DeepLearning_L1
No ratings yet
6S191_MIT_DeepLearning_L1
108 pages
DL Intro
No ratings yet
DL Intro
64 pages
AI - Physics Informed Neural Network by ARNAB HALDER
100% (1)
AI - Physics Informed Neural Network by ARNAB HALDER
15 pages
Continuous Time 2
No ratings yet
Continuous Time 2
91 pages
NN_Notes
No ratings yet
NN_Notes
39 pages
ML-U2
No ratings yet
ML-U2
15 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
ANN Formulas and Models (1)
No ratings yet
ANN Formulas and Models (1)
24 pages
Deep Learning
100% (4)
Deep Learning
100 pages
OCI DL Fundations
No ratings yet
OCI DL Fundations
4 pages
four unit
No ratings yet
four unit
3 pages
5. Introduction to Artificial Neural Networks With Keras - IITR Batch 2 (8)
No ratings yet
5. Introduction to Artificial Neural Networks With Keras - IITR Batch 2 (8)
252 pages
Ann I
No ratings yet
Ann I
41 pages
Unit-3 D.L
No ratings yet
Unit-3 D.L
16 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Neural Networks
From Everand
Neural Networks
Sasha Kurzweil
No ratings yet
Artificial Intelligence Interview Questions
From Everand
Artificial Intelligence Interview Questions
Tech Interviews
5/5 (2)
IMFI 2022 02 Abu2
No ratings yet
IMFI 2022 02 Abu2
15 pages
The Future of The Payments Industry - How Managing Risk Can Drive Growth - McKinsey
No ratings yet
The Future of The Payments Industry - How Managing Risk Can Drive Growth - McKinsey
10 pages
Managerial Finance - Base Paper
No ratings yet
Managerial Finance - Base Paper
32 pages
Startup Valuation by Venture Capitalists An Empirical Study
No ratings yet
Startup Valuation by Venture Capitalists An Empirical Study
25 pages
Valuation Model For Internet-Of-Things (Iot) Startups: Conference Paper
No ratings yet
Valuation Model For Internet-Of-Things (Iot) Startups: Conference Paper
9 pages
ChurnNet Deep Learning Enhanced Customer Churn Prediction in Telecommunication Industry
No ratings yet
ChurnNet Deep Learning Enhanced Customer Churn Prediction in Telecommunication Industry
14 pages
Springer Publication grp31
No ratings yet
Springer Publication grp31
9 pages
Image Segmentation in Deep Learning
No ratings yet
Image Segmentation in Deep Learning
12 pages
Deep Learning Models For Bridge Deck Evaluation Using Impact Echo
No ratings yet
Deep Learning Models For Bridge Deck Evaluation Using Impact Echo
41 pages
Credit Card Fraud Detection Using Machine Learning FINAL
No ratings yet
Credit Card Fraud Detection Using Machine Learning FINAL
8 pages
Review Article Drl-Based Intelligent Resource Allocation For Diverse Qos in 5G and Toward 6G Vehicular Networks: A Comprehensive Survey
No ratings yet
Review Article Drl-Based Intelligent Resource Allocation For Diverse Qos in 5G and Toward 6G Vehicular Networks: A Comprehensive Survey
21 pages
SMALL-LANGUAGE-MODEL-NOV-2024
No ratings yet
SMALL-LANGUAGE-MODEL-NOV-2024
52 pages
Srikanta Mishra - Machine Learning Applications in Subsurface Energy Resource Management - State of The Art and Future Prognosis-CRC Press (2024)
100% (1)
Srikanta Mishra - Machine Learning Applications in Subsurface Energy Resource Management - State of The Art and Future Prognosis-CRC Press (2024)
379 pages
CampusX (D.L) Course Syllabus
No ratings yet
CampusX (D.L) Course Syllabus
5 pages
Hadiyyisa POS Tagger With Deep Learning
100% (2)
Hadiyyisa POS Tagger With Deep Learning
34 pages
PHD Thesis Deep Learning For Automatic Assessment and Feedback of Spoken English
No ratings yet
PHD Thesis Deep Learning For Automatic Assessment and Feedback of Spoken English
282 pages
Reinforcement_Learning_for_Financial_Portfolio_Optimization Dynamic Strategies for Risk and Reward Management Nov 2024
No ratings yet
Reinforcement_Learning_for_Financial_Portfolio_Optimization Dynamic Strategies for Risk and Reward Management Nov 2024
8 pages
CNN and RNN
No ratings yet
CNN and RNN
82 pages
Google Play Store Apps-Data Analysis and Ratings Prediction
No ratings yet
Google Play Store Apps-Data Analysis and Ratings Prediction
10 pages
enhancing-individual-sports-training-through-artificial-intelligence-a-comprehensive-review
No ratings yet
enhancing-individual-sports-training-through-artificial-intelligence-a-comprehensive-review
9 pages
Financial Times Series
No ratings yet
Financial Times Series
10 pages
Berk Ra Statistical Learning From A Regression Perspective
100% (2)
Berk Ra Statistical Learning From A Regression Perspective
451 pages
Machine Learning and Big Data Analytics
No ratings yet
Machine Learning and Big Data Analytics
372 pages
AISTech 2019 Successful Use Case Applications of Artificial Intelligence in The Steel Industry
No ratings yet
AISTech 2019 Successful Use Case Applications of Artificial Intelligence in The Steel Industry
14 pages
Plagiarism-Report 3
No ratings yet
Plagiarism-Report 3
3 pages
Catalog Dahua-IP-Video-Project-Product V2.0 EN 20200730 (124P PDF
No ratings yet
Catalog Dahua-IP-Video-Project-Product V2.0 EN 20200730 (124P PDF
123 pages
Detecting Port Scan Attempts With Comparative Analysis of Deep Learning and Support Vector Machine Algorithms
No ratings yet
Detecting Port Scan Attempts With Comparative Analysis of Deep Learning and Support Vector Machine Algorithms
4 pages
Dartmstadt AI ML
No ratings yet
Dartmstadt AI ML
150 pages
Kaspersky Lab Whitepaper Machine Learning
No ratings yet
Kaspersky Lab Whitepaper Machine Learning
17 pages
Fingerprinting_Attack_on_Tor_Anonymity_u
No ratings yet
Fingerprinting_Attack_on_Tor_Anonymity_u
6 pages
DL and Feature Learning
No ratings yet
DL and Feature Learning
2 pages
Knowledge Science Engineering and Management 15th International Conference KSEM 2022 Singapore August 6 8 2022 Proceedings Part I Gerard Memmi - Explore the complete ebook content with the fastest download
100% (1)
Knowledge Science Engineering and Management 15th International Conference KSEM 2022 Singapore August 6 8 2022 Proceedings Part I Gerard Memmi - Explore the complete ebook content with the fastest download
68 pages
3363Artificial Intelligence and Deep Learning in Pathology 1st Edition Stanley Cohen Md (Editor)download
100% (2)
3363Artificial Intelligence and Deep Learning in Pathology 1st Edition Stanley Cohen Md (Editor)download
83 pages
Fraud Detection How Machine Learning Systems Help Reveal Scams in Fintech Healthcare and ECommerce
100% (2)
Fraud Detection How Machine Learning Systems Help Reveal Scams in Fintech Healthcare and ECommerce
24 pages
A Comprehensive Study Classification of Asian Ethnicities From Facial Images Using Deep Learning
No ratings yet
A Comprehensive Study Classification of Asian Ethnicities From Facial Images Using Deep Learning
5 pages

Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science

Uploaded by

Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science

Uploaded by

8/26/2020 Applied Deep Learning - Part 1: Artificial Neural Networks | by Arden Dertat | Towards Data Science

3. Artificial Neural Networks (ANN)

So if we visualize this as a sequence of vector transformations, we first map the 3D input

If we put everything together it looks like the figure below.

So we project it to a higher dimensional space by performing a non-linear

1.1) Linearly Separable Data

There is a LogisticRegression classifier available in scikit-learn, I won’t go into too much

activation: The activation function of a logistic regression model is

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=

history = model.fit(x=X, y=y, verbose=0, epochs=50)

y: Not to be confused with the y coordinate of the input points. In all ML

2.2) Complex Data - Moons

The classification report and the confusion matrix looks as follows.

2.3 Complex Data - Circles

Building the model with same parameters.

3. Artificial Neural Networks (ANN)

3.1) Complex Data - Moons

Step 1: Define a Sequential model.

Step 3: Compile the model with an optimizer and loss function.

Step 4: Fit the model to the dataset.

100% precision, nothing misclassified.

3.2) Complex Data - Circles

Just like above we get 100% accuracy.

4.3) Complex Data - Sine Wave

4.1) Softmax Regression

There are a couple of differences, let’s go over them one by one:

Activation function: LR used sigmoid activation function, SR uses softmax. Softmax

4.2) Deep ANN

We achieve 99% accuracy in just a couple of epochs.

The decision boundary is non-linear.

We got almost 100% accuracy. We totally misclassified 5 points out of 1500.

You might also like