0% found this document useful (0 votes)
25 views

DL Lab-III-II

Uploaded by

asha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

DL Lab-III-II

Uploaded by

asha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 98

Experiment No.

1
Implement multilayer perceptron algorithm for MNIST Hand written Digit Classification.

Human Visual System is a marvel of the world. People can readily recognize digits. But it is not as simple
as it looks like. The human brain has a million neurons and billions of connections between them, which
makes this exceptionally complex task of image processing easier. People can effortlessly recognize digits.
However, it turns into a challenging task for computers to recognize digits. Simple hunches about how to
recognize digits become difficult to express algorithmically. Moreover, there is a significant variation in
writing from person to person, which makes it immensely complex.
Handwritten digit recognition system is the working of a machine to train itself so that it can recognize
digits from different sources like emails, bank cheque, papers, images, etc.
Google Colab
Google Colab has been used to implement the network. It is a free cloud service that can be used to
develop deep learning applications using popular libraries such as Keras, TensorFlow, PyTorch, and
OpenCV. The most important feature that distinguishes Colab from other free cloud services is; it provides
GPU and is totally free. Thus, if PC is incompatible with hardware requirements or does not support GPU,
then it is the best option because a stable internet connection is the only requirement.

MNIST Datasets
MNIST stands for “Modified National Institute of Standards and Technology”. It is a dataset of 70,000
handwritten images. Each image is of 28x28 pixels i.e. about 784 features. Each feature represents only
one pixel’s intensity i.e. from 0(white) to 255(black). This database is further divided into 60,000 training
and 10,000 testing images.

Phases of Implementation

Import the libraries


First, we imported all the libraries that we are going to use.
We imported TensorFlow which is an open-source free library that is used for machine learning
applications such as neural networks etc. Further, we imported pyplot function, which is basically used
for plotting, from the matplotlib library which is used for visualization purposes. After that, we
imported NumPy i.e. Numerical Python which is used to perform various mathematical operations.

Load the dataset


The Keras library already contains some datasets such as CIFAR10, CIFAR100, Boston Housing price
regression dataset, IMDB movie review sentiment classification dataset etc.

The MNIST dataset is also part of it. So, we imported it from keras.datasets and loaded it into variable
“objects”. The objects.load_data() method returns us the training data(train_img), its labels(train_lab) and
also the testing data(test_img) and its labels(test_lab). Out of the 70,000 images provided in the dataset,
60,000 are given for training and 10,000 are given for testing.
Before preprocessing the data, we first displayed the first 20 images of the training set with the help of for
loop.

subplot() is used to add a subplot or grid-like structure to the current figure. The first argument is for “no.
of rows”, second for “no. of columns” and third for position index in the grid.
Suppose we have to plot 10 images in the 4x5 grid starting from the second position in the grid. Then, it
will be like
 imshow() is used to display data as an image i.e. training image (train_img[i])
whereas cmap stands for the colour map. Cmap is an optional feature. Basically, if the image is in
the array of shape (M, N), then the cmap controls the colour map used to display the values.
cmap=‘gray’ will display image as grayscale while cmap=‘gray_r’ is used to display image as
inverse grayscale.
 title() sets title for each image. We have set “Digit: train_lab[i]” as the title for each image in the
subplot.
 subplots_adjust() is used for tuning subplot layout. In order to change the space provided between
two rows, we have used hspace. If you want to change space between two columns then you can
use wspace.
By default parameters of the subplot layout are,

In order to hide the axis of the image, plt.axis(‘off’) has been used.
After that, we displayed the shape of training and testing section.

(60000,28,28) means there are 60,000 images in the training set and each image is of size 28x28 pixels.
Similarly, there are 10,000 images of the same size in the testing set.
So each image is of size 28x28 i.e. 784 features, and each feature represents the intensity of each pixel
from 0 to 255.
You can use print(train_img[0]) to print the first training set image in the matrix form of 28x28.

We plotted the first training image on a histogram. Before normalization,

hist() is used to plot the histogram for the first training image i.e. train_img[0]. The image has been
reshaped into a 1-D array of size 784. facecolor is an optional parameter which specifies the colour of the
histogram. Title of the histogram, Y-axis and X-axis have been named as “Pixel vs its intensity”, “PIXEL”
and “Intensity”.
Pre-process the data
Before feeding the data to the network, we will normalize it. Normalizing the input data helps to speed up
the training. Also, it reduces the chance of getting stuck in local optima, since we’re using stochastic
gradient descent to find the optimal weights for the network.
The pixel values are between 0 and 255. So, scaling of input values is good when using neural network
models since the scale is well known and well behaved, we can very quickly normalize the pixel values to
the range 0 and 1 by dividing each value by the maximum intensity of 255.

After normalization,
Creating the model

There are 3 ways to create a model in Keras:


 The Sequential model is very straightforward and simple. It allows building a model layer by layer.
 The Functional API which is an easy-to-use, fully-featured API that supports arbitrary model
architectures. This is the Keras “industry-strength” model.
 Model sub-classing where you implement everything from scratch on your own.
Here, we have used the Sequential model. This model has one input layer, one output layer and two
hidden layers.
Sequential() is used to create a layer of the network in sequence.
.add() is used here to add the layer into the model.
In the first layer(input layer), we feed image as the input. Since each image is of size 28x28, hence we
have used Flatten() to compress the input.
We have used Dense() in the other layers. It ensures that each neuron in the previous layer is connected to
every neuron in the next layer.
The model is a simple neural network with two hidden layers with 512 neurons. A rectifier linear unit
activation (ReLU) function is used for the neurons in the hidden layers. The nicest thing about it is that
its gradient is always equal to 1, this way we can pass the maximum amount of the error through the
network during back-propagation.

The output layer has 10 neurons i.e. for each class from 0 to 9. A softmax activation function is used on
the output layer to turn the outputs into probability-like values.

Note: You can add more neurons int the hidden layers. You can even increase the no. of hidden layers int
the model to increase efficiency. However, it will take more time during training.

Compiling the network

Next, we need to compile our model. Compiling the model takes three parameters: optimizer, loss and

metrics. The optimizer controls the learning rate. We are using ‘adam’ as our optimizer. It is generally a

good optimizer to use for many cases. It adjusts the learning rate throughout the training.

We will use ‘Sparse_Categorical_Crossentropy’ for our loss function because it saves time in memory

as well as computation since it simply uses a single integer for a class, rather than a whole vector. A lower

score indicates that the model is performing better.

In order to determine the accuracy, we will use the ‘accuracy’ metric to see the accuracy score on the

validation set when we train the model.

Train the model


We will train the model with the help of fit() function. It will have parameters as training data (train_img),

training labels (train_lab) and the number of epochs. The number of epochs is the number of times the

model will cycle through the data. The more epochs we run, the more the model will improve, up to a

certain point. After that point, the model will stop improving during each epoch.

We will save the model as project.h5

Evaluate the model


model.evaluate() method computes the loss and any metric defined when compiling the model. So in our

case, the accuracy is computed on the 10,000 testing examples using the network weights given by the

saved model.

Verbose can be either 0,1, or 2. By default verbose is 1.

verbose = 0, means silent.

verbose = 1, which includes both progress bar and one line per epoch.

verbose = 2, one line per epoch i.e. epoch no./total no. of epochs.

After evaluating the model, we will now check the model for the testing section.
model.predict() is used to do prediction on the testing set.

np.argmax() returns the indices of the maximum values along an axis.


Image from Google Images

Now, in order to make a prediction for a new image that is not part of MNIST dataset. We will first create

a function named “load_image”.

Above function converts the image into an array of pixels which is fed to the model as an input.
In order to upload a file from local drive, we used the code:

from google.colab import files

uploaded = files.upload()

It will lead you to select a file. Click on “Choose Files” then select and upload the file and wait for the file

to be uploaded 100%. You will see the name of the file once Colab has uploaded it.

In order to display image file, we used the code:

from IPython .display import Image Image(‘5img.jpeg’,width=250,height=250)

5img.jpeg is the file name.


As you can see we have successfully predicted the value as 5.

Now, if we want to run the model after a few days then, we will have to run the whole code again, which is

time-consuming.

In that case, you can use the saved model i.e. project.h5

So, before closing the colab notebook, you can download the model from the folder symbol.

Highlighted folder

So, when you try to run the model again, all you have to do is upload project.h5 file from the computer by

using the code :

from google.colab import files

uploaded = files.upload()
When the file is 100% uploaded, use the following code & after that, you can predict the digit for new

images without running the whole code.

model=tf.keras.models.load_model(‘project.h5’)

Link for reference https://colab.research.google.com/drive/10LzhqSlJx4bnCNT6C8llhuXTDuh_WQPG?

usp=sharing

Experiment 2:

Design a neural network for classifying movie reviews (Binary Classification) using IMDB dataset.

Case Study 2: IMDB – Binary Classification of Movie Reviews


In this case study, our objective is to classify movie reviews as positive or negative. This is a
classic binary classification, which aims to predict one of two classes (positive vs. negative). To predict
whether a review is positive or negative, we will use the text of the movie review. ℹ️
Throughout this case study you will learn a few new concepts:

 Vectorizing text with one-hot encoding


 Regularization with:
o Learning rate
o Model capacity
o Weight decay
o Dropout

Package requirements
Hide
library(keras) # for deep learning
library(tidyverse) # for dplyr, ggplot2, etc.
library(testthat) # unit testing
library(glue) # easy print statements

The IMDB dataset


Our data consists of 50,000 movie reviews from IMDB. This data has been curated and supplied to us via
keras; however, tomorrow we will go through the process of preprocessing the original data on our own.
First, let’s grab our data and unpack them into training vs test and features vs labels.
Hide
imdb <- dataset_imdb(num_words = 10000)
c(c(reviews_train, y_train), c(reviews_test, y_test)) %<-% imdb

length(reviews_train) # 25K reviews in our training data


[1] 25000

Hide
length(reviews_test) # 25K reviews in our test data
[1] 25000

Understanding our data


The reviews have been preprocessed, and each review is encoded as a sequence of word indexes
(integers). For convenience, words are indexed by overall frequency in the dataset. For example, the
integer “14” encodes the 14th most frequent word in the data. Actually, since the numbers 1, 2, and 3 are
reserved to identify:

1. start of a sequence
2. unknown words
3. padding

the integer “14” represents the 14−3=1114−3=11th most frequent word.


Hide
reviews_train[[1]]
[1] 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941 4 173
[17] 36 256 5 25 100 43 838 112 50 670 2 9 35 480 284 5
[33] 150 4 172 112 167 2 336 385 39 4 172 4536 1111 17 546 38
[49] 13 447 4 192 50 16 6 147 2025 19 14 22 4 1920 4613 469
[65] 4 22 71 87 12 16 43 530 38 76 15 13 1247 4 22 17
[81] 515 17 12 16 626 18 2 5 62 386 12 8 316 8 106 5
[97] 4 2223 5244 16 480 66 3785 33 4 130 12 16 38 619 5 25
[113] 124 51 36 135 48 25 1415 33 6 22 12 215 28 77 52 5
[129] 14 407 16 82 2 8 4 107 117 5952 15 256 4 2 7 3766
[145] 5 723 36 71 43 530 476 26 400 317 46 7 4 2 1029 13
[161] 104 88 4 381 15 297 98 32 2071 56 26 141 6 194 7486 18
[177] 4 226 22 21 134 476 26 480 5 144 30 5535 18 51 36 28
[193] 224 92 25 104 4 226 65 16 38 1334 88 12 16 283 5 16
[209] 4472 113 103 32 15 16 5345 19 178 32

We can map the integer values back to the original word index (dataset_imdb_word_index()). The integer
number corresponds to the position in the word count list and the name of the vector is the actual word.
Hide
word_index <- dataset_imdb_word_index() %>%
unlist() %>%
sort() %>%
names()

# The indices are offset by 3 since 0, 1, and 2 are reserved for "padding",
# "start of sequence", and "unknown"
reviews_train[[1]] %>%
map_chr(~ ifelse(.x >= 3, word_index[.x - 3], "<UNK>")) %>%
cat()
<UNK> this film was just brilliant casting location scenery story direction everyone's really suited the
part they played and you could just imagine being there robert <UNK> is an amazing actor and now the
same being director <UNK> father came from the same scottish island as myself so i loved the fact there
was a real connection with this film the witty remarks throughout the film were great it was just brilliant
so much that i bought the film as soon as it was released for <UNK> and would recommend it to
everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know
what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two
little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out
of the <UNK> list i think because the stars that play them all grown up are such a big profile for the
whole film but these children are amazing and should be praised for what they have done don't you think
the whole story was so lovely because it was true and was someone's life after all that was shared with us
all

Our response variable is just a vector of 1s (positive reviews) and 0s (negative reviews).
Hide
str(y_train)
int [1:25000] 1 0 0 1 0 0 1 0 1 0 ...

Hide
# our labels are equally balanced between positive (1s) and negative (0s)
# reviews
table(y_train)
y_train
0 1
12500 12500

Preparing the features


All inputs and response values in a neural network must be tensors of either floating-point or integer data.
Moreover, our feature values should not be relatively large compared to the randomized initial
weights and all our features should take values in roughly the same range.
Consequently, we need to vectorize our data into a format conducive to neural networks. For this data set,
we’ll transform our list of article reviews to a 2D tensor of 0s and 1s representing if the word was used
(aka one-hot encode). ℹ️
Hide
# number of unique words will be the number of features
n_features <- c(reviews_train, reviews_test) %>%
unlist() %>%
max()

# function to create 2D tensor (aka matrix)


vectorize_sequences <- function(sequences, dimension = n_features) {
# Create a matrix of 0s
results <- matrix(0, nrow = length(sequences), ncol = dimension)

# Populate the matrix with 1s


for (i in seq_along(sequences))
results[i, sequences[[i]]] <- 1
results
}

# apply to training and test data


x_train <- vectorize_sequences(reviews_train)
x_test <- vectorize_sequences(reviews_test)

# unit testing to make sure certain attributes hold


expect_equal(ncol(x_train), n_features)
expect_equal(nrow(x_train), length(reviews_train))
expect_equal(nrow(x_test), length(reviews_test))

Our transformed feature set is now just a matrix (2D tensor) with 25K rows and 10K columns (features).
Hide
dim(x_train)
[1] 25000 9999

Let’s check out the first 10 rows and columns:


Hide
x_train[1:10, 1:10]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 0 1 1 1 1 1 1 0
[2,] 1 1 0 1 1 1 1 1 1 0
[3,] 1 1 0 1 0 1 1 1 1 0
[4,] 1 1 0 1 1 1 1 1 1 1
[5,] 1 1 0 1 1 1 1 1 0 1
[6,] 1 1 0 1 0 0 0 1 0 1
[7,] 1 1 0 1 1 1 1 1 1 0
[8,] 1 1 0 1 0 1 1 1 1 1
[9,] 1 1 0 1 1 1 1 1 1 0
[10,] 1 1 0 1 1 1 1 1 1 1

Preparing the labels


In contrast to MNIST, the labels of a binary classification will just be one of two values, 0 (negative) or 1
(positive). We do not need to do any further preprocessing.
Hide
str(y_train)
int [1:25000] 1 0 0 1 0 0 1 0 1 0 ...

Initial model
Since we are performing binary classification, our output activation function will be the sigmoid
activation function ℹ️. Recall hat the sigmoid activation is used to predict the probability of the output
being positive. This will constrain our output to be values ranging from 0-100%.
Hide
network <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = n_features) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")

Hide
summary(network)
Model: "sequential"
_____________________________________________________________________________________
____
Layer (type) Output Shape Param #
===========================================================================
==============
dense (Dense) (None, 16) 160000
_____________________________________________________________________________________
____
dense_1 (Dense) (None, 16) 272
_____________________________________________________________________________________
____
dense_2 (Dense) (None, 1) 17
===========================================================================
==============
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_____________________________________________________________________________________
____

We’re going to use binary crossentropy since we only have two possible classes.
Hide
network %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = "accuracy"
)
Now let’s train our network for 20 epochs and we’ll use a batch size of 512 because, as you’ll find out,
this model overfits very quickly (remember, large batch sizes compute more accurate gradient descents
that traverse the loss more slowly).
Hide
history <- network %>% fit(
x_train,
y_train,
epochs = 20,
batch_size = 512,
validation_split = 0.2
)

Check out our initial resuls:


Hide
best_epoch <- which.min(history$metrics$val_loss)
best_loss <- history$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history$metrics$val_accuracy[best_epoch] %>% round(3)

glue("Our optimal loss is {best_loss} with an accuracy of {best_acc*100}%")


Our optimal loss is 0.27 with an accuracy of 89.6%

In the previous module, we had the problem of underfitting; however looking at our learning curve for
this model it’s obvious that we have an overfitting problem.
Hide
plot(history)

YOUR TURN (3 min)


Using what you learned in the last module, make modifications to this model such as:

1. Increasing or decreasing number of units and layers


2. Adjusting the learning rate
3. Adjusting the batch size
4. Adding callbacks (i.e. early stopping, learning rate adjuster)

Hide
network <- keras_model_sequential() %>%
layer_dense(units = ____, activation = "relu", input_shape = n_features) %>%
layer_dense(units = ____, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")

network %>% compile(


optimizer = ____,
loss = "binary_crossentropy",
metrics = c("accuracy")
)

history <- network %>% fit(


x_train,
y_train,
epochs = 20,
batch_size = ____,
validation_split = 0.2
)

Regardless of what you tried above, you likely had results that consistently overfit. Our quest is to see if
we can control this overfitting. Often, when we control the overfitting we improve model performance
and generalizability. To reduce overfitting we are going to look at a few common ways to regularize our
model.

Regularizing how quickly the model learns


Recall that the learning rate decides how fast we try to traverse the gradient descent of the loss. When the
loss curve has a sharp U shape, this can indicate that your learning rate is too large.
The default learning rate for RMSprop is 0.001 (?optimizer_rmsprop()). Reducing the learning rate will
allow us to traverse the gradient more cautiously. Although the learning rate is not traditionally
considered a “regularization” hyperparameter, it should be the first hyperparameter you start assessing.
Best practice:

 When tuning the learning rate, we often try factors of 10−s10−s where s ranges between 1-6 (0.1,
0.01, …, 0.000001).
 Add callback_reduce_lr_on_plateau() to automatically adjust the learning during training.
 As you reduce the learning rate, reduce the batch size
o Adds stochastic nature to reduce chance of getting stuck in local minimum
o Speeds up training (small learning rate + large batch size = SLOW!)

Hide
network <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = n_features) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")

network %>% compile(


optimizer = optimizer_rmsprop(lr = 0.0001), # regularization parameter
loss = "binary_crossentropy",
metrics = c("accuracy")
)

history <- network %>% fit(


x_train,
y_train,
epochs = 25,
batch_size = 128,
validation_split = 0.2,
callbacks = list(
callback_reduce_lr_on_plateau(patience = 3), # regularization parameter
callback_early_stopping(patience = 7)
)
)

Our results show decrease in overfitting and improvement in our loss score and (possibly) accuracy.
Hide
best_epoch <- which.min(history$metrics$val_loss)
best_loss <- history$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history$metrics$val_accuracy[best_epoch] %>% round(3)

glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")


Our optimal loss is 0.265 with an accuracy of 0.894

Hide
plot(history) +
scale_x_continuous(limits = c(0, length(history$metrics$val_loss)))

Regularizing model capacity


In the last module, we discussed how we could add model capacity by increasing the number of units in
each hidden layer and/or the number of layers to reduce underfitting. We can also reduce these parameters
to regularize model capacity.
In the last module, we changed model capacity manually. Here, we’ll use a custom function and
a for loop to automate this process.

Variant 1: Larger or smaller layers?


Here, we’ll use a larger range of neurons (from 22=422=4 to 28=25628=256) in each hidden layer.
To do this, we’ll define a function dl_model that allows us to define and compile our DL network with
the specified number of neurons based on 2n2n. This function returns a data frame with the training and
validation loss and accuracy for each epoch and number of neurons:
Hide
dl_model <- function(powerto = 6) {

network <- keras_model_sequential() %>%


layer_dense(units = 2^powerto, activation = "relu", # regularizing param
input_shape = n_features) %>%
layer_dense(units = 2^powerto, activation = "relu") %>% # regularizing param
layer_dense(units = 1, activation = "sigmoid")

network %>% compile(


optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)

history <- network %>%


fit(
x_train,
y_train,
epochs = 20,
batch_size = 512,
validation_split = 0.2,
verbose = FALSE,
callbacks = callback_early_stopping(patience = 5)
)

output <- as.data.frame(history) %>%


mutate(neurons = 2^powerto)

return(output)
}

Let’s also define a helper function that simply pulls out the minimum loss score from the above output
(this is not necessary, just informational):
Hide
get_min_loss <- function(output) {
output %>%
filter(data == "validation", metric == "loss") %>%
summarize(min_loss = min(value, na.rm = TRUE)) %>%
pull(min_loss) %>%
round(3)
}

Now we can iterate over 22=422=4 to 28=25628=256 neurons in each layer:


Hide
# so that we can store results
results <- data.frame()
powerto_range <- 2:8

for (i in powerto_range) {
cat("Running model with", 2^i, "neurons per hidden layer: ")
m <- dl_model(i)
results <- rbind(results, m)
loss <- get_min_loss(m)
cat(loss, "\n", append = TRUE)
}
Running model with 4 neurons per hidden layer: 0.271
Running model with 8 neurons per hidden layer: 0.271
Running model with 16 neurons per hidden layer: 0.282
Running model with 32 neurons per hidden layer: 0.301
Running model with 64 neurons per hidden layer: 0.268
Running model with 128 neurons per hidden layer: 0.293
Running model with 256 neurons per hidden layer: 0.277

The above results indicate that we may actually be improving our optimal loss score as we constrain the
size of our hidden layers. The below plot shows that we definitely reduce overfitting.
Hide
min_loss <- results %>%
filter(metric == "loss" & data == "validation") %>%
summarize(min_loss = min(value, na.rm = TRUE)) %>%
pull()

results %>%
filter(metric == "loss") %>%
ggplot(aes(epoch, value, color = data)) +
geom_line() +
geom_hline(yintercept = min_loss, lty = "dashed") +
facet_wrap(~ neurons) +
theme_bw()

Variant 2: More or less layers?


We can perform a similar approach to assess the impact that the number of layers has on model
performance. The following modifies our dl_model so that we can dynamically alter the number of layers
and neurons.
Hide
dl_model <- function(nlayers = 2, powerto = 4) {
# Create a model with a single hidden input layer
network <- keras_model_sequential() %>%
layer_dense(units = 2^powerto, activation = "relu", input_shape = n_features)

# regularizing parameter --> Add additional hidden layers based on input


if (nlayers > 1) {
for (i in seq_along(nlayers - 1)) {
network %>% layer_dense(units = 2^powerto, activation = "relu")
}
}

# Add final output layer


network %>% layer_dense(units = 1, activation = "sigmoid")

# Add compile step


network %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)

# Train model
history <- network %>%
fit(
x_train,
y_train,
epochs = 25,
batch_size = 512,
validation_split = 0.2,
verbose = FALSE,
callbacks = callback_early_stopping(patience = 5)
)

# Create formated output for downstream plotting & analysis


output <- as.data.frame(history) %>%
mutate(nlayers = nlayers, neurons = 2^powerto)

return(output)
}

Now we can iterate over a range of layers and neurons in each layer to assess the impact to performance.
For time, we’ll use hidden layers with 64 nodes and just assess the impact of adding more layers:
Hide
# so that we can store results
results <- data.frame()
nlayers <- 1:6

for (i in nlayers) {
cat("Running model with", i, "hidden layer(s) and 16 neurons per layer: ")
m <- dl_model(nlayers = i, powerto = 4)
results <- rbind(results, m)
loss <- get_min_loss(m)
cat(loss, "\n", append = TRUE)
}
Running model with 1 hidden layer(s) and 16 neurons per layer: 0.27
Running model with 2 hidden layer(s) and 16 neurons per layer: 0.274
Running model with 3 hidden layer(s) and 16 neurons per layer: 0.27
Running model with 4 hidden layer(s) and 16 neurons per layer: 0.278
Running model with 5 hidden layer(s) and 16 neurons per layer: 0.279
Running model with 6 hidden layer(s) and 16 neurons per layer: 0.274

It’s uncertain how much performance in the minimum loss score we get from the above results; however,
the plot below illustrates that our 1-2 layer models have less overfitting than the deeper models.
Hide
min_loss <- results %>%
filter(metric == "loss" & data == "validation") %>%
summarize(min_loss = min(value, na.rm = TRUE)) %>%
pull()

results %>%
filter(metric == "loss") %>%
ggplot(aes(epoch, value, color = data)) +
geom_line() +
geom_hline(yintercept = min_loss, lty = "dashed") +
facet_wrap(~ nlayers, ncol = 3) +
theme_bw()

Regularizing the size of weights


A common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its
weights to take on small values, which makes the distribution of weight values more regular. This is
called weight regularization and its done by adding to the loss function of the network a cost associated
with having large weights.
If you a familiar with regularized regression ℹ️(lasso, ridge, elastic nets) then weight regularization is
essentially the same thing. ℹ️
Best practice:

 Although you can use L1, L2 or a combination, L2 is by far the most common and is known
as weight decay in the context of neural nets.
 Optimal values vary but when tuning we typically start with factors of 10−s10−s where s ranges
between 1-4 (0.1, 0.01, …, 0.0001).
 The larger the weight regularizer, the more epochs generally required to reach a minimum loss
 Weight decay can cause a noisier learning curve so its often beneficial to increase
the patience parameter for early stopping

Hide
network <- keras_model_sequential() %>%
layer_dense(
units = 16, activation = "relu", input_shape = n_features,
kernel_regularizer = regularizer_l2(l = 0.01) # regularization parameter
) %>%
layer_dense(
units = 16, activation = "relu",
kernel_regularizer = regularizer_l2(l = 0.01) # regularization parameter
) %>%
layer_dense(units = 1, activation = "sigmoid")

network %>% compile(


optimizer = "rmsprop",
loss = loss_binary_crossentropy,
metrics = c("accuracy")
)

history <- network %>% fit(


x_train,
y_train,
epochs = 100,
batch_size = 512,
validation_split = 0.2,
callbacks = callback_early_stopping(patience = 15)
)

Unfortunately, in this example, weight decay negatively impacts performance. The impact of weight
decay is largely problem and data specific.
Hide
best_epoch <- which.min(history$metrics$val_loss)
best_loss <- history$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history$metrics$val_accuracy[best_epoch] %>% round(3)

glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")


Our optimal loss is 0.375 with an accuracy of 0.885

Hide
plot(history) +
scale_x_continuous(limits = c(0, length(history$metrics$val_loss)))
Regularizing happenstance patterns
Dropout is one of the most effective and commonly used regularization techniques for neural networks.
Dropout applied to a layer randomly drops out (sets to zero) a certain percentage of the output features of
that layer. By randomly dropping some of a layer’s outputs we minimize the chance of fitting patterns to
noise in the data, a common cause of overfitting. ℹ️
Best practice:

 Dropout rates typically ranges between 0.2-0.5. Sometimes higher rates are necessary but note that
you will get a warning when supplying rate > 0.5.
 The higher the dropout rate, the slower the convergence so you may need to increase the number
of epochs.
 Its common to apply dropout after each hidden layer and with the same rate; however, this is not
necessary.

Hide
network <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = n_features) %>%
layer_dropout(0.6) %>% # regularization parameter
layer_dense(units = 16, activation = "relu") %>%
layer_dropout(0.6) %>% # regularization parameter
layer_dense(units = 1, activation = "sigmoid")

network %>% compile(


optimizer = "rmsprop",
loss = loss_binary_crossentropy,
metrics = c("accuracy")
)

history <- network %>% fit(


x_train,
y_train,
epochs = 100,
batch_size = 512,
validation_split = 0.2,
callbacks = callback_early_stopping(patience = 10)
)

Similar to weight regularization, the impact of dropout is largely problem and data specific. In this
example we do not see significant improvement.
Hide
best_epoch <- which.min(history$metrics$val_loss)
best_loss <- history$metrics$val_loss[best_epoch] %>% round(3)
best_acc <- history$metrics$val_accuracy[best_epoch] %>% round(3)

glue("Our optimal loss is {best_loss} with an accuracy of {best_acc}")


Our optimal loss is 0.274 with an accuracy of 0.896

Hide
plot(history) +
scale_x_continuous(limits = c(0, length(history$metrics$val_loss)))

So which is best?
There is no definitive best approach for minimizing overfitting. However, typically you want to focus first
on finding the optimal learning rate and model capacity that optimizes the loss score. Then move on to
fighting overfitting with dropout or weight decay.
Unfortunately, many of these hyperparameters interact so changing one can impact the performance of
another. Performing a grid search can help you identify the optimal combination; however, as your data
gets larger or as you start using more complex models such as CNNs and LSTMs, you often constrained
by compute to adequately execute a sizable grid search. Here is a great paper on how to practically
approach hyperparameter tuning for neural networks (https://arxiv.org/abs/1803.09820).
To see the performance of a grid search on this data set and the parameters discussed here, check out this
notebook.

Key takeaways

 Preparing text data


o Text data is usually stored as numeric data representing a word index
o We typically apply a word limit (i.e. top 10K, 20K, etc most frequent words)
o In this example we one-hot encoded the features into a 2D tensor but tomorrow we will
look at better approaches
 When our model overfits regularizing can improve model performance
 Common approaches to regularization
o learning rate
o model capacity
o weight decay
o dropout

Experiment 5:

MNIST Handwritten Digit Classification Dataset

The MNIST dataset is an acronym that stands for the Modified National Institute of Standards and
Technology dataset.
It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0
and 9.

The task is to classify a given image of a handwritten digit into one of 10 classes representing integer
values from 0 to 9, inclusively.

It is a widely used and deeply understood dataset and, for the most part, is “solved.” Top-performing
models are deep learning convolutional neural networks that achieve a classification accuracy of above
99%, with an error rate between 0.4 %and 0.2% on the hold out test dataset.
The example below loads the MNIST dataset using the Keras API and creates a plot of the first nine
images in the training dataset.

1 # example of loading the mnist dataset

2 from tensorflow.keras.datasets import mnist

3 from matplotlib import pyplot as plt

4 # load dataset

5 (trainX, trainy), (testX, testy) = mnist.load_data()

6 # summarize loaded dataset

7 print('Train: X=%s, y=%s' % (trainX.shape, trainy.shape))

8 print('Test: X=%s, y=%s' % (testX.shape, testy.shape))

9 # plot first few images

10 for i in range(9):

11 # define subplot

12 plt.subplot(330 + 1 + i)

13 # plot raw pixel data


14 plt.imshow(trainX[i], cmap=plt.get_cmap('gray'))

15 # show the figure

16 plt.show()

Running the example loads the MNIST train and test dataset and prints their shape.

We can see that there are 60,000 examples in the training dataset and 10,000 in the test dataset and that
images are indeed square with 28×28 pixels.

1 Train: X=(60000, 28, 28), y=(60000,)

2 Test: X=(10000, 28, 28), y=(10000,)

A plot of the first nine images in the dataset is also created showing the natural handwritten nature of the
images to be classified.

Plot of a Subset of Images From the MNIST Dataset


Model Evaluation Methodology

Although the MNIST dataset is effectively solved, it can be a useful starting point for developing and
practicing a methodology for solving image classification tasks using convolutional neural networks.

Instead of reviewing the literature on well-performing models on the dataset, we can develop a new
model from scratch.

The dataset already has a well-defined train and test dataset that we can use.

In order to estimate the performance of a model for a given training run, we can further split the training
set into a train and validation dataset. Performance on the train and validation dataset over each run can
then be plotted to provide learning curves and insight into how well a model is learning the problem.

The Keras API supports this by specifying the “validation_data” argument to the model.fit() function
when training the model, that will, in turn, return an object that describes model performance for the
chosen loss and metrics on each training epoch.
1 # record model performance on a validation dataset during training

2 history = model.fit(..., validation_data=(valX, valY))

In order to estimate the performance of a model on the problem in general, we can use k-fold cross-
validation, perhaps five-fold cross-validation. This will give some account of the models variance with
both respect to differences in the training and test datasets, and in terms of the stochastic nature of the
learning algorithm. The performance of a model can be taken as the mean performance across k-folds,
given the standard deviation, that could be used to estimate a confidence interval if desired.
We can use the KFold class from the scikit-learn API to implement the k-fold cross-validation evaluation
of a given neural network model. There are many ways to achieve this, although we can choose a flexible
approach where the KFold class is only used to specify the row indexes used for each spit.
1 # example of k-fold cv for a neural net

2 data = ...

3 # prepare cross validation

4 kfold = KFold(5, shuffle=True, random_state=1)

5 # enumerate splits

6 for train_ix, test_ix in kfold.split(data):

7 model = ...
8 ...

We will hold back the actual test dataset and use it as an evaluation of our final model.

How to Develop a Baseline Model

The first step is to develop a baseline model.

This is critical as it both involves developing the infrastructure for the test harness so that any model we
design can be evaluated on the dataset, and it establishes a baseline in model performance on the problem,
by which all improvements can be compared.

The design of the test harness is modular, and we can develop a separate function for each piece. This
allows a given aspect of the test harness to be modified or inter-changed, if we desire, separately from the
rest.

We can develop this test harness with five key elements. They are the loading of the dataset, the
preparation of the dataset, the definition of the model, the evaluation of the model, and the presentation of
results.

Load Dataset

We know some things about the dataset.

For example, we know that the images are all pre-aligned (e.g. each image only contains a hand-drawn
digit), that the images all have the same square size of 28×28 pixels, and that the images are grayscale.

Therefore, we can load the images and reshape the data arrays to have a single color channel.

1 # load dataset

2 (trainX, trainY), (testX, testY) = mnist.load_data()

3 # reshape dataset to have a single channel

4 trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))

5 testX = testX.reshape((testX.shape[0], 28, 28, 1))

We also know that there are 10 classes and that classes are represented as unique integers.
We can, therefore, use a one hot encoding for the class element of each sample, transforming the integer
into a 10 element binary vector with a 1 for the index of the class value, and 0 values for all other classes.
We can achieve this with the to_categorical() utility function.
1 # one hot encode target values

2 trainY = to_categorical(trainY)

3 testY = to_categorical(testY)

The load_dataset() function implements these behaviors and can be used to load the dataset.
1 # load train and test dataset

2 def load_dataset():

3 # load dataset

4 (trainX, trainY), (testX, testY) = mnist.load_data()

5 # reshape dataset to have a single channel

6 trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))

7 testX = testX.reshape((testX.shape[0], 28, 28, 1))

8 # one hot encode target values

9 trainY = to_categorical(trainY)

10 testY = to_categorical(testY)

11 return trainX, trainY, testX, testY

Prepare Pixel Data


We know that the pixel values for each image in the dataset are unsigned integers in the range between
black and white, or 0 and 255.

We do not know the best way to scale the pixel values for modeling, but we know that some scaling will
be required.

A good starting point is to normalize the pixel values of grayscale images, e.g. rescale them to the range
[0,1]. This involves first converting the data type from unsigned integers to floats, then dividing the pixel
values by the maximum value.
1 # convert from integers to floats

2 train_norm = train.astype('float32')
3 test_norm = test.astype('float32')

4 # normalize to range 0-1

5 train_norm = train_norm / 255.0

6 test_norm = test_norm / 255.0

The prep_pixels() function below implements these behaviors and is provided with the pixel values for
both the train and test datasets that will need to be scaled.
1 # scale pixels

2 def prep_pixels(train, test):

3 # convert from integers to floats

4 train_norm = train.astype('float32')

5 test_norm = test.astype('float32')

6 # normalize to range 0-1

7 train_norm = train_norm / 255.0

8 test_norm = test_norm / 255.0

9 # return normalized images

10 return train_norm, test_norm

This function must be called to prepare the pixel values prior to any modeling.

Define Model
Next, we need to define a baseline convolutional neural network model for the problem.

The model has two main aspects: the feature extraction front end comprised of convolutional and pooling
layers, and the classifier backend that will make a prediction.

For the convolutional front-end, we can start with a single convolutional layer with a small filter size (3,3)
and a modest number of filters (32) followed by a max pooling layer. The filter maps can then be
flattened to provide features to the classifier.
Given that the problem is a multi-class classification task, we know that we will require an output layer
with 10 nodes in order to predict the probability distribution of an image belonging to each of the 10
classes. This will also require the use of a softmax activation function. Between the feature extractor and
the output layer, we can add a dense layer to interpret the features, in this case with 100 nodes.
All layers will use the ReLU activation function and the He weight initialization scheme, both best
practices.
We will use a conservative configuration for the stochastic gradient descent optimizer with a learning
rate of 0.01 and a momentum of 0.9. The categorical cross-entropy loss function will be optimized,
suitable for multi-class classification, and we will monitor the classification accuracy metric, which is
appropriate given we have the same number of examples in each of the 10 classes.
The define_model() function below will define and return this model.
# define cnn model
1
def define_model():
2
model = Sequential()
3
model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform',
4
input_shape=(28, 28, 1)))
5
model.add(MaxPooling2D((2, 2)))
6
model.add(Flatten())
7
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
8
model.add(Dense(10, activation='softmax'))
9
# compile model
10
opt = SGD(learning_rate=0.01, momentum=0.9)
11
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
12
return model

Evaluate Model

After the model is defined, we need to evaluate it.

The model will be evaluated using five-fold cross-validation. The value of k=5 was chosen to provide a
baseline for both repeated evaluation and to not be so large as to require a long running time. Each test set
will be 20% of the training dataset, or about 12,000 examples, close to the size of the actual test set for
this problem.
The training dataset is shuffled prior to being split, and the sample shuffling is performed each time, so
that any model we evaluate will have the same train and test datasets in each fold, providing an apples-to-
apples comparison between models.
We will train the baseline model for a modest 10 training epochs with a default batch size of 32 examples.
The test set for each fold will be used to evaluate the model both during each epoch of the training run, so
that we can later create learning curves, and at the end of the run, so that we can estimate the performance
of the model. As such, we will keep track of the resulting history from each run, as well as the
classification accuracy of the fold.

The evaluate_model() function below implements these behaviors, taking the training dataset as
arguments and returning a list of accuracy scores and training histories that can be later summarized.
1 # evaluate a model using k-fold cross-validation

2 def evaluate_model(dataX, dataY, n_folds=5):

3 scores, histories = list(), list()

4 # prepare cross validation

5 kfold = KFold(n_folds, shuffle=True, random_state=1)

6 # enumerate splits

7 for train_ix, test_ix in kfold.split(dataX):

8 # define model

9 model = define_model()

10 # select rows for train and test

11 trainX, trainY, testX, testY = dataX[train_ix], dataY[train_ix], dataX[test_ix],


dataY[test_ix]
12
# fit model
13
history = model.fit(trainX, trainY, epochs=10, batch_size=32, validation_data=(testX,
14 testY), verbose=0)

15 # evaluate model
16 _, acc = model.evaluate(testX, testY, verbose=0)
17 print('> %.3f' % (acc * 100.0))
18 # stores scores
19 scores.append(acc)
20 histories.append(history)
return scores, histories

Present Results

Once the model has been evaluated, we can present the results.

There are two key aspects to present: the diagnostics of the learning behavior of the model during training
and the estimation of the model performance. These can be implemented using separate functions.

First, the diagnostics involve creating a line plot showing model performance on the train and test set
during each fold of the k-fold cross-validation. These plots are valuable for getting an idea of whether a
model is overfitting, underfitting, or has a good fit for the dataset.

We will create a single figure with two subplots, one for loss and one for accuracy. Blue lines will
indicate model performance on the training dataset and orange lines will indicate performance on the hold
out test dataset. The summarize_diagnostics() function below creates and shows this plot given the
collected training histories.
1 # plot diagnostic learning curves

2 def summarize_diagnostics(histories):

3 for i in range(len(histories)):

4 # plot loss

5 plt.subplot(2, 1, 1)

6 plt.title('Cross Entropy Loss')

7 plt.plot(histories[i].history['loss'], color='blue', label='train')

8 plt.plot(histories[i].history['val_loss'], color='orange', label='test')

9 # plot accuracy

10 plt.subplot(2, 1, 2)

11 plt.title('Classification Accuracy')

12 plt.plot(histories[i].history['accuracy'], color='blue', label='train')

13 plt.plot(histories[i].history['val_accuracy'], color='orange', label='test')

14 plt.show()

Next, the classification accuracy scores collected during each fold can be summarized by calculating the
mean and standard deviation. This provides an estimate of the average expected performance of the
model trained on this dataset, with an estimate of the average variance in the mean. We will also
summarize the distribution of scores by creating and showing a box and whisker plot.

The summarize_performance() function below implements this for a given list of scores collected during
model evaluation.
# summarize model performance
1
def summarize_performance(scores):
2
# print summary
3
print('Accuracy: mean=%.3f std=%.3f, n=%d' % (mean(scores)*100, std(scores)*100,
4
len(scores)))
5
# box and whisker plots of results
6
plt.boxplot(scores)
7
plt.show()

Complete Example

We need a function that will drive the test harness.

This involves calling all of the define functions.

1 # run the test harness for evaluating a model

2 def run_test_harness():

3 # load dataset

4 trainX, trainY, testX, testY = load_dataset()

5 # prepare pixel data

6 trainX, testX = prep_pixels(trainX, testX)

7 # evaluate model

8 scores, histories = evaluate_model(trainX, trainY)

9 # learning curves

10 summarize_diagnostics(histories)

11 # summarize estimated performance


12 summarize_performance(scores)

We now have everything we need; the complete code example for a baseline convolutional neural
network model on the MNIST dataset is listed below.

1 # baseline cnn model for mnist

2 from numpy import mean

3 from numpy import std

4 from matplotlib import pyplot as plt

5 from sklearn.model_selection import KFold

6 from tensorflow.keras.datasets import mnist

7 from tensorflow.keras.utils import to_categorical

8 from tensorflow.keras.models import Sequential

9 from tensorflow.keras.layers import Conv2D

10 from tensorflow.keras.layers import MaxPooling2D

11 from tensorflow.keras.layers import Dense

12 from tensorflow.keras.layers import Flatten

13 from tensorflow.keras.optimizers import SGD

14

15 # load train and test dataset

16 def load_dataset():

17 # load dataset

18 (trainX, trainY), (testX, testY) = mnist.load_data()

19 # reshape dataset to have a single channel

20 trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))

21 testX = testX.reshape((testX.shape[0], 28, 28, 1))

22 # one hot encode target values

23 trainY = to_categorical(trainY)
24 testY = to_categorical(testY)

25 return trainX, trainY, testX, testY

26

27 # scale pixels

28 def prep_pixels(train, test):

29 # convert from integers to floats

30 train_norm = train.astype('float32')

31 test_norm = test.astype('float32')

32 # normalize to range 0-1

33 train_norm = train_norm / 255.0

34 test_norm = test_norm / 255.0

35 # return normalized images

36 return train_norm, test_norm

37

38 # define cnn model

39 def define_model():

40 model = Sequential()

41 model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform',


input_shape=(28, 28, 1)))
42
model.add(MaxPooling2D((2, 2)))
43
model.add(Flatten())
44
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
45
model.add(Dense(10, activation='softmax'))
46
# compile model
47
opt = SGD(learning_rate=0.01, momentum=0.9)
48
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
49
return model
50

51 # evaluate a model using k-fold cross-validation

52 def evaluate_model(dataX, dataY, n_folds=5):

53 scores, histories = list(), list()

54 # prepare cross validation

55 kfold = KFold(n_folds, shuffle=True, random_state=1)

56 # enumerate splits

57 for train_ix, test_ix in kfold.split(dataX):

58 # define model

59 model = define_model()

60 # select rows for train and test

61 trainX, trainY, testX, testY = dataX[train_ix], dataY[train_ix], dataX[test_ix],


dataY[test_ix]
62
# fit model
63
history = model.fit(trainX, trainY, epochs=10, batch_size=32, validation_data=(testX,
64 testY), verbose=0)

65 # evaluate model
66 _, acc = model.evaluate(testX, testY, verbose=0)
67 print('> %.3f' % (acc * 100.0))
68 # stores scores
69 scores.append(acc)
70 histories.append(history)
71 return scores, histories
72

73 # plot diagnostic learning curves

74 def summarize_diagnostics(histories):

75 for i in range(len(histories)):
76 # plot loss

77 plt.subplot(2, 1, 1)

78 plt.title('Cross Entropy Loss')

79 plt.plot(histories[i].history['loss'], color='blue', label='train')

80 plt.plot(histories[i].history['val_loss'], color='orange', label='test')

81 # plot accuracy

82 plt.subplot(2, 1, 2)

83 plt.title('Classification Accuracy')

84 plt.plot(histories[i].history['accuracy'], color='blue', label='train')

85 plt.plot(histories[i].history['val_accuracy'], color='orange', label='test')

86 plt.show()

87

88 # summarize model performance

89 def summarize_performance(scores):

90 # print summary

91 print('Accuracy: mean=%.3f std=%.3f, n=%d' % (mean(scores)*100, std(scores)*100,


len(scores)))
92
# box and whisker plots of results
93
plt.boxplot(scores)
94
plt.show()
95

96
# run the test harness for evaluating a model
97
def run_test_harness():
98
# load dataset
99
trainX, trainY, testX, testY = load_dataset()
100
# prepare pixel data
101
trainX, testX = prep_pixels(trainX, testX)
# evaluate model
102
scores, histories = evaluate_model(trainX, trainY)
103
# learning curves
104
summarize_diagnostics(histories)
105
# summarize estimated performance
106
summarize_performance(scores)
107

108
# entry point, run the test harness
109
run_test_harness()

Running the example prints the classification accuracy for each fold of the cross-validation process. This
is helpful to get an idea that the model evaluation is progressing.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the average
outcome.
We can see two cases where the model achieves perfect skill and one case where it achieved lower than
98% accuracy. These are good results.

1 > 98.550

2 > 98.600

3 > 98.642

4 > 98.850

5 > 98.742

Next, a diagnostic plot is shown, giving insight into the learning behavior of the model across each fold.

In this case, we can see that the model generally achieves a good fit, with train and test learning curves
converging. There is no obvious sign of over- or underfitting.
Loss and Accuracy Learning Curves for the Baseline Model During k-Fold Cross-Validation

Next, a summary of the model performance is calculated.

We can see in this case, the model has an estimated skill of about 98.6%, which is reasonable.

1 Accuracy: mean=98.677 std=0.107, n=5

Finally, a box and whisker plot is created to summarize the distribution of accuracy scores.
Box and Whisker Plot of Accuracy Scores for the Baseline Model Evaluated Using k-Fold Cross-
Validation

We now have a robust test harness and a well-performing baseline model.

How to Develop an Improved Model

There are many ways that we might explore improvements to the baseline model.

We will look at areas of model configuration that often result in an improvement, so-called low-hanging
fruit. The first is a change to the learning algorithm, and the second is an increase in the depth of the
model.

Improvement to Learning

There are many aspects of the learning algorithm that can be explored for improvement.

Perhaps the point of biggest leverage is the learning rate, such as evaluating the impact that smaller or
larger values of the learning rate may have, as well as schedules that change the learning rate during
training.

Another approach that can rapidly accelerate the learning of a model and can result in large performance
improvements is batch normalization. We will evaluate the effect that batch normalization has on our
baseline model.

Batch normalization can be used after convolutional and fully connected layers. It has the effect of
changing the distribution of the output of the layer, specifically by standardizing the outputs. This has the
effect of stabilizing and accelerating the learning process.
We can update the model definition to use batch normalization after the activation function for the
convolutional and dense layers of our baseline model. The updated version of define_model() function
with batch normalization is listed below.
1 # define cnn model

2 def define_model():

3 model = Sequential()

4 model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform',


input_shape=(28, 28, 1)))
5
model.add(BatchNormalization())
6
model.add(MaxPooling2D((2, 2)))
7
model.add(Flatten())
8
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
9
model.add(BatchNormalization())
10
model.add(Dense(10, activation='softmax'))
11
# compile model
12
opt = SGD(learning_rate=0.01, momentum=0.9)
13
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
14
return model

The complete code listing with this change is provided below.

1 # cnn model with batch normalization for mnist

2 from numpy import mean

3 from numpy import std

4 from matplotlib import pyplot as plt

5 from sklearn.model_selection import KFold

6 from tensorflow.keras.datasets import mnist

7 from tensorflow.keras.utils import to_categorical

8 from tensorflow.keras.models import Sequential

9 from tensorflow.keras.layers import Conv2D

10 from tensorflow.keras.layers import MaxPooling2D

11 from tensorflow.keras.layers import Dense

12 from tensorflow.keras.layers import Flatten

13 from tensorflow.keras.optimizers import SGD

14 from tensorflow.keras.layers import BatchNormalization

15

16 # load train and test dataset


17 def load_dataset():

18 # load dataset

19 (trainX, trainY), (testX, testY) = mnist.load_data()

20 # reshape dataset to have a single channel

21 trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))

22 testX = testX.reshape((testX.shape[0], 28, 28, 1))

23 # one hot encode target values

24 trainY = to_categorical(trainY)

25 testY = to_categorical(testY)

26 return trainX, trainY, testX, testY

27

28 # scale pixels

29 def prep_pixels(train, test):

30 # convert from integers to floats

31 train_norm = train.astype('float32')

32 test_norm = test.astype('float32')

33 # normalize to range 0-1

34 train_norm = train_norm / 255.0

35 test_norm = test_norm / 255.0

36 # return normalized images

37 return train_norm, test_norm

38

39 # define cnn model

40 def define_model():

41 model = Sequential()

42 model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform',


input_shape=(28, 28, 1)))
43 model.add(BatchNormalization())

44 model.add(MaxPooling2D((2, 2)))

45 model.add(Flatten())

46 model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))

47 model.add(BatchNormalization())

48 model.add(Dense(10, activation='softmax'))

49 # compile model

50 opt = SGD(learning_rate=0.01, momentum=0.9)

51 model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

52 return model

53

54 # evaluate a model using k-fold cross-validation

55 def evaluate_model(dataX, dataY, n_folds=5):

56 scores, histories = list(), list()

57 # prepare cross validation

58 kfold = KFold(n_folds, shuffle=True, random_state=1)

59 # enumerate splits

60 for train_ix, test_ix in kfold.split(dataX):

61 # define model

62 model = define_model()

63 # select rows for train and test

64 trainX, trainY, testX, testY = dataX[train_ix], dataY[train_ix], dataX[test_ix],


dataY[test_ix]
65
# fit model
66
history = model.fit(trainX, trainY, epochs=10, batch_size=32, validation_data=(testX,
67 testY), verbose=0)

68 # evaluate model
69 _, acc = model.evaluate(testX, testY, verbose=0)

70 print('> %.3f' % (acc * 100.0))

71 # stores scores

72 scores.append(acc)

73 histories.append(history)

74 return scores, histories

75

76 # plot diagnostic learning curves

77 def summarize_diagnostics(histories):

78 for i in range(len(histories)):

79 # plot loss

80 plt.subplot(2, 1, 1)

81 plt.title('Cross Entropy Loss')

82 plt.plot(histories[i].history['loss'], color='blue', label='train')

83 plt.plot(histories[i].history['val_loss'], color='orange', label='test')

84 # plot accuracy

85 plt.subplot(2, 1, 2)

86 plt.title('Classification Accuracy')

87 plt.plot(histories[i].history['accuracy'], color='blue', label='train')

88 plt.plot(histories[i].history['val_accuracy'], color='orange', label='test')

89 plt.show()

90

91 # summarize model performance

92 def summarize_performance(scores):

93 # print summary

94 print('Accuracy: mean=%.3f std=%.3f, n=%d' % (mean(scores)*100, std(scores)*100,


len(scores)))
# box and whisker plots of results
95
plt.boxplot(scores)
96
plt.show()
97

98
# run the test harness for evaluating a model
99
def run_test_harness():
100
# load dataset
101
trainX, trainY, testX, testY = load_dataset()
102
# prepare pixel data
103
trainX, testX = prep_pixels(trainX, testX)
104
# evaluate model
105
scores, histories = evaluate_model(trainX, trainY)
106
# learning curves
107
summarize_diagnostics(histories)
108
# summarize estimated performance
109
summarize_performance(scores)
110

111
# entry point, run the test harness
112
run_test_harness()

Running the example again reports model performance for each fold of the cross-validation process.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the average
outcome.
We can see perhaps a small drop in model performance as compared to the baseline across the cross-
validation folds.

1 > 98.475

2 > 98.608
3 > 98.683

4 > 98.783

5 > 98.667

A plot of the learning curves is created, in this case showing that the speed of learning (improvement over
epochs) does not appear to be different from the baseline model.

The plots suggest that batch normalization, at least as implemented in this case, does not offer any
benefit.
Loss and Accuracy Learning Curves for the BatchNormalization Model During k-Fold Cross-Validation

Next, the estimated performance of the model is presented, showing performance with a slight decrease in
the mean accuracy of the model: 98.643 as compared to 98.677 with the baseline model.

1 Accuracy: mean=98.643 std=0.101, n=5


Box and Whisker Plot of Accuracy Scores for the BatchNormalization Model Evaluated Using k-Fold
Cross-Validation

Increase in Model Depth

There are many ways to change the model configuration in order to explore improvements over the
baseline model.

Two common approaches involve changing the capacity of the feature extraction part of the model or
changing the capacity or function of the classifier part of the model. Perhaps the point of biggest
influence is a change to the feature extractor.
We can increase the depth of the feature extractor part of the model, following a VGG-like pattern of
adding more convolutional and pooling layers with the same sized filter, while increasing the number of
filters. In this case, we will add a double convolutional layer with 64 filters each, followed by another
max pooling layer.
The updated version of the define_model() function with this change is listed below.
# define cnn model
1
def define_model():
2
model = Sequential()
3
model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform',
4
input_shape=(28, 28, 1)))
5
model.add(MaxPooling2D((2, 2)))
6
model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
7
model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
8
model.add(MaxPooling2D((2, 2)))
9
model.add(Flatten())
10
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
11
model.add(Dense(10, activation='softmax'))
12
# compile model
13
opt = SGD(learning_rate=0.01, momentum=0.9)
14
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
15
return model

For completeness, the entire code listing, including this change, is provided below.
1 # deeper cnn model for mnist

2 from numpy import mean

3 from numpy import std

4 from matplotlib import pyplot as plt

5 from sklearn.model_selection import KFold

6 from tensorflow.keras.datasets import mnist

7 from tensorflow.keras.utils import to_categorical

8 from tensorflow.keras.models import Sequential

9 from tensorflow.keras.layers import Conv2D

10 from tensorflow.keras.layers import MaxPooling2D

11 from tensorflow.keras.layers import Dense

12 from tensorflow.keras.layers import Flatten

13 from tensorflow.keras.optimizers import SGD

14

15 # load train and test dataset

16 def load_dataset():

17 # load dataset

18 (trainX, trainY), (testX, testY) = mnist.load_data()

19 # reshape dataset to have a single channel

20 trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))

21 testX = testX.reshape((testX.shape[0], 28, 28, 1))

22 # one hot encode target values

23 trainY = to_categorical(trainY)

24 testY = to_categorical(testY)

25 return trainX, trainY, testX, testY

26
27 # scale pixels

28 def prep_pixels(train, test):

29 # convert from integers to floats

30 train_norm = train.astype('float32')

31 test_norm = test.astype('float32')

32 # normalize to range 0-1

33 train_norm = train_norm / 255.0

34 test_norm = test_norm / 255.0

35 # return normalized images

36 return train_norm, test_norm

37

38 # define cnn model

39 def define_model():

40 model = Sequential()

41 model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform',


input_shape=(28, 28, 1)))
42
model.add(MaxPooling2D((2, 2)))
43
model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
44
model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
45
model.add(MaxPooling2D((2, 2)))
46
model.add(Flatten())
47
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
48
model.add(Dense(10, activation='softmax'))
49
# compile model
50
opt = SGD(learning_rate=0.01, momentum=0.9)
51
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
52
return model
53

54 # evaluate a model using k-fold cross-validation

55 def evaluate_model(dataX, dataY, n_folds=5):

56 scores, histories = list(), list()

57 # prepare cross validation

58 kfold = KFold(n_folds, shuffle=True, random_state=1)

59 # enumerate splits

60 for train_ix, test_ix in kfold.split(dataX):

61 # define model

62 model = define_model()

63 # select rows for train and test

64 trainX, trainY, testX, testY = dataX[train_ix], dataY[train_ix], dataX[test_ix],


dataY[test_ix]
65
# fit model
66
history = model.fit(trainX, trainY, epochs=10, batch_size=32, validation_data=(testX,
67 testY), verbose=0)

68 # evaluate model
69 _, acc = model.evaluate(testX, testY, verbose=0)
70 print('> %.3f' % (acc * 100.0))
71 # stores scores
72 scores.append(acc)
73 histories.append(history)
74 return scores, histories
75

76 # plot diagnostic learning curves

77 def summarize_diagnostics(histories):

78 for i in range(len(histories)):
79 # plot loss

80 plt.subplot(2, 1, 1)

81 plt.title('Cross Entropy Loss')

82 plt.plot(histories[i].history['loss'], color='blue', label='train')

83 plt.plot(histories[i].history['val_loss'], color='orange', label='test')

84 # plot accuracy

85 plt.subplot(2, 1, 2)

86 plt.title('Classification Accuracy')

87 plt.plot(histories[i].history['accuracy'], color='blue', label='train')

88 plt.plot(histories[i].history['val_accuracy'], color='orange', label='test')

89 plt.show()

90

91 # summarize model performance

92 def summarize_performance(scores):

93 # print summary

94 print('Accuracy: mean=%.3f std=%.3f, n=%d' % (mean(scores)*100, std(scores)*100,


len(scores)))
95
# box and whisker plots of results
96
plt.boxplot(scores)
97
plt.show()
98

99
# run the test harness for evaluating a model
100
def run_test_harness():
101
# load dataset
102
trainX, trainY, testX, testY = load_dataset()
103
# prepare pixel data
104
trainX, testX = prep_pixels(trainX, testX)
# evaluate model
105
scores, histories = evaluate_model(trainX, trainY)
106
# learning curves
107
summarize_diagnostics(histories)
108
# summarize estimated performance
109
summarize_performance(scores)
110

111
# entry point, run the test harness
112
run_test_harness()

Running the example reports model performance for each fold of the cross-validation process.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the average
outcome.
The per-fold scores may suggest some improvement over the baseline.

1 > 99.058

2 > 99.042

3 > 98.883

4 > 99.192

5 > 99.133

A plot of the learning curves is created, in this case showing that the models still have a good fit on the
problem, with no clear signs of overfitting. The plots may even suggest that further training epochs could
be helpful.
Loss and Accuracy Learning Curves for the Deeper Model During k-Fold Cross-Validation

Next, the estimated performance of the model is presented, showing a small improvement in performance
as compared to the baseline from 98.677 to 99.062, with a small drop in the standard deviation as well.

1 Accuracy: mean=99.062 std=0.104, n=5


Box and Whisker Plot of Accuracy Scores for the Deeper Model Evaluated Using k-Fold Cross-
Validation

How to Finalize the Model and Make Predictions

The process of model improvement may continue for as long as we have ideas and the time and resources
to test them out.

At some point, a final model configuration must be chosen and adopted. In this case, we will choose the
deeper model as our final model.

First, we will finalize our model, but fitting a model on the entire training dataset and saving the model to
file for later use. We will then load the model and evaluate its performance on the hold out test dataset to
get an idea of how well the chosen model actually performs in practice. Finally, we will use the saved
model to make a prediction on a single image.

Save Final Model

A final model is typically fit on all available data, such as the combination of all train and test dataset.

In this tutorial, we are intentionally holding back a test dataset so that we can estimate the performance of
the final model, which can be a good idea in practice. As such, we will fit our model on the training
dataset only.

1 # fit model

2 model.fit(trainX, trainY, epochs=10, batch_size=32, verbose=0)

Once fit, we can save the final model to an H5 file by calling the save() function on the model and pass in
the chosen filename.
1 # save model

2 model.save('final_model.h5')

Note, saving and loading a Keras model requires that the h5py library is installed on your workstation.
The complete example of fitting the final deep model on the training dataset and saving it to file is listed
below.

1 # save the final model to file

2 from tensorflow.keras.datasets import mnist

3 from tensorflow.keras.utils import to_categorical


4 from tensorflow.keras.models import Sequential

5 from tensorflow.keras.layers import Conv2D

6 from tensorflow.keras.layers import MaxPooling2D

7 from tensorflow.keras.layers import Dense

8 from tensorflow.keras.layers import Flatten

9 from tensorflow.keras.optimizers import SGD

10

11 # load train and test dataset

12 def load_dataset():

13 # load dataset

14 (trainX, trainY), (testX, testY) = mnist.load_data()

15 # reshape dataset to have a single channel

16 trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))

17 testX = testX.reshape((testX.shape[0], 28, 28, 1))

18 # one hot encode target values

19 trainY = to_categorical(trainY)

20 testY = to_categorical(testY)

21 return trainX, trainY, testX, testY

22

23 # scale pixels

24 def prep_pixels(train, test):

25 # convert from integers to floats

26 train_norm = train.astype('float32')

27 test_norm = test.astype('float32')

28 # normalize to range 0-1

29 train_norm = train_norm / 255.0


30 test_norm = test_norm / 255.0

31 # return normalized images

32 return train_norm, test_norm

33

34 # define cnn model

35 def define_model():

36 model = Sequential()

37 model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform',


input_shape=(28, 28, 1)))
38
model.add(MaxPooling2D((2, 2)))
39
model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
40
model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform'))
41
model.add(MaxPooling2D((2, 2)))
42
model.add(Flatten())
43
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
44
model.add(Dense(10, activation='softmax'))
45
# compile model
46
opt = SGD(learning_rate=0.01, momentum=0.9)
47
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
48
return model
49

50
# run the test harness for evaluating a model
51
def run_test_harness():
52
# load dataset
53
trainX, trainY, testX, testY = load_dataset()
54
# prepare pixel data
55
trainX, testX = prep_pixels(trainX, testX)
56 # define model

57 model = define_model()

58 # fit model

59 model.fit(trainX, trainY, epochs=10, batch_size=32, verbose=0)

60 # save model

61 model.save('final_model.h5')

62

63 # entry point, run the test harness

64 run_test_harness()

After running this example, you will now have a 1.2-megabyte file with the name ‘final_model.h5‘ in
your current working directory.
Evaluate Final Model

We can now load the final model and evaluate it on the hold out test dataset.

This is something we might do if we were interested in presenting the performance of the chosen model
to project stakeholders.

The model can be loaded via the load_model() function.


The complete example of loading the saved model and evaluating it on the test dataset is listed below.

1 # evaluate the deep model on the test dataset

2 from tensorflow.keras.datasets import mnist

3 from tensorflow.keras.models import load_model

4 from tensorflow.keras.utils import to_categorical

6 # load train and test dataset

7 def load_dataset():

8 # load dataset

9 (trainX, trainY), (testX, testY) = mnist.load_data()


10 # reshape dataset to have a single channel

11 trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))

12 testX = testX.reshape((testX.shape[0], 28, 28, 1))

13 # one hot encode target values

14 trainY = to_categorical(trainY)

15 testY = to_categorical(testY)

16 return trainX, trainY, testX, testY

17

18 # scale pixels

19 def prep_pixels(train, test):

20 # convert from integers to floats

21 train_norm = train.astype('float32')

22 test_norm = test.astype('float32')

23 # normalize to range 0-1

24 train_norm = train_norm / 255.0

25 test_norm = test_norm / 255.0

26 # return normalized images

27 return train_norm, test_norm

28

29 # run the test harness for evaluating a model

30 def run_test_harness():

31 # load dataset

32 trainX, trainY, testX, testY = load_dataset()

33 # prepare pixel data

34 trainX, testX = prep_pixels(trainX, testX)

35 # load model
36 model = load_model('final_model.h5')

37 # evaluate model on test dataset

38 _, acc = model.evaluate(testX, testY, verbose=0)

39 print('> %.3f' % (acc * 100.0))

40

41 # entry point, run the test harness

42 run_test_harness()

Running the example loads the saved model and evaluates the model on the hold out test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the average
outcome.
The classification accuracy for the model on the test dataset is calculated and printed. In this case, we can
see that the model achieved an accuracy of 99.090%, or just less than 1%, which is not bad at all and
reasonably close to the estimated 99.753% with a standard deviation of about half a percent (e.g. 99% of
scores).

1 > 99.090

Make Prediction

We can use our saved model to make a prediction on new images.

The model assumes that new images are grayscale, that they have been aligned so that one image contains
one centered handwritten digit, and that the size of the image is square with the size 28×28 pixels.

Below is an image extracted from the MNIST test dataset. You can save it in your current working
directory with the filename ‘sample_image.png‘.
Sample Handwritten Digit
 Download the sample image (sample_image.png)
We will pretend this is an entirely new and unseen image, prepared in the required way, and see how we
might use our saved model to predict the integer that the image represents (e.g. we expect “7“).
First, we can load the image, force it to be in grayscale format, and force the size to be 28×28 pixels. The
loaded image can then be resized to have a single channel and represent a single sample in a dataset.
The load_image() function implements this and will return the loaded image ready for classification.
Importantly, the pixel values are prepared in the same way as the pixel values were prepared for the
training dataset when fitting the final model, in this case, normalized.

1 # load and prepare the image

2 def load_image(filename):

3 # load the image

4 img = load_img(filename, grayscale=True, target_size=(28, 28))

5 # convert to array

6 img = img_to_array(img)

7 # reshape into a single sample with 1 channel

8 img = img.reshape(1, 28, 28, 1)

9 # prepare pixel data

10 img = img.astype('float32')
11 img = img / 255.0

12 return img

Next, we can load the model as in the previous section and call the predict() function to get the predicted
score, and then use argmax() to obtain the digit that the image represents.
1 # predict the class

2 predict_value = model.predict(img)

3 digit = argmax(predict_value)

The complete example is listed below.

1 # make a prediction for a new image.

2 from numpy import argmax

3 from keras.preprocessing.image import load_img

4 from keras.preprocessing.image import img_to_array

5 from keras.models import load_model

7 # load and prepare the image

8 def load_image(filename):

9 # load the image

10 img = load_img(filename, grayscale=True, target_size=(28, 28))

11 # convert to array

12 img = img_to_array(img)

13 # reshape into a single sample with 1 channel

14 img = img.reshape(1, 28, 28, 1)

15 # prepare pixel data

16 img = img.astype('float32')

17 img = img / 255.0

18 return img
19

20 # load an image and predict the class

21 def run_example():

22 # load the image

23 img = load_image('sample_image.png')

24 # load model

25 model = load_model('final_model.h5')

26 # predict the class

27 predict_value = model.predict(img)

28 digit = argmax(predict_value)

29 print(digit)

30

31 # entry point, run the example

32 run_example()

Running the example first loads and prepares the image, loads the model, and then correctly predicts that
the loaded image represents the digit ‘7‘.
17

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

 Tune Pixel Scaling. Explore how alternate pixel scaling methods impact model performance as
compared to the baseline model, including centering and standardization.
 Tune the Learning Rate. Explore how different learning rates impact the model performance as
compared to the baseline model, such as 0.001 and 0.0001.
 Tune Model Depth. Explore how adding more layers to the model impact the model performance as
compared to the baseline model, such as another block of convolutional and pooling layers or another
dense layer in the classifier part of the model.
If you explore any of these extensions, I’d love to know.
Post your findings in the comments below.
Classification of Handwritten Digits Using CNN

SwatiLast Updated : 12 Jul, 2021

6 min read

This article was published as a part of the Data Science Blogathon

Introduction

In this blog, we will understand how to create and train a simple Convolutional Neural

Network (CNN) for classifying handwritten digits from a popular dataset.


Figure 1:
MNIST Dataset (Picture credits: en.wikipedia.org/wiki/MNIST_database)

Pre-requisite

Although each step will be thoroughly explained in this tutorial, it will certainly benefit someone

who already has some theoretical knowledge of the working of CNN. Also, some knowledge

of TensorFlow is also good to have, but not necessary.

Convolutional Neural Network

For those of you new to this concept, CNN is a deep learning technique to classify the input

automatically (well, after you provide the right data). Over the years, CNN has found a good grip

over classifying images for computer visions and now it is being used in healthcare domains too.

This indicates that CNN is a reliable deep learning algorithm for an automated end-to-end

prediction. CNN essentially extracts ‘useful’ features from the given input automatically making it

super easy for us!


Figure 2: End to end process of CNN

A CNN model consists of three primary layers: Convolutional Layer, Pooling layer(s), and fully

connected layer.
(1) Convolutional Layer: This layer extracts high-level input features from input data and passes

those features to the next layer in the form of feature maps.

(2) Pooling Layer: It is used to reduce the dimensions of data by applying pooling on the feature

map to generate new feature maps with reduced dimensions. PL takes either maximum or average in

the old feature map within a given stride.

(3) Fully-Connected Layer: Finally, the task of classification is done by the FC layer. Probability

scores are calculated for each class label by a popular activation function called the softmax

function.

For more details, I highly recommend you check this awesome tutorial on Analytics Vidhya.

Dataset

The dataset that is being used here is the MNIST digits classification dataset . Keras is a deep

learning API written in Python and MNIST is a dataset provided by this API. This dataset consists of

60,000 training images and 10,000 testing images. It is a decent dataset for individuals who need to

have a go at pattern recognition as we will perform in just a minute!

When the Keras API is called, there are four values returned namely- x_train, y_train, x_test, and

y_test. Do not worry, I will walk you through this.

Loading the Dataset

The language used here is python. I am going to use google colab for writing and executing the

python code. You may choose a jupyter notebook as well. I choose google colab because it provides

easy access to notebooks anytime and anywhere. It is also possible to connect a colab notebook to a

GitHub repository.
Also, the code used in this tutorial is available on this Github repository. So if you find yourself

stuck someplace, do check that repository. To keep this tutorial relevant for all, we will understand

the most critical code.

1. Create and name a notebook.

2. After loading the necessary libraries, load the MNIST dataset as shown below:
(X_train, y_train) , (X_test, y_test) = keras.datasets.mnist.load_data()

As we discussed previously, this dataset returns four values and in the same order as mentioned

above. Also, x_train, y_train, x_test, and y_test are representations for training and test datasets. To

get how a dataset is divided into training and test, check out the picture below which I used during a

session where I talked about C


Figure 3: Dividing the dataset into training and test set

Voilà! You just loaded your dataset and are ready to move to the next step which is to process the

data

Processing the Dataset


Data has to be processed, cleaned, rectified in order to improve its quality. CNN will learn best from

a dataset that does not contain any null values, has all numeric data, and is scaled. So, here we will

perform some steps to ensure that our dataset is perfectly suitable for a CNN model to learn

from. From here onwards till we create CNN model, we will work only on the training dataset.

If you write X_train[0] then you get the 0th image with values between 0-255 (0 means black and

255 means white). The output is a 2-dimensional matrix (Of course, we will not know what

handwritten digit X_train[0] represents. To know this write y_train[0] and you will get 5 as output.

This means that the 0th image of this training dataset represents the number 5.

So, let’s scale this training and test datasets as shown below:

X_train = X_train / 255


X_test = X_test / 255

After scaling, we should convert the 2-d matrix to a 1-d array by using this:

X_train = X_train.reshape(-1,28,28,1) #training set


X_test = X_test.reshape(-1,28,28,1) #test set

Now that the dataset is looking good, it is high time that we create a Convolutional Neural Network.

Creating and Training a CNN

Let’s create a CNN model using the TensorFlow library. The model is created as follows:

convolutional_neural_network = models.Sequential([
layers.Conv2D(filters=25, kernel_size=(3, 3), activation='relu', input_shape=(28,28,1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])

Take some time to let this entire code sink in. It is important that you understand every bit of it. In

the CNN model created above, there is an input layer followed by two hidden layers and finally an

output layer. In the most simpler terms, activation functions are responsible for making decisions of

whether or not to move forward. In a deep neural network like CNN, there are many neurons, and

based on activation functions, neurons fire up and the network moves forward. If you do not

understand much about activation functions use ‘relu’ as it is used most popularly.

Once the model has been created, it is time to compile it and fit the model. During the process of

fitting, the model will go through the dataset and understand the relations. It will learn throughout

the process as many times as has been defined. In our example, we have defined 10 epochs. During

the process, the CNN model will learn and also make mistakes. For every mistake (i.e., wrong

predictions) the model makes, there is a penalty and that is represented in the loss value for each

epoch (see GIF below). In short, the model should generate as little loss and as high accuracy as

possible at the end of the last epoch.


GIF 1: Training CNN and the improved accuracies during each epoch

Making Predictions

To evaluate the CNN model so created you can run:

convolutional_neural_network.evaluate(X_test, y_test)
It is time to use our test dataset to see how well the CNN model will perform.

y_predicted_by_model = convolutional_neural_network.predict(X_test)

The above code will use the convolutional_neural_network model to make predictions for the test

dataset and store it in the y_predicted_by_model dataframe. For each of the 10 possible digits, a

probability score will be calculated. The class with the highest probability score is the prediction

made by the model. For example, if you want to see what is the digit in the first row of the test set:

y_predicted_by_model[0]

The output will be something like this:

array([3.4887790e-09, 3.4696127e-06, 7.7428967e-07, 2.9782784e-08,


6.3373392e-08, 6.1983449e-08, 7.4500317e-10, 9.9999511e-01,
4.2418694e-08, 3.8616824e-07], dtype=float32)

Since it is really difficult to identify the output class label with the highest probability score, let’s

write another code:

np.argmax(y_predicted[0])

And with this, you will get one of the ten digits as output (0 to 9).

Conclusion

In this blog, we begin by discussing the Convolutional Neural Network and its importance. The

tutorial also covered how a dataset is divided into training and test dataset. As an example, a popular

dataset called MNIST was taken to make predictions of handwritten digits from 0 to 9. The dataset

was cleaned, scaled, and shaped. Using TensorFlow, a CNN model was created and was eventually

trained on the training dataset. Finally, predictions were made using the trained model.

Experiment 6:

Cat and Dog Classification using CNN


A Convolutional Neural Network (CNN) operates by applying convolutional layers, utilizing

operations like conv2d to convolve learned filters (kernels) with input images. These filters assign

weights and biases to different aspects of the image, aiding in feature extraction. During training,

batches of labeled images are fed into the network. We compare predictions to ground truth labels

using algorithms like argmax to determine the class with the highest probability. We apply batch

normalization to enhance learning by normalizing the input across batches. The network parameters

are adjusted iteratively to minimize the distance between predictions and labels. This process repeats

for each batch, gradually improving the network’s prediction capabilities.


Dogs vs. Cats Prediction Problem

This tutorial aims to create a system capable of recognizing cat and dog images. It analyzes input

images of cats and images of dogs to make predictions. The implemented model is adaptable for

websites or mobile devices. The Dogs vs Cats dataset, available on Kaggle, comprises images for the
model to learn distinctive features. After training, the classification model distinguishes between cat

and dog images.

Also Read: Top 25 Machine Learning Projects for Beginners in 2024

Installing Required Packages for Python 3.6

 Numpy -> 1.14.4 [ Image is read and stored in a NumPy array ]

 TensorFlow -> 1.8.0 [ Tensorflow is the backend for Keras ]

 Keras -> 2.1.6 [ Keras is used for implementing the CNN ]


Import Libraries

 NumPy- For working with arrays, linear algebra.

 Pandas – For reading/writing data

 Matplotlib – to display images

 TensorFlow Keras models – Need a model to predict right !!

 TensorFlow Keras layers – Every NN needs layers and CNN needs well a couple of layers.
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from os import listdir
from sklearn import metrics
from keras.models import Sequential
from keras.layers import Convolution2D
from keras.layers import MaxPooling2D
from keras.layers import Dense
from keras.layers import Flatten
CNN does the processing of Images with the help of matrixes of weights known as filters. They

detect low-level features like vertical and horizontal edges etc. Through each layer, the filters

recognize high-level features.

We first initialize the CNN:

#initializing the cnn


classifier=Sequential()

For compiling the CNN, we are using adam optimizer.

Adaptive Moment Estimation (Adam) is a method used for computing individual learning rates for

each parameter. For loss function, we are using Binary cross-entropy to compare the class output to

each of the predicted probabilities. Then it calculates the penalization score based on the total

distance from the expected value.

Image augmentation is a method of applying different kinds of transformation to original images

resulting in multiple transformed copies of the same image. The images are different from each other

in certain aspects because of shifting, rotating, flipping techniques. So, we are using the Keras

ImageDataGenerator class to augment our images.

#part2-fitting the cnn to the images


from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale = 1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)

We need a way to turn our images into batches of data arrays in memory so that they can be fed to

the network during training. ImageDataGenerator can readily be used for this purpose. So, we import

this class and create an instance of the generator. We are using Keras to retrieve images from the

disk with the flow_from_directory method of the ImageDataGenerator class.


# Generating images for the Test set
test_datagen = ImageDataGenerator(rescale = 1./255)
# Creating training set
training_set = train_datagen.flow_from_directory('C:/Users/khushi
shah/AndroidStudioProjects/catanddog/dataset/training_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')
# Creating the Test set
test_set = test_datagen.flow_from_directory('C:/Users/khushi
shah/AndroidStudioProjects/catanddog/dataset/test_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')

Also Read: 25 Open Datasets for Deep Learning Every Data Scientist Must Work With!

Convolution

Convolution involves linearly multiplying weights with the input. This multiplication occurs

between an array of input data and a 2D array of weights called a filter or kernel. The filter is

consistently smaller than the input data, and the dot product takes place between the input and filter

array.

Activation

We add the activation function to assist the Artificial Neural Network (ANN) in learning complex

patterns within the data. The primary purpose of the activation function is to introduce non-linearity

into the neural network.


Pooling

The pooling operation provides spatial variance making the system capable of recognizing an object

with some varied appearance. It involves adding a 2Dfilter over each channel of the feature map and

thus summarise features lying in that region covered by the filter.

So, pooling basically helps reduce the number of parameters and computations present in the

network. It progressively reduces the spatial size of the network and thus controls overfitting. There

are two types of operations in this layer; Average pooling and Maximum pooling. Here, we are using

max-pooling which according to its name will only take out the maximum from a pool. With the

help of filters sliding through the input and at each stride, the maximum parameter is taken out, and

the rest are dropped.

The pooling layer does not modify the depth of the network unlike in the convolution layer.

Fully Connected
The fully connected layer receives the flattened output from the final pooling layer.

The Full Connection process practically works as follows:

The neurons present in the fully connected layer detect a certain feature and preserves its value then

communicates the value to both the dog and cat classes who then check out the feature and decide if

the feature is relevant to them.

Full CNN Overview


#step1-convolution
classifier.add(Convolution2D(32,3,3,input_shape=(64,64,3),activation='relu'))
#step2-maxpooling
classifier.add(MaxPooling2D(pool_size=(2,2)))
#step3-flattening
classifier.add(Flatten())
#step4-fullconnection
classifier.add(Dense(output_dim=128,activation='relu'))
classifier.add(Dense(output_dim=1,activation='sigmoid'))

We are fitting our model to the training set. It will take some time for this to finish.

classifier.fit_generator(training_set,samples_per_epoch=8000,nb_epoch=25,validation_data=test_set
,nb_val_samples=2000)
It is seen that we have 0.8115 accuracies on our training set.

We can predict new images with our model by predict_image function where we have to provide a

path of new image as image path and using predict method. If the probability is more than 0.5 then

the image will be of a dog else of cat.

#to predict new images


def predict_image(imagepath, classifier):
predict = image.load_img(imagepath, target_size = (64, 64))
predict_modified = image.img_to_array(predict)
predict_modified = predict_modified / 255
predict_modified = np.expand_dims(predict_modified, axis = 0)
result = classifier.predict(predict_modified)
if result[0][0] >= 0.5:
prediction = 'dog'
probability = result[0][0]
print ("probability = " + str(probability))
else:
prediction = 'cat'
probability = 1 - result[0][0]
print ("probability = " + str(probability))
print("Prediction = " + prediction)

Features Provided

 We can test our own images and verify the accuracy of the model.

 We can integrate the code directly into our other project and extend it into a website or

mobile application device.

 We can extend the project to different entities by just finding the suitable dataset, change the

dataset and train the model accordingly.

Conclusion

In this exhilarating journey through the realm of image classification, we delved into the marvels of

Convolutional Neural Networks (CNN). From discerning between cats and dogs to installing

essential Python packages, we’ve left no stone unturned. This beginner-friendly project provides
invaluable insights and sets the stage for exploring diverse applications. With a solid understanding

of CNN fundamentals, you’re now ready to embark on your own image classification escapades!

Don’t forget to leverage techniques like softmax activation and model.predict to further enhance

your models and you can overlook key metrics like validation loss (val_loss) to assess model

performance accurately.

Key Takeaways

 CNNs are essential deep learning models for image classification, capable of automatically

learning features from raw pixel data.

 Preprocessing and augmenting image data are crucial steps in CNN training, enhancing

model generalization and performance.

 Understanding the components of a CNN, such as convolutional layers and activation

functions, is vital for designing effective neural network architectures.

 Practical applications of CNNs extend beyond cat and dog classification, encompassing

various domains like medical imaging, object detection, and natural language processing.

Frequently Asked Questions

Q1. Why is Adam the most popular optimizer in Deep Learning?

A. Adam is popular in deep learning due to its adaptive learning rate and momentum features,

improving optimization efficiency.

Q2. How to do Cat and Dog Classification using CNN?

A. Cat and Dog Classification using CNN involves training a convolutional neural network on

labeled cat and dog image data to differentiate between the two classes.

Q3. What is Transfer Learning?


A. In transfer learning, practitioners transfer knowledge from a pre-trained model to a new model,

usually achieved by retraining the output layer on new data.

Q4. Do you have any tutorial that I can follow step by step to generate the Class activation
map?

A. Generating Class Activation Maps involves visualizing which parts of an image are important for

classification, often done by appending a global average pooling layer and visualizing activations.

Q5. How would I predict the images in the test1 data set?

A. To predict images in the test1 dataset, use a trained model on test data, typically resizing images

to match training image size, then generating predictions, often with libraries like PyTorch. Detailed

tutorials are available on platforms like GitHub.

https://www.studocu.com/in/document/pragati-engineering-college/deep-learning/dl-lab-r20-manual-
123/108610698

You might also like