b.e-etce-batchno-11
b.e-etce-batchno-11
MACHINE LEARNING
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of Muthukumari M
(37250016) and Ramalakshmi M (37250019)who carried out the project entitled “
ANIMAL RECOGNITION AND IDENTIFICATION USING MACHINE LEARNING ”
under my supervision from November 2020 to April 2021.
Internal Guide
ii
DECLARATION
DATE: 19/04/2021
1.
2.
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
v
INDEX
ABSTRACT v
LIST OF FIGURES ix
LIST OF TABLES xi
1 Introduction 1
1.1 Related Work 2
2 Literature Survey 6
2.1 Baseline Methods 6
3 System Design 8
3.1 Existing System 8
3.1.1 Drawbacks 10
3.2 Proposed System 10
3.2.1 Block Diagram 12
3.2.2 Project Implementation 12
3.2.3 Architecture 13
3.2.4 Intuition behind Capsule Networks 18
3.3.5 Advantages 25
vi
4.1 Dataset 28
4.2 Preprocessing 29
4.3 Environment 31
4.4 Results 33
4.4.1 Input given 33
4.4.2 Preprocessing of input image 34
4.4.3 Results 36
5 Software Design 37
5.1 Python version 37
5.1.1 Python 37
5.2 Operating System 39
5.3 Additional Packages 39
5.3.1 OpenCV – Python 39
5.3.2 Pillow - Python 40
REFERENCES 43
vii
LIST OF ABBREVIATIONS
viii
LIST OF FIGURES
ix
5.3 Pillow - Python 40
x
LIST OF TABLES
xi
CHAPTER 1
INTRODUCTION
Surveillance of wild animals in the national parks and wildlife sanctuaries in a non-
invasive approach in the ecological monitoring is a difficult task. Recent
achievements in computer vision and Classification techniques including a deep
learning allowed researchers to obtain the promising results. However, the
recognition of animal species in the wildlife using camera traps remains as a
unsolved problem due to many challenges, which are caused by the shooting
conditions (like varying illumination, different weather, seasons and cluttered
background) and animal or bird behavior (like unpredicted movement, multiple
shapes and poses, occlusions by natural objects) [1].
Camera traps are on the predefined locations near the animal trails, watering
places, salt licks, etc. Nowadays, the cameras which are kept for capturing cannot
be connected into locally distributed networks with data transmission and
communication in remote territories of the national parks and wildlife sanctuaries.
The fixed position of the camera trap is determined under multi-years
observations, and each camera trap is an automatic device with the motion and
flash sensors, which stores and transmits information about the movement of the
animal. As a result, each camera trap accumulates great amount of captured
images or short movies, which are captured if any motion in a scene had been
detected. The moving object can be animals, birds, or humans, but our interest is
to detect and recognize the animals and birds excluding a human from the
following analysis. Also, several moving objects may be detected during the
particular photo session. If camera trap stores captured images, then several
images for 3–5 s will be written in a hard drive as one event that is been detected.
Each image is been automatically marked with the current date, time, and
temperature values. As a result the set of images in the database can be sorted.
1
The analysis of dataset captured by camera traps in Ergaki national park,
Krasnoyarsky Kray, Russia, 2012-2018, after excluding the non-recognized
images (blurred images, images with low contrast, or non-understandable pose of
animal) indicated that conditionally first sub-set contains good representation of
the animal muzzles, second sub-set includes good representation of the animal
shapes, third sub-set holds a part of theshapes, and forth sub-set of images
involves the whole objects [1]. In this project, we propose the procedure, which
categorizes the animal images, and architecture of the Images with a human are
excluded during the categorization procedure.
1.1RELATED WORK
A conventional set of color-texture features are usually taken which are not
variant to light and contrast variations, are proposed in the paper referenced. If
animal types are previously known, some special features for the animal
detection, recognition, and rectification, are used . Thus, for the individual
identification of the marbled salamanders, skinks, and geckos, a feature of strong
bilateral symmetry was additionally applied in [1]. In this peper, different types of
matching like patch-based (multi-scale principal component analysis and scale-
cascaded alignment), local feature-based (histogram, Shift Invariant Feature
Transform (SIFT), and affine invariant variations), and context-based (hybrid
shape-contexts with cascaded correspondence) matching, were used. In
Conventional instruments like decision trees, fuzzy logic, genetic algorithm for the
production of rule set, generalised linear models random forest and maximum
entropy mehtod of machine learning approaches are used in species distribution
modelling. Images which are manually cropped or selected, which contains whole
animal shape will provide a high accuracy results. The results of combination of
SIFT, cell structured local binary patterns, weighted sparse coding for dictionary
learning, and linear Support Vector Machine (SVM) were reported. The Authors of
the above mentioned work tested their method on a dataset consisting of 7,000
camera trap images of 18 species from two different field cites and achieved an
average classification accuracy of 82% [2].
2
At present, deep learning algorithm is employed for the recognition in the wildlife
as a state-of-the art technique . The most popular Convolutional Networks
(ConvNets) for the recognition and classification tasks are the following [1].
AlexNet was a major milestone for the development of the deep ConvNets.
AlexNet won the 2012 ImageNet challenge by a large margin. By creating
nonlinear transformations AlexNet was the first CNN which has used Rectified
Linear Network(ReLu). Network in Network (NiN) proposed in [8] was one of the
first architectures, in which convolutions were implemented, in order to provide
more combinational power to the features of the convolutional layers. VGG16 was
composed of sixteen convolutional layers, multiple max-pool layers, and three
final fully-connected layers.
Residual learning model of ResNet was based on the connection between the
output of one or multiple convolutional layers and their original input. Dense
Convolutional Network (DenseNet) connected each layer to every other layer in a
feed-forward fashion that led to better classification performance.
All the above mentioned works consists of CNN architectures that were useful for
identification, recognition and detection. The Overall species identification
accuracy was 33.507% for the bag of visual words and the overall species
recogniton accuracy of deep convolutional neural network was 38.215%.
The Deep Convolutional Neural Network contains three convolutional layers and
three max pooling layers. The convolutional layer had a convolutional kernel with
a size of 9*9, while the pooling layer had a kernel with a size of 2*2. The input
layer size was 128*128, and the 3rd pooling layer in a view of 32 9*9 matrices
3
was transformed into 2,592 dimensional vector. The soft-max layer had almost 20
neurons and maximum output value among those 20 neurons determined input
image label. Chen et al. employed a relatively small dataset of around 20 classes
and 20,000 images. The upcoming investigations in this work were connected
with deep CNN. Villa et al. studied AlexNet, VGGNet, GoogLeNet (Inception), and
ResNets architectures for identification of the animal species using the camera-
trap images. At that time, ConvNets were used as the blackbox feature extractors.
The whole dataset Serengeti National Park, Tanzania was divided into the raw
unbalanced images, raw balanced images, images with animals in foreground,
and animals manually segmented from foreground. According to the ImageNet
recognition challenge, the performance metrics Top-1 (the correct class is the
most probable class) and Top-5 (the correct class is within the five most probable
classes) provided the following results. This method reached 35.4% Top-1 and
60.4% Top-5 accuracy in the worst-case scenario with the unbalanced and empty
images. At the same time, the accuracy reached 88.9% Top-1 and 98.1% Top-5
in the best scenario (the balanced dataset, images containing the foreground
animals only, and manually segmented).
In that work they proposed the Wildlife detector with a CNN-based model which is
been designed to train a binary classifier (with two classes, Animal and Non-
animal) and Wildlife identifier as another CNN-based model created to train a
multi-class classifier (species identification).
So called Lite AlexNet with less hidden layers and feature maps at each layer was
applied as the Wildlife detector, while two ConvNets, VGG-16 and ResNet-50,
were tested as the Wildlife identifier[1]. The suggested approach in that paper
was allowed to achieve 96% for detection of animal and 90% for identification of
three most common animals such as bird, rat, and bandicoot.
4
Sometimes, a combination is made of well-known methods created for
classification of animals and segmentation using the camera-trap images. Thus,
Giraldo-Zuluagaet employed several techniques for animal recognition in the
Colombian forest, viz. the -layer robust principal component analysis for
segmentation, CNN for extracting features,
Generally the most researchers use the datasets obtained from the Serengeti
National Park, Australia, Tanzania, Northern America, or South-central Victoria
without season changes as it’s happened in Northern countries. We are the first,
who study the animal recognition through all natural seasons in Russia,
developing the combined background model [4].
5
CHAPTER 2
LITERATURE SURVEY
Currently, the animal detection and recognition are still a difficult task and there is
no unique method to provides a robust and efficient solution for all situations. The
animal detection algorithms implement by using animal detection as a binary
pattern classification task [1]. That means, that given an input image, it is divided
into small blocks and each block is transformed into a feature. Features from the
animal that belongs to a certain class is used to train a certain classifier. Then,
when given a new input image, the classifier will be able to check whether the
sample is the animal or not. The animal recognition system can be divided into the
following basic applications:
Identification - compares the given animal image to all the other animals in
the database and gives a ranked list of matches (one-to-N matching).
Verification (authentication) - compares the given animal image and
involves confirming or denying the identity of found animal (one-to-one
matching).
While identification and verification often share the same classification algorithms,
both modes target distinct applications [1]. In order to better understanding of the
animal detection and recognition task and its difficulties, the following factors must
be taken into account, because they can cause serious performance degradation
in animal detection and recognition systems:
Since the main goal of our project is to evaluate the performance of CapsNet on
different types of images (faces, traffic signs, everyday objects) we compared its
performance with other suitable methods highlighted in the literature as reliable
and robust for every type of datasets. First, we compare CapsNetwith the
Fisherface algorithm [3] for face datasets (Yale face database B and MIT). This
algorithm has proved to be fast, reliable [4,5], and one the most successful
6
methods for face recognition [6,7] achieving an average of 96.4% recognition rate
on the Yale B Extended Database [8] and 93.29% on the MIT CBCL dataset.
For the Belgium Traffic signs dataset, we chose a CNN architecture known as
LeNet-5 [9] and designed initially for handwritten digit recognition. The advantage
of having a CNN is that it requires a much lower number of trainable parameters
as compared to a Multi-Layer Feed Forward Neural Network, which supports
sharing of weights and partially connected layers. Along with a reduced number of
trainable parameters, a CNNs design also makes the model invariant to
translation, a distinctive feature of state-of-the-art methods for image
classification. LeNet-5 is relatively small but has proved to be powerful enough to
solve several classification problems including traffic sign recognition [2].
7
CHAPTER 3
SYSTEM DESIGN
8
Figure: 3.1 Process of CNN
As shown in the above figure, the Convolutional Neural Network is the
sequence of layers which can be categorised into groups each having a
convolutional layer with a non-linear activation funciton. Generally there will exists
the Pooling layer, the Rectified Linear Unit (ReLu) and mostly the Max pooling;
This will be resulting in the fully connected layers where the last one is the output
layer with the predictions of the image. In the standard neural networks, the
neurons in each layer are completely independent and each neuron is fully
connected to all neurons in the previous layer. The total number of parameters can
reach millions, leading to serious over fitting problem and it will be impractical to
be trained when applied to high dimensional data such as natural images. By
contrast, each neuron is connected only to a small region of the preceding layer
forming local connectivity in CNNs. The convolution layer computes the outputs of
its neurons, the spatial extent of this connection is specified by a filter size which
are connected to local regions in the previous layer. Moreover, dramatically
reduces the number of parameters and so does computing complexity namely
parameter sharing is the important property. CNNs have fewer connections and
parameters, making them easy to train while their performance is slightly
degraded; thus, it is having compared to regular neural networks with similar size
of layers,.
Parameter sharing
9
Local connectivity
Spatial strucutre.
The higher layers exhibit - more abstract features of object while the lower layers -
present detail features of images such as edges, curves and corners.
Apart from using better techniques for preventing overfitting and more
powerful models, the performance of machine learning which is been data driven
approaches will depend striclty on the quality and size of the collected training
datasets. Requiring much larger training sets to learn recognizing them, the real-
life objects exhibit considerable variability.
3.1.1 DRAWBACKS
10
Wildlife-vehicle and Wildlife-Human encounters often result in injuries and
sometimes fatalities. Thereby, this work aims to diminish the negative
impacts of these encounters in a way that makes the environment safer for
both humans and animals.
For this project, the main focus is going to be on wild animal data and prove
that capsule network can do better and faster the task than human
equivalent labor.
Incidentally, we are going to take advantage of these data to make a
comparative study on multiple deep learning models, specifically, VGG-net,
RES-net, and a custom made Convolutional-Capsule Network
Attempts to identify and count the animals wild animals with camera traps,
explore some state of the art models and obtained some quite interesting
results.
A Capsule network is a neural network that does an inverse graphic to learn
features from an image.
In Computer vision, inverse graphics is a concept that allows the
identification of objects in the image and helps to find the instantation
parameter of the object.
Capsule network effectively uses inverse graphic to produce a multi-
dimensional vector that in capsulates various components of an object
11
3.2.1 BLOCK DIAGRAM
12
As our choices of network has been based on performances and diversity.
We are selecting Capsule network as our 3 type of model to explore. In
point of fact VGG was very deep nettwork with very small 3*3 filter.
Inception is the network followed by it showing sparsity and parameter
efficacy in the simultaneous convoluition on the top.Capsule network comes
with a completely different way of learning from images
The First one will be with a 3-layer network and a very low-resolution
image. This will be to explore the efficiency of capsule on low resolution
image and the effect of image alteration on capsule network performances.
The second one will be a more customize Capsule network with 11 layers.
Here we are going to use a more complex image and try to explore how our
custom build network can perform on a more complex image.
3.2.3 ARCHITECTURE
13
Scalar weighting -Scalar weighting on the input computes which higher-
level capsule should receive the current capsule output.
Dynamic Routing Algorithm - It permits these different components to
transfer amongst each other. The Lower level capsules gives the input to
the higher level capsules. This is a repetitive process.
Squashing Function - The last component that condenses the information
is the squashing function. The Squashing function converts it into a vector
by takes all the information. The vector will be less than or equal to 1 also
maintaining the direction of the vector.
X: 25
Y: 10
Radius: 5
Angle: 0
14
Capsule is the key component of the capsule network. A capsule is a
function that tries to identify any given instantiation parameter of an object. A
capsule is composed of an activation vector. It typically has two dimensions:
Anyhow, Activation vectors may have configuration of the capsules and many
more dimensions depending on the goal.
In spite of this fact, it can be get back on one of the dimensions of the
activation vector, the length. The length is representative of the probability of
an object being present based on its instantiation parameters. It should always
be less or ideally equal to 1 is meant. For this purpose, a new activation
function was designed for the capsule. The so Called, Squash function. To
ensure that the activation vectors are always less than 1 the squash function is
designed.
In addition to that, the first layers of capsules, predicts the output of the next
layer from every capsule. This means, during training, the network will try to
learn the transformation matrices from one vector in the first layer to another
one in the next layer. What does this actually mean?
After each capsule in the primary capsules, layer has contributed to the
construction of the next layer, the capsules of the last capsules layer have to
decide what they actually identify. It is a democratic process between capsules.
Meaning that the majority of the capsules have to agree on what output should be.
This process is called routing by agreement. In the point of fact, all the capsules
will depend on the inputs and predictions of the previous layer, in the last capsule
layers, vote on what is the most suitable object detected in the given image. This
vote is based on all the instantiating parameters the capsules were set to detect.
Indeed, if we have to predict between a house, a boat and a car, each primary
capsule will vote as to what is supposed to be output based on what they identify.
The object having the highest vote will be the one predicted. When all the
capsules involved in the process all agree on the same object, we talk about
Routing by agreement.
16
Strong agreement.
All these features and mechanism behind the Capsule networks allow them to
preserve at least the location and pose of objects throughout the network.
Hence, Capsule network is Equivariant. We will use these key concepts to build
custom networks and perform several experiments with capsules-based
networks.
The table below resumes a brief comparison between capsules network and
17
3.2.4 Intuition behind Capsule Networks
For instance a picture of Cat is taken. But how it is identified that it is cat ? The
General Approach is by seeing their features. The Individual features such as
eyes, nose, ears etc. are broken down from the picture.
18
Here the essential high level features is been decomposed to low level ones. This
can be defined as,
Where P(face) is defined as the presence of face of the cat in the picture
respectively. It can also been done for low level features like edges and shapes in
order to decrease the complexity of the procedure.
19
In this case the process cannot be taken through like extracting their features
because if the image is rotated, the identification of cat is not possible. Since the
Orientation of the low level features is also changing in rotation the feature
extracting method cannot be undergone here since the predefined features will not
match to the rotated ones.
So now, the low level features taken from images is depicted below.
Figure 3.11 : Comparison Extracted features of rotated and proper image of cat
In this case, the better approach can be done is the Brute Force Search. Brute
Force generally means finding for all possible combinations. The Low level
features are rotated in all possible rotations and the feature matching is done.
Including additional properties of the low level features itself like rotational angle
was one of the main suggestion by the researchers. Although it's rotated this is the
way how the presence of feature can be checked. For instance, notice the below
given picture.
20
In a more rigorous way, it can be defined as:
These are the important aspect of Capsule networks. Dynamic Routing is the
another important feature of Capsule network. Another example is taken to explain
that process.
Now, a picture of dog and cat is taken for the classification problem.
They are very similar if seen entirely. But to figure out them as cat and dog there
are some significant features in the images which help us identify them.
21
Figure 3.13 : Comparing Feature of similar picture of Dog and Cat
As done before in the previous sub-section the features in the images can be
defined which will help us identify them.
The Complex features can be iteratively defined which come up with the solution
as seen in the image below. Initially the very low features like eyes and ears are
defined and then combine them to identify the face. Following that the facial and
body features are combined to arrive at a solution as the given image is dog or
cat.
If a new image is taken with the low level features extracted. The Class of this
image is tried to figured out.
A Feature is been randomly picked and it is eye actually. There arises a question
that can it only be enough finding the class?
22
Figure 3.14 : Compering it with the one feature - Eye
Yes, it is noticed that eye alone is not a differentiating factor. And so the next step
is to add more features to the process for analysis. The Next feature which is
randomly picked is the nose. For this moment only the intermediate level features
and low level features are seen.
Figure 3.15 : Compering it with the two features - Nose and Eye
23
Still it is seen that it is not enough for classification. Now the next step is to
consider all the features, combine them, guess and the output is taken by
estimating the class. In this example we see that eyes, nose, ears and whiskers
are combined. It is seen that it is more probable that both cat's face rather than a
dog's face. So more weightage is given to the features when performing cat
recognition in the given image. This step is iteratively done at each feature level
so that correct information can be routed to these feature detectors that need the
information of class.
Simply the each lower level, the most probable output at its immediate higher level
is been tried to figured out. Now the question is, will the high level feature activate
when it gets the information from the features. If it's indeed activate the lower level
feature will give the information to the higher level. If it is somewhat irrelevant for
that feature detection it will not pass on the information.
24
In the terms of Capsule, the higher level capsule will get the information
from the lower level that agrees with the input. This is essence of dynamic
routing algorithm.
These are the most important essential aspects of the capsule network,
which makes it stand apart the the traditional deep learning architectures -
dynamic routing and equivariance.
Although these capsule networks come with their own difficulties like requirement
of more training time and resources than the deep learning architectures in
comparison. But this is just a time of matter before it is figured out how it can be
tuned and come out of the production from current research phase.
3.2.5 ADVANTAGES
25
Dynamic routing is used for the selection of parent capsule. Parent
capsules are decided by the capsules.
26
CHAPTER 4
The main goal of our project is to be able to automatically identify animal in the
wild. Incidentally, we take advantage of this project to create the opportunity of
exploring capsule networks in complex dataset and large. That is why our
comparative study between capsules and Convolutional neural network will be the
second axis of our project.
To covert and exhaust the two axes of our project, we will first explore state of the
art architecture and key pre-processing techniques to help us achieve our main
goal. This will give off a strong basement to start as to what will be the contouring
factor that will ameliorate state of the art models on the same task. We should
note that we are looking at full automation, so we will be more interested in top-1
accuracy all along the research. Moreover, explore capsule networks; by
engineering custom models and comparing the performances.
After gathering all the performances of our multiples experiments, we will first
benchmark our work with several results obtain from people that performed a
similar task, then compare our performances between each other. More precisely,
we will first compare our result with recent research that attempted this kind of
classification task using state of the art models. Then we will compare our capsule
network results with that of research that have attempted it with large and complex
dataset. Finally, we will compare our own results with each other and get the best
out of it. This will allow us to give a prospective analysis as to what to improve and
what are the shortcomings of using one methodology over the other.
As our choices of network has been based on performances and Diversity. We are
selecting Capsule network as our 3 type of model to explore. we saw VGG as very
deep network with very small 3 by 3 filter. Followed by the Inception which was
network advocating sparsity and simultaneous convolution on top of the parameter
efficacy. Capsule network comes with a completely different way of learning from
images. Instead of dealing with pixels, it's deals with instantiation vectors as we
discussed in Chapter II.
27
Capsule network theory provides some confidence to image classification as the
models are built to be rotation invariant and performs inverse graphic to ensure
that the proper vectoral features of an object in an image are extracted. However,
the best result achieve with Capsule network was a 0.25% error on the MNIST
dataset. Motivated by the promising future of this new techniques we decided to
explore it on more complex dataset compare to MNIST.
We are going to perform two set of experiments on Capsule network.
The First one will be with a very low-resolution image and a 3-layer
network. This will be to explore the efficiency of capsule on low resolution
image and the effect of image alteration on capsule network performances.
The second one will be a more customize Capsule network with eleven
layers. Here we are going to use a more complex image and try to explore
how our custom build network can perform on a more complex image
(300by300).
For each of our Capsule network, we will perform image reconstruction to
minimizing the changes of overfitting and to ensure that our network
actually learns the general patterns in the images. For experimentation the
details of these experiments will be given in the next chapter.
4.1DATASET
We have used a Complex Dataset which can identify twelve animals. Due
to the exhaustive nature of the capsule network and limited resources, we had to
create a significant subset of this dataset. We noticed a huge data imbalance in
the distribution of classes where few classes accounted for most of the images.
Hence, we decided to go with the 12 most occurring animals for this project:
Elephant, Lion, Tiger, Bear, Zebra, Deer, Monkey, Fox, Cow, Pig, Rabbit and
Squirrel . For these experiments, we split the data into training, testing and
validation. 80% of the data went towards the training and validation set, and 20%
went towards the testing set. Out of the 80% allocated to testing and validation,
20% of that went towards validation and the other 80% was purely training set.
28
4.2 PREPROCESSING
Pre-processing has been a keep part of the work we did here as it allowed us to
have better models, and it increased the performances from one operation to
another and it eased our workflow. We identified a few things that might be a
problem for our experiment.
The first thing was to identify the class distribution from the metadata file in the
Serengeti official website. Once that was done, we selected the 10 most occurring
classes which happened to account for a good portion of the images. Then, we
need to prepare a script to download these images and classify them by folders
based on their label. This operation really eased the processed when I came to
input the data into the models.
Secondly, the size of each image was big due to the resolution of each image.
Keeping 1K(1,920x1,080) images would be computationally expensive for each
and every single operation on every single image. As a solution, we decided to
resize all the image to 300 x 300. 300 x 300 because most of the state-of-the-art
model needed less than this to achieve a good result. For example, Inception
which requires the biggest image size, use 299 x 299 images as an input.
The next issue faced was that some images were in the day, others were in the
dark. We emitted a hypothesis that this could bias the model in case there is no
balance between day and night or the model will tend to learn more about one
versus the other. To solve this, we did a colour alteration on all the images,
transforming them to grayscale and then applying another grayscale
transformation. This will remove the presumably biased on all the images.
29
Figure 4.1 : Example of double Gray-scale Transformation
The first set of images represent the first transformation applied, where we
convert all the images to Gray-scale once while keeping the 3 colour channels.
The second image represents the next issue faced was that some images
were in the day, others were in the dark. We emitted a hypothesis that this could
bias the model in case there is no balance between day and night or the model will
tend to learn more about one versus the other. To solve this, we did a colour
alteration on all the images, transforming them to grayscale and then applying
another grayscale transformation. This will remove the presumably biased on all
the images.
30
This transformation also enables us to get rid of our third problem, the weather.
These transformations will allow the model not to be biased by the weather but to
focus only on the shapes and hard colour variation (intensity) of the animals.
4.3 ENVIRONMENT
The type of resources we used for this research was key factor in the
achievements or no achievements we have had so far. In terms of hardware, we
used two different devices.
o The first one was a Mac Pro, with a CPU of 3.5ghz 6-core intel Xeon E5,
two graphic cards from AMD having 3 Gb of memory each; 32 Gb of RAM
and a storage of 512Gb SSD.
This was the initially hardware used at the early stage of our research. It
helps us investigate the pros and cons of most of our model and perform
the literature review. However, when we tried to start the actual
experiments, this hardware was quite limited for the following reasons. The
Storage on the device was not big enough to hold the dataset; there was
also no GPU on the devices which did not allow us to use Cuda and the
parallel processing abilities of the graphical processing unit. Finally, it
31
served as a terminal station to ssh into the Main server that we are
presenting below.
file.
32
4.4 RESULTS
Initially the input is been given as a picture from our system. Here the network is
been trained for twelve animals. Any one of the animal’s picture is given as input
respectively. For example, here the image of an elephant is given as the input
image to the network.
33
4.4.2 Processing of Input Image
In this application the image from the camera trap is taken and given as input to
find the animal. The Captured image can be of in any time considering the lighting
in the image and also the environmental conditions may vary like Summer, Winter,
Spring or Autumn. However the captured animal should be identified successfully in
order to find what animal is intruding. Considering all these factors the image pre-
processing techniques are undergone. The Input image is processed. The Gray
Image, HSV image and the Binary Image of the input is obtained by processing the
Input.
34
Figure 4.5 : Binary Image Figure 4.6 : Result Image
35
4.4.3 Result
Finally the result is obtained from the program. An alert message will be popping
up in the screen. As the Elephant Picture is given as input it will pop-up as Elephant
is detected. Following that, the message will also been shown in the python shell
window.
By giving various inputs to the network the accuracy of the network is calculated.
The Overall Accuracy we got is 96.48% which is better than the traditional
Convolutional Neural Networks respectively. Thus, by working on the Capsule
Network Algorithm it is proved that it is a better algorithm than CNN which satisfies
the drawbacks of them.
36
CHAPTER 5
SOFTWARE DESIGN
Versioin : 2.7.18
5.1.1 PYTHON
37
For the installation on many operating systems python interpreters are available,
which allows Python code to execute on a wide variety of system. Using third-party
tools, such as Py2exe or Pyinstaller,
The Codes can be packaged into a stand-alone executable programs for operating
systems which are most popular, and allows for the distribution of Python-based
softwar to use on those environments which does not require a python interpreter
for installation.
The exact syntax and semantics of the Python language were described by the
python language reference, the standard library that is distributed with Python is
manually described by the library reference.And also Optional components are
described by them which are commonly included in the pyhton distributions.
The Python’s standard library offers a wide range of facilities which is very
extensive. It provides access to system functionality like file input or output that
would otherwise be not accessible to Python users, as well as modules written in
Python that provide standardized solutions for many problems that occur in
everyday programming as it contains built-in modules (Written in C program). To
encourage and enhance the portability of Python programs by abstracting away
platform-specifics into platform-neutral APIs, some of these modules are explicitly
designed.
The Installers of python for the Windows platform usually consists of the entire
standard library and often also includes many additional components. It may be
necessary to use the packaging tools provided with operating system to obtain
some or all of the optional components since Unix-like operating systems python
will be normally provided as collection of packages. There is a growing collection of
several thousand components in addition to the standard library, available from the
38
python package index for the individual programs and modules to packages and for
the development of entire applications and frameworks.
Windows 10
Pillow
39
5.3.2 PILLOW – PYTHON
40
CHAPTER 6
With the improvements and availability of more resources, the project will explore
more areas of this research. How to widen capsule network and the impact it has
on complex datasets. Using transfer learning for feature extraction and feed the
features to the capsules. Explore new preprocessing techniques that might
improve our models. Study the behavior of capsule networks without prior
convolutions applied. Finally, Repeat the whole experiment for a complex dataset.
The first perspective we plan to explore next is actually the impact of having a
bigger or smaller capsule along with the number of capsules per layers. This
inspiration is driven by the idea behind Inception networks, however, there is a
doubt that is will reduce the computational needs of the network as in Inception.
6.2 CONCLUSION
In this Project , a dataset consisting of twelve animals has been exploredand tried
to out-perform existing authors that attempted the same identification task.
Moreover, quite complex network in implementation has been explored. For the
former, the project succeeded to have our top-1 accuracy up to 96.48%.
As for the later, it has been shown how capsule networks have a big potential in
recognizing complex objects in very low-quality images. However, the more depth
importance give to the capsule, the higher the computational expenses rise; they
41
rise exponentially. The Project have also experienced manual and convolutional
dimensionality reduction applied to a capsule network. A capsule network learns
better from a convolutional cropping. Indeed, it gives it the ability to be built for any
size and resolution of images.
The capacities of the capsules seem only limited because of the huge
computational needs to run larger networks on a larger dataset. However, we
expect that not to be a problem in the coming years as computational power
becomes more available to the common public.
42
REFERENCES
5. Joel Kamdem Teto & Ying Xie (2018), "Automatic identification of animals in
the wild: a comparative study between c-capsule networks and deep
convolutional neural networks" , Kennesaw State University, pp.21-28
6. Edgar Xi, Selina Bing & Yang Jin (2017), "Capsule Network Performance
on Complex Data", Carnegie Mellon University Pittsburgh, pp.1-3
43
7. By Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton(2017),
"ImageNet Classification with Deep Convolutional Neural Networks",
Communications of ICM, pp.86-89
8. Amara Dinesh Kumar, R.K arthika & Dr. Latha Parameswaran (2018),
"Novel Deep Learning Model for Traffic Sign Detection Using Capsule
Networks", International Journal of Pure and Applied Mathematics,
pp.4543-4546
44