Machine Learning 4th Unit
Machine Learning 4th Unit
A deep neural network (DNN) is an ANN with multiple hidden layers between the input and
output layers. Similar to shallow ANNs, DNNs can model complex non-linear relationships.
The main purpose of a neural network is to receive a set of inputs, perform progressively
complex calculations on them, and give output to solve real world problems like
classification. We restrict ourselves to feed forward neural networks.
We have an input, an output, and a flow of sequential data in a deep network.
Neural networks are widely used in supervised learning and reinforcement learning problems.
These networks are based on a set of layers connected to each other.
In deep learning, the number of hidden layers, mostly non-linear, can be large; say about
1000 layers.
DL models produce much better results than normal ML networks.
We mostly use the gradient descent method for optimizing the network and minimising the
loss function.
We can use the Imagenet, a repository of millions of digital images to classify a dataset into
categories like cats and dogs. DL nets are increasingly used for dynamic images apart from
static ones and for time series and text analysis.
Training the data sets forms an important part of Deep Learning models. In addition,
Backpropagation is the main algorithm in training DL models.
DL deals with training large neural networks with complex input output transformations.
One example of DL is the mapping of a photo to the name of the person(s) in photo as they
do on social networks and describing a picture with a phrase is another recent application of
DL.
Neural networks are functions that have inputs like x1,x2,x3…that are transformed to outputs
like z1,z2,z3 and so on in two (shallow networks) or several intermediate operations also
called layers (deep networks).
The weights and biases change from layer to layer. ‘w’ and ‘v’ are the weights or synapses of
layers of the neural networks.
The best use case of deep learning is the supervised learning problem.Here,we have large set
of data inputs with a desired set of outputs.
RNNSare neural networks in which data can flow in any direction. These networks are used
for applications such as language modelling or Natural Language Processing (NLP).
The basic concept underlying RNNs is to utilize sequential information. In a normal neural
network it is assumed that all inputs and outputs are independent of each other. If we want to
predict the next word in a sentence w
CNNs are extensively used in computer vision; have been applied also in acoustic modelling
for automatic speech recognition.
The idea behind convolutional neural networks is the idea of a “moving filter” which passes
through the image. This moving filter, or convolution, applies to a certain neighbourhood of
nodes which for example may be pixels, where the filter applied is 0.5 x the node value −
Noted researcher Yann LeCun pioneered convolutional neural networks. Facebook as facial
recognition software uses these nets. CNN have been the go to solution for machine vision
projects. There are many layers to a convolutional network. In Imagenet challenge, a machine
was able to beat a human at object recognition in 2015.
In a nutshell, Convolutional Neural Networks (CNNs) are multi-layer neural networks. The
layers are sometimes up to 17 or more and assume the input data to be images.
CNNs drastically reduce the number of parameters that need to be tuned. So, CNNs
efficiently handle the high dimensionality of raw images.
It is assumed that the reader knows the concept of Neural networks.
When it comes to Machine Learning, Artificial Neural Networks perform really well.
Artificial Neural Networks are used in various classification tasks like image, audio,
words. Different types of Neural Networks are used for different purposes, for
example for predicting the sequence of words we use Recurrent Neural Networks
more precisely an LSTM, similarly for image classification we use Convolution
Neural networks. In this blog, we are going to build a basic building block for CNN.
Before diving into the Convolution Neural Network, let us first revisit some concepts
of Neural Network. In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The
number of neurons in this layer is equal to the total number of features in
our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden
layer. There can be many hidden layers depending upon our model and
data size. Each hidden layer can have different numbers of neurons which
are generally greater than the number of features. The output from each
layer is computed by matrix multiplication of output of the previous layer
with learnable weights of that layer and then by the addition of learnable
biases followed by activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic
function like sigmoid or softmax which converts the output of each class
into the probability score of each class.
The data is then fed into the model and output from each layer is obtained this step is
called feedforward, we then calculate the error using an error function, some common
error functions are cross-entropy, square loss error, etc. After that, we backpropagate
into the model by calculating the derivatives. This step is called Backpropagation
which basically is used to minimize the loss.
Here’s the basic python code for a neural network with random inputs and two hidden
layers.
Python
activation = lambda x: 1.0/(1.0 + np.exp(-x)) # sigmoid function
input = np.random.randn(3, 1)
Convolution Neural Network
Convolution Neural Networks or covnets are neural networks that share their
parameters. Imagine you have an image. It can be represented as a cuboid having its
length, width (dimension of the image), and height (as images generally have red,
green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network
on it, with say, k outputs and represent them vertically. Now slide that neural network
across the whole image, as a result, we will get another image with different width,
height, and depth. Instead of just R, G, and B channels now we have more channels
but lesser width and height. This operation is called Convolution. If the patch size is
the same as that of the image it will be a regular neural network. Because of this small
patch, we have fewer weights.
Artificial neural network tutorial covers all the aspects related to the artificial neural
network. In this tutorial, we will discuss ANNs, Adaptive resonance theory, Kohonen
self-organizing map, Building blocks, unsupervised learning, Genetic algorithm, etc.
The typical Artificial Neural Network looks something like the given figure.
Dendrites Inputs
Axon Output
There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than
one piece of this data when necessary from our memory parallelly. We can say that
the human brain is made up of incredibly amazing parallel processors.
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
function.
Artificial neural networks have a numerical value that can perform more than one
task simultaneously.
Data that is used in traditional programming is stored on the whole network, not on
a database. The disappearance of a couple of pieces of data in one place doesn't
prevent the network from working.
After ANN training, the information may produce output even with inadequate data.
The loss of performance here relies upon the significance of missing data.
Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.
It is the most significant issue of ANN. When ANN produces a testing solution, it
does not provide insight concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.
ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user's
abilities.
The network is reduced to a specific value of the error, and this value does not give
us optimum results.
Science artificial neural networks that have steeped into the world in the mid-20 th century are
exponentially developing. In the present time, we have investigated the pros of artificial neural
networks and the issues encountered in the course of their utilization. It should not be overlooked
that the cons of ANN networks, which are a flourishing science branch, are eliminated individually,
and their pros are increasing
day by day. It means that artificial neural networks will turn into an irreplaceable part of our lives
progressively important.
If the weighted sum is equal to zero, then bias is added to make the output non-zero
or something else to scale up to the system's response. Bias has the same input, and
weight equals to 1. Here the total of weighted inputs can be in the range of 0 to
positive infinity. Here, to keep the response in the limits of the desired value, a
certain maximum value is benchmarked, and the total of weighted inputs is passed
through the activation function.
The activation function refers to the set of transfer functions used to achieve the
desired output. There is a different kind of the activation function, but primarily
either linear or non-linear sets of functions. Some of the commonly used sets of
activation functions are the Binary, linear, and Tan hyperbolic sigmoidal activation
functions. Let us take a look at each of them in details:
Binary:
In binary activation function, the output is either a one or a 0. Here, to accomplish
this, there is a threshold value set up. If the net weighted input of neurons is more
than 1, then the final output of the activation function is returned as one or else the
output is returned as 0.
Sigmoidal Hyperbolic:
The Sigmoidal Hyperbola function is generally seen as an "S" shaped curve. Here the
tan hyperbolic function is used to approximate output from the actual net input. The
function is defined as:
Feedback ANN:
In this type of ANN, the output returns into the network to accomplish the best-
evolved results internally. As per the University of Massachusetts, Lowell Centre for
Atmospheric Research. The feedback networks feed information back into itself and
are well suited to solve optimization issues. The Internal system error corrections
utilize feedback ANNs.
Feed-Forward ANN:
A feed-forward network is a basic neural network comprising of an input layer, an output
layer, and at least one layer of a neuron. Through assessment of its output by reviewing its
input, the intensity of the network can be noticed based on group behavior of the associated
neurons, and the output is decided. The primary advantage of this network is that it figures
out how to evaluate and recognize input patterns.
.
Convolutional Neural Network
Convolutional Neural Network is one of the main categories to do image
classification and image recognition in neural networks. Scene labeling, objects
detections, and face recognition, etc., are some of the areas where convolutional
neural
CNN takes an image as input, which is classified and process under a certain
category such as dog, cat, lion, tiger, etc. The computer sees an image as an array of
pixels and depends on the resolution of the image. Based on image resolution, it will
see as h * w * d, where h= height w= width and d= dimension. For example, An RGB
image is 6 * 6 * 3 array of the matrix, and the grayscale image is 4 * 4 * 1 array of the
matrix.
In CNN, each input image will pass through a sequence of convolution layers along
with pooling, fully connected layers, filters (Also known as kernels). After that, we will
apply the Soft-max function to classify an object with probabilistic values 0 and 1.
Convolution Layer
Convolution layer is the first layer to extract features from an input image. By
learning image features using a small square of input data, the convolutional layer
preserves the relationship between pixels. It is a mathematical operation which takes
two inputs such as image matrix and a kernel or filter.
o The dimension of the image matrix is h×w×d.
o The dimension of the filter is fh×fw×d.
o The dimension of the output is (h-fh+1)×(w-fw+1)×1.
Let's start with consideration a 5*5 image whose pixel values are 0, 1, and filter matrix
3*3 as:
The convolution of 5*5 image matrix multiplies with 3*3 filter matrix is called
"Features Map" and show as an output.
Convolution of an image with different filters can perform an operation such as blur,
sharpen, and edge detection by applying filters.
Strides
Stride is the number of pixels which are shift over the input matrix. When the stride is
equaled to 1, then we move the filters to 1 pixel at a time and similarly, if the stride is
equaled to 2, then we move the filters to 2 pixels at a time. The following figure
shows that the convolution would work with a stride of 2.
Padding
Padding plays a crucial role in building the convolutional neural network. If the image
will get shrink and if we will take a neural network with 100's of layers on it, it will
give us a small image after filtered in the end.
If we take a three by three filter on top of a grayscale image and do the convolving
then what will happen?
It is clear from the above picture that the pixel in the corner will only get covers one
time, but the middle pixel will get covered more than once. It means that we have
more information on that middle pixel, so there are two downsides:
o Shrinking outputs
o Losing information on the corner of the image.
Pooling Layer
Pooling layer plays an important role in pre-processing of an image. Pooling layer
reduces the number of parameters when the images are too large. Pooling is
"downscaling" of the image obtained from the previous layers. It can be compared
to shrinking an image to reduce its pixel density. Spatial pooling is also called
downsampling or subsampling, which reduces the dimensionality of each map but
retains the important information. There are the following types of spatial pooling:
Max Pooling
Max pooling is a sample-based discretization process. Its main objective is to
downscale an input representation, reducing its dimensionality and allowing for the
assumption to be made about features contained in the sub-region binned.
Syntax
layer = averagePooling2dLayer(poolSize)
layer = averagePooling2dLayer(poolSize,Name,Value)
Sum Pooling
The sub-region for sum pooling or mean pooling are set exactly the same as
for max-pooling but instead of using the max function we use sum or mean.
In the above diagram, the feature map matrix will be converted into the vector such
as x1, x2, x3... xn with the help of fully connected layers. We will combine features
to create a model and apply the activation function such as softmax or sigmoid to
classify the outputs as a car, dog, truck, etc.
RECURRENT NEURAL NETWORK
Recurrent neural networks (RNN) are the state of the art algorithm for
sequential data and are used by Apple's Siri and and Google's voice search.
It is the first algorithm that remembers its input, due to an internal
memory, which makes it perfectly suited for machine learning problems
that involve sequential data. It is one of the algorithms behind the scenes
of the amazing achievements seen in deep learning over the past few years.
In this post, we'll cover the basic concepts of how recurrent neural
networks work, what the biggest issues are and how to solve them.
Table of Contents
Introduction
How it works: RNN vs. Feed-forward neural network
Backpropagation through time
Two issues of standard RNNs: Exploding gradients & vanishing gradients
LSTM: Long short-term memory
Summary
Since RNNs are being used in the software behind Siri and Google
Translate, recurrent neural networks show up a lot in everyday life.
Sequential data is basically just ordered data in which related things follow
each other. Examples are financial data or the DNA sequence. The most
popular type of sequential data is perhaps time series data, which is just a
series of data points that are listed in time order.
Imagine you have a normal feed-forward neural network and give it the
word "neuron" as an input and it processes the word character by
character. By the time it reaches the character "r," it has already forgotten
about "n," "e" and "u," which makes it almost impossible for this type of
neural network to predict which character would come next.
Simply put: recurrent neural networks add the immediate past to the
present.
Therefore, a RNN has two inputs: the present and the recent past. This is
important because the sequence of data contains crucial information about
what is coming next, which is why a RNN can do things other algorithms
can’t.
A feed-forward neural network assigns, like all other deep learning
algorithms, a weight matrix to its inputs and then produces the output.
Note that RNNs apply weights to the current and also to the previous
input. Furthermore, a recurrent neural network will also tweak the weights
for both through gradient descent and backpropagation through time
(BPTT).
TYPES OF RNNS
One to One
One to Many
Many to One
Many to Many
Also note that while feed-forward neural networks map one input to one
output, RNNs can map one to many, many to many (translation) and many
to one (classifying a voice).
Backpropagation Through Time
To understand the concept of backpropagation through time you'll need to
understand the concepts of forward and backpropagation first. We could
spend an entire article discussing these concepts, so I will attempt to
provide as simple a definition as possible.
WHAT IS BACKPRAPAGATION?
Backpropagation (BP or backprop, for short) is known as a workhorse algorithm in machine
learning. Backpropagation is used for calculating the gradient of an error function with
respect to a neural network’s weights. The algorithm works its way backwards through the
various layers of gradients to find the partial derivative of the errors with respect to the
weights. Backprop then uses these weights to decrease error margins when training.
Those derivatives are then used by gradient descent, an algorithm that can
iteratively minimize a given function. Then it adjusts the weights up or
down, depending on which decreases the error. That is exactly how a
neural network learns during the training process.
So, with backpropagation you basically try to tweak the weights of your
model while training.
You can view a RNN as a sequence of neural networks that you train one
after another with backpropagation.
The image below illustrates an unrolled RNN. On the left, the RNN is
unrolled after the equal sign. Note there is no cycle after the equal sign
since the different time steps are visualized and information is passed from
one time step to the next. This illustration also shows why a RNN can be
seen as a sequence of neural networks.
Within BPTT the error is backpropagated from the last to the first
timestep, while unrolling all the timesteps. This allows calculating the
error for each timestep, which allows updating the weights. Note that
BPTT can be computationally expensive when you have a high number of
timesteps.
Two issues of standard RNN’s
There are two major obstacles RNN’s have had to deal with, but to
understand them, you first need to know what a gradient is.
You can also think of a gradient as the slope of a function. The higher the
gradient, the steeper the slope and the faster a model can learn. But if the
slope is zero, the model stops learning. A gradient simply measures the
change in all weights with regard to the change in error.
EXPLODING GRADIENTS
Exploding gradients are when the algorithm, without much reason, assigns
a stupidly high importance to the weights. Fortunately, this problem can be
easily solved by truncating or squashing the gradients.
VANISHING GRADIENTS
Vanishing gradients occur when the values of a gradient are too small and
the model stops learning or takes way too long as a result. This was a
major problem in the 1990s and much harder to solve than the exploding
gradients. Fortunately, it was solved through the concept of LSTM by
Sepp Hochreiter and Juergen Schmidhuber.
The units of an LSTM are used as building units for the layers of a RNN,
often called an LSTM network.
LSTMs enable RNNs to remember inputs over a long period of time. This
is because LSTMs contain information in a memory, much like the
memory of a computer. The LSTM can read, write and delete information
from its memory.
This memory can be seen as a gated cell, with gated meaning the cell
decides whether or not to store or delete information (i.e., if it opens the
gates or not), based on the importance it assigns to the information. The
assigning of importance happens through weights, which are also learned
by the algorithm. This simply means that it learns over time
what information is important and what is not.
In an LSTM you have three gates: input, forget and output gate. These
gates determine whether or not to let new input in (input gate), delete the
information because it isn’t important (forget gate), or let it impact the
output at the current timestep (output gate). Below is an illustration of a
RNN with its three gates:
The gates in an LSTM are analog in the form of sigmoids, meaning they
range from zero to one. The fact that they are analog enables them to do
backpropagation.
Graphs can come in different structures and sizes, which does not
conform to the rectangular arrays that neural networks expect.
Graphs also have other characteristics that make them different
from the type of information that classic neural networks are
designed for. For instance, graphs are “permutation invariant,”
which means changing the order and position of nodes doesn’t
make a difference as long as their relations remain the same. In
contrast, changing the order of pixels results in a different image
and will cause the neural network that processes them to behave
differently.
To make graphs useful to deep learning algorithms, their data
must be transformed into a format that can be processed by a
neural network. The type of formatting used to represent graph
data can vary depending on the type of graph and the intended
application, but in general, the key is to represent the information
as a series of matrices.
But graph neural networks can also learn from other information
that the graph contains. The edges, the lines that connect the
nodes, can be represented in the same way, with each row
containing the IDs of the users and additional information such as
date of friendship, type of relationship, etc. Finally, the general
connectivity of the graph can be represented as an adjacency
matrix that shows which nodes are connected to each other.
How does the GNN create the graph embedding? When the graph
data is passed to the GNN, the features of each node are
combined with those of its neighboring nodes. This is called
“message passing.” If the GNN is composed of more than one
layer, then subsequent layers repeat the message-passing
operation, gathering data from neighbors of neighbors and
aggregating them with the values obtained from the previous
layer. For example, in a social network, the first layer of the GNN
would combine the data of the user with those of their friends,
and the next layer would add data from the friends of friends and
so on. Finally, the output layer of the GNN produces the
embedding, which is a vector representation of the node’s data
and its knowledge of other nodes in the graph.
Once you have a neural network that can learn the embeddings
of a graph, you can use it to accomplish different tasks.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the
ensemble is a decision tree classifier and is generated using a random
selection of attributes at each node to determine the split. During
classification, each tree votes and the most popular class is returned.
Implementation steps of Random Forest –
0. Multiple subsets are created from the original data set,
selecting observations with replacement.
1. A subset of features is selected randomly and whichever
feature gives the best split is used to split the node iteratively.
2. The tree is grown to the largest.
3. Repeat the above steps and prediction is given based on the
aggregation of predictions from n number of trees.
These long-term goals help prevent the agent from stalling on lesser goals. With time,
the agent learns to avoid the negative and seek the positive. This learning method has
been adopted in artificial intelligence (AI) as a way of directing unsupervised machine
learning through rewards and penalties.
Current use cases include, but are not limited to, the following:
gaming
resource management
personalized recommendations
robotics
Gaming is likely the most common usage field for reinforcement learning. It is
capable of achieving superhuman performance in numerous games. A common
example involves the game Pac-Man.
A learning algorithm playing Pac-Man might have the ability to move in one of four
possible directions, barring obstruction. From pixel data, an agent might be given a
numeric reward for the result of a unit of travel: 0 for empty space, 1 for pellets, 2 for
fruit, 3 for power pellets, 4 for ghost post-power pellets, 5 for collecting all pellets and
completing a level, and a 5-point deduction for collision with a ghost. The agent starts
from randomized play and moves to more sophisticated play, learning the goal of
getting all pellets to complete the level. Given time, an agent might even learn tactics
like conserving power pellets until needed for self-defense.
In robotics, reinforcement learning has found its way into limited tests. This type of
machine learning can provide robots with the ability to learn tasks a human teacher
cannot demonstrate, to adapt a learned skill to a new task or to achieve optimization
despite a lack of analytic formulation available.
For example, if you were to deploy a robot that was reliant on reinforcement learning
to navigate a complex physical environment, it will seek new states and take different
actions as it moves. It is difficult to consistently take the best actions in a real-world
environment, however, because of how frequently the environment changes.
The time required to ensure the learning is done properly through this method can
limit its usefulness and be intensive on computing resources. As the training
environment grows more complex, so too do demands on time and compute resources.
Deep learning models are capable enough to focus on the accurate features
themselves by requiring a little guidance from the programmer and are very helpful
in solving out the problem of dimensionality. Deep learning algorithms are used,
especially when we have a huge no of inputs and outputs.
Since deep learning has been evolved by the machine learning, which itself is a
subset of artificial intelligence and as the idea behind the artificial intelligence is to
mimic the human behavior, so same is "the idea of deep learning to build such
algorithm that can mimic the brain".
Deep learning is implemented with the help of Neural Networks, and the idea behind
the motivation of Neural Network is the biological neurons, which is nothing but a
brain cell.
Deep learning is a collection of statistical techniques of machine learning for learning feature
hierarchies that are actually based on artificial neural networks.
So basically, deep learning is implemented by the help of deep networks, which are
nothing but neural networks with multiple hidden layers.
Architectures
o Deep Neural Networks
It is a neural network that incorporates the complexity of a certain level, which means
several numbers of hidden layers are encompassed in between the input and output
layers. They are highly proficient on model and process non-linear associations.
o Deep Belief Networks
A deep belief network is a class of Deep Neural Network that comprises of multi-layer
belief networks.
Steps to perform DBN:
0. With the help of the Contrastive Divergence algorithm, a layer of features is
learned from perceptible units.
1. Next, the formerly trained features are treated as visible units, which perform
learning of features.
2. Lastly, when the learning of the final hidden layer is accomplished, then the
whole DBN is trained.
o Recurrent Neural Networks
It permits parallel as well as sequential computation, and it is exactly similar to that of
the human brain (large feedback network of connected neurons). Since they are
capable enough to reminisce all of the imperative things related to the input they
have received, so they are more precise.
Applications:
o Data Compression
o Pattern Recognition
o Computer Vision
o Sonar Target Recognition
o Speech Recognition
o Handwritten Characters Recognition
Applications:
o Machine Translation
o Robot Control
o Time Series Prediction
o Speech Recognition
o Speech Synthesis
o Time Series Anomaly Detection
o Rhythm Learning
o Music Composition
Applications:
Applications:
o Filtering.
o Feature Learning.
o Classification.
o Risk Detection.
o Business and Economic analysis.
5. Autoencoders
An autoencoder neural network is another kind of unsupervised machine learning
algorithm. Here the number of hidden cells is merely small than that of the input
cells. But the number of input cells is equivalent to the number of output cells. An
autoencoder network is trained to display the output similar to the fed input to force
AEs to find common patterns and generalize the data. The autoencoders are mainly
used for the smaller representation of the input. It helps in the reconstruction of the
original data from compressed data. This algorithm is comparatively simple as it only
necessitates the output identical to the input.
Applications:
o Classification.
o Clustering.
o Feature Compression.
Deep learning applications
o Self-Driving Cars
In self-driven cars, it is able to capture the images around it by processing a huge
amount of data, and then it will decide which actions should be incorporated to take
a left or right or should it stop. So, accordingly, it will decide what actions it should
take, which will further reduce the accidents that happen every year.
o Voice Controlled Assistance
When we talk about voice control assistance, then Siri is the one thing that comes
into our mind. So, you can tell Siri whatever you want it to do it for you, and it will
search it for you and display it for you.
o Automatic Image Caption Generation
Whatever image that you upload, the algorithm will work in such a way that it will
generate caption accordingly. If you say blue colored eye, it will display a blue-
colored eye with a caption at the bottom of the image.
o Automatic Machine Translation
With the help of automatic machine translation, we are able to convert one language
into another with the help of deep learning.
Limitations
o It only learns through the observations.
o It comprises of biases issues.
Advantages
o It lessens the need for feature engineering.
o It eradicates all those costs that are needless.
o It easily identifies difficult defects.
o It results in the best-in-class performance on problems.
Disadvantages
o It requires an ample amount of data.
o It is quite expensive to train.
o It does not have strong theoretical groundwork.