Artificial Neural Network
Artificial Neural Network
These all are simple applications of Deep Learning models, and there are many more
advanced applications of Deep Learning. Not to forget driverless and autonomous cars!
All of these operate using the fundamentals of neural networks. Neural Networks are
not a new age thing; they have been there for over 70 years now. In 1943, Warren
McCullough and Walter Pitts proposed the first neural network, which had built
“computational models for neural networks based on algorithms called threshold logic.”
Definition of Neural Networks
Essentially Neural networks are non-linear machine learning models, which can be used
for both supervised or unsupervised learning. Neural networks are also seen as a set of
algorithms, which are modeled loosely based on the human brain and are built to
identify patterns.
The Basic Concept of Artificial Neural Networks
An artificial neural network (ANN) is a computing system designed to simulate how the
human brain analyzes and processes information. It is the foundation of artificial
intelligence (AI) and solves problems that would prove impossible or difficult by human
or statistical standards.
Artificial Neural Networks are primarily designed to mimic and simulate the functioning
of the human brain. Using the mathematical structure, it is ANN constructed to
replicate the biological neurons.
A human brain has a decision-making process: it sees or gets exposed to information
through the five sense organs; this information gets stored, correlates the registered
piece of information with any previous learnings, and makes certain decisions
accordingly.
The concept of ANN follows the same process as that of a natural neural net. The
objective of ANN is to make the machines or systems understand and ape how a human
brain makes a decision and then ultimately takes action. Inspired by the human brain,
the fundamentals of neural networks are connected through neurons or nodes and is
depicted as below:
The similarities among the terminologies between the biological and the Artificial
Neural Networks based on their functionalities:
The structure of the neural network depends on the problem’s specification, and it is
configured according to the application. Though, in general, a neural network has the
following structure, and the components of artificial neural networks which make the
fundamentals of neural networks are:
In a neural network, there are three layers: Input Layer, Hidden Layers, and Output
layer.
The input layer consists of the inputs or the independent X variable known as the
predictors. These inputs are collected from external sources such as text data, images,
audio, or video files. In a natural network, these Xs are the information perceived from
the sense organs.
The output layer results from the neural network; it could be a numerical value in a
regression problem or a binary or multi-layer class for a classification problem. The
output can also be the recognition of handwriting or audio voice or classified image or
text in categories.
Apart from the Input and the Output layer, there is another layer in the Neural
Networks, called the Hidden Layer, which derives the features for the model. There is
one hidden layer in the above image, and the below image has three hidden layers.
Single Layer Perceptron: The neural net with one single hidden layer is called the
Single Layer Perceptron.
Multilayer Perceptron: The neural net with more than one hidden layer and where
each of the layers is connected is called the Multilayer Perceptron.
Neurons
As seen above, neurons are the primary and basic processing unit in the neural network.
It receives information or data, performs simple calculations, and then passes it further.
The neurons are in the hidden and the output layers. The input layer doesn’t have any
neurons; the circles in the input layer represent the independent variables or the Xs.
The number of nodes in the output layer depends on the kind of the business problem:
Regression: In case to predict the stock prices or in other words when the
nature of the output is continuous, then there will be one node in the output
layer like the above image.
Classification: In the case of a classification problem, the nodes in the output
layer are equal to the number of the classes or the categories. For binary
classification, we can either have one or two nodes in the output layer.
The number of neurons in the hidden layers is subject to the user. The architecture of
the neural net is configured based on the problem at hand and is determined by the
user. The input layer is always predefined, and the output is the goal of the network that
is also prefixed.
It is the number of hidden layers and the number of neurons that form the part of the
hyperparameters. It is so because these exact parameters create the features, and a
small tweak in these can significantly impact the output.
Weights and Bias
The inputs from the input layer are connected to each neuron of the first hidden layer.
Similarly, the neurons of this hidden layer are connected with the subsequent layers’
neurons. In a nutshell, the output of one layer becomes the input of another layer. Any
given neuron can have many to many relationships with multiple inputs and output
connections. Weights and bias are applied to each of the connections during the nodes’
transmission between the layers.
In the biological neural network, the synapses indicate the strength of the neurons. In
the same manner here in the neural networks artificial intelligence, the weights control
the strength of the connections between the neurons.
These weights represent the relative importance of the neural net, and meaning indicates how much
precedence the input X or the subsequently derived neurons will have on the output.
This will naturally lead to the neurons with the higher weight will have more influence in the next layer, and
ultimately, the neurons with not significant weights will get dropped out.
Bias is an additional input to each layer starting from the input layer. The bias is not dependent nor impacted
by the preceding layer. In simple words, bias is the intercept term and is constant. It implies that even if there
are no inputs or independent variables, the model will be activated with a default value of the bias.
The weights and biases are the learnable parameters of the model. In the first iteration or the initialization
process, the weights are randomly set up or assigned and optimized to minimize loss or error.
Types or models of Neural Networks
There are many types of neural networks available or that might be in the development stage.
They can be classified depending on their: Structure, Data flow, Neurons used and their density, Layers and their depth
activation filters etc.
A. Perceptron
It accepts weighted inputs, and apply the activation function to obtain the
output as the final result.
Perceptron is a supervised learning algorithm that classifies the data into two
categories, thus it is a binary classifier.
Advantages of Perceptron
Perceptrons can implement Logic Gates like AND, OR, or NAND.
Disadvantages of Perceptron
Perceptrons can only learn linearly separable problems such as boolean AND problem. For non-linear problems such as
the boolean XOR problem, it does not work.
B. Feed Forward Neural Networks
The simplest form of neural networks where input data travels in one direction
only, passing through artificial neural nodes and exiting through output nodes.
Where hidden layers may or may not be present, input and output layers are
present there.
4. Cannot be used for deep learning [due to absence of dense layers and back propagation]
C. Multilayer Perceptron
An entry point towards complex neural nets where input data travels through various layers of artificial
neurons.
Every single node is connected to all neurons in the next layer which makes it a fully connected neural
network.
Input and output layers are present having multiple hidden Layers i.e. at least three or more layers in
total. It has a bi-directional propagation i.e. forward propagation and backward propagation.
Inputs are multiplied with weights and fed to the activation function and in backpropagation, they are
modified to reduce the loss.
In simple words, weights are machine learnt values from Neural Networks. They self-adjust depending
on the difference between predicted outputs vs training inputs. Nonlinear activation functions are used
followed by softmax as an output layer activation function.
Applications on Multi-Layer Perceptron
● Speech Recognition
● Machine Translation
● Complex Classification
1. Used for deep learning [due to the presence of dense fully connected layers and
back propagation]
Each neuron in the convolutional layer only processes the information from a small part of the visual field.
Input features are taken in batch-wise like a filter. The network understands the images in parts and can
compute these operations multiple times to complete the full image processing. Processing involves
conversion of the image from RGB or HSI scale to grey-scale. Furthering the changes in the pixel value
will help to detect the edges and images can be classified into different categories.
Propagation is uni-directional where CNN contains one or more convolutional layers followed by pooling
and bidirectional where the output of convolution layer goes to a fully connected neural network for
classifying the images as shown in the above diagram. Filters are used to extract certain parts of the
image. In MLP the inputs are multiplied with weights and fed to the activation function. Convolution uses
RELU and MLP uses nonlinear activation function followed by softmax. Convolution neural networks
show very effective results in image and video recognition, semantic parsing and paraphrase detection.
Applications on Convolution Neural Network
● Image processing
● Computer Vision
● Speech Recognition
● Machine translation
Advantages of Convolution Neural Network:
Radial Basis Function Network consists of an input vector followed by a layer of RBF neurons and an output layer with one
node per category. Classification is performed by measuring the input’s similarity to data points from the training set where
each neuron stores a prototype. This will be one of the examples from the training set.
When a new input vector [the n-dimensional vector that you are trying to classify] needs to be classified, each neuron
calculates the Euclidean distance between the input and its prototype. For example, if we have two classes i.e. class A and
Class B, then the new input to be classified is more close to class A prototypes than the class B prototypes. Hence, it could
be tagged or classified as class A.
Each RBF neuron compares the input vector to its prototype and outputs a value ranging which is a measure of similarity
from 0 to 1. As the input equals to the prototype, the output of that RBF neuron will be 1 and with the distance grows
between the input and prototype the response falls off exponentially towards 0. The curve generated out of neuron’s
response tends towards a typical bell curve. The output layer consists of a set of neurons [one per category].
F. Recurrent Neural Networks
Designed to save the output of a layer, Recurrent Neural Network is fed back to the input to help in predicting the outcome of the
layer. The first layer is typically a feed forward neural network followed by recurrent neural network layer where some information
it had in the previous time-step is remembered by a memory function. Forward propagation is implemented in this case. It stores
information required for it’s future use. If the prediction is wrong, the learning rate is employed to make small changes. Hence,
making it gradually increase towards making the right prediction during the backpropagation.
Applications of Recurrent Neural Networks
1. Model sequential data where each sample can be assumed to be dependent on historical ones is one of the advantage.
2. Used with convolution layers to extend the pixel effectiveness.
A sequence to sequence model consists of two Recurrent Neural Networks. Here, there exists an encoder that processes
the input and a decoder that processes the output. The encoder and decoder work simultaneously – either using the same
parameter or different ones. This model, on contrary to the actual RNN, is particularly applicable in those cases where the
length of the input data is equal to the length of the output data. While they possess similar benefits and limitations of the
RNN, these models are usually applied mainly in chatbots, machine translations, and question answering systems.
H. Modular Neural Network
A modular neural network has a number of different networks that function independently and perform sub-tasks. The
different networks do not really interact with or signal each other during the computation process. They work independently
towards achieving the output.As a result, a large and complex computational process are done significantly faster by
breaking it down into independent components. The computation speed increases because the networks are not
interacting with or even connected to each other.
Advantages of Modular Neural Network
1. Efficient
2. Independent training
3. Robustness
Artificial Neural Network (ANN) is entirely inspired by the way the biological nervous system work.
For Example, the human brain works. The most powerful attribute of the human brain is to adapt, and ANN
acquires similar characteristics.
We should understand that how exactly our brain does? It is still very primitive, although we have a
fundamental understanding of the procedure.
It is accepted that during the learning procedure, the brain's neural structure is altered, increasing or
decreasing the capacity of its synaptic connections relying on their activity.
This is the reason why more relevant information is simpler to review than information that has not been
reviewed for a long time.
More significant information will have powerful synaptic connections, and less applicable information will
gradually have its synaptic connections weaken, making it harder to review.
ANN can model this learning process by changing the weighted associations found between neurons in the
network.
It effectively mimics the strengthening and weakening of the synaptic associations found in our brains.
The strengthening and weakening of the associations are what empowers the network to adapt.
Face recognition would be an example of an issue extremely difficult for a human to precisely convert into
code.
An issue that could not be resolved better by a learning algorithm would be a loan granting institution that
could use the previous credit score to classify future loan probabilities.
The learning rule is a technique or a mathematical logic which encourages a neural network to gain from the
existing condition and uplift its performance.
It is an iterative procedure.
A learning rule or Learning process is a technique or a mathematical logic. It boosts the Artificial Neural
Network's performance and implements this rule over the network. Thus learning rules refreshes the
weights and bias levels of a network when a network mimics in a particular data environment.
Hebbian learning rule:
The Hebbian rule was the primary learning rule. In 1949, Donald Hebb created this learning algorithm of
the unsupervised neural network.
We can use this rule to recognize how to improve the weights of nodes of a network.
The Hebb learning rule accepts that if the neighboring neurons are activated and deactivated
simultaneously, then the weight associated with these neurons should increase.
For neurons working on the contrary stage, the weight between them should diminish. If there is no input
signal relationship, the weight should not change.
If inputs of both the nodes are either positive or negative, then a positive weight exists between the nodes.
If the input of a node is either positive or negative for others, a solid negative weight exists between the
nodes.
In the beginning, the values of all weights are set to zero. This learning rule can be utilized for both easy and hard
activation functions. Since desired reactions of neurons are not utilized in the learning process, this is the
unsupervised learning rule. The absolute values of the weights are directly proportional to the learning time, which
is undesired.
According to the Hebbian learning rule, the formula to increase the weight of connection at each time frame is
given below.
∆ωij(t) = αpi(t)*qj(t)
Here,
∆ωij(t) = increment by which the connection of the weight increases at the time function t.
{X1,t1} , {x2,t2},…,{xq,tq}
Where,
tq = target output.
As each input is given to the network, the network output is compared with the objective of the network.
Afterward, the learning rule changes the weights and biases the network in order to move the network
output closer to the objective.
Single-Neuron Perceptron:
Among these techniques, artificial neural networks are inspired by the physiological
operations of the brain.
They depend on the scientific model of a single neural cell (neuron) named single neuron
perceptron and try to resemble the actual networks of neurons in the brain.
Consider a two-input perceptron with one neuron, shown in the figure given below.
Perceptron Learning Rule
It was introduced by Rosenblatt. It is an error-correcting actual output(y)=wixi
rule of a single-layer feedforward network. it is
supervised in nature and calculates the error between learning signal(ej)=ti-y
the desired and actual output and if the output is (difference between
present then only adjustments of weight are done. desired and actual output)
Computed as follows:
δw=αxiej
Assume (x1,x2,x3……………………….xn) –>set
wnew=wo+δw
of input vectors
Now, the output can be
and (w1,w2,w3…………………..wn) –>set of
calculated on the basis of
weights
the input and the activation
y=actual output function applied over the
wo=initial weight net input and can be
expressed as:
wnew=new weight
y=1, if net input>=θ
δw=change in weight
y=0, if net input<θ
α=learning rate
Computed as follows:
Assume (x1,x2,x3……………………….xn) –>set
3. Delta Learning Rule of input vectors
Error= ti-y
Learning signal(ej)=(ti-y)y’
y=f(net input)= ∫wixi
δw=αxiej=αxi(ti-y)y’
wnew=wo+δw
The updating of weights can only be
done if there is a difference between
the target and actual output(i.e.,
error) present:
case I: when t=y
then there is no change in weight
case II: else
wnew=wo+δw
Correlation Learning Rule:
The correlation learning rule is based on the same principle as the Hebbian learning rule. It considers that
weight between corresponding neurons should be positive, and weights between neurons with inverse
reactions should be progressively negative. Opposite to the Hebbian rule, the correlation rule is supervised
learning. Instead of an actual response, oj (desired response), dj (for weight calculation).
∆wij = ɳXidj
The training algorithm generally starts with the initialization of weights equals to zero. Since empowering
the desired weight by users, the correlation learning rule is an example of supervised learning.
Where
∆wij = c(di-wij)
where
c is the small learning constant, which further decreases during the learning process.
Linear vs. Non-Linear Classification
Linear Classification refers to categorizing a set of data points to a discrete class based on a
linear combination of its explanatory variables. On the other hand, Non-Linear Classification
refers to separating those instances that are not linearly separable.
Linear Classification
→ Linear Classification refers to categorizing a set of data points into a discrete class based on
a linear combination of its explanatory variables.
→ Some of the classifiers that use linear functions to separate classes are Linear Discriminant
Classifier, Naive Bayes, Logistic Regression, Perceptron, SVM (linear kernel).
Explanatory Variable
An Explanatory Variable is a factor that has been manipulated in an experiment by a researcher. It is used to determine the
change caused in the response variable. An Explanatory Variable is often referred to as an Independent Variable or a
Predictor Variable.
Response Variable
Response Variable is the result of the experiment where the explanatory variable is manipulated. It is a factor whose
variation is explained by the other factors. Response Variable is often referred to as the Dependent Variable or the
Outcome Variable.
For Example,
You want to find out if alcohol decreases the ability to drive safely. The alcohol a participant consumes determines its effect
on their driving performance. In the experiment, the amount of alcohol consumed gives an explanation for the driving skill.
In the figure above, we have two classes, namely 'O' and '+.' To differentiate between the two
classes, an arbitrary line is drawn, ensuring that both the classes are on distinct sides.
→ Since we can tell one class apart from the other, these classes are called ‘linearly-separable.’
→ However, an infinite number of lines can be drawn to distinguish the two classes.
→ The exact location of this plane/hyperplane depends on the type of the linear classification
Linear Discriminant Classifier
→ Technique - Linear Discriminant Analysis (LDA) is used, which reduced the 2D graph into a 1D graph by creating a new
axis. This helps to maximize the distance between the two classes for differentiation.
In the above graph, we notice that a new axis is created, which maximizes the distance between the mean of the two
classes.
→ However, the problem with LDA is that it would fail in case the means of both the classes are the same. This would
mean that we would not be able to generate a new axis for differentiating the two.
Naive Bayes
→ It is based on the Bayes Theorem and lies in the domain of Supervised Machine Learning.
→ Every feature is considered equal and independent of the others during Classification.
→ Naive Bayes indicates the likelihood of occurrence of an event. It is also known as conditional probability.
A: event 1
B: event 2
P(A|B): Probability of A being true given B is true - posterior probability
P(B|A): Probability of B being true given A is true - the likelihood
P(A): Probability of A being true - prior
P(B): Probability of B being true - marginalization
However, in the case of the Naive Bayes classifier, we are concerned only with the maximum posterior probability, so we
ignore the denominator, i.e., the marginal likelihood. Argmax does not depend on the normalization term.
The Naive Bayes classifier is based on two essential assumptions:-
(i) Conditional Independence - All features are independent of each other. This implies that one
feature does not affect the performance of the other. This is the sole reason behind the ‘Naive’ in
‘Naive Bayes.’
(ii) Feature Importance - All features are equally important. It is essential to know all the features
to make good predictions and get the most accurate results.
→ Naive Bayes is classified into three main types: Multinomial Naive Bayes, Bernoulli Naive
Bayes, and Gaussian Bayes.
Logistic Regression
→ This model finds a hyper-plane that creates a boundary between the various data types.
→ A binary classifier can be created for each class to perform multi-class Classification.
→ In the case of SVM, the classifier with the highest score is chosen as the output of the SVM.
→ SVM works very well with linearly separable data but can work for non-linearly separable data as well.
Non-Linear Classification
→ Non-Linear Classification refers to categorizing those instances that are not linearly separable.
→ Some of the classifiers that use non-linear functions to separate classes are Quadratic Discriminant Classifier, Multi-
Layer Perceptron (MLP), Decision Trees, Random Forest, and K-Nearest Neighbours (KNN).
In the figure above, we have two classes, namely 'O' and 'X.' To differentiate between the two classes, it
is impossible to draw an arbitrary straight line to ensure that both the classes are on distinct sides.
→ We notice that even if we draw a straight line, there would be points of the first-class present between
the data points of the second class.
→ In such cases, piece-wise linear or non-linear classification boundaries are required to distinguish the
two classes.
Quadratic Discriminant Classifier
→ The only difference is that here, we do not assume that the mean and covariance of all classes are the
same.
→ Instances are classified by sorting them down from the root to some leaf node.
→ An instance is classified by starting at the tree's root node, testing the attribute specified by this node, then moving
down the tree branch corresponding to the attribute's value, as shown in the above figure.
→ The process is repeated based on each derived subset in a recursive partitioning manner.
K-Nearest Neighbours
→ KNN is a supervised machine learning algorithm . It is used for classification problems. Since it is a supervised machine
learning algorithm, it uses labeled data to make predictions.
→ KNN analyzes the 'k' nearest data points and then classifies the new data based on the same.
→ In detail, to label a new point, the KNN algorithm analyzes the ‘k’ nearest neighbors or ‘k’ nearest data points to the new
point. It chooses the label of the new point as the one to which the majority of the ‘k’ nearest neighbors belong to.
→It is essential to choose an appropriate value of ‘K’ to avoid the overfitting of our model.
1.It is possible to classify data with a straight It is not easy to classify data with a
line. straight line.
There are lion and tiger. How can we discriminate Lion and Tiger? Someone said
that:
● Tiger has stripe on its head
● Lion has mane on its head
This information such as striped pattern and the mane is called features in the
machine learning.
Hyperplane
The line in the previous example is called as a hyperplane in a high dimensional
space. The hyperplane can be defined in the arbitrary D-dimensional space and should
separate the defined space into two disjoint spaces.
Perceptron
So we found out that to handle binary classification, we need to find hyperplane from
the given data. How can we do that?
One answer is Perceptron. Perceptron is an algorithm for supervised learning of binary
classification problem. It requires the training dataset that includes Input X and
Corresponding label y.
From the figure, we can guess the prediction process of the perceptron briefly. The sign of
sum of all value becomes the predicted label. In details, the sum of value is decomposed by
inner product between the weight of the perceptron and the input vector, and adding bias.
This process is similar with the definition of the hyperplane. So we can use this algorithm to
find the hyperplane.
For example,
Perceptron and its convergence theorem | Chan`s Jupyter (goodboychan.github.io)
Adaptive Resonance Theory (ART)
Adaptive Resonance Theory (ART) Adaptive resonance theory is a type of neural network
technique developed by Stephen Grossberg and Gail Carpenter in 1987.
The basic ART uses unsupervised learning technique. The term “adaptive” and “resonance” used
in this suggests that they are open to new learning(i.e. adaptive) without discarding the previous or
the old information(i.e. resonance).
The ART networks are known to solve the stability-plasticity dilemma i.e., stability refers to their
nature of memorizing the learning and plasticity refers to the fact that they are flexible to gain new
information.
Due to this the nature of ART they are always able to learn new input patterns without forgetting the
past.
ART networks implement a clustering algorithm. Input is presented to the network and the algorithm
checks whether it fits into one of the already stored clusters.
If it fits then the input is added to the cluster that matches the most else a new cluster is formed.
Types of Adaptive Resonance Theory(ART) Carpenter and Grossberg developed
different ART architectures as a result of 20 years of research. The ARTs can be classified as
follows:
● ART1 – It is the simplest and the basic ART architecture. It is capable of clustering
binary input values.
● ART2 – It is extension of ART1 that is capable of clustering continuous-valued input
data.
● Fuzzy ART – It is the augmentation of fuzzy logic and ART.
● ARTMAP – It is a supervised form of ART learning where one ART learns based on
the previous ART module. It is also known as predictive ART.
● FARTMAP – This is a supervised ART architecture with Fuzzy logic included.
Basic of Adaptive Resonance Theory (ART) Architecture The adaptive resonant theory
is a type of neural network that is self-organizing and competitive. It can be of both types,
the unsupervised ones(ART1, ART2, ART3, etc) or the supervised ones(ARTMAP). Generally,
the supervised algorithms are named with the suffix “MAP”. But the basic ART model is
unsupervised in nature and consists of :
There exist two sets of weighted interconnection for controlling the degree of similarity between
the units in the F1 and the F2 layer. The F2 layer is a competitive layer.
The cluster unit with the large net input becomes the candidate to learn the input pattern first and the
rest F2 units are ignored.
The reset unit makes the decision whether or not the cluster unit is allowed to learn the input pattern
depending on how similar its top-down weight vector is to the input vector and to the decision.
Thus we can say that the vigilance parameter helps to incorporate new memories or new
information. Higher vigilance produces more detailed memories, lower vigilance produces more general
memories.
Self Organizing Maps – Kohonen Maps
Self Organizing Map (or Kohonen Map or SOM) is a type of Artificial Neural Network which is
also inspired by biological models of neural systems from the 1970s. It follows an unsupervised
learning approach and trained its network through a competitive learning algorithm. SOM is used for
clustering and mapping (or dimensionality reduction) techniques to map multidimensional data onto
lower-dimensional which allows people to reduce complex problems for easy interpretation. SOM has
two layers, one is the Input layer and the other one is the Output layer.
The architecture of the Self Organizing Map with two clusters and n input features of any sample is
given below:
How do SOM works?
Let’s say an input data of size (m, n) where m is the number of training examples and n is the
number of features in each example. First, it initializes the weights of size (n, C) where C is the
number of clusters. Then iterating over the input data, for each training example, it updates the
winning vector (weight vector with the shortest distance (e.g Euclidean distance) from training
example). Weight updation rule is given by :
wij = wij(old) + alpha(t) * (xik - wij(old))
where alpha is a learning rate at time t, j denotes the winning vector, i denotes the ith feature of
training example and k denotes the kth training example from the input data. After training the SOM
network, trained weights are used for clustering new examples. A new example falls in the cluster
of winning vectors.
Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the learning rate α.
Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate the new weight.
wij(new)=wij(old) + α[xi – wij(old)]
Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
Step 6: Test the Stopping Condition.