UNIT V (1)
UNIT V (1)
• Each pattern is classified into one of two classes. Notice that these classes can be separated
with a single line L. They are known as linearly separable patterns.
• Linear separability refers to the fact that classes of patterns with n-dimensional vector x =
(x1, x2, …xn) can be separated with a single decision surface. In the case above, the line L
represents the decision surface.
• If two classes of patterns can be separated by a decision boundary, represented by the linear
equation then they are said to be linearly separable. The simple network can correctly classify
any patterns.
• Decision boundary (i.e., W, b or q) of linearly separable classes can be determined either by
some learning procedures or by solving linear equation systems based on representative
patterns of each classes.
• If such a decision boundary does not exist, then the two classes are said to be linearly
inseparable.
• Linearly inseparable problems cannot be solved by the simple network, more sophisticated
architecture is needed.
• Examples of linearly separable classes
• 1. Logical AND function
2. Logical OR function
• Examples of linearly inseparable classes
1. Logical XOR (exclusive OR) function
No line can separate these two classes, as can be seen from the fact that the following linear
inequality system has no solution.
because we have b < 0 from (1) +(4), and b >= 0 from (2) + (3), which is a contradiction.
Activation Functions
• Activation functions also known as transfer function is used to map input nodes to output
nodes in certain fashion.
• The activation function is the most important factor in a neural network which decided
whether or not a neuron will be activated or not and transferred to the next layer.
• Activation functions help in normalizing the output between 0 to 1 or 1 to 1. It helps in the
process of back propagation due to their differentiable property. During back propagation,
loss function gets updated, and activation function helps the gradient descent curves to
achieve their local minima.
• Activation function basically decides in any neural network that given input or receiving
information is relevant or it is irrelevant.
• These activation function makes the multilayer network to have greater representational
power than single layer network only when non-linearity is introduced.
• The input to the activation function is sum which is defined by the following equation.
Sum = I1W1 +I2 W2 +...+In Wn
= Σnj=1 Ij Wj + b
• Activation Function: Logistic Functions
• Logistic function monotonically increases from a lower limit (0 or 1) to an upper limit (+1)
as sum increases. In which values vary between 0 and 1, with a value of 0.5 when I is zero.
• Activation Function: Arc Tangent
• Activation Function: Hyperbolic Tangent
Sigmoid
• A sigmoid function produces a curve with an "S" shape. The example sigmoid function
shown on the left is a special case of the logistic function, which models the growth of some
set.
Sig (t) =1/1+e-t
• In general, a sigmoid function is real-valued and differentiable, having a non-negative or
non-positive first derivative, one local minimum, and one local maximum.
• The logistic sigmoid function is related to the hyperbolic tangent as follows:
1 - 2 sig (x) = 1- 2.1/1+e–x = -tanh x/2
• Sigmoid functions are often used in artificial neural networks to introduce nonlinearity in
the model.
• A neural network element computes a linear combination of its input signals, and applies a
sigmoid function to the result.
• A reason for its popularity in neural networks is because the sigmoid function satisfies a
property between the derivative and itself such that it is computationally easy to perform.
d/dt sig (t) = sig(t) (1 - sig (t))
• Derivatives of the sigmoid function are usually employed in learning algorithms.
c. Examples: Logistic function, Arc tangent function, Hyperbolic tangent activation function.
• These activation function makes the multilayer network to have greater representational
power than single layer network only when non-linearity is introduced.
• Need of hidden layers:
1. A network with only two layers (input and output) can only represent the input with
whatever representation already exists in the input data.
2. If the data is discontinuous or non-linearly separable, the innate representation is
inconsistent, and the mapping cannot be learned using two layers (Input and Output).
3. Therefore, hidden layer(s) are used between input and output layers
• Weights connects unit (neuron) in one layer only to those in the next higher layer. The
output of the unit is scaled by the value of the connecting weight, and it is fed forward to
provide a portion of the activation for the units in the next higher layer.
• Backpropagation can be applied to an artificial neural network with any number of hidden
layers. The training objective is to adjust the weights so that the application of a set of inputs
produces the desired outputs.
• Training procedure: The network is usually trained with a large number of input-output
pairs.
1. Generate weights randomly to small random values (both positive and negative) to ensure
that the network is not saturated by large values of weights.
2. Choose a training pair from the training set.
3. Apply the input vector to network input.
4. Calculate the network output.
5. Calculate the error, the difference between the network output and the desired output.
6. Adjust the weights of the network in a way that minimizes this error.
7. Repeat steps 2 - 6 for each pair of input-output in the training set until the error for the
entire system is acceptably low.
Forward pass and backward pass:
• Backpropagation neural network training involves two passes.
1. In the forward pass, the input signals moves forward from the network input to the output.
2. In the backward pass, the calculated error signals propagate backward through the network,
where they are used to adjust the weights.
3. In the forward pass, the calculation of the output is carried out, layer by layer, in the
forward direction. The output of one layer is the input to the next layer.
• In the reverse pass,
a. The weights of the output neuron layer are adjusted first since the target value of each
output neuron is available to guide the adjustment of the associated weights, using the delta
rule.
b. Next, we adjust the weights of the middle layers. As the middle layer neurons have no
target values, it makes the problem complex.
• Selection of number of hidden units: The number of hidden units depends on the number of
input units.
1. Never choose h to be more than twice the number of input units.
2. You can load p patterns of I elements into log2 p hidden units.
3. Ensure that we must have at least 1/e times as many training examples.
4. Feature extraction requires fewer hidden units than inputs.
5. Learning many examples of disjointed inputs requires more hidden units than inputs.
6. The number of hidden units required for a classification task increases with the number of
classes in the task. Large networks require longer training times.
Factors influencing Backpropagation training
• The training time can be reduced by using:
1.Bias: Networks with biases can represent relationships between inputs and outputs more
easily than networks without biases. Adding a bias to each neuron is usually desirable to
offset the origin of the activation function. The weight of the bias is trainable similar to
weight except that the input is always+1.
2. Momentum: The use of momentum enhances the stability of the training process.
Momentum is used to keep the training process going in the same general direction analogous
to the way that momentum of a moving object behaves. In backpropagation with momentum,
the weight change is a combination of the current gradient and the previous gradient.
Shallow Networks
• The terms shallow and deep refer to the number of layers in a neural network; shallow
neural networks refer to a neural network that have a small number of layers, usually
regarded as having a single hidden layer, and deep neural networks refer to neural networks
that have multiple hidden layers. Both types of networks perform certain tasks better than the
other and selecting the right network depth is important for creating a successful model.
• In a shallow neural network, the values of the feature vector of the data to be classified (the
input layer) are passed to a hidden layer of nodes (neurons) each of which generates a
response according to some activation function, g, acting on the weighted sum of those
values, z.
• The responses of each unit in the hidden layer is then passed to a final, output layer (which
may consist of a single unit), whose activation produces the classification prediction output.
Deep Network
• Deep learning is a new area of machine learning research, which has been introduced with
the objective of moving machine learning closer to one of its original goals. Deep learning is
about learning multiple levels of representation and abstraction that help to make sense of
data such as images, sound, and text.
• 'Deep learning' means using a neural network with several layers of nodes between input
and output. It is generally better than other methods on image, speech and certain other types
of data because the series of layers between input and output do feature identification and
processing in a series of stages, just as our brains seem to.
• Deep Learning emphasizes the network architecture of today's most successful machine
learning approaches. These methods are based on "deep" multi- neural networks with many
hidden layers.
TensorFlow
• TensorFlow is one of the most popular frameworks used to build deep learning models. The
framework is developed by Google Brain Team.
• Languages like C++, R and Python are supported by the framework to create the models as
well as the libraries. This framework can be accessed from both - desktop and mobile.
• The translator used by Google is the best example of TensorFlow. In this, the model is
created by adding the functionalities of text classification, natural language processing,
speech or handwriting recognition, image recognition, etc.
• The framework has its own visualization toolkit, named TensorBoard which helps in
powerful data visualization of the network along with its performance.
• One more tool added in TensorFlow, TensorFlow Serving, can be used for quick and easy
deployment of the newly developed algorithms without introducing any change in the
existing API or architecture.
• TensorFlow framework comes along with a detailed documentation for the users od or to
adapt it quickly and easily, making it the most preferred deep learning to do framework to
model deep learning algorithms.
• Some of the characteristics of TensorFlow is:
• Multiple GPU supported
• One can visualize graphs and queues easily using TensorBoard.
• Powerful documentation and larger support from community
Keras
• If you are comfortable in programming with Python, then learning Keras will not prove
hard to you. This will be the most recommended framework to create deep aid learning
models for ones having a sound of Python.
• Keras is built purely on Python and can run on the top of TensorFlow. Due to its complexity
and use of low level libraries, TensorFlow can be comparatively harder to adapt for the new
users as compared to Keras. Users those who are beginners in deep learning, and find its
models difficult to understand in TensorFlow generally prefer Keras as it solves all complex
models in no time.
• Keras has been developed keeping in mind the complexities in the deep learning models,
and hence it can run quickly to get the results in minimum time. Convolutional as well as
Recurrent Neural networks are supported in Keras. The framework can run easily on CPU
and GPU.
• The models in Keras can be classified into 2 categories:
1. Sequential model:
The layers in the deep learning model are defined in a sequential manner. Hence the
implementation of the layers in this model will also be done sequentially.
2. Keras functional API:
Deep learning models that has multiple outputs, or has shared layers, i.e. more complex
models can be implemented in Keras functional API.
• The derivative of an activation function is required when updating the weights during the
back-propagation of the error. The slope of ReLU is 1 for positive values and 0 for negative
values. It becomes non-differentiable when the input x is zero, but it can be safely assumed to
be zero and causes no problem in practice.
• ReLU is used in the hidden layers instead of Sigmoid or tanh. The ReLU function solves the
problem of computational complexity of the Logistic Sigmoid and Tanh functions.
• A ReLU activation unit is known to be less likely to create a vanishing gradient problem
because its derivative is always 1 for positive values of the argument.
• Advantages of ReLU function
a) ReLU is simple to compute and has a predictable gradient for the backpropagation of the
error.
b) Easy to implement and very fast.
c) The calculation speed is very fast. The ReLU function has only a direct relationship.
d) It can be used for deep network training.
• Disadvantages of ReLU function
a) When the input is negative, ReLU is not fully functional which means when it comes to the
wrong number installed, ReLU will die. This problem is also known as the Dead Neurons
problem.
b) ReLU function can only be used within hidden layers of a Neural Network Model.
• The leak helps to increase the range of the ReLU function. Usually, the value of a dog is
0.01 or so.
• The motivation for using LReLU instead of ReLU is that constant zero gradients can also
result in slow learning, as when a saturated neuron uses a sigmoid activation function
2. EReLU
• An Elastic ReLU (EReLU) considers a slope randomly drawn from a uniform distribution
during the training for the positive inputs to control the amount of non-linearity.
• The EReLU is defined as: EReLU(x) = max(Rx; 0) in the output range of [0;1) where R is a
random number
• At the test time, the ERELU becomes the identity function for positive inputs.
Hyperparameter Tuning
• Hyperparameters are parameters whose values control the learning process and determine
the values of model parameters that a learning algorithm ends up learning.
• While designing a machine learning model, one always has multiple choices for the
architectural design for the model. This creates a confusion on which design to choose for the
model based on its optimality. And due to this, there are always trials for defining a perfect
machine learning model.
• The parameters that are used to define these machine learning models are known as the
hyperparameters and the rigorous search for these parameters to build an optimized model is
known as hyperparameter tuning.
• Hyperparameters are not model parameters, which can be directly trained from data. Model
parameters usually specify the way to transform the input into the required output, whereas
hyperparameters define the actual structure of the model that gives the required data.
Layer Size
• Layer size is defined by the number of neurons in a given layer. Input and output layers are
relatively easy to figure out because they correspond directly to how our modeling problem
handles input and ouput.
• For the input layer, this will match up to the number of features in the input vector. For the
output layer, this will either be a single output neuron or a number of neurons matching the
number of classes we are trying to predict.
• It is obvious that a neural network with 3 layers will give better performance than that of 2
layers. Increasing more than 3 doesn't help that much in neural networks. In the case of CNN,
an increasing number of layers makes the model better.
Normalization
• Normalization is a data preparation technique that is frequently used in machine learning.
The process of transforming the columns in a dataset to the same scale is referred to as
normalization. Every dataset does not need to be normalized for machine learning.
• Normalization makes the features more consistent with each other, which allows the model
to predict outputs more accurately. The main goal of normalization is to make the data
homogenous over all records and fields.
• Normalization refers to rescaling real-valued numeric attributes into a 0 to 1 range. Data
normalization is used in machine learning to make model training less sensitive to the scale
of features.
• Normalization is important in such algorithms as k-NN, support vector machines, neural
networks, and principal components. The type of feature preprocessing and normalization
that's needed can depend on the data.
Batch Normalization
• It is a method of adaptive reparameterization, motivated by the difficulty of training very
deep models. In Deep networks, the weights are updated for each layer. So the output will no
longer be on the same scale as the input.
• When we input the data to a machine or deep learning algorithm we tend to for change the
values to a balanced scale because, we ensure that our model can generalize appropriately.
• Batch normalization is a technique for standardizing the inputs to layers in a neural
network. Batch normalization was designed to address the problem of internal covariate shift,
which arises as a consequence of updating multiple-layer inputs simultaneously in deep
neural networks.
• Batch normalization is applied to individual layers, or optionally, to all of them: In each
training iteration, we first normalize the inputs by subtracting their mean and dividing by
their standard deviation, where both are estimated based on the statistics of the current mini-
batch.
• Next, we apply a scale coefficient and an offset to recover the lost degrees of freedom. It is
precisely due to this normalization based on batch statistics that batch normalization derives
its name.
• We take the output a[i-1] from the preceding layer, and multiply by the weights W and add
the bias b of the current layer. The variable I denotes the current layer.
Z[i] = W [i] a[i-1] + b[i]
• Next, we usually apply the non-linear activation function that results in the output a[i] of the
current layer. When applying batch norm, we correct our data before feeding it to the
activation function.
• To apply batch norm, calculate the mean as well as the variance of current z.
μ = Σ mi=1 Zj
• When calculating the variance, we add a small constant to the variance to prevent potential
divisions by zero.
σ2 = 1/m Σmi=1 (Zj - μ)2 + €
• To normalize the data, we subtract the mean and divide the expression by the standard
deviation.
Z[i] = Z[i]-μ / √σ 2
• This operation scales the inputs to have a mean of 0 and a standard deviation of 1.
• Advantages of Batch Normalisation:
a) The model is less delicate to hyperparameter tuning.
b) Shrinks internal covariant shift.
c) Diminishes the reliance of gradients on the scale of the parameters or their underlying
values.
d) Dropout can be evacuated for regularization
Regularization
• Just have a look at the above figure, and we can immediately predict that once we try to
cover every minutest feature of the input data, there can be irregularities in the extracted
features, which can introduce noise in the output. This is referred to as "Overfitting".
• This may also happen with the lesser number of features extracted as some of the important
details might be missed out. This will leave an effect on the accuracy of the outputs produced.
This is referred to as "Underfitting".
• This also shows that the complexity for processing the input elements increases with
overfitting. Also, neural networks being a complex interconnection of nodes, the issue of
overfitting may arise frequently.
• To eliminate this, regularization is used, in which we have to make the slightest
modification in the design of the neural network, and we can get better outcomes.
Dropout
• Dropout was introduced by "Hinton et al"and this method is now very popular. It consists of
setting to zero the output of each hidden neuron in chosen layer with some probability and is
proven to be very effective in reducing overfitting.
• Fig. shows dropout regulations.
• To achieve dropout regularization, some neurons in the artificial neural network are
randomly disabled. That prevents them from being too dependent on one another as they
learn the correlations. Thus, the neurons work more independently, and the artificial neural
network learns multiple independent correlations in the data based on different configurations
of the neurons.
• It is used to improve the training of neural networks by omitting a hidden unit. It also
speeds training.
• Dropout is driven by randomly dropping a neuron so that it will not contribute to the
forward pass and back-propagation.
• Dropout is an inexpensive but powerful method of regularizing a broad family of es models.
DropConnect
• DropConnect, known as the generalized version of Dropout, is the method used brts for
regularizing deep neural networks. Fig. 10.11.3 shows dropconnect.