Ccs355 Neural Networks and Deep Learning Unit1 (1)
Ccs355 Neural Networks and Deep Learning Unit1 (1)
COURSE OBJECTIVES:
UNIT I INTRODUCTION
Each input to a neuron is scaled with a weight, which affects the function
computed at that unit.
Banking:
Credit card attrition, credit and loan application evaluation, fraud and risk
evaluation, and loan delinquencies
Business Analytics:
Customer behaviour modelling, customer segmentation, fraud propensity,
market research, market mix, market structure, and models for attrition, default,
purchase, and renewals
Defence:
Counterterrorism, facial recognition, feature extraction, noise suppression,
object discrimination, sensors, sonar, radar and image signal processing,
signal/image identification, target tracking, and weapon steering
Education:
Adaptive learning software, dynamic forecasting, education system
analysis and forecasting, student performance modelling, and personality
profiling
Financial:
Corporate bond ratings, corporate financial analysis, credit line use
analysis, currency price prediction, loan advising, mortgage screening, real estate
appraisal, and portfolio trading
Medical:
Cancer cell analysis, ECG and EEG analysis, emergency room test
advisement, expense reduction and quality improvement for hospital systems,
transplant process optimization, and prosthesis design
Securities:
Automatic bond rating, market analysis, and stock trading advisory
systems
Transportation:
Routing systems, truck brake diagnosis systems, and vehicle scheduling.
CCS355 NEURAL NETWORKS & DEEP LEARNING
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by
the programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
CCS355 NEURAL NETWORKS & DEEP LEARNING
Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum
of the inputs and includes a bias. This computation is represented in the form of
a transfer function.
McCulloch and Pitts subsequent work [Pitts & McCulloch, 1947] addressed
issues that are still important research areas today, such as translation and rotation
invariant pattern recognition.
Hebb learning
Donald Hebb, a psychologist at McGill University, designed the first
learning law for artificial neural networks [Hebb, 1949]. His premise was that
if two neurons were active simultaneously', then the strength of the connection
between them should be increased. Refinements were subsequently made to this
rather general statement to allow computer simulations [Rochester, Holland,
Haibt & Duda, 1956]. The idea is closely related to the correlation matrix
learning developed by Kohonen (1972) and Anderson (1972) among others. An
expanded form of Hebb learning [McClelland & Rumelhart, 1988] in which
units that are simultaneously off also reinforce the weight on the connection
between them.
The 1950s and 1960s: The First Golden Age of Neural Networks
Although today neural networks are often viewed as an alternative to (or
complement of) traditional computing, it is interesting to note that John von
Neumann, the "father of modern computing," was keenly interested in modeling
the brain [von Neumann, 1958]. Johnson and Brown (1988) and Anderson and
Rosenfeld (1988) discuss the interaction between von Neumann f1nd early neural
network researchers such as Warren McCulloch, and present further indication of
von Neumann's views of the directions in which computers would develop.
Perceptrons
Together with several other researchers [Block, 1962; Minsky & Papert,
1988 (originally published 1969)], Frank Rosenblatt (1958, 1959, 1962)
introduced and developed a large class of artificial neural networks called
perceptrons. The most typical perceptron consisted of an input layer (the retina)
connected by paths with fixed weights to associator neurons; the weights on the
connection paths were adjustable. The perceptron learning rule uses an iterative
weight adjustment that is more powerful than the Hebb rule. Perceptron learning
can be proved to con- verge to the correct weights if there are weights that will
solve the problem at hand (i.e., allow the net to reproduce correctly all of the
training input and target output pairs). Rosenblatt's 1962 work describes many
types of perceptrons. Like the neurons developed by McCulloch and Pitts and by
Hebb, perceptrons use a threshold output function.
The early successes with perceptrons led to enthusiastic claims. However,
the mathematical proof of the convergence of iterative learning under suitable
assumptions was followed by a demonstration of the limitations regarding what
the perceptron type of net can learn [Minsky & Papert, 1969].
CCS355 NEURAL NETWORKS & DEEP LEARNING
ADALINE
Bernard Widrow and his student, Marcian (Ted) Hoff [Widrow & Hoff,
1960], developed a learning rule (which usually either bears their names, or is
designated the least mean squares or delta rule) that is closely related to the
perceptron learning rule. The perceptron rule adjusts the connection weights to a
unit when- ever the response of the unit is incorrect. (The response indicates a
classification of the input pattern.) The delta rule adjusts the weights to reduce the
difference between the net input to the output unit and the desired output. This
results in the smallest mean squared error. The similarity of models developed in
psychology by Rosenblatt to those developed in electrical engineering by Widrow
and Hoff is evidence of the interdisciplinary nature of neural networks. The
difference in learning rules, although slight, leads to an improved ability of the
net to generalize (i.e., respond to input that is similar, but not identical, to that on
which it was trained). The Widrow-Hoff learning rule for a single-layer network
is a precursor of the backpropagation rule for multilayer nets.
Kohonen
The early work of Teuvo Kohonen (1972), of Helsinki University of
Technology, dealt with associative memory neural nets. His more recent work
[Kohonen, 1982] has been the development of self-organizing feature maps that
use a topological structure for the cluster units. These nets have been applied to
speech recognition (for Finnish and Japanese words) [Kohonen, Torkkola,
Shozakai, Kangas, & Venta, 1987; Kohonen, 1988], the solution of the "Traveling
CCS355 NEURAL NETWORKS & DEEP LEARNING
Anderson
James Anderson, of Brown University, also started his research in neural
net- works with associative memory nets [Anderson, 1968, 1972]. He developed
these ideas into his "Brain-State-in-a-Box" [Anderson, Silverstein, Ritz, & Jones,
1977], which truncates the linear output of earlier models to prevent the output
from becoming too large as the net iterates to find a stable solution (or memory).
Among the areas of application for these nets are medical diagnosis and learning
multiplication tables. Anderson and Rosenfeld (1988) and Anderson, Pellionisz,
and Rosenfeld (1990) are collections of fundamental papers on neural network
research. The introductions to each are especially useful.
Grossberg
Stephen Grossberg, together with his many colleagues and coauthors, has
had an extremely prolific and productive career. Klimasauskas (1989) lists 146
publica- tions by Grossberg from 1967 to 1988. His work, which is very
mathematical and very biological, is widely known [Grossberg, 1976, 1980, 1982,
1987, 1988]. Gross- berg is director of the Center for Adaptive Systems at Boston
University.
Carpenter
Together with Stephen Grossberg, Gail Carpenter has developed a theory of self-
organizing neural networks called adaptive resonance theory [Carpenter & Gross-
berg, 1985, 1987a, 1987b, 1990]. Adaptive resonance theory nets for binary input
patterns (ARTI) and for continuously valued inputs (ART2) will be examined in
Chapter 5.
1969]. Parker's work came to the attention of the Parallel Distributed Processing
Group led by psychologists David Rumelhart, of the University of California at
San Diego, and James McClelland, of Carnegie-Mellon University, who refined
and publicized it [Rumelhart, Hinton, & Williams, 1986a, 1986b; McClelland &
Rumelhart, 19881.
Hopfield nets
Another key player in the increased visibility of and respect for neural nets
is prominent physicist John Hopfield, of the California Institute of Tech- nology.
Together with David Tank, a researcher at AT&T, Hopfield has developed a
number of neural networks based on fixed weights and adaptive activations
[Hopfield, 1982, 1984; Hopfield & Tank, 1985, 1986; Tank & Hopfield, 1987].
These nets can serve as associative memory nets and can be used to solve con-
straint satisfaction problems such as the "Traveling Salesman Problem." An ar-
ticle in Scientific American [Tank & Hopfield, 1987] helped to draw popular at-
tention to neural nets, as did the message of a Nobel prize-winning physicist that,
in order to make machines that can do what humans do, we need to study human
cognition.
Neocognitron
Kunihiko Fukushima and his colleagues at NHK Laboratories in Tokyo
have developed a series of specialized neural nets for character recognition. One
ex- ample of such a net, called a neocognitron, is described in Chapter 7. An
earlier self-organizing network, called the cognitron [Fukushima, 1975], failed to
rec- ognize position- or rotation-distorted characters. This deficiency was
corrected in the neocognitron [Fukushima, 1988; Fukushima, Miyake, & Ito,
1983].
Boltzmann machine
A number of researchers have been involved in the development of
nondeter- ministic neural nets, that is, nets in which weights or activations are
changed on the basis of a probability density function [Kirkpatrick, Gelatt, &
Vecchi, 1983; Geman & Geman, 1984; Ackley, Hinton, & Sejnowski, 1985; Szu
& Hartley, 1987]. These nets incorporate such classical ideas as simulated
annealing and Bayesian decision theory.
Hardware implementation
Another reason for renewed interest in neural networks (in addition to
solving the problem of how to train a multilayer net) is improved computational
capa- bilities. Optical neural nets [Farhat, Psaltis, Prata, & Paek, 1985] and VLSI
im- plementations [Sivilatti, Mahowald, & Mead, 1987] are being developed.
CCS355 NEURAL NETWORKS & DEEP LEARNING
1. Interconnections
2. Learning rules
3. Activation functions
where connections between nodes form a directed graph along a sequence. This
allows it to exhibit dynamic temporal behavior for a time sequence. Unlike
feedforward neural networks, RNNs can use their internal state (memory) to
process sequences of inputs.
3. Activation Function
• A person is performing some work. To make the work more efficient and
to obtain exact output, some force or activation may be given. This
activation helps in achieving the exact output. In a similar way, the
activation I function is applied over the next input to calculate the output
of an ANN.
• In the process of building a neural network, one of the choices you get to
make is what activation function to use in the hidden layer as well as at the
output layer of the network.
CCS355 NEURAL NETWORKS & DEEP LEARNING
f ( x ) = x for all x.
Single-layer nets often use a step function to convert the net input, which
is a continuously valued variable, to an output unit that is a binary (1 or 0)
or bipolar (1 or -1) signal (see Figure 1.8):The use of a threshold in this
regard is discussed in Section 2.1.2. The binary step function is also known as
the threshold function or Heaviside function.
CCS355 NEURAL NETWORKS & DEEP LEARNING
The logistic sigmoid function can be scaled to have any range of values
that is appropriate for a given problt!m. The most com- mon range is from -
1 to 1;we call this sigmoid the bipolar sigmoid .
CCS355 NEURAL NETWORKS & DEEP LEARNING
( iv ) Bipolar sigmoid:
(
1
(
1
(v) tan-h:
(vi) ReLU:
REGULARIZATION:
Need of Regularization
• Overfitting refers to the phenomenon where a neural network models the
training data very well but fails when it sees new data from the same
problem domain.
• Overfitting is caused by noise in the training data that the neural network
picks up during training and learns it as an underlying concept of the data.
CCS355 NEURAL NETWORKS & DEEP LEARNING
• This learned noise, however, is unique to each training set. As soon as the
model sees new data from the same problem domain, but that does not
contain this noise, the performance of the neural network gets much worse.
• The reason for this is that the complexity of this network is too high.
• The model with a higher complexity is able to pick up and learn patterns
(noise) in the data that are just caused by some random fluctuation or error.
• Less complex neural networks are less susceptible to overfitting. To
prevent overfitting or a high variance we must use something that is called
regularization.
What Is Regularization?
Regularization means restricting a model to avoid overfitting by shrinking
the coefficient estimates to zero. When a model suffers from overfitting, we
should control the model's complexity. Technically, regularizations avoid
overfitting by adding a penalty to the model's loss function:
Regularization = Loss Function + Penalty
There are three commonly used regularization techniques to control the
complexity of machine learning
models, as follows:
• L2 regularization
• L1 regularization
• Elastic Net regularization
• Early stopping
• Drop-out
L2 Regularization
A linear regression that uses the L2 regularization technique is called ridge
regression. In other words, in ridge regression, a regularization term is added to
the cost function of the linear regression, which keeps the magnitude of the
model's weights (coefficients) as small as possible. The L2 regularization
technique tries to keep the model's weights close to zero, but not zero, which
means each feature should have a low impact on the output while the model's
accuracy should be as high as possible.
CCS355 NEURAL NETWORKS & DEEP LEARNING
Where λ controls the strength of regularization, and Wj are the model's weights
(coefficients).
By increasing 1, the model becomes flattered and underfit. On the other hand, by
decreasing 1, the model
becomes more overfit, and with λ = 0, the regularization term will be eliminated.
L1 Regularization
Least Absolute Shrinkage and Selection Operator (lasso) regression is an
alternative to ridge for regularizing linear regression. Lasso regression also adds
a penalty term to the cost function, but slightly different, called L1 regularization.
L1 regularization makes some coefficients zero, meaning the model will ignore
those features. Ignoring the least important features helps emphasize the model's
essential features.
Where λ controls the strength of regularization, and Wj are the model's weights
(coefficients).
Lasso regression automatically performs feature selection by eliminating the least
important features.
Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part
of the training set as the validation set. When we see that the performance on the
validation set is getting worse, we immediately stop the training on the model.
This is known as early stopping.
Droput:
dropout layer does not drop out any neurons during inference (i.e., when making
predictions on new data), since we want the full power of the network to be used
for making accurate predictions.
Advantages of Dropout:
1. Improved Generalization: Dropout is an effective technique for reducing
overfitting and improving generalization. By randomly dropping out neurons
during training, the network learns to be more robust to different subsets of
neurons being dropped out and therefore can generalize better to unseen data.
2. Simplicity: Dropout is a simple and easy-to-implement regularization technique
that does not require extensive hyperparameter tuning. It can be easily integrated
into existing neural network architectures.
3. Computationally efficient: Dropout is computationally efficient, and can be
easily parallelized, allowing for faster training times and the ability to scale to
large datasets.
4. Reduces co-adaptation: Dropout encourages neurons to be more independent
and reduces the co-adaptation between neurons. This can help prevent overfitting
and improve model performance.
5. No additional data required: Unlike other regularization technique dropout
doesn’t need any additional data for training.
Drawbacks of Dropout:
1. Increased Training Time: Dropout increases the training time of the neural
network, as the network needs to be trained multiple times with different subsets
of neurons dropped out. However, this can be mitigated by parallelizing the
training process.
2. Reduced learning rate: The use of dropout can reduce the effective learning rate
of the network, which can slow down the learning process.
3. Can cause instability: In some cases, dropout can cause instability during
training, particularly if the dropout rate is too high. This can be addressed by
tuning the dropout rate and adjusting other hyperparameters.
4. Cannot used by all type of networks.
CCS355 NEURAL NETWORKS & DEEP LEARNING
Input Data Input data is Input data is not Input data is not
labelled. labelled. predefined.
Problem Learn pattern of Divide data into Find the best
inputs and their classes. reward between a
labels. start and an end
state.
Solution Finds a mapping Finds similar Maximizes reward
equation on input features in input by assessing the
data and its labels. data to classify it results of state-
into classes. action pairs
Model Building Model is built and Model is built and The model is
trained prior to trained prior to trained and tested
testing. testing. simultaneously.
Applications Deal with Deals with Deals with
regression and clustering and exploration and
classification associative rule exploitation
problems. mining problems. problems.