MLT Quantum
MLT Quantum
UNITT
Introduction
CONTENTS
Part-1 Learning, Types of Learning . - L to 1-7L
1-24Lto 1-26L
Part-4 Issues in Machine Learning..
and Data Science Vs.
** *******
Machine Learning8
1-1L CS/IT-Sem-6)
1-2 L (CSSIT-Sem-5) Introduction
PART 1
Learning, Types of Learning.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
1. Learning refers to the change in a subject's behaviour to a given situation
that the
brought by repeated experiences in that situation, provided
the basis native
behaviour changes cannot be explained on
of response
tendencies, matriculation or temporary states of the subject.
Stimuli
examples Feedback
Learner
component
Environment
or teacheer Critic
Knowledge performance
base
evaluator
Kesponse
Performance
component Tasks
Fig. 1.1.1. General learning model.
1.Acquisition of new knowledge:
a. One component of learning is the acquisition of new knowledge
Machine Learning Techniques 1-3L(CS/IT-Sem-5)
Answer
Following are the performance measures for learning are :
1. Generality:
a The most important performanece measure for learning methods is
the generality or scope of the method.
b. Generality is a measure of the case with which the method can be
adapted to different domains of application.
A completely general algorithm is one which is a fixed or self adjusting
configuration that can learn or adapt in any environment or
application domain.
2 Efficiency:
a. Theetticiency of a method is a measure of the average time required
to construct the target knowledge structures from some specified
initial structures.
b. Since this measure is often difficult to determine and is meaningless
without some standard comparison time, a relative etficiency index
be used
can
instead.
3. Robustness :
a. Robustness the ability of
learning
a
system to function with
unreliable feedback and with a variety of training examples, including
noisy ones.
b. A robust system must be able to build tentative structures which
are subjected to modification or withdrawal if later found to be
inconsistent with statistically sound structures.
4 Efficacy:
a. The efficacy of a system is a measure of the overall power of the
5. Ease of implementation:
a. Ease of implementation relates to the complexity of the programs
and data structures, and the resources required to develop the
given learning system.
Introduction
1-4L (CS/IT-Sem-5)
Answer
Supervised learning:
associative learning, in which
known
Supervised learning is also
as
1. and matching
trained by providing it with input
the network is
output patterns.
vector with
the pairing of each input
Supervised training requires
desired output.
vector representing the
a target vector is
with the corresponding target
3. The input vector together
called training pair.
T'arget feature
Input feature Matehing
Neural
network XKH
/Weight/threshold
adjustment Error
vector
Supervised
learning
algorithm
Fig 13.1
to the network,
session an input vector is applied
During the training
vector,
and it results in an output
is compared with t target response
5. This response
the network
differs from the target
response,
or teacher.
an external teacher,
or
can be provided by
9. These input-output pairs network (self-supervised).
the neural
which contains
by the system to perform
non-linear
methoda a r e used
10. Supervised trainingc l a s s i f i c a t i o n networks, pattern association
12. In some cases, the map is implemented as a set of local models such
as in case-based reasoning or the nearest neighbour algorithm.
13. In order to solve problem of supervised learning following steps are
considered:
generated.
8. Though unsupervised training does not require a teacher, it requires
of the object.
method of machine learning where a model is fit to
10. It is a
observations.
Learning
Environment system
Answer
1. Reinforcement learning is the study of how artificial system can learn to
optimize their behaviour in the face of rewards and punishments.
2. Reinforcement learning algorithms have been developed that are closely
related to methods of dynamic programming which is a general approach
to optimal control.
Primary
State (input) reintorcement
vector signal
Critic
Environment
Heuristic
reinforcement
Actions signal
Learning
system
Answer
Steps used to design a learning system are:
1 Specify the learning task.
2. Chose a suitable set of training data to serve as the training experience.
3. Divide the training data into groups cla.sses and label
or
accordingly.
4 Determine the type of knowledge representation to be learned from the
training experience.
5. Choose a learner classifier that can generate general hypotheses from
the training data.
6. Apply the learner classifier to test data.
7. Compare the performance of the system with that of an expert human.
Learner
Environment
Experience Knowledge
Performance
element
Fig 1.5.1
PART-2
Well Defined Learning Problems, Designing a Learning System.
Introduction
1-8L (CS/IT-Sem-5)
Questions-Answers
with
Que 16.Write short note on well defined learning problem
example.
Answer
Well defined learning problem:
E with respect to s o m e
is said to learn from experience
A computer program at tasks in T,
measure P, ifits performancee
class of tasks T' and performance
with experience E.
as m e a s u r e d by P, improves
Three features in learning problems:
1. The class of tasks (7)
(P)
2. The measure of performance to be improved
For example:
1. A checkers learning problem:
a. Task (T): Playing checkers. won against
b. Performance measure (P): Percent of games
opponents. itself.
(E): Playing practice games against
T r a i n i n g experience
learning problem:
A handwriting recognition handwritten words within
a. Task (T) : Recognizing and classifying
images.
b. Performance measure (P): Percent of words correctly classified.
words with
(E): A database of handwritten
Training experience
given classifications.
problems role's in
Describe well defined learning
que 1.7.
machine learning.
Machine Learning Techniques 1-9L (CS/IT-Sem-5)
Answer
Well defined learning problems role's in machine learning :
L Learning to recogmize spoken words:
a Successful speech recognition systems employ machine learning in
some form.
PART-3
History of ML, Introduction of Machine Learning
Approaches -(Artificial Neural Network, Clustering, Reinforcement
Learning, Decision Tree Learning, Bayesian Network, Support
Vector Machine, Genetic Algorithm.
1-10 L (CS/IT-Sem-5) Introduction
Questions-Answers
Answer
A Early history of machine learning:
mathematician Walter
1. In 1943, neurophysiologist Warren McCulloch and
htts wrote a paper about neurons, and
how they work. They created a
an electrical circuit, and thus
the neural network
model of neurons using
was created.
recogniti0n.
In 1959, Bernard Widrow and Marcian Hoff
created two models of neural
4.
network. The first was called ADELINE, and it could detect binary
in a stream of bits, it could predict what the
next
patterns. For example,
and it could eliminate
one would be. The second was called MALDELINE,
echo on phone lines.
B. 1980s and 1990s:
network which had
1. In 1982, John Hopfield suggested creating a
actually work.
bidirectional lines, similar to how neurons
i GoogleBrain (2012)
i. AlexNet (2012)
v.OpenAI(2015)
vi. ResNet (2015)
vi. U-net (2015)
2 Speech recognition:
a. Speech Recognition (SR) is the translation of spoken words into
text.
Statistical arbitrage:
a. In finance, statistical arbitrage refers to automated trading
strategies that are typical of a short-term and involve a large number
of securities.
b. In such strategies, the user tries to implement a trading algorithm
for a set of securities on the basis of quantities such as historical
orrelations and general economie variables.
5. Learning associations: Learning association is the process for
6 Extraction :
a Information Extraction (IE) is another application of machine
learning.
b. It is the process of extracting structured information from
unstructured data.
learning?
Answer
Advantages of machine learning are :
3. Continuous improvement
a. ML algorithms gain experience, they keep improving in accuracy
and eficiency.
As the amount of data keeps growing, algorithms learn to make
accurate predictions faster.
Machine Learning Techniques 1-13 L(CS/IT-Sem-5)
ground.
2. It does not consider spatial relationships in the data.
Answer
1 Artificial Neural Networks (ANN) or neural networks are computational
algorithms thatintended to simulate the behaviour of biological systems
composed of neurons.
Machine Learning Techniques 1-16 L(CS/AT-Sem-5)
10. There are certain situations where clustering is useful. These include :
a.
The collection classification
and
of training data be costly and
can
timeconsuming. Therefore it is difficult to collect a training data
set. A number of
training samples are not all labelled. Then it
large
1s Useful to train a supervised classi fier with a small portion of
training data and then use clustering procedures to tune the classifier
based on the large, unclassified dataset.
b. For data mining, it can be useful to search for grouping among the
data and then recognize the cluster.
The properties of feature vectors can change over time. Then,
supervised classification is not reasonable. Because the test feature
vectors may have completely different properties.
d The clustering can be useful when it is required to search for good
parametric families for the class conditional densities, in case of
supervised classification.
Answer
Following are the applications of clustering:
1 Data reduction
a. In many cases, the amount of available data is very large and its
processing becomes complicated.
b. Cluster analysis can be used to group the data into a number of
clusters and then process each cluster as a single entity.
Answer
1. Clustering techniques are used for combining observed examples into
clusters or groups which satisfy two following main criteria :
Clustering
Hierarchical Partitional
a.
Agglomerative hierarchical clustering: This bottom up strategy
starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the
objects are in a single cluster.
b. Divisive hierarchical
This
clustering
top down strategy does the reverse of agglomerative
strategy by starting with all objects in one cluster.
ii. It subdivides the cluster into smaller and smaller
pieces until
each object forms a cluster on its own.
2 Partitional clustering :
a. This method first creates an initial set of
number of partitions
where each partition
represents a cluster.
b. The clusters are formed to
optimize an objective partition criterion
such as a
dissimilarnty function based on distance so that the objects
within a cluster are similar whereas the objects of different clusters
are dissimilar.
Following are the types of partitioning methods:
Centroid based
a.
clustering
In this, it takes the input parameter and partitions a set of
object into a number of clusters so that resulting intracluster
similarity is high but the intercluster similarity is low.
Cluster similarity is measured in terms of the mean value of
the objects in the cluster, which can be viewed as the cluster's
centroid or center gravity.
of
b. Model-based clustering: This method hypothesizes a model for
each of the cluster and finds the best fit of the data to that model.
Answer
1. Reinforcement learning is the study of how animals and artificial systems
can learm to optimize their behaviour in the face of rewards and
punishments.
2. Reinforcement learning algorithms related to methods of dynamie
programming which is a general approach to optimal control.
3. Reinforeement learning phenomena have been observed in psychological
studies of animal behaviour, and in neurobiological investigations of
neuromodulation and addiction.
The task of reinforcement learning is to use observed rewards to learn
an optimal policy for the environment. An optimal policy is a policy that
maximizes the expected total reward.
1-20 L (CS/IT-Sem-5) Introduction
Answer
1. A decision tree is a flowehart structure in which each internal node
a test on a feature, each leaf node represents a class label
represents
and branches represent conjunctions of features that lead to those class
labels.
The paths from root to leaf represent classification rules.
Fig 1.19.1,illustrate the basic flow of decision tree for decision making
with labels (Rain(Yes), Rain(No).
Outlook
Wind
Humidity Yes
decision tree ?
What the steps used for making
Que 1.20.|
are
Anwer
decision tree are:
Steps used for making consideration for making
(dataset) which are taken into
1. Get list of rows at each node).
decision tree (recursively
Machine Learning Techniques 1-21 L (Cs/IT-Sem-5)
2. Calculate uncertainty of our dataset or Gini impurity or how much our
data is mixed up etc.
Generate list of all question which needs to be asked at that node.
Partition rows into True rows and False rows based on each question
asked.
5. Calculate infornmation gain based on Gini impurity and partition of data
from prev1ous step.
Answer
Advantages of decision tree method are:
Decision trees are able to
12. Decision trees
generate understandable rules.
perform classification without requiring computation.
3. Decision trees are able to handle both continuous and categorical
variables.
4.Decision trees provide a clear indication for the fields that are important
for prediction or classification.
Disadvantages of decision tree method are:
1 Decision trees are less appropriate for estimation tasks where the goal
is to predict the value ofa continuous attribute
Decision trees are prone to errors in classification problems with many
class and relatively small number of trainingg examples.
Decision tree are computationally expensive to train. At each node,
each candidate splitting field must be sorted before its best split can be
found.
Answer
1. Bayesian belief networks specify joint conditional probability distributions
2. They are also known as belief networks, Bayesian networks, or
probabilistic networks.
1-22 L (CS/AT-Sem-5) Introduction
in the data.
shows a
Directed acyclic graph representation: The following diagram
directed acyclic graph for six Boolean variables.
a smoker.
Smoker
(Pamily History)
LungCancer Emphysema
Dyspnea
(Positive Xray)
Positive X-ray is independent
ii. Itis worth noting that the variable or
the variable
table for the values of
The conditional probability combination ofthe values
(LC) showing each possible
LungCancer (FH), and Smoker (S) is as
follows:
parent nodes, FamilyHistory
ofits
FH,S FHS -PHS FHS
0.7 0.1
LC0.8 0.5
0.3 0.9
-LC0.2 0.5
Machine Learning Techniques 1-23 L (CS/AT-Sem-5)
Answer
1. A Support Vector Machine (SVM) is machine learning algorithm that
analyzes data for classification and regression analysis.
SVM is a supervised learning method that looks at data and sorts it into
one of two categories.
An SVM outputs a map of the sorted data with the margins between the
two as far apart as
possible
Applications ofSVM:
1. Text and hpertext classification
i. Image classification
i. Recognizing handwritten characters
iv. Biological sciences, including protein classification
Que 1.24. Explain genetic algorithm with flow chart.
Answer
Genetic algorithm (GA):
1. The genetic algorithm is a method for solving both constrained and
unconstrained optimization problems that is based on natural selection.
2 The genetic algorithm repeatedly modifies a population of individual
solutions.
3. At each step, the genetic algorithm selects individuals at random from
the current population to be parents and uses them to produce the
children for the next
generation.
4. Over successive generations, the population evolves toward an
optimal
solution.
Start
Initialization
Initial population
Selection
New population
s Quit ?
NO
Crossover
Mutation
End
Fig. 1.24.1
LPART-4
Issues in Machine Learning and Data Science Vs. Machine Learning.
Questions-Answers
Answer
Issues related with machine learning a r e :
1. Data quality:
a. It is essential to have
good quality data to produce quality MI.
algorithms and models.
b. To get
high-quality data,
must implement data evaluation,
we
integration, exploration, and governance techniques prior to
developing ML models.
c. Accuracy of ML is driven by the quality of the data.
2 Transparency:
a. It is diff+cult to make definitive statements on how well a model is
going to generalize in new environments.
Manpower:
a. Manpower means having data and being able to use it. This does
not introduce bias into the model.
b. There should be enough skill sets in the
organization for software
development and data collection.
Other
a The most common issue with ML is
people using it where it does
not belong.
b. Every time there is innovation in ML, we see overzealous
some new
Answer
Common classes of problem in machine learning:
1. Classification :
a. In classification data is labelled i.e., it is assigned a class, for example,
spam/non-spam or fraud/non-fraud.
b. The decision being modelled is to assign labels to new unlabelled
pieces of data.
c. This can be thought of as a discrimination problem, modelling the
differences or similarities between groups.
2 Regression
a. Regression data is labelled with a real value rather than a label.
b. The decision being modelled is what value to predict for new
unpredicted data.
1-26 L (CS/AT-Sem-5) Introduction
3 Clustering:
a In clustering data is not labelled, but can be divided into groups
on similarity and other measures of natural structure in the
based
data.
b. For example, organising pictures by faces without names, where
the human user has to assign names to groups, like iPhoto on the
Mac.
4 Rule extraction
a. In rule extraction, data is used as the basis for the extraction of
propositional rules.
b. These rules discover statistically supportable relationships between
attributes in the data.
structured and
unstructured
models.
data.
Fraud detection and Recommendation systems such
Spotily andFacial
he althcare analysis are as
Recognition
are examples of machine
examples of data science
learning.
2
UNIT
Regression and
Bayesian Learning
CONTENTS
Part-1
Regression, Linear Regression.
and Logistic Regression
2-2L to 2-4L
Part-2 Bayesian Learning, Bayes. .2-4L to 2-19L
***************
2-1L CS/TT-Sem-5)
2-2L(CS/IT-Sem-5) Regression & Bayesian learning
LPART-1
Regression, Linear Regression and Logistic Regression.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
1.
Kegression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of the
relationship between one dependent variable (usually denoted by Y)
and a series of other variables (known as independent variables).
2. Regression helps investment and financial managers to value assets
and understand the relationships between variables, such as commodity
prices and the stocks of businesses dealing in those commodities.
There are two type of regression:
a. Simple linear regression: It uses one independent variable to
explain or predict the outcome of dependent variable Y.
Y =a
+bX+u
b. Mutiple linear regression: lt two
uses or more independent
variables to predict outcomes.
Y= a +b,X, +6_X, + b,X, + ... + bX, + u
Where:
Y=The variable we you are trying to predict (dependent variable).
X The variable that we are using to
=
predict Y (independent variable).
a = The intercept.
b =The slope.
u =The regression residual.
accurate prediction, y mx +b
where, m and b are the variables,
=
wgNewspapers
Que 2.3. Explain logistics regression.
Answer
Logistic regression is a supervised learning classification algorithm used
to predict the
probability of a target variable.
2. The nature of target or dependent variable is dichotomous, which means
there would be only two possible classes.
3 The dependent variable is binary in nature having data coded as either
1 (stands for success/yes) or 0 (stands for failure/no).
4. A logistic regression model predicts PY1)as a function
of X. It is
=
one
of the simplest ML
algorithms that can be used for various
classification
problems such as spam detection, diabetes prediction, cancer detection
etc.
Answer
Logistics regression can be divided into following types:
1. Binary (Binomial) Regression :
In this
a
classification,
types either 1 and 0.
a
dependent variable will have only two possible
b. For example, these variables may
or no, win or loss etc.
represent success or failure, yes
2 Multinomial regression
a. In this classification,
possible unordered types
dependentthevariable can
or
have three or more
types having no
quantitative
significance.
b. For example, these variables may
represent "Type A"
T'ype C"
or
"Type B" or
24L(CS/IT-Sem-5) Regression & Bayesian Learning
3. Ordinal regression :
a. In this classification, dependent variable can have three or more
possible ordered types or the types having a quantitative significance.
b. For example, these variables may represent "poor" or "good, "very
good", "Excellent" and each category can have the scores like 0, 1, 2, 3.
Answepr
PART-2
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
Bayesian learning:
1. Bayesian learning is a fundamental statistical approach to the problem
ofpattern classification.
2. This approach is based on quantifying the tradeoffs between various
classification decisions using probability and costs that accompany such
decisions.
3. Because the decision problem is solved on the basis of probabilistic terms,
hence it is assumed that all the relevant probabilities are known.
4. For this we define the state of nature of the things present in the
particular pattern. We denote the state of nature by .
Twocategory classification:
1. Let o, o be the two classes of the patterns. It is assumed that the a
priori probabilities plo,) and plo,) are known.
2. Even if they are not known, they can easily be estimated from the
available training feature vectors.
3. IfN is total number of available training patterns and N. N, of themn
belong to o, and og, respectively then pto,) = N,/N and plo,) = NJN.
4 The conditional probability density functions plx | o), i = 1, 2 is also
assumed to be known which describes the distribution of the feature
vectors in each of the classes.
5. The feature vectors can take any value in the l-dimensional feature
space.
6. Density functions ptx |o) become probability and will be denoted by
plx| o) when the feature vectors can take only diserete values.
7. Consider the conditional probability,
where pr) is the probability density function ofx and for which we have
the inequalities:
a. plx| 0,)plo,) > p(x|o,)plo
b. px| o) plo,) < p(x | o,)plo,) ..(2.6.4)
10. Here p(x) is not taken because it is same for all classes and it does not
affect the decision.
11. Further, if the priori probabilities are equal, i.e.,
a. plo =p(o,) = 1/2 then Eq. (2.6.4) becomes,
b.
plx| o) > plx| o2)
C. plx| o,) <p{x| 02)
12. For example, in Fig. 2.6.1, two equiprobable classes are presented which
shows the variations of plx | o,), i = 1, 2 as functions of x for the simple
case of a single feature (l = ).
13. The dotted line at x, is a threshold which partitions the space into two
regions, R, and R According to Baye'sdecisions rule, for all value ofx
in R, the classifier decides o, and for all values in R, it decides o
14. From the Fig. 2.6.1, it is obvious that the errors are unavoidable. There
1s a finite probability for an x to lie in the R, region and at the same time
to belong in class o. Then there is error in the decision.
Fig. 2.6.1. Bayesian classifier for the case of two equiprobable classes.
15. The total probability, P of committing a decision error for two
equiprobable classes is given by,
2-7L(CSIT-Sem-5)
Machine Learning Techniques
in Fig. 2.6.1
under the curves
Answer classification
the
Bayesian classifier can be made optimal by minimizing
error probability.
away from
2. In Fig. 2.7.1, it is observed that when the threshold is moved increases.
a r e a under the curves always
the corresponding shaded
Xo to minimize the error.
3. Hence, we have to decrease this shaded area
.(2.7.1)
P.= plxeRa , ,)+pxeR,, og)
be written as,
6. P. can
. =
plxeR2l)plo,) +plxeR, | «)plo)
..(2.7.2)
= Plo plx |o)dx+ pl®2) plx|@2)dx
R2 R
7. Using the Baye's rule,
2 R
and R, of the
The e r r o r will be minimized if the partitioning regions R,
feature space are chosen so that
lx)ptx)dx =1 ..(2.7.5)
ploy lx)p/xdx+ ] plo
10.
2
Combining equation (2.7.3) and (2.7.5), we get,
(2.7.6)
P =plw,) (plojlx)-plw2|a)) plx\dx
11. Thus, the probability of error is minimized i f R, is the region of space in
which plo,|X) > p(o, |x). Then R, becomes region where the reverse is
true.
2-8L (CS/1T-Sem-5) egression & Bayesian Learning
xe la,, a,
Plxw,) =
muullion
*Elb,. 6,1
Plxlo) 6, -b
0 muullion
Show the classification results for some values for a and b
("muullion" means "otherwise").
Answer
Typical cases a r e presented in the Fig. 2.8.1.
Pa lu) Plxl)
(a) (b)
Pt ly) Pa|y,
ba-b
2
C) (d)
Pig. 28.1.
Answer
1. A Bayes classifier is a simple probabilistic classifier based on applying
R
C Pa/C)ds+CP Pa/C,d
+CP
H2
Px/C,dx+CPPx/C d H
where the various terms are defined as follows:
P, = Prior probability that the observation vector x is drawn from
subspace H, with i = 1, 2, and P + P2=1
[Assign r to class
Input vector
LikelihoodaAlx)Comparator
ratio
computer
ift (x) > E
Otherwise, assign
lit to class E2
(a)
[Assign x to class
Input vector Likelihood
ratio
og alar
Comparator
if log a (x)> log
computer Otherwise, assign
Lit to clas 52
(6) logT
Fig. 2.9.1. Two equivalent implementations of the Bayes classifier:
(a) Likelihood ratio test, (6) Log-likelihood ratio test
Answer
Bayes classifier: Refer Q. 2.9, Page 2-8L, Unit-2.
For
1.
example:
LetD be a training set of features and their associated class labels. Each
Teature is represented by an n-dimensional attribute vector
X=1, T2 ) depicting n measurements made on the feature from
n attributes, respectively A,Az.A
Suppose that there are m classes, C,. Cg Given a feature X, the
classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X. That is, classifier predicts that X
belongs to class C, if and only if,
PC,|X) =
AA |C)pG;
pX)
3. As p(X) is constant for all classes,
only PCX| C) PC; need to be
maximized. If the class prior probabilities are not known then it is
commonly assumed that the classes are cqually likely i.e.,
plC,) = plC,)=.. p{C,) and therefore pX|C)is maximized. Otherwise
P A T C ) p ) 1s maximized.
Thus, pX|c) =
||Pa,|C
R=
iv. The probabilities plx, | C), plx,1C) plr, 1C) are easily estimated
from the training feature. Here x refers to the value of attribute
A, for each attribute, it is checked whether the attribute is
categorical or continuous valued.
For example, to compute p(X|C)) we consider,
a. IfA, is categorical then p(z,IC) is the number of feature of
class C in D having the value z, for A, divided by |C, D|, the
number of features of class C, in D.
b. IfA, is continuous valued then continuous valued attribute is
typically assumed to have a Gaussian distribution with a mean
Hand standard deviation a, defined by,
gx) = o 1_
so that pr,|C) =g{z,).
vi There is a need to compute the mean H and the standard deviation
a of the value of attribute A, for training set of class C, These
values are used to estimate plz,|CP
vii. For example, let X = (35, Rs. 40,000) where A, and A, are the
attributes age and income, respectively. Let the class label attribute
be buys-computer.
vii. The associated class label for X is yes (ie., buys-computer = yes).
Let's suppose that age has not been discretized and therefore exists
as a continuous valued attribute.
ix. Suppose that from the training set, we find that eustomer in Dwho
buy a computer are 38+ 12 years of age. In other words, for attribute
age and this class, we have u = 38 and a = 12.
5. In order to predict the classlabelofX.pUX|C)p(C)is evaluated for each
class C, The classifier predicts that the class label of X is the class C, if
and only if
Answer
As per Bayes rule:
=
Plpenci/ green) Pgreen)
Pgreenpencu (P(penci/green) Pgreen)P(red)P(pencil blue) +
3 2 b=0.5050
4
P(penci/ blue) P(blue)
Pblue/pencu P(pencil green) Hgreens P(penci/ blue)
P(blue) + P(pencil/ red) P(red)
2 =0.378
0.33
Ppenci/ red) P(red)
PMredpencil)= (P(penci/ red) Pred)+ P(pencil blue)
Pblue)+ P(pencil/ green) Plgreen)
6424 =0.126
.33 0.33.126
class green.
0.666
,1,1,1,10.375
Machine Learning Techniques 2-13 L (CSNT-Sem-5)
P(pen/ blue)P(blue)
Pblue/pen)=
P(pen/ green) Plgreen) +
Ptpen/ blue)
P(blue)+ P(pen/ red) Pred)
6 4 24 =
0.111
0.375 0.375
6 2
12 = 0.286
0.291
Plpaper/ blue) P(blue
Pblue/paper Plpaperl green) Pgreen) + Plpaper Diue
P(blue) + P(paper/ red) P(red)
1
12 = 0.286
0.291 0.291
1
2 4 8 0.429
0.291 0.291
value therefore, paper belongs to
Since, P(red/paper) has the highest
class red.
Answer
Bayesian network model used
1. Naive Bayes model is the most common
in machine learning.
which is to be predicted and the
2. Here, the class variable C is the root
attribute variables X, are the leaves.
that the attributes are
The model is Naive because it
assumes
3.
other, given the cla8s.
conditionaly independent of each
0.9 * *
..
n * - -
***
0.8
0.7-
0.6
Decision 'ree
Naive Bayes ***
0.5
0.4
20 40 0 100
Training set size
Answer
Usha
Tasty
Yes No
status
Yes No Yes No Yes No
Asha
2/6 0Bad 2/6 3/4Indian 4/6 1/4
Sita 2/6 214 Good 4/6 1/4 Continental|2/63/4
Usha 2/6 2/4
T'asty
Yes No
6/10 4/10
Regression & Bayesian Learning
2-16 L (CS/AT-Sem-5)
Likelihood of no = 0xx
4 4
x
10
=0
a Estimation Step:
i Initialize 4g, 2, and n, by random values, or by K means clustering
resuts.
results or by hierarchical clustering
i. Then for those given parameter values, estimate the value of
the latent variables (i.e., Yp
b. Maximization Step: Update the value of the parameters (t.e., Ha
function.
iv. Compute log-likelihood
v. Put some convergence criterion.
to some value
vi. If the log-likelihood value converges
(or if all the parameters converge to some values) then stop,
Answer
Usage of EM algorithm:
1. It can be used to fill the missing data in a sample.
2. lt can be used as the basis of unsupervised learning of clusters.
2 The E-step and M-step are often pretty easy for many problems in terms
of implementation.
Answer
A Bayesian network is a directed acyclic graph in which each node is
annotated with quantitative probability information.
2. The full specification is as follows:
i A set of random variables makes up the nodes of the network
variables may be discrete or continuous.
5. Using the Baye's rule, we can deduce the most contributing factor
towards the wet grass.
Conditionj
Sprinkler Rain
Wet grass
Fig. 2.16.1.
of the event
The prior probability is used
to compute the probability
1
before the collection of new data.
/ domain knowledge and is
2. It is used to capture our assumptions
independent of the data.
is assigned before any relevant
3. It is the unconditional probability that
evidence is taken into account.
inference
method of handling approximate
Que 2.18. Explain the
in Bayesian networks.
Answer
inference
inference methods can be used when exact
1. Approximate times because the
network
methods lead to unacceptable computation
connected.
is very large or densely
inference:
2. Methods handling approximate
to generate
Simulation methods:
This method use the network
i. distribution and estimate
conditional probability
samples from the number of samples
conditional
of interest when the
probabilities
is sufficiently large.
inference task
This method express the
ii. Variational methods:
find upper and
numerical optimization problem and then
as a
of interest by solving a simplified
lower bounds of the probabilities
version of this optimization problem.
2-20 L (CS/AT-Sem-5) Regression & Bayesian Learning
PART 3
Support Vector Machine, Introduction, Types of Support
Vector Kernel - (Linear Kernel Polynomial Kernel, and Gaussian
Kernel), Hyperplane: (Decision Surface), Properties
of SVM, and Issues in SVM.
Questions-Answers
Answer
where, a and b are two different data points that we need to classify.
r determines the coefficients of the polynomial
d determines the degree of the polynomial.
3. We perform the dot products of the data points, which gives us the high
dimensional coordinates for the data.
Answer|
1. RBFkernel is a funetion whose value depends on the distance from the
origin or from some point.
2. Gaussian Kernel is of the following format:
K1,Xz, )=
exponent (=-7||X, -X, I)
I1X,- X, l = Euclidean distance between X, and X2
Using the distance in the original space we calculate the dot produet
(Similanty) of A, and
A
3. Following are the parameters used in Gaussain Kernel:
a. C:Inverse of the strength of regularization.
Behavior : As the value of'e' increases the model gets overfits.
As the value of 'c' decreases the model underfits.
b. y: Gamma (used only for RBF kernel)
Lehavior : As the value of y increases the model gets overfits.
As ine value of y decreases the model underfits.
Answer
Answer
Disadvantages of SVM:
1. SVM does not give the best performance for handling text structures as
compared to other algorithms that are used in handling text data. This
leads to loss of sequentil information and thereby, leading to worse
performance.
Machine Learning Techniques 2-23 L (CS/NT-Sem-5)
Que 2.25.
Explain the properties of SVM.
Answer
classifier ?
Answer
Parameters used in support vector classifier are:
1. Kernel:
a. Kernel, is selected based on the type of data and also the type of
transformation.
b. By default, the kernel is Radial Basis Function Kernel (RBF).
2 Gamma 3
So, more points are grouped together and have smoother decision
boundaries (may be less accurate).
d. Larger values of gamma cause points to be closer together (may
cause uverfitting).
2-24 L (CS/AT-Sem-5) Regression & Bayesian Learning
CONTENTS
Part-1 Decision Tree Learning,
******************** 4 L to 3-6L
Decision Tree Learning
Algorithm, Inductive Bias,
Inductive Inference with
Decision Trees
3-1L(CSIT-Sem-5)
3-2L (CS/IT-Sem-5) Decision Tree Learning
LPART- 11
Decision Tree Learning, Decision Tree Learning Algorithm,
Inductive Bias, Inductive Inference with Decision Trees
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
Basic terminology used in decision trees are:
1. Root node : It represents entire population or sample and this further
gets divided into two or more homogeneous sets.
2 Splitting: It is a process of dividing a node into two or more sub-nodes.
3 Decision node: When a sub-node splits into further sub-nodes, then it
is called decision node.
Root node
Branch/sub-tree
Splitting
Decision node Decision node
Yes
Humidity Wind
Weak
High Normal Strong
No Yes No Yes
Fig. 3.3.1.
Answer
Various decision tree learning algorithms are:
1. D3 (Iterative Dichotomiser3)
i D 3 is an algorithm used to generate a decision tree from a dataset.
smallest tree.
v. For building a decision tree model, ID3 only accepts categorical
ID3 when there is
attributes. Accurate results are not given by
noise and when it is serially implemented.
vi. Therefore data is preprocessed before constructing a decision tree.
2 C4.5:
i. C4.5 is an algorithm used to generate a decision tree. It is an extension
of ID3 algorithm.
for classification
ii. C4.5 generates decision trees which c a n be used
and therefore C4.5 is referred to a s statistical classifier.
with both
ii. It is better than the ID3 algorithm because it deals
continuous and discrete attributes and also with the missing values
and pruning trees after construction.
iv. C5.0 is the commercial successor of C4.5 because it 1s faster, memory
interpretations.
3 CART (Classification And Regression Trees):
i CART algorithm builds both classification and regression trees.
period of time.
vi CART has an average speed of processingg and supports both
continuous and nominal attribute data.
Answer
Advantages of ID3 algorithm:
1. The training data is used to create understandable predietion rules.
2. It builds short and fast tree.
3. D3 searches the whole dataset to create the whole tree.
4. It finds the leaf nodes thus enabling the test data to be pruned and
reducing the number of tests.
The calculation time of ID3 is the linear function of the produet of the
characteristic number and node number.
Disadvantages of ID3 algorithm:
1. For a small sample, data may be overfitted or overclassified.
3-6 L (CS/IT-Sem-5) Decision Tree Learning
LPART-2
Entropy and Information Theory, Information Gain,
ID-3 Algorithm, Issues in Decision Tree Learning.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
Attribute selection mensures used in decision tree are:
1. Entropy:
1 Entropy is a measure of uncertainty associated with a random
variable.
The entropy increases with the increase in uncertainty or
randomness and decreases with a decrease in uncertainty or
randomness.
ii. The value of entropy ranges from 0-1.
GainD,A) =
Entropy(D)- ntropy)
Where,
D: Agiven data partition
A: Attribute
V: Suppose we partition the tuples in D on some
attribute A having Vdistinct values
3 Gain ratio:
The information gain measure is biased towards tests with many
i
outcome8.
large number of
That is, it prefers to select attributes having a
values.
is
i As each is pure, the information gain by partitioning
partition
be used for classification.
maximal. But such partitioning cannot
Splitinfoa =- A|DI D
v. The gain ratio is then defined as :
Answer
The various decision tree applications in data mining are:
1. E-Commerce: It is used widely in the field of e-commerce, decision
tree helps to generate online catalog which is an important factor tor
the success of an e-commerce website.
(Customer
visualization probabilistic
of business models, used in CRMM
Relationship Management) and used for credit scoring for
credit card users and for
predicting loan risks in banks.
Que 38. Explain procedure of ID3 algorithm.
Answer
ID3 (Examples, Target Attribute, Attributes)
1. Create a Root node for the tree.
2. If all Examples are positive, return the
=+
single-node tree root, with label
Machine Learning Techniques 3-9L (CS/IT-Sem-5)
3. If all Examples are negative, return the single-node tree root, with Ilabel
Answer
Inductive bias:
1. Inductive bias refers to the restrictions that are imposed by the
assumptions made in the learning method.
2 Por example, assuming that the solution to the problem of road safety
can be expressed as a conjunction of a set of eight concepts.
3. This does not allow for more complex expressions that cannot be
expressed as a conjunction.
4. This inductive bias means that there are some potential solutions that
we cannot explore, and not contained within the version space we
examine.
only able to classify a new piece of data if all the hypotheses contained
within its version space give data the same classification.
9 Hence, the inductive bias does impose a limitation on the learning method.
Inductive system
Inductive system
Classification of
Candidate new instance or
Training exampi elimination do not know
algorithm
Newinstance
Using hypothesis
space H
Fig. 3.9.1.
Answer
Inductive learning algorithm
Step 1: Divide the table T containing m examples into n sub-tables
(t1, t2,..tn). One table for each possible value of the class attribute (repeat
steps 2-8 for each sub-table).
Step2: Initialize the attribute combination countj = 1.
Step 3:For the sub-table on which work is going on, divide the attribute list
into distinct combinations, each combination withj distinct attributes.
Step 4: For each combination of attributes, count the number of occurrences
of attribute values that appear under the same combination of attributes in
unmarked rows of the sub-table under consideration, and at the same time,
not appears under the same combination of attributes of other sub-tables.
Call the first combination with the maximum number of occurrences the
max-combination MAX.
Step 5:If MAX = = null, increasej by 1 and go to Step 3.
Step 6: Mark all rows of the sub-table where working, in which the values
of MAX appear, as classified.
Step 7:Add a rule (IP attribute ="XYZ">THEN decision is YES/NO) to R
(rule set) whose left-hand side will have attribute names of the MAX with
their values separated by AND, and its right hand side contains the decision
attribute value associated with the sub-table.
Step8:If all rows are marked as classified, then move on to process another
sub-table and go to Step 2, else, go to Step 4. If no sub-tables are available,
exit with the set of rules obtained till then.
Machine Learning Techniques 3-11L (Cs/IT-Sem-5)
Answer
Learning algorithm used in inductive bias are:
1. Rote-learner:
a.
Learning corresponds to storing each observed training example in
memory.
b. Subsequent instances classified by
are
looking them up in memory.
c. If the instance is found in memory, the stored classification is
returned.
d Otherwise, the system refuses to classify the new instance.
e Inductive bias: There is no inductive bias.
2 Candidate-elimination:
a. New instances are classified
only in the case where all members of
the current version
space agree on the classification.
b. Otherwise, the system refuses to classify the new, instance.
c.
Induetive bias: The target concept can be represented in its
hypothesis space.
3 FIND-S:
a. This
algorithm, finds the most specific hypothesis consistent with
the training examples.
b. It then uses this hypothesis to classify all subsequent instances.
c.
Inductive bias: The target concept can be
hypothesis space, and all instances are negativerepresented
in its
instances unless
the opposite is entailed by its other
knowledge.
Que 3.12. Discuss the issues related to the applications of decision
trees.
Answer
Issues related to the
applications of decision trees are:
1 Missing data:
d. When values have
gone unrecorded, or they might be too
expensive
obtain.
b. Two problems arise:
To classify an object that is missing from the test
i. To modify the information
attributes.
gain formula when examples have
unknown values for the attribute.
Decision Tree Learning
3-12L(CS/AT-Sem-5)
2 Multi-valued attributes:
a. When an attribute has many possible values, the information gain
useles
d One solution is to use the gain ratio.
Continuous and integer valued input attributes:
regression tree.
b. Such a tree has a linear function of some subset of numerical
PART-3
Instance-based Learning.
Questions-Answers
Answer
1. Instance-Based Learning (TBL) is an extension of nearest neighbour or
K-NN classification algorithms.
TBL algorithms do not maintain a set of abstractions of model
created
rom the instances.
3. The K-NN algorithms have large space requirement.
4 They also extend it with a significance test to work with noisy instances,
since a lot of real-life datasets have training instances and K-NN
algorithms do not work well with noise.
5. Instance-based learning is based on the memorization of the dataset.
6. The number of
data.
parameters is unbounded and grows with the size of the
7. The classification is obtainedthrough memorized examples.
8. The cost of the learning process is 0, all the cost is in the computation of
the prediction.
9. This kind learning is also known as lazy learning.
Que 3.14. Explain instance-based learning representation.
Answer
Following are the instance based learning representation:
Instance-based representation (1):
1 The simplest form of learning is plain memorization.
2 This is a completely different way of representing the knowledge extracted
from a set
of instances:
just store the instances themselves and operate
by relating new instances whose class is unknown to existing ones
whose class is knowT.
3 Instead of creating rules, work directly from the examples themselves.
Instance-based representation (2):
Instance-based learning is lazy, deferring the real work as long as
possible.
In instance-based learning, each new instance is compared with existing
ones using a distance metric, and the closest existing instance is used to
assign the class to the new one. This is also called the nearest-neighbour
classification method.
3. Sometimes more than one nearest neighbour is used, and the majority
class of the closest k-nearest neighbours is assigned to the new instance.
This is termed the k-nearest neighbour method.
regions.
is that they do
2. An apparent drawback to instance-based representation
a r e learned.
not make explicit the structures that
b) (C)
a
Fig. 3.14.1.
Que 3.15. What are the performance dimensions used for instance
Answer
instance-based learning algorithm
Performance dimension used for
are:
1. Generality:
of a n
describe the representation
a This is the class of concepts that
algorithm.
concept whose boundary is a
b. IBL algorithms can pac-learn any
of closed hyper-curves of finite size.
union of a finite number
Answer
Advantages of instance-based learning:
1. Learning is trivial.
2. Works efficiently.
3. Noise resistant.
Rich representation, arbitrary decision surfaces.
5. Easy to understand.
Disadvantages of instance-based learning:
1. Need lots of data.
2. Computational cost is high.
3. Restricted to x e R".
4. Implicit weights of attributes (need normalization).
5 Need large space for storage i.e., , require large memory.
6. Expensive application time.
3-16 L (CS/AT-Sem-5) Decision Tree Learning
PART-44
K-Nearest Neighbour Learning, Locally Weighted Regression,
Radial Basis Function Networks.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
is used to decide the new instance
1. The KNN classification algorithm
should belong to which class.
2 When K=1, we have the nearest neighbour algorithm.
3. KNN classification is incremental.
training phase, all instances are
KNN classification does not have a
comparison.
the local neighborh0od to obtain
a prediction.
6. K-nearest neighbours use
Let p be an
Algorithm: Let m be the number of training data samples.
unknown point.
of data points array. This means
Store the training samples in an array
a tuple (x, y).
each element of this array represents
2. For i = 0 to m:
Answer
Advantages of KNN algorithm
1. No training period:
a. KNN is called lazy learner (Instance-based learning).
Machine Learning 1Techniques 3-17L (Cs/IT-Sem-5)
b. It does not learn anything in the training period. It does not derive
any discriminative function fronm the training data.
c. In other words, there is no training period for it. It stores the
training dataset and learns from it only at the time of making real
time predictions
d This makes the KNN algorithm much faster than other algorithms
that require training for example, SVM, Linear Regression etc.
Since the KNN algorithm requires no training before making predictions,
new data can be added seamlessly which will not impact the accuracy of
the algorithm.
3. KNN is very easy to implement. There are only two parameters required
to implement KNN i.e., the value of K and the distance function (for
example, Euclidean).
Disadvantages of KNN:
1. Does not work well with large dataset: In large datasets, the cost of
calculating the distance between the new point and each existing points
is huge which degrades the performance of the algorithm.
Does not work well with high dimensions: The KNN algorithm
does not work well with high dimensional data because with large number
of dimensions, it becomes difficult for the algorithm to calculate the
distance in each dimension.
3 Need feature scaling: We need to do feature scaling (standardization
and normalization) before applying KNN algorithm to any dataset.
do not do so, KINN may
If we
generate wrong predictions.
Sensitive to noisy data, missing values and outliers : KNN is
sensitive to noise in the dataset. We need to manually represent missing
values and remove outliers.
Fig. 3.20.1
6. The LOESS (Locally Estimated Scatterplot Smoothing) model performs
a linear regression on points in the data set, weighted by a kernel
centered at x.
expectations and
9. We use the data covariances to express the conditional
their estimated variances
Answer
A Radial Basis Function (RBP) is a function that assigns real value
a to
1.
value
each input from its domain (it is a real-value function), and the
it is a mneasure of
produced by the RBF is alway8 an absolute value i.e.,
distance and cannot be negative.
Machine Learning Techniques 3-19L (CS/AT-Sem-5)
Answer
1. Radial Basis Funetion (RBF) networks have three layers: an input
layer, a hidden layer with a non-linear RBF activation function and a
linear output layer.
The input can be modeled as a vector of real numbers r e R".
3. The output of the network is then a scalar function of the input vector,
: R R , and is given by
dx) = a , pl| x - e, |)
Output y
Linear weights
Radial basis
functions
Weights
Input x
vector.
radially symmetric about that
5. In the basic form all inputs are connected to each hidden neuron.
Gaussian
6. The radial basis function is taken to be
that
0
lim pll c , )
=
PART-5]
Case-based Learning
Questions-Answers
Answer
1. Case-Based Learning (CBL) algorithms contain input sequence
an asa
Answer
Case-based learning algorithm processing stages are :
1. Caseretrieval : After the problem situation has been assessed, the
best matching case is searched in the case-base and an approximate
solution is retrieved.
2 Case adaptation: The retrieved solution is adapted to fit better in the
new problem.
Problem
Retrieve
Retain
Revise
Confirmed Proposed
solution solution
i g 3.36.1.The CBL cycle.
3 Solution evaluation:
solution can be evaluated either before the solution is
a The adapted
to the problem or after the solution has been applied.
applied
In any case, if the accomplished result
is not satisfactory, the
b.
or more cases should be
retrieved solution must be adapted again
retrieved.
verified as correct, the new
4 Case-base updating: If the solution
base.
was
Que 3.26.| What are the benefits of CBL as a lazy problem solving
method ?
Answer
The benefits of CBL as a lazy Problem solving method are:
Ease
of knowledge elicitation:
a.
Lazy methods can utilise easily available case
instead of rules that
or problem instances
are difficult to extract.
b. So, classical knowledge engineering is replaced by case acquisition
and structuring.
a. ACBL system can be put into operation with a minimal set solved
cases furnishing the case base.
problem solving:
5. Suitability for sequential learning
reinforcement
encountered
like these
a. Sequential tasks, the form of sequence
problems, benefit from the storage of history in
states or procedures.
of
by lazy approaches.
b. Such a storage is facilitated
& Ease of explanation :
Answer
Limitations of CBL are:
bases:
1 Handling large case retrieval
requirements and time-consuming
a High memory/ storage case bases.
accompany CBL systems utilising large
domains:
2 Dynamie problem
problem
a. CBL systems may have difficulties in handling dynamic the
in
unable to follow a shift way
domains, where they may be
strongly biased towards what
problems are solved, since they are
itself.
Machine Learning Techniques 3-25 L(CS/IT-Sem-5)
C. In turn this
implies inefficient storage and retrieval
of cases.
4 Fully automatic operation:
a. In a CBL system, the problem domain is not fully covered.
b. Hence, some problem situations can occur for which the system
has no solution.
Answer
Applications of CBL:
1. Interpretation : It is a process of evaluating situations/ problems in
some context (Por example, HYPO for interpretation of patent laws
KICS for interpretation of building regulations, LISSA for interpretation
of non-destructive test measurements).
2 Classification: It is a process of explaining a number of encountered
symptoms (For example, CASEY for classification of auditory
impairments, CASCADE for classification of software failures, PAKARR
for causal classification of building defects, ISFER for classification of
facial expressions into user defined interpretation categories.
3 Design:Itis a process of satisfying a number of posed constraints (For
example, JULIAfor meal planning. CLAVIER for design of optimal
layouts of composite airplane parts, EADOCS for aircraft panels design).
4 Planning: It is a process of arranging a sequence of actions in time
For example, BOLERO for building
diagnostic plans for medical patients,
TOTLEC for manufacturing planning).
5. Advising: It is a process of resolving diagnosed problems (For example,
DECIDER for advising students, HOMER).
Answer
Major paradigms of machine learning are:
1. Rote Learning:
a. There is one-to-one mapping from inputs to stored representation.
b. Learning by memorization.
3-26 L (Cs/AT-Sem-5) Decision Tree Learning
retrieval.
C. There is Association-based storage and
2 Induction : Machine learning use specific examples to reach general
conclusions.
6 Genetic algorithms:
a Genetic algorithms are stochastic search algorithms which act on a
Answer
Inductive learning problem are :
teacher.
c. Unsupervised learning means we are only given the xs.
CONTENTS
4-2L to 4-11L
Part-1 Neural Network,...
Artificial
Perceptron's, Multilayer
Perceptron, Gradient Descent
and the Delta Rule
4-1L(CS/IT-Sem-5)
Artificial Neural Network & Deep Learning
4-2L (CSIT-Sem-5)
PART 1
Artifical Neural Network, Perceptron's Multilayer Perceptron,
Gradient Descent and the Delta Rule.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
Advantages of Artificial Neural Networks (ANN):
1. Problems in ANN are represented by attribute-value pairs.
2 ANNs are used for problems having the target function, output may be
discrete-valued, real-valued, or a vector of several real or discrete-valued
attributes.
3. ANNs learning methods are quite robust to noise in the training data.
The training examples may contain errors, which do not affect the final
tput.
4-3 L(CS/IT-Sem-6)
Machine Learning Techniques
function
evaluation of the learned target
4. It is used where the fast
required.
5. ANNs can bear long training times depending on factors such as the
of training examples
number of weights in the network, the number
considered, and the settings of various learning algorithm parameters.
Networks (ANN) :
Disadvantages of Artificial Neural
1. Hardware dependence:
with parallel processing
a. Artificial neural networks require processors
structure.
power, by their
Neural
Que 43.What are the characteristics of Artificial
Network ?
Answer
Characteristics of Artificial Neural Network are:
Character recognition:
a It is a problem which falls under the general area of Pattern
Recognition.
b. Many neural networks have been developed for automatic
recognition of handwritten characters, either letters or digits.
3 Signature verification application:
a Signatures are useful ways to authorize and authenticate a person
in legal transactions.
b. Signature verification technique is a non-vision based technique.
For this application, the first approach is to extract the featureor
rather the geometrical feature set representing the signature.
images.
CHowever,if a neural network is well trained, then it can be divided
elasses namely images having faces and images that do not
into two
have faces.
After this the neurons collectively give the output layer to compute
the output signals.
12
2
nm
forward network:
2 Multilayer feed which is internal to the network and
a. This layer has hidden layer
external layer.
has no direct contact with the
hidden layers enables the network to be
b. Existence of one or more
computationally stronger.
feedback connections in which outputs of the model
There are no
nm
Input Output
Feedback
Pig. 4.5.1
11
22
m
5 Multilayer recurrent network:
of a with
b. They perform the same task for every element sequence,
the output being depended on the previous computations. Inputs
are not needed at each time step.
neural network is its
c . T h e main feature of a multilayer recurrent
hidden state, which captures information about a sequence.
*1 1
W22
Wn
m
difficult to manage.
48L(CSIT-Sem-5) Artificial Neural Network &
Deep Learningk
4. It generalizes knowledge to
produce adequate
tuations. responses to unknown
5. Artificial neural networks are flexible and have the
generalize and adapts to situations based on its ability to learn,
6 This function allows the findings.
network to
efficiently acquire knowledge by
learning. This is a distinct advantage over
that is inadequate when it
comes to
a
traditionally linear network
An
modelling non-linear data.
artificial neuron network is
traditional network. Without thecapable
of greater fault tolerance
than a
loss of stored data, the network is able
to
regenerate a fault in any of its
components.
& An artificial neuron network is based on adaptive learning.
Que 4.7.Write short note on
gradient descent.
Answer
1. Gradient descent is an optimization
function by iteratively algorithm used to minimize some
moving in the direction of steepest descent as
defined by the negative of the
gradient.
2 Agradient is the slope of a function, the degree of
with the amount of
change in another
change ofa parameter
parameter.
3.
Mathematically, it can be described as the partial derivatives of a set of
parameters with respect to its inputs. The more the
the slope. gradient, the steeper
4 Gradient Descent is a convex function.
5. Gradient Descent can be described as an iterative method which is used
to find the values of the
parameters of a function that minimizes the
cost function as much as
possible.
The parameters are initially defined a particular value and from that,
Gradient Descent run in an iterative fashion to find the
of the parameters, using calculus, to find the optimal values
minimum possible value of
the given cost funetion.
Answer
Advantages of batch gradient descent:
1. Less oscillations and noisy steps taken towards the global minima of the
loss function due to updating the parameters by computing the average
of all the training samples rather than the value of a single sample.
2 It can benefit from the vectorization which increases the speed of
processing al training samples together.
It produces a more stable gradient descent convergence and stable error
samples.
Disadvantages of batch gradient descent:
a local minima and unlike
1 Sometimes a stable e r r o r gradient can lead to
stochastic gradient descent no noisy steps are there to help to get out of
the local minima.
2. The entire training set can be too large to process in the memory due to
which additional memory might be needed.
3. Depending on computer resources it can take too long for processing all
the training samples as a batch.
Answer
Advantages of stochastic gradient descent:
1 . l t is easier to fit into memory due to a single training sample being
processed by the network.
2. t is computationally fast as only one sample is processed at a time.
4Due to frequent updates the steps taken towards the minima of the loss
function have oscillations which can help getting out of local minimums
oflocal loss function (in
theminimum). the computed position turns out to be the
case
O
1+ exp(- WO)
where O, is the output vector of the hidden layer
1+ exp(- W*)
where8 = ( y - O 0 ( 1 - 0)
..
the present error
Step 6: Cumulative cycle error is computed by adding
to E
E:= E+ 1/2(y - 0)
k+ 1 and we continue the training by going
Step 7: Ifk <
Kthen k=
back to step 2, otherwise we go to step 8.
For E terminate the
Step 8: The training cycle is completed. <
Emax
then E: 0, k := 1l and we initiate a new
If E> Ema
=
training session.
training cycle by going
back to step 3.
PART-22
Multilayer Network, Derivation of Back Propagation Algorithm,
Generalization.
Questions-Answers
Answer
in the training of feedforward
1. Backpropagation is an algorithm used
neural networks for supervised learning.
the gradient of the loss function
2. Backpropagation efficiently computes
with respect to the weights of the network for a single input-output
example.
for training multi-layer
3. This makes it feasible to use gradient methods we use descent
los8, gradient
networks, updating weights to minimize
descent.
or variants such as stochastie gradient
4-12 L (CS/IT-Sem-5)
Artificial Neural Network & Deep Learning
4. The back propagation algorithm works by computing the gradient of the
loss function with
respect each weight by the chain rule,
to
backwards one layer at a time from the last iteratin8
calculations of intermediate terms in the layer to avoid redundant
chain rule; this is an example
of dynamie programming
5. The term
the
backpropagation refers only to the
algorithm computing for
gradient, but it is often used loosely to refer to the entire
learning
algorithm, also including how the gradient is used, such as by stochastic
gradient descent.
6.
Backpropagation generalizes the gradient computation in the delta rule,
which is the single-layer version of backpropagation, and is in turn
generalized by automatic differentiation, where backpropagation is a
special case of reverse accumulation (reverse mode).
Im
7. The externally applied bias is denoted by b.
Bias b
V Output
Inputs Hand
limiter
*mO
Pig A131. Signal low greph of the pereptron,
8. From the model, we find that the hard limiter input or induced local field
of the neuron as
V-ux+b
4-13 L (CS/IT-Sem-5)
Machine Learning Techniques
9. The goal of the perceptron is to correctly classify the set of externally
applied input r , X2, . i n t o one of two classes G, and G
+1 then assign the
10. The decision rule for classification is that if outputy is
G, else y is -1 then
point represented by input x, X2 *'m to class
assign to class G
11. In Fig. 4.13.2, if a point (r.x,) lies below the boundary lines is assigned
to class G, and above the line is assigned to class G,. Decision boundary
is calculated as
W+wt2 +b =0
Decision boundary
WX+W2X2+ b=0
Glass G2 lass G
Fig. 4.13.2.
defined as :
12. There are two decision regions separated by a hyperplane
algorithm is used.
convergence
and G, must be
14. For a perceptron to function properly, the two classes G,
linearly separable.
the pattern or set of inputs to be classified
15. Linearly separable means,
straight line.
must be separated by a
G2class 2 Class
Answer
Statement: The Perceptron convergence theorem states that for any data
set which is linearly separable the Perceptron
learning rule is guaranteed to
find a solution in a finite number of steps.
Proof:
1. To derive theerror-correction learning algorithm for the perceptron.
2. The perceptron convergence theorem used the
synaptic weights w,, w
o f the perceptron can be adapted on an iteration by iteration
basis.
The bias b(n) is treated as a
synaptic weight driven by fixed input equal
to +11.
xn) l+ 1,
=
x, (n), xln), n)
Where denotes the iteration step in applying the
n
algorithm.
4. Correspondingly, we define the weight vector as
wn) [bn),
=
w,n), w(n). ,n)
Accordingly, the linear combiner output is written in the compact form:
its arehitecture
Que 4.15. Explain multilayer perceptron with
and characteristics.
Answer
Multilayer
1.
perceptron
The perceptrons which are arranged in layers are called multilayer
perceptron. This model has three layers : an input layer, output layer
and hidden layer.
the linear transfer function used
For the perceptrons in the input layer,
and for the perceptron in the hidden layer and output layer, the sigmoidal
or squashed-S function is used.
the network ina forward direction.
3. The input signal propagates through
bias b{n) is treated
4. On alayer by layer basis, in the multilayer perceptron
as a synaptic weight driven by fixed input equal to +1.
xn) = +1, X, 7), xzn), .. X,n)
x(n)
2w, (n)x, (n)= w'(n)
x
V(n) =
i-0
Input Output
signal signal
Output layer
Input layer First hidden Second hidden
layer layer
Fig. 4.15.1.
Multilayer perceptrons have been applied successfully to solve some
difficult and diverse problems by training them in a supervised manner
with highly popular algorithm known as the error backpropagation
algorithm.
Characteristics of multilayer perceptron:
1. In this model, each neuron in the network includes a non-linear
activation function (non-linearity is smooth). Most commonly used
non-linear function is defined by:
1+ exp(-v,)
where u, is the induced local field (i.e., the sum of all weights and bia5)
and y is the output of neuronj.
2. The network contains hidden neurons that are not a part of input or
of enabled network to
of
the network. Hidden layer neurons
output
learn complex tasks.
Answer
network:
the backpropagation neural
Effect of tuning parameters of
1. Momentum factor:
the values
The momentum factor has a significant role in deciding
a.
of learning rate that will produce rapid
learning.
weights biases.
the size of change in
or
b. It determines
4-17L (CS/IT-Sem-6)
Machine Learning Techniques
one is repeated.
Between 0 and 1 is a region where the weight adjustment is
e.
smoothened by an amount proportional to the momentum tactor.
2 Learning coefficient:
a. A formula to select learning coefficient is
n
1.5
N N , . . + N,)
the optimum.
d The optimum value of learning rate is 0.6 whieh produce fast
...(4. 16.1)
4. Threshold value:
a. 0in eq. (4.16.1) is called as threshold value or the bias or the noise
factor.
4-18 L (CS/AT-Sem-5) Artificial Neural Network & Deep Learning
Answer
Selection of various parameters in BPN:
1. Number of hidden nodes:
a The guiding criterion is to select the minimum nodes in the first
and third layer, so that the memory demand for storing the weights
can be kept minimum.
b. The number of separable regions in the input space M, is a function
10
T 1
10T
,+1)
Which yields the value for
I
2 Momentum coefficient a:
a. To reduce the training time we use the momentum factor because
it enhances the training process.
b. The influences of momentum on weight change is
Machine Learning T'echniques 4-19L (CS/IT-Sem-5)
OE
AW = -
n alAW"
OW
The momentum also overcomes the effect of local minima.
(Weight change
without momentum)
W
[AW]®/ a[AW]"
la Wn
(Momentum term)
Fig. 4.17.1. Influence of momentum term on weight change.
3 Sigmoidal gain
a. When the weights become large and force the neuron to operate in
a region where sigmoidal function is very flat, a better method of
coping with network paralysis is to adjust the sigmoidal gain.
b. By decreasing this scaling factor, we effectively spread out sigmoidal
function on wide range so that training proceeds faster.
4 Local minima:
a One of the most practical solutions involves the introduction of a
shock which changes all weights by specific or random amounts.
b. If this fails, then the most practical solution is to rerandomize the
weights and start the training all over.
|PART-33]
Unsperuised Learning, SOM Algorithm and its Variants
Questions-Answers
Long Answer 1ype and Medium Answer Type Questions
Answer
1. Unsupervised learning is the training of machine using information
that is neither classified nor labeled and allowing the algorithm to act on
that information without guidance.
& Deep Learning
4-20 L (CS/AT-Sem-5) Artificial Neural Network
to
2. Here the task of machine is to group
unsorted information according
of data.
Similarities, patterns and differences
without any prior training
no training
no teacher is provided that m e a n s
3. Unlike supervised learning,
will be given to the machine.
in unlabeled
4. Therefore machine is restricted to find the hidden structure
data by our-self.
of
unsupervised learning into two categories
Que 4.19.| Classify
algorithm.
Answer
learning algorithm into two categories
:
Classification of unsupervised
discover the
1. Clustering: A clustering problem is where we want to
inherent groupings in the data, such as grouping customers by
purchasing behavior.
?
Que 4.20.What are the applications of unsupervised learning
Answer
Following a r e the application of unsupervised learning:
the dataset into groups base
1. Unsupervised learning automatically split
on their similarities.
4. Latent variable models are widely used for data preprocessing. Like
the number of features in a dataset or decomposing the dataset
reducing
into multiple components.
?
Que 4.21.What is Self-Organizing Map (SOM)
Answer
1. Self-Organizing Map (SOM) provides a data visualization technique which
helps to understand high dimensional data by reducing the dimensions
of data to a map.
2. SOM also represents clustering concept by grouping similar data together.
4-21 L(CS/AT-Sem-5)
Machine Learning Techniques
Feature Map (SOFM)
o r Self-Organizing
3. A self-Organizing Map (SOM) trained using
Network (ANN) that is
is a type of Artificial Neural
unsupervised learning to produce a low-dimensional (typically two-
of the training
dimensional), discretized representation of the input space
therefore a method to do dimensionality
called a map, and is
samples,
reduction.
as
other artificial neural networks
4. Self-organizing maps differ from
they apply competitive learning as opposed to error-correction learning
and in the sense that
(such as backpropagation with gradient descent),
neighborhood function to preserve the topological properties
they use a
Que 4.23. What are the basic processes used in SOM? Also explain
stages of SOM algorithm.
Answer
Basics processes used in SOM algorithm are
1. Initilization: All the connection weights are initialized with small
random values.
PART-4|
Deep Learning, Introduction, Concept of Convolutional Neural
Network, Types of Layers, (Convolutional Layers, Activation
Function, Pooling, Fully Connected)
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
1. Deep learning is the subfield of artificial intelligence that focuses on
creating large neural network models that are capable of making
accurate data-driven decisions.
Machine Learning Techniques 4-23 L (CS/AT-Sem-5)
2. Deep learning is used where the data is complex and has large datasets.
3. Facebook uses deep learning to analyze text in online conversations.
search and machine
Google and Microsoft all use deep learning for image
translation.
All modern smart phones have deep learning systems running on them.
For example, deep learning is the standard technology for speech
recognition, and also for face detection on digital cameras.
Answer
Different architecture of deep learning are:
1. Deep Neural Network: It is a neural network with a certain level of
complexity (having multiple hidden layers in between input and output
are capable of modeling and processing non-linear
layers). They
relationships.
2 Deep Belief Network (DBN):It is a class of Deep Neural Network. It
is multi-layer belief networks. Steps for performing DBN are:
a. Learn a layer of features from visible units using Contrastive
Divergence algorithm.
b. Treat activations of previously trained features as visible units and
then learn features of features.
c. Finaly, the whole DBN is trained when the learning for the final
hidden layer is achieved.
3. Recurrent (perform same task for every element of a sequence)
Neural Network : Allows for parallel and sequential computation.
Similar to the human brain (large feedback network of connected
neurons).They are able to remember important things about the input
they received and hence enable them to be more precise.
Answer
Answer
Following are the application of deep learning
L Automatic text generation: Corpus of text is learned and from this
model new text is generated, word-by-word or
character-by-character.
Then this model is capable of learning how to spell, punctuate, form
sentences, or it may even capture the style.
2 Healthcare: Helps in diagnosing various diseases and treating it.
3 Automatic machine translation: Certain words, sentences or
phrases in one language is transformed into another language (DeepP
Learning is achieving top results in the areas of text, images).
Image recognition : Recognizes and identifies peoples and objects in
images as well as to understand content and context. This area is already
being used in Gaming, Retail, Tourism, etc.
5 Predicting earthquakes: Teaches a computer to perform viscoelastic
computations which are used in predicting earthquakes.
Answer
Convolutional networks also known as Convolutional Neural Networks
output layer
input layer
hidden layer 1 hidden layer 2
10. Within a single layer, each neuron is completely independent and they
do not share any connections.
11. The fully connected layer, (the output layer), contains class scores in the
case of an image classification problem. There are three main layers in
a simple ConvNet.
4-26 L (CSAT-Sem-5)
Artificial Neural Network & Deep Learning
Que 4.29.Write short note on convolutional layer.
Answer
1. Convolutional layers are the
neural networks.
major building blocks used in convolutional
2 A convolution is the simple application of a filter to an input that results
in an activation.
Answer
Activation function:
1. An activation function is a function that is
added into an artificial neural
network in order to help the network learn
complex patterns in the
data.
2. When comparing with a neuron-based model that
is in our brains, the
activation function is at the end
deciding what is to be fired to the next
neuron.
Pooling layer:
1. A pooling layer is a new layer added after the convolutional
layer.
Specifically, after a non-linearity (for example ReLU) has been applied
to the feature maps output by a convolutional layer, for
layers in a model may look as follows : example, the
a. Input image
b. Convolutional layer
Machine Learning Techniques 4-27L (CS/IT-Sem-5)
C. Non-linearity
d Pooling layer
2. The addition of a pooling layer after the convolutional layer is a common
pattern used for ordering layers within a convolutional neural network
that may be repeated one or more times in a given model.
3. The pooling layer operates upon each feature map separately to create
a new set of the same number of pooled feature maps.
Fully connected layer:
1 Fuly connected layers are an essential component of Convolutional
Neural Networks (CNNs), which have been proven very successful in
recognizing and classifying images for computer vision.
2. The CNN process begins with convolution and pooling eaking iov
the image into features, and analyzing them independently.
3. The result of this process feeds into a fully connected neural network
structure that drives the final classification decision.
PART-5
Concept of Convolution (1D and 2D) Layers, Training of Network,
Case Study of CNN for eg on Diabetic Retinopathy, Building a
Smart Speaker, Self Deriving Car etc.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
1D convolutional neural network:
1. Convolutional Neural Network (CNN) models were developed for image
classification, in which the model accepts a two-dimensional input
representing an image's pixels and color channels, in a process called
feature learning.
4-28 L (CS/IT-Sem-5) Artificial Neural Network & Deep Learning
of data.
This process can be applied to one-dimensional sequences
2 same
audio recording.
Natural Language Processing (NLP), although Recurrent Neural
Networks which leverage Long Short Term Memory (LSTM) cells
are more promising than CNN as they take into account the
proximity of words to create trainable patterns.
Answer
Once a network has been structured for a particular application, that
network is ready to be trained.
2 To start this processthe initial weights are chosen randomly. Then, the
training, or learning begins.
3. There are two approaches to training:
a In supervised training, both the inputs and the outputs are provided.
The network then processes the inputs and compares its resulting
outputs against the desired outputs.
b. Errors are then propagated back through the system, causing the
system to adjust the weights which control the network. This
process occurs over and over as the weights are continually
tweaked.
c. The set of data which enables the training is called the "training
set." During the training of a network the same set of data is
processed many times as the connecticn weizhts are ever refined.
4-29 L (CS/IT-Sem-5)
Machine Lerrning Techniques
In
d The other type of training is called unsupervised training.
but not
unsupervised training. the network is provided with inputs
with desired outputs8.
conclusions.
research in computer imaging with encouraging
the features of DR using
8. Significant work has been done on detecting
machines and k-NN classifiers.
automated methods such a support vector
two class
The majority of these classif+cation techniques arc on
9
classification for DR or no DR.
Answer
Answer
1. The rapid development ofthe Internet economy and Artificial Intelligence
(AI) has promoted the progress of self-driving cars.
of self-driving c a r s are
2 The market demand and economic value
more and more enterprises and
increasingly prominent. At present,
scientific research institutions have invested in this field. Google, Tesla,
6. Followed by the Tesla models series, its "auto-pilot" technology has made
major breakthroughs in recent years.
7. Although the Tesla's autopilot technology is only regarded as Level 2
stage by the National Highway Traffic Safety Administration (NHTSA),
Tesla shows us that the car has basically realized automatic
driving
under certain conditions.
5
UNIT
Reinforcement
Learning and
Genetic Algorithmn
CONTENTS
D-ZL to 5-6L
Part-1 Introduction
**********meosnnanaasse.
t0 .
Reinforce ment Learning
6-6L to 5-9L
Part-2 Learning Task, Example ********
of Reinforcement
Learning in Practice
LearningS
d-1JL to 5-15L
Part-4 Introduction to Deep. *********** v*sanenee
Q Learning
5-15L to 5-30L
Part-5 : Genetic Algorithm,..
Introduction, Components,
GA Cycle of Reproduction,
Crossover, Mutation,
Genetic Programming,
Models of Evolution and
Learning, Application.
5-1 L (CS/IT-Sem-5)
5-2L (CS/TT-Sem-5) Reinforcement Iearning & Genetic Algorithm
|PART 1
Tntroduction to Reinforcement Learning.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
5. Without some feedback about what is good and what is bad, the agent
will have no grounds for deciding which move to make.
6. The agents needs to know that something good has happened when it
wins and that something bad has happened when it loses.
Primary
State (input) reinforcement signal
vector
Environment Critic
Heuristic
reinforcement
Signal
Actions
Learning
system
Answer
Reinforcement learning : Refer Q. 5.1, Page 5-2L, Unit-5.
5-4L(CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm
2. Its goal is simply to learn how good the policy is -that is, to learn the
utility function U(s).
3. Fig. 5.3.1 shows a policy for the world and the corresponding utilities.
4. In Fig. 5.3.1(a) the policy happens to be optimal with rewards of
Rs) = - 0.04 in the non-terminal states and no discounting.
5. Passive learning agent does not know the transition model TMs, a, s'),
which specifies the probability of reaching state s' from state s after
doing action a; nor does it know the reward function R(s) which specifies
the reward for each state.
6. The agent executes a set of trials in the environment using its policy T.
7. In each trial, the agent starts in state (1, 1) and experiences a sequence
of state transitions until it reaches one of the terminal states, (4, 2) or
(4, 3).
8. Its percepts supPply both the current state and the reward received in
that state. Typical trials might look like this.
0.762
0.660-)
0.705 0.655 0.611|0.388
3 4
(a) b)
9. Each state percept is subscripted with the reward received. The objectis
to use the information about rewards to learn the expected utility U"ls)
associated with each non-terminal state s.
5. These equations can be solved to obtain the utility function U using the
value iteration or policy iteration
algorithms.
6. A utility
function is optimal for the learned model, the
U agent can
extract an optimal action by one-step look-ahead to maximize the expected
utility.
7. Alternatively, if it uses policy iteration, the optimal policy is already
available, so it should simply execute the action the optimal policy
recommen
Explain.
Answer
learning:
d Disadvantages of positive reinforcement
lead to overload of states which
i. Too much reinforcement can
Answer
PART-2
Learning Task, Example of Reinforcement Learning in Practice
Machine Learning Techniques
5-7L(CS/IT-Sem-5)
Questions-Answers
Long Answer Type and Medium Answer Type Questions
etc.
Clustering: Clustering tasks are all about finding natural groupings of
data anda label associated with each of these groupings (clusters).
Some of the common example includes customer segmentation, product
features identification for produet roadmap.
7. Multivariate querying: Multivariate querying is about querying or
finding similar objects.
& Density estimation: Density estimation problems are related with
finding likelihood or frequency of objects.
9. Dimension reduction: Dimension reduction is the process of reducing
the number of random variables under consideration, and can be divided
example.
Answer
Reinforcement learning (RL) is learning concerned with how software
agents ought to take actions in an environment in order to maximize
the notion of cumulative reward.
2. The software agent is not told which actions to take, but instead must
discover which actions yield the most reward by trying them.
For example,
Consider the scenario of teaching new tricks to a cat:
1. As cat does not understand English or any other human language, we
cannot tell her directly what to do. Instead, we follow a different strategy.
Machine Learning Techniques 5-9L (CS/IT-Sem-5)
4
That's like learning that cat gets from "what to do" from positive
experiences.
At the same time, the cat also learns what not do when faced with
5.
negative experiences.
PART3
Learning Models for Reinforcement (Markov Decision Process, Q
Learning, Q Learning Function, Q Learning Algorithm), Application
of Reinforcement Learning.
Questions-Answers
Answer
Following are the terms used in reinforcement learning
Agent: It is an assumed entity which pertorms actions in an environment to
gain some reward.
returned by the
iii. State (s) : State refers to the current situation
environment.
ix. Qvalue or action value Q):Q value is quite similar to value. The
only difference between the two is that it takes an additional parameter
as a current action.
Answer
1. Reinforcement learning is defined by a specific type of problem, and all
its solutions are classed as reinforcement
learning algorithms.
2. In the problem, an agent is supposed to decide the best action to select
based on his current state.
3. When this is
step repeated, the problem is known as a Markov Decision
Process.
4.
A Markov Decision Process (MDP) model contains
a. A State is a set of tokens that represent every state that the agent can
be in.
b. A Model(sometimes called Transition Model) gives an action's effect in
a state. In particular, TYS, a, S') defines a
transition T where being inn
state S and taking an action 'a' takes us to state S'
(S may be andS'
same).
simply being in the state S. R(S,a) indicates the reward for being in a
state S and taking an action 'a'. R(S,a,S) indicates the reward
for being
in a state S, taking action'a' and
an
ending up in a state S'.
e. A Policy is
solution a to the Markov Decision Process. A
policy is a
mapping from S to a. It indicates the action'a' to be taken while in state
S.
Answer
Following are the applications of reinforcement learning:
1L Robotics for industrial automation.
2. Business strategy planning.
3. Machine learning and data
4. It
processin8
helps us to creste training systems that provide custom instruction
and materials accrding to the requirement of students.
5. Aircraft control and robot motion control.
Following are the reasons for using reinforcement learning:
1. Ithelps us to find which situation needs an action.
2. Helps us to discover which action yields the highest reward over the
longer period.
Reinforcement Learning & Genetic Algorithm
5-12L (Cs/AT-Sem-5)
function.
4. It also allows us to figure out the best method for obtaining large rewards.
Answer
We cannot apply reinforcement learning model is all the situation. Following
are the conditions when we should not use reinforcement learning model.
1. When we have enough data to solve the problem with a supervised
learning method.
Answer
1 Q-learning is a model-free reinforcement learning algorithm.
2. -learning is a values-based learning algorithm. Value based algorithms
Bellman
updates the value function based on a n equation (particularly
equation).
3. Whereas the other type, policy-based estimates the value function with
a greedy policy obtained from the last policy improvement.
-learning is an off-policy learner i.e., it learns the value of the optimal
policy independently of the agent's actions.
5. On the other hand, an on-policy learner learns the value of the policy
beingcarried by the agent, including the exploration steps and it will
out
find a policy that is optimal, taking into account the exploration inherent
in the policy.
Answer
Step 1: Initialize the Q-table: First the Q-table has to be built. There are
n columns, where n number of actions. There are m rows, where m
=
=
number of states.
In our
exampleGo left, Go right, Go up and Go down and m
n =
Start, Idle, =
Correct path, Wrong path and End. First, lets initialize the value at 0.
Step 2: Choose an action.
Step 3: Perform action: The combination
an
for an
of steps 2 and 3 is
undefined amount of time. These steps run until the time performed
stopped, or when the training loop stopped as defined in the code.
training is
PART-4
Introduction to Deep Q Learning8
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
1. In deep Q-learning, we use a neural network to approximate the Q-
value function.
2. The state is given as the input and the Q-value of all possible actionsi
generated as the output.
3. The comparison between -learning and deepQ-learning is illustrated
below
5-14 L (CS/AT-Sem-5) Reinforcement Learning & Genetic Algorithm
L State H
Q-value
Action
Qlearning
-value action 1
LStateF Q-value action 2
-value action N
Deep Q learning
Fig. 5.16.1
4. On a higher level, Deep Q learning works as such:
Gather and store samples in a replay buffer with current poliey.
i Random sample batches of experiences from the replay buffer
ii Use the sampled experiences to update the Q network.
IV. Repeat 1-3.
Que 5.17. What are the steps involved in deep Q-learning network ?
Answer
Steps involved in reinforcement learning using deep Q-learning networks
Answer
Start with 4,s, a) for all s, a.
For k =
1,2,... till convergence
Sample action a, get next state s'
Machine Learning 'Techniques 5-15 L (CS/IT-Sem-5)
[f s isterminal:
target = R(s, a, s')
Sample new initial state s'
else target Rs, a, s') + y max@,(s', a')
targets')"1|la-,
S8
PART-S
Genetic Algorithm, Introduction, Components, GA Cycle of
Reproduction, Crossover, Mutation, Genetic Programming.
Models of Evolution and Learning, Application.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
b. Mutation testing
c. Code breaking
problem.
2. The selection of suitable genetic operators is difficult.
Answer
Different phases of genetic algorithm are:
1. Initial population :
a The process begins with a set of individuals which is called a
population.
b. Each individual is a solution to the problem we want to solve.
c. An individual is characterized by a set of purameters (variables)
known as genes
d Genes are joined into a string to form a chromosome (solution).
e. Ina genetic algorithm, the set of genes ofan individual is represented
using a string.
Machine Learning Techniques 6-17L (CS/IT-Sem-5)
A Gene
DD hromosome
Aoo|i|
Population
3. Selection:
a. The idea of selection phase is to select the fittest individuals and let
them paSs their genes to the next generation.
b. Two pairs of individuals (parents) are selected based on their fitness
Scores.
4. Crossover:
a. Crossover is the most significant phase in a genetic algorithm.
is chosen at
b. For each pair of parents to be mated, a crossover
point
random from within the genes.
to be 3 as shown:
For example, consider the crossover point
A00o0
A2
Crossover point
AIo o 0 o
As 00
5 Mutation:
of their genes can be subjected
a When new offspring formed, some
As11o00
After mutation
and
Mutation occurs to maintain diversity within the population
C.
prevent premature convergence.
& Termination:
(does
terminates if the population has converged
a The algorithm different from the
which are significantiy
not produce offspring
previous generation).
provided set of
genetic algorithm has
a
it is said that the
b. Then
solutions to our problem.
principle.
Answer
Unit-1.
Genetic algorithm : Refer Q. 1.24, Page 1-23L,
Working principle :
consider unconstrainedd
1. To illustrate the working principle of GA, we
optimizat10n problem.
maximize f
3. 1f we want to minimize fX), for flX) > 0, then we can write the objective
function as
1
maximize
1+fX)
It fx) < 0 instead of minimizing f\X), maximize ( X ) . Hence, both
maximization and minimization problems can be handled by GA.
Answer
Benefits of using GA:
1. It is easy to understand.
2. It is modular and separate from application.
3. It supports multi-objective optimization.
Answer
solutionshindividuals in
1. Genetic representation is a way of representing
evolutionary computation methods.
behavior, physical
2. Genetic representation can encode appearance,
qualities of individuals.
is an array of bits.
method
otf representation
of individual
5. These genetic representations a r e convenient because parts
are easily aligned due to their fixed size which makes simple crossover
operation.
Answer
Genetic representations:
Pncoding:
Encoding is a process of representing individual genes.
b. The process can be performed using bits, numbers, trees, arrays,
lists or any other objects.
C.The encoding depends mainly on solving the problem.
2 Binary encoding:
a. Binary encoding is the most commonly used method of geneticC
representation because GA uses this type of encoding.
Machine Learning Techniques 6-21 L(CS/AT-Sem-5)
Chromosome A| 101100101100101011100101
Chromosome B 111111100000110000011111
chromosomes.
C. Binary encoding gives many possible
3 Octal or Hexadecimal encoding:
a. The encoding is done using octal or hexadecimal numbers
Octal Hexadecimal
Chromosome
B2CAES
Chromosome A 54545345
Chromosome B 77406037 FEOCIF
Chromosome A|153264798
Chromosome B8 5 6 723149
5. Value encoding:
a Direct value encoding can be used in problems, where some
are used.
complicated values, such a s real numbers,
is a string of some values.
b. In value encoding, every chromosome
real numbers o r
Values c a n be anything connected to problem,
chars to some complicated objects.
2.4545
Chromosome A|1.2324 5.3243 0.4556 2.3293
Chromosome BABDJEIFJDHDIERJFDLDFLFEGT
Chromosome C((back), (back), (right), (forward), (left)
6 Tree encoding:
or expressions, for
a. Tree encoding is used for evolving programs
genetic programming.
In tree encoding, every chromosome is a tree of some objects,
such as functions or commands in programming language.
LISP is often used to this, because
Prográ. .ming language
represented in this form and can be easily
programs in it are
relatively easily.
5-22 L (CS/AT-Sem-5) Reinforcement Learning & Genetic Algorithm
Chromosome A Chromosome B
Do_until
Step Wall
(do_until step wall)
+xU5 y)
Explain different methods of selection in genetie
Que 527.
in order to select a population for next generation.
algorithm
Answer
to c r o s s over a r e :
for parents
The various methods of selecting chromosomes
a Roulette-wheel selection:
i Roulette-wheel selection is the proportionate reproductive method
is selected from the mating pool with a probability
where a string
fitness.
proportional to the
is selected with a probability
i Thus, ith string in the population that string.
to where is the f+tness value for
proportional ,
fixed in Genetic Algorithm,
ii Since the population size is usually kept for the
selected
of each string being
the s u m of the probabilities
mating pool must be one.
selected string is
iv. The probability of the ith
P
size.
where'n'is the population
The average fitness is
.5.27.1)
F F /n
b Boltzmann selection:
Boltzmann selection uses the concept of simulated annealing.
L minimization or
is a method of functional
i Simulated annealing
maximization.
of slow cooling of molten metal
i This method simulates the process
in a minimization problem.
the minimum function value
to achieve temperature
simulated by controlling a
iv. The cooling phenomenon is temperature 7'has its
thermal equilibrium at a
8o that a 8ystem in
to
distributed probabilistically according
energy
(6.27.2)
PE)-expl-
where '*' is Boltzmann constant.
5-23 L (CS/IT-Sem-5)
Machine Learning Techniques
This expression suggests that a system at a high temperature has
almost uniform probability of being at any energy state, but at a low
temperature it has a small probability of being at a high energy
state.
vi. Therefore, by controlling the temperature T and assuming search
process follows Boltzmann probability distribution, the convergence
of the algorithm is controlled.
Tournament selection :
i. GA uses a strategy to select the individuals from population and
insert them into a mating pool.
ii. A sclection strategy in GA is a process that favours the selection of
better individuals in the population for the mating pool.
ii. There are two important issues in the evolution process of genetic
search.
1. Population diversity: Population diversity means that the
the already discovered good individuals are
genes from
exploited.
2 Selective pressure : Selective pressure is the degree to
which the better individuals are favoured.
The higher the selective pressure the better individuals are
favoured.
d Rank selection
i R a n k selection first ranks the population and takes every
chromosome, receives fitness from the ranking.
ii The worst will have fitness 1, the next 2, .., and the best will have
fitness N (N is the number of chromosomes in the population).
ii. The method can lead to slow convergence because the best
chromosome does not differ so much from the other.
e Steady-state selection:
The main idea of the selection is that bigger part of chromosome
should survive to next generation.
i GA works in the following way :
I n every generation a few chromosomes are selected for
creating new off springs.
2. Then, some chromosomes are removed and new offspring is
placed in that place.
3. The rest of population survives a new generation.
Que 5.28.| Differentiate between Roulette-wheel based on fitness
and Roulette-wheel based on rank with suitable example.
5-24 L(CS/AT-Sem-5) Reinforcement Learning & Genetic Algorithm
Answer
Differenee:
Roulette-wheel Roulette-wheel
No.
based on rank
based on fitness
1. Population is selected with a Probability of a population being
probability that is directly selected is based on its fitness
Example:
where all chrom0somes in the population
Roulette-wheel
L Imagine a to its fitness
chromosome has its place accordingly
are placed, each
function
Chromosomes 4
Chromosomes3- - Chromosomes l
Chromosomes 2
selection.
Fig. 5.28.1. Roulette- whecl
and pointer
When the wheel is spun,
the wheel will finally stop
2. chromosomes with bigger
the one of
attached to it will points to
fitness value.
fitnessS
roulette-wheel selection based
on
Chromosomes3 Chromosomes 4
Chromosomes 2
Chromosomes 1
Fig. 6.38.3. Situation before ranking (graph offitnenaos)
Chromosomes 4
Chromosomes 3 - Chromosomes 1
Chromosomes 2
Fig 6.28.. Situation afer ranking (graph of order nurabers)
Answer
Generational cyele of GA:
Population
(Chromosomes) Decoded
Offsprings string
New
generation
Genetic Evaluation
Parents
operator (Fitness)
Mate
Manipulation
Reproduction
Selection
(Mating pool)
Fe 520.1. The GAeyele
Components of generational cycle in GA:
L Population (Chromosomes):A population is collection of individuals.
A population consists of a number of individuals being tested, the
phenotype parameters defining the individuals and some information
about search space.
2 Evaluation (Fitness) :A fitness function is a particular type of objective
function that quantifies the optimality of a solution (ie., a chromosome)
5-26 L (CS/AT-Sem-5) Reinforcement Learning & Genetic Algorithm
Answer
Mutation is done in genetic algorithm because :
It maintains genetic diversity from one generation of a population of
genetic algorithm chromosomes to the next.
2. GA can give better solution of the problem by using mutation.
Types of mutation :
1 Bit string mutation: The mutation of bit strings occurs through bit
flips at random positions.
Example: 1010010
1010110
The probability of a mutation of a bit is 1/1, where is the length of the
binary vector. Thus, a mutation rate of 1 per mutation and individual
selected for mutation is reached.
2 Flipbit: This mutation operator takes the chosen genome and inverts
the bits (i.e., if the genome bit is 1, it is changed to 0 and vice versa).
a Boundary: This mutation operator replaces the genome with either
lower or upper bound randomly. This can be used for integer and float
genes.
Non-uniform: The probability that amount of mutation will go to 0
with the next generation is increased by using non-uniform mutation
operator. Itkeeps the population from stagnating in the early stages of
the evolution.
Uniform: This operator replaces the value of the chosen gene witha
selected between the user-specified upper and
uniform random value
lower bounds for that gene.
& Gaussian : This operator adds a unit Gaussian distributed random
value to the chosen gene. Ifit falls outside of the user-specified lower or
upper bounds for that gene, the new gene value is clipped.
Machine Learning Techniques 5-27 L(CS/AT-Sem-5)
Answer
1. Crossover is the basic operator of genetic algorithm. Performance of
genetic algorithm depends on crossover operator.
2. Type of crossover operator used for a problem depends on the type of
encoding used.
3. The basic principle of crossover process is to exchange genetic material
of two parents beyond the crossover points.
Function of crossover
operation/operator in genetic algorithm:
1. The main function
of crossover operator is to introduce diversity in the
population.
2. Specific crossover made for a specific problem
of the genetic algorithm.
can
improve pertormance
3. Crossover combines parental solutions to form offspring with a hope
to produce better solutions.
4. Crossover operators are critical in ensuring good mixing of building
blocks
5. Crossover is used to maintain balance between
exploitation and
exploration. The exploitation and exploration techniques are
responsible for the performance of genetic algorithms. Exploitation
means to use the already
existing information to find out the betteer
solution and exploration is to investigate new and unknown solution
in exploration space.
5. Image processing: GAs are used for various digital image processing
(DIP) Lasks as well like dense pixel matching
6 Machine learning: Genetics based machine learning (GBML) is a
nice area in machine learning.
7. Robot trajectory generation: GAs have been used to plan the path
which a robot arm takes by moving from one point to another.
Answer
1. The TSP consist a number of cities, where each pair of cities has a
corresponding distance.
LStart
Set GA parameters
Generate initial random
population
population
Mutation of
chromo80me
L
Fig. 5.33.1. Genetic algorithm procedure for TSP
Machine Learning T'echniques 5-29 L (Cs/AT-Sem-5)
2. The aim is to visit all the cities such that the total
distance travelled will
be minimized.
3. A solution, and therefore achromosome which represents that solution
to the TSP, can be given order, that is, a path, of the cities.
as an
4. The procedure for solving TSP can be viewed as a process flow given in
F8. 6.33..
10. After each generation, a new set of chromosomes where the size is
equal to the initial population size is evolved.
11. This transformation process from one generation to the next continues
until the population converges to the optimal solution, which usually
occurs when a certain percentage of the population (for example $0 )
the best individual is taken
has the s a m e optimal chrom0some in which
as the optimal solution.
Answer
1. Agenetic algorithm is usually said to converge when there is no significant
in the values of fitness of the population from
one
improvement
generation to the next.
2. One criterion for convergence may be such that when a fixed percentage
same, it can be
population matrix becomes the
of
columns and rows in
fixed percentage may be
assumed that convergence is attained. The
80% or 85%.
3. In genetic algorithras as we proceed with more generations, there may
the population fitness and the best
not be much improvement in
individual may not change for subsequent populations.
As the generation progresses, the population gets filled with more fit
individuals
individuals with only slight deviation from the fitness of best
Reinforcement Learning & Genetic Algorithm
5-30 L (CS/AT-Sem-5)
to the fitness of
so far found, and the average fitness
comes very close
the best individuals.
of view.
8. One can visualize GA's search for the optimal strings as a simultaneous
increases the number of their instances in
competition among schema
the population.