Ai Combined Update
Ai Combined Update
Java or C++). Test it on the attached data set and print the decision tree and
the performance. The CSV file contains the data and the txt file contains a
description about the data.
SOURCE CODE: -
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
print("Report : ",
classification_report(y_test, y_pred))
# Driver code
def main():
# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operational Phase
print("Results Using Gini Index:")
OUTPUT:-
Q.2) Please develop a MLP for I50 data of Iris dataset using 5-fold cross validation. The
MLP consist of I hidden layer. Plot the error chart and final accuracy.
Ans:-
INPUT CODE:-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.neural_network import MLPClassifier
iris = load_iris()
X = iris.data
y = iris.target
# print(X,y)
kf = KFold(n_splits = 5)
clf=MLPClassifier(solver='adam',hidden_layer_sizes=(100,),random_state=1,max_iter=800)
scores = []
for train_indices, test_indices in kf.split(X):
clf.fit(X[train_indices], y[train_indices])
scores.append(clf.score(X[test_indices], y[test_indices]))
accuracy = np.mean(scores)
print(accuracy)
plt.plot(range(1,6), scores)
plt.xlabel('FOLDS')
plt.ylabel('ACCURACY')
plt.title('ERROR CHART')
plt.show()
OUTPUT:-
How KNN algorithm works
Suppose we have height, weight and T-shirt size of some customers and we
need to predict the T-shirt size of a new customer given only height and weight
information we have. Data including height, weight and T-shirt size information
is shown below -
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
There are many distance functions but Euclidean is the most commonly used
measure. It is mainly used when data is continuous. Manhattan distance is
also very common for continuous variables.
Distance Functions
The idea to use distance measure is to find the distance (similarity) between
new sample and training cases and then finds the k-closest customers to new
customer in terms of height and weight.
=SQRT((161-158)^2+(61-58)^2)
Similarly, we will calculate distance of all the training cases with new case and
calculates the rank in terms of distance. The smallest distance value will be
ranked 1 and considered as nearest neighbor.
In the graph below, binary dependent variable (T-shirt size) is displayed in blue
and orange color. 'Medium T-shirt size' is in blue color and 'Large T-shirt size'
in orange color. New customer information is exhibited in yellow circle. Four
blue highlighted data points and one orange highlighted data point are close to
yellow circle. so the prediction for the new case is blue highlighted data point
which is Medium T-shirt size.
Term Frequency tft,d of term t in document d is defined as the number of times that t occurs in d.
Inverse Document Frequency Estimate the rarity of a term in the whole document collection. (If a term
occurs in all the documents of the collection, its IDF is zero.)
|D|
idfi = log
|{j : ti ∈ dj }|
with |D| : cardinality of D, or the total number of documents in the corpus |{j : ti ∈ dj }|: number of documents
where the term ti appears (viz. the document frequency) (that is ni,j 6= 0). If the term is not in the corpus,
this will lead to a division-by-zero. It is therefore common to use 1 + |{j : ti ∈ dj }|
1,000,000
Example |D| = 1, 000, 000 idft = log10
dft
Tf-idf The tf-idf weight of a term is the product of its tf weight and its idf weight.
Normalized tf tf count is usually normalized to prevent a bias towards longer documents (which may have
a higher term count regardless of the actual importance of that term in the document) to give a measure of the
importance of the term ti within the particular document dj .
ni,j
tfi,j = P
k nk,j
where ni,j is the number of occurrences of the considered term (ti ) in document dj , and the denominator is the
sum of number of occurrences of all terms in document dj , that is, the size of the document |dj |.
tft,d
Alternative: max tfd
Exercise
Given a document with the terms A, B and C with the following frequencies A: 3, B: 2, C: 1
The document belongs to a collection of 10,000 docs. The document frequencies are: A: 50, B:1300, C:250.
Compute the normalized tf and the tf-idf and compare them. You could also check the effects of using normalized
tf measures. The idf are as below
A idf = log(10000/50) = 5.3;
B idf = log(10000/1300) = 2.0;
C idf = log(10000/250) = 3.7
1
Results:
A tf = 3/3; idf = log(10000/50) = 5.3; tf-idf=5.3
B tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf=1.3
C tf = 1/3; idf = log(10000/250) = 3.7; tf-idf=1.2
Recall: The logarithm of a number y with respect to base b is the exponent to which b has to be raised in order
to yield y. In other words, the logarithm of y to base b is the solution x of the equation
bx = y
Compute the tf-idf weights for the terms in the tables for each document.
Tf normalization Take the values in the tf table above and replace them with normalized tf weights. Com-
pute the tf-idf again. and compare the results.
2 Similarity Measures
• Jaccard
|A ∩ B|
jaccard(A, B) =
|A ∪ B|
• Cosine Similarity
Pn
~x · ~y xi × yi
cos(~x, ~y ) = = pPn i=1 pPn
|~x||~y | 2
i=1 xi ×
2
i=1 yi
pPn
|~x| = i=1 x2i
Exercise 3 Given the tf for term for the three novels “Sense and Sensibility” (SaS), “Pride and Prejudice”
(PaP) by Austin and “Wuthering Heights”
Weighted terms:
compute the cosine similarity and the jaccard measure between SaS-PaP and SaS-WH.
2
Results: cos-sim(SaS,PaP) = 0.999 vs. cos-sim(SaS,WH)=0.888.
3 Evaluation Measures
Error Rate Inverse of accuracy. Percentage of documents wrongly classified by the system
Precision percentage of relevant documents correctly retrieved by the system (TP) with respect to all doc-
uments retrieved by the system (TP + FP). (how many of the retrieved books are relevant?)
Recall : percentage of relevant documents correctly retrieved by the system (TP) with respect to all docu-
ments relevant for the human (TP + FN). (how many of the relevant books have been retrieved?)
F-Measure : Combine in a single measure Precision (P) and Recall (R) giving a global estimation of the
performance of an IR system
3
Relevant Not Relevant
Retrieved True Positive (TP) False Positive (FP)
TP + TN
Accuracy T P +T N +F P +F N
FP+FN
Error Rate T P +T N +F P +F N
TP
Precision T P +FP
TP
Recall T P +FN
2P R
F R+P
Exercise An IR system returns eight relevant documents and ten non-relevant documents. There are a total
of twenty relevant documents in the collection. What is the precision of the system on this search, and what is
its recall? Calculate the above measures for the following IR systems:
Not retrieved 25 10
4
ECE595 / STAT598: Machine Learning I
Lecture 18 Multi-Layer Perceptron
Spring 2020
Stanley Chan
c Stanley Chan 2020. All Rights Reserved.
1 / 28
Outline
Discriminative Approaches
Lecture 16 Perceptron 1: Definition and Basic Concepts
Lecture 17 Perceptron 2: Algorithm and Property
Lecture 18 Multi-Layer Perceptron: Back Propagation
Input neurons x
Weights w
Predicted label = σ(w T x + w0 ).
c Stanley Chan 2020. All Rights Reserved.
3 / 28
Multi-Layer Network
https://towardsdatascience.com/
multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f
c Stanley Chan 2020. All Rights Reserved.
5 / 28
Understanding the Weights
Single-layer:
h = wTx
Hidden-layer:
h = WTx
Two Hidden Layers:
h = WT T
2 W1 x
h = WT T T
3 W2 W1 x
h = WT T T
N ...W2 W1 x
c Stanley Chan 2020. All Rights Reserved.
7 / 28
Interpreting the Hidden Layer
i=1
i=1
W t+1
1 = W t1 − α∇J(W t1 )
J(W 1 , W 2 ) = kσ(W T T
2 σ(W 1 x)) − y k
2
| {z }
a2
Let us go backward:
∂J ∂J ∂a 2
= ·
∂W 2 ∂a 2 ∂W 2
Now, what is a 2 ?
a 2 = σ(W T σ(W T x))
| 2 {z 1 }
z2
So let us compute:
∂a 2 ∂a 2 ∂z 2
= · .
∂W 2 ∂z 2 ∂W 2
c Stanley Chan 2020. All Rights Reserved.
14 / 28
Let us See an Example
J(W 1 , W 2 ) = kσ(W T T
2 σ(W 1 x)) − y k
2
| {z }
a1
∂a 2 ∂a 2 ∂a 1
=
∂W 1 ∂a 1 ∂W 1
∂a 2 ∂a 1 ∂z 1
=
∂a 1 ∂z 1 W 1
So it is just a very long sequence of chain rule.
c Stanley Chan 2020. All Rights Reserved.
15 / 28
Notations for Back Propagation
The following notations are based on Nielsen’s online book.
The purpose of doing these is to write down a concise algorithm.
Weights:
c Stanley Chan 2020. All Rights Reserved.
18 / 28
Loss
c Stanley Chan 2020. All Rights Reserved.
19 / 28
Error Term
The error is defined as
∂C
δj` =
∂zj`
∂C
First term: ∂ajL
is rate of change w.r.t. ajL
∂C
= (ajL − yj )
∂ajL
Matrix-vector form: δ L = ∇a C σ 0 (z L )
c Stanley Chan 2020. All Rights Reserved.
21 / 28
4 Fundamental Equations for Back Propagation
BP Equation 2: An equation for the error δ ` in terms of the error in the
next layer, δ `+1
δ ` = ((w `+1 )T δ `+1 ) σ 0 (z ` ). (BP-2)
(BP-1) and (BP-2) can help you determine error at any layer.
c Stanley Chan 2020. All Rights Reserved.
22 / 28
4 Fundamental Equations for Back Propagation
Equation 3: An equation for the rate of change of the cost with respect
to any bias in the network.
∂C
= δj` . (BP-3)
∂bj`
Equation 4: An equation for the rate of change of the cost with respect
to any weight in the network.
∂C
= ak`−1 δj` (BP-4)
∂wjk`
c Stanley Chan 2020. All Rights Reserved.
24 / 28
Step 2: Feed Forward Step
Let us take a closer look at Step 2
The feed forward step computes the intermediate variables and the
activations
z ` = (w ` )T a `−1 + b `
a ` = σ(z ` ).
c Stanley Chan 2020. All Rights Reserved.
25 / 28
Step 3: Output Error
δ L = ∇a C σ 0 (z L )
c Stanley Chan 2020. All Rights Reserved.
26 / 28
Step 4: Output Error
c Stanley Chan 2020. All Rights Reserved.
27 / 28
Summary of Back Propagation
Once upon a time a fire broke out in a hotel, where just then a sci-
entific conference was held. It was night and all guests were sound
asleep. As it happened, the conference was attended by researchers
from a variety of disciplines. The first to be awakened by the smoke
was a mathematician. His first reaction was to run immediately to
the bathroom, where, seeing that there was still water running from
the tap, he exclaimed: “There is a solution!”. At the same time, how-
ever, the physicist went to see the fire, took a good look and went back
to his room to get an amount of water, which would be just suffi-
cient to extinguish the fire. The electronic engineer was not so choosy
and started to throw buckets and buckets of water on the fire. Finally,
when the biologist awoke, he said to himself: “The fittest will survive”
and went back to sleep.
17
18 2. A S IMPLE C LASS OF GA S
2.1 Algorithm.
t := 0;
Compute initial population B0 = (b1,0 , . . . , bm,0 );
FOR i := 1 TO m − 1 STEP 2 DO
IF Random[0, 1] ≤ pC THEN
cross bi,t+1 with bi+1,t+1 ;
FOR i := 1 TO m DO
eventually mutate bi,t+1 ;
t := t + 1
END
2.1.1 Selection
Selection is the component which guides the algorithm to the solution by
preferring individuals with high fitness over low-fitted ones. It can be a
deterministic operation, but in most implementations it has random com-
ponents.
One variant, which is very popular nowadays (we will give a theo-
retical explanation of its good properties later), is the following scheme,
2.1. G ENETIC O PERATIONS ON B INARY S TRINGS 19
f (bj,t )
P[bj,t is selected] = P
m . (2.1)
f (bk,t )
k=1
Of course, this formula only makes sense if all the fitness values are pos-
itive. If this is not the case, a non-decreasing transformation ϕ : R → R+
must be applied (a shift in the simplest case). Then the probabilities can
be expressed as
ϕ(f (bj,t ))
P[bj,t is selected] = P
m (2.2)
ϕ(f (bk,t ))
k=1
2.2 Algorithm.
x := Random[0, 1];
i := 1
Pi Pm
WHILE i < m & x < j=1 f (bj,t )/ j=1 f (bj,t ) DO
i := i + 1;
select bi,t ;
2.1.2 Crossover
In sexual reproduction, as it appears in the real world, the genetic material
of the two parents is mixed when the gametes of the parents merge. Usu-
ally, chromosomes are randomly split and merged, with the consequence
20 2. A S IMPLE C LASS OF GA S
0.208
0.083 0.167
0.251
0.208
0.083
that some genes of a child come from one parent while others come from
the other parents.
!
!
2.3 Algorithm.
FOR i := 1 TO pos DO
BEGIN
Child1 [i] := Parent1 [i];
Child2 [i] := Parent2 [i]
END
FOR i := pos + 1 TO n DO
BEGIN
Child1 [i] := Parent2 [i];
Child2 [i] := Parent1 [i]
END
N -point crossover: Instead of only one, N breaking points are chosen ran-
domly. Every second section is swapped. Among this class, two-
point crossover is particularly important
2.1.3 Mutation
2.4 Algorithm.
FOR i := 1 TO n DO
IF Random[0, 1] < pM THEN
invert s[i];
Bitwise inversion: The whole string is inverted bit by bit with prob. pM .
2.1.4 Summary
2.5 Algorithm.
t := 0;
Create initial population B0 = (b1,0 , . . . , bm,0 );
(∗ proportional selection ∗)
FOR i := 1 TO m DO
BEGIN
x := Random[0, 1];
k := 1;
WHILE k < m & x < kj=1 f (bj,t )/ m
P P
j=1 f (bj,t ) DO
k := k + 1;
bi,t+1 := bk,t
END
(∗ one-point crossover ∗)
FOR i := 1 TO m − 1 STEP 2 DO
BEGIN
IF Random[0, 1] ≤ pC THEN
BEGIN
pos := Random{1, . . . , n − 1};
FOR k := pos + 1 TO n DO
BEGIN
aux := bi,t+1 [k];
bi,t+1 [k] := bi+1,t+1 [k];
bi+1,t+1 [k] := aux
END
END
END
(∗ mutation ∗)
FOR i := 1 TO m DO
FOR k := 1 TO n DO
IF Random[0, 1] < pM THEN
invert bi,t+1 [k];
t := t + 1
END
24 2. A S IMPLE C LASS OF GA S
2.2 Examples
Like in [22], let us assume that we use Algorithm 2.5 as it is, with a
population size of m = 4, a crossover probability pC = 1 and a mutation
probability of pM = 0.001. If we compute the initial generation randomly
with uniform distribution over {0, 1}5 , we obtain the following in the first
step:
One can compute easily that the sum of fitness values is 1170, where the
average is 293 and the maximum is 576. We see from the last column in
which way proportional selection favors high-fitted individuals (such as
no. 2) over low-fitted ones (such as no. 3).
A random experiment could, for instance, give the result that individu-
als no. 1 and no. 4 are selected for the new generation, while no. 3 dies and
no. 2 is selected twice, and we obtain the second generation as follows:
2.2. E XAMPLES 25
1.5
0.5
-1 -0.5 0.5 1
f2 : [−1, 1] −→ R
2
x 7−→ 1 + e−x · cos(36x).
As one can see easily from the plot in Figure 2.3, the function has a global
maximum in 0 and a lot of local maxima.
First of all, in order to work with binary strings, we have to discretize
the search space [−1, 1]. A common technique for doing so is to make a
26 2. A S IMPLE C LASS OF GA S
uniform grid of 2n points, then to enumerate the grid points, and to use
the binary representation of the point index as coding. In the general form
(for an arbitrary interval [a, b]), this looks as follows:
cn,[a,b] : [a, b] −→ {0, 1}n
x−a
(2.3)
x 7−→ binn round (2n − 1) · b−a
,
where binn is the function which converts a number from {0, . . . , 2n−1 } to
its binary representation of length n. This operation is not bijective since
information is lost due to the rounding operation. Obviously, the corre-
sponding decoding function can be defined as
c̃n,[a,b] : {0, 1}n −→ [a, b]
(2.4)
s 7−→ a + bin−1
n (s) ·
b−a
2n −1
.
It is left as an exercise to show that the decoding function c̃n,[a,b] is injective
and that the equality (1.2) holds for the pair (cn,[a,b] , c̃n,[a,b] ).
Applying the above coding scheme to the interval [−1, 1] with n = 16,
we get a maximum accuracy of the solution of
1 2
· 16 ≈ 1.52 · 10−5 .
2 2 −1
Now let us apply Algorithm 2.5 with m = 6, pC = 1, and pM = 0.005. The
first and the last generation are given as follows:
Generation 1 max. fitness 1.9836 at -0.0050
#0 0111111101010001 fitness: 1.98
#1 1101111100101011 fitness: 0.96
#2 0111111101011011 fitness: 1.98
#3 1001011000011110 fitness: 1.97
#4 1001101100101011 fitness: 1.20
#5 1100111110011110 fitness: 0.37
Average Fitness: 1.41
...
We see that the algorithm arrives at the global maximum after 52 gen-
erations, i.e. it suffices with at most 52 × 6 = 312 evaluations of the fitness
2.2. E XAMPLES 27
0.75 10
0.5
5
0.25
0
-10 0
-5
0 -5
10 -10
function, while the total size of the search space is 216 = 65536. We can
draw the conclusion—at least for this example—that the GA is definitely
better than a pure random search or an exhaustive method which stupidly
scans the whole search space.
Just in order to get more insight into the coding/decoding scheme, let
us take the best string 0111111111111111. Its representation as integer
number is 32767. Computing the decoding function yields
1 − (−1)
−1 + 32767 · = −1 + 0.9999847 = −0.0000153.
65535
f3 : [−10, 10]2 −→ R √
1−sin2 ( x2 +y 2 )
(x, y) 7−→ 1+0.001·(x2 +y 2 )
.
As one can see easily from the plot in Figure 2.4, the function has a global
maximum in 0 and a lot of local maxima.
Let us use the coding/decoding scheme as shown in (2.3) and (2.4) for
the two components x and y independently with n = 24, i.e. c24,[−10,10] and
28 2. A S IMPLE C LASS OF GA S
Again, we learn from this example that the GA is here for sure much
faster than an exhaustive algorithm or a pure random search. The ques-
2.2. E XAMPLES 29
0.8
0.6
0.4
0.2
-2 -1 1 2
-2 -1 1 2
-1
-2
pM = 0.005. The result was that the global maximum at x = 0 was found
after 9 generations (i.e. at most 90 evaluations of the fitness function) and
5 milliseconds computation time, respectively (on the same computer as
above).
In order to repeat the above comparison, BFGS with line search and
random selection of the initial value was applied to f4 as well. The global
optimum was found after 30 trials (initial values) with perfect accuracy,
but 9 milliseconds of computation time.
2.2.5 Discussion
Finally, let us summarize some conclusions about the four examples
above:
Algorithm 2.5 is very universal. More or less, the same algorithm has
been applied to four fundamentally different optimization tasks.
As seen in 2.2.4, GAs can even be faster in finding global maxima than
conventional methods, in particular when derivatives provide misleading
information. We should not forget, however, that, in most cases where
conventional methods can be applied, GAs are much slower because they
do not take auxiliary information like derivatives into account. In these
optimization problems, there is no need to apply a GA which gives less
accurate solutions after much longer computation time. The enormous
potential of GAs lies elsewhere—in optimization of non-differentiable or
even discontinuous functions, discrete optimization, and program induc-
tion.
A Comprehensive Tutorial to learn Convolutional Neural
Networks from Scratch (deeplearning.ai Course #4)
A LG O RI T HM D E E P LE A RNI NG IMAGE I NT E RM E D I AT E PYT HO N UNS T RUC T URE D D AT A
Introduction
If you had to pick one deep learning technique for computer vision from the plethora of options out there,
which one would you go for? For a lot of folks, including myself, convolutional neural network is the default
answer.
But what is a convolutional neural network and why has it suddenly become so popular? Well, that’s what
we’ll find out in this article! CNNs have become the go-to method for solving any image data challenge.
Their use is being extended to video analytics as well but we’ll keep the scope to image processing for
now. Any data that has spatial relationships is ripe for applying CNN – let’s just keep that in mind for now.
In the previous articles in this series, we learned the key to deep learning – understanding how neural
networks work. We saw how using deep neural networks on very large images increases the computation
and memory cost. To combat this obstacle, we will see how convolutions and convolutional neural
networks help us to bring down these factors and generate better results.
So welcome to part 3 of our deeplearning.ai course series (deep learning specialization) taught by the
great Andrew Ng. In addition to exploring how a convolutional neural network (ConvNet) works, we’ll also
look at different architectures of a ConvNet and how we can build an object detection model using YOLO.
Finally, we’ll tie our learnings together to understand where we can apply these concepts in real-life
applications (like facial recognition and neural style transfer).
I highly recommend going through the first two parts before diving into this guide:
An Introductory Guide to Deep Learning and Neural Networks (Notes from deeplearning.ai Course #1)
Improving Neural Networks – Hyperparameter Tuning, Regularization, and More (deeplearning.ai Course
#2)
Table of Contents
1. Course Structure
2. Course #4: Convolutional Neural Networks
1. Module 1: Foundations of Convolutional Neural Networks
2. Module 2: Deep Convolutional Models: Case Studies
1. Case Studies
2. Practical Advice for using ConvNets
3. Module 3: Object Detection
4. Module 4: Special Applications: Face Recognition & Neural Style Transfer
Course Structure
The previous articles of this series covered the basics of deep learning and neural networks. We also
learned how to improve the performance of a deep neural network using techniques like hyperparameter
tuning, regularization and optimization.
So where to next? Let’s turn our focus to the concept of Convolutional Neural Networks. Course #4 of the
deep learning specialization is divided into 4 modules:
1. In module 1, we will understand the convolution and pooling operations and will also look at a simple
Convolutional Network example
2. In module 2, we will look at some practical tricks and methods used in deep CNNs through the lens of
multiple case studies. We will also learn a few practical concepts like transfer learning, data
augmentation, etc.
3. Module 3 will cover the concept of object detection. We will discuss the popular YOLO algorithm and
different techniques used in YOLO for object detection
4. Finally, in module 4, we will briefly discuss how face recognition and neural style transfer work. This is
a very interesting module so keep your learning hats on till the end
Computer Vision
Some of the computer vision problems which we will be solving in this article are:
1. Image classification
2. Object detection
3. Neural style transfer
One major problem with computer vision problems is that the input data can get really big. Suppose an
image is of the size 68 X 68 X 3. The input feature dimension then becomes 12,288. This will be even
bigger if we have larger images (say, of size 720 X 720 X 3). Now, if we pass such a big input to a neural
network, the number of parameters will swell up to a HUGE number (depending on the number of hidden
layers and hidden units). This will result in more computational and memory requirements – not something
most of us can deal with.
In the previous article, we saw that the early layers of a neural network detect edges from an image. Deeper
layers might be able to detect the cause of the objects and even more deeper layers might detect the cause
of complete objects (like a person’s face).
In this section, we will focus on how the edges can be detected from an image. Suppose we are given the
below image:
As you can see, there are many vertical and horizontal edges in the image. The first thing to do is to detect
these edges:
But how do we detect these edges? To illustrate this, let’s take a 6 X 6 grayscale image (i.e. only one
channel):
After the convolution, we will get a 4 X 4 image. The first element of the 4 X 4 matrix will be calculated as:
So, we take the first 3 X 3 matrix from the 6 X 6 image and multiply it with the filter. Now, the first element
of the 4 X 4 output will be the sum of the element-wise product of these values, i.e. 3*1 + 0 + 1*-1 + 1*1 +
5*0 + 8*-1 + 2*1 + 7*0 + 2*-1 = -5. To calculate the second element of the 4 X 4 output, we will shift our
filter one step towards the right and again get the sum of the element-wise product:
Similarly, we will convolve over the entire image and get a 4 X 4 output:
So, convolving a 6 X 6 input with a 3 X 3 filter gave us an output of 4 X 4. Consider one more example:
Note: Higher pixel values represent the brighter portion of the image and the lower pixel values represent
the darker portions. This is how we can detect a vertical edge in an image.
The type of filter that we choose helps to detect the vertical or horizontal edges. We can use the following
filters to detect different edges:
The Sobel filter puts a little bit more weight on the central pixels. Instead of using these filters, we can
create our own as well and treat them as a parameter which the model will learn using backpropagation.
Padding
We have seen that convolving an input of 6 X 6 dimension with a 3 X 3 filter results in 4 X 4 output. We can
generalize it and say that if the input is n X n and the filter size is f X f, then the output size will be (n-f+1) X
(n-f+1):
Input: n X n
Filter size: f X f
Output: (n-f+1) X (n-f+1)
1. Every time we apply a convolutional operation, the size of the image shrinks
2. Pixels present in the corner of the image are used only a few number of times during convolution as
compared to the central pixels. Hence, we do not focus too much on the corners since that can lead to
information loss
To overcome these issues, we can pad the image with an additional border, i.e., we add one pixel all around
the edges. This means that the input will be an 8 X 8 matrix (instead of a 6 X 6 matrix). Applying
convolution of 3 X 3 on it will result in a 6 X 6 matrix which is the original shape of the image. This is where
padding comes to the fore:
Input: n X n
Padding: p
Filter size: f X f
Output: (n+2p-f+1) X (n+2p-f+1)
1. Valid: It means no padding. If we are using valid padding, the output will be (n-f+1) X (n-f+1)
2. Same: Here, we apply padding so that the output size is the same as the input size, i.e.,
n+2p-f+1 = n
So, p = (f-1)/2
We now know how to use padded convolution. This way we don’t lose a lot of information and the image
does not shrink either. Next, we will look at how to implement strided convolutions.
Strided Convolutions
Suppose we choose a stride of 2. So, while convoluting through the image, we will take two steps – both in
the horizontal and vertical directions separately. The dimensions for stride s will be:
Input: n X n
Padding: p
Stride: s
Filter size: f X f
Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]
Stride helps to reduce the size of the image, a particularly useful feature.
Input: 6 X 6 X 3
Filter: 3 X 3 X 3
The dimensions above represent the height, width and channels in the input and filter. Keep in mind that
the number of channels in the input and filter should be same. This will result in an output of 4 X 4. Let’s
understand it visually:
Since there are three channels in the input, the filter will consequently also have three channels. After
convolution, the output shape is a 4 X 4 matrix. So, the first element of the output is the sum of the
element-wise product of the first 27 values from the input (9 values from each channel) and the 27 values
from the filter. After that we convolve over the entire image.
Instead of using just a single filter, we can use multiple filters as well. How do we do that? Let’s say the
first filter will detect vertical edges and the second filter will detect horizontal edges from the image. If we
use multiple filters, the output dimension will change. So, instead of having a 4 X 4 output as in the above
example, we would have a 4 X 4 X 2 output (if we have used 2 filters):
Input: n X n X n c
Filter: f X f X n c
Padding: p
Stride: s
Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1] X n c ’
Here, n c is the number of channels in the input and filter, while n c ’ is the number of filters.
Once we get an output after convolving over the entire image using a filter, we add a bias term to those
outputs and finally apply an activation function to generate activations. This is one layer of a convolutional
network. Recall that the equation for one forward pass is given by:
In our case, input (6 X 6 X 3) is a[0] and filters (3 X 3 X 3) are the weights w [1] . These activations from layer
1 act as the input for layer 2, and so on. Clearly, the number of parameters in case of convolutional neural
networks is independent of the size of the image. It essentially depends on the filter size. Suppose we
have 10 filters, each of shape 3 X 3 X 3. What will be the number of parameters in that layer? Let’s try to
solve this:
No matter how big the image is, the parameters only depend on the filter size. Awesome, isn’t it? Let’s have
a look at the summary of notations for a convolution layer:
p [l] = padding
s [l] = stride
Let’s combine all the concepts we have learned so far and look at a convolutional network example.
There are a number of hyperparameters that we can tweak while building a convolutional network. These
include the number of filters, size of filters, stride to be used, padding, etc. We will look at each of these in
detail later in this article. Just keep in mind that as we go deeper into the network, the size of the image
shrinks whereas the number of channels usually increases.
1. Convolution layer
2. Pooling layer
3. Fully connected layer
Pooling Layers
Pooling layers are generally used to reduce the size of the inputs and hence speed up the computation.
Consider a 4 X 4 matrix as shown below:
For every consecutive 2 X 2 block, we take the max number. Here, we have applied a filter of size 2 and a
stride of 2. These are the hyperparameters for the pooling layer. Apart from max pooling, we can also apply
average pooling where, instead of taking the max of the numbers, we take their average. In summary, the
hyperparameters for a pooling layer are:
1. Filter size
2. Stride
3. Max or average pooling
If the input of the pooling layer is n h X n w X n c , then the output will be [{(n h – f) / s + 1} X {(n w – f) / s + 1}
X n c ].
CNN Example
We’ll take things up a notch now. Let’s look at how a convolution neural network with convolutional and
pooling layer works. Suppose we have an input of shape 32 X 32 X 3:
There are a combination of convolution and pooling layers at the beginning, a few fully connected layers at
the end and finally a softmax classifier to classify the input into various categories. There are a lot of
hyperparameters in this network which we have to specify as well.
Generally, we take the set of hyperparameters which have been used in proven research and they end up
doing well. As seen in the above example, the height and width of the input shrinks as we go deeper into
the network (from 32 X 32 to 5 X 5) and the number of channels increases (from 3 to 10).
All of these concepts and techniques bring up a very fundamental question – why convolutions? Why not
something else?
Why Convolutions?
There are primarily two major advantages of using convolutional layers over using just fully connected
layers:
1. Parameter sharing
2. Sparsity of connections
If we would have used just the fully connected layer, the number of parameters would be =
32*32*3*28*28*6, which is nearly equal to 14 million! Makes no sense, right?
If we see the number of parameters in case of a convolutional layer, it will be = (5*5 + 1) * 6 (if there are 6
filters), which is equal to 156. Convolutional layers reduce the number of parameters and speed up the
training of the model significantly.
In convolutions, we share the parameters while convolving through the input. The intuition behind this is
that a feature detector, which is helpful in one part of the image, is probably also useful in another part of
the image. So a single filter is convolved over the entire input and hence the parameters are shared.
The second advantage of convolution is the sparsity of connections. For each layer, each output value
depends on a small number of inputs, instead of taking into account all the inputs.
Classic Networks
1. LeNet-5
2. AlexNet
3. VGG
We will also see how ResNet works and finally go through a case study of an inception neural network.
LeNet-5
It takes a grayscale image as input. Once we pass it through a combination of convolution and pooling
layers, the output will be passed through fully connected layers and classified into corresponding classes.
The total number of parameters in LeNet-5 are:
Parameters: 60k
Layers flow: Conv -> Pool -> Conv -> Pool -> FC -> FC -> Output
Activation functions: Sigmoid/tanh and ReLu
AlexNet
This network is similar to LeNet-5 with just more convolution and pooling layers:
Parameters: 60 million
Activation function: ReLu
VGG-16
The underlying idea behind VGG-16 was to use a much simpler network where the focus is on having
convolution layers that have 3 X 3 filters with a stride of 1 (and always using the same padding). The max
pool layer is used after each convolution layer with a filter size of 2 and a stride of 2. Let’s look at the
architecture of VGG-16:
These are three classic architectures. Next, we’ll look at more advanced architecture starting with ResNet.
ResNet
Training very deep networks can lead to problems like vanishing and exploding gradients. How do we deal
with these issues? We can use skip connections where we take activations from one layer and feed it to
another layer that is even more deeper in the network. There are residual blocks in ResNet which help in
training deeper networks.
Residual Blocks
The general flow to calculate activations from different layers can be given as:
This is how we calculate the activations a[l+2] using the activations a[l] and then a[l+1] . a[l] needs to go
through all these steps to generate a[l+2] :
In a residual network, we make a change in this path. We take the activations a[l] and pass them directly to
the second layer:
We now have an overview of how ResNet works. But why does it perform so well? Let’s find out!
In order to make a good model, we first have to make sure that it’s performance on the training data is
good. That’s the first test and there really is no point in moving forward if our model fails here. We have
seen earlier that training deeper networks using a plain network increases the training error after a point of
time. But while training a residual network, this isn’t the case. Even when we build a deeper residual
network, the training error generally does not increase.
a[l+2] = g(a[l] )
It is fairly easy to calculate a[l+2] knowing just the value of a[l] . As per the research paper, ResNet is given
by:
Let’s see how a 1 X 1 convolution can be helpful. Suppose we have a 28 X 28 X 192 input and we apply a 1
X 1 convolution using 32 filters. So, the output will be 28 X 28 X 32:
The basic idea of using 1 X 1 convolution is to reduce the number of channels from the image. A couple of
points to keep in mind:
We generally use a pooling layer to shrink the height and width of the image
To reduce the number of channels from an image, we convolve it using a 1 X 1 filter (hence reducing
the computation cost as well)
While designing a convolutional neural network, we have to decide the filter size. Should it be a 1 X 1 filter,
or a 3 X 3 filter, or a 5 X 5? Inception does all of that for us! Let’s see how it works.
Suppose we have a 28 X 28 X 192 input volume. Instead of choosing what filter size to use, or whether to
use convolution layer or pooling layer, inception uses all of them and stacks all the outputs:
A good question to ask here – why are we using all these filters instead of using just a single filter size,
say 5 X 5? Let’s look at how many computations would arise if we would have used only a 5 X 5 filter on our
input:
Number of multiplies = 28 * 28 * 32 * 5 * 5 * 192 = 120 million! Can you imagine how expensive performing
all of these will be?
Now, let’s look at the computations a 1 X 1 convolution and then a 5 X 5 convolution will give us:
Inception Networks
Now that we have understood how different ConvNets work, it’s important to gain a practical perspective
around all of this.
Building your own model from scratch can be a tedious and cumbersome process. Also, it is quite a task to
reproduce a research paper on your own (trust me, I am speaking from experience!). In many cases, we
also face issues like lack of data availability, etc. We can design a pretty decent model by simply following
the below tips and tricks:
1. Using Open-Source implementation: Generally, most deep learning researchers open-source their work
on platforms like GitHub. We can (and should) integrate their work into our projects. This has always
been a helpful path for me throughout my career
2. Transfer Learning: We can take a pretrained network and transfer that to a new task which we are
working on. In transfer learning, we take the complete network, remove a few layers from it, and add
custom layers on top of the remaining layers to train our model. In essense, we are extracting features
from a pretrained model and using those to classify and train our model
3. Data Augmentation: Deep learning models perform well when we have a large amount of data. There
are quite a few domains where getting enough data is a problem. In such cases, we use data
augmentation to generate training data from the available data. Some of the common augmentation
methods are:
1. Mirroring: Here we take the mirror image. The class of the image will not change in this case
2. Random Cropping
3. Rotating
4. Shearing
5. Color Shifting: We change the RGB scale of the image randomly.
With this, we come to the end of the second module. We saw some classical ConvNets, their structure and
gained valuable practical tips on how to use these networks.
To understand the challenges of Object Localization, Object Detection and Landmark Finding
Understanding and implementing non-max suppression
Understanding and implementing intersection over union
To understand how we label a dataset for an object detection application
To learn the vocabulary used in object detection (landmark, anchor, bounding box, grid, etc.)
I have covered most of the concepts in this comprehensive ar ticle. I highly recommend going through it to
learn the concepts of YOLO. For your reference, I’ll summarize how YOLO works:
It also applies Intersection over Union (IoU) and Non-Max Suppression to generate more accurate
bounding boxes and minimize the chance of the same object being detected multiple times.
In the final module of this course, we will look at some special applications of CNNs, such as face
recognition and neural style transfer.
The objective behind the final module is to discover how CNNs can be applied to multiple fields, including
art generation and facial recognition.
Face recognition is probably the most widely used application in computer vision. It seems to be
everywhere I look these days – from my own smartphone to airport lounges, it’s becoming an integral part
of our daily activities.
In this section, we will discuss various concepts of face recognition, like one-shot learning, siamese
network, and many more.
In face recognition literature, there are majorly two terminologies which are discussed the most:
1. Face verification
2. Face recognition
In face verification, we pass the image and its corresponding name or ID as the input. For a new image, we
want our model to verify whether the image is that of the claimed person. This is also called one-to-one
mapping where we just want to know if the image is of the same person.
Face recognition is where we have a database of a certain number of people with their facial images and
corresponding IDs. When our model gets a new image, it has to match the input image with all the images
available in the database and return an ID. It is a one-to-k mapping (k being the number of people) where
we compare an input image with all the k people present in the database.
One-Shot Learning
One potential obstacle we usually encounter in a face recognition task is the problem a lack of training
data. This is where we have only a single image of a person’s face and we have to recognize new images
using that. Since deep learning isn’t exactly known for working well with one training example, you can
imagine how this presents a challenge.
One-shot learning is where we learn to recognize the person from just one example. Training a CNN to
learn the representations of a face is not a good idea when we have less images. The model simply would
not be able to learn the features of the face. If a new user joins the database, we have to retrain the entire
network. Quite a conundrum, isn’t it? So instead of using a ConvNet, we try to learn a similarity function:
We train a neural network to learn a function that takes two images as input and outputs the degree of
difference between these two images. So, if two images are of the same person, the output will be a small
number, and vice versa. We can define a threshold and if the degree is less than that threshold, we can
safely say that the images are of the same person.
Siamese Network
We will use a Siamese network to learn the function which we defined earlier:
Suppose we have two images, x(1) and x(2), and we pass both of them to the same ConvNet. Instead of
generating the classes for these images, we extract the features by removing the final softmax layer. So,
the last layer will be a fully connected layer having, say 128 neurons:
Here, f(x(1)) and f(x(2)) are the encodings of images x(1) and x(2) respectively. So,
We train the model in such a way that if x(i) and x(j) are images of the same person, || f(x(i)) – f(x(j)) || 2 will
be small and if x(i) and x(j) are images of different people, || f(x(i)) – f(x(j)) || 2 will be large. This is the
architecture of a Siamese network.
Next up, we will learn the loss function that we should use to improve a model’s performance.
Triplet Loss
In order to define a triplet loss, we take an anchor image, a positive image and a negative image. A positive
image is the image of the same person that’s present in the anchor image, while a negative image is the
image of a different person. Since we are looking at three images at the same time, it’s called a triplet loss.
We will use ‘A’ for anchor image, ‘P’ for positive image and ‘N’ for negative image.
If the model outputs zero for both || f(A) – f(P) || 2 and || f(A) – f(N) || 2 , the above equation will be satisfied.
The model might be trained in a way such that both the terms are always 0. This will inevitably affect the
performance of the model. How do we overcome this? We need to slightly modify the above equation and
add a term , also known as the margin:
Similarly, the cost function for a set of people can be defined as:
Our aim is to minimize this cost function in order to improve our model’s performance. Apart with using
triplet loss, we can treat face recognition as a binary classification problem.
Instead of using triplet loss to learn the parameters and recognize faces, we can solve it by translating our
problem into a binary classification one. We first use a Siamese network to compute the embeddings for
the images and then pass these embeddings to a logistic regression, where the target will be 1 if both the
embeddings are of the same person and 0 if they are of different people:
The final output of the logistic regression is:
Here, is the sigmoid function. Hence, we treat it as a supervised learning problem and pass different sets
of combinations. Each combination can have two images with their corresponding target being 1 if both
images are of the same person and 0 if they are of different people.
In the final section of this course, we’ll discuss a very intriguing application of computer vision, i.e., neural
style transfer.
Let’s understand the concept of neural style transfer using a simple example. Suppose we want to recreate
a given image in the style of another image. Here, the input image is called as the content image while the
image in which we want our input to be recreated is known as the style image:
Neural style transfer allows us to create a new image which is the content image drawn in the fashion of
the style image:
Awesome, right?! For the sake of this article, we will be denoting the content image as ‘C’, the style image
as ‘S’ and the generated image as ‘G’. In order to perform neural style transfer, we’ll need to extract features
from different layers of our ConvNet.
Before diving deeper into neural style transfer, let’s first visually understand what the deeper layers of a
ConvNet are really doing. Let’s say we’ve trained a convolution neural network on a 224 X 224 X 3 input
image:
To visualize each hidden layer of the network, we first pick a unit in layer 1, find 9 patches that maximize
the activations of that unit, and repeat it for other units. The first hidden layer looks for relatively simpler
features, such as edges, or a particular shade of color. The image compresses as we go deeper into the
network. The hidden unit of a CNN’s deeper layer looks at a larger region of the image.
This is what the shallow and deeper layers of a CNN are computing. We will use this learning to build a
neural style transfer algorithm.
Cost Function
First, let’s look at the cost function needed to build a neural style transfer algorithm. Minimizing this cost
function will help in getting a better generated image (G). Defining a cost function:
Here, the content cost function ensures that the generated image has the same content as that of the
content image whereas the generated cost function is tasked with making sure that the generated image
is of the style image fashion.
Below are the steps for generating the image using the content and style images:
1. We first initialize G randomly, say G: 100 X 100 X 3, or any other dimension that we want
2. We then define the cost function J(G) and use gradient descent to minimize J(G) to update G:
G = G – d/dG(J(G))
After applying gradient descent and updating G multiple times, we get something like this:
Not bad! This is the outline of a neural style transfer algorithm. It’s important to understand both the
content cost function and the style cost function in detail for maximizing our algorithm’s output.
Suppose we use the lth layer to define the content cost function of a neural style transfer algorithm.
Generally, the layer which is neither too shallow nor too deep is chosen as the lth layer for the content cost
function. We use a pretrained ConvNet and take the activations of its lth layer for both the content image
as well as the generated image and compare how similar their content is. With me so far?
Now, we compare the activations of the lth layer. For the content and generated images, these are a[l](C)
and a[l](G) respectively. If both these activations are similar, we can say that the images have similar
content. Thus, the cost function can be defined as follows:
We take the activations from the lth layer to measure the style. We define the style as the correlation
between activations across channels of that layer. Let’s say that the lth layer looks like this:
We want to know how correlated the activations are across different channels:
Here, i is the height, j is the width, and k is the channel number. We can create a correlation matrix which
provides a clear picture of the correlation between the activations from every channel of the lth layer:
where k and k’ ranges from 1 to n c [l] . This matrix is called a style matrix. If the activations are correlated,
G kk’ will be large, and vice versa. S denotes that this matrix is for the style image. Similarly, we can create a
style matrix for the generated image:
Finally, we can combine the content and style cost function to get the overall cost function:
And there you go! Quite a ride through the world of CNNs, wasn’t it?
End Notes
We have learned a lot about CNNs in this article (far more than I did in any one place!). We have seen how a
ConvNet works, the various building blocks of a ConvNet, it’s various architectures and how they can be
used for image recognition applications. Finally, we have also learned how YOLO can be used for detecting
objects in an image before diving into two really fascinating applications of computer vision – face
recognition and neural style transfer.
Do share your throughts with me regarding what you learned from this article. Have you used CNNs
before? If yes, feel free to share your experience with me – it always helps to learn from each other.
Pulkit Sharma
My research interests lies in the field of Machine Learning and Deep Learning. Possess an enthusiasm
for learning new skills and technologies.
PRACTICE QUESTIONS ON BAYES’S FORMULA AND ON
PROBABILITY
(NOT TO BE HANDED IN )
1. remarks
If you find any errors in this document, please alert me.
Remark 1. First, I’ll make a remark about question 40 from section 12.4 in the book. Let
A= event that first card is a spade and B=event that second card is a spade. As part of this
question, you computed (presumably using the total law of probability) that
13 12 39 13 1
P (B) = P (A)P (B | A) + P (Ac )P (B | Ac ) = × + × = .
52 51 52 51 4
Note that in this case, of course, you already knew actually that
13 1
P (B) = = ,
52 4
since there are 13 spades in 52 cards, therefore the unconditional probability of B is 13
52
.
The law of total of total probability gives you a method for computing the unconditional
(or total) probability of an event B if you know its conditional probabilities with respect to
some other event A and the probability of A. In this case, we knew directly what P (B) is
(because we had enough information- we know how many cards there are and how many
spades) and you can see how it agrees with what the total law of probability gives you.
However, in most of the other examples, such as the one with the test for a virus we
did in class, it’s not possible to compute the probability of B (in that case, that the test
is positive) directly because you don’t have enough information (we don’t know how many
tests come out positive and how many tests are being administered, i.e., we don’t know the
percentage of tests that come out positive). What we know are the conditional probabilities
of the test coming out positive with the conditions that the person taking it was infected or
not. And we know the probability of this condition happening, i.e., we know the probability
that someone is infected. So the information you have here consists of precisely the pieces
that you need in order to use the total law of probability to compute the probability that a
test comes out positive, and there’s no other way to know this probability.
Remark 2. For all the following questions, the easiest way to think about them is to draw
the tree diagram. Please do so when you try to do them, or when you read the solutions –
draw the diagram to try to follow what’s happening.
2. solutions
Exercise 1. A doctor is called to see a sick child. The doctor has prior information that
90% of sick children in that neighborhood have the flu, while the other 10% are sick with
1
measles. Let F stand for an event of a child being sick with flu and M stand for an event of
a child being sick with measles. Assume for simplicity that F ∪ M = Ω, i.e., that there no
other maladies in that neighborhood.
A well-known symptom of measles is a rash (the event of having which we denote R).
Assume that the probability of having a rash if one has measles is P (R | M ) = 0.95.
However, occasionally children with flu also develop rash, and the probability of having a
rash if one has flu is P (R | F ) = 0.08.
Upon examining the child, the doctor finds a rash. What is the probability that the child
has measles?
Solution.
We use Bayes’s formula.
P (R | M )P (M )
P (M | R) =
(P (R | M )P (M ) + P (R | F )P (F ))
0.95 × 0.10
= ' 0.57.
(0.95 × 0.10 + 0.08 × 0.90)
Which is nowhere close to 95% of P(R—M).
Exercise 2. In a study, physicians were asked what the odds of breast cancer would be in
a woman who was initially thought to have a 1% risk of cancer but who ended up with a
positive mammogram result (a mammogram accurately classifies about 80% of cancerous
tumors and 90% of benign tumors.)
95 out of a hundred physicians estimated the probability of cancer to be about 75%. Do
you agree?
Solution.
Introduce the events:
P (+ | M )P (M )
P (M | +) =
(P (+ | M )P (M ) + P (+ | B)P (B))
0.80 × 0.01
=
(0.80 × 0.01 + 0.10 × 0.99)
' 0.075
So the chance would be 7.5%. A far cry from a common estimate of 75
2
Exercise 3. Suppose we have 3 cards identical in form except that both sides of the first
card are colored red, both sides of the second card are colored black, and one side of the
third card is colored red and the other side is colored black.
The 3 cards are mixed up in a hat, and 1 card is randomly selected and put down on the
ground. If the upper side of the chosen card is colored red, what is the probability that the
other side is colored black?
Solution.
Let RR, BB, and RB denote, respectively, the events that the chosen cars is the red-red,
the black-black, or the red-black card. Letting R be the event that the upturned side of the
chosen card is red, we have that the desired probability is obtained by
P (RB ∩ R)
P (RB | R) =
P (R)
P (R | RB)P (RB)
=
P (R | RR)P (RR) + P (R | RB)P (RB) + P (R | BB)P (BB)
( 21 )( 13 ) 1
= 1 1 1 1 =
(1)( 3 ) + ( 2 )( 3 ) + 0( 3 ) 3
This question was actually just like the Monty Hall problem!
Exercise 4. It is estimated that 50% of emails are spam emails. Some software has been
applied to filter these spam emails before they reach your inbox. A certain brand of software
claims that it can detect 99% of spam emails, and the probability for a false positive (a
non-spam email detected as spam) is 5%.
Now if an email is detected as spam, then what is the probability that it is in fact a
non-spam email?
Solution.
Define events
A = event that an email is detected as spam,
B = event that an email is spam,
c
B = event that an email is not spam.
c
We know P (B) = P (B ) = .5, P (A | B) = 0.99, P (A | B c ) = 0.05.
Hence by the Bayes’s formula we have
P (A | B c )P (B c )
P (B c | A) =
P (A | B)P (B) + P (A|B c )P (B c )
0.05 × 0.5
=
0.05 × 0.5 + 0.99 × 0.5
5
= .
104
3
M ATHEMATICAL L OGIC E XERCISES
1 Introduction 3
2 Propositional Logic 5
2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Truth Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Propositional Formalization . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Formalizing Simple Sentences . . . . . . . . . . . . . . . . 13
2.3.2 Formalizing Problems . . . . . . . . . . . . . . . . . . . . 20
2.4 Normal Form Reduction . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Propositional Natural Deduction . . . . . . . . . . . . . . . . . . 35
4 Modal Logic 81
4.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Satisfiability and Validity . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Modal Logic Formalization . . . . . . . . . . . . . . . . . . . . . . 102
1
Mathematics is the only
instructional material that can
be presented in an entirely
undogmatic way.
The Mathematical
Intelligencer, v. 5, no. 2, 1983
Chapter 1 M AX D EHN
Introduction
Symbol Difficulty
Trivial
Easy
Medium
Difficult
Very difficult
3
When you have eliminated the
impossible, what ever remains,
however improbable must be
the truth.
Propositional Logic
5
Propositional Logic
Exercise 2.1. -
Which of the following are well formed propositional formulas?
1. ∨pq
2. (¬(p → (q ∧ p)))
3. (¬(p → (q = p)))
4. (¬(♦(q ∨ p)))
5. (p ∧ ¬q) ∨ (q → r)
6. p¬r
Solution.
Exercise 2.2. -
Let’s consider the interpretation v where v(p) = F, v(q) = T, v(r) = T.
Does v satisfy the following propositional formulas?
1. (p → ¬q) ∨ ¬(r ∧ q)
3. ¬(¬p → ¬q) ∧ r
4. ¬(¬p → q ∧ ¬r)
6
2.1 Basic Concepts
Solution.
7
Propositional Logic
Exercise 2.3. -
Compute the truth table of (F ∨ G) ∧ ¬(F ∧ G).
Solution.
F G F ∨G F ∧G ¬(F ∧ G) (F ∨ G) ∧ ¬(F ∧ G)
T T T T F F
T F T F T T
F T T F T T
F F F F T F
Exercise 2.4. -
Use the truth tables method to determine whether (p → q) ∨ (p → ¬q) is valid.
Solution.
p q p→q ¬q p → ¬q (p → q) ∨ (p → ¬q)
T T T F F T
T F F T T T
F T T F T T
F F T T T T
8
2.2 Truth Tables
Exercise 2.5. -
Use the truth tables method to determine whether (¬p∨q)∧(q → ¬r∧¬p)∧(p∨r)
(denoted with ϕ) is satisfiable.
Solution.
p q r ¬p ∨ q ¬r ∧ ¬p q → ¬r ∧ ¬p (p ∨ r) ϕ
T T T T F F T F
T T F T F F T F
T F T F F T T F
T F F F F T T F
F T T T F F T F
F T F T T T F F
F F T T F T T T
F F F T T T F F
There exists an interpretation satisfying ϕ, thus ϕ is satisfiable.
Exercise 2.6. -
Use the truth tables method to determine whether the formula ϕ : p∧¬q → p∧q
is a logical consequence of the formula ψ : ¬p.
Solution.
p q ¬p p ∧ ¬q p∧q p ∧ ¬q → p ∧ q
T T F F T T
T F F T F F ψ |= ϕ since each
F T T F F T
F F T F F T
9
Propositional Logic
Exercise 2.7. -
Use the truth tables method to determine whether p → (q ∧ ¬q) and ¬p are
logically equivalent.
Solution.
p q q ∧ ¬q p → (q ∧ ¬q) ¬p
T T F F F
T F F F F The two formulas are equivalent since
F T F T T
F F F T T
for every possible interpretation they evaluate to tha same truth value.
Exercise 2.8.
Compute the truth tables for the following propositional formulas:
• (p → p) → p
• p → (p → p)
• p∨q →p∧q
• p ∨ (q ∧ r) → (p ∧ r) ∨ q
• p → (q → p)
• (p ∧ ¬q) ∨ ¬(p ↔ q)
10
2.2 Truth Tables
Exercise 2.9.
Use the truth table method to verify whether the following formulas are valid,
satisfiable or unsatisfiable:
• (p → q) ∧ ¬q → ¬p
• (p → q) → (p → ¬q)
• (p ∨ q → r) ∨ p ∨ q
• (p ∨ q) ∧ (p → r ∧ q) ∧ (q → ¬r ∧ p)
• (p ∨ q) ∧ (¬q ∧ ¬p)
• (p → q) ∧ (p → ¬q)
• (p → (q ∨ r)) ∨ (r → ¬p)
Exercise 2.10.
Use the truth table method to verify whether the following logical consequences
and equivalences are correct:
• (p → q) |= ¬p → ¬q
• (p → q) ∧ ¬q |= ¬p
• p → q ∧ r |= (p → q) → r
11
Propositional Logic
• p ∨ (¬q ∧ r) |= q ∨ ¬r → p
• ¬(p ∧ q) ≡ ¬p ∨ ¬q
• (p ∨ q) ∧ (¬p → ¬q) ≡ q
• (p ∧ q) ∨ r ≡ (p → ¬q) → r
• (p ∨ q) ∧ (¬p → ¬q) ≡ p
• ((p → q) → q) → q ≡ p → q
12
2.3 Propositional Formalization
Exercise 2.11. -
Let’s consider a propositional language where
1. “if Paola is happy and paints a picture then Renzo isn’t happy”
Solution.
1. p ∧ q → ¬r
2. p → q
13
Propositional Logic
Exercise 2.12. -
Let’s consider a propositional language where
• q means “x is odd”.
Exercise 2.13. -
Let A =“Aldo is Italian” and B =“Bob is English”.
Formalize the following sentences:
5. “Either Aldo is Italian and Bob is English, or neither Aldo is Italian nor
Bob is English”
Solution.
1. ¬A
2. A ∧ B
3. A → ¬B
14
2.3 Propositional Formalization
Exercise 2.14.
Angelo, Bruno and Carlo are three students that took the Logic exam. Let’s
consider a propositional language where
3. “Only one, among Aldo, Bruno and Carlo, passed the exam”
5. “At least two among Aldo, Bruno and Carlo passed the exam”
6. “At most two among Aldo, Bruno and Carlo passed the exam”
7. “Exactly two, among Aldo, Bruno and Carlo passed the exam”
Exercise 2.15. -
Let’s consider a propositional langiage where
15
Propositional Logic
1. “If Davide comes to the party then Bruno and Carlo come too”
2. “Carlo comes to the party only if Angelo and Bruno do not come”
3. “Davide comes to the party if and only if Carlo comes and Angelo doesn’t
come”
4. “If Davide comes to the party, then, if Carlo doesn’t come then Angelo
comes”
5. “Carlo comes to the party provided that Davide doesn’t come, but, if
Davide comes, then Bruno doesn’t come”
7. “Angelo, Bruno and Carlo come to the party if and only if Davide doesn’t
come, but, if neither Angelo nor Bruno come, then Davide comes only if
Carlo comes”
Solution.
1. D → B ∧ C
2. C → ¬A ∧ ¬B
3. D ↔ (C ∧ ¬A)
4. D → (¬C → A)
5. (¬D → C) ∧ (D → ¬B)
6. A → (¬B ∧ ¬C → D)
16
2.3 Propositional Formalization
Exercise 2.16.
Let’s consider a propositional langiage where
2. “Either Carlo comes to the party, or Bruno and Davide don’t come”
3. “If Angelo and Bruno come to the party, then Carlo comes provided that
Davide doesn’t come”
4. “Carlo comes to the party if Bruno and Angelo don’t come, or if Davide
comes”
5. “If Angelo comes to the party then Bruno or Carlo come too, but if Angelo
doesn’t come to the party, then Carlo and Davide come”
Exercise 2.17. -
Socrate says:
17
Propositional Logic
Exercise 2.18. -
Socrate says:
Exercise 2.19.
Socrate says:
18
INTRODUCTION TO APPLIED ARTIFICIAL INTELLIGENCE
1. Introduction to AI
a. Overview
b. Intelligence
c. Searching
2. Mathematical Background
a. Linear Algebra
b. Fuzzy Sets
c. Probability Theory, Bayes Theorem
d. Calculus and Optimization
e. Statistical Concepts: Random Variables, Gaussian Distribution, Bayes Theorem,
Markov Process, Regression
3. Machine Learning
a. Unsupervised Learning: Clustering, PCA
b. Supervised learning: KNN, FCM, LR, Decision Trees, Random Forests
c. Evolutionary Learning: Genetic Algorithms
4. Neural Networks and Deep Learning
a. MLP and Back Propagation
b. Deep learning:
i. CNN
ii. RNN and LSTM
iii. Autoencoders
5. Propositional Logic and Predicate Logic
6. Introduction to NLP
Textbooks:
1. Artificial Intelligence Illuminated – Ben Coppin (2004)
2. Deep Learning – Ian Goodfellow, Yoshua Bengio, Aaron Courville (2017)
3. Computer Vision: Algorithms and Applications – Richard Szeliski (2010)
4. Foundations of Statistical Natural Language Processing - Christopher D. Manning,
Hinrich Schuetze (1999)
KNN & Other Algorithms
KNN Classification Algorithm
K-Nearest Neighbours (KNN)
1. Simple, but a very powerful classification algorithm
2. Classifies based on a similarity measure
3. Lazy learning, does not “learn” until the test example is given
4. Whenever we have a new data to classify, we find its K-nearest neighbours from the training data
5. Classified by “MAJORITY VOTES” for its neighbour classes
1. Assigned to the most common class amongst its K nearest neighbours (by measuring “distance” between data)
Voronoi diagram
• Describes the areas that are nearest
to any given point, given a set of data
• Each line segment is equidistant
between two points of opposite class
KNN Contd…
Pros
• No training period
• Learning and implementation
is extremely simple and
Intuitive
• Flexible decision boundaries
Cons
• Irrelevant or correlated
features have high impact
and must be eliminated
• Typically difficult to handle
high dimensionality
• Computational costs:
memory and classification
time computation
• Sensitive to noisy data
KNN Result Interpretation
Step 1: Initialize the membership of the data points into desired number of clusters randomly.
Step 2: Find out the fuzzy centroids using the formula
n is the no. of data points, c is the number of clusters, m is the fuzziness parameter (generally 2),
and 𝜇𝑖𝑗 is the membership of ith data into the jth cluster
Step 3: Now recompute the membership values using the formula
𝑑𝑖𝑗 is the distance of the ith data point from the jth cluster centroid
Step 4: Continue repeating steps 2 & 3 until the stopping criteria is achieved
An Example of FCM
• Suppose the given data points are {(1, 3), (2, 5), (6, 8), (7, 9)} • Distances are calculated as below:
• Let us assume there are 2 clusters and the initial D11 = ((1 - 1.568)2 + (3 - 4.051)2)0.5 = 1.2
memberships are randomly initialized as below: D12 = ((1 - 5.35)2 + (3 - 8.215)2)0.5 = 6.8
(1, 3) (2, 5) (4, 8) (7, 9) Similarly distances are calculated for all the
1) 0.8 0.7 0.2 0.1 data points
2) 0.2 0.3 0.8 0.9
• New centroids are computed as below: • Membership values are updated as below:
V11 = (0.82 *1 + 0.72 * 2 + 0.22 * 4 + 0.12 * 7) / ( (0.82 + 0.72 + 𝜇11 =
0.22 + 0.12 ) = 1.568 [{ [(1.2)2 / (1.2)2] + [(1.2)2 / (6.8)2]} (1 / (2 – 1)) ]-1
V12 = (0.82 *3 + 0.72 * 5 + 0.22 * 8 + 0.12 * 9) / ( (0.82 + 0.72 + = 0.96
0.22 + 0.12 ) = 4.051 𝜇12 =
V21 = (0.22 *1 + 0.32 * 2 + 0.82 * 4 + 0.92 * 7) / ( (0.22 + 0.32 + [{ [(6.8)2 / (6.8)2] + [(6.8)2 / (1.2)2]} (1 / (2 – 1)) ]-1
0.82 + 0.92 ) = 5.35 = 0.04
V22 = (0.22 *3 + 0.32 * 5 + 0.82 * 8 + 0.92 * 9) / ( (0.22 + 0.32 + Now these two above steps are repeated to
0.82 + 0.92 ) = 8.215 compute the final cluster centroids.
• Feature variation: It extracts only the required features of an image and generates the output by removing
any noise or unnecessary interruption.
• Dimensionality Reduction: The reconstructed image is the same as our input but with reduced dimensions.
It helps in providing the similar image with a reduced pixel value.
• Denoising Image: The input seen by the autoencoder is not the raw input but a stochastically corrupted
version. A denoising autoencoder is thus trained to reconstruct the original input from the noisy version.
• Watermark Removal: It is also used for removing watermarks from images or to remove any object while
filming a video or a movie.
Genetic Algorithms (GAs) • Start with a large “population” of randomly
generated “attempted solutions” to a problem
• A genetic algorithm (or GA) is a search technique used in • Repeatedly do the following:
computing to find true or approximate solutions to • Evaluate each of the attempted solutions
optimization and search problems. GAs are categorized as • (probabilistically) keep a subset of the best
global search heuristics. solutions
• GAs are a particular class of evolutionary algorithms that • Use these solutions to generate a new
use techniques inspired by evolutionary biology such as population
inheritance, mutation, selection, and crossover (also called • Quit when you have a satisfactory solution
recombination). (or you run out of time)
• The evolution usually starts from a population of randomly
generated individuals and happens in generations. Natural Evaluation Genetic Algorithm
• In each generation, the fitness of every individual in the
genotype coded string
population is evaluated, multiple individuals are selected
from the current population (based on their fitness), and phenotype uncoded point
modified to form a new population. The new population is chromosome string
used in the next iteration of the algorithm.
gene string position
• The algorithm terminates when either a maximum number
of generations has been produced, or a satisfactory fitness allele value at a certain posn
level has been reached for the population. fitness objective function value
The Basic Operations & Characteristics of GAs
Selection: Mechanism for selecting individuals 1. GAs manipulate coded versions of the problem parameters
(strings) for reproduction according to their fitness instead of the parameters themselves, i.e. the search space is S
(objective function value). instead of X itself.
Crossover: Method of merging the genetic 2. While almost all conventional methods search from a single
information of two individuals; if the coding is point, GAs always operate on a whole population of points
chosen properly, two good parents produce good (strings). This contributes much to the robustness of genetic
children. algorithms. It improves the chance of reaching the global
Mutation: In real evolution, the genetic material optimum and, vice versa, reduces the risk of becoming trapped
can by changed randomly by erroneous in a local stationary point.
reproduction or other deformations of genes, e.g. 3. Normal genetic algorithms do not use any auxiliary
by gamma radiation. In genetic algorithms, information about the objective function value such as
mutation can be realized as a random deformation derivatives. Therefore, they can be applied to any kind of
of the strings with a certain probability. The positive continuous or discrete optimization problem. The only thing to
effect is preservation of genetic diversity and, as an be done is to specify a meaningful decoding function.
effect, that local maxima can be avoided. 4. GAs use probabilistic transition operators while conventional
Sampling: Procedure which computes a new methods for continuous optimization apply deterministic
generation from the previous one and its off- transition operators. More specifically, the way a new
springs. generation is computed from the actual one has some random
components (we will see later by the help of some examples
what these random components are like).
Selection Crossover Mutation
Selection is the component which guides the algorithm to the solution by preferring individuals with high fitness over low-
fitted ones. It can be a deterministic operation, but in most implementations it has random components. One variant, which
is very popular nowadays (we will give a theoretical explanation of its good properties later), is the following scheme, where
the probability to choose a certain individual is proportional to its fitness. For obvious reasons, this method is often called
proportional selection.
Crossover is a very powerful tool for introducing new genetic material and maintaining genetic diversity, but with the
outstanding property that good parents also produce well-performing children or even better ones. In the simplest case, we
can realize this process by cutting two strings at a randomly chosen position and swapping the two tails.
• N-point crossover: Instead of only one, N breaking points are chosen randomly. Every second section is swapped. Among
this class, two point crossover is particularly important
• Segmented crossover: Similar to N-point crossover with the difference that the number of breaking points can vary.
• Uniform crossover: For each position, it is decided randomly if the positions are swapped.
• Shuffle crossover: First a randomly chosen permutation is applied to the two parents, then N-point crossover is applied to
the shuffled parents, finally, the shuffled children are transformed back with the inverse permutation.
Mutation: For each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1).
Variation of Mutation includes Inversion of single bits, Bitwise inversion and Random selection.
Recurrent Neural Network- RNN
RNN were created because there were a few In Recurrent Neural networks, the information cycles through a
issues in the feed-forward neural network: loop to the middle hidden layer.
• Cannot handle sequential data
• Considers only the current input
• Cannot memorize previous inputs
The solution to these issues is the RNN. An
RNN can handle
• sequential data,
• accepting the current input data, and
• previously received inputs.
• RNNs can memorize previous inputs
The input layer ‘x’ takes in the input to the neural network and
Applications of RNN
processes it and passes it onto the middle layer.
1. Time Series Prediction
2. Natural Language Processing
3. Machine Translation
Types of RNN Many to One RNN
This RNN takes a sequence
One to One RNN of inputs and generates a
This type of neural network is single output. Sentiment
known as the Vanilla Neural analysis is a good example
Network. It's used for general of this kind.
machine learning problems,
which has a single input and a
single output.
• OR (∨) T T T T F T T
• AND (∧) T F T F F F F
• Negation/ NOT (¬)
F T T F T T F
• Implication / if-then (→)
• Biconditional / If and only if (⇔) F F F F T T T
Connectives or Operators
Properties of Operators:
PRECEDENCE OPERATORS
• Commutativity:
• P∧ Q= Q ∧ P, or
• P ∨ Q = Q ∨ P. First Precedence Parenthesis
• Associativity:
• (P ∧ Q) ∧ R= P ∧ (Q ∧ R), Second Precedence Negation
• (P ∨ Q) ∨ R= P ∨ (Q ∨ R)
• Identity element:
Third Precedence Conjunction (AND)
• P ∧ True = P,
• P ∨ True= True.
• Distributive: Fourth Precedence Disjunction (OR)
• P∧ (Q ∨ R) = (P ∧ Q) ∨ (P ∧ R).
• P ∨ (Q ∧ R) = (P ∨ Q) ∧ (P ∨ R). Fifth Precedence Implication
• DE Morgan's Law:
• ¬ (P ∧ Q) = (¬P) ∨ (¬Q) Sixth Precedence Biconditional
• ¬ (P ∨ Q) = (¬ P) ∧ (¬Q).
• Double-negation elimination:
• ¬ (¬P) = P.
Tautology & Contradiction
A Tautology is a formula
which is always true for
every value of its
propositional variables.
A Contradiction is a
formula which is always
false for every value of its
propositional variables.
Propositional Equivalence
Two statements X and Y are
logically equivalent if any of
the following two conditions
hold −
• The bi-conditional
statement X⇔Y is a
tautology.
Inferencing
Inference rules are the templates for generating valid arguments. Inference rules are applied to derive proofs in
artificial intelligence, and the proof is a sequence of the conclusion that leads to the desired goal. Following are
some terminologies related to inference rules:
Implication – It has two parts − Hypothesis & Conclusion. “If you do your homework, you will not be punished.”
Here, "you do your homework" is the hypothesis, and "you will not be punished" is the conclusion.
Inverse − An inverse of the conditional statement is the negation of both the hypothesis and the conclusion. If
the statement is “If p, then q”, the inverse will be “If not p, then not q”. Thus the inverse of p→q is ¬p→¬q.
Example − The inverse of “If you do your homework, you will not be punished” is
“If you do not do your homework, you will be punished.”
Converse − The converse of the conditional statement is computed by interchanging the hypothesis and the
conclusion. If the statement is “If p, then q”, the converse will be “If q, then p”. The converse of p→q is q→p.
Example − The converse of "If you do your homework, you will not be punished" is "If you will not be punished,
you do your homework”.
Contra-positive − The contra-positive of the conditional is computed by interchanging the hypothesis and the
conclusion of the inverse statement. If the statement is “If p, then q”, the contra-positive will be “If not q, then
not p”. The contra-positive of p→q is ¬q→¬p.
Example − The Contra-positive of " If you do your homework, you will not be punished” is "If you are punished,
you did not do your homework”.
Inference Rules
Modus Ponens: Modus Tollens:
The Modus Ponens rule states that if P and P→Q is true, The Modus Tollens rule state that if P→Q is true and
then we can infer that Q will be true. ¬ Q is true, then ¬ P will also true.
Statement-1: "If I am sleepy then I go to bed" ==> P→ Q Statement-1: "If I am sleepy then I go to bed" ==> P→ Q
Statement-2: "I am sleepy" ==> P Statement-2: "I do not go to the bed."==> ~Q
Conclusion: "I go to bed." ==> Q. Statement-3: Which infers that "I am not sleepy" => ~P
Hence, we can say that, if P→ Q is true and P is true
then Q will be true Disjunctive Syllogism:
The Disjunctive syllogism rule state that if P∨Q is true,
and ¬P is true, then Q will be true.
Hypothetical Syllogism: Statement-1: Today is Sunday or Monday. ==>P∨Q
The Hypothetical Syllogism rule state that Statement-2: Today is not Sunday. ==> ¬P
P→R is true whenever P→Q is true, and Q→R is true. Conclusion: Today is Monday. ==> Q
Statement-1: If you have my home key then
you can unlock my home. P→Q The Addition rule: If P is true, then P∨Q will be true
Statement-2: If you can unlock my home then The Simplification rule: If P∧ Q is true, then Q or P will
you can take my money. Q→R also be true
Conclusion: If you have my home key then The Resolution rule: If P∨Q and ¬ P∧R is true, then Q∨R
you can take my money. P→R will also be true
The Rules
𝑃 𝑃∨𝑄
Addition
∴ 𝑃⋁𝑄 Disjunctive Syllogism ~𝑃
∴𝑄
𝑃
Conjunction 𝑄
𝑃→𝑄
∴𝑃∧𝑄 Hypothetical Syllogism 𝑄→𝑅
𝑃∧𝑄 ∴𝑃→𝑅
Simplification
∴𝑃
𝑃 → 𝑄 ∧ (𝑅 → 𝑆)
𝑃→𝑄 Constructive Dilemma 𝑃∨𝑅
Modus Ponens 𝑃 ∴𝑄∨𝑆
∴𝑄
𝑃→𝑄 𝑃 → 𝑄 ∧ (𝑅 → 𝑆)
Modus Tollens ~𝑄 Destructive Dilemma ~𝑄 ∨ ~𝑆
∴ ~𝑃 ∴ ~𝑃 ∨ ~𝑅
An Example
Mrs. Baker says that
1. Alice is her youngest child if Bill is not her youngest child, and
2. Alice is not her youngest child if Carl is not her youngest child.
Write down a knowledge base that describes this riddle and the necessary background knowledge that only
one of the three children can be her youngest child. Show with resolution that Bill is her youngest child.
Let the propositions A, B and C denote that Mrs. The information from Mrs. Baker are:
Baker’s youngest child is Alice, Bill and Carl, 1. ¬B ⇒ A (Alice is her youngest child if Bill is not)
respectively. 2. ¬C ⇒ ¬A (Alice is not her youngest child if Carl is not
We have the following clauses for the background We want to show that Bill is the youngest child.
knowledge: 1. ¬B (Assume that Bill is not the youngest child.)
1. A ∨ B ∨ C (One child has to be the youngest.) 2. A (As ¬B ⇒ A)
2. ¬A ∨ ¬B (Alice and Bill can’t both be the youngest.) 3. ¬C (As ¬A ∨ ¬C)
3. ¬A ∨ ¬C 4. ¬A (As ¬C ⇒ ¬A)
4. ¬B ∨ ¬C
So our assumption leads to the fact that both A and ¬A
are true, which is not possible. So B is true.
From Proposition To Predicate
3 + 2 = 5 is a proposition. But is X + 2 = 5 a proposition? Suppose P(x) is a predicate, where the universe of discourse
Because it has a variable X in it, we cannot say it is T or F. for x is {1, 2, 3}. Then P(x) is not a proposition, but P(1) is a
So, it is not a proposition. It is called a predicate. proposition.
In general, a predicate is not a proposition. But when you
A predicate is a function. It takes some variable(s) as assign values to all its argument variables, you get a
arguments; proposition.
It returns either True or False (but not both) for each
combination of the argument values. Example:
In contrast, a proposition is not a function. It does not have P(x, y) : “x + 2 = y” is a predicate.
any It has two variables x and y;
variable as argument. It is either True or False (but not Universe of Discourse: x is in {1, 2, 3}; y is in {4, 5, 6}.
both).
P(1, 4) : 1 + 2 = 4 is a proposition (it is F);
The variables are always associated with a universe (or P(2, 4) : 2 + 2 = 4 is a proposition (it is T);
domain) P(2, 3): meaningless (in this example),
of discourse, which tells us what combinations of the because 3 is not in the specified universe of discourse for y
argument
values are allowed.
Predicate Logic Or First Order Logic
In propositional logic, we can only represent the facts, which are either true or false.
PL is not sufficient to represent the complex sentences or natural language statements.
Consider the following sentence, which we cannot represent using PL logic.
• "Some humans are intelligent", or
• "Sachin likes cricket."
To represent the above statements, we required some more powerful logic, such as first-order logic.
First-order logic (like natural language) does not only assume that the world contains facts like propositional
logic but also assumes the following things in the world:
• Objects: A, B, people, numbers, colours, wars, theories, squares, pits, etc.
• Relations: It can be unary relation such as: red, round, is adjacent, or n-any relation such as: the sister of,
brother of, has colour, comes between, etc.
• Function: Father of, best friend, third inning of, end of, etc.
Consider the statement: "x is an integer.", it consists A quantifier is a language element which generates
of two parts, quantification. There are two quantifiers:
the first part x is the subject, and • Universal Quantifier: (for all, everyone, everything)
second part "is an integer," is known as a predicate. • Existential quantifier: (for some, at least one).
Example
• All man drink coffee = ∀x man(x) → drink (x, coffee)
• Some boys are intelligent = ∃x: boys(x) ∧ intelligent(x)
• All birds fly = ∀x bird(x) →fly(x)
• Every man respects his parent = ∀x man(x) → respects (x, parent)
• Some boys play cricket = ∃x boys(x) → play(x, cricket)
• Not all students like both Mathematics and Science =
¬∀ (x) [ student(x) → like(x, Mathematics) ∧ like(x, Science)]
Goldbach’s Conjecture: every even number greater than 2 is equal to the sum of two prime numbers
∀ n, if integer(n), even(n), greater(n,2) then
∃ p1, p2, integer(p1), integer(p2), prime(p1),prime(p2), and
equals(n, sum(p1,p2))
Universal Quantification and Existential Quantification are two more ways of converting Predicates to Propositions.
Let P(x) : “x+2=5” is a Predicate in the Universe of Discourse {1, 2, 3}.
Then ∀ x P(x) means: ”for all x in {1, 2, 3}, x + 2 = 5” which is a false proposition
And ∃ x P(x) means: “for some x in {1, 2, 3}, x + 2 = 5” which is a true proposition.
Inference in FOL
Substitution: Substitution is a fundamental operation performed on terms and formulas. If we write F[a/x], so
it refers to substitute a constant "a" in place of variable "x".
Equality: First-Order logic does not only use predicate and terms for making atomic sentences but also uses
equality symbols which specify that the two terms refer to the same object, e.g., Brother (John) = Smith.
As propositional logic, FOL does also have inference rules. Following are some basic inference rules in FOL:
• Universal Generalization: states that if premise P(c) is true for any arbitrary element c in the universe of
discourse, then we can have a conclusion as ∀ x P(x).
• Universal Instantiation: states that we can infer any sentence P(c) by substituting a ground term c (a
constant within domain x) from ∀ x P(x) for any object in the universe of discourse.
• Existential Instantiation: states that one can infer P(c) from the formula given in the form of ∃x P(x) for a
new constant symbol c, with the restriction that the c used in the rule must be a new term for which P(c ) is
true.
• Existential Generalization: states that if there is some element c in the universe of discourse which has a
property P, then we can infer that there exists something in the universe which has the property P.
Example
𝑃 𝑐
Universal Generalization: ∴∀𝑥 𝑃 𝑥
Example: Let's represent, P(c): "A byte contains 8 bits", so for ∀ x P(x) "All bytes contain 8 bits.", it will also be true
∀𝑥 𝑃 𝑥
Universal Instantiation: 𝑃(𝑐)
Example: IF "Every person like ice-cream"=> ∀x P(x) so we can infer that "John likes ice-cream" => P(c)
∃𝑥𝑃 𝑥
Existential Instantiation:
𝑃(𝑐)
Example: If you get 90 on the final exam for AI, then you get an A for the course. Someone, call him/her say c, gets
90 on the final exam. Therefore c gets an A for AI.
𝑃 𝑐
Existential Introduction: ∃ 𝑥 𝑃(𝑥)
Example: "Priyanka got good marks in English.“ "Therefore, someone got good marks in English."
Negated Quantifiers
A(x): “x lives in Amherst.” • All CSE students have good GPA: ∀ x (B(x) → C(x))
B(x): “x is a CSE student.” • No CSE student lives in Amherst: ¬ ∃ x (B(x) ∧ A(x))
C(x): “x has a good GPA.” • CSE students who do not live in Amherst major in computer
D(x): “x majors in computer science.” science: ∀x (B(x) ∧ ¬ A(x) → D(x))
Universe of discourse: all UG students
There are some prediction based vectors Word2vec is not a single algorithm but a combination of two
techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these are shallow neural
networks which map word(s) to the target variable which is also a word(s).
Feature Engineering – other features
A number of extra text based features can also be created which sometimes are helpful for improving text
classification models. Some examples are:
1. Word Count of the documents – total number of words in the documents
2. Character Count of the documents – total number of characters in the documents
3. Average Word Density of the documents – average length of the words used in the documents
4. Punctuation Count in the Complete Essay – total number of punctuation marks in the documents
5. Upper Case Count in the Complete Essay – total number of upper count words in the documents
6. Title Word Count in the Complete Essay – total number of proper case (title) words in the documents
7. Frequency distribution of Part of Speech Tags:
1. Noun Count
2. Verb Count
3. Adjective Count
4. Adverb Count
5. Pronoun Count
Model Building
The final step in the text classification framework is to train a classifier using the features created in the
previous step. There are many different choices of machine learning models which can be used to train a
final model. We will implement following different classifiers for this purpose:
1. Naive Bayes Classifier
2. Linear Classifier
3. Support Vector Machine
4. Bagging Models
5. Boosting Models
6. Shallow Neural Networks
7. Deep Neural Networks
1. Convolutional Neural Network (CNN)
2. Long Short Term Modelr (LSTM)
3. Gated Recurrent Unit (GRU)
4. Bidirectional RNN
5. Recurrent Convolutional Neural Network (RCNN)
6. Other Variants of Deep Neural Networks
Evaluation Accuracy, answers the question How well was the model
classified? and is calculated as a ratio of the number of correct
classifications (or predictions) to the total number of
classifications. That is to say:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision, answers the question What proportion of the
positive classified samples were actually correct? and is
TP (True Positives): number of samples that were calculated as a proportion
correctly classified as positive. between the total number of correct positive classifications
FP (False Positives): number of samples that were over the total number of positive classifications. That is to say,
incorrectly classified as positive. Precision = TP / (TP + FP)
TN (True Negatives): number of samples that were • Recall, answers the question What proportion of the actual
correctly classified as negative. positive samples were correctly classified? and is calculated as
FN (False Negatives): number of samples that were a proportion of the total number of correct positive
incorrectly classified as negative. classifications over the total number of samples that actually
belong to the positive class (i.e., samples weren’t rated as
positive, but that were actually positive). That is to say,
Recall = TP/(TP+FN)
CNN – Convolutional Neural Networks
Introduction CNN Architecture
Convolutional networks are simply neural networks
• CNNs has deep feed-forward architecture and has capability that use convolution in place of general matrix
to generalize much better compared to FC networks multiplication in at least one of their layers
• CNN can be thought of as a hierarchical feature detector and 1. Neurons arranged in 3 dimensions: width,
they are biologically inspired height, depth
• It can learn abstract features and identify objects efficiently 2. A stacked sequence of layers (3 main types) are
• The advantages of CNN are: used to build a CNN – they are
• Fully-connected structure does not scale to large • Convolutional Layer
images, but due to the weight sharing property, CNNS • Pooling Layer
vastly reduce the amount of parameters in the network • Fully-Connected Layer
resulting in 3. Every layer of a CNN transforms one volume of
• better generalization i.e., lesser overfitting, and activations to another through a differentiable
• smooth training function
• Feature extraction and classification are happening 4. Convolutional Layers are the main building
simultaneously within the same framework blocks and they contain Kernels (Filters) to
• CNNs are found to be extremely efficient in the fields of – detect features from an image
• Image classification 5. The pooling layers are reducing the dimension
• Object detection using some kind of aggregation
• Face detection 6. The fully connected layers are generally at the
• Speech recognition end of network, consisting generally of not more
• Facial expression recognition, etc. than 3 layers, does the final classification
Convolution Convolution leverages three important ideas that
can help improve a machine learning system:
1. Sparse interactions: In traditional network
layers, there is interaction between each input
unit and each output unit. Convolutional
networks, typically have sparse interactions
accomplished by making the kernel smaller
than the input. So, when with the input image
with thousands or millions of pixels, we can
detect small, meaningful features such as
edges with kernels that occupy only tens or
hundreds of pixels.
2. Parameter sharing: It refers to using the same
An Example
parameter more than once. In a CNN, each
member of the kernel is used at every position
of the input resulting in lesser computation
and storage.
3. Equivariant representations: Parameter
sharing gives the layers an advantage of
equivariance to translation.
Convolution contd…
= 50.30+20.30+50.30+50.30+50.30
= 6600
Convolution contd…
Input Image
Filter
1 1 1 0 0 Output
1 0 1 Simple convolution size
0 1 1 1 0 4 4
0 1 0
0 0 1 1 1 2 4 Image size: 𝑁 × 𝑁
1 0 1 Filter: 𝑀 × 𝑀
Convolution
0 0 1 1 0
with stride = 2 Stride: 𝑆
0 1 1 0 0
Output size
0 0 0 0 0 0 0 𝑁ہ− 𝑀/𝑆+1ۂ
0 1 1 1 0 0 0
0 0 1 1 1 0 0
0 0 0 1 1 1 0
0 0 0 1 1 0 0
0 0 1 1 0 0 0
0 0 0 0 0 0 0
Zero-padding
Convolution - Summary
Max Pooling
𝑁−𝐹
𝑂𝑢𝑡𝑝𝑢𝑡 𝑆𝑖𝑧𝑒 = +1
𝑠𝑡𝑟𝑖𝑑𝑒
Sample Architecture & Size
[INPUT] →[CONV 1] → [BATCH NORM] → [ReLU] → [POOL 1]→ [CONV 2] → [BATCH NORM] → [ReLU] → [POOL 2]→ [FC LAYER] → [RESULT]
CNN Architecture
Architecture of VGGNet
Gradient Descent
Gradient descent algorithm does not work well for all functions. There are
Pre-requisite two specific requirements. A function has to be differentiable and convex.
model.compile(loss='categorical_crossentropy’, optimizer='adam')
Model Optimizers Stochastic gradient descent (SGD) – Adaptive Learning Rate
Require: Learning rate schedule 𝜖1 , 𝜖2 , …
Require: Initial parameter 𝜃
𝐾←1
Stochastic gradient descent (SGD)
while stopping criterion not met do
Require: Learning rate schedule 𝜖
Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖
Require: Initial parameter 𝜃 1
while stopping criterion not met do Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖 Apply update: 𝜃 ← 𝜃 − 𝜖𝐾 𝑔
1 𝐾 ←𝐾+1
Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
end while
Apply update: 𝜃 ← 𝜃 − 𝜖𝑔
end while The AdaGrad Algorithm
Require: Learning rate 𝜖, Initial parameter 𝜃
Stochastic gradient descent (SGD) with momentum Require: Small constant 𝛿, ≈ 10−7 for numeric stability
Require: Learning rate 𝜖, momentum parameter 𝛼 Initialize gradient accumulation variable 𝑟 = 0
Require: Initial parameter 𝜃, initial velocity 𝑣 while stopping criterion not met do
while stopping criterion not met do Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖
Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖 1
1 Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
Accumulate squared gradient: 𝑟 ← 𝑟 + 𝑔 ⊙ 𝑔
Compute velocity update: 𝑣 ← 𝛼𝑣 − 𝜖𝑔 𝜖
Compute update: Δ𝜃 ← − 𝛿+ 𝑟 𝜃 ⊙ 𝑔
Apply update: 𝜃 ← 𝜃 + 𝑣
end while Apply update: 𝜃 ← 𝜃 + Δ𝜃
end while
Optimizers Contd…
The RMSProp Algorithm
Require: Learning rate 𝜖, decay rate 𝜌
Require: Small constant 𝛿, ≈ 10−7 for numeric stability
Initialize accumulation variable 𝑟 = 0
while stopping criterion not met do The Adam Algorithm
Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖 Require: Global Learning rate 𝜖, decay rate 𝜌
1
Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 ) Require: Initial parameter 𝜃, momentum coefficient 𝛼
Require: Initial velocity 𝑣
Accumulate squared gradient: 𝑟 ← 𝜌𝑟 + (1 − 𝜌)𝑔 ⊙ 𝑔
𝜖 Initialize gradient accumulation variable 𝑟 = 0
Compute update: Δ𝜃 ← − 𝛿+ 𝑟 𝜃 ⊙ 𝑔 while stopping criterion not met do
Apply update: 𝜃 ← 𝜃 + Δ𝜃 Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖
end while Compute interim update 𝜃 ← 𝜃 + 𝛼𝑣
1
Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
Accumulate gradient: 𝑟 ← 𝜌𝑟 + (1 − 𝜌)𝑔 ⊙ 𝑔
𝜖
Compute velocity update: 𝑣 ← 𝛼𝑣 − 𝑟 ⊙ 𝑔
Apply update: 𝜃 ← 𝜃 + 𝑣
end while
Model Loss Functions – Cross-Entropy
The outputs (softmax layer) generated by the Cross-Entropy = 1 ∑𝑛 ∑𝐶 −𝑦𝑖𝑗 log 𝑝𝑖𝑗
n 𝑖=1 𝑗=1
network and the corresponding targets are
C is the no. of classes
given below: n is the no. of data points
𝐶𝐸 = −log 0.8 − log 0.4 − log 0.6 − log 0.8
[1 0 0] [0.8 0.2 0.0] [1 0 0]
[0.4 0.3 0.3] [1 0 0]
[0.3 0.6 0.1] [0 1 0] When there are only two classes, and
[0.1 0.1 0.8] [0 0 1] The model generates a score for only one class
Then it’s a case for binary cross-entropy
Class: 1 – Prob: 0.94 – Corr Prob: 0.94 Which can be calculated using the formula
[0 1 0] Class: 0 – Prob: 0.38 – Corr Prob: 0.62 𝑛
1
Class: 1 – Prob: 0.54 – Corr Prob: 0.54 𝐵𝐶𝐸 = −(𝑦𝑖 log 𝑝𝑖 + (1 − 𝑦𝑖 )(log 1 − 𝑝𝑖 )
Class: 0 – Prob: 0.12 – Corr Prob: 0.88 𝑛
𝑖=1
1
𝐵𝐶𝐸 = − log 0.94 − log 0.62 − log 0.54 − log 0.88
[0 0 1] 4
Metrics
Metric values are displayed during fit() and logged to the History object returned by fit().
They are also returned by evaluate().
1, 𝑤0 +∑𝑤𝑖 𝑥𝑖 > 0
O 𝒙 =ቊ
−1, 𝑤0 + ∑𝑤𝑖 𝑥𝑖 ≤ 0
1
𝑂 𝑥 =
1 + 𝑒 − 𝑤0+∑𝑤𝑖𝑥𝑖
The activation function:
Activation Functions
x = [2 3]
y = [0.7122574322957417 0.5330975738715015]
Perceptron Learning Algorithm
Linearly Separable
Linearly Non-Separable
Logistic Regression
What is Logistic Regression?
1. Dependent Variable is not continuous but categorical.
Birth Weight < 2.5 kg is LOW
2. Mostly the DV is dichotomous e.g., Yes/No, Low/High,
Low Right/Left, etc.
1 1 exp 𝛽 +𝛽1 𝑥
5. This implies 𝑝 = = 1+exp 0𝛽
1+exp −𝛽0 −𝛽1 𝑥 0 +𝛽1 𝑥
1. As 𝛽0 + 𝛽1 𝑥 → ∞, p approaches 1
2. As 𝛽0 + 𝛽1 𝑥 → −∞, p approaches 0
0 3. As p approaches 0.5, 𝛽0 + 𝛽1 𝑥 approaches 0
4. Steepness of the curve increases as 𝛽1 gets bigger
LR Continued…
Logistic Regression can be of three types:
1. Binary LR: The response is binary i.e., it has two possible outcomes.
2. Nominal LR: There are three or more categories with no natural ordering e.g., Red, Blue, Orange.
3. Ordinal LR: There are three or more categories with a natural ordering to the levels, but the ranking of the
levels do not necessarily mean the intervals between them are equal e.g., Good, Stable, Serious, Critical.
When we have multiple independent variables, i.e., x is a vector X, then the model is formulated as follows:
exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 exp 𝑿𝛽 1
𝑝 𝑿 = = =
1 + exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 1 + exp 𝑿𝛽 1 + exp −𝑿𝛽
Interpretation of the Coefficients
Let us consider a binary DV Y and a single dimensional input variable X.
𝑝
Then the equation is ln = 𝛽0 + 𝛽1 𝑥
1−𝑝
Now, let us try to understand the impact of a unit increase in X.
𝑝′
If the probability becomes 𝑝′, then the equation becomes ln = 𝛽0 + 𝛽1 𝑋 + 1 = 𝛽0 + 𝛽1 𝑥 + 𝛽1
1−𝑝′
𝑝′ 𝑝
Now 𝛽1 = 𝛽0 + 𝛽1 𝑥 + 𝛽1 − 𝛽0 + 𝛽1 𝑥 = ln − ln
1−𝑝′ 1−𝑝
𝑝′
1 − 𝑝′
= ln 𝑝
1−𝑝
𝑜𝑑𝑑𝑠 ′
= ln
𝑜𝑑𝑑𝑠
𝑜𝑑𝑑𝑠 ′
⇒ 𝑒 𝛽1 =
𝑜𝑑𝑑𝑠
This means that with the increase of 𝛽1 , the ratio of odds will increase exponentially.
Module - 3
Decision Tree
Random Forest
OLS Regression
Learning in General
Learning
Learning Process & Methods
1. Data Collection: The quality and quantity of Model Applications
information you get are very important since it will
directly impact how well or badly your model will Decision Trees Financial Applications
work.
Random Forest Fraud Detection
2. Data Preparation: It will be necessary to make a
selection of characteristics which will be used during Logistic Regression Price prediction
the execution times.
3. Model Selection: An appropriate model needs to be MLP Networks Classification
selected depending on the type of data and the type
Convolutional Neural Networks Image processing
of the application.
4. Model Training: The selected model needs to be
trained on the available data collected during the first Recurrent Neural Networks Voice recognition
phase of the process.
5. Evaluation: Validation of the performance on the Reinforcement Learning Learning by trial and error
available data itself, to ensure proper training.
Generative Models Image creation
6. Parameter Tuning: It’s a continuous process along
with the previous step, to gradually improve and K-means Segmentation
converge the learning process.
7. Inferencing: Applying the model on real-life data. k-Nearest Neighbors Recommendation systems
Decision Tree
Decision Trees
OUTLOOK TEMP. HUMIDITY WIND PLAY? Node = root of decision tree
S H H W N Main loop:
S H H S N 1. A → the “best” decision attribute for
the next node.
O H H W Y
2. Assign A as decision attribute for node.
R M H W Y 3. For each value of A, create a new
R C N W Y descendant of node.
R C N S N 4. Sort training examples to leaf nodes.
5. If training examples are perfectly
O C N S Y
classified, stop.
S M H W N 6. Else, recurse over new leaf nodes.
S C N W Y
R M N W Y
S M N S Y
O M H S Y
O H N W Y
R M H S N
No Yes No Yes
“BEST” Attribute
OUTL TEMP. HUMI WIND PLAY? The entropy of a variable is the "amount of information"
OOK DITY contained in the variable.
S H H W N Shannon’s entropy quantifies
S H H S N the amount of information in a variable,
and it is computed using the formula:
O H H W Y
R M H W Y
R C N W Y • Initial Entropy:
R C N S N 5 5 9 9
Ei = − × log 2 + × log 2 = − −0.16 − 0.12 = 0.28
O C N S Y 14 14 14 14
S M H W N
S C N W Y • When OUTLOOK is taken as the splitting criteria, then we get three splits
whose population are S(3, 2), O(0, 4) and R(2, 3). So the entropy becomes:
R M N W Y
5 3 3 2 2 5 2 2 3 3
S M N S Y 𝐸𝑂 = − log 2 − log 2 − log 2 − log 2
14 5 5 5 5 14 5 5 5 5
O M H S Y = − 0.36 −0.13 − 0.16 − 0.36 −0.16 − 0.13 = − −0.1 − 0.1 = 0.2
O H N W Y
R M H S N • So, information gain 𝐺𝑂 = 0.08
Some more facts on Decision Tree x1 = 1
Y N
1. Some other choices of attribute
selection are – a. Random, b. Least- x2 = 0 x2 = 1
values, c. Highest-values, etc.
2. Decision trees can represent any Y N Y N
Boolean function of the input
attributes. C1 C0 C1 C0
3. Decision trees divide the feature space
into axis-parallel (hyper-)rectangles.
Each rectangular region is labelled x2 < 3 x2
with one label – or a probability Y N
6 $ $
distribution over labels. $
x1 < 4 x1 < 3 4 $
4. For continuous variables, $
*
the range will be split in Y N Y N $
various slabs to fit into * $ $
2
this form. C0 C1 x2 < 4 C1 * $
*
5. There are other measures of Y N * $
impurity like GINI, chi-square test, etc. * * $
C0 C1 3 4 6 x1
Random Forest
Random Forest Algorithm
• Random forest builds an ensemble of decision trees on
different samples and takes their majority vote for the final
classification.
• Bagging, also known as Bootstrap Aggregation is the
ensemble technique used by random forest.
• Bagging chooses a random sample from the data set.
Hence each model is generated from the samples
(Bootstrap Samples) provided by the Original Data with
replacement known as row sampling. This step of row
sampling with replacement is called bootstrap.
• Now each model is trained independently which
generates results. The final output is based on majority
voting after combining the results of all models. This
step which involves combining all the results and
generating output based on majority voting is known
as aggregation.
The bootstrap trees are all different from the original tree
• While growing the tree, at each node, while splitting,
the splitting feature is chosen from only a subset of all
features.
OLS Regression
The Regression Line Parameters
𝑌𝑖 = 𝑚𝑥𝑖 + 𝑏 "x"
"y"
2 Ice
𝐸 = ∑ 𝑌𝑖 − 𝑌𝑖 Hours of
Creams
= ∑ yi − mxi − b 2 Sunshine
Sold
= ∑ 𝑦𝑖2 + 𝑚2 𝑥𝑖2 + 𝑏2 − 2𝑚𝑥𝑖 𝑦𝑖 − 2𝑏𝑦𝑖 + 2𝑚𝑥𝑖 𝑏
2 4
= ∑𝑦𝑖2 + 𝑚2 ∑𝑥𝑖2 + ∑𝑏2 − 2𝑚∑𝑥𝑖 𝑦𝑖 − 2𝑏∑𝑦𝑖 + 2𝑏𝑚∑𝑥𝑖
3 5
= 𝑚2 ∑𝑥𝑖2 − 2𝑚∑𝑥𝑖 𝑦𝑖 + ∑𝑦𝑖2 + 2𝑏𝑚∑𝑥𝑖 − 2𝑏∑𝑦𝑖 + ∑𝑏2
5 7
= 𝑚2 𝒜 − 2𝑚ℬ + 𝒞 + 2𝑏𝑚𝒳 − 2𝑏𝒴 + 𝑏2 𝑁
7 10
9 15
𝜕𝐸
= 2𝑚𝒜 − 2ℬ + 2𝑏𝒳 = 0 … 𝑖 x y x2 xy
𝜕𝑚
𝜕𝐸 𝒴 − 𝑚𝒳
= 2𝑚𝒳 − 2𝒴 + 2𝑏𝑁 = 0 … 𝑖𝑖 ⇒ 𝑏 = … 𝑖𝑖𝑖 2 4 4 8
𝜕𝑏 𝑁
𝑝𝑢𝑡𝑡𝑖𝑛𝑔 𝑡ℎ𝑖𝑠 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑏 𝑖𝑛 𝑖 , 𝑤𝑒 𝑔𝑒𝑡 3 5 9 15
𝒴 − 𝑚𝒳 Nℬ − 𝒳𝒴 5 7 25 35
𝑚𝒜 − ℬ + 𝒳 =0⇒m= … 𝑖𝑣
𝑁 𝑁𝒜 − 𝒳 2 7 10 49 70
• The strong α-cut of a fuzzy set A is the crisp set that contains
all the elements of the universal set X whose membership
grades in A are only greater than the specified value of α. Please note that α cut is a crisp set.
Fuzzy Set Operations
Given two fuzzy sets, A and B, their standard All the basic set operation rules i.e., Except
intersection, union and complement are defined • Commutativity • 𝐴 ∩ 𝐴ҧ ≠ 𝜙, 𝑎𝑛𝑑
for all 𝑥 ∈ 𝑋 by the equations. • Associativity • 𝐴 ∪ 𝐴ҧ ≠ 𝑋
𝜇𝐴∪𝐵 𝑥 = max 𝜇𝐴 𝑥 , 𝜇𝐵 𝑥 • Distributivity
𝜇𝐴∩𝐵 𝑥 = min 𝜇𝐴 𝑥 , 𝜇𝐵 𝑥 • De-Morgan’s Laws
𝜇𝐴ҧ 𝑥 = 1 − 𝜇𝐴 𝑥 Are Valid
1. Support(A) of a fuzzy set A is the set of all points 𝑥 ∈ 𝑋 such that 𝜇𝐴 𝑥 > 0
2. Core(A) of a fuzzy set A is the set of all points 𝑥 ∈ 𝑋 such that 𝜇𝐴 𝑥 = 1
3. Singleton: Fuzzy set A whose support is a single point 𝑥 in 𝑋 with 𝜇𝐴 𝑥 = 1 is called a fuzzy Singleton
Fuzzy Relations
Fuzzy Relation
Crisp Relation
M D A E N
5 1 0.3 0 0 0
11 0.3 1 0.2 0 0
17 0 0.1 0.5 1 0.2
23 0 0 0 0.2 1
Fuzzy Expert System
• Most applications of fuzzy logic
use it as the underlying logic
system for decision support
systems.
Funding X
Step-2: Rule Evaluation
𝑦1
1 𝜇 𝑦=𝐵1 = 0.1
𝜇 𝑦=𝐵2 = 0.7 0.1 × 30 + 0.7 × 40 + 0.5 × 30
𝑘=
30 + 40 + 30
Small Large = 0.46
(B1) (B2)
Step-3: Defuzzification (Centroid Method)
Staffing Y
Step-1: Fuzzification
Some Statistical Concepts
Random Variable
Random Variable: a quantitative variable whose values are results of a random process.
• Discrete Random Variable: one whose possible values are finite or countably infinite (e.g., 1, 2, 3, …)
• Continuous Random Variable: one whose values constitute an entire (infinite) range over an interval
Probability distribution of a random variable tells all of its possible values along with their associated
probabilities, for example – if X is the random variable representing the number of tails in two tosses of a coin,
then the probability distribution of X is
X = no. of tails 0 1 2
Probability 0.25 0.5 0.25
A probability distribution function should follow the following axioms:
i. 0 ≤ 𝑃 𝑋 = 𝑥 ≤ 1
ii. 𝑃 𝑋 = 𝑥1 + 𝑃 𝑋 = 𝑥2 + ⋯ 𝑃 𝑋 = 𝑥𝑛 = 1, 𝑤ℎ𝑒𝑟𝑒 𝑥1 , 𝑥2 , … , 𝑥𝑛 are the all possible values of X
iii. If the events are independent, then 𝑃 𝑋 = 𝑥1 𝐴𝑁𝐷 𝑃 𝑋 = 𝑥2 = 𝑃 𝑋 = 𝑥1 × 𝑃(𝑋 = 𝑥2 )
The conditional probability of an event B, assuming that an event A has occurred is denoted by P(B/A)
𝑃 𝐴∩𝐵
and is defined as 𝑃 𝐵/𝐴 = ⇒ 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 × 𝑃 𝐵/𝐴
𝑃 𝐴
Probability Mass Function and Probability
Density Function
If the random variable takes on all values within a
If the random variable always takes values from {0, 1, 2, certain interval, then the random variable is called a
3, ….. n}, then it is called a Discrete random variable Continuous random variable
• The number of printing mistakes in each page of a • The height, age and weight of individuals
book • the amount of rainfall on a rainy day
• The number of telephone calls received by a
telephone operator, etc. If X is a continuous random variable, then f(x) is called
the Probability Density Function, and it satisfies the
If X is a discrete random variable taking values conditions:
𝑥1 , 𝑥2 , … , 𝑥𝑛 , then the function 𝑃 𝑥𝑖 = 𝑃(𝑋 = 𝑥𝑖 ) is 1. 𝑓 𝑥 ≥ 0, ∀𝑥
called the Probability Mass Function. ∞
2. −∞ 𝑓 𝑥 𝑑𝑥 = 1
The function 𝐹 𝑥 = 𝑃(𝑋 ≤ 𝑥) is the Cumulative
distribution function of X. The Cumulative distribution function of 𝐹(𝑥) is defined
𝑥
𝑃 𝑎 <𝑋 ≤𝑏 =𝐹 𝑏 −𝐹 𝑎 as 𝐹 𝑥 = 𝑃 𝑋 ≤ 𝑥 = −∞ 𝑓 𝑡 𝑑𝑡 , −∞ < 𝑥 < ∞
𝑃 𝑎 ≤ 𝑋 ≤ 𝑏 = 𝐹 𝑏 − 𝐹 𝑎 + 𝑃(𝑋 = 𝑎) 𝑃 𝑎 ≤𝑋 ≤𝑏 =𝐹 𝑏 −𝐹 𝑎
𝑃 𝑎 <𝑋 <𝑏 =𝐹 𝑏 −𝐹 𝑎 −𝑃 𝑋 =𝑏 𝑑 𝐹 𝑥
= 𝑓 𝑥 , when x is an one dimensional variable
𝑃 𝑎 ≤ 𝑋 < 𝑏 = 𝐹 𝑏 − 𝐹 𝑎 − 𝑃 𝑋 = 𝑏 + 𝑃(𝑋 = 𝑎) 𝑑𝑥
Mathematical Expectation
Let X be a random variable having a probability distribution In a gambling game a man is paid Rs. 5 if he gets all
function f(x). Expected value of the random variable is the heads or all tails when three coins are tossed, and he
arithmetic mean of the random variable. will pay out Rs. 3 if either one or two heads show,.
If X is discrete type of random variable What is his expected gain?
u ( X ) = E[u( x)] = u( x) f ( x) win is Rs.5 if event
x E1 = {HHH, TTT} occurs and
If X is continuous type of random variable loss Rs.3 if event
E2 = {HHT, HTH, THH, HTT, THT, TTH} occurs
u ( X ) = E[u ( x)] = u ( x) f ( x) dx
− Since E1 and E2 occur with probabilities 1/4 and 3/4
1. Mean of the random variable is E[X] = µX respectively, it follows that
2. If a is constant, E[a] = a
1 3
3. If a and b are constants, E[aX ± b] = a E[X] ± b = E[ X ] = x p ( x) = (5) + (− 3) = −1
4. E[f(X) ± g(X)] = E[f(X)] ± E[g(X)] x 4 4
5. Variance of the random variable X2 = E[(X − µX)2] = E[X2] − E[X]2
6. If a is constant, Var[a] = 0 In this game, the gambler will on an average
7. Var[aX ± b] = a2 Var[X] lose Rs. 1 per toss of the three coins
Example
Suppose that the number of cars X that pass through a car wash between 4.00p.m and 9.00p.m. has been
observed for 60 days and the result is the following distribution:
No. of Cars 4 5 6 7 8 9
No. of Days 5 5 15 15 10 10
Let g(X) = 2X – 1 represent the amount of money in rupees, paid to the attendant by the manager. Find the
attendant’s expected earnings for this particular time period.
x 4 5 6 7 8 9
P(X = x) 1/12 1/12 1/4 1/4 1/6 1/6
E[ g ( X )] = E[ 2 X − 1]
9
= (2 x − 1) p ( x)
x=4
− (
dx = .68 1 )
− 2
− 2
− 2 2
• e 2 dx = .95
+ 1 x− 2 + 2 1 x− 2
− ( 1 − ( ) + 3 1 x− 2
1 ) − (
• e 2 dx = .95
•e = .68 1 )
•e = .997
2 dx 2 dx
2 2
2
− −2
−3
+ 2 1 x− 2 + 3 1 x− 2
− ( 1 − ( )
1 )
•e = .997
• e 2 dx = .95 2 dx
−2 2
−3 2
Standard Normal Distribution
A standard normal distribution has mean 0 and S.D. 1. All normal distributions can be converted into the
standard normal curve by subtracting the mean and dividing by the standard deviation: 𝒁 = (𝑿 − 𝝁)/𝝈
A taxi company has divided the city into three regions – North-side, Downtown, and Southside. By keeping track
of pickups and deliveries, the company has found that of the fares picked up in North-side, 50% stay in that
region, 20% are taken Downtown, and 30% go to Southside. Of the fares picked up Downtown, only 10% go to
North-side, 40% stay Downtown, and 50% go to Southside. Of the fares picked up in Southside, 30% go to each
of North-side and Downtown, while 40% stay in Southside.
We would like to know what the distribution of taxis will be over time as they pick up and drop off successive
fares. Suppose we want to know the probability that a taxi starting off Downtown, will be Downtown after letting
off its seventh fare?
An Example
A taxi company has divided the city into three
regions – Northside, Downtown, and
Southside. By keeping track of pickups and
deliveries, the company has found that of the
fares picked up in Northside, 50% stay in that
region, 20% are taken Downtown, and 30%
go to Southside. Of the fares picked up
Downtown, only 10% go to Northside, 40%
stay Downtown, and 50% go to Southside. Of
the fares picked up in Southside, 30% go to
each of Northside and Downtown, while 40%
stay in Southside.
These probabilities are constant and independent of previous
We would like to know what the distribution behavior – this memory-lessness of the system is called the
of taxis will be over time as they pick up and Markov property. We assume that a transition – picking up and
drop off successive fares. Suppose we want dropping off a fare – occurs each time the system is observed,
to know the probability that a taxi starting off and that observations occur at regular intervals. Systems with
Downtown, will be Downtown after letting these characteristics are called Markov chains or Markov
off its seventh fare? processes.
Transition Matrix & Computing the probabilities
0.309
= T (Transition Matrix)
The ijth entry in the matrix Tn gives the probability of being in state j after n transitions, with state i as the initial state.
INTRODUCTION TO LINEAR ALGEBRA
Fundamentals
1. What is a matrix? – solution to f(x) = y, linear approximation of the
function ‘f’
2. Matrix Addition
3. Vector
1. What is a vector? – A point or the position vector in n dimensional
space
2. Scalar times Vector
4. Vector Norm – Lp norm
1. Properties of Norm – Non-negativity, Definiteness, Triangle
Inequality, Homogeneity
2. L1 norm (Manhattan / Taxicab norm)
• To distinguish between zero and non-zero, measure of sparsity
3. L2 norm (Euclidean norm)
• Not robust to outliers
4. Infinity norm (Max norm)
5. L0 norm – no. of non-zero entries in x
• Strictly speaking its not a norm because it is not homogeneous
Matrix Products
1. Products of two vectors
1. Hadamard product
2. Dot Product (Inner product) – Projection – Feed
forward neuron with multiple input weights
3. Multiplication (Outer product) – size
compatibility
2. Matrix times a vector – Y = WX – Rotation as well as
scaling
1. Inner product representation – yi is the dot product of
ith row of W with x ➔ Each row is aligned with a class
2. Outer product representation – Sum of the columns
of W weighted by the corresponding x ➔ The entire
space is spanned by the vector addition (if the vectors
(3,5)
are independent)
3. Rank of a matrix – A matrix whose columns can’t span the
whole space – number of independent columns
Two Layer Linear Network
Inner Product View Outer Product View
3. Inverse of a matrix
1
INTRODUCTION TO APPLIED ARTIFICIAL INTELLIGENCE
1. Introduction to AI Textbooks:
4. Neural Networks and Deep
a. Overview 1. Artificial Intelligence
Learning
b. Intelligence Illuminated – Ben
a. MLP and Back Propagation
c. Learning Coppin (2004)
b. Deep learning:
2. Pre-requisites 2. Deep Learning – Ian
i. CNN
a. Linear Algebra Goodfellow, Yoshua
ii. RNN and LSTM
b. Probability Theory, Bayes Bengio, Aaron Courville
iii. Autoencoders
Theorem (2017)
iv. GAN
c. Calculus and Optimization 3. Computer Vision:
5. Applications of AI
d. Statistical Concepts: Random Algorithms and
a. Natural Language Processing &
Variables, Naïve Bayes, Markov Applications – Richard
Sentiment Analysis
Process, Regression Szeliski (2010)
b. Machine Vision & Image
3. Machine Learning 4. Foundations of
Processing:
a. Unsupervised Learning: Statistical Natural
i. Image Classification
Clustering, PCA Language Processing -
ii. Image Segmentation
b. Supervised learning: KNN, LR, Christopher D. Manning,
iii. Object Recognition
Decision Trees, SVM Hinrich Schuetze (1999)
2
OVERVIEW
UNIT-1
3
What is Intelligence?
According to Britannica - human intelligence is the ability to learn from experience, adapt to new
situations, understand and handle abstract concepts, and use knowledge to manipulate one’s
environment.
It encompasses
Fluid the ability to see
complex
Intelligence relationships and
solve problems
Robert Sternberg, 1988
4
Multiple Intelligences (Howard Gardner, 1983)
Intelligence Type Characteristics Representative Career
Perceives different functions of language, meanings of words, Journalist, novelist, poet,
Linguistic intelligence
may easily learn multiple languages teacher
Logical-mathematical Capable of seeing numerical patterns, strong ability to use reason
Scientist, mathematician
intelligence and logic
Understands rhythm, pitch, and tone; may play multiple
Musical intelligence Composer, performer
instruments or perform as a vocalist
Bodily kinaesthetic High ability to control the movements of the body and use the Dancer, athlete, athletic
intelligence body to perform various physical tasks coach, yoga instructor
Ability to perceive the relationship between objects and how Choreographer, sculptor,
Spatial intelligence
they move in space architect, aviator, sailor
Ability to understand and be sensitive to the various emotional Counselor, social worker,
Interpersonal intelligence
states of others salesperson
Ability to access personal feelings and motivations, and use them Key component of personal
Intrapersonal intelligence
to direct behaviour success
High capacity to appreciate the natural world and interact with Biologist, ecologist,
Naturalist intelligence
the species within it environmentalist
5
What is Artificial Intelligence (AI)
According to the father of Artificial Intelligence, John McCarthy:
6
APPLICATIONS OF AI Personalized Shopping
Autonomous Vehicles
Speech
Gaming
Recognition
Spam Detection / Filtering
Artificial
Facial Recognition
Intelligence
Techniques
Weather Prediction
Natural
Vision
Language Assisted Diagnosis
Systems
Processing
Targeted Marketing
Expert
Systems Investment Planning
7
Intelligent Agents
An agent is anything that can be viewed as • The agent function maps from percept
perceiving its environment through sensors histories to actions: [f: P* → A]
and acting upon that environment through • The agent program runs on the physical
actuators. architecture to produce f
• Human agent: • agent = architecture + program
• eyes, ears, and other organs for
sensors;
• hands, legs, mouth, and other body
parts for actuators
• Robotic agent:
• cameras and infrared range finders
for sensors
• various motors for actuators
8
Rational Agents
For each possible percept sequence, a rational agent should select an action that is expected to
maximize its performance measure, given the evidence provided by the percept sequence and
whatever built-in knowledge the agent has.
9
Environment Types
• Fully vs. partially observable - When an agent sensor is capable to sense or access the complete state of
an agent at each point in time, e.g., chess or image recognition is fully observable whereas driving or
weather prediction is partially observable.
• Deterministic vs. stochastic - If an agent's current state and selected action can completely determine the
next state of the environment, then such environment is called a deterministic environment. A stochastic
environment is random in nature and cannot be determined completely by an agent. In a deterministic,
fully observable environment, agent does not need to worry about uncertainty.
• Episodic vs. sequential - In an episodic environment, there is a series of one-shot actions, and only the
current percept is required for the action. However, in Sequential environment, an agent requires
memory of past actions to determine the next best actions.
• Static vs. dynamic - If the environment can change itself while an agent is deliberating then such
environment is called a dynamic environment else it is called a static environment. Chess is a static
environment whereas taxi driving is dynamic.
• Discrete vs. continuous - If in an environment there are a finite number of actions that can be performed,
then such environment is a discrete environment else it is called continuous environment.
• Competitive vs. collaborative - An agent is said to be in a competitive environment when it competes
against another agent to optimize the output, whereas in a collaborative environment when multiple
agents cooperate to produce the desired output. The game of chess is competitive but “Age-of-empires”
is collaborative.
10
Intelligence
11
Reasoning
INDUCTIVE REASONING DEDUCTIVE REASONING
• Inductive reasoning is a bottom-up • Deductive reasoning is a top-down approach
approach • Deductive reasoning takes us from the
• Inductive reasoning takes us from the theory to the observations
specific to the general
• there is a first premise, then a second
• We make many observations, discern a premise and finally an inference
pattern, make a generalization, and infer an
explanation or a theory Major premise: All mammals have backbones
Data: Every dog I meet is friendly Minor premise: Humans are mammals
Hypothesis: Most dogs are usually friendly Conclusion: Humans have backbones
Data: I tend to catch colds when people Major premise: All spiders have eight legs
around me are sick Minor premise: A tarantula is a spider
Hypothesis: Colds are infectious Conclusion: tarantulas have eight legs
12
Learning
HUMAN LEARNING
• Auditory Learning: Students listening to recorded audio lectures
• Episodic Learning: Fear of water after a near-drowning
• Motor Learning: Learning the forehand chop in table-tennis
• Observational Learning: Mimicry
• Perceptual Learning: Distinguish between different odours
• Relational Learning: Adding a ‘little less’ salt
• Spatial learning: Mind map
• Stimulus-Response Learning: Reacting on a horror movie
13
Perception
• Perception is the process by which
sensory information captured in the
real world is interpreted, acquired,
selected, and then organized.
Perception and action are critical
concepts in robotics.
• Our senses—taste, sight, touch, smell,
and hearing—provide patterns to the
appropriate part of our brain in order
for it to generate perceptions.
• According to the received
information, action is taken by
interacting with the environment to
manipulate and navigate the objects.
14
Problem Solving
Performance
Regression Algorithms. Measure
Instance-Based Algorithms.
Completeness
Root Cause Decision Tree Algorithms.
Algorithms
Analysis
Clustering Algorithms.
Optimality
Association Rule Learning Algorithms.
Algorithm
An automated solution to a problem
Deterministic and proven to grant an optimal result
Containing a finite set of instructions to solve a problem.
Heuristics
Arbitrary choices or educated guesses
No proof of correctness, may not grant optimal results.
Applied to improve the running time of algorithms.
--------------------------------------------------------------------
•Completeness – Completeness measures the algorithms guarantee to find the
solution for the problem if there exists any solution for it.
•Optimality – This measure is used in order to measure the search strategies which
find an optimal solution to the given problem.
•Time Complexity – This measures the amount of time the algorithm takes in order
to reach till the solution for the given problem.
•Space Complexity – This is the measure which is used to determine the amount of
space (in memory) which the algorithm requires in order to perform the search.
15
Regression
Regression is a statistical
technique to determine the
relationship between a
dependent (outcome) variable to
one or more independent
(predictor) variables.
16
Instance Based Algorithms
A family of learning algorithms
that, instead of performing
explicit generalization, compare
unknown data points with
instances seen in training,
which have been stored in
memory, to measure some kind
of distance or similarity RBF Network
17
Decision Tree Algorithms
COLOR
Yellow Reddish
Green
SOFT-
SMELL
NESS
NO
NO WEIGHT NO
Oval Round
< 750 g
> 750 g
YES NO
YES NO
18
Clustering Algorithms
• Clustering (cluster analysis) is
grouping objects based on
similarities.
• Its an unsupervised method,
where the inputs are not
labelled.
Centroid based clustering Density based clustering
Some common applications for
clustering include the following:
• market segmentation
• social network analysis
• medical imaging
• image segmentation
• anomaly detection
Distribution based clustering Hierarchical clustering
19
Association Rule Learning Algorithm
• Association learning is a rule based
data mining technique that finds
important relations between variables
or features in a data set.
• Association rules are created for
finding information about frequent if-
then patterns using specific criteria
• Various algorithms are used. Some of
them are:
• Apriori algorithm
• ECLAT algorithm
• FP-Growth algorithm
20
Artificial Neural Network Algorithms
21
Deep Learning • Deep learning networks learn by
discovering intricate structures in the
data they experience
22
Search Algorithms
Breadth-First Search
Depth-First Search
Depth-Limited Search
Uninformed Search
Algorithms Iterative Deepening Depth-First Search
The uninformed search algorithm does not have any domain knowledge such
as closeness, location of the goal state, etc. it behaves in a brute-force way. It
only knows the information about how to traverse the given tree and how to
find the goal state. This algorithm is also known as the Blind search algorithm
or Brute -Force algorithm.
23