0% found this document useful (0 votes)
94 views

Ai Combined Update

The document discusses Term Frequency-Inverse Document Frequency (TF-IDF), which is a numerical statistic that reflects how important a term is to a document in a collection. It is obtained by multiplying two statistics: Term Frequency (TF), which measures how frequently a term occurs in a document; and Inverse Document Frequency (IDF), which measures how rare a term is across documents. A higher TF-IDF score means more importance of that term to the document in the corpus. An example of TF and IDF calculations for some terms is provided.

Uploaded by

John Doe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Ai Combined Update

The document discusses Term Frequency-Inverse Document Frequency (TF-IDF), which is a numerical statistic that reflects how important a term is to a document in a collection. It is obtained by multiplying two statistics: Term Frequency (TF), which measures how frequently a term occurs in a document; and Inverse Document Frequency (IDF), which measures how rare a term is across documents. A higher TF-IDF score means more importance of that term to the document in the corpus. An example of TF and IDF calculations for some terms is provided.

Uploaded by

John Doe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 274

Q) Write a program to implement a Decision Tree algorithm in Python (or in

Java or C++). Test it on the attached data set and print the decision tree and
the performance. The CSV file contains the data and the txt file contains a
description about the data.

SOURCE CODE: -
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Function importing Dataset


def importdata():
balance_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-'+
'databases/balance-scale/balance-scale.data',
sep= ',', header = None)

# Printing the dataswet shape


print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)

# Printing the dataset obseravtions


print ("Dataset: ",balance_data.head())
return balance_data
# Function to split the dataset
def splitdataset(balance_data):

# Separating the target variable


X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]

# Splitting the dataset into train and test


X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)

return X, Y, X_train, X_test, y_train, y_test

# Function to perform training with giniIndex.


def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object


clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)

# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini

# Function to perform training with entropy.


def tarin_using_entropy(X_train, X_test, y_train):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)

# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy

# Function to make predictions


def prediction(X_test, clf_object):

# Predicton on test with giniIndex


y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred

# Function to calculate accuracy


def cal_accuracy(y_test, y_pred):

print("Confusion Matrix: ",


confusion_matrix(y_test, y_pred))

print ("Accuracy : ",


accuracy_score(y_test,y_pred)*100)

print("Report : ",
classification_report(y_test, y_pred))

# Driver code
def main():
# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)

# Operational Phase
print("Results Using Gini Index:")

# Prediction using gini


y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)

# Calling main function


if _name=="main_":
main()
TREE STRUCTURE:-

|--- Left weight <= 2.50


| |--- Right distance <= 1.50
| | |--- Left distance <= 2.50
| | | |--- class: R
| | |--- Left distance > 2.50
| | | |--- class: L
| |--- Right distance > 1.50
| | |--- Right weight <= 2.50
| | | |--- class: R
| | |--- Right weight > 2.50
| | | |--- class: R
|--- Left weight > 2.50
| |--- Left distance <= 2.50
| | |--- Right weight <= 2.50
| | | |--- class: L
| | |--- Right weight > 2.50
| | | |--- class: R
| |--- Left distance > 2.50
| | |--- Right distance <= 3.50
| | | |--- class: L
| | |--- Right distance > 3.50
| | | |--- class: L
TREE VISUALSIZED:-
Q.1) Write a python program to implement gradient descent algorithm and use it to
find the value of x for which the values of x^2 - 4x + 6 is minimum.
Ans:-
INPUT CODE:-
import numpy as np
import pandas as pd
current_guess = 10 # starting point
alpha = 0.16 # learning rate
total_iteration = 50 # max iterations
current_iteration = 0
tolerance = 0.0001 # conditional stop
diff = float('inf') # set the height as maximum
delta = 0
momentum = 0.2
def derivative(x):
return 2 * x - 4
while diff > tolerance and current_iteration < total_iteration:
previous_guess = current_guess
delta = - alpha * derivative(current_guess) + momentum * delta
current_guess = previous_guess + delta
current_iteration = current_iteration + 1
diff = abs(current_guess - previous_guess)
print(f'Iteration: {current_iteration}/{total_iteration}')
print(f'x: {current_guess:.4f}\tdiff: {diff:.4f}')

OUTPUT:-
Q.2) Please develop a MLP for I50 data of Iris dataset using 5-fold cross validation. The
MLP consist of I hidden layer. Plot the error chart and final accuracy.

Ans:-
INPUT CODE:-

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.neural_network import MLPClassifier
iris = load_iris()
X = iris.data
y = iris.target
# print(X,y)
kf = KFold(n_splits = 5)
clf=MLPClassifier(solver='adam',hidden_layer_sizes=(100,),random_state=1,max_iter=800)
scores = []
for train_indices, test_indices in kf.split(X):
clf.fit(X[train_indices], y[train_indices])
scores.append(clf.score(X[test_indices], y[test_indices]))
accuracy = np.mean(scores)
print(accuracy)
plt.plot(range(1,6), scores)
plt.xlabel('FOLDS')
plt.ylabel('ACCURACY')
plt.title('ERROR CHART')
plt.show()

OUTPUT:-
How KNN algorithm works
Suppose we have height, weight and T-shirt size of some customers and we
need to predict the T-shirt size of a new customer given only height and weight
information we have. Data including height, weight and T-shirt size information
is shown below -

Height (in cms) Weight (in kgs) T Shirt Size


158 58 M

158 59 M

158 63 M

160 59 M

160 60 M

163 60 M

163 61 M

160 64 L

163 64 L

165 61 L

165 62 L

165 65 L

168 62 L

168 63 L

168 66 L

170 63 L

170 64 L

170 68 L

Step 1 : Calculate Similarity based on distance function

There are many distance functions but Euclidean is the most commonly used
measure. It is mainly used when data is continuous. Manhattan distance is
also very common for continuous variables.
Distance Functions

The idea to use distance measure is to find the distance (similarity) between
new sample and training cases and then finds the k-closest customers to new
customer in terms of height and weight.

New customer named 'Monica' has height 161cm and weight


61kg.

Euclidean distance between first observation and new observation (monica) is


as follows -

=SQRT((161-158)^2+(61-58)^2)
Similarly, we will calculate distance of all the training cases with new case and
calculates the rank in terms of distance. The smallest distance value will be
ranked 1 and considered as nearest neighbor.

Step 2 : Find K-Nearest Neighbors

Let k be 5. Then the algorithm searches for the 5 customers closest to


Monica, i.e. most similar to Monica in terms of attributes, and see what
categories those 5 customers were in. If 4 of them had ‘Medium T shirt sizes’
and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T
shirt. See the calculation shown in the snapshot below -
Calculate KNN manually

In the graph below, binary dependent variable (T-shirt size) is displayed in blue
and orange color. 'Medium T-shirt size' is in blue color and 'Large T-shirt size'
in orange color. New customer information is exhibited in yellow circle. Four
blue highlighted data points and one orange highlighted data point are close to
yellow circle. so the prediction for the new case is blue highlighted data point
which is Medium T-shirt size.

KNN: Visual Representation


1 Term Frequency and Inverted Document Frequency

Term Frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

Inverse Document Frequency Estimate the rarity of a term in the whole document collection. (If a term
occurs in all the documents of the collection, its IDF is zero.)

|D|
idfi = log
|{j : ti ∈ dj }|

with |D| : cardinality of D, or the total number of documents in the corpus |{j : ti ∈ dj }|: number of documents
where the term ti appears (viz. the document frequency) (that is ni,j 6= 0). If the term is not in the corpus,
this will lead to a division-by-zero. It is therefore common to use 1 + |{j : ti ∈ dj }|
1,000,000
Example |D| = 1, 000, 000 idft = log10
dft

term dft idft


calpurnia 1 6
animal 100 4
sunday 1000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0

Tf-idf The tf-idf weight of a term is the product of its tf weight and its idf weight.

Normalized tf tf count is usually normalized to prevent a bias towards longer documents (which may have
a higher term count regardless of the actual importance of that term in the document) to give a measure of the
importance of the term ti within the particular document dj .

ni,j
tfi,j = P
k nk,j

where ni,j is the number of occurrences of the considered term (ti ) in document dj , and the denominator is the
sum of number of occurrences of all terms in document dj , that is, the size of the document |dj |.
tft,d
Alternative: max tfd

where max tfd is the max frequency within the document.

Exercise

Given a document with the terms A, B and C with the following frequencies A: 3, B: 2, C: 1

The document belongs to a collection of 10,000 docs. The document frequencies are: A: 50, B:1300, C:250.

Compute the normalized tf and the tf-idf and compare them. You could also check the effects of using normalized
tf measures. The idf are as below
A idf = log(10000/50) = 5.3;
B idf = log(10000/1300) = 2.0;
C idf = log(10000/250) = 3.7

1
Results:
A tf = 3/3; idf = log(10000/50) = 5.3; tf-idf=5.3
B tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf=1.3
C tf = 1/3; idf = log(10000/250) = 3.7; tf-idf=1.2

Recall: The logarithm of a number y with respect to base b is the exponent to which b has to be raised in order
to yield y. In other words, the logarithm of y to base b is the solution x of the equation

bx = y

Exercise 1 Given the tables below


Doc1 Doc2 Doc3 term dft idft
car 27 4 24 car 18,165 1.65
tf) auto 3 33 0 idf) auto 6723 2.08
insurance 3 33 0 insurance 19,241 1.62
best 14 0 17 best 25,235 1.5

Compute the tf-idf weights for the terms in the tables for each document.

Tf normalization Take the values in the tf table above and replace them with normalized tf weights. Com-
pute the tf-idf again. and compare the results.

2 Similarity Measures
• Jaccard

|A ∩ B|
jaccard(A, B) =
|A ∪ B|

• Cosine Similarity
Pn
~x · ~y xi × yi
cos(~x, ~y ) = = pPn i=1 pPn
|~x||~y | 2
i=1 xi ×
2
i=1 yi
pPn
|~x| = i=1 x2i

Exercise 3 Given the tf for term for the three novels “Sense and Sensibility” (SaS), “Pride and Prejudice”
(PaP) by Austin and “Wuthering Heights”

term SaS PaP WH


affection 115 58 20
jealous 10 7 11
gossip 2 0 6

Weighted terms:

term SaS PaP WH


affection 0.996 0.993 0.847
jealous 0.087 0.120 0.466
gossip 0.17 0 0.254

compute the cosine similarity and the jaccard measure between SaS-PaP and SaS-WH.

2
Results: cos-sim(SaS,PaP) = 0.999 vs. cos-sim(SaS,WH)=0.888.

3 Evaluation Measures

Accuracy Percentage of documents correctly classified by the system.

Error Rate Inverse of accuracy. Percentage of documents wrongly classified by the system

Precision percentage of relevant documents correctly retrieved by the system (TP) with respect to all doc-
uments retrieved by the system (TP + FP). (how many of the retrieved books are relevant?)

Recall : percentage of relevant documents correctly retrieved by the system (TP) with respect to all docu-
ments relevant for the human (TP + FN). (how many of the relevant books have been retrieved?)

F-Measure : Combine in a single measure Precision (P) and Recall (R) giving a global estimation of the
performance of an IR system

3
Relevant Not Relevant
Retrieved True Positive (TP) False Positive (FP)

Not retrieved False Negative (FN) True Negative (TN)

TP + TN
Accuracy T P +T N +F P +F N

FP+FN
Error Rate T P +T N +F P +F N

TP
Precision T P +FP

TP
Recall T P +FN

2P R
F R+P

Exercise An IR system returns eight relevant documents and ten non-relevant documents. There are a total
of twenty relevant documents in the collection. What is the precision of the system on this search, and what is
its recall? Calculate the above measures for the following IR systems:

Relevant Not Relevant Relevant Not Relevant


Retrieved 40 0 Retrieved 40 50

Not retrieved 50 10 Not retrieved 0 10

Relevant Not Relevant


Retrieved 40 25

Not retrieved 25 10

4
ECE595 / STAT598: Machine Learning I
Lecture 18 Multi-Layer Perceptron

Spring 2020

Stanley Chan

School of Electrical and Computer Engineering


Purdue University


c Stanley Chan 2020. All Rights Reserved.
1 / 28
Outline
Discriminative Approaches
Lecture 16 Perceptron 1: Definition and Basic Concepts
Lecture 17 Perceptron 2: Algorithm and Property
Lecture 18 Multi-Layer Perceptron: Back Propagation

This lecture: Multi-Layer Perceptron: Back Propagation


Multi-Layer Perceptron
Hidden Layer
Matrix Representation
Back Propagation
Chain Rule
4 Fundamental Equations
Algorithm
Interpretation

c Stanley Chan 2020. All Rights Reserved.
2 / 28
Single-Layer Perceptron

Input neurons x
Weights w
Predicted label = σ(w T x + w0 ).


c Stanley Chan 2020. All Rights Reserved.
3 / 28
Multi-Layer Network

https://towardsdatascience.com/
multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f

Introduce a layer of hidden neurons


So now you have two sets of weights: from input to hidden, and from
hidden to output c Stanley Chan 2020. All Rights Reserved.
4 / 28
Many Hidden Layers

You can introduce as many hidden layers as you want.


Every time you add a hidden layer, you add a set of weights.


c Stanley Chan 2020. All Rights Reserved.
5 / 28
Understanding the Weights

Each hidden neuron is an output of a perceptron


So you will have
 1  w1 w1 ... w1n1
 
x1
h1 11 12
w 1 w 1 ... w2n1  x 
 h1   21 22  2
 2 =  . .. .. ..   .. 
 ..
. . .  . . .  . 
hn1 w1 w1 1
. . . wmn xm
m1 m2

c Stanley Chan 2020. All Rights Reserved.
6 / 28
Progression to DEEP (Linear) Neural Networks

Single-layer:
h = wTx
Hidden-layer:
h = WTx
Two Hidden Layers:
h = WT T
2 W1 x

Three Hidden Layers:

h = WT T T
3 W2 W1 x

A LOT of Hidden Layers:

h = WT T T
N ...W2 W1 x


c Stanley Chan 2020. All Rights Reserved.
7 / 28
Interpreting the Hidden Layer

Each hidden neuron is responsible for certain features.


Given an object, the network identifies the most likely
features.
c Stanley Chan 2020. All Rights Reserved.
8 / 28
Interpreting the Hidden Layer

c Stanley Chan 2020.


https://www.scientificamerican.com/article/springtime-for-ai-the-rise-of-deep-learning/ All Rights Reserved.
9 / 28
Two Questions about Multi-Layer Network
How do we efficiently learn the weights?
Ultimately we need to minimize the loss
N
X
J(W 1 , . . . , W L ) = kW T T T
L . . . W 2 W 1 xi − yik
2

i=1

One layer: Gradient descent. Multi-layer: Also gradient descent, also


known as Back propagation (BP) by Rumelhart, Hinton and Williams
(1986)
Back propagation = Very careful book-keeping and chain rule
What is the optimization landscape?
Convex? Global minimum? Saddle point?
Two-layer case is proved by Baldi and Hornik (1989)
All local minima are global.
A critical point is either a saddle point or global minimum.
L-layer case is proved by Kawaguchi (2016). Also proved L-layer
nonlinear network (with sigmoid between adjacent layers.)

c Stanley Chan 2020. All Rights Reserved.
10 / 28
Outline
Discriminative Approaches
Lecture 16 Perceptron 1: Definition and Basic Concepts
Lecture 17 Perceptron 2: Algorithm and Property
Lecture 18 Multi-Layer Perceptron: Back Propagation

This lecture: Multi-Layer Perceptron: Back Propagation


Multi-Layer Perceptron
Hidden Layer
Matrix Representation
Back Propagation
Chain Rule
4 Fundamental Equations
Algorithm
Interpretation

c Stanley Chan 2020. All Rights Reserved.
11 / 28
Back Propagation: A 20-Minute Tour

You will be able to find A LOT OF blogs on the internet discussing


how back propagation is being implemented.
Some are mystifying back propagation
Some literally just teach you the procedure of back propagation
without telling you the intuition
I find the following online book by Mike Nielsen fairly well-written
http://neuralnetworksanddeeplearning.com/
The following slides are written based on Nielsen’s book
We will not go into great details
The purpose to get you exposed to the idea, and de-mystify back
propagation
As stated before, back propagation is chain rule + very careful book
keeping

c Stanley Chan 2020. All Rights Reserved.
12 / 28
Back Propagation

Here is the loss function you want to minimize:


N
X
J(W 1 , . . . , W L ) = kσ(W T T T
L . . . σ(W 2 σ(W 1 x i ))) − y i k
2

i=1

You have a set of nonlinear activation functions, usually the sigmoid.


To optimize, you need gradient descent. For example, for W 1

W t+1
1 = W t1 − α∇J(W t1 )

But you need to do this for all W 1 , . . . , W L .


And there are lots of sigmoid functions.
Let us do the brute force.
And this is back-propagation. (Really? Yes...)

c Stanley Chan 2020. All Rights Reserved.
13 / 28
Let us See an Example
Let us look at two layers

J(W 1 , W 2 ) = kσ(W T T
2 σ(W 1 x)) − y k
2
| {z }
a2

Let us go backward:
∂J ∂J ∂a 2
= ·
∂W 2 ∂a 2 ∂W 2
Now, what is a 2 ?
a 2 = σ(W T σ(W T x))
| 2 {z 1 }
z2

So let us compute:
∂a 2 ∂a 2 ∂z 2
= · .
∂W 2 ∂z 2 ∂W 2

c Stanley Chan 2020. All Rights Reserved.
14 / 28
Let us See an Example

J(W 1 , W 2 ) = kσ(W T T
2 σ(W 1 x)) − y k
2
| {z }
a1

How about W 1 ? Again, let us go backward:


∂J ∂J ∂a 2
= ·
∂W 1 ∂a 2 ∂W 1

But you can now repeat the calculation as follows (Let z 1 = W T


1 x)

∂a 2 ∂a 2 ∂a 1
=
∂W 1 ∂a 1 ∂W 1
∂a 2 ∂a 1 ∂z 1
=
∂a 1 ∂z 1 W 1
So it is just a very long sequence of chain rule.

c Stanley Chan 2020. All Rights Reserved.
15 / 28
Notations for Back Propagation
The following notations are based on Nielsen’s online book.
The purpose of doing these is to write down a concise algorithm.
Weights:

3 : The 3rd layer


w24
3 : From 4-th neuron to 2-nd neuron
w24

c Stanley Chan 2020. All Rights Reserved.
16 / 28
Notations for Back Propagation
Activation and Bias:

a13 : 3rd layer, 1st activation


b32 : 2nd layer, 3rd bias
Here is the relationship. Think of σ(w T x + w0 ):
!
X
aj` = σ wjk` ak`−1 + bj` .
k
c Stanley Chan 2020. All Rights Reserved.
17 / 28
Understanding Back Propagation
This is the main equation
!
X
aj` = σ wjk` ak`−1 + bj` , or aj` = σ(zj` ).
k
| {z }
zj`

aj` : activation, zj` : intermediate.


c Stanley Chan 2020. All Rights Reserved.
18 / 28
Loss

The loss takes the form of


X
C= (ajL − yj )2
j

Think of two-class cross-entropy where each a L is a 2-by-1 vector


c Stanley Chan 2020. All Rights Reserved.
19 / 28
Error Term
The error is defined as
∂C
δj` =
∂zj`

You can show that at the output,


L
∂C ∂aj ∂C
δjL = L L
= L σ 0 (zjL ).
∂aj ∂zj ∂aj

c Stanley Chan 2020. All Rights Reserved.
20 / 28
4 Fundamental Equations for Back Propagation
BP Equation 1: For the error in the output layer:
∂C 0 L
δjL = σ (zj ). (BP-1)
∂ajL

∂C
First term: ∂ajL
is rate of change w.r.t. ajL

Second term: σ 0 (zjL ) = rate of change w.r.t. zjL .


So it is just chain rule.
Example: If C = 21 j (yj − ajL )2 , then
P

∂C
= (ajL − yj )
∂ajL

Matrix-vector form: δ L = ∇a C σ 0 (z L )

c Stanley Chan 2020. All Rights Reserved.
21 / 28
4 Fundamental Equations for Back Propagation
BP Equation 2: An equation for the error δ ` in terms of the error in the
next layer, δ `+1
δ ` = ((w `+1 )T δ `+1 ) σ 0 (z ` ). (BP-2)

You start with δ `+1 . Take weighted average w `+1 .

(BP-1) and (BP-2) can help you determine error at any layer.

c Stanley Chan 2020. All Rights Reserved.
22 / 28
4 Fundamental Equations for Back Propagation
Equation 3: An equation for the rate of change of the cost with respect
to any bias in the network.
∂C
= δj` . (BP-3)
∂bj`

Good news: We have already known δj` from Equation 1na dn 2.


∂C
So computing ∂bj`
is easy.

Equation 4: An equation for the rate of change of the cost with respect
to any weight in the network.
∂C
= ak`−1 δj` (BP-4)
∂wjk`

Again, everything on the right is known. So it is easy to compute.



c Stanley Chan 2020. All Rights Reserved.
23 / 28
Back Propagation Algorithm
Below is a very concise summary of the BP algorithm


c Stanley Chan 2020. All Rights Reserved.
24 / 28
Step 2: Feed Forward Step
Let us take a closer look at Step 2
The feed forward step computes the intermediate variables and the
activations

z ` = (w ` )T a `−1 + b `
a ` = σ(z ` ).


c Stanley Chan 2020. All Rights Reserved.
25 / 28
Step 3: Output Error

Let us take a closer look at Step 3


The output error is given by (BP-1)

δ L = ∇a C σ 0 (z L )


c Stanley Chan 2020. All Rights Reserved.
26 / 28
Step 4: Output Error

Let us take a closer look at Step 4


The error back propagation is given by (BP-2)

δ ` = ((w `+1 )T δ `+1 ) σ 0 (z ` ).


c Stanley Chan 2020. All Rights Reserved.
27 / 28
Summary of Back Propagation

There is no dark magic behind back propagation


It is literally just chain rule
You need to do this chain rule very systematically and carefully
Then you can derive the back propagation steps
Nielsen wrote in his book that
... How backpropagation could have been discovered in the first place? In
fact, if you follow the approach I just sketched you will discover a proof of
backpropagation...You make those simplifications, get a shorter proof, and
write that out....The result after a few iterations is the one we saw earlier,
short but somewhat obscure...
Most deep learning libraries have built-in back propagation steps.
You don’t have to implement it yourself, but you need to know what’s
behind it.

c Stanley Chan 2020. All Rights Reserved.
28 / 28
Reading List

Michael Nielsen, Neural Networks and Deep Learning,


http://neuralnetworksanddeeplearning.com/chap2.html
Very well written. Easy to follow.
Duda, Hart, Stork, Pattern Classification, Chapter 5
Classical treatment. Comprehensive. Readable.
Bishop, Pattern Recognition and Machine Learning, Chapter 5
Somewhat Bayesian. Good for those who like statistics
Stanford CS 231N, http://cs231n.stanford.edu/slides/2017/
cs231n_2017_lecture4.pdf
Good numerical example.
CMU https://www.cs.cmu.edu/~mgormley/courses/
10601-s17/slides/lecture20-backprop.pdf
Cornell https://www.cs.cornell.edu/courses/cs5740/2016sp/
resources/backprop.pdf

c Stanley Chan 2020. All Rights Reserved.
29 / 28
Chapter 2

A Simple Class of GAs

Once upon a time a fire broke out in a hotel, where just then a sci-
entific conference was held. It was night and all guests were sound
asleep. As it happened, the conference was attended by researchers
from a variety of disciplines. The first to be awakened by the smoke
was a mathematician. His first reaction was to run immediately to
the bathroom, where, seeing that there was still water running from
the tap, he exclaimed: “There is a solution!”. At the same time, how-
ever, the physicist went to see the fire, took a good look and went back
to his room to get an amount of water, which would be just suffi-
cient to extinguish the fire. The electronic engineer was not so choosy
and started to throw buckets and buckets of water on the fire. Finally,
when the biologist awoke, he said to himself: “The fittest will survive”
and went back to sleep.

Anecdote originally told by C. L. Liu

In this chapter, we will present a very simple but extremely important


subclass—genetic algorithms working with a fixed number of binary
strings of fixed length. For this purpose, let us assume that the strings
we consider are all from the set
S = {0, 1}n ,
where n is obviously the length of the strings. The population size will be
denoted with m in the following. Therefore, the generation at time t is a
list of m strings which we will denote with
Bt = (b1,t , b2,t , . . . , bm,t ).
All GAs in this chapter will obey the following structure:

17
18 2. A S IMPLE C LASS OF GA S

2.1 Algorithm.

t := 0;
Compute initial population B0 = (b1,0 , . . . , bm,0 );

WHILE stopping condition not fulfilled DO


BEGIN
FOR i := 1 TO m DO
select an individual bi,t+1 from Bt ;

FOR i := 1 TO m − 1 STEP 2 DO
IF Random[0, 1] ≤ pC THEN
cross bi,t+1 with bi+1,t+1 ;

FOR i := 1 TO m DO
eventually mutate bi,t+1 ;

t := t + 1
END

Obviously, selection, crossover (done only with a probability of pC


here), and mutation are still degrees of freedom, while the sampling op-
eration is already specified. As it is easy to see, every selected individual
is replaced by one of its children after crossover and mutation; unselected
individuals die immediately. This is a rather common sampling operation,
although other variants are known and reasonable.
In the following, we will study the three remaining operations selec-
tion, crossover, and mutation.

2.1 Genetic Operations on Binary Strings

2.1.1 Selection
Selection is the component which guides the algorithm to the solution by
preferring individuals with high fitness over low-fitted ones. It can be a
deterministic operation, but in most implementations it has random com-
ponents.
One variant, which is very popular nowadays (we will give a theo-
retical explanation of its good properties later), is the following scheme,
2.1. G ENETIC O PERATIONS ON B INARY S TRINGS 19

where the probability to choose a certain individual is proportional to its


fitness. It can be regarded as a random experiment with

f (bj,t )
P[bj,t is selected] = P
m . (2.1)
f (bk,t )
k=1

Of course, this formula only makes sense if all the fitness values are pos-
itive. If this is not the case, a non-decreasing transformation ϕ : R → R+
must be applied (a shift in the simplest case). Then the probabilities can
be expressed as
ϕ(f (bj,t ))
P[bj,t is selected] = P
m (2.2)
ϕ(f (bk,t ))
k=1

We can force the property (2.1) to be satisfied by applying a random


experiment which is, in some sense, a generalized roulette game. In this
roulette game, the slots are not equally wide, i.e. the different outcomes
can occur with different probabilities. Figure 2.1 gives a graphical hint
how this roulette wheel game works.
The algorithmic formulation of the selection scheme (2.1) can be writ-
ten down as follows, analogously for the case of (2.2):

2.2 Algorithm.

x := Random[0, 1];
i := 1
Pi Pm
WHILE i < m & x < j=1 f (bj,t )/ j=1 f (bj,t ) DO
i := i + 1;

select bi,t ;

For obvious reasons, this method is often called proportional selection.

2.1.2 Crossover
In sexual reproduction, as it appears in the real world, the genetic material
of the two parents is mixed when the gametes of the parents merge. Usu-
ally, chromosomes are randomly split and merged, with the consequence
20 2. A S IMPLE C LASS OF GA S

0.208

0.083 0.167

0.251
0.208

0.083

Figure 2.1: A graphical representation of roulette wheel selection, where


the number of alternatives m is 6. The numbers inside the arcs correspond
to the probabilities to which the alternative is selected.

that some genes of a child come from one parent while others come from
the other parents.

This mechanism is called crossover. It is a very powerful tool for intro-


ducing new genetic material and maintaining genetic diversity, but with
the outstanding property that good parents also produce well-performing
children or even better ones. Several investigations have come to the con-
clusion that crossover is the reason why sexually reproducing species have
adapted faster than asexually reproducing ones.

Basically, crossover is the exchange of genes between the chromosomes


of the two parents. In the simplest case, we can realize this process by
cutting two strings at a randomly chosen position and swapping the two
tails. This process, which we will call one-point crossover in the following,
is visualized in Figure 2.2.
2.1. G ENETIC O PERATIONS ON B INARY S TRINGS 21

 
  
!  



  !

Figure 2.2: One-point crossover of binary strings.

2.3 Algorithm.

pos := Random{1, . . . , n − 1};

FOR i := 1 TO pos DO
BEGIN
Child1 [i] := Parent1 [i];
Child2 [i] := Parent2 [i]
END

FOR i := pos + 1 TO n DO
BEGIN
Child1 [i] := Parent2 [i];
Child2 [i] := Parent1 [i]
END

One-point crossover is a simple and often-used method for GAs which


operate on binary strings. For other problems or different codings, other
crossover methods can be useful or even necessary. We mention just a
small collection of them, for more details see [20, 22]:

N -point crossover: Instead of only one, N breaking points are chosen ran-
domly. Every second section is swapped. Among this class, two-
point crossover is particularly important

Segmented crossover: Similar to N -point crossover with the difference


that the number of breaking points can vary.
22 2. A S IMPLE C LASS OF GA S

Uniform crossover: For each position, it is decided randomly if the posi-


tions are swapped.

Shuffle crossover: First a randomly chosen permutation is applied to the


two parents, then N -point crossover is applied to the shuffled par-
ents, finally, the shuffled children are transformed back with the in-
verse permutation.

2.1.3 Mutation

The last ingredient of our simple genetic algorithm is mutation—the ran-


dom deformation of the genetic information of an individual by means of
radioactive radiation or other environmental influences. In real reproduc-
tion, the probability that a certain gene is mutated is almost equal for all
genes. So, it is near at hand to use the following mutation technique for
a given binary string s, where pM is the probability that a single gene is
modified:

2.4 Algorithm.

FOR i := 1 TO n DO
IF Random[0, 1] < pM THEN
invert s[i];

Of course, pM should be rather low in order to avoid that the GA be-


haves chaotically like a random search.
Again, similar to the case of crossover, the choice of the appropriate
mutation technique depends on the coding and the problem itself. We
mention a few alternatives, more details can be found in [20] and [22]
again:

Inversion of single bits: With probability pM , one randomly chosen bit is


negated.

Bitwise inversion: The whole string is inverted bit by bit with prob. pM .

Random selection: With probability pM , the string is replaced by a ran-


domly chosen one.
2.1. G ENETIC O PERATIONS ON B INARY S TRINGS 23

2.1.4 Summary

If we fill in the methods described above, we can write down a uni-


versal genetic algorithm for solving optimization problems in the space
S = {0, 1}n .

2.5 Algorithm.

t := 0;
Create initial population B0 = (b1,0 , . . . , bm,0 );

WHILE stopping condition not fulfilled DO


BEGIN

(∗ proportional selection ∗)

FOR i := 1 TO m DO
BEGIN
x := Random[0, 1];

k := 1;
WHILE k < m & x < kj=1 f (bj,t )/ m
P P
j=1 f (bj,t ) DO
k := k + 1;

bi,t+1 := bk,t
END

(∗ one-point crossover ∗)

FOR i := 1 TO m − 1 STEP 2 DO
BEGIN
IF Random[0, 1] ≤ pC THEN
BEGIN
pos := Random{1, . . . , n − 1};

FOR k := pos + 1 TO n DO
BEGIN
aux := bi,t+1 [k];
bi,t+1 [k] := bi+1,t+1 [k];
bi+1,t+1 [k] := aux
END
END
END

(∗ mutation ∗)

FOR i := 1 TO m DO
FOR k := 1 TO n DO
IF Random[0, 1] < pM THEN
invert bi,t+1 [k];

t := t + 1
END
24 2. A S IMPLE C LASS OF GA S

2.2 Examples

2.2.1 A Very Simple One


Consider the problem of finding the global maximum of the following
function:
f1 : {0, . . . , 31} −→ R
x 7−→ x2
Of course, the solution is obvious, but the simplicity of this problem allows
us to compute some steps by hand in order to gain some insight into the
principles behind genetic algorithms.
The first step on the checklist of things, which have to be done in order
to make a GA work, is, of course, to specify a proper string space along
with an appropriate coding and decoding scheme. In this example, it is
near at hand to consider S = {0, 1}5 , where a value from {0, . . . , 31} is
coded by its binary representation. Correspondingly, a string is decoded
as
X4
c̃(s) = s[4 − i] · 2i .
i=0

Like in [22], let us assume that we use Algorithm 2.5 as it is, with a
population size of m = 4, a crossover probability pC = 1 and a mutation
probability of pM = 0.001. If we compute the initial generation randomly
with uniform distribution over {0, 1}5 , we obtain the following in the first
step:

Individual String x value f (x) pselecti


No. (genotype) (phenotype) x2 Pfi
fj
1 0 1 1 0 1 13 169 0.14
2 1 1 0 0 0 24 576 0.49
3 0 1 0 0 0 8 64 0.06
4 1 0 0 1 1 19 361 0.31

One can compute easily that the sum of fitness values is 1170, where the
average is 293 and the maximum is 576. We see from the last column in
which way proportional selection favors high-fitted individuals (such as
no. 2) over low-fitted ones (such as no. 3).
A random experiment could, for instance, give the result that individu-
als no. 1 and no. 4 are selected for the new generation, while no. 3 dies and
no. 2 is selected twice, and we obtain the second generation as follows:
2.2. E XAMPLES 25

1.5

0.5

-1 -0.5 0.5 1

Figure 2.3: The function f2 .

Set of selected Crossover site New x f (x)


individuals (random) population value x2
0 1 1 0|1 4 0 1 1 0 0 12 144
1 1 0 0|0 4 1 1 0 0 1 25 625
1 1|0 0 0 2 1 1 0 1 1 27 729
1 0|0 1 1 2 1 0 0 0 0 16 256

So, we obtain a new generation with a sum of fitness values of 1754, an


average of 439, and a maximum of 729.
We can see from this very basic example in which way selection favors
high-fitted individuals and how crossover of two parents can produce an
offspring which is even better than both of its parents. It is left to the
reader as an exercise to continue this example.

2.2.2 An Oscillating One-Dimensional Function


Now we are interested in the global maximum of the function

f2 : [−1, 1] −→ R
2
x 7−→ 1 + e−x · cos(36x).

As one can see easily from the plot in Figure 2.3, the function has a global
maximum in 0 and a lot of local maxima.
First of all, in order to work with binary strings, we have to discretize
the search space [−1, 1]. A common technique for doing so is to make a
26 2. A S IMPLE C LASS OF GA S

uniform grid of 2n points, then to enumerate the grid points, and to use
the binary representation of the point index as coding. In the general form
(for an arbitrary interval [a, b]), this looks as follows:
cn,[a,b] : [a, b] −→ {0, 1}n
x−a
 (2.3)
x 7−→ binn round (2n − 1) · b−a
,
where binn is the function which converts a number from {0, . . . , 2n−1 } to
its binary representation of length n. This operation is not bijective since
information is lost due to the rounding operation. Obviously, the corre-
sponding decoding function can be defined as
c̃n,[a,b] : {0, 1}n −→ [a, b]
(2.4)
s 7−→ a + bin−1
n (s) ·
b−a
2n −1
.
It is left as an exercise to show that the decoding function c̃n,[a,b] is injective
and that the equality (1.2) holds for the pair (cn,[a,b] , c̃n,[a,b] ).
Applying the above coding scheme to the interval [−1, 1] with n = 16,
we get a maximum accuracy of the solution of
1 2
· 16 ≈ 1.52 · 10−5 .
2 2 −1
Now let us apply Algorithm 2.5 with m = 6, pC = 1, and pM = 0.005. The
first and the last generation are given as follows:
Generation 1 max. fitness 1.9836 at -0.0050
#0 0111111101010001 fitness: 1.98
#1 1101111100101011 fitness: 0.96
#2 0111111101011011 fitness: 1.98
#3 1001011000011110 fitness: 1.97
#4 1001101100101011 fitness: 1.20
#5 1100111110011110 fitness: 0.37
Average Fitness: 1.41

...

Generation 52 max. fitness 2.0000 at 0.0000


#0 0111111101111011 fitness: 1.99
#1 0111111101111011 fitness: 1.99
#2 0111111101111011 fitness: 1.99
#3 0111111111111111 fitness: 2.00
#4 0111111101111011 fitness: 1.99
#5 0111111101111011 fitness: 1.99
Average Fitness: 1.99

We see that the algorithm arrives at the global maximum after 52 gen-
erations, i.e. it suffices with at most 52 × 6 = 312 evaluations of the fitness
2.2. E XAMPLES 27

0.75 10

0.5

5
0.25

0
-10 0

-5

0 -5

10 -10

Figure 2.4: A surface plot of the function f3 .

function, while the total size of the search space is 216 = 65536. We can
draw the conclusion—at least for this example—that the GA is definitely
better than a pure random search or an exhaustive method which stupidly
scans the whole search space.
Just in order to get more insight into the coding/decoding scheme, let
us take the best string 0111111111111111. Its representation as integer
number is 32767. Computing the decoding function yields
1 − (−1)
−1 + 32767 · = −1 + 0.9999847 = −0.0000153.
65535

2.2.3 A Two-Dimensional Function


As next example, we study the function

f3 : [−10, 10]2 −→ R √
1−sin2 ( x2 +y 2 )
(x, y) 7−→ 1+0.001·(x2 +y 2 )
.

As one can see easily from the plot in Figure 2.4, the function has a global
maximum in 0 and a lot of local maxima.
Let us use the coding/decoding scheme as shown in (2.3) and (2.4) for
the two components x and y independently with n = 24, i.e. c24,[−10,10] and
28 2. A S IMPLE C LASS OF GA S

c̃24,[−10,10] are used as coding and decoding functions, respectively. In order


to get a coding for the two-dimensional vector, we can use concatenation
and splitting:
c3 : [−10, 10]2 −→ {0, 1}48
(x, y) 7−→ c24,[−10,10] (x)|c24,[−10,10] (y)

c̃3 : {0, 1}48 −→ [−10, 10]2 


s 7−→ c̃24,[−10,10] (s[1 : 24]), c̃24,[−10,10] (s[25 : 48])
If we apply Algorithm 2.5 with m = 50, pC = 1, pM = 0.01, we observe
that a fairly good solution is reached after 693 generations (at most 34650
evaluations at a search space size of 2.81 · 1014 ):
Generation 693 max. fitness 0.9999 at (0.0098,0.0000)
#0 000000001000000001000000000000000000000010000000 fitness: 1.00
#1 000001000000011001000110000000000000000010100010 fitness: 0.00
#2 000000001000000000100000000000000000000010000000 fitness: 1.00
#3 000000001000001001000000000000000000000010000000 fitness: 0.97
#4 000000001000001011001000000000000000000010000011 fitness: 0.90
#5 000000101000000001000010000100000000000010000000 fitness: 0.00
#6 000000001000000011000000000000001000000010000011 fitness: 0.00
#7 000000001000000001100000000010000000000110000000 fitness: 0.00
#8 000000001001000001000000000000000000000000100010 fitness: 0.14
#9 000000001000000001000000000000000000000010100010 fitness: 0.78
#10 000000001000011011000000000000000000000010000000 fitness: 0.75
#11 000000001000000001000000000000000000000010100000 fitness: 0.64
#12 000000001000001000010010000000000000000010001001 fitness: 0.56
#13 000000001000001011000000000000000000000010100010 fitness: 0.78
#14 000000001000000001000001000000000000000010000000 fitness: 1.00
#15 000000001000000001100000100000000000000010000000 fitness: 0.00
#16 000000001000001010001000000000000000000010100010 fitness: 0.78
#17 000000001000011011000000000000000000000010000011 fitness: 0.70
#18 000000001000001011001000000000000000000010000011 fitness: 0.90
#19 000000001000011001000010001000010000000010000010 fitness: 0.00
#20 000000001000000001000000000001000000000010100010 fitness: 0.00
#21 000000001000011001100000000000000000010010000000 fitness: 0.00
#22 000000001000000101100000000000000000010010000000 fitness: 0.00
#23 000000001000100001000000000000000000000010000111 fitness: 0.44
#24 000000001000000011000000000000000000000000000000 fitness: 0.64
#25 000000001000000001011000000000010000000010100010 fitness: 0.00
#26 000000001000000001001000000000000000000000100010 fitness: 0.23
#27 000000001000001011000010000000000000000010100010 fitness: 0.78
#28 000000001000001011100010000000000000000010101010 fitness: 0.97
#29 010000001000000011000000000000000010010010000000 fitness: 0.00
#30 000000001000001011000000000000000000000010000011 fitness: 0.90
#31 000000001000011011000000000000000000000011000011 fitness: 0.26
#32 000000001000001001100000000000000000000010000000 fitness: 0.97
#33 000000001001001011000110000000000000000011110100 fitness: 0.87
#34 000000001000000000000000000000000000000010100010 fitness: 0.78
#35 000000001000001011001000000000000000000010000010 fitness: 0.93
#36 000000001000011011000000000000000010000010000001 fitness: 0.00
#37 000000001000001011000000000010000000000010100010 fitness: 0.00
#38 000000001000001011000010010000000000000010000000 fitness: 0.00
#39 000000001000000001000000000001000000000010100010 fitness: 0.00
#40 000000001000001001000110000000000000000011010100 fitness: 0.88
#41 000000001010000001000000000000000000000010000000 fitness: 0.66
#42 000000001000001001100110000000000000000011010100 fitness: 0.88
#43 000000000000000000000000000000000000000010000011 fitness: 0.64
#44 000000001000001011001000000000000000000010100000 fitness: 0.65
#45 000000001000001011000110000000000000000011110100 fitness: 0.81
#46 000000000000000000000000000000000000000010000000 fitness: 0.64
#47 000000001000010001000110000000000000000010000000 fitness: 0.89
#48 000000001000001011000000000000000000000010100011 fitness: 0.84
#49 000000001000000111000000000000000000000010000001 fitness: 0.98
Average Fitness: 0.53

Again, we learn from this example that the GA is here for sure much
faster than an exhaustive algorithm or a pure random search. The ques-
2.2. E XAMPLES 29

tion arises, since f3 is perfectly smooth, which result we obtain if we ap-


ply a conventional method with random selection of the initial value. In
this example, the expectation is obvious: The global maximum (0, 0) is
surrounded by a ring of minima at a radius of π2 . If we apply, for in-
stance, BFGS (Broyden Fletcher Goldfarb Shanno—a very efficient Quasi-
Newton method for continuous unconstrained function optimization [10])
with line search, it is likely that convergence to the global maximum is
achieved if the initial value is inside that ring, but only in this case. If we
take the initial value from [−10, 10]2 randomly with uniform distribution,
the probability to get a value from the appropriate neighborhood of the
global maximum is
π 2

2
·π π3
= = 0.0775.
10 · 10 400
The expected number of trials until we get an initial value is, therefore,
1
0.0775
≈ 13. In a test implementation, it took 15 trials (random initial val-
ues) until the correct global optimum was found by the BFGS method with
line search. The total time for all these computations was 5 milliseconds
on an SGI O2 (MIPS R5000/180SC). The genetic algorithm, as above, took
1.5 seconds until it found the global optimum with comparable accuracy.
This example shows that GAs are not necessarily fast. Moreover, they
are in many cases much slower than conventional methods which involve
derivatives. The next example, however, will drastically show us that
there are even smooth functions which can be hard for conventional opti-
mization techniques.

2.2.4 Global Smoothness versus Local Perturbations


Consider the function
f4 : [−2, 2] −→ R
2
x 7−→ e−x + 0.01 cos(200x).
As easy to see from Figure 2.5, this function has a clear bell-like shape with
small but highly oscillating perturbations. In the first derivative, these
oscillations are drastically emphasized (see Figure 2.5):
2
f40 (x) = −2xe−x − 2 sin(200x)

We applied the simple GA as in Algorithm 2.5 with n = 16, i.e. the


pair c̃16,[−2,2] /c̃16,[−2,2] as coding/decoding scheme, m = 10, pC = 1, and
30 2. A S IMPLE C LASS OF GA S

0.8

0.6

0.4

0.2

-2 -1 1 2

-2 -1 1 2

-1

-2

Figure 2.5: The function f4 (top) and its derivative (bottom).

pM = 0.005. The result was that the global maximum at x = 0 was found
after 9 generations (i.e. at most 90 evaluations of the fitness function) and
5 milliseconds computation time, respectively (on the same computer as
above).

In order to repeat the above comparison, BFGS with line search and
random selection of the initial value was applied to f4 as well. The global
optimum was found after 30 trials (initial values) with perfect accuracy,
but 9 milliseconds of computation time.

We see that, depending on the structure of the objective function, a


GA can even outperform an acknowledged conventional method which
makes use of derivatives.
2.2. E XAMPLES 31

2.2.5 Discussion
Finally, let us summarize some conclusions about the four examples
above:
Algorithm 2.5 is very universal. More or less, the same algorithm has
been applied to four fundamentally different optimization tasks.
As seen in 2.2.4, GAs can even be faster in finding global maxima than
conventional methods, in particular when derivatives provide misleading
information. We should not forget, however, that, in most cases where
conventional methods can be applied, GAs are much slower because they
do not take auxiliary information like derivatives into account. In these
optimization problems, there is no need to apply a GA which gives less
accurate solutions after much longer computation time. The enormous
potential of GAs lies elsewhere—in optimization of non-differentiable or
even discontinuous functions, discrete optimization, and program induc-
tion.
A Comprehensive Tutorial to learn Convolutional Neural
Networks from Scratch (deeplearning.ai Course #4)
A LG O RI T HM D E E P LE A RNI NG IMAGE I NT E RM E D I AT E PYT HO N UNS T RUC T URE D D AT A

Introduction

If you had to pick one deep learning technique for computer vision from the plethora of options out there,
which one would you go for? For a lot of folks, including myself, convolutional neural network is the default
answer.

But what is a convolutional neural network and why has it suddenly become so popular? Well, that’s what
we’ll find out in this article! CNNs have become the go-to method for solving any image data challenge.
Their use is being extended to video analytics as well but we’ll keep the scope to image processing for
now. Any data that has spatial relationships is ripe for applying CNN – let’s just keep that in mind for now.

In the previous articles in this series, we learned the key to deep learning – understanding how neural
networks work. We saw how using deep neural networks on very large images increases the computation
and memory cost. To combat this obstacle, we will see how convolutions and convolutional neural
networks help us to bring down these factors and generate better results.

So welcome to part 3 of our deeplearning.ai course series (deep learning specialization) taught by the
great Andrew Ng. In addition to exploring how a convolutional neural network (ConvNet) works, we’ll also
look at different architectures of a ConvNet and how we can build an object detection model using YOLO.
Finally, we’ll tie our learnings together to understand where we can apply these concepts in real-life
applications (like facial recognition and neural style transfer).

I highly recommend going through the first two parts before diving into this guide:

An Introductory Guide to Deep Learning and Neural Networks (Notes from deeplearning.ai Course #1)
Improving Neural Networks – Hyperparameter Tuning, Regularization, and More (deeplearning.ai Course
#2)
 

Table of Contents

1. Course Structure
2. Course #4: Convolutional Neural Networks
1. Module 1: Foundations of Convolutional Neural Networks
2. Module 2: Deep Convolutional Models: Case Studies
1. Case Studies
2. Practical Advice for using ConvNets
3. Module 3: Object Detection
4. Module 4: Special Applications: Face Recognition & Neural Style Transfer

Course Structure

The previous articles of this series covered the basics of deep learning and neural networks. We also
learned how to improve the performance of a deep neural network using techniques like  hyperparameter
tuning, regularization and optimization.

So where to next? Let’s turn our focus to the concept of Convolutional Neural Networks. Course #4 of the
deep learning specialization is divided into 4 modules:

1. In module 1, we will understand the convolution and pooling operations and will also look at a simple
Convolutional Network example
2. In module 2, we will look at some practical tricks and methods used in deep CNNs through the lens of
multiple case studies. We will also learn a few practical concepts like transfer learning, data
augmentation, etc.
3. Module 3 will cover the concept of object detection. We will discuss the popular YOLO algorithm and
different techniques used in YOLO for object detection
4. Finally, in module 4, we will briefly discuss how face recognition and neural style transfer work. This is
a very interesting module so keep your learning hats on till the end

Course 4: Convolutional Neural Network

Ready? Good, because we are diving straight into module 1!

Week 1: Foundations of Convolutional Neural Networks

The objectives behind the first module of the course 4 are:

To understand the convolution operation


To understand the pooling operation
Remembering the vocabulary used in convolutional neural networks (padding, stride, filter, etc.)
Building a convolutional neural network for multi-class classification in images

Computer Vision

Some of the computer vision problems which we will be solving in this article are:

1. Image classification
2. Object detection
3. Neural style transfer

One major problem with computer vision problems is that the input data can get really big. Suppose an
image is of the size 68 X 68 X 3. The input feature dimension then becomes 12,288. This will be even
bigger if we have larger images (say, of size 720 X 720 X 3). Now, if we pass such a big input to a neural
network, the number of parameters will swell up to a HUGE number (depending on the number of hidden
layers and hidden units). This will result in more computational and memory requirements – not something
most of us can deal with.

Edge Detection Example

In the previous article, we saw that the early layers of a neural network detect edges from an image. Deeper
layers might be able to detect the cause of the objects and even more deeper layers might detect the cause
of complete objects (like a person’s face).

In this section, we will focus on how the edges can be detected from an image. Suppose we are given the
below image:

As you can see, there are many vertical and horizontal edges in the image. The first thing to do is to detect
these edges:
But how do we detect these edges? To illustrate this, let’s take a 6 X 6 grayscale image (i.e. only one
channel):

Next, we convolve this 6 X 6 matrix with a 3 X 3 filter:

After the convolution, we will get a 4 X 4 image. The first element of the 4 X 4 matrix will be calculated as:

So, we take the first 3 X 3 matrix from the 6 X 6 image and multiply it with the filter. Now, the first element
of the 4 X 4 output will be the sum of the element-wise product of these values, i.e. 3*1 + 0 + 1*-1 + 1*1 +
5*0 + 8*-1 + 2*1 + 7*0 + 2*-1 = -5. To calculate the second element of the 4 X 4 output, we will shift our
filter one step towards the right and again get the sum of the element-wise product:

Similarly, we will convolve over the entire image and get a 4 X 4 output:

So, convolving a 6 X 6 input with a 3 X 3 filter gave us an output of 4 X 4. Consider one more example:
Note: Higher pixel values represent the brighter portion of the image and the lower pixel values represent
the darker portions. This is how we can detect a vertical edge in an image.

More Edge Detection

The type of filter that we choose helps to detect the vertical or horizontal edges. We can use the following
filters to detect different edges:

Some of the commonly used filters are:

The Sobel filter puts a little bit more weight on the central pixels. Instead of using these filters, we can
create our own as well and treat them as a parameter which the model will learn using backpropagation.

Padding

We have seen that convolving an input of 6 X 6 dimension with a 3 X 3 filter results in 4 X 4 output. We can
generalize it and say that if the input is n X n and the filter size is f X f, then the output size will be (n-f+1) X
(n-f+1):
Input: n X n
Filter size: f X f
Output: (n-f+1) X (n-f+1)

There are primarily two disadvantages here:

1. Every time we apply a convolutional operation, the size of the image shrinks
2. Pixels present in the corner of the image are used only a few number of times during convolution as
compared to the central pixels. Hence, we do not focus too much on the corners since that can lead to
information loss

To overcome these issues, we can pad the image with an additional border, i.e., we add one pixel all around
the edges. This means that the input will be an 8 X 8 matrix (instead of a 6 X 6 matrix). Applying
convolution of 3 X 3 on it will result in a 6 X 6 matrix which is the original shape of the image. This is where
padding comes to the fore:

Input: n X n
Padding: p
Filter size: f X f
Output: (n+2p-f+1) X (n+2p-f+1)

There are two common choices for padding:

1. Valid: It means no padding. If we are using valid padding, the output will be (n-f+1) X (n-f+1)
2. Same: Here, we apply padding so that the output size is the same as the input size, i.e.,
n+2p-f+1 = n
So, p = (f-1)/2

We now know how to use padded convolution. This way we don’t lose a lot of information and the image
does not shrink either. Next, we will look at how to implement strided convolutions.

Strided Convolutions

Suppose we choose a stride of 2. So, while convoluting through the image, we will take two steps – both in
the horizontal and vertical directions separately. The dimensions for stride s will be:

Input: n X n
Padding: p
Stride: s
Filter size: f X f
Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]

Stride helps to reduce the size of the image, a particularly useful feature.

Convolutions Over Volume


Suppose, instead of a 2-D image, we have a 3-D input image of shape 6 X 6 X 3. How will we apply
convolution on this image? We will use a 3 X 3 X 3 filter instead of a 3 X 3 filter. Let’s look at an example:

Input: 6 X 6 X 3
Filter: 3 X 3 X 3

The dimensions above represent the height, width and channels in the input and filter. Keep in mind that
the number of channels in the input and filter should be same. This will result in an output of 4 X 4. Let’s
understand it visually:

Since there are three channels in the input, the filter will consequently also have three channels. After
convolution, the output shape is a 4 X 4 matrix. So, the first element of the output is the sum of the
element-wise product of the first 27 values from the input (9 values from each channel) and the 27 values
from the filter. After that we convolve over the entire image.

Instead of using just a single filter, we can use multiple filters as well. How do we do that? Let’s say the
first filter will detect vertical edges and the second filter will detect horizontal edges from the image. If we
use multiple filters, the output dimension will change. So, instead of having a 4 X 4 output as in the above
example, we would have a 4 X 4 X 2 output (if we have used 2 filters):

Generalized dimensions can be given as:

Input: n X n X n c

Filter: f X f X n c

Padding: p
Stride: s
Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1] X n c ’
Here, n c is the number of channels in the input and filter, while n c ’ is the number of filters.

One Layer of a Convolutional Network

Once we get an output after convolving over the entire image using a filter, we add a bias term to those
outputs and finally apply an activation function to generate activations. This is one layer of a convolutional
network. Recall that the equation for one forward pass is given by:

z[1] = w [1] *a[0] + b [1]


a[1] = g(z[1] )

In our case, input (6 X 6 X 3) is a[0] and filters (3 X 3 X 3) are the weights w [1] . These activations from layer
1 act as the input for layer 2, and so on. Clearly, the number of parameters in case of convolutional neural
networks is independent of the size of the image. It essentially depends on the filter size. Suppose we
have 10 filters, each of shape 3 X 3 X 3. What will be the number of parameters in that layer? Let’s try to
solve this:

Number of parameters for each filter = 3*3*3 = 27


There will be a bias term for each filter, so total parameters per filter = 28
As there are 10 filters, the total parameters for that layer = 28*10 = 280

No matter how big the image is, the parameters only depend on the filter size. Awesome, isn’t it? Let’s have
a look at the summary of notations for a convolution layer:

f [l] = filter size

p [l] = padding

s [l] = stride

n [c] [l] = number of filters

Let’s combine all the concepts we have learned so far and look at a convolutional network example.

Simple Convolutional Network Example

This is how a typical convolutional network looks like:


We take an input image (size = 39 X 39 X 3 in our case), convolve it with 10 filters of size 3 X 3, and take
the stride as 1 and no padding. This will give us an output of 37 X 37 X 10. We convolve this output further
and get an output of 7 X 7 X 40 as shown above. Finally, we take all these numbers (7 X 7 X 40 = 1960),
unroll them into a large vector, and pass them to a classifier that will make predictions. This is a
microcosm of how a convolutional network works.

There are a number of hyperparameters that we can tweak while building a convolutional network. These
include the number of filters, size of filters, stride to be used, padding, etc. We will look at each of these in
detail later in this article. Just keep in mind that as we go deeper into the network, the size of the image
shrinks whereas the number of channels usually increases.

In a convolutional network (ConvNet), there are basically three types of layers:

1. Convolution layer
2. Pooling layer
3. Fully connected layer

Let’s understand the pooling layer in the next section.

Pooling Layers

Pooling layers are generally used to reduce the size of the inputs and hence speed up the computation.
Consider a 4 X 4 matrix as shown below:

Applying max pooling on this matrix will result in a 2 X 2 output:

For every consecutive 2 X 2 block, we take the max number. Here, we have applied a filter of size 2 and a
stride of 2. These are the hyperparameters for the pooling layer. Apart from max pooling, we can also apply
average pooling where, instead of taking the max of the numbers, we take their average. In summary, the
hyperparameters for a pooling layer are:

1. Filter size
2. Stride
3. Max or average pooling

If the input of the pooling layer is n h X n w X n c , then the output will be [{(n h – f) / s + 1} X {(n w – f) / s + 1}
X n c ].

 
CNN Example

We’ll take things up a notch now. Let’s look at how a convolution neural network with convolutional and
pooling layer works. Suppose we have an input of shape 32 X 32 X 3:

There are a combination of convolution and pooling layers at the beginning, a few fully connected layers at
the end and finally a softmax classifier to classify the input into various categories. There are a lot of
hyperparameters in this network which we have to specify as well.

Generally, we take the set of hyperparameters which have been used in proven research and they end up
doing well. As seen in the above example, the height and width of the input shrinks as we go deeper into
the network (from 32 X 32 to 5 X 5) and the number of channels increases (from 3 to 10).

All of these concepts and techniques bring up a very fundamental question – why convolutions? Why not
something else?

Why Convolutions?

There are primarily two major advantages of using convolutional layers over using just fully connected
layers:

1. Parameter sharing
2. Sparsity of connections

Consider the below example:

If we would have used just the fully connected layer, the number of parameters would be =
32*32*3*28*28*6, which is nearly equal to 14 million! Makes no sense, right?

If we see the number of parameters in case of a convolutional layer, it will be = (5*5 + 1) * 6 (if there are 6
filters), which is equal to 156. Convolutional layers reduce the number of parameters and speed up the
training of the model significantly.

In convolutions, we share the parameters while convolving through the input. The intuition behind this is
that a feature detector, which is helpful in one part of the image, is probably also useful in another part of
the image. So a single filter is convolved over the entire input and hence the parameters are shared.

The second advantage of convolution is the sparsity of connections. For each layer, each output value
depends on a small number of inputs, instead of taking into account all the inputs.

Module 2: Deep Convolutional Models: Case Studies

The objective behind the second module of course 4 are:

To understand multiple foundation papers of convolutional neural networks


To analyze the dimensionality reduction of a volume in a very deep network
Understanding and implementing a residual network
Building a deep neural network using Keras
Implementing a skip-connection in your network
Cloning a repository from GitHub and using transfer learning

Classic Networks

In this section, we will look at the following popular networks:

1. LeNet-5
2. AlexNet
3. VGG

We will also see how ResNet works and finally go through a case study of an inception neural network.

LeNet-5

Let’s start with LeNet-5:

It takes a grayscale image as input. Once we pass it through a combination of convolution and pooling
layers, the output will be passed through fully connected layers and classified into corresponding classes.
The total number of parameters in LeNet-5 are:

Parameters: 60k
Layers flow: Conv -> Pool -> Conv -> Pool -> FC -> FC -> Output
Activation functions: Sigmoid/tanh and ReLu

AlexNet

An illustrated summary of AlexNet is given below:

This network is similar to LeNet-5 with just more convolution and pooling layers:

Parameters: 60 million
Activation function: ReLu

VGG-16

The underlying idea behind VGG-16 was to use a much simpler network where the focus is on having
convolution layers that have 3 X 3 filters with a stride of 1 (and always using the same padding). The max
pool layer is used after each convolution layer with a filter size of 2 and a stride of 2. Let’s look at the
architecture of VGG-16:

As it is a bigger network, the number of parameters are also more.

Parameters: 138 million

These are three classic architectures. Next, we’ll look at more advanced architecture starting with ResNet.

 
ResNet

Training very deep networks can lead to problems like vanishing and exploding gradients. How do we deal
with these issues? We can use skip connections where we take activations from one layer and feed it to
another layer that is even more deeper in the network. There are residual blocks in ResNet which help in
training deeper networks.

Residual Blocks

The general flow to calculate activations from different layers can be given as:

This is how we calculate the activations a[l+2] using the activations a[l] and then a[l+1] . a[l] needs to go
through all these steps to generate a[l+2] :

In a residual network, we make a change in this path. We take the activations a[l] and pass them directly to
the second layer:

So, the activations a[l+2] will be:

a[l+2] = g(z[l+2] + a[l] )

The residual network can be shown as:


The benefit of training a residual network is that even if we train deeper networks, the training error does
not increase. Whereas in case of a plain network, the training error first decreases  as we train a deeper
network and then starts to rapidly increase:

We now have an overview of how ResNet works. But why does it perform so well? Let’s find out!

Why ResNets Work?

In order to make a good model, we first have to make sure that it’s performance on the training data is
good. That’s the first test and there really is no point in moving forward if our model fails here. We have
seen earlier that training deeper networks using a plain network increases the training error after a point of
time. But while training a residual network, this isn’t the case. Even when we build a deeper residual
network, the training error generally does not increase.

The equation to calculate activation using a residual block is given by:

a[l+2] = g(z[l+2] + a[l] )


a[l+2] = g(w [l+2] * a[l+1] + b [l+2] + a[l] )

Now, say w [l+2] = 0 and the  bias b [l+2] is also 0, then:

a[l+2] = g(a[l] )

It is fairly easy to calculate a[l+2] knowing just the value of a[l] . As per the research paper, ResNet is given
by:
 

Networks in Networks and 1×1 Convolutions

Let’s see how a 1 X 1 convolution can be helpful. Suppose we have a 28 X 28 X 192 input and we apply a 1
X 1 convolution using 32 filters. So, the output will be 28 X 28 X 32:

The basic idea of using 1 X 1 convolution is to reduce the number of channels from the image. A couple of
points to keep in mind:

We generally use a pooling layer to shrink the height and width of the image
To reduce the number of channels from an image, we convolve it using a 1 X 1 filter (hence reducing
the computation cost as well)

The Motivation Behind Inception Networks

While designing a convolutional neural network, we have to decide the filter size. Should it be a 1 X 1 filter,
or a 3 X 3 filter, or a 5 X 5? Inception does all of that for us! Let’s see how it works.

Suppose we have a 28 X 28 X 192 input volume. Instead of choosing what filter size to use, or whether to
use convolution layer or pooling layer, inception uses all of them and stacks all the outputs:

A good question to ask here – why are we using all these filters instead of using just a single filter size,
say 5 X 5? Let’s look at how many computations would arise if we would have used only a 5 X 5 filter on our
input:
Number of multiplies = 28 * 28 * 32 * 5 * 5 * 192 = 120 million! Can you imagine how expensive performing
all of these will be?

Now, let’s look at the computations a 1 X 1 convolution and then a 5 X 5 convolution will give us:

Number of multiplies for first convolution = 28 * 28 * 16 * 1 * 1 * 192 = 2.4 million


Number of multiplies for second convolution = 28 * 28 * 32 * 5 * 5 * 16 = 10 million
Total number of multiplies = 12.4 million

A significant reduction. This is the key idea behind inception.

Inception Networks

This is how an inception block looks:


We stack all the outputs together. Also, we apply a 1 X 1 convolution before applying 3 X 3 and 5 X 5
convolutions in order to reduce the computations. An inception model is the combination of these
inception blocks repeated at different locations, some fully connected layer at the end, and a softmax
classifier to output the classes.

Now that we have understood how different ConvNets work, it’s important to gain a practical perspective
around all of this.

Practical advice for using ConvNets

Building your own model from scratch can be a tedious and cumbersome process. Also, it is quite a task to
reproduce a research paper on your own (trust me, I am speaking from experience!). In many cases, we
also face issues like lack of data availability, etc. We can design a pretty decent model by simply following
the below tips and tricks:

1. Using Open-Source implementation: Generally, most deep learning researchers open-source their work
on platforms like GitHub. We can (and should) integrate their work into our projects. This has always
been a helpful path for me throughout my career
2. Transfer Learning: We can take a pretrained network and transfer that to a new task which we are
working on. In transfer learning, we take the complete network, remove a few layers from it, and add
custom layers on top of the remaining layers to train our model. In essense, we are extracting features
from a pretrained model and using those to classify and train our model
3. Data Augmentation: Deep learning models perform well when we have a large amount of data. There
are quite a few domains where getting enough data is a problem. In such cases, we use data
augmentation to generate training data from the available data. Some of the common augmentation
methods are:
1. Mirroring: Here we take the mirror image. The class of the image will not change in this case
2. Random Cropping
3. Rotating
4. Shearing
5. Color Shifting: We change the RGB scale of the image randomly.

With this, we come to the end of the second module. We saw some classical ConvNets, their structure and
gained valuable practical tips on how to use these networks.

Module 3: Object Detection

The objectives behind the third module are:

To understand the challenges of Object Localization, Object Detection and Landmark Finding
Understanding and implementing non-max suppression
Understanding and implementing intersection over union
To understand how we label a dataset for an object detection application
To learn the vocabulary used in object detection (landmark, anchor, bounding box, grid, etc.)
I have covered most of the concepts in this comprehensive ar ticle. I highly recommend going through it to
learn the concepts of YOLO. For your reference, I’ll summarize how YOLO works:

1. YOLO first takes an input image


2. The framework then divides the input image into grids
3. Image classification and localization are applied on each grid
4. YOLO then predicts the bounding boxes and their corresponding class probabilities for objects

It also applies Intersection over Union (IoU) and Non-Max Suppression to generate more accurate
bounding boxes and minimize the chance of the same object being detected multiple times.

In the final module of this course, we will look at some special applications of CNNs, such as face
recognition and neural style transfer.

Module 4: Special Applications: Face Recognition & Neural Style


transfer

The objective behind the final module is to discover how CNNs can be applied to multiple fields, including
art generation and facial recognition.

Part 1: Face Recognition

Face recognition is probably the most widely used application in computer vision. It seems to be
everywhere I look these days – from my own smartphone to airport lounges, it’s becoming an integral part
of our daily activities.

In this section, we will discuss various concepts of face recognition, like one-shot learning, siamese
network, and many more.

What is face recognition?

In face recognition literature, there are majorly two terminologies which are discussed the most:

1. Face verification
2. Face recognition

In face verification, we pass the image and its corresponding name or ID as the input. For a new image, we
want our model to verify whether the image is that of the claimed person. This is also called one-to-one
mapping where we just want to know if the image is of the same person.

Face recognition is where we have a database of a certain number of people with their facial images and
corresponding IDs. When our model gets a new image, it has to match the input image with all the images
available in the database and return an ID. It is a one-to-k mapping (k being the number of people) where
we compare an input image with all the k people present in the database.

 
One-Shot Learning

One potential obstacle we usually encounter in a face recognition task is the problem a lack of training
data. This is where we have only a single image of a person’s face and we have to recognize new images
using that. Since deep learning isn’t exactly known for working well with one training example, you can
imagine how this presents a challenge.

One-shot learning is where we learn to recognize the person from just one example. Training a CNN to
learn the representations of a face is not a good idea when we have less images. The model simply would
not be able to learn the features of the face. If a new user joins the database, we have to retrain the entire
network. Quite a conundrum, isn’t it? So instead of using a ConvNet, we try to learn a similarity function:

d(img1,img2) = degree of difference between images

We train a neural network to learn a function that takes two images as input and outputs the degree of
difference between these two images. So, if two images are of the same person, the output will be a small
number, and vice versa. We can define a threshold and if the degree is less than that threshold, we can
safely say that the images are of the same person.

Siamese Network

We will use a Siamese network to learn the function which we defined earlier:

d(img1,img2) = degree of difference between images

Suppose we have two images, x(1) and x(2), and we pass both of them to the same ConvNet. Instead of
generating the classes for these images, we extract the features by removing the final softmax layer. So,
the last layer will be a fully connected layer having, say 128 neurons:

Here, f(x(1)) and f(x(2)) are the encodings of images x(1) and x(2) respectively. So,

d(x(1),x(2)) = || f(x(1)) – f(x(2)) || 2 2

We train the model in such a way that if x(i) and x(j) are images of the same person, || f(x(i)) – f(x(j)) || 2 will
be small and if x(i) and x(j) are images of different people, || f(x(i)) – f(x(j)) || 2 will be large. This is the
architecture of a Siamese network.

Next up, we will learn the loss function that we should use to improve a model’s performance.

 
Triplet Loss

In order to define a triplet loss, we take an anchor image, a positive image and a negative image. A positive
image is the image of the same person that’s present in the anchor image, while a negative image is the
image of a different person. Since we are looking at three images at the same time, it’s called a triplet loss.
We will use ‘A’ for anchor image, ‘P’ for positive image and ‘N’ for negative image.

So, for given A, P and N, we want:

|| f(A) – f(P) || 2  <= || f(A) – f(N) || 2


|| f(A) – f(P) || 2 – || f(A) – f(N) || 2 <= 0

If the model outputs zero for both || f(A) – f(P) || 2 and || f(A) – f(N) || 2 , the above equation will be satisfied.
The model might be trained in a way such that both the terms are always 0. This will inevitably affect the
performance of the model. How do we overcome this? We need to slightly modify the above equation and
add a term , also known as the margin:

|| f(A) – f(P) || 2 – || f(A) – f(N) || 2 + <= 0

The loss function can thus be defined as:

L(A,P,N) = max(|| f(A) – f(P) || 2 – || f(A) – f(N) || 2 + , 0)

Similarly, the cost function for a set of people can be defined as:

Our aim is to minimize this cost function in order to improve our model’s performance. Apart with using
triplet loss, we can treat face recognition as a binary classification problem.

Face Verification and Binary Classification

Instead of using triplet loss to learn the parameters and recognize faces, we can solve it by translating our
problem into a binary classification one. We first use a Siamese network to compute the embeddings for
the images and then pass these embeddings to a logistic regression, where the target will be 1 if both the
embeddings are of the same person and 0 if they are of different people:
The final output of the logistic regression is:

Here, is the sigmoid function. Hence, we treat it as a supervised learning problem and pass different sets
of combinations. Each combination can have two images with their corresponding target being 1 if both
images are of the same person and 0 if they are of different people.

Part 2: Neural Style Transfer

In the final section of this course, we’ll discuss a very intriguing application of computer vision, i.e., neural
style transfer.

What is neural style transfer?

Let’s understand the concept of neural style transfer using a simple example. Suppose we want to recreate
a given image in the style of another image. Here, the input image is called as the content image while the
image in which we want our input to be recreated is known as the style image:

Neural style transfer allows us to create a new image which is the content image drawn in the fashion of
the style image:

Awesome, right?! For the sake of this article, we will be denoting the content image as ‘C’, the style image
as ‘S’ and the generated image as ‘G’. In order to perform neural style transfer, we’ll need to extract features
from different layers of our ConvNet.
 

What are deep ConvNets learning?

Before diving deeper into neural style transfer, let’s first visually understand what the deeper layers of a
ConvNet are really doing. Let’s say we’ve trained a convolution neural network on a 224 X 224 X 3 input
image:

To visualize each hidden layer of the network, we first pick a unit in layer 1, find 9 patches that maximize
the activations of that unit, and repeat it for other units. The first hidden layer looks for relatively simpler
features, such as edges, or a particular shade of color. The image compresses as we go deeper into the
network. The hidden unit of a CNN’s deeper layer looks at a larger region of the image.

As we move deeper, the model learns complex relations:

This is what the shallow and deeper layers of a CNN are computing. We will use this learning to build a
neural style transfer algorithm.

Cost Function

First, let’s look at the cost function needed to build a neural style transfer algorithm. Minimizing this cost
function will help in getting a better generated image (G). Defining a cost function:

J(G) = ⍺*J Content (C,G) + β*J Style (S,G)

Here, the content cost function ensures that the generated image has the same content as that of the
content image whereas  the generated cost function is tasked with making sure that the generated image
is of the style image fashion.

Below are the steps for generating the image using the content and style images:

1. We first initialize G randomly, say G: 100 X 100 X 3, or any other dimension that we want
2. We then define the cost function J(G) and use gradient descent to minimize J(G) to update G:
G = G – d/dG(J(G))

Suppose the content and style images we have are:

First, we initialize the generated image:

After applying gradient descent and updating G multiple times, we get something like this:

Not bad! This is the outline of a neural style transfer algorithm. It’s important to understand both the
content cost function and the style cost function in detail for maximizing our algorithm’s output.

Content Cost Function

Suppose we use the lth layer to define the content cost function of a neural style transfer algorithm.
Generally, the layer which is neither too shallow nor too deep is chosen as the lth layer for the content cost
function. We use a pretrained ConvNet and take the activations of its lth layer for both the content image
as well as the generated image and compare how similar their content is. With me so far?

Now, we compare the activations of the lth layer. For the content and generated images, these are a[l](C)
and a[l](G) respectively. If both these activations are similar, we can say that the images have similar
content. Thus, the cost function can be defined as follows:

J Content (C,G) = ½ * || a[l](C) – a[l](G) || 2


We try to minimize this cost function and update the activations in order to get similar content. Next, we
will define the style cost function to make sure that the style of the generated image is similar to the style
image.

Style Cost Function

Suppose we pass an image to a pretrained ConvNet:

We take the activations from the lth layer to measure the style. We define the style as the correlation
between activations across channels of that layer. Let’s say that the lth layer looks like this:

We want to know how correlated the activations are across different channels:

ai,j,k = activations at (i,j,k)

Here, i is the height, j is the width, and k is the channel number. We can create a correlation matrix which
provides a clear picture of the correlation between the activations from every channel of the lth layer:

where k and k’ ranges from 1 to n c [l] . This matrix is called a style matrix. If the activations are correlated,
G kk’ will be large, and vice versa. S denotes that this matrix is for the style image. Similarly, we can create a
style matrix for the generated image:

Using these two matrices, we define a style cost function:


This style cost function is for a single layer. We can generalize it for all the layers of the network:

Finally, we can combine the content and style cost function to get the overall cost function:

J(G) = ⍺*J Content (C,G) + β*J Style (S,G)

And there you go! Quite a ride through the world of CNNs, wasn’t it?

End Notes

We have learned a lot about CNNs in this article (far more than I did in any one place!). We have seen how a
ConvNet works, the various building blocks of a ConvNet, it’s  various architectures and how they can be
used for image recognition applications. Finally, we have also learned how YOLO can be used for detecting
objects in an image before diving into two really fascinating applications of computer vision – face
recognition and neural style transfer.

Do share your throughts with me regarding what you learned from this article. Have you used CNNs
before? If yes, feel free to share your experience with me – it always helps to learn from each other.

Article Url - https://www.analyticsvidhya.com/blog/2018/12/guide-convolutional-neural-network-cnn/

Pulkit Sharma
My research interests lies in the field of Machine Learning and Deep Learning. Possess an enthusiasm
for learning new skills and technologies.
PRACTICE QUESTIONS ON BAYES’S FORMULA AND ON
PROBABILITY
(NOT TO BE HANDED IN )

1. remarks
If you find any errors in this document, please alert me.
Remark 1. First, I’ll make a remark about question 40 from section 12.4 in the book. Let
A= event that first card is a spade and B=event that second card is a spade. As part of this
question, you computed (presumably using the total law of probability) that
13 12 39 13 1
P (B) = P (A)P (B | A) + P (Ac )P (B | Ac ) = × + × = .
52 51 52 51 4
Note that in this case, of course, you already knew actually that
13 1
P (B) = = ,
52 4
since there are 13 spades in 52 cards, therefore the unconditional probability of B is 13
52
.

The law of total of total probability gives you a method for computing the unconditional
(or total) probability of an event B if you know its conditional probabilities with respect to
some other event A and the probability of A. In this case, we knew directly what P (B) is
(because we had enough information- we know how many cards there are and how many
spades) and you can see how it agrees with what the total law of probability gives you.

However, in most of the other examples, such as the one with the test for a virus we
did in class, it’s not possible to compute the probability of B (in that case, that the test
is positive) directly because you don’t have enough information (we don’t know how many
tests come out positive and how many tests are being administered, i.e., we don’t know the
percentage of tests that come out positive). What we know are the conditional probabilities
of the test coming out positive with the conditions that the person taking it was infected or
not. And we know the probability of this condition happening, i.e., we know the probability
that someone is infected. So the information you have here consists of precisely the pieces
that you need in order to use the total law of probability to compute the probability that a
test comes out positive, and there’s no other way to know this probability.
Remark 2. For all the following questions, the easiest way to think about them is to draw
the tree diagram. Please do so when you try to do them, or when you read the solutions –
draw the diagram to try to follow what’s happening.

2. solutions
Exercise 1. A doctor is called to see a sick child. The doctor has prior information that
90% of sick children in that neighborhood have the flu, while the other 10% are sick with
1
measles. Let F stand for an event of a child being sick with flu and M stand for an event of
a child being sick with measles. Assume for simplicity that F ∪ M = Ω, i.e., that there no
other maladies in that neighborhood.
A well-known symptom of measles is a rash (the event of having which we denote R).
Assume that the probability of having a rash if one has measles is P (R | M ) = 0.95.
However, occasionally children with flu also develop rash, and the probability of having a
rash if one has flu is P (R | F ) = 0.08.
Upon examining the child, the doctor finds a rash. What is the probability that the child
has measles?
Solution.
We use Bayes’s formula.
P (R | M )P (M )
P (M | R) =
(P (R | M )P (M ) + P (R | F )P (F ))
0.95 × 0.10
= ' 0.57.
(0.95 × 0.10 + 0.08 × 0.90)
Which is nowhere close to 95% of P(R—M).
Exercise 2. In a study, physicians were asked what the odds of breast cancer would be in
a woman who was initially thought to have a 1% risk of cancer but who ended up with a
positive mammogram result (a mammogram accurately classifies about 80% of cancerous
tumors and 90% of benign tumors.)
95 out of a hundred physicians estimated the probability of cancer to be about 75%. Do
you agree?
Solution.
Introduce the events:

+ = mammogram result is positive,


B = tumor is benign,
M = tumor is malignant.
Note that B c = M . We are given P (M ) = .01, so P (B) = 1 − P (M ) = .99.
We are also given the conditional probabilities P (+ | M ) = .80 and P (− | B) = .90, where
the event − is the complement of +, thus P (+ | B) = .10
Bayes’ formula in this case is

P (+ | M )P (M )
P (M | +) =
(P (+ | M )P (M ) + P (+ | B)P (B))
0.80 × 0.01
=
(0.80 × 0.01 + 0.10 × 0.99)
' 0.075
So the chance would be 7.5%. A far cry from a common estimate of 75
2
Exercise 3. Suppose we have 3 cards identical in form except that both sides of the first
card are colored red, both sides of the second card are colored black, and one side of the
third card is colored red and the other side is colored black.
The 3 cards are mixed up in a hat, and 1 card is randomly selected and put down on the
ground. If the upper side of the chosen card is colored red, what is the probability that the
other side is colored black?
Solution.
Let RR, BB, and RB denote, respectively, the events that the chosen cars is the red-red,
the black-black, or the red-black card. Letting R be the event that the upturned side of the
chosen card is red, we have that the desired probability is obtained by

P (RB ∩ R)
P (RB | R) =
P (R)
P (R | RB)P (RB)
=
P (R | RR)P (RR) + P (R | RB)P (RB) + P (R | BB)P (BB)
( 21 )( 13 ) 1
= 1 1 1 1 =
(1)( 3 ) + ( 2 )( 3 ) + 0( 3 ) 3
This question was actually just like the Monty Hall problem!
Exercise 4. It is estimated that 50% of emails are spam emails. Some software has been
applied to filter these spam emails before they reach your inbox. A certain brand of software
claims that it can detect 99% of spam emails, and the probability for a false positive (a
non-spam email detected as spam) is 5%.
Now if an email is detected as spam, then what is the probability that it is in fact a
non-spam email?
Solution.
Define events
A = event that an email is detected as spam,
B = event that an email is spam,
c
B = event that an email is not spam.
c
We know P (B) = P (B ) = .5, P (A | B) = 0.99, P (A | B c ) = 0.05.
Hence by the Bayes’s formula we have
P (A | B c )P (B c )
P (B c | A) =
P (A | B)P (B) + P (A|B c )P (B c )
0.05 × 0.5
=
0.05 × 0.5 + 0.99 × 0.5
5
= .
104

3
M ATHEMATICAL L OGIC E XERCISES

Annapaola Marconi, Luciano Serafini and Chiara Ghidini

Anno Accademico 2012-2013


Everything should be made as simple as possible,
but not simpler.

Reader’s Digest. Oct. 1977


Albert Einstein
Contents

1 Introduction 3

2 Propositional Logic 5
2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Truth Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Propositional Formalization . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Formalizing Simple Sentences . . . . . . . . . . . . . . . . 13
2.3.2 Formalizing Problems . . . . . . . . . . . . . . . . . . . . 20
2.4 Normal Form Reduction . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Propositional Natural Deduction . . . . . . . . . . . . . . . . . . 35

3 First Order Logic 47


3.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 FOL Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 FOL Natural Deduction . . . . . . . . . . . . . . . . . . . . . . . 72

4 Modal Logic 81
4.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Satisfiability and Validity . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Modal Logic Formalization . . . . . . . . . . . . . . . . . . . . . . 102

1
Mathematics is the only
instructional material that can
be presented in an entirely
undogmatic way.

The Mathematical
Intelligencer, v. 5, no. 2, 1983
Chapter 1 M AX D EHN

Introduction

The purpose of this booklet is to give you a number of exercises on proposi-


tional, first order and modal logics to complement the topics and exercises
covered during the lectures of the course on mathematical logic. The mate-
rial presented here is not a direct component of the course but is offered to
you as an incentive and a support to understand and master the concepts and
exercises presented during the course.

Symbol Difficulty
 Trivial
 Easy
 Medium
 Difficult
 Very difficult

3
When you have eliminated the
impossible, what ever remains,
however improbable must be
the truth.

The Sign of Four.


Chapter 2 S IR A RTHUR C ONAN D OYLE

Propositional Logic

5
Propositional Logic

2.1 Basic Concepts

Exercise 2.1.  -
Which of the following are well formed propositional formulas?

1. ∨pq

2. (¬(p → (q ∧ p)))

3. (¬(p → (q = p)))

4. (¬(♦(q ∨ p)))

5. (p ∧ ¬q) ∨ (q → r)

6. p¬r

Solution.

Well formed formulas: 2. and 5.

Exercise 2.2.  -
Let’s consider the interpretation v where v(p) = F, v(q) = T, v(r) = T.
Does v satisfy the following propositional formulas?

1. (p → ¬q) ∨ ¬(r ∧ q)

2. (¬p ∨ ¬q) → (p ∨ ¬r)

3. ¬(¬p → ¬q) ∧ r

4. ¬(¬p → q ∧ ¬r)

6
2.1 Basic Concepts

Solution.

v satisfies 1., 3. and 4.


v doesn’t satisfy 2.

7
Propositional Logic

2.2 Truth Tables

Exercise 2.3.  -
Compute the truth table of (F ∨ G) ∧ ¬(F ∧ G).

Solution.

F G F ∨G F ∧G ¬(F ∧ G) (F ∨ G) ∧ ¬(F ∧ G)
T T T T F F
T F T F T T
F T T F T T
F F F F T F

+ The formula models an exclusive or!

Exercise 2.4.  -
Use the truth tables method to determine whether (p → q) ∨ (p → ¬q) is valid.

Solution.

p q p→q ¬q p → ¬q (p → q) ∨ (p → ¬q)
T T T F F T
T F F T T T
F T T F T T
F F T T T T

The formula is valid since it is satisfied by every interpretation.

8
2.2 Truth Tables

Exercise 2.5.  -
Use the truth tables method to determine whether (¬p∨q)∧(q → ¬r∧¬p)∧(p∨r)
(denoted with ϕ) is satisfiable.

Solution.

p q r ¬p ∨ q ¬r ∧ ¬p q → ¬r ∧ ¬p (p ∨ r) ϕ
T T T T F F T F
T T F T F F T F
T F T F F T T F
T F F F F T T F
F T T T F F T F
F T F T T T F F
F F T T F T T T
F F F T T T F F
There exists an interpretation satisfying ϕ, thus ϕ is satisfiable.

Exercise 2.6.  -
Use the truth tables method to determine whether the formula ϕ : p∧¬q → p∧q
is a logical consequence of the formula ψ : ¬p.

Solution.

p q ¬p p ∧ ¬q p∧q p ∧ ¬q → p ∧ q
T T F F T T
T F F T F F ψ |= ϕ since each
F T T F F T
F F T F F T

interpretation satisfying psi satisfies also ϕ.

9
Propositional Logic

Exercise 2.7.  -
Use the truth tables method to determine whether p → (q ∧ ¬q) and ¬p are
logically equivalent.

Solution.

p q q ∧ ¬q p → (q ∧ ¬q) ¬p
T T F F F
T F F F F The two formulas are equivalent since
F T F T T
F F F T T

for every possible interpretation they evaluate to tha same truth value.

Exercise 2.8. 
Compute the truth tables for the following propositional formulas:

• (p → p) → p

• p → (p → p)

• p∨q →p∧q

• p ∨ (q ∧ r) → (p ∧ r) ∨ q

• p → (q → p)

• (p ∧ ¬q) ∨ ¬(p ↔ q)

10
2.2 Truth Tables

Exercise 2.9. 
Use the truth table method to verify whether the following formulas are valid,
satisfiable or unsatisfiable:

• (p → q) ∧ ¬q → ¬p

• (p → q) → (p → ¬q)

• (p ∨ q → r) ∨ p ∨ q

• (p ∨ q) ∧ (p → r ∧ q) ∧ (q → ¬r ∧ p)

• (p → (q → r)) → ((p → q) → (p → r))

• (p ∨ q) ∧ (¬q ∧ ¬p)

• (¬p → q) ∨ ((p ∧ ¬r) ↔ q)

• (p → q) ∧ (p → ¬q)

• (p → (q ∨ r)) ∨ (r → ¬p)

Exercise 2.10. 
Use the truth table method to verify whether the following logical consequences
and equivalences are correct:

• (p → q) |= ¬p → ¬q

• (p → q) ∧ ¬q |= ¬p

• p → q ∧ r |= (p → q) → r

11
Propositional Logic

• p ∨ (¬q ∧ r) |= q ∨ ¬r → p

• ¬(p ∧ q) ≡ ¬p ∨ ¬q

• (p ∨ q) ∧ (¬p → ¬q) ≡ q

• (p ∧ q) ∨ r ≡ (p → ¬q) → r

• (p ∨ q) ∧ (¬p → ¬q) ≡ p

• ((p → q) → q) → q ≡ p → q

12
2.3 Propositional Formalization

2.3 Propositional Formalization

2.3.1 Formalizing Simple Sentences

Exercise 2.11.  -
Let’s consider a propositional language where

• p means “Paola is happy”,

• q means “Paola paints a picture”,

• r means “Renzo is happy”.

Formalize the following sentences:

1. “if Paola is happy and paints a picture then Renzo isn’t happy”

2. “if Paola is happy, then she paints a picture”

3. “Paola is happy only if she paints a picture”

Solution.

1. p ∧ q → ¬r

2. p → q

3. ¬(p ∧ ¬q) ..which is equivalent to p → q

+ The precision of formal languages avoid the ambiguities of natural lan-


guages.

13
Propositional Logic

Exercise 2.12.  -
Let’s consider a propositional language where

• p means “x is a prime number”,

• q means “x is odd”.

Formalize the following sentences:

1. “x being prime is a sufficient condition for x being odd”

2. “x being odd is a necessary condition for x being prime”

Solution. 1. and 2. p→q

Exercise 2.13.  -
Let A =“Aldo is Italian” and B =“Bob is English”.
Formalize the following sentences:

1. “Aldo isn’t Italian”

2. “Aldo is Italian while Bob is English”

3. “If Aldo is Italian then Bob is not English”

4. “Aldo is Italian or if Aldo isn’t Italian then Bob is English”

5. “Either Aldo is Italian and Bob is English, or neither Aldo is Italian nor
Bob is English”

Solution.

1. ¬A

2. A ∧ B

3. A → ¬B

14
2.3 Propositional Formalization

4. A ∨ (¬A → B) logically equivalent to A ∨ B

5. (A ∧ B) ∨ (¬A ∧ ¬B) logically equivalent to A ↔ B

Exercise 2.14. 
Angelo, Bruno and Carlo are three students that took the Logic exam. Let’s
consider a propositional language where

• A =“Aldo passed the exam”,

• B =“Bruno passed the exam”,

• C =“Carlo passed the exam”.

Formalize the following sentences:

1. “Carlo is the only one passing the exam”

2. “Aldo is the only one not passing the exam”

3. “Only one, among Aldo, Bruno and Carlo, passed the exam”

4. “At least one among Aldo, Bruno and Carlo passed”

5. “At least two among Aldo, Bruno and Carlo passed the exam”

6. “At most two among Aldo, Bruno and Carlo passed the exam”

7. “Exactly two, among Aldo, Bruno and Carlo passed the exam”

Exercise 2.15.  -
Let’s consider a propositional langiage where

• A =“Angelo comes to the party”,

15
Propositional Logic

• B =“Bruno comes to the party”,

• C =“Carlo comes to the party”,

• D =“Davide comes to the party”.

Formalize the following sentences:

1. “If Davide comes to the party then Bruno and Carlo come too”

2. “Carlo comes to the party only if Angelo and Bruno do not come”

3. “Davide comes to the party if and only if Carlo comes and Angelo doesn’t
come”

4. “If Davide comes to the party, then, if Carlo doesn’t come then Angelo
comes”

5. “Carlo comes to the party provided that Davide doesn’t come, but, if
Davide comes, then Bruno doesn’t come”

6. “A necessary condition for Angelo coming to the party, is that, if Bruno


and Carlo aren’t coming, Davide comes”

7. “Angelo, Bruno and Carlo come to the party if and only if Davide doesn’t
come, but, if neither Angelo nor Bruno come, then Davide comes only if
Carlo comes”

Solution.

1. D → B ∧ C

2. C → ¬A ∧ ¬B

3. D ↔ (C ∧ ¬A)

4. D → (¬C → A)

5. (¬D → C) ∧ (D → ¬B)

6. A → (¬B ∧ ¬C → D)

7. (A ∧ B ∧ C ↔ ¬D) ∧ (¬A ∧ ¬B → (D ↔ C))

16
2.3 Propositional Formalization

Exercise 2.16. 
Let’s consider a propositional langiage where

• A =“Angelo comes to the party”,

• B =“Bruno comes to the party”,

• C =“Carlo comes to the party”,

• D =“Davide comes to the party”.

Formalize the following sentences:

1. “Angelo comes to the party while Bruno doesn’t”

2. “Either Carlo comes to the party, or Bruno and Davide don’t come”

3. “If Angelo and Bruno come to the party, then Carlo comes provided that
Davide doesn’t come”

4. “Carlo comes to the party if Bruno and Angelo don’t come, or if Davide
comes”

5. “If Angelo comes to the party then Bruno or Carlo come too, but if Angelo
doesn’t come to the party, then Carlo and Davide come”

Exercise 2.17.  -
Socrate says:

“If I’m guilty, I must be punished;


I’m guilty. Thus I must be punished.”

17
Propositional Logic

Is the argument logically correct?

Solution. The argument is logically correct: if p means “I’m guilty” and q


means “I must be punished”, then:
(p → q) ∧ p |= q (modus ponens)

Exercise 2.18.  -
Socrate says:

“If I’m guilty, I must be punished;


I’m not guilty. Thus I must not be punished.”

Is the argument logically correct?

Solution. The argument is not logically correct:


(p → q) ∧ ¬p 2 ¬q

+ consider for instance v(p) = F and v(q) = T

Exercise 2.19. 
Socrate says:

“If I’m guilty, I must be punished;


I must not be punished. Thus I’m not guilty.”

Is the argument logically correct?

18
INTRODUCTION TO APPLIED ARTIFICIAL INTELLIGENCE

1. Introduction to AI
a. Overview
b. Intelligence
c. Searching
2. Mathematical Background
a. Linear Algebra
b. Fuzzy Sets
c. Probability Theory, Bayes Theorem
d. Calculus and Optimization
e. Statistical Concepts: Random Variables, Gaussian Distribution, Bayes Theorem,
Markov Process, Regression
3. Machine Learning
a. Unsupervised Learning: Clustering, PCA
b. Supervised learning: KNN, FCM, LR, Decision Trees, Random Forests
c. Evolutionary Learning: Genetic Algorithms
4. Neural Networks and Deep Learning
a. MLP and Back Propagation
b. Deep learning:
i. CNN
ii. RNN and LSTM
iii. Autoencoders
5. Propositional Logic and Predicate Logic
6. Introduction to NLP
Textbooks:
1. Artificial Intelligence Illuminated – Ben Coppin (2004)
2. Deep Learning – Ian Goodfellow, Yoshua Bengio, Aaron Courville (2017)
3. Computer Vision: Algorithms and Applications – Richard Szeliski (2010)
4. Foundations of Statistical Natural Language Processing - Christopher D. Manning,
Hinrich Schuetze (1999)
KNN & Other Algorithms
KNN Classification Algorithm
K-Nearest Neighbours (KNN)
1. Simple, but a very powerful classification algorithm
2. Classifies based on a similarity measure
3. Lazy learning, does not “learn” until the test example is given
4. Whenever we have a new data to classify, we find its K-nearest neighbours from the training data
5. Classified by “MAJORITY VOTES” for its neighbour classes
1. Assigned to the most common class amongst its K nearest neighbours (by measuring “distance” between data)

Voronoi diagram
• Describes the areas that are nearest
to any given point, given a set of data
• Each line segment is equidistant
between two points of opposite class
KNN Contd…
Pros
• No training period
• Learning and implementation
is extremely simple and
Intuitive
• Flexible decision boundaries
Cons
• Irrelevant or correlated
features have high impact
and must be eliminated
• Typically difficult to handle
high dimensionality
• Computational costs:
memory and classification
time computation
• Sensitive to noisy data
KNN Result Interpretation

• KNN creates local models (or neighbourhoods) across the


feature space with each space defined by a subset of the
training data.
• Implicitly a ‘global’ decision space is created with
boundaries between the training data.
K-means Clustering Algorithm
Basic Algorithm:
Step 0: select K
Step 1: randomly select initial cluster seeds
Step 2: An initial cluster seed represents the “mean value” of its cluster
Step 3: Calculate distance from each object to each cluster seed.
• What type of distance should we use?
• Squared Euclidean distance
Step 4: Assign each object to the closest cluster
Step 5: Compute the new centroid for each cluster
Step 6: Iterate
• Calculate distance from objects to cluster centroids.
• Assign objects to closest cluster
• Recalculate new centroids
Step 7: Stop based on convergence criteria
• No change in clusters
• Max iterations
Fuzzy C-Means Algorithm
It’s the fuzzy version of K-Means Clustering Algorithm. Each data point, instead of belonging to a single cluster,
can have membership values to each of the clusters.

Step 1: Initialize the membership of the data points into desired number of clusters randomly.
Step 2: Find out the fuzzy centroids using the formula

n is the no. of data points, c is the number of clusters, m is the fuzziness parameter (generally 2),
and 𝜇𝑖𝑗 is the membership of ith data into the jth cluster
Step 3: Now recompute the membership values using the formula

𝑑𝑖𝑗 is the distance of the ith data point from the jth cluster centroid
Step 4: Continue repeating steps 2 & 3 until the stopping criteria is achieved
An Example of FCM
• Suppose the given data points are {(1, 3), (2, 5), (6, 8), (7, 9)} • Distances are calculated as below:
• Let us assume there are 2 clusters and the initial D11 = ((1 - 1.568)2 + (3 - 4.051)2)0.5 = 1.2
memberships are randomly initialized as below: D12 = ((1 - 5.35)2 + (3 - 8.215)2)0.5 = 6.8
(1, 3) (2, 5) (4, 8) (7, 9) Similarly distances are calculated for all the
1) 0.8 0.7 0.2 0.1 data points
2) 0.2 0.3 0.8 0.9
• New centroids are computed as below: • Membership values are updated as below:
V11 = (0.82 *1 + 0.72 * 2 + 0.22 * 4 + 0.12 * 7) / ( (0.82 + 0.72 + 𝜇11 =
0.22 + 0.12 ) = 1.568 [{ [(1.2)2 / (1.2)2] + [(1.2)2 / (6.8)2]} (1 / (2 – 1)) ]-1
V12 = (0.82 *3 + 0.72 * 5 + 0.22 * 8 + 0.12 * 9) / ( (0.82 + 0.72 + = 0.96
0.22 + 0.12 ) = 4.051 𝜇12 =
V21 = (0.22 *1 + 0.32 * 2 + 0.82 * 4 + 0.92 * 7) / ( (0.22 + 0.32 + [{ [(6.8)2 / (6.8)2] + [(6.8)2 / (1.2)2]} (1 / (2 – 1)) ]-1
0.82 + 0.92 ) = 5.35 = 0.04
V22 = (0.22 *3 + 0.32 * 5 + 0.82 * 8 + 0.92 * 9) / ( (0.22 + 0.32 + Now these two above steps are repeated to
0.82 + 0.92 ) = 8.215 compute the final cluster centroids.

New Centroids are: (1.568, 4.051) and (5.35, 8.215)


Principal Component Analysis (PCA)
• This takes N-dimensional data and finds the M orthogonal directions in which the data have the
most variance.
• These M principal directions form a lower-dimensional subspace.
• We can represent an N-dimensional datapoint by its projections onto the M principal
directions.
• This loses all information about where the datapoint is located in the remaining orthogonal
directions.
• It is a statistical process that converts the observations of correlated features into a set of linearly
uncorrelated features with the help of orthogonal transformation. These new transformed features
are called the Principal Components.
• Step1: Represent the data in a matrix Z (n x c) [n->cardinality, c->dimensionality]
• Step2: Calculate the Covariance by doing Z x ZT
• Step3: Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
• Step4: Now, we take all the eigenvalues and will sort them in decreasing order. Simultaneously, sort
the eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.
• Step5: Finally, we multiply the P* matrix to the Z. In the resultant matrix Z*, each observation is the
linear combination of original features. Each column of the Z* matrix is independent of each other.
• We can discard unimportant or less important features from the final matrix.
Auto Encoders
• An autoencoder neural network is an Unsupervised Machine learning algorithm that applies backpropagation, setting
the target values to be equal to the inputs. Autoencoders are used to reduce the size of our inputs into a smaller
representation. If anyone needs the original data, they can reconstruct it from the compressed data.

• An autoencoder can learn non-linear


transformations with a non-linear
activation function and multiple layers.
• It doesn’t have to learn dense layers. It can
use convolutional layers to learn which is
better for video, image and series data.
• It is more efficient to learn several layers
with an autoencoder rather than learn one
huge transformation with PCA.
• An autoencoder provides a representation
of each layer as the output.
• It can make use of pre-trained layers from
another model to apply transfer learning to
enhance the encoder/decoder.
Applications of Auto-Encoders
• Image Colouring: Autoencoders are used for converting any black and white picture into a coloured image.
Depending on what is in the picture, it is possible to tell what the colour should be.

• Feature variation: It extracts only the required features of an image and generates the output by removing
any noise or unnecessary interruption.

• Dimensionality Reduction: The reconstructed image is the same as our input but with reduced dimensions.
It helps in providing the similar image with a reduced pixel value.

• Denoising Image: The input seen by the autoencoder is not the raw input but a stochastically corrupted
version. A denoising autoencoder is thus trained to reconstruct the original input from the noisy version.

• Watermark Removal: It is also used for removing watermarks from images or to remove any object while
filming a video or a movie.
Genetic Algorithms (GAs) • Start with a large “population” of randomly
generated “attempted solutions” to a problem
• A genetic algorithm (or GA) is a search technique used in • Repeatedly do the following:
computing to find true or approximate solutions to • Evaluate each of the attempted solutions
optimization and search problems. GAs are categorized as • (probabilistically) keep a subset of the best
global search heuristics. solutions
• GAs are a particular class of evolutionary algorithms that • Use these solutions to generate a new
use techniques inspired by evolutionary biology such as population
inheritance, mutation, selection, and crossover (also called • Quit when you have a satisfactory solution
recombination). (or you run out of time)
• The evolution usually starts from a population of randomly
generated individuals and happens in generations. Natural Evaluation Genetic Algorithm
• In each generation, the fitness of every individual in the
genotype coded string
population is evaluated, multiple individuals are selected
from the current population (based on their fitness), and phenotype uncoded point
modified to form a new population. The new population is chromosome string
used in the next iteration of the algorithm.
gene string position
• The algorithm terminates when either a maximum number
of generations has been produced, or a satisfactory fitness allele value at a certain posn
level has been reached for the population. fitness objective function value
The Basic Operations & Characteristics of GAs
Selection: Mechanism for selecting individuals 1. GAs manipulate coded versions of the problem parameters
(strings) for reproduction according to their fitness instead of the parameters themselves, i.e. the search space is S
(objective function value). instead of X itself.
Crossover: Method of merging the genetic 2. While almost all conventional methods search from a single
information of two individuals; if the coding is point, GAs always operate on a whole population of points
chosen properly, two good parents produce good (strings). This contributes much to the robustness of genetic
children. algorithms. It improves the chance of reaching the global
Mutation: In real evolution, the genetic material optimum and, vice versa, reduces the risk of becoming trapped
can by changed randomly by erroneous in a local stationary point.
reproduction or other deformations of genes, e.g. 3. Normal genetic algorithms do not use any auxiliary
by gamma radiation. In genetic algorithms, information about the objective function value such as
mutation can be realized as a random deformation derivatives. Therefore, they can be applied to any kind of
of the strings with a certain probability. The positive continuous or discrete optimization problem. The only thing to
effect is preservation of genetic diversity and, as an be done is to specify a meaningful decoding function.
effect, that local maxima can be avoided. 4. GAs use probabilistic transition operators while conventional
Sampling: Procedure which computes a new methods for continuous optimization apply deterministic
generation from the previous one and its off- transition operators. More specifically, the way a new
springs. generation is computed from the actual one has some random
components (we will see later by the help of some examples
what these random components are like).
Selection Crossover Mutation
Selection is the component which guides the algorithm to the solution by preferring individuals with high fitness over low-
fitted ones. It can be a deterministic operation, but in most implementations it has random components. One variant, which
is very popular nowadays (we will give a theoretical explanation of its good properties later), is the following scheme, where
the probability to choose a certain individual is proportional to its fitness. For obvious reasons, this method is often called
proportional selection.
Crossover is a very powerful tool for introducing new genetic material and maintaining genetic diversity, but with the
outstanding property that good parents also produce well-performing children or even better ones. In the simplest case, we
can realize this process by cutting two strings at a randomly chosen position and swapping the two tails.
• N-point crossover: Instead of only one, N breaking points are chosen randomly. Every second section is swapped. Among
this class, two point crossover is particularly important
• Segmented crossover: Similar to N-point crossover with the difference that the number of breaking points can vary.
• Uniform crossover: For each position, it is decided randomly if the positions are swapped.
• Shuffle crossover: First a randomly chosen permutation is applied to the two parents, then N-point crossover is applied to
the shuffled parents, finally, the shuffled children are transformed back with the inverse permutation.

Mutation: For each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1).
Variation of Mutation includes Inversion of single bits, Bitwise inversion and Random selection.
Recurrent Neural Network- RNN
RNN were created because there were a few In Recurrent Neural networks, the information cycles through a
issues in the feed-forward neural network: loop to the middle hidden layer.
• Cannot handle sequential data
• Considers only the current input
• Cannot memorize previous inputs
The solution to these issues is the RNN. An
RNN can handle
• sequential data,
• accepting the current input data, and
• previously received inputs.
• RNNs can memorize previous inputs
The input layer ‘x’ takes in the input to the neural network and
Applications of RNN
processes it and passes it onto the middle layer.
1. Time Series Prediction
2. Natural Language Processing
3. Machine Translation
Types of RNN Many to One RNN
This RNN takes a sequence
One to One RNN of inputs and generates a
This type of neural network is single output. Sentiment
known as the Vanilla Neural analysis is a good example
Network. It's used for general of this kind.
machine learning problems,
which has a single input and a
single output.

One to Many RNN Many to Many RNN


This type of neural This RNN takes a sequence
network has a single of inputs and generates a
input and multiple sequence of outputs.
outputs. An example Machine translation is one
of this is the image of the examples.
caption.
Propositional Logic
And
Predicate Logic
Proposition
• A proposition is a collection of declarative statements that has either a truth value "true” or a truth value "false".
• A proposition consists of propositional variables and connectives.
• We denote the propositional variables by capital letters (A, B, etc), and
• The connectives connect the propositional variables.
• Some examples are given below −
• "Man is Mortal", is a proposition and it returns truth value “TRUE”
• "12 + 9 = 3 – 2", is a proposition and it returns truth value “FALSE”
• "A is less than 2“ is not a proposition because it depends on the specific value of A

In propositional logic generally we


use five connectives which are − A B A∨B A∧B ¬A A→B A⇔B

• OR (∨) T T T T F T T
• AND (∧) T F T F F F F
• Negation/ NOT (¬)
F T T F T T F
• Implication / if-then (→)
• Biconditional / If and only if (⇔) F F F F T T T
Connectives or Operators
Properties of Operators:
PRECEDENCE OPERATORS
• Commutativity:
• P∧ Q= Q ∧ P, or
• P ∨ Q = Q ∨ P. First Precedence Parenthesis
• Associativity:
• (P ∧ Q) ∧ R= P ∧ (Q ∧ R), Second Precedence Negation
• (P ∨ Q) ∨ R= P ∨ (Q ∨ R)
• Identity element:
Third Precedence Conjunction (AND)
• P ∧ True = P,
• P ∨ True= True.
• Distributive: Fourth Precedence Disjunction (OR)
• P∧ (Q ∨ R) = (P ∧ Q) ∨ (P ∧ R).
• P ∨ (Q ∧ R) = (P ∨ Q) ∧ (P ∨ R). Fifth Precedence Implication
• DE Morgan's Law:
• ¬ (P ∧ Q) = (¬P) ∨ (¬Q) Sixth Precedence Biconditional
• ¬ (P ∨ Q) = (¬ P) ∧ (¬Q).
• Double-negation elimination:
• ¬ (¬P) = P.
Tautology & Contradiction
A Tautology is a formula
which is always true for
every value of its
propositional variables.

A Contradiction is a
formula which is always
false for every value of its
propositional variables.
Propositional Equivalence
Two statements X and Y are
logically equivalent if any of
the following two conditions
hold −

• The truth tables of each


statement have the same
truth values.

• The bi-conditional
statement X⇔Y is a
tautology.
Inferencing
Inference rules are the templates for generating valid arguments. Inference rules are applied to derive proofs in
artificial intelligence, and the proof is a sequence of the conclusion that leads to the desired goal. Following are
some terminologies related to inference rules:
Implication – It has two parts − Hypothesis & Conclusion. “If you do your homework, you will not be punished.”
Here, "you do your homework" is the hypothesis, and "you will not be punished" is the conclusion.
Inverse − An inverse of the conditional statement is the negation of both the hypothesis and the conclusion. If
the statement is “If p, then q”, the inverse will be “If not p, then not q”. Thus the inverse of p→q is ¬p→¬q.
Example − The inverse of “If you do your homework, you will not be punished” is
“If you do not do your homework, you will be punished.”
Converse − The converse of the conditional statement is computed by interchanging the hypothesis and the
conclusion. If the statement is “If p, then q”, the converse will be “If q, then p”. The converse of p→q is q→p.
Example − The converse of "If you do your homework, you will not be punished" is "If you will not be punished,
you do your homework”.
Contra-positive − The contra-positive of the conditional is computed by interchanging the hypothesis and the
conclusion of the inverse statement. If the statement is “If p, then q”, the contra-positive will be “If not q, then
not p”. The contra-positive of p→q is ¬q→¬p.
Example − The Contra-positive of " If you do your homework, you will not be punished” is "If you are punished,
you did not do your homework”.
Inference Rules
Modus Ponens: Modus Tollens:
The Modus Ponens rule states that if P and P→Q is true, The Modus Tollens rule state that if P→Q is true and
then we can infer that Q will be true. ¬ Q is true, then ¬ P will also true.
Statement-1: "If I am sleepy then I go to bed" ==> P→ Q Statement-1: "If I am sleepy then I go to bed" ==> P→ Q
Statement-2: "I am sleepy" ==> P Statement-2: "I do not go to the bed."==> ~Q
Conclusion: "I go to bed." ==> Q. Statement-3: Which infers that "I am not sleepy" => ~P
Hence, we can say that, if P→ Q is true and P is true
then Q will be true Disjunctive Syllogism:
The Disjunctive syllogism rule state that if P∨Q is true,
and ¬P is true, then Q will be true.
Hypothetical Syllogism: Statement-1: Today is Sunday or Monday. ==>P∨Q
The Hypothetical Syllogism rule state that Statement-2: Today is not Sunday. ==> ¬P
P→R is true whenever P→Q is true, and Q→R is true. Conclusion: Today is Monday. ==> Q
Statement-1: If you have my home key then
you can unlock my home. P→Q The Addition rule: If P is true, then P∨Q will be true
Statement-2: If you can unlock my home then The Simplification rule: If P∧ Q is true, then Q or P will
you can take my money. Q→R also be true
Conclusion: If you have my home key then The Resolution rule: If P∨Q and ¬ P∧R is true, then Q∨R
you can take my money. P→R will also be true
The Rules
𝑃 𝑃∨𝑄
Addition
∴ 𝑃⋁𝑄 Disjunctive Syllogism ~𝑃
∴𝑄
𝑃
Conjunction 𝑄
𝑃→𝑄
∴𝑃∧𝑄 Hypothetical Syllogism 𝑄→𝑅
𝑃∧𝑄 ∴𝑃→𝑅
Simplification
∴𝑃
𝑃 → 𝑄 ∧ (𝑅 → 𝑆)
𝑃→𝑄 Constructive Dilemma 𝑃∨𝑅
Modus Ponens 𝑃 ∴𝑄∨𝑆
∴𝑄
𝑃→𝑄 𝑃 → 𝑄 ∧ (𝑅 → 𝑆)
Modus Tollens ~𝑄 Destructive Dilemma ~𝑄 ∨ ~𝑆
∴ ~𝑃 ∴ ~𝑃 ∨ ~𝑅
An Example
Mrs. Baker says that
1. Alice is her youngest child if Bill is not her youngest child, and
2. Alice is not her youngest child if Carl is not her youngest child.
Write down a knowledge base that describes this riddle and the necessary background knowledge that only
one of the three children can be her youngest child. Show with resolution that Bill is her youngest child.

Let the propositions A, B and C denote that Mrs. The information from Mrs. Baker are:
Baker’s youngest child is Alice, Bill and Carl, 1. ¬B ⇒ A (Alice is her youngest child if Bill is not)
respectively. 2. ¬C ⇒ ¬A (Alice is not her youngest child if Carl is not
We have the following clauses for the background We want to show that Bill is the youngest child.
knowledge: 1. ¬B (Assume that Bill is not the youngest child.)
1. A ∨ B ∨ C (One child has to be the youngest.) 2. A (As ¬B ⇒ A)
2. ¬A ∨ ¬B (Alice and Bill can’t both be the youngest.) 3. ¬C (As ¬A ∨ ¬C)
3. ¬A ∨ ¬C 4. ¬A (As ¬C ⇒ ¬A)
4. ¬B ∨ ¬C
So our assumption leads to the fact that both A and ¬A
are true, which is not possible. So B is true.
From Proposition To Predicate
3 + 2 = 5 is a proposition. But is X + 2 = 5 a proposition? Suppose P(x) is a predicate, where the universe of discourse
Because it has a variable X in it, we cannot say it is T or F. for x is {1, 2, 3}. Then P(x) is not a proposition, but P(1) is a
So, it is not a proposition. It is called a predicate. proposition.
In general, a predicate is not a proposition. But when you
A predicate is a function. It takes some variable(s) as assign values to all its argument variables, you get a
arguments; proposition.
It returns either True or False (but not both) for each
combination of the argument values. Example:
In contrast, a proposition is not a function. It does not have P(x, y) : “x + 2 = y” is a predicate.
any It has two variables x and y;
variable as argument. It is either True or False (but not Universe of Discourse: x is in {1, 2, 3}; y is in {4, 5, 6}.
both).
P(1, 4) : 1 + 2 = 4 is a proposition (it is F);
The variables are always associated with a universe (or P(2, 4) : 2 + 2 = 4 is a proposition (it is T);
domain) P(2, 3): meaningless (in this example),
of discourse, which tells us what combinations of the because 3 is not in the specified universe of discourse for y
argument
values are allowed.
Predicate Logic Or First Order Logic
In propositional logic, we can only represent the facts, which are either true or false.
PL is not sufficient to represent the complex sentences or natural language statements.
Consider the following sentence, which we cannot represent using PL logic.
• "Some humans are intelligent", or
• "Sachin likes cricket."
To represent the above statements, we required some more powerful logic, such as first-order logic.

First-order logic (like natural language) does not only assume that the world contains facts like propositional
logic but also assumes the following things in the world:
• Objects: A, B, people, numbers, colours, wars, theories, squares, pits, etc.
• Relations: It can be unary relation such as: red, round, is adjacent, or n-any relation such as: the sister of,
brother of, has colour, comes between, etc.
• Function: Father of, best friend, third inning of, end of, etc.

Consider the statement: "x is an integer.", it consists A quantifier is a language element which generates
of two parts, quantification. There are two quantifiers:
the first part x is the subject, and • Universal Quantifier: (for all, everyone, everything)
second part "is an integer," is known as a predicate. • Existential quantifier: (for some, at least one).
Example
• All man drink coffee = ∀x man(x) → drink (x, coffee)
• Some boys are intelligent = ∃x: boys(x) ∧ intelligent(x)
• All birds fly = ∀x bird(x) →fly(x)
• Every man respects his parent = ∀x man(x) → respects (x, parent)
• Some boys play cricket = ∃x boys(x) → play(x, cricket)
• Not all students like both Mathematics and Science =
¬∀ (x) [ student(x) → like(x, Mathematics) ∧ like(x, Science)]

Goldbach’s Conjecture: every even number greater than 2 is equal to the sum of two prime numbers
∀ n, if integer(n), even(n), greater(n,2) then
∃ p1, p2, integer(p1), integer(p2), prime(p1),prime(p2), and
equals(n, sum(p1,p2))

Universal Quantification and Existential Quantification are two more ways of converting Predicates to Propositions.
Let P(x) : “x+2=5” is a Predicate in the Universe of Discourse {1, 2, 3}.
Then ∀ x P(x) means: ”for all x in {1, 2, 3}, x + 2 = 5” which is a false proposition
And ∃ x P(x) means: “for some x in {1, 2, 3}, x + 2 = 5” which is a true proposition.
Inference in FOL
Substitution: Substitution is a fundamental operation performed on terms and formulas. If we write F[a/x], so
it refers to substitute a constant "a" in place of variable "x".
Equality: First-Order logic does not only use predicate and terms for making atomic sentences but also uses
equality symbols which specify that the two terms refer to the same object, e.g., Brother (John) = Smith.

As propositional logic, FOL does also have inference rules. Following are some basic inference rules in FOL:

• Universal Generalization: states that if premise P(c) is true for any arbitrary element c in the universe of
discourse, then we can have a conclusion as ∀ x P(x).
• Universal Instantiation: states that we can infer any sentence P(c) by substituting a ground term c (a
constant within domain x) from ∀ x P(x) for any object in the universe of discourse.
• Existential Instantiation: states that one can infer P(c) from the formula given in the form of ∃x P(x) for a
new constant symbol c, with the restriction that the c used in the rule must be a new term for which P(c ) is
true.
• Existential Generalization: states that if there is some element c in the universe of discourse which has a
property P, then we can infer that there exists something in the universe which has the property P.
Example
𝑃 𝑐
Universal Generalization: ∴∀𝑥 𝑃 𝑥
Example: Let's represent, P(c): "A byte contains 8 bits", so for ∀ x P(x) "All bytes contain 8 bits.", it will also be true

∀𝑥 𝑃 𝑥
Universal Instantiation: 𝑃(𝑐)
Example: IF "Every person like ice-cream"=> ∀x P(x) so we can infer that "John likes ice-cream" => P(c)

∃𝑥𝑃 𝑥
Existential Instantiation:
𝑃(𝑐)
Example: If you get 90 on the final exam for AI, then you get an A for the course. Someone, call him/her say c, gets
90 on the final exam. Therefore c gets an A for AI.

𝑃 𝑐
Existential Introduction: ∃ 𝑥 𝑃(𝑥)
Example: "Priyanka got good marks in English.“ "Therefore, someone got good marks in English."
Negated Quantifiers
A(x): “x lives in Amherst.” • All CSE students have good GPA: ∀ x (B(x) → C(x))
B(x): “x is a CSE student.” • No CSE student lives in Amherst: ¬ ∃ x (B(x) ∧ A(x))
C(x): “x has a good GPA.” • CSE students who do not live in Amherst major in computer
D(x): “x majors in computer science.” science: ∀x (B(x) ∧ ¬ A(x) → D(x))
Universe of discourse: all UG students

Not every UB student majors in computer science: ¬ ∀x D(x).


There is UB student who does not major in computer science: ∃ x ¬ D(x)
These two statements have the same meaning

There is no UB student living in Amherst: ¬ ∃ x A(x).


Every UB student lives in a town other than Amherst: ∀ x ¬ A(x)
These two statements have the same meaning

In general, we can say that


¬ ∀ x P(x) ≡ ∃ x ¬ P(x) ¬ ∃ x P(x) ≡ ∀ x ¬ P(x)
A(x): “x lives in Amherst.”
Nested Quantifiers Examples B(x): “x is a CSE-AI student.”
C(x, y): “x and y are friends.”
Suppose P(x, y) means: “x (the first variable in P(∗.∗) D(x): “x majors in computer science.”
takes y (the second variable in (P(∗, ∗)”), where Universe of discourse: all UB students
the domain of x is the students in this class and
the domain of y is the courses offered by UG CSE There is a computer science major who has a friend
• P(Alice, CSE-AI): Alice takes CSE-AI living in Amherst: ∃ x (D(x) ∧ ∃ y (A(y) ∧ C(x, y)))
• ∃ y P(Alice, y): Alice takes a UG CSE course.
• ∃ x (P(x, CSE-AI) ∧ P(x, CSE-DBMS)): A student in Either there is a computer science major who has a
this class takes both CSE-AI and CSE-DBMS friend living in Amherst, or all CSE-AI students major in
• ∃ x ∃ y ∀ z ((x ≠ y) ∧ (P(x,z) → P(y,z))): There are computer science:
two different students x and y in this class such ∃ x (D(x) ∧ ∃ y (A(y) ∧ C(x, y))) ∨ ∀ x (B(x) → D(x))
that if x loves a UG CSE course, then y takes it as
well. Consider the following two statements:
• Every UG CSE course is taken by some student in ∀ x ∃y(C(x, y) ∧ B(y)):
this class: ∀ y ∃ x P(x, y). All UB students have friends taking CSE191.
• No student in this class takes both CSE-AI and CSE- ∃ y ∀ x (C(x, y) ∧ B(y)):
Crypto: ¬ ∃ x (P(x, CSE-AI) ∧ P(x, CSE-Crypto)) There is a UB student who is the friend of all UB
students and takes CSE 191.
Do they mean the same thing? NO
NLP Fundamentals
Levels of Analysis
Level 1 – Speech sound (Phonetics & Phonology) – interpret the acoustic signals  SOUND
of a person, interpret individual units or phonemes, and convert them into word
sequences.
Level 2 – Words & their forms (Morphology, Lexicon) – The analysis of
morphology is necessary because a text can use different forms of a word (i.e.,  Dog -> Dos(s), Dog(ged)
infect, infected, etc.), which could produce too much linguistic variability and,
therefore, increase the dimensionality of a text, obfuscating the real meaning of
the individual word.
Level 3 – Structure of sentences (Syntax, Parsing) – The level of lexical analysis  Ram Goes To Market (Right)
tries to understand the linguistic roles or functions of words, usually known as Goes Ram Market To (Wrong)
their part- of-speech ( POS).
Level 4 – Meaning of sentences (Semantics) – The level of semantic analysis  Go : event, Ram : Agent
determines the literal meaning of a word or sentence. For this, semantics tries to Market : Destination
identify the interactions between individual meanings (words) in contexts given ??? : Source
in a sentence.
Level 5 – Meaning in context & for a purpose (Pragmatics) – The task that makes
it possible to connect statements with world or domain changes, and, therefore,  Is That Water? Different when in a
reason about it is called Pragmatic Analysis. Pub or when in a Chemistry Lab
Level 6 – Connected sentence processing in a larger body of text (Discourse) –
through the structure of the discourse of statements, we must try to discover  There goes Ram. He is seek.
what the speaker is trying to say with its words or intentions.
Feature Engineering – TF/IDF Vectors
TF-IDF Vectors: TF-IDF (Term Frequency / Inverse Document Frequency) score represents the relative
importance of a term in the document and the entire corpus. TF(t) = (Number of times term t appears in a
document) / (Total number of terms in the document), and IDF(t) = log_e(Total number of documents /
Number of documents with term t in it)
• Word level: Matrix representing tf-idf scores of every term in different documents
• N-Gram level: Matrix representing tf-idf scores of N-grams (combination of N terms together)
• Character level: Matrix representing tf-idf scores of character level n-grams in the corpus

DOCUMENT-1 DOCUMENT-2 TF(this, doc-1) = 1/8, TF(this, doc-2) = 1/5


TF(Messi, doc-1) = 1/2, TF(Messi, doc-2) = 0
Term Count Term Count IDF(this) = ln(2/2) = 0
This 1 This 1 IDF(Messi) = ln(2/1) = 0.301
Is 1 Is 2 TF-IDF(Messi, doc-1) = 1/2 x 0.301 = 0.15
About 2 About 1 TF-IDF(This, doc-1) = 1/8 x 0 = 0
Messi 4 Tf-Idf 1 Thus we can see how IDF eliminates the common
Total 8 Total 5 words which appear in all (or almost all) documents.
Feature Engineering – Co-occurrence Matrix
The idea is that similar words tend to occur together and will have similar context
Co-occurrence – For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times
they have appeared together in a Context Window.
Context Window – Context window is specified by a number and the direction.
Now, let us take an example corpus to calculate a co-occurrence matrix.

Corpus = He is not lazy. He is intelligent. He is smart.


He is not lazy intelligent smart this co-occurrence matrix is not
He 0 4 2 1 2 1 the word vector representation.
Instead, this Co-occurrence
is 4 0 1 2 2 1 matrix is decomposed using
not 2 1 0 1 0 0 techniques like PCA, SVD etc. into
factors and combination of these
lazy 1 2 1 0 0 0
factors forms the word vector
intelligent 2 2 0 0 0 0 representation.
smart 1 1 0 0 0 0

There are some prediction based vectors Word2vec is not a single algorithm but a combination of two
techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these are shallow neural
networks which map word(s) to the target variable which is also a word(s).
Feature Engineering – other features
A number of extra text based features can also be created which sometimes are helpful for improving text
classification models. Some examples are:
1. Word Count of the documents – total number of words in the documents
2. Character Count of the documents – total number of characters in the documents
3. Average Word Density of the documents – average length of the words used in the documents
4. Punctuation Count in the Complete Essay – total number of punctuation marks in the documents
5. Upper Case Count in the Complete Essay – total number of upper count words in the documents
6. Title Word Count in the Complete Essay – total number of proper case (title) words in the documents
7. Frequency distribution of Part of Speech Tags:
1. Noun Count
2. Verb Count
3. Adjective Count
4. Adverb Count
5. Pronoun Count
Model Building
The final step in the text classification framework is to train a classifier using the features created in the
previous step. There are many different choices of machine learning models which can be used to train a
final model. We will implement following different classifiers for this purpose:
1. Naive Bayes Classifier
2. Linear Classifier
3. Support Vector Machine
4. Bagging Models
5. Boosting Models
6. Shallow Neural Networks
7. Deep Neural Networks
1. Convolutional Neural Network (CNN)
2. Long Short Term Modelr (LSTM)
3. Gated Recurrent Unit (GRU)
4. Bidirectional RNN
5. Recurrent Convolutional Neural Network (RCNN)
6. Other Variants of Deep Neural Networks
Evaluation Accuracy, answers the question How well was the model
classified? and is calculated as a ratio of the number of correct
classifications (or predictions) to the total number of
classifications. That is to say:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision, answers the question What proportion of the
positive classified samples were actually correct? and is
TP (True Positives): number of samples that were calculated as a proportion
correctly classified as positive. between the total number of correct positive classifications
FP (False Positives): number of samples that were over the total number of positive classifications. That is to say,
incorrectly classified as positive. Precision = TP / (TP + FP)
TN (True Negatives): number of samples that were • Recall, answers the question What proportion of the actual
correctly classified as negative. positive samples were correctly classified? and is calculated as
FN (False Negatives): number of samples that were a proportion of the total number of correct positive
incorrectly classified as negative. classifications over the total number of samples that actually
belong to the positive class (i.e., samples weren’t rated as
positive, but that were actually positive). That is to say,
Recall = TP/(TP+FN)
CNN – Convolutional Neural Networks
Introduction CNN Architecture
Convolutional networks are simply neural networks
• CNNs has deep feed-forward architecture and has capability that use convolution in place of general matrix
to generalize much better compared to FC networks multiplication in at least one of their layers
• CNN can be thought of as a hierarchical feature detector and 1. Neurons arranged in 3 dimensions: width,
they are biologically inspired height, depth
• It can learn abstract features and identify objects efficiently 2. A stacked sequence of layers (3 main types) are
• The advantages of CNN are: used to build a CNN – they are
• Fully-connected structure does not scale to large • Convolutional Layer
images, but due to the weight sharing property, CNNS • Pooling Layer
vastly reduce the amount of parameters in the network • Fully-Connected Layer
resulting in 3. Every layer of a CNN transforms one volume of
• better generalization i.e., lesser overfitting, and activations to another through a differentiable
• smooth training function
• Feature extraction and classification are happening 4. Convolutional Layers are the main building
simultaneously within the same framework blocks and they contain Kernels (Filters) to
• CNNs are found to be extremely efficient in the fields of – detect features from an image
• Image classification 5. The pooling layers are reducing the dimension
• Object detection using some kind of aggregation
• Face detection 6. The fully connected layers are generally at the
• Speech recognition end of network, consisting generally of not more
• Facial expression recognition, etc. than 3 layers, does the final classification
Convolution Convolution leverages three important ideas that
can help improve a machine learning system:
1. Sparse interactions: In traditional network
layers, there is interaction between each input
unit and each output unit. Convolutional
networks, typically have sparse interactions
accomplished by making the kernel smaller
than the input. So, when with the input image
with thousands or millions of pixels, we can
detect small, meaningful features such as
edges with kernels that occupy only tens or
hundreds of pixels.
2. Parameter sharing: It refers to using the same
An Example
parameter more than once. In a CNN, each
member of the kernel is used at every position
of the input resulting in lesser computation
and storage.
3. Equivariant representations: Parameter
sharing gives the layers an advantage of
equivariance to translation.
Convolution contd…

Learning small features and combining them Equivariance to Translation

= 50.30+20.30+50.30+50.30+50.30
= 6600
Convolution contd…
Input Image
Filter
1 1 1 0 0 Output
1 0 1 Simple convolution size
0 1 1 1 0 4 4
0 1 0
0 0 1 1 1 2 4 Image size: 𝑁 × 𝑁
1 0 1 Filter: 𝑀 × 𝑀
Convolution
0 0 1 1 0
with stride = 2 Stride: 𝑆
0 1 1 0 0
Output size
0 0 0 0 0 0 0 ‫ 𝑁ہ‬− 𝑀/𝑆+1‫ۂ‬
0 1 1 1 0 0 0
0 0 1 1 1 0 0
0 0 0 1 1 1 0
0 0 0 1 1 0 0
0 0 1 1 0 0 0
0 0 0 0 0 0 0
Zero-padding
Convolution - Summary
Max Pooling

INPUT FILTER STRIDE OUTPUT


7x7 3x3 1 5x5
7x7 3x3 2 3x3
7x7 3x3 3 ???

𝑁−𝐹
𝑂𝑢𝑡𝑝𝑢𝑡 𝑆𝑖𝑧𝑒 = +1
𝑠𝑡𝑟𝑖𝑑𝑒
Sample Architecture & Size
[INPUT] →[CONV 1] → [BATCH NORM] → [ReLU] → [POOL 1]→ [CONV 2] → [BATCH NORM] → [ReLU] → [POOL 2]→ [FC LAYER] → [RESULT]
CNN Architecture

Architecture of VGGNet
Gradient Descent
Gradient descent algorithm does not work well for all functions. There are
Pre-requisite two specific requirements. A function has to be differentiable and convex.

Gradient descent (GD) is an iterative first-order optimisation


algorithm used to find a local minimum/maximum of a given
function.

A function is convex means that the line segment connecting


two points lies on or above its curve but it does not cross it.
Mathematically, for two points x₁, x₂ lying on the function’s
curve this condition is expressed as:
𝑓 𝜆𝑥1 + 1 − 𝜆 𝑥2 ≤ 𝑓 𝜆𝑥1 + 1 − 𝜆 𝑓(𝑥2 )
λ denotes a point’s location on a section line and its value has
to be between 0 (left point) and 1 (right point), e.g. λ=0.5
means a location in the middle.

Another way to check mathematically if a univariate


function is convex is to calculate the second derivative and
𝑑2 𝑓 𝑥
check if its value is always bigger than 0 i.e., >0
𝑑𝑥 2
Semi-convex or Quasi-convex For a strictly convex function,
Gradient Descent guarantees to
Let 𝑓 𝑥 = 𝑥 2 − 𝑥 + 3 Also, 𝑓 ′′ 𝑥 = 0 𝑎𝑡 𝑥 = 0, 1 converge to a global minima,
𝑑 So these are saddle points where the whereas for a quasi-convex
⇒ 𝑓 𝑥 = 2𝑥 − 1
𝑑𝑥 curvature changes sign. function, it can stuck to a local
𝑑2 At 𝑥 = 1.5, there is a global minima. minima.
⇒ 2𝑓 𝑥 = 2
𝑑𝑥

=>The second derivative is always


bigger than zero
=>The function is strictly convex

Now, consider the function


𝑓 𝑥 = 𝑥 4 − 2𝑥 3 + 2
𝑑
⇒ 𝑓 𝑥 = 4𝑥 3 − 6𝑥 2
𝑑𝑥
= 𝑥 2 (4𝑥 − 6)
𝑓 ′ 𝑥 = 0 𝑎𝑡 𝑥 = 0 & 1.5
• for x<0: function is convex
𝑓 ′′𝑥 = 12𝑥 2 − 12𝑥 = 12𝑥 𝑥 − 1 • for 0<x<1: function is concave
At 𝑥 = 1.5, 𝑓 ′′ 𝑥 > 0 • for x>1: function is convex again
Gradient Descent
• Gradient descent is an optimization technique that can find Let f w is a function where 𝑤 = 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝑇

the minimum of an objective function. We need to find a w for which 𝑓 𝑤 is minimum


• It is a greedy technique that finds the optimal solution by Using Gradient Descent, we do the following:
taking a step in the direction of the maximum rate of 1. Choose an initial random value of w
decrease of the function. 2. Choose the maximum no. iterations T
• In summary, Gradient Descent method’s steps are: 3. Choose a learning rate 𝝀
1. choose a starting point (initialisation) 4. Repeat the following steps until
2. calculate gradient at this point 1. f doesn’t change, or
3. make a scaled step in the opposite direction to the 2. Iterations exceed T
gradient (objective: minimise) 1. Δ𝑤 = −𝜆∇𝑤 𝑓 𝑤
4. repeat points 2 and 3 until one of the criteria is met: 2. 𝑤 ← 𝑤 + Δ𝑤
• maximum number of iterations reached 5. Here ∇𝑤 𝑓 𝑤 =
𝛿𝑓 𝑤 𝛿𝑓 𝑤

𝛿𝑓 𝑤
• step size is smaller than the tolerance 𝛿𝑤1 𝛿𝑤2 𝛿𝑤𝑛

• Below are the parameters:


1. starting point – manual / random For example if 𝑓 𝑤 = [𝑤12 , 𝑤22 ],
2. gradient function - The differential of error then at each iteration 𝑤1 , 𝑤2 is updated as
3. learning rate - scaling factor for step sizes 𝑤1 , 𝑤2 ← 𝑤1 , 𝑤2 − 𝜆[2𝑤1 , 2𝑤2 ]
4. maximum number of iterations
5. tolerance to conditionally stop the algorithm
An Example
Let 𝑓 𝑥 = 𝑥 2 − 4𝑥 + 1
𝑓 ′ 𝑥 = 2𝑥 − 4

Let us take the initial value of 𝑥 = 9


And learning rate 𝜆 = 0.1

The first few iterations are as follows:


𝑥1 = 9 − 0.1 ∗ 2 ∗ 9 − 4 = 7.6

𝑥2 = 7.6 − 0.1 ∗ 2 ∗ 7.6 − 4 = 6.48

𝑥3 = 6.48 − 0.1 ∗ 2 ∗ 6.48 − 4 = 5.584

𝑥4 = 5.584 − 0.1 ∗ 2 ∗ 5.584 − 4 = 4.8672





Strategies – Momentum & Batch
Momentum Batch
When using gradient descent, we run into the following
problems: Batch Gradient Descent: Batch gradient descent, also
called vanilla gradient descent, calculates the error for
• Getting trapped in a local minimum, which is a direct each example within the training dataset, but only after
consequence of this algorithm being greedy all training examples have been evaluated does the
• Overshooting and missing the global optimum, this is a model get updated. This whole process is like a cycle
direct result of moving too fast along the gradient direction and it’s called a training epoch. Computationally more
• Oscillation, this is a phenomenon that occurs when the efficient but convergence is problematic.
function's value doesn't change significantly no matter the
direction it advances. You can think of it as navigating a Stochastic Gradient Descent: SGD updates the
plateau, you're at the same height no matter where you go parameters for each training example one by one. Its
computationally more expensive but convergence is
To combat these problems, a momentum term α is added to more rapid.
the expression for Δw to stabilize the rate of learning when
moving towards the global optimum value. Mini-batch Gradient Descent: Mini-batch gradient
descent splits the training dataset into small batches
We use the superscript i to denote the iteration number: Δwi and updates each of these batches to create a balance
= −η∇wf(wi) + αwi−1 between the robustness of stochastic gradient descent
and the efficiency of batch gradient descent.
MLP Using Python
MLP in Python Model Layers
1. Weight Initialization
General pipeline for implementing an ANN 1. random_uniform (−0.5 ≤ 𝑤 ≤ 0.5)
• Design and define the neural network 2. random_normal (𝜇 = 0, 𝜎 = 0.5)
3. zeros (𝑤 = 0)
architecture 2. Activation Function
• Select the optimizer that performs 1. relu - max(𝑥, 0)
optimization (gradient descent) 2. sigmoid - 1/(1 + 𝑒 −𝑥 )
• Select the loss function and train it 3. softmax - 𝑒 𝑥 /(∑𝑒 𝑥 )
• Select the appropriate evaluation metric for 4. softplus - log(1 + 𝑒 𝑥 )
the given problem 5. softsign - 𝑥/( 𝑥 + 1)
6. tanh - (𝑒 𝑥 − 𝑒 −𝑥 )/(𝑒 𝑥 + 𝑒 −𝑥 )
1. from tensorflow.keras.models import Sequential 3. Layer Types
2. model = Sequential(...) 1. Dense – Fully connected
3. model.add(...) 2. Dropout – Setting a fraction to zero
4. model.add(...) 3. Concatenate – Combine inputs from multiple layers
5. model.add(...)
6. Dense(16, input_shape=(8,)) model.add(layers.Dense(units=64, activation=activations.relu,
7. model.compile(optimizer=..., loss=..., metrics=...) kernel_initializer=‘random_normal’)
8. model.fit(X, y, epochs=..., batch_size=...)
9. model.evaluate() Model Compilation
10. Model.predict() 1. Model Optimizers (SGD, RMSProp, ADAM, etc…)
2. Model Loss Functions (CrossEntropy – binary/categorical)
3. Model Metrics (Accuracy, Probabilistic, Regression, etc…)

model.compile(loss='categorical_crossentropy’, optimizer='adam')
Model Optimizers Stochastic gradient descent (SGD) – Adaptive Learning Rate
Require: Learning rate schedule 𝜖1 , 𝜖2 , …
Require: Initial parameter 𝜃
𝐾←1
Stochastic gradient descent (SGD)
while stopping criterion not met do
Require: Learning rate schedule 𝜖
Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖
Require: Initial parameter 𝜃 1
while stopping criterion not met do Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖 Apply update: 𝜃 ← 𝜃 − 𝜖𝐾 𝑔
1 𝐾 ←𝐾+1
Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
end while
Apply update: 𝜃 ← 𝜃 − 𝜖𝑔
end while The AdaGrad Algorithm
Require: Learning rate 𝜖, Initial parameter 𝜃
Stochastic gradient descent (SGD) with momentum Require: Small constant 𝛿, ≈ 10−7 for numeric stability
Require: Learning rate 𝜖, momentum parameter 𝛼 Initialize gradient accumulation variable 𝑟 = 0
Require: Initial parameter 𝜃, initial velocity 𝑣 while stopping criterion not met do
while stopping criterion not met do Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖
Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖 1
1 Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
Accumulate squared gradient: 𝑟 ← 𝑟 + 𝑔 ⊙ 𝑔
Compute velocity update: 𝑣 ← 𝛼𝑣 − 𝜖𝑔 𝜖
Compute update: Δ𝜃 ← − 𝛿+ 𝑟 𝜃 ⊙ 𝑔
Apply update: 𝜃 ← 𝜃 + 𝑣
end while Apply update: 𝜃 ← 𝜃 + Δ𝜃
end while
Optimizers Contd…
The RMSProp Algorithm
Require: Learning rate 𝜖, decay rate 𝜌
Require: Small constant 𝛿, ≈ 10−7 for numeric stability
Initialize accumulation variable 𝑟 = 0
while stopping criterion not met do The Adam Algorithm
Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖 Require: Global Learning rate 𝜖, decay rate 𝜌
1
Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 ) Require: Initial parameter 𝜃, momentum coefficient 𝛼
Require: Initial velocity 𝑣
Accumulate squared gradient: 𝑟 ← 𝜌𝑟 + (1 − 𝜌)𝑔 ⊙ 𝑔
𝜖 Initialize gradient accumulation variable 𝑟 = 0
Compute update: Δ𝜃 ← − 𝛿+ 𝑟 𝜃 ⊙ 𝑔 while stopping criterion not met do
Apply update: 𝜃 ← 𝜃 + Δ𝜃 Sample a minibatch of m examples 𝑥𝑖 with targets 𝑦𝑖
end while Compute interim update 𝜃 ← 𝜃 + 𝛼𝑣
1
Compute gradient estimate: 𝑔 ← 𝑚 ∇𝜃 ∑𝑖 𝐿(𝑓 𝑥𝑖 ; 𝜃 , 𝑦𝑖 )
Accumulate gradient: 𝑟 ← 𝜌𝑟 + (1 − 𝜌)𝑔 ⊙ 𝑔
𝜖
Compute velocity update: 𝑣 ← 𝛼𝑣 − 𝑟 ⊙ 𝑔
Apply update: 𝜃 ← 𝜃 + 𝑣
end while
Model Loss Functions – Cross-Entropy
The outputs (softmax layer) generated by the Cross-Entropy = 1 ∑𝑛 ∑𝐶 −𝑦𝑖𝑗 log 𝑝𝑖𝑗
n 𝑖=1 𝑗=1
network and the corresponding targets are
C is the no. of classes
given below: n is the no. of data points
𝐶𝐸 = −log 0.8 − log 0.4 − log 0.6 − log 0.8
[1 0 0] [0.8 0.2 0.0] [1 0 0]
[0.4 0.3 0.3] [1 0 0]
[0.3 0.6 0.1] [0 1 0] When there are only two classes, and
[0.1 0.1 0.8] [0 0 1] The model generates a score for only one class
Then it’s a case for binary cross-entropy
Class: 1 – Prob: 0.94 – Corr Prob: 0.94 Which can be calculated using the formula
[0 1 0] Class: 0 – Prob: 0.38 – Corr Prob: 0.62 𝑛
1
Class: 1 – Prob: 0.54 – Corr Prob: 0.54 𝐵𝐶𝐸 = ෍ −(𝑦𝑖 log 𝑝𝑖 + (1 − 𝑦𝑖 )(log 1 − 𝑝𝑖 )
Class: 0 – Prob: 0.12 – Corr Prob: 0.88 𝑛
𝑖=1

1
𝐵𝐶𝐸 = − log 0.94 − log 0.62 − log 0.54 − log 0.88
[0 0 1] 4
Metrics
Metric values are displayed during fit() and logged to the History object returned by fit().
They are also returned by evaluate().

• Accuracy metrics • Classification metrics based on True/False positives


✓ Accuracy class & negatives
✓ BinaryAccuracy class ✓ AUC class
✓ CategoricalAccuracy class ✓ Precision class
✓ ……… ✓ Recall class
• Probabilistic metrics ✓ TruePositives class
✓ BinaryCrossentropy class ✓ TrueNegatives class
✓ CategoricalCrossentropy class ✓ FalsePositives class
✓ ……… ✓ FalseNegatives class
• Regression metrics ✓ ………
✓ MeanSquaredError class
✓ RootMeanSquaredError class
✓ MeanAbsoluteError class
✓ ………
ANN-1
Artificial Neural Network
Neurons
1. A typical brain contains a network of 10 billion
neurons
2. Components:
1. Dendrites: Receives the inputs
2. Cell body (Soma): Sums the incoming signals.
When sufficient input is received, the cell fires
3. Synapse: Links. The input signals may be
modified by the weights of the receiving
synapses.
4. Axon: Output, under appropriate circumstance
the neuron transmits a single output
3. A neuron is connected to other neurons through
about 10,000 synapses
4. Once input exceeds a critical level, the neuron discharges an electrical pulse that travels from the body,
down the axon, to the next neuron(s)
1. Transmission of an electrical signal from one neuron to the next is effected by neurotransmitters
2. This link is called a synapse. The strength of the signal that reaches the next neuron depends on factors
such as the amount of neurotransmitter available
MCP Neuron Model
Biological Neuron Artificial Neuron
• Researchers Warren McCullock and
Walter Pitts published their first Cell Nucleus (Soma) Node
concept of simplified brain cell in
1943. Dendrites Input
• This was called McCullock-Pitts (MCP)
neuron. Synapse Weights or interconnections
• They described such a nerve cell as a
simple logic gate with binary outputs. Axon Output

• A neuron is a mathematical function modelled on the


working of biological neurons
• One or more inputs are separately weighted
• Inputs are summed and passed through a nonlinear function
to produce output
• Every neuron holds an internal state called activation signal
• Every neuron is connected to another neuron via connection
link
Rosenblatt’s Perceptron
• Perceptron was introduced by Frank Rosenblatt in 1957.
• He proposed a Perceptron learning rule based on the original MCP neuron.
• A Perceptron is an algorithm for supervised learning of binary classifiers.
• This algorithm enables neurons to learn and processes elements in the training set one at a time.
• Single layer: Single layer perceptron
can learn only linearly separable
patterns.
• Multilayer: Multilayer perceptron
(MLP) can learn about two or more
layers having a greater processing
power.
• MLP with at-least one hidden layer is
a Universal approximator (can
represent any function)
• This step function or Activation function is vital in ensuring that output is mapped between (0,1) or (-1,1)
Exploring the Perceptron Model
Activation Functions
1, 𝑤0 +∑𝑤𝑖 𝑥𝑖 > 𝑡
O 𝒙 =ቊ
0, 𝑤0 + ∑𝑤𝑖 𝑥𝑖 ≤ 𝑡

1, 𝑤0 +∑𝑤𝑖 𝑥𝑖 > 0
O 𝒙 =ቊ
−1, 𝑤0 + ∑𝑤𝑖 𝑥𝑖 ≤ 0

1
𝑂 𝑥 =
1 + 𝑒 − 𝑤0+∑𝑤𝑖𝑥𝑖
The activation function:
Activation Functions

x = [2 3]

y = [0.7122574322957417 0.5330975738715015]
Perceptron Learning Algorithm

Linearly Separable

Linearly Non-Separable
Logistic Regression
What is Logistic Regression?
1. Dependent Variable is not continuous but categorical.
Birth Weight < 2.5 kg is LOW
2. Mostly the DV is dichotomous e.g., Yes/No, Low/High,
Low Right/Left, etc.

3. OLS Regression yields to negative probabilities or


Normal probabilities more than 1.

4. The “logit” model solves this problem. Logit implies


Gestational Age 𝑝
“log odds ratio” i.e., ln = 𝛽0 + 𝛽1 𝑥, where p is
1−𝑝
the probability that the event occurs.

1 1 exp 𝛽 +𝛽1 𝑥
5. This implies 𝑝 = = 1+exp 0𝛽
1+exp −𝛽0 −𝛽1 𝑥 0 +𝛽1 𝑥
1. As 𝛽0 + 𝛽1 𝑥 → ∞, p approaches 1
2. As 𝛽0 + 𝛽1 𝑥 → −∞, p approaches 0
0 3. As p approaches 0.5, 𝛽0 + 𝛽1 𝑥 approaches 0
4. Steepness of the curve increases as 𝛽1 gets bigger
LR Continued…
Logistic Regression can be of three types:

1. Binary LR: The response is binary i.e., it has two possible outcomes.

2. Nominal LR: There are three or more categories with no natural ordering e.g., Red, Blue, Orange.

3. Ordinal LR: There are three or more categories with a natural ordering to the levels, but the ranking of the
levels do not necessarily mean the intervals between them are equal e.g., Good, Stable, Serious, Critical.

When we have multiple independent variables, i.e., x is a vector X, then the model is formulated as follows:
exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 exp 𝑿𝛽 1
𝑝 𝑿 = = =
1 + exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 1 + exp 𝑿𝛽 1 + exp −𝑿𝛽
Interpretation of the Coefficients
Let us consider a binary DV Y and a single dimensional input variable X.
𝑝
Then the equation is ln = 𝛽0 + 𝛽1 𝑥
1−𝑝
Now, let us try to understand the impact of a unit increase in X.
𝑝′
If the probability becomes 𝑝′, then the equation becomes ln = 𝛽0 + 𝛽1 𝑋 + 1 = 𝛽0 + 𝛽1 𝑥 + 𝛽1
1−𝑝′
𝑝′ 𝑝
Now 𝛽1 = 𝛽0 + 𝛽1 𝑥 + 𝛽1 − 𝛽0 + 𝛽1 𝑥 = ln − ln
1−𝑝′ 1−𝑝
𝑝′
1 − 𝑝′
= ln 𝑝
1−𝑝
𝑜𝑑𝑑𝑠 ′
= ln
𝑜𝑑𝑑𝑠
𝑜𝑑𝑑𝑠 ′
⇒ 𝑒 𝛽1 =
𝑜𝑑𝑑𝑠
This means that with the increase of 𝛽1 , the ratio of odds will increase exponentially.
Module - 3
Decision Tree
Random Forest
OLS Regression
Learning in General
Learning
Learning Process & Methods
1. Data Collection: The quality and quantity of Model Applications
information you get are very important since it will
directly impact how well or badly your model will Decision Trees Financial Applications
work.
Random Forest Fraud Detection
2. Data Preparation: It will be necessary to make a
selection of characteristics which will be used during Logistic Regression Price prediction
the execution times.
3. Model Selection: An appropriate model needs to be MLP Networks Classification
selected depending on the type of data and the type
Convolutional Neural Networks Image processing
of the application.
4. Model Training: The selected model needs to be
trained on the available data collected during the first Recurrent Neural Networks Voice recognition
phase of the process.
5. Evaluation: Validation of the performance on the Reinforcement Learning Learning by trial and error
available data itself, to ensure proper training.
Generative Models Image creation
6. Parameter Tuning: It’s a continuous process along
with the previous step, to gradually improve and K-means Segmentation
converge the learning process.
7. Inferencing: Applying the model on real-life data. k-Nearest Neighbors Recommendation systems
Decision Tree
Decision Trees
OUTLOOK TEMP. HUMIDITY WIND PLAY? Node = root of decision tree
S H H W N Main loop:
S H H S N 1. A → the “best” decision attribute for
the next node.
O H H W Y
2. Assign A as decision attribute for node.
R M H W Y 3. For each value of A, create a new
R C N W Y descendant of node.
R C N S N 4. Sort training examples to leaf nodes.
5. If training examples are perfectly
O C N S Y
classified, stop.
S M H W N 6. Else, recurse over new leaf nodes.
S C N W Y
R M N W Y
S M N S Y
O M H S Y
O H N W Y
R M H S N

Sunny Hot Humid Strong YES


Overcast Mild Non-humid Weak NO
Rainy Cold
Decision Making Will You Play Tennis?

Outlook 1. (Outlook == rain) and (not Windy == false) → yes


2. (Outlook == sunny) and (Humidity == high) → no

sunny Decision trees represent a disjunction of


overcast rain
conjunctions of constraints on the attribute
values of instances.

Humidity Yes (Outlook == overcast)


Windy OR
((Outlook == rain) and (Windy == false))
OR
high normal true false ((Outlook == sunny) and (Humidity == normal))
➔ YES

No Yes No Yes
“BEST” Attribute
OUTL TEMP. HUMI WIND PLAY? The entropy of a variable is the "amount of information"
OOK DITY contained in the variable.
S H H W N Shannon’s entropy quantifies
S H H S N the amount of information in a variable,
and it is computed using the formula:
O H H W Y
R M H W Y
R C N W Y • Initial Entropy:
R C N S N 5 5 9 9
Ei = − × log 2 + × log 2 = − −0.16 − 0.12 = 0.28
O C N S Y 14 14 14 14
S M H W N
S C N W Y • When OUTLOOK is taken as the splitting criteria, then we get three splits
whose population are S(3, 2), O(0, 4) and R(2, 3). So the entropy becomes:
R M N W Y
5 3 3 2 2 5 2 2 3 3
S M N S Y 𝐸𝑂 = − log 2 − log 2 − log 2 − log 2
14 5 5 5 5 14 5 5 5 5
O M H S Y = − 0.36 −0.13 − 0.16 − 0.36 −0.16 − 0.13 = − −0.1 − 0.1 = 0.2
O H N W Y
R M H S N • So, information gain 𝐺𝑂 = 0.08
Some more facts on Decision Tree x1 = 1
Y N
1. Some other choices of attribute
selection are – a. Random, b. Least- x2 = 0 x2 = 1
values, c. Highest-values, etc.
2. Decision trees can represent any Y N Y N
Boolean function of the input
attributes. C1 C0 C1 C0
3. Decision trees divide the feature space
into axis-parallel (hyper-)rectangles.
Each rectangular region is labelled x2 < 3 x2
with one label – or a probability Y N
6 $ $
distribution over labels. $
x1 < 4 x1 < 3 4 $
4. For continuous variables, $
*
the range will be split in Y N Y N $
various slabs to fit into * $ $
2
this form. C0 C1 x2 < 4 C1 * $
*
5. There are other measures of Y N * $
impurity like GINI, chi-square test, etc. * * $
C0 C1 3 4 6 x1
Random Forest
Random Forest Algorithm
• Random forest builds an ensemble of decision trees on
different samples and takes their majority vote for the final
classification.
• Bagging, also known as Bootstrap Aggregation is the
ensemble technique used by random forest.
• Bagging chooses a random sample from the data set.
Hence each model is generated from the samples
(Bootstrap Samples) provided by the Original Data with
replacement known as row sampling. This step of row
sampling with replacement is called bootstrap.
• Now each model is trained independently which
generates results. The final output is based on majority
voting after combining the results of all models. This
step which involves combining all the results and
generating output based on majority voting is known
as aggregation.
The bootstrap trees are all different from the original tree
• While growing the tree, at each node, while splitting,
the splitting feature is chosen from only a subset of all
features.
OLS Regression
The Regression Line Parameters
𝑌෡𝑖 = 𝑚𝑥𝑖 + 𝑏 "x"
"y"
2 Ice
𝐸 = ∑ 𝑌𝑖 − 𝑌෡𝑖 Hours of
Creams
= ∑ yi − mxi − b 2 Sunshine
Sold
= ∑ 𝑦𝑖2 + 𝑚2 𝑥𝑖2 + 𝑏2 − 2𝑚𝑥𝑖 𝑦𝑖 − 2𝑏𝑦𝑖 + 2𝑚𝑥𝑖 𝑏
2 4
= ∑𝑦𝑖2 + 𝑚2 ∑𝑥𝑖2 + ∑𝑏2 − 2𝑚∑𝑥𝑖 𝑦𝑖 − 2𝑏∑𝑦𝑖 + 2𝑏𝑚∑𝑥𝑖
3 5
= 𝑚2 ∑𝑥𝑖2 − 2𝑚∑𝑥𝑖 𝑦𝑖 + ∑𝑦𝑖2 + 2𝑏𝑚∑𝑥𝑖 − 2𝑏∑𝑦𝑖 + ∑𝑏2
5 7
= 𝑚2 𝒜 − 2𝑚ℬ + 𝒞 + 2𝑏𝑚𝒳 − 2𝑏𝒴 + 𝑏2 𝑁
7 10
9 15
𝜕𝐸
= 2𝑚𝒜 − 2ℬ + 2𝑏𝒳 = 0 … 𝑖 x y x2 xy
𝜕𝑚
𝜕𝐸 𝒴 − 𝑚𝒳
= 2𝑚𝒳 − 2𝒴 + 2𝑏𝑁 = 0 … 𝑖𝑖 ⇒ 𝑏 = … 𝑖𝑖𝑖 2 4 4 8
𝜕𝑏 𝑁
𝑝𝑢𝑡𝑡𝑖𝑛𝑔 𝑡ℎ𝑖𝑠 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑏 𝑖𝑛 𝑖 , 𝑤𝑒 𝑔𝑒𝑡 3 5 9 15
𝒴 − 𝑚𝒳 Nℬ − 𝒳𝒴 5 7 25 35
𝑚𝒜 − ℬ + 𝒳 =0⇒m= … 𝑖𝑣
𝑁 𝑁𝒜 − 𝒳 2 7 10 49 70

𝑵∑ 𝒙𝒚 − ∑𝒙∑𝒚 ∑𝒚 − 𝒎∑𝒙 9 15 81 135


𝒎= ; 𝒃= m = 1.5183, b = 0.3049
𝑵∑ 𝒙 𝟐 − ∑𝒙 𝟐 𝑵 Σx: 26 Σy: 41 Σx2: 168 Σxy: 263
Regression Analysis Contd…
An alternate form is In case, all the points lie on the same line, then in that case
∑ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത the regression formula should also yield the same line.
𝑚= Let us verify this:
∑ 𝑥𝑖 − 𝑥ҧ 2
∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑥𝑦 ҧ 𝑖 − ∑ 𝑥𝑖 𝑦ത + ∑ 𝑥ҧ 𝑦ത
= Let us assume ∀𝑖, 𝑦𝑖 = 𝑝𝑥𝑖 + 𝑞
∑ 𝑥𝑖2 − 2 𝑥∑𝑥ҧ 𝑖 + ∑ 𝑥ҧ 2
∑ 𝑥𝑖 𝑦𝑖 − 𝑁 𝑥ҧ 𝑦ത
= 𝐵 = ∑𝑥𝑖 𝑦𝑖 = ∑𝑥𝑖 𝑝𝑥𝑖 + 𝑞 = 𝑝∑𝑥𝑖2 + 𝑞∑𝑥𝑖 = 𝑝𝒜 + 𝑞𝒳
∑ 𝑥𝑖2 − 𝑁 𝑥ҧ 2 𝒴 = ∑𝑦𝑖 = 𝑝∑𝑥𝑖 + 𝑞𝑁 = 𝑝𝒳 + 𝑞𝑁

Simple vs. Multiple Regression Now, according to the formula,

Dependent Variable is Dependent Variable is


𝑁ℬ − 𝒳𝒴 𝑁𝑝𝒜 + 𝑁𝑞𝒳 − 𝑝𝒳 2 − 𝑞𝑁𝒳
predicted from one predicted from a set of 𝑚= = =𝑝
independent variable independent variables 𝑁𝒜 − 𝒳 2 𝑁𝒜 − 𝒳 2

One Regression For each independent 𝒴 − 𝑚𝒳 𝑝𝒳 + 𝑞𝑁 − 𝑚𝒳


Coefficient variable one 𝑏= = =𝑞
𝑁 𝑁
Regression Coefficient
Multivariate Regression and Pseudo-Inverse
𝐿𝑒𝑡 𝑥 ∈ ℛ 𝑛 𝐼𝑛 𝑜𝑡ℎ𝑒𝑟 𝑤𝑜𝑟𝑑𝑠, 𝑤𝑒 𝑎𝑟𝑒 𝑡𝑟𝑦𝑖𝑛𝑔 𝑡𝑜 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
⇒ 𝒙 = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 𝒘. 𝒙𝟏 − 𝑦1
These are our Predictors / Features /etc. 𝑿𝒘 − 𝒚 = 𝒘. 𝒙𝟐 − 𝑦2
We have ‘m’ such data points which is represented as …
𝑿𝟏 …
𝑿= 𝑿𝟐 𝐸𝑎𝑐ℎ 𝑿𝒊 𝑖𝑠 𝑎𝑛 𝑛 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 𝑟𝑜𝑤 𝑣𝑒𝑐𝑡𝑜𝑟 𝒘. 𝒙𝒎 − 𝑦𝑚
𝑿𝟑 𝑚


𝑿𝒎 𝑖. 𝑒. 𝑤𝑒 𝑎𝑟𝑒 𝑡𝑟𝑦𝑖𝑛𝑔 𝑡𝑜 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 ෍ 𝒘. 𝒙𝒊 − 𝑦𝑖 2
𝑎𝑛𝑑 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 𝑖=1
𝑦1 2
= 𝑿𝒘 − 𝒚 2
𝒚 = 𝑦2 = min 𝑿𝒘 − 𝒚 𝑇 𝑿𝒘 − 𝒚

… = min(𝑤 𝑇 𝑋 𝑇 𝑋𝑤 − 𝑤 𝑇 𝑋 𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + 𝑦 𝑇 𝑦)
𝑦𝑚 = min(𝑤 𝑇 𝑋 𝑇 𝑋𝑤 − 2𝑤 𝑇 𝑋 𝑇 𝑦 + 𝑦 𝑇 𝑦)
𝑊𝑒 𝑎𝑟𝑒 𝑡𝑟𝑦𝑖𝑛𝑔 𝑡𝑜 𝑠𝑜𝑙𝑣𝑒 𝑡ℎ𝑒 𝑔𝑒𝑛𝑒𝑟𝑎𝑙 𝑝𝑟𝑜𝑏𝑙𝑒𝑚 𝑜𝑓
𝑤1 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡𝑖𝑎𝑡𝑖𝑛𝑔 𝑤. 𝑟. 𝑡 𝑤 𝑎𝑛𝑑 𝑠𝑒𝑡𝑡𝑖𝑛𝑔 𝑡𝑜 0 𝑦𝑖𝑒𝑙𝑑𝑠
𝑤2
𝑓𝑖𝑡𝑡𝑖𝑛𝑔 𝑎 𝑣𝑒𝑐𝑡𝑜𝑟 𝒘 =
… 𝑋 𝑇 𝑋𝑤 = 𝑋 𝑇 𝑦
… ⇒ 𝑤 = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
𝑤𝑛
𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝒘. 𝑿 = 𝒚 𝑇ℎ𝑖𝑠 𝑡𝑒𝑟𝑚 𝑋 𝑇 𝑋 −1 𝑋 𝑇
𝑖𝑠 𝑟𝑒𝑓𝑒𝑟𝑟𝑒𝑑 𝑡𝑜 𝑎𝑠 𝑝𝑠𝑒𝑢𝑑𝑜 𝑖𝑛𝑣𝑒𝑟𝑠𝑒
Fuzzy Set and Fuzzy Logic
What is a Fuzzy Set
• If X is an universe of discourse and x is a particular element of X, then
a fuzzy set A defined on X and can be written as a collection of
ordered pairs: 𝐴 = {(𝑥, 𝜇𝐴 (𝑥)), 𝑥 ∈ 𝑋},
• where 𝜇𝐴 𝑥 is called the membership function and
• 𝜇𝐴 𝑥 : 𝑋 → 0, 1 and it is called the Degree of Membership of 𝑥
in 𝐴.
• For example
• Let 𝑋 = 𝐽𝑜ℎ𝑛, 𝑃𝑎𝑢𝑙, 𝑇𝑖𝑛𝑎, 𝑅𝑦𝑎𝑛, 𝐽𝑢𝑙𝑖𝑎 be the reference set of
students.
• Let 𝐴 be a fuzzy set of smart students defined on the universe 𝑋.
• 𝐴=
𝐽𝑜ℎ𝑛, 0.2 , 𝑃𝑎𝑢𝑙, 0.9 , 𝑇𝑖𝑛𝑎, 1.0 , 𝑅𝑦𝑎𝑛, 0.0 , 𝐽𝑢𝑙𝑖𝑎, 0.7
• Unlike statistics and probabilities, the degree is NOT describing
probabilities that the item is in the set, but instead describes to what We want the value to switch
extent the item is the set. gradually as Young becomes
• A Fuzzy Set is any set that allows its members to have different degree Middle and Middle becomes
of membership, called membership function, having interval [0,1].
Old, or Hot becomes Warm and
• Fuzzy Logic is used to describe human reasoning because, True and
False is sometimes insufficient when describing human reasoning, e.g., then becomes Cold. This is the
young, hot, tall, fair, etc. idea of fuzzy logic.
Definition
Crisp set has a unique membership function
A(x) = 1 xA
= 0 xA
A(x)  {0, 1}
Fuzzy Set can have an infinite number of
membership functions
A  [0,1]

• A set defined by a membership function is a fuzzy set.


• The universal set X is always a crisp set.
• Given a universal set X, a fuzzy set A is defined as
A : X → [0,1]
A number close to 5
Example
An Example
α cut and Strong α cut
• The α-cut of a fuzzy set A is the crisp set that contains all the
elements of the universal set X whose membership grades in
A are greater than or equal to the specified value of α.

• The strong α-cut of a fuzzy set A is the crisp set that contains
all the elements of the universal set X whose membership
grades in A are only greater than the specified value of α. Please note that α cut is a crisp set.
Fuzzy Set Operations
Given two fuzzy sets, A and B, their standard All the basic set operation rules i.e., Except
intersection, union and complement are defined • Commutativity • 𝐴 ∩ 𝐴ҧ ≠ 𝜙, 𝑎𝑛𝑑
for all 𝑥 ∈ 𝑋 by the equations. • Associativity • 𝐴 ∪ 𝐴ҧ ≠ 𝑋
𝜇𝐴∪𝐵 𝑥 = max 𝜇𝐴 𝑥 , 𝜇𝐵 𝑥 • Distributivity
𝜇𝐴∩𝐵 𝑥 = min 𝜇𝐴 𝑥 , 𝜇𝐵 𝑥 • De-Morgan’s Laws
𝜇𝐴ҧ 𝑥 = 1 − 𝜇𝐴 𝑥 Are Valid

1. Support(A) of a fuzzy set A is the set of all points 𝑥 ∈ 𝑋 such that 𝜇𝐴 𝑥 > 0
2. Core(A) of a fuzzy set A is the set of all points 𝑥 ∈ 𝑋 such that 𝜇𝐴 𝑥 = 1
3. Singleton: Fuzzy set A whose support is a single point 𝑥 in 𝑋 with 𝜇𝐴 𝑥 = 1 is called a fuzzy Singleton
Fuzzy Relations
Fuzzy Relation
Crisp Relation

M D A E N
5 1 0.3 0 0 0
11 0.3 1 0.2 0 0
17 0 0.1 0.5 1 0.2
23 0 0 0 0.2 1
Fuzzy Expert System
• Most applications of fuzzy logic
use it as the underlying logic
system for decision support
systems.

• Fuzzy expert system is a collection


of membership functions and
rules that are used to reason
about data.

• Usually, the rules in a fuzzy expert


system are have the following
form: “if x is low and y is high then
z is medium”
An Example Expert System
• We are trying to find the RISK of a project using the two input parameters Funding & Staffing
• On the Universe of Funding(X) there are three fuzzy sets Not Enough, Medium & Enough
• On the Universe of Staffing(Y), we have defined two fuzzy sets Small & Large
• On the Universe of RISK(Z) we have defined three fuzzy sets Low, Normal & High

RULE1: IF Funding(x) is Enough(A3) OR Staffing is Small THEN RISK is LOW INPUTS:


RULE2: IF Funding is Medium OR Staffing is Large THEN RISK is NORMAL 𝑿 = 𝒙𝟏
RULE3: IF Funding is Not Enough THEN RISK is HIGH 𝒀 = 𝒚𝟏
Fuzzy System 𝜇 𝑥=𝐴1 = 0.5 IF x is A3 OR y is B1 THEN z is C1
𝜇 𝑥=𝐴2 = 0.2 𝜇 𝑥=𝐴3 ∪ 𝜇 𝑦=𝐵1 = max 0,0.1 = 0.1
𝑥1 𝜇 𝑥=𝐴3 =0
1 IF x is A2 OR y is B2 THEN z is C2
𝜇 𝑥=𝐴2 ∪ 𝜇 𝑦=𝐵2 = max 0.2,0.7 = 0.7
Not
Medium Enough
Enough IF x is A1 THEN z is C3
(A2) (A3)
(A1) 𝜇 𝑥=𝐴1 = 0.5

Funding X
Step-2: Rule Evaluation
𝑦1
1 𝜇 𝑦=𝐵1 = 0.1
𝜇 𝑦=𝐵2 = 0.7 0.1 × 30 + 0.7 × 40 + 0.5 × 30
𝑘=
30 + 40 + 30
Small Large = 0.46
(B1) (B2)
Step-3: Defuzzification (Centroid Method)
Staffing Y
Step-1: Fuzzification
Some Statistical Concepts
Random Variable
Random Variable: a quantitative variable whose values are results of a random process.
• Discrete Random Variable: one whose possible values are finite or countably infinite (e.g., 1, 2, 3, …)
• Continuous Random Variable: one whose values constitute an entire (infinite) range over an interval

Probability distribution of a random variable tells all of its possible values along with their associated
probabilities, for example – if X is the random variable representing the number of tails in two tosses of a coin,
then the probability distribution of X is
X = no. of tails 0 1 2
Probability 0.25 0.5 0.25
A probability distribution function should follow the following axioms:
i. 0 ≤ 𝑃 𝑋 = 𝑥 ≤ 1
ii. 𝑃 𝑋 = 𝑥1 + 𝑃 𝑋 = 𝑥2 + ⋯ 𝑃 𝑋 = 𝑥𝑛 = 1, 𝑤ℎ𝑒𝑟𝑒 𝑥1 , 𝑥2 , … , 𝑥𝑛 are the all possible values of X
iii. If the events are independent, then 𝑃 𝑋 = 𝑥1 𝐴𝑁𝐷 𝑃 𝑋 = 𝑥2 = 𝑃 𝑋 = 𝑥1 × 𝑃(𝑋 = 𝑥2 )

The conditional probability of an event B, assuming that an event A has occurred is denoted by P(B/A)
𝑃 𝐴∩𝐵
and is defined as 𝑃 𝐵/𝐴 = ⇒ 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 × 𝑃 𝐵/𝐴
𝑃 𝐴
Probability Mass Function and Probability
Density Function
If the random variable takes on all values within a
If the random variable always takes values from {0, 1, 2, certain interval, then the random variable is called a
3, ….. n}, then it is called a Discrete random variable Continuous random variable
• The number of printing mistakes in each page of a • The height, age and weight of individuals
book • the amount of rainfall on a rainy day
• The number of telephone calls received by a
telephone operator, etc. If X is a continuous random variable, then f(x) is called
the Probability Density Function, and it satisfies the
If X is a discrete random variable taking values conditions:
𝑥1 , 𝑥2 , … , 𝑥𝑛 , then the function 𝑃 𝑥𝑖 = 𝑃(𝑋 = 𝑥𝑖 ) is 1. 𝑓 𝑥 ≥ 0, ∀𝑥
called the Probability Mass Function. ∞
2. ‫׬‬−∞ 𝑓 𝑥 𝑑𝑥 = 1
The function 𝐹 𝑥 = 𝑃(𝑋 ≤ 𝑥) is the Cumulative
distribution function of X. The Cumulative distribution function of 𝐹(𝑥) is defined
𝑥
𝑃 𝑎 <𝑋 ≤𝑏 =𝐹 𝑏 −𝐹 𝑎 as 𝐹 𝑥 = 𝑃 𝑋 ≤ 𝑥 = ‫׬‬−∞ 𝑓 𝑡 𝑑𝑡 , −∞ < 𝑥 < ∞
𝑃 𝑎 ≤ 𝑋 ≤ 𝑏 = 𝐹 𝑏 − 𝐹 𝑎 + 𝑃(𝑋 = 𝑎) 𝑃 𝑎 ≤𝑋 ≤𝑏 =𝐹 𝑏 −𝐹 𝑎
𝑃 𝑎 <𝑋 <𝑏 =𝐹 𝑏 −𝐹 𝑎 −𝑃 𝑋 =𝑏 𝑑 𝐹 𝑥
= 𝑓 𝑥 , when x is an one dimensional variable
𝑃 𝑎 ≤ 𝑋 < 𝑏 = 𝐹 𝑏 − 𝐹 𝑎 − 𝑃 𝑋 = 𝑏 + 𝑃(𝑋 = 𝑎) 𝑑𝑥
Mathematical Expectation
Let X be a random variable having a probability distribution In a gambling game a man is paid Rs. 5 if he gets all
function f(x). Expected value of the random variable is the heads or all tails when three coins are tossed, and he
arithmetic mean of the random variable. will pay out Rs. 3 if either one or two heads show,.
If X is discrete type of random variable What is his expected gain?
u ( X ) = E[u( x)] =  u( x) f ( x) win is Rs.5 if event
x E1 = {HHH, TTT} occurs and
If X is continuous type of random variable loss Rs.3 if event

E2 = {HHT, HTH, THH, HTT, THT, TTH} occurs
u ( X ) = E[u ( x)] =  u ( x) f ( x) dx
− Since E1 and E2 occur with probabilities 1/4 and 3/4
1. Mean of the random variable is E[X] = µX respectively, it follows that
2. If a is constant, E[a] = a
1 3
3. If a and b are constants, E[aX ± b] = a E[X] ± b  = E[ X ] =  x p ( x) = (5)  + (− 3)  = −1
4. E[f(X) ± g(X)] = E[f(X)] ± E[g(X)] x 4 4
5. Variance of the random variable X2 = E[(X − µX)2] = E[X2] − E[X]2
6. If a is constant, Var[a] = 0 In this game, the gambler will on an average
7. Var[aX ± b] = a2 Var[X] lose Rs. 1 per toss of the three coins
Example
Suppose that the number of cars X that pass through a car wash between 4.00p.m and 9.00p.m. has been
observed for 60 days and the result is the following distribution:
No. of Cars 4 5 6 7 8 9
No. of Days 5 5 15 15 10 10

Let g(X) = 2X – 1 represent the amount of money in rupees, paid to the attendant by the manager. Find the
attendant’s expected earnings for this particular time period.
x 4 5 6 7 8 9
P(X = x) 1/12 1/12 1/4 1/4 1/6 1/6

E[ g ( X )] = E[ 2 X − 1]
9
=  (2 x − 1) p ( x)
x=4

 1   1  1 1 1 1


= (7 )  + (9 )  + (11)  + (13)  + (15)  + (17) 
 12   12  4 4 6 6
= 12.67
The Normal Distribution
f(X)
1. Mean, Median and Mode are equal
2. The curve is bell-shaped and symmetric about its mean
3. The curve is asymptotic towards both sides of the X axis
1 x− 2 + 1 x− 2
− ( 1 − ( )

1 )
f ( x) =  e 2  e 2  dx = 1
 2 − 2

This is a bell shaped curve with + 1 x− 2 X


1 − ( )
No matter what  and  are, the area
different centers and spreads
depending on  and 
E(X)= = x
− 2
e 2  dx
between-+ and + is
1
1 about
− (
x− 2
)
68%;

+ 1 x− 2 about 95%;



the area between -2
 2the
 − and
• e 2and  +2
 area between -
dx = .is
68
1 − ( )

Var(X)=2 =
( x2 e 2   +
dx) − 1 • e2 − 12 ( x−  ) 2 3 and +3
 + 2 is about 99.7%.
1 x− 2


− (
dx = .68 1 )

−  2
  
− 2 
 − 2  2
• e 2  dx = .95

 + 1 x− 2  + 2 1 x− 2
− ( 1 − ( )  + 3 1 x− 2


1 ) − (
 • e 2  dx = .95
•e = .68 1 )


•e  = .997
2 dx 2 dx
   2    2
   2
− −2
−3
 + 2 1 x− 2  + 3 1 x− 2
− ( 1 − ( )

1 )
 •e = .997

• e 2  dx = .95 2 dx
  
−2 2   
−3 2
Standard Normal Distribution
A standard normal distribution has mean 0 and S.D. 1. All normal distributions can be converted into the
standard normal curve by subtracting the mean and dividing by the standard deviation: 𝒁 = (𝑿 − 𝝁)/𝝈

Tests for normal ity:


• Look at the histogram! Does it appear bell shaped?
• Compute descriptive summary measures—are mean, median, and mode similar?
• Do 2/3 of observations lie within 1 std dev of the mean?
• Do 95% of observations lie within 2 std dev of the mean?
Bayes Theorem
An Example
Markov Process
A Markov process {Xt} is a stochastic process with the property that, given the value of Xt, the values of Xs for s > t
are not influenced by the values of Xu for u < t. In words, the probability of any particular future behaviour of the
process, when its current state is known exactly, is not altered by additional knowledge concerning its past
behaviour. A discrete-time Markov chain is a Markov process whose state space is a finite or countable set, and
whose (time) index set is T = (0, 1, 2, …). In formal terms, the Markov property is that

A taxi company has divided the city into three regions – North-side, Downtown, and Southside. By keeping track
of pickups and deliveries, the company has found that of the fares picked up in North-side, 50% stay in that
region, 20% are taken Downtown, and 30% go to Southside. Of the fares picked up Downtown, only 10% go to
North-side, 40% stay Downtown, and 50% go to Southside. Of the fares picked up in Southside, 30% go to each
of North-side and Downtown, while 40% stay in Southside.

We would like to know what the distribution of taxis will be over time as they pick up and drop off successive
fares. Suppose we want to know the probability that a taxi starting off Downtown, will be Downtown after letting
off its seventh fare?
An Example
A taxi company has divided the city into three
regions – Northside, Downtown, and
Southside. By keeping track of pickups and
deliveries, the company has found that of the
fares picked up in Northside, 50% stay in that
region, 20% are taken Downtown, and 30%
go to Southside. Of the fares picked up
Downtown, only 10% go to Northside, 40%
stay Downtown, and 50% go to Southside. Of
the fares picked up in Southside, 30% go to
each of Northside and Downtown, while 40%
stay in Southside.
These probabilities are constant and independent of previous
We would like to know what the distribution behavior – this memory-lessness of the system is called the
of taxis will be over time as they pick up and Markov property. We assume that a transition – picking up and
drop off successive fares. Suppose we want dropping off a fare – occurs each time the system is observed,
to know the probability that a taxi starting off and that observations occur at regular intervals. Systems with
Downtown, will be Downtown after letting these characteristics are called Markov chains or Markov
off its seventh fare? processes.
Transition Matrix & Computing the probabilities

0.309

= T (Transition Matrix)

The ijth entry in the matrix Tn gives the probability of being in state j after n transitions, with state i as the initial state.
INTRODUCTION TO LINEAR ALGEBRA
Fundamentals
1. What is a matrix? – solution to f(x) = y, linear approximation of the
function ‘f’
2. Matrix Addition
3. Vector
1. What is a vector? – A point or the position vector in n dimensional
space
2. Scalar times Vector
4. Vector Norm – Lp norm
1. Properties of Norm – Non-negativity, Definiteness, Triangle
Inequality, Homogeneity
2. L1 norm (Manhattan / Taxicab norm)
• To distinguish between zero and non-zero, measure of sparsity
3. L2 norm (Euclidean norm)
• Not robust to outliers
4. Infinity norm (Max norm)
5. L0 norm – no. of non-zero entries in x
• Strictly speaking its not a norm because it is not homogeneous
Matrix Products
1. Products of two vectors
1. Hadamard product
2. Dot Product (Inner product) – Projection – Feed
forward neuron with multiple input weights
3. Multiplication (Outer product) – size
compatibility
2. Matrix times a vector – Y = WX – Rotation as well as
scaling
1. Inner product representation – yi is the dot product of
ith row of W with x ➔ Each row is aligned with a class
2. Outer product representation – Sum of the columns
of W weighted by the corresponding x ➔ The entire
space is spanned by the vector addition (if the vectors
(3,5)
are independent)
3. Rank of a matrix – A matrix whose columns can’t span the
whole space – number of independent columns
Two Layer Linear Network
Inner Product View Outer Product View

The response is the dot Contribution of xj to


product of the ith row network output
of W with the vector x
Special Matrices
1. Diagonal Matrices – Kind of scalar multiplication

2. Identity matrix – Representation of identity element of that space

3. Inverse of a matrix

For future scope:

1. Eigenvalues and Eigenvectors – Required for PCA

2. Sparsification – Individual and Grouped – Required for dimensionality reduction


ARTIFICIAL INTELLIGENCE
SUKUMAR CHAKRABORTY

1
INTRODUCTION TO APPLIED ARTIFICIAL INTELLIGENCE
1. Introduction to AI Textbooks:
4. Neural Networks and Deep
a. Overview 1. Artificial Intelligence
Learning
b. Intelligence Illuminated – Ben
a. MLP and Back Propagation
c. Learning Coppin (2004)
b. Deep learning:
2. Pre-requisites 2. Deep Learning – Ian
i. CNN
a. Linear Algebra Goodfellow, Yoshua
ii. RNN and LSTM
b. Probability Theory, Bayes Bengio, Aaron Courville
iii. Autoencoders
Theorem (2017)
iv. GAN
c. Calculus and Optimization 3. Computer Vision:
5. Applications of AI
d. Statistical Concepts: Random Algorithms and
a. Natural Language Processing &
Variables, Naïve Bayes, Markov Applications – Richard
Sentiment Analysis
Process, Regression Szeliski (2010)
b. Machine Vision & Image
3. Machine Learning 4. Foundations of
Processing:
a. Unsupervised Learning: Statistical Natural
i. Image Classification
Clustering, PCA Language Processing -
ii. Image Segmentation
b. Supervised learning: KNN, LR, Christopher D. Manning,
iii. Object Recognition
Decision Trees, SVM Hinrich Schuetze (1999)

2
OVERVIEW
UNIT-1

3
What is Intelligence?
According to Britannica - human intelligence is the ability to learn from experience, adapt to new
situations, understand and handle abstract concepts, and use knowledge to manipulate one’s
environment.

This refers to the


Crystallized acquired
knowledge and
Intelligence the ability to
retrieve it

It encompasses
Fluid the ability to see
complex
Intelligence relationships and
solve problems
Robert Sternberg, 1988

Raymond Cattell, 1963

• When we learn, remember, and recall information, we are using crystallized


intelligence
• Crystallized intelligence helps us to overcome concrete, straightforward problems
• Fluid intelligence helps us to tackle complex, abstract challenges in our daily life

4
Multiple Intelligences (Howard Gardner, 1983)
Intelligence Type Characteristics Representative Career
Perceives different functions of language, meanings of words, Journalist, novelist, poet,
Linguistic intelligence
may easily learn multiple languages teacher
Logical-mathematical Capable of seeing numerical patterns, strong ability to use reason
Scientist, mathematician
intelligence and logic
Understands rhythm, pitch, and tone; may play multiple
Musical intelligence Composer, performer
instruments or perform as a vocalist
Bodily kinaesthetic High ability to control the movements of the body and use the Dancer, athlete, athletic
intelligence body to perform various physical tasks coach, yoga instructor
Ability to perceive the relationship between objects and how Choreographer, sculptor,
Spatial intelligence
they move in space architect, aviator, sailor
Ability to understand and be sensitive to the various emotional Counselor, social worker,
Interpersonal intelligence
states of others salesperson
Ability to access personal feelings and motivations, and use them Key component of personal
Intrapersonal intelligence
to direct behaviour success
High capacity to appreciate the natural world and interact with Biologist, ecologist,
Naturalist intelligence
the species within it environmentalist

5
What is Artificial Intelligence (AI)
According to the father of Artificial Intelligence, John McCarthy:

“The science and engineering of making intelligent


machines, especially intelligent computer
programs”.
AI is accomplished by studying how a human learns, decides, and solves a
problem and then applying this on software and systems.

Programming without AI Programming with AI


Answers specific questions Can answer generic questions
Automatic modification not supported Can modify itself up to certain extent
In non-AI, there is no learning / training In AI, there is a phase for learning / training

6
APPLICATIONS OF AI Personalized Shopping

Fraud Detection / Prevention


Handwriting
Recognition Voice Assistants

Autonomous Vehicles
Speech
Gaming
Recognition
Spam Detection / Filtering

Artificial
Facial Recognition
Intelligence
Techniques
Weather Prediction
Natural
Vision
Language Assisted Diagnosis
Systems
Processing
Targeted Marketing
Expert
Systems Investment Planning

7
Intelligent Agents
An agent is anything that can be viewed as • The agent function maps from percept
perceiving its environment through sensors histories to actions: [f: P* → A]
and acting upon that environment through • The agent program runs on the physical
actuators. architecture to produce f
• Human agent: • agent = architecture + program
• eyes, ears, and other organs for
sensors;
• hands, legs, mouth, and other body
parts for actuators
• Robotic agent:
• cameras and infrared range finders
for sensors
• various motors for actuators

8
Rational Agents
For each possible percept sequence, a rational agent should select an action that is expected to
maximize its performance measure, given the evidence provided by the percept sequence and
whatever built-in knowledge the agent has.

Agent: An automated taxi driver:


➢ Performance measure: Safe, fast, legal, comfortable trip, maximize profits
➢ Environment: Roads, other traffic, pedestrians, customers
➢ Actuators: Steering wheel, accelerator, brake, signal, horn
➢ Sensors: Cameras, sonar, speedometer, GPS, odometer, engine sensors, keyboard

Agent: Interactive English tutor


➢ Performance measure: Maximize student's score on test
➢ Environment: Set of students
➢ Actuators: Screen display (exercises, suggestions, corrections)
➢ Sensors: Keyboard

9
Environment Types
• Fully vs. partially observable - When an agent sensor is capable to sense or access the complete state of
an agent at each point in time, e.g., chess or image recognition is fully observable whereas driving or
weather prediction is partially observable.
• Deterministic vs. stochastic - If an agent's current state and selected action can completely determine the
next state of the environment, then such environment is called a deterministic environment. A stochastic
environment is random in nature and cannot be determined completely by an agent. In a deterministic,
fully observable environment, agent does not need to worry about uncertainty.
• Episodic vs. sequential - In an episodic environment, there is a series of one-shot actions, and only the
current percept is required for the action. However, in Sequential environment, an agent requires
memory of past actions to determine the next best actions.
• Static vs. dynamic - If the environment can change itself while an agent is deliberating then such
environment is called a dynamic environment else it is called a static environment. Chess is a static
environment whereas taxi driving is dynamic.
• Discrete vs. continuous - If in an environment there are a finite number of actions that can be performed,
then such environment is a discrete environment else it is called continuous environment.
• Competitive vs. collaborative - An agent is said to be in a competitive environment when it competes
against another agent to optimize the output, whereas in a collaborative environment when multiple
agents cooperate to produce the desired output. The game of chess is competitive but “Age-of-empires”
is collaborative.

10
Intelligence

Reasoning − It is the set of processes that enables us to provide basis for


judgement, making decisions, and prediction.
Learning − It is the activity of gaining knowledge or skill by studying,
practising, being taught, or experiencing something.
Perception − It is the process of acquiring, interpreting, selecting, and
organizing sensory information.
Problem Solving − It is the process in which one perceives and tries to arrive
at a desired solution from a present situation by taking some path.
Linguistic Intelligence − It is one’s ability to use, comprehend, speak, and
write the verbal and written language.

11
Reasoning
INDUCTIVE REASONING DEDUCTIVE REASONING
• Inductive reasoning is a bottom-up • Deductive reasoning is a top-down approach
approach • Deductive reasoning takes us from the
• Inductive reasoning takes us from the theory to the observations
specific to the general
• there is a first premise, then a second
• We make many observations, discern a premise and finally an inference
pattern, make a generalization, and infer an
explanation or a theory Major premise: All mammals have backbones
Data: Every dog I meet is friendly Minor premise: Humans are mammals
Hypothesis: Most dogs are usually friendly Conclusion: Humans have backbones

Data: I tend to catch colds when people Major premise: All spiders have eight legs
around me are sick Minor premise: A tarantula is a spider
Hypothesis: Colds are infectious Conclusion: tarantulas have eight legs

12
Learning
HUMAN LEARNING
• Auditory Learning: Students listening to recorded audio lectures
• Episodic Learning: Fear of water after a near-drowning
• Motor Learning: Learning the forehand chop in table-tennis
• Observational Learning: Mimicry
• Perceptual Learning: Distinguish between different odours
• Relational Learning: Adding a ‘little less’ salt
• Spatial learning: Mind map
• Stimulus-Response Learning: Reacting on a horror movie

Auditory Learning: It is learning by listening and hearing. For example, students


listening to recorded audio lectures.
Episodic Learning: To learn by remembering sequences of events that one has
witnessed or experienced. This is linear and orderly.
Motor Learning: It is learning by precise movement of muscles. For example, picking
objects, Writing, etc.
Observational Learning: To learn by watching and imitating others. For example, child
tries to learn by mimicking her parent.
Perceptual Learning: It is learning to recognize stimuli that one has seen before. For
example, identifying and classifying objects and situations.
Relational Learning: It involves learning to differentiate among various stimuli on the
basis of relational properties, rather than absolute properties. For Example,
Adding ‘little less’ salt at the time of cooking potatoes that came up salty last time,
when cooked with adding say a tablespoon of salt.
Spatial learning: It is learning through visual stimuli such as images, colours, maps,
etc. For Example, A person can create roadmap in mind before actually following the
road.
Stimulus-Response Learning: It is learning to perform a particular behaviour when a
certain stimulus is present. For example, a dog raises its ear on hearing doorbell.

13
Perception
• Perception is the process by which
sensory information captured in the
real world is interpreted, acquired,
selected, and then organized.
Perception and action are critical
concepts in robotics.
• Our senses—taste, sight, touch, smell,
and hearing—provide patterns to the
appropriate part of our brain in order
for it to generate perceptions.
• According to the received
information, action is taken by
interacting with the environment to
manipulate and navigate the objects.

14
Problem Solving

Performance
Regression Algorithms. Measure
Instance-Based Algorithms.
Completeness
Root Cause Decision Tree Algorithms.
Algorithms
Analysis
Clustering Algorithms.
Optimality
Association Rule Learning Algorithms.

Artificial Neural Network Algorithms. Time Complexity


Heuristics
Deep Learning Algorithms.
Space
Search Algorithms Complexity

Algorithm
An automated solution to a problem
Deterministic and proven to grant an optimal result
Containing a finite set of instructions to solve a problem.

Heuristics
Arbitrary choices or educated guesses
No proof of correctness, may not grant optimal results.
Applied to improve the running time of algorithms.
--------------------------------------------------------------------
•Completeness – Completeness measures the algorithms guarantee to find the
solution for the problem if there exists any solution for it.
•Optimality – This measure is used in order to measure the search strategies which
find an optimal solution to the given problem.
•Time Complexity – This measures the amount of time the algorithm takes in order
to reach till the solution for the given problem.
•Space Complexity – This is the measure which is used to determine the amount of
space (in memory) which the algorithm requires in order to perform the search.

15
Regression
Regression is a statistical
technique to determine the
relationship between a
dependent (outcome) variable to
one or more independent
(predictor) variables.

• To predict the value of the


dependent variable using
information about the
predictor variables.

• Or, to estimate the effect of


the predictor variables on
the dependent variable.
𝑦 = 𝛼 + 𝛽𝑥 + ϵ

16
Instance Based Algorithms
A family of learning algorithms
that, instead of performing
explicit generalization, compare
unknown data points with
instances seen in training,
which have been stored in
memory, to measure some kind
of distance or similarity RBF Network

kNN Algorithm SVM Algorithm

17
Decision Tree Algorithms
COLOR

Yellow Reddish
Green
SOFT-
SMELL
NESS

NO

None Hard Soft


Bad Semi-soft
Good
NO YES SHAPE

NO WEIGHT NO

Oval Round
< 750 g
> 750 g
YES NO

YES NO

18
Clustering Algorithms
• Clustering (cluster analysis) is
grouping objects based on
similarities.
• Its an unsupervised method,
where the inputs are not
labelled.
Centroid based clustering Density based clustering
Some common applications for
clustering include the following:
• market segmentation
• social network analysis
• medical imaging
• image segmentation
• anomaly detection
Distribution based clustering Hierarchical clustering

19
Association Rule Learning Algorithm
• Association learning is a rule based
data mining technique that finds
important relations between variables
or features in a data set.
• Association rules are created for
finding information about frequent if-
then patterns using specific criteria
• Various algorithms are used. Some of
them are:
• Apriori algorithm
• ECLAT algorithm
• FP-Growth algorithm

Some common applications for clustering


include the following:
• market basket analysis
• medical diagnosis, etc.

ECLAT stands for Equivalence Class Transform


FP stands for Frequent Pattern

20
Artificial Neural Network Algorithms

21
Deep Learning • Deep learning networks learn by
discovering intricate structures in the
data they experience

• By building computational models that


are composed of multiple processing
layers

• the networks can create multiple levels


of abstraction to represent the data

• Areas such as computer vision, speech


recognition, robotics, etc., the
performance of deep learning systems
far exceeds that of conventional
machine learning systems.

22
Search Algorithms
Breadth-First Search

Depth-First Search

Depth-Limited Search
Uninformed Search
Algorithms Iterative Deepening Depth-First Search

Search Bidirectional Search

Algorithms Uniform Cost Search

Greedy Best First Search Algorithm


Informed Search
Algorithms A* Search Algorithm

The uninformed search algorithm does not have any domain knowledge such
as closeness, location of the goal state, etc. it behaves in a brute-force way. It
only knows the information about how to traverse the given tree and how to
find the goal state. This algorithm is also known as the Blind search algorithm
or Brute -Force algorithm.

The informed search algorithm is also called heuristic search or directed


search. In contrast to uninformed search algorithms, informed search
algorithms require details such as distance to reach the goal, steps to reach the
goal, cost of the paths which makes this algorithm more efficient. Here, the
goal state can be achieved by using the heuristic function. The heuristic
function is used to achieve the goal state with the lowest cost possible. This
function estimates how close a state is to the goal.

23

You might also like