0% found this document useful (0 votes)
22 views

Lec1 PerceptronPocket Recap

Here is how the figure would look for a perceptron: - There would be a decision boundary line instead of a curve, since perceptron can only learn linear separable data - The data points would be on either side of the decision boundary line - Data points of one class would be on one side of the line, and data points of the other class would be on the other side - The perceptron would make predictions by checking on which side of the decision boundary line a new data point falls - For data points on the line, the perceptron prediction could be determined by the sign of the linear combination of features So in summary, the figure would show linearly separable data with a straight decision boundary line

Uploaded by

tejsharma815
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Lec1 PerceptronPocket Recap

Here is how the figure would look for a perceptron: - There would be a decision boundary line instead of a curve, since perceptron can only learn linear separable data - The data points would be on either side of the decision boundary line - Data points of one class would be on one side of the line, and data points of the other class would be on the other side - The perceptron would make predictions by checking on which side of the decision boundary line a new data point falls - For data points on the line, the perceptron prediction could be determined by the sign of the linear combination of features So in summary, the figure would show linearly separable data with a straight decision boundary line

Uploaded by

tejsharma815
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

ML – Recap

Perceptron
10th Oct 2022
We infer a rule through instances…….

This is a typical “Prediction” Problem


Training samples : Tuple (Pattern, class label)

Class 1

Class 2

We infer a rule through


What about this ???? instances…….

This is a typical
“Classification” Problem
Test sample : Only pattern is given
We need to complete the Tuple (Pattern, ??class label??)
Extending the problem
Training samples : Tuple (Pattern, class label)
Label = CAR Label = ??

Test sample : Only pattern is given


Label = PLANE We need to complete the Tuple (Pattern, ??class label??)

Label = ??

Training samples : Tuple (Pattern, class label)


Essence of Learning
(From Prof. Moustafa’s slides)

• There has to be a pattern


• Pattern may/may not be captured in a mathematical
expression
• We should have data on it
Question
(From Prof. Mostafa’s slides)
From Data ----->To Features
Class 1

Sample 1 Sample 2 Sample 3

Class 2

Sample 4 Sample 5 Sample 6


Let’s Define Feature as, = Number of Black boxes
Number of black boxes

2
1

1 2 3 4 5 6
Training Sample Number
Feature space
• Feature extractor : Mapping from Data to Feature Space
• Feature Extractor : Data -> (Feature1,Feature2,..)

• What is the advantage ?


• In this case, if the image is of size 10 X 100, we need 1000
pixels to represent every Data point
• But using Feature Extraction, each Data point is now
represented by a smaller set of numbers (Reals/Integers)
• Features should get us closer towards discovering inherent
patterns
Feature space
Extract “good” Features from the Given Data
Each Sample instance is mapped to a point in the
Feature-space
Sample Instance = {Feature1, Feature2, Feature3,….}

For an unseen test sample, what do we do ?


Compute number of black boxes

Number of black boxes


If 1, then Class 2
If 2, then Class 1
2
1

1 2 3 4 5 6
Training Sample Number
Data separability in Feature-space
Feature2
Feature2

Feature1 Feature1
Linearly separable Non-Linearly separable

Which kind of data is easier to work with ??


Basic Learning Premise

Training samples are drawn Independently of each other

All Training samples come from the same underlying process

Training Samples are all Independent Identically Distributed


(IID) Samples
Learning scenarios
1. Supervised Learning

2. Unsupervised Learning

3. Any more ???


Supervised Learning - Scenario
Linear Models

From the Hypothesis set, we are choosing only those that are Linear !!
Linear Models

Examples of Linear Models :


Linear Regression, Logistic Regression, Perceptron,
SVM, LDA….

When do you choose which model ?


What are the differences between them ?
Linear Regression
(Real-valued output)
Linear Regression - Objective
Cost function

Please Remember :
Training Error = In-sample error
Testing Error = Out-of-sample error
Cost function contribution

Notice : All data points that are not on the line contribute to the cost
function
Cost function - Rewriting

Notice : All data points x1, x2…..xN contribute to the cost function
Vector calculus - Hints

Given that w is a vector and U is a matrix/vector


(1) d/dw (UTw) = U
(2) d/dw (wTUw) = 2Uw
Error measure, J = ||y-Xw||2
J = (y – Xw)T(y-Xw), Expanding Norm ||y-Xw||2
Expand the brackets :

J = [yT – (Xw) T][y – Xw]

T T T
= yTy- (Xw) y- y Xw + (Xw) (Xw)

= yTy - 2 y Xw + wTXTXw
T

T T T
Setting dJ/dw = 0

T T T
-2 (y X) + 2 X Xw = 0

T T T
(y X) = X Xw

T T T T
X Xw = (y X) = X y

T T
w = (X X) -1 X

Hence w is the Pseudo Inverse of X


Pseudo-inverse Dimensions

Think : What if “d” was a large number ?!?


Cost function

• Can we think of any other way of quantifying error ?

• How about 1-norm of the Deviation ?

• How will we know when to choose 1-norm, when to


choose 2-norm ?
Logistic Regression
Linear Models – Contd…

Note : The logistic function behaves


“linearly” in a certain interval ; It behaves
like the Signum function at the tails
Logistic (Sigmoid) function

1
 (s)  Where,
1  es
Logistic (Sigmoid) function
• Maps real line to [0,1]
• Can be used to model posterior probability i.e P(C | x)
• Final goal: Feature as input and output as posterior probability,
using the sigmoid model
1
 ( w x) 
T
 wT x
1 e

Step 1: Take linear combination of features (similar to Linear Regression)


Step 2: Apply sigmoid function
Logistic Regression
1 Where,
 (s) 
1  es

What happens when -----

Risk score is high ?? s is large; Ɵ(s) ->


1
Risk score is low ?? s is small; Ɵ(s) ->
0
MLE

We got a closed form solution !!

Closed form solution not possible 

Way ahead : Iterative solution


Gradient Descent
Initialize w as w(t)

Differentiate Error wrt w :


Logistic Regression

• Classifier
• Discriminative model (vs Generative model)
• Parameters – feature weights
• Estimation – ML estimation
• Gradient Descent (Iterative Method)

Summary : Maximum likelihood estimation of ”w” assuming that the observed


training set was generated by a binomial model
Other Linear Models..
Same old Example
Perceptron – Why?

• Guaranteed to solve if Data is Linearly Separable


• Link to Support Vector Machines (SVM) – Neural
Networks (NN)
Perceptron
Linearly separable data
Perceptron -Wiki
• Invented in 1957 by Frank Rosenblatt

• In 1969 a famous book entitled “Perceptions” by Minsky &


Papert , they showed that it was impossible for Perceptron to
learn an XOR function.
To Do

• Perceptron on MNIST Data


• Take 2 classes at a time and check if 2 confusing
classes such as “1” and “7” can be classified using
Perceptron
• Is the Number of iterations taken dependent on the
Initialization ?
Perceptron - MNIST
Feature Space
PLA - Output
Pocket - Version
PLA Vs Pocket
To Do

• Perceptron on MNIST Data


• Think of a suitable feature set. Take 2 classes at a
time and check if 2 confusing classes such as “1”
and “7” can be classified using Perceptron
• What is the cost function we are minimizing in
Perceptron ?
• Apply Gradient Descent to this cost function
• Prove : Convergence of Perceptron
PLA aims to do..
• we have blue points and
the gray points (two
classes)

Assume we start with


the dotted line,
x1+x2 -0.5 = 0;

that will classify two


points wrongly, one blue
point and one gray point 59
PLA
The perceptron learns a line
which separates the
points correctly
1.42x1+0.51x2 -0.5 = 0

This line has zero


Training error

60
PLA
Cost function

Gradient Descent

61
Y
Iteration in PLA

62
Weight correction (blue to
black)
Perceptron inference
• Given this straight line, say we have learnt, the inference is as
follows :
• On the straight line x1-x2 = 0;
• If x1-x2 > 0, then RED class;
• If x1-x2 < 0 BLUE class
x1-x2 < 0
x2 x1-x2 = 0

x1-x2 > 0

x1
Linear Models

How does this figure look for perceptron ?

You might also like