100% found this document useful (1 vote)

230 views

Combined ML

The document discusses machine learning concepts including concept learning, hypotheses representation, and the Find-S algorithm. It provides examples and definitions of concept learning, describes how hypotheses can be represented, and explains the Find-S algorithm for finding the most specific hypothesis that is consistent with positive training examples. The document is presented by Dr. Neeraj Gupta from the Department of CEA at GLA University in Mathura.

Uploaded by

Abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

230 views

Combined ML

Uploaded by

Abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 705

MACHINE LEARNING (ML-4)

Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura

AGENDA
Concept Learning
Hypotheses Representation
Find-S Algorithm

[email protected] 2
LEARNING
Human learn our surrounding through 5 senses
eye,
ear,
nose,
tongue
and skin.

[email protected] 3
LEARNING
1. Rote Learning (memorization): Memorizing things without
knowing the concept or logic behind them.
2. Passive Learning (instructions): Learning from a
teacher/expert.
3. Analogy (experience): Learning new things from our past
experience.
4. Inductive Learning (experience): On the basis of past
experience, formulating a generalized concept.
5. Deductive Learning: Deriving new facts from past facts.

[email protected] 4
CONCEPT LEARNING
Inducing general functions from specific training examples is a
main issue of machine learning.

Tom Mitchell defines the concept learning as —

“Problem of searching through a predefined
space of potential hypotheses for the
hypothesis that best fits the training examples”
[email protected] 5
DEFINITION OF CONCEPT LEARNING
Task: learning a category description (concept) from a set of
positive and negative training examples.
Concept may be an event, an object …
Target function: a boolean function c: X  {0, 1}
Experience: a set of training instances D:{x, c(x)}
A search problem for best fitting hypothesis in a hypotheses space.

[email protected] 6
CONCEPT LEARNING
A Formal Definition for Concept Learning:
Inferring a Boolean-valued function from training examples of
its input and output.
• An example for concept-learning is the learning of bird-concept
from the given examples of birds (positive examples) and non-
birds (negative examples).
• We are trying to learn the definition of a concept from given
examples.
SPORT EXAMPLE
[email protected] 7
SPORT EXAMPLE
Concept to be learned:
Days in which Aldo can enjoy water sport
Attributes:
Sky: Sunny, Cloudy, Rainy Wind: Strong, Weak
AirTemp: Warm, Cold Water: Warm, Cool
Humidity: Normal, High Forecast: Same, Change
Instances in the training set:
(out of the 96 possible):

[email protected] 8
HYPOTHESES REPRESENTATION
h is a set of constraints on attributes:
 a specific value: e.g. Water = Warm
 any value allowed: e.g. Water = ?
 no value allowed: e.g. Water = Ø
Example hypothesis:
Sky AirTemp Humidity Wind Water Forecast
Sunny, ?, ?, Strong, ?, Same
Corresponding to boolean function:
Sunny(Sky) ∧ Strong(Wind) ∧ Same(Forecast)
H, hypotheses space, all possible h

[email protected] 9
HYPOTHESIS SATISFACTION
An instance x satisfies an hypothesis h iff all the constraints expressed by h are
satisfied by the attribute values in x.
Example 1:
x1: Sunny, Warm, Normal, Strong, Warm, Same
h1: Sunny, ?, ?, Strong, ?, Same Satisfies?
Yes
Example 2:
x2: Sunny, Warm, Normal, Strong, Warm, Same
h2: Sunny, ?, ?, Ø, ?, Same Satisfies?
No

[email protected] 10
FORMAL TASK DESCRIPTION
Given:
 X all possible days, as described by the attributes
 A set of hypothesis H, a conjunction of constraints on the attributes,
representing a function h: X  {0, 1}
[h(x) = 1 if x satisfies h; h(x) = 0 if x does not satisfy h]
 A target concept: c: X  {0, 1} where
c(x) = 1 iff EnjoySport = Yes;
c(x) = 0 iff EnjoySport = No;
 A training set of possible instances D: {x, c(x)}
Goal: find a hypothesis h in H such that
h(x) = c(x) for all x in D
Hopefully h will be able to predict outside D…
[email protected] 11
THE INDUCTIVE LEARNING ASSUMPTION
We can at best guarantee that the output hypothesis fits the target
concept over the training data
Assumption: an hypothesis that approximates well the training data
will also approximate the target function over unobserved
examples
i.e. given a significant training set, the output hypothesis is able to
make predictions

[email protected] 12
CONCEPT LEARNING AS SEARCH
Concept learning is a task of searching an hypotheses space
The representation chosen for hypotheses determines the search
space
In the example we have:
3 x 25 = 96 possible instances (6 attributes)
1 + 4 x 35= 973 semantically distinct hypothesis
considering that all the hypothesis with some  are semantically
equivalent, i.e. inconsistent
Structuring the search space may help in searching more efficiently

[email protected] 13
GENERAL TO SPECIFIC ORDERING
Consider: h1 = Sunny, ?, ?, Strong, ?, ?
h2 = Sunny, ?, ?, ?, ?, ?
Any instance classified positive by h1 will also be classified
positive by h2
h2 is more general than h1
Definition: hj g hk iff (x  X ) [(hk(x) = 1)  (hj (x)= 1)]
g more general or equal; >g strictly more general
Most general hypothesis: ?, ?, ?, ?, ?, ?
Most specific hypothesis: Ø, Ø, Ø, Ø, Ø, Ø [email protected] 14
GENERAL TO SPECIFIC ORDERING: INDUCED STRUCTURE

[email protected] 15
FIND-S: FINDING THE MOST SPECIFIC HYPOTHESIS
1.Initialize h to the most specific hypothesis in H
2.For each positive training instance:
for each attribute constraint a in h:
If the constraint a is satisfied by x then do nothing
else replace a in h by the next more general
constraint satified by x (move towards a more
general hp)
3.Output hypothesis h
[email protected] 16
FIND-S IN ACTION

[email protected] 17
EXAMPLE
To illustrate this algorithm, assume the learner is given the sequence of training
examples from the EnjoySport task

[email protected] 18
PROPERTIES OF FIND-S
Find-S is guaranteed to output the most specific hypothesis within H that is
consistent with the positive training examples
The final hypothesis will also be consistent with the negative examples
Problems:
 There can be more than one “most specific hypotheses”
 We cannot say if the learner converged to the correct target
 Why choose the most specific?
 If the training examples are inconsistent, the algorithm can be mislead: no tolerance to rumor.
 Negative example are not considered

[email protected] 19
QUESTION
Consider the following data set having the data about which particular seeds are
poisonous.

[email protected] 20
EXAMPLE
First we consider the hypothesis to be more specific hypothesis.
Hence, our hypothesis would be :
h = {ϕ, ϕ, ϕ, ϕ}

Instance 1 :
The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our initial hypothesis is
more specific and we have to generalize it for this example. Hence, the hypothesis becomes :

h = { GREEN, HARD, NO, WRINKLED }

Instance 2 :
Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.

h = { GREEN, HARD, NO, WRINKLED }

[email protected] 21
EXAMPLE
Instance 3 :

Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.

h = { GREEN, HARD, NO, WRINKLED }

Instance 4 :
The data present in example 4 is { ORANGE, HARD, NO, WRINKLED }. We compare every single
attribute with the initial data and if any mismatch is found we replace that particular attribute
with general case ( ” ? ” ). After doing the process the hypothesis becomes :
h = { ?, HARD, NO, WRINKLED }
[email protected] 22
EXAMPLE
Instance 5 :
The data present in example 5 is { GREEN, SOFT, YES, SMOOTH }. We compare
every single attribute with the initial data and if any mismatch is found we replace
that particular attribute with general case ( ” ? ” ). After doing the process the
hypothesis becomes :
h = { ?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis have the
general condition, the example 6 and example 7 would result in the same
hypothesizes with all general attributes.
h = { ?, ?, ?, ? }
Hence, for the given data the final hypothesis would be :
Final Hyposthesis: h = { ?, ?, ?, ? }

[email protected] 23
THANKS

[email protected] 24
MACHINE LEARNING (ML-5)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura

[email protected] 1
AGENDA
Linear regression with one variable

[email protected] 2
QUIZ
X Y
2 4
3 9
5 25
9 81
7 49
11 121
10.5 WHAT?
[email protected] 3
QUIZ
X Y
2 4
3 9
5 25
9 81
7 49
11 121
10.5 110.25
[email protected] 4
HOW DO YOU FIND THAT?
You find the relation between X and Y

[email protected] 5
HOW DO YOU FIND THAT?
You find the relation between X and Y

Y = X.X =X^2
Y=f(X)

[email protected] 6
HOW DO YOU FIND THAT?
You find the relation between X and Y

Y = X.X =X^2
Y=f(X)
Which one is dependent variable ?

[email protected] 7
HOW DO YOU FIND THAT?
You find the relation between X and Y

Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER =
Y

[email protected] 8
HOW DO YOU FIND THAT?
You find the relation between X and Y

Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER = Y
So What is X?
[email protected] 9
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER = Y
SO What is X? X= Independent variable

[email protected] 10
QUIZ
X Y
2 3
3 5
5 9
9 10
7 6.5
11 11.8
10.5 WHAT?
[email protected] 11
QUIZ
X Y
2 3
3 5
5 9
9 10
7 6.5
11 11.8
10.5 Is it difficult to find out
the relation ?
[email protected] 12
GRAPH IS SOLUTION ?
Y
14

0
0 2 4 6 8 10 12

[email protected] 13
Y
14

0
0 2 4 6 8 10 12

[email protected] 14
APPROXIMATION
Y
14

11.8
12

10
10
9

8
6.5
Y

6
5

4
3

0
0 2 4 6 8 10 12
X

[email protected] 15
FIND THE EQUATION OF LINE ?
Two Points are given (3, 5) and (9,10)
Find equation of line ?

First right answer = 1Choclate (with in 90 secs)

What will be slope (m) and y intercept (c )?

Y= m.X + c

[email protected] 16
FIND THE EQUATION OF LINE ?
Two Points are given (3, 5) and (9,10)
Find equation of line ?

First right answer = 1Choclate

What will be slope (m) and y intercept (c )?

Y= m.X + c
Y= 0.83X+2.5

[email protected] 17
DEFINITION
Finding the relation between dependent variable and
Independent variable is called Linear Regression.
OR
Finding the best fit line between dependent variable and
Independent variable is called Linear Regression.

[email protected] 18
DEFINITION
Finding the relation between dependent variable and Independent variable is called
Linear Regression.

Now X=10.5 , X=2 , 13, 7

Y= what?
Put the values in equation Y=0.83X+2.5

What are you doing here? (USES OF LINEAR REGRESSION)

[email protected] 19
DEFINITION
Finding the relation between dependent variable and Independent variable is called
Linear Regression.
Now X=10.5 , X=12 , 13, 2.5
Y= what?
What are you doing here?
(USES OF LINEAR REGRESSION)
FORECASTING
PREDICTION

[email protected] 20
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)

[email protected] 21
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)
“Question”
“How can we find the best fit line?”

[email protected] 22
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)
“Question”
“How can we find the best fit line?”
“Answer”
If Y(actual) =Y(predicted) or e=0
Or
Minimise the error
[email protected] 23
HOW TO FIND BEST FIT LINE
Derivation of linear regression equations : (FOR SINGLE VARIABLE)
given a set of n points ( , ) on a scatterplot,
find the best-fit line, =a+ b
such that the sum of squared errors in Y, ∑( − ) is minimized.

SSE method or least Square method:

(Sum of Squared Error method)
Find the a (y intercept), b (slope of line).

[email protected] 24
Linear regression with one variable
MODEL REPRESENTATION *

SRC : * Andrew NG
[email protected] 25
500
Housing Prices
400
(Portland, OR)
300

Price 200
(in 1000s 100
of dollars)
0
0 500 1000 1500 2000 2500 3000

Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Training Set How do we represent h ?

Learning Algorithm

Size of h Estimated
house price

Linear regression with one variable.

Univariate linear regression.
Linear regression with one variable
COST FUNCTION *

SRC : * Andrew NG
[email protected] 29
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …

Hypothesis:
‘s: Parameters
How to choose ‘s ?
[email protected] 30
3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

[email protected] 31
y

Idea: Choose so that

is close to for our
training examples

[email protected] 32
Linear regression with one variable
COST FUNCTION
INTUITION I *

SRC : * Andrew NG
[email protected] 33
Simplified
Hypothesis:

Parameters:

Cost Function:

Goal:

[email protected] 34
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

[email protected] 35
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

[email protected] 36
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

[email protected] 37
Linear regression with one variable
COST FUNCTION
INTUITION II *

SRC : * Andrew NG
[email protected] 38
Hypothesis:

Parameters:

Cost Function:

Goal:

[email protected] 39
(for fixed , this is a function of x) (function of the parameters )

500

400

Price ($) 300

in 1000’s
200

100

0
0 500 1000 1500 2000 2500 3000
Size in feet2 (x)

[email protected] 40
[email protected] 41
(for fixed , this is a function of x) (function of the parameters )

[email protected] 42
(for fixed , this is a function of x) (function of the parameters )

[email protected] 43
(for fixed , this is a function of x) (function of the parameters )

[email protected] 44
(for fixed , this is a function of x) (function of the parameters )

[email protected] 45
Linear regression with one variable

GRADIENT DESCENT *

SRC : * Andrew NG
[email protected] 46
Have some function
Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
[email protected] 47
J()




[email protected] 48
J()




[email protected] 49
Gradient descent algorithm

Correct: Simultaneous update Incorrect:

[email protected] 50
Linear regression with one variable
GRADIENT DESCENT
INTUITION*

SRC : * Andrew NG
[email protected] 51
Gradient descent algorithm

[email protected] 52
[email protected] 53
If α is too small, gradient descent
can be slow.

If α is too large, gradient descent

can overshoot the minimum. It may
fail to converge, or even diverge.

[email protected] 54
Linear regression with one variable
GRADIENT DESCENT FOR
LINEAR REGRESSION*

SRC : * Andrew NG
[email protected] 55
Gradient descent algorithm Linear Regression Model

[email protected] 56
[email protected] 57
Gradient descent algorithm

update
and
simultaneously

[email protected] 58
J()




[email protected] 59
J()




[email protected] 60
[email protected] 61
(for fixed , this is a function of x) (function of the parameters )

[email protected] 62
(for fixed , this is a function of x) (function of the parameters )

[email protected] 63
(for fixed , this is a function of x) (function of the parameters )

[email protected] 64
(for fixed , this is a function of x) (function of the parameters )

[email protected] 65
(for fixed , this is a function of x) (function of the parameters )

[email protected] 66
(for fixed , this is a function of x) (function of the parameters )

[email protected] 67
(for fixed , this is a function of x) (function of the parameters )

[email protected] 68
(for fixed , this is a function of x) (function of the parameters )

[email protected] 69
(for fixed , this is a function of x) (function of the parameters )

[email protected] 70
“Batch” Gradient Descent

“Batch”: Each step of gradient descent

uses all the training examples.

[email protected] 71
THANKS

[email protected] 72
MACHINE LEARNING (ML-6)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura

[email protected] 1
AGENDA
Linear regression with multiple variable
Gradient descent for multiple variables
Gradient descent in practice I: Feature Scaling

SRC : * Andrew NG
[email protected] 2
Multiple features (variables).

Size (feet2) Price ($1000)

2104 460
1416 232
1534 315
852 178
… …

[email protected] 3
Multiple features (variables).

Size (feet2) Number of Age of home Price ($1000)

bedrooms Number of floors (years)

2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …

[email protected] 4
Multiple features (variables).
Size (feet2) Number of Number of Price ($1000)
bedrooms floors Age of home (years)

2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …

Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
[email protected] 5
Hypothesis:
Previously:

[email protected] 6
For convenience of notation, define .

Multivariate linear regression.

[email protected] 7
Linear Regression with multiple variables

GRADIENT DESCENT FOR MULTIPLE

VARIABLES

SRC : * Andrew NG
[email protected] 8
Hypothesis:
Parameters:
Cost function:

Gradient descent:
Repeat

(simultaneously update for every )

[email protected] 9
New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)

(simultaneously update )

[email protected] 10
Linear Regression with multiple variables

GRADIENT DESCENT IN PRACTICE I:

FEATURE SCALING

SRC : * Andrew NG
[email protected] 11
Feature Scaling
Idea: Make sure features are on a similar scale.
E.g. = size (0-2000 feet2) size (feet2)
= number of bedrooms (1-5)
number of bedrooms

[email protected] 12
Feature Scaling
Get every feature into approximately a
range.

[email protected] 13
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.

[email protected] 14
Linear Regression with multiple variables

GRADIENT DESCENT IN PRACTICE II:

LEARNING RATE

SRC : * Andrew NG
[email protected] 15
Gradient descent

- “Debugging”: How to make sure gradient

descent is working correctly.
- How to choose learning rate .

[email protected] 16
Making sure gradient descent is working correctly.

Example automatic
convergence test:

Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
[email protected] 17
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .

No. of iterations

No. of iterations No. of iterations

- For sufficiently small , should decrease on every iteration.

- But if is too small, gradient descent can be slow to converge.
[email protected] 18
Summary:
- If is too small: slow convergence.
- If is too large: may not decrease on
every iteration; may not converge.

To choose , try

[email protected] 19
Linear Regression with multiple variables

FEATURES AND POLYNOMIAL REGRESSION

[email protected] 20
Housing prices prediction

[email protected] 21
Polynomial regression

Price
(y)

Size (x)

[email protected] 22
Choice of features

Price
(y)

Size (x)

[email protected] 23
Linear Regression with multiple variables

NORMAL EQUATION

[email protected] 24
Gradient Descent

Normal equation: Method to solve for

analytically.

[email protected] 25
Intuition: If 1D

(for every )

Solve for [email protected] 26

Examples:
Size (feet2) Number of Number of Price ($1000)
bedrooms floors Age of home (years)

1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178

[email protected] 27
Examples:
Size (feet2) Number of Number of Price ($1000)
bedrooms floors Age of home (years)

1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
1 3000 4 1 38 540

[email protected] 28
[email protected] 29
THANKS

[email protected] 30
MACHINE LEARNING (ML-7)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
Logistic regression (Classification)

SRC : * Andrew NG
[email protected] 2
Classification

Email: Spam / Not Spam?

Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?

0: “Negative Class” (e.g., benign tumor)

1: “Positive Class” (e.g., malignant tumor)
(Yes) 1

Malignant ?

(No) 0
Tumor Size Tumor Size

Threshold classifier output at 0.5:

If , predict “y = 1”
If , predict “y = 0”
Classification: y = 0 or 1

can be > 1 or < 0

Logistic Regression:
Logistic Regression
HYPOTHESIS
REPRESENTATION
Logistic Regression Model
Want

0.5

Sigmoid function 0

Logistic function
Interpretation of Hypothesis Output
= estimated probability that y = 1 on input x

Example: If

Tell patient that 70% chance of tumor being malignant

“probability that y = 1, given x,

parameterized by ”
Logistic Regression
DECISION BOUNDARY
Logistic regression 1

z
Suppose predict “ “ if

predict “ “ if
Decision Boundary
x2
3
2

1 2 3 x1

Predict “ “ if
Non-linear decision boundaries
x2

-1 1 x1
-1
Predict “ “ if
x2

x1
Logistic Regression

COST FUNCTION
Training set:

m examples

How to choose parameters ?

Cost function
Linear regression:

“non-convex” “convex”
Logistic regression cost function

If y = 1

0 1
Logistic regression cost function

If y = 0

0 1
Logistic Regression
SIMPLIFIED COST FUNCTION AND
GRADIENT DESCENT
Logistic regression cost function
Logistic regression cost function

To fit parameters :

To make a prediction given new :

Output
Gradient Descent

Want :
Repeat

(simultaneously update all )

Gradient Descent

Want :
Repeat

(simultaneously update all )

Algorithm looks identical to linear regression!

Logistic Regression
MULTI-CLASS CLASSIFICATION:
ONE-VS-ALL
Multiclass classification

Email foldering/tagging: Work, Friends, Family, Hobby

Medical diagrams: Not ill, Cold, Flu

Weather: Sunny, Cloudy, Rain, Snow

Binary classification: Multi-class classification:

x2 x2

x1 x1
x2
One-vs-all (one-vs-rest):

x1
x2 x2

x1 x1
x2
Class 1:
Class 2:
Class 3:
x1
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .

On a new input , to make a prediction, pick the

class that maximizes
THANKS

[email protected] 28
MACHINE LEARNING (ML-8)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
Machine Learning: Training, Testing, Evaluation

SRC : * Andrew NG
[email protected] 2
EVALUATING THE HYPOTHESIS
Fail to generalize to new examples not in
training set.

[email protected] 3
EVALUATING THE HYPOTHESIS

[email protected] 4
TRAINING/TESTING PROCEDURE FOR LINEAR
REGRESSION
Learn parameter from training data (minimizing training error ( ))

Compute test set error:

[email protected] 5
TRAINING/TESTING PROCEDURE FOR LOGISTIC
REGRESSION
Learn parameter from training data (minimizing training error ( ))
Compute test set error:

Misclassification error (0/1 misclassification error):

[email protected] 6
PYTHON IMPLEMENTATION

[email protected] 7
PYTHON IMPLEMENTATION

[email protected] 8
PYTHON IMPLEMENTATION

[email protected] 9
EVALUATION METRICS*
Training objective (cost function) is only a proxy for real world objectives.
Metrics are useful and important for evaluation.
Metrics help capture a business goal into a quantitative target.
Helps organize ML team effort towards that target. Generally in the form of
improving that metric on the dev set.

SRC : *Yining Chen Slides

[email protected] 10
EVALUATION METRICS
Useful to quantify the “gap” between:
Desired performance and baseline (estimate effort initially).
Desired performance and current performance.
Measure progress over time.

Useful for lower level tasks and debugging (e.g. diagnosing bias vs variance).

[email protected] 11
BINARY CLASSIFICATION
x is input
y is binary output (0/1)
Model is ŷ= h(x)
Two types of models
Models that output a categorical class directly (K-nearest neighbor, Decision
tree)
Models that output a real valued score (SVM, Logistic Regression)
Score could be margin (SVM), probability
Need to pick a threshold
 We focus on this type (the other type can be interpreted as an instance)

[email protected] 12
SCORE BASED MODELS

[email protected] 13
THRESHOLD -> CLASSIFIER -> POINT METRICS

[email protected] 14
POINT METRICS: CONFUSION MATRIX

[email protected] 15
POINT METRICS: TRUE POSITIVES

[email protected] 16
POINT METRICS: TRUE NEGATIVES

[email protected] 17
POINT METRICS: FALSE POSITIVES

[email protected] 18
POINT METRICS: FALSE NEGATIVES

[email protected] 19
FP AND FN ALSO CALLED TYPE-1 AND TYPE-2
ERRORS

[email protected] 20
POINT METRICS: ACCURACY

[email protected] 21
POINT METRICS: PRECISION

[email protected] 22
POINT METRICS: POSITIVE RECALL (SENSITIVITY)

[email protected] 23
POINT METRICS: NEGATIVE RECALL (SPECIFICITY)

[email protected] 24
POINT METRICS: F1-SCORE

[email protected] 25
POINT METRICS: CHANGING THRESHOLD

[email protected] 26
POINT METRICS: CHANGING THRESHOLD

[email protected] 27
SUMMARY METRICS: PRC (RECALL VS. PRECISION)

[email protected] 28
ROC (RECEIVER OPERATING CHARACTERISTICS)
•ROC curve is a performance measurement for classification problem at various
thresholds settings.
•ROC is a probability curve and AUC represents degree or measure of separability.
•It tells how much model is capable of distinguishing between classes.
•Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
•By analogy, Higher the AUC, better the model is at distinguishing between patients
with disease and no disease.

[email protected] 29
ROC CURVE
•The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is
on the x-axis.

[email protected] 30
DEFINING TERMS USED IN AUC AND ROC CURVE
TPR (True Positive Rate) / Recall /Sensitivity

Specificity FPR(False Positive Rate)

[email protected] 31
HOW TO SPECULATE THE PERFORMANCE OF THE
MODEL?
•An excellent model has AUC near to the 1 which means it has good measure of
separability.
•A poor model has AUC near to the 0 which means it has worst measure of
separability. In fact it means it is reciprocating the result. It is predicting 0s as 1s and
1s as 0s.
•AUC is 0.5, it means model has no class separation capacity whatsoever.

[email protected] 32
INTERPRETATION OF ROC CURVE
As we know, ROC is a curve of probability. So lets plot the distributions of those
probabilities:
Red distribution curve is of the positive class (patients with disease) and green
distribution curve is of negative class(patients with no disease)

[email protected] 33
INTERPRETATION OF ROC CURVE

[email protected] 34
INTERPRETATION OF ROC CURVE

[email protected] 35
INTERPRETATION OF ROC CURVE

[email protected] 36
COMPARING ROC CURVES

[email protected] 37
RELATION BETWEEN SENSITIVITY, SPECIFICITY,
FPR AND THRESHOLD
•Sensitivity and Specificity are inversely proportional to each other. So when we
increase Sensitivity, Specificity decreases and vice versa.
•When we decrease the threshold, we get more positive values thus it increases the
sensitivity and decreasing the specificity.
•Similarly, when we increase the threshold, we get more negative values thus we get
higher specificity and lower sensitivity.

Sensitivity⬆, Specificity⬇ and Sensitivity⬇, Specificity⬆

[email protected] 38
EXAMPLE: ROC

[email protected] 39
THANKS

[email protected] 40
MACHINE LEARNING (ML-9)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura

[email protected] 1
AGENDA
Bias Vs Variance

[email protected] 2
[email protected] 3
[email protected] 4
[email protected] 5
[email protected] 6
[email protected] 7
[email protected] 8
[email protected] 9
[email protected] 10
[email protected] 11
[email protected] 12
[email protected] 13
[email protected] 14
[email protected] 15
[email protected] 16
[email protected] 17
[email protected] 18
[email protected] 19
[email protected] 20
[email protected] 21
[email protected] 22
[email protected] 23
[email protected] 24
[email protected] 25
[email protected] 26
[email protected] 27
[email protected] 28
[email protected] 29
[email protected] 30
[email protected] 31
[email protected] 32
[email protected] 33
[email protected] 34
[email protected] 35
[email protected] 36
[email protected] 37
[email protected] 38
[email protected] 39
[email protected] 40
[email protected] 41
[email protected] 42
[email protected] 43
[email protected] 44
[email protected] 45
[email protected] 46
[email protected] 47
TRADE-OFF (BIAS VS VARIANCE)

[email protected] 48
THANKS

[email protected] 49
MACHINE LEARNING (ML-10)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
K Nearest Neighbor

[email protected] 2
K-NEAREST-NEIGHBORS ALGORITHM
K nearest neighbors (KNN) is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (distance function)

KNN has been used in statistical estimation and pattern recognition

since 1970’s.
K-NEAREST-NEIGHBORS ALGORITHM
A case is classified by a majority voting of its neighbors, with the
case being assigned to the class most common among its K nearest
neighbors measured by a distance function.

If K=1, then the case is simply assigned to the class of its nearest
neighbor
DISTANCE FUNCTION MEASUREMENTS
HAMMING DISTANCE
For category variables, Hamming distance can be used.
K-NEAREST-NEIGHBORS
WHAT IS THE MOST POSSIBLE LABEL FOR C?

c
WHAT IS THE MOST POSSIBLE LABEL FOR C?
Solution: Looking for the nearest K neighbors of c.
Take the majority label as c’s label
Let’s suppose k = 3:
WHAT IS THE MOST POSSIBLE LABEL FOR C?

c
WHAT IS THE MOST POSSIBLE LABEL FOR C?
The 3 nearest points to c are: a, a and o.
Therefore, the most possible label for c is a.
PSEUDO CODE OF KNN
1. Load the data
2. Initialise the value of k
3. For getting the predicted class, iterate from 1 to total number of
training data points
1. Calculate the distance between test data and each row of training data. Here we will
use Euclidean distance as our distance metric since it’s the most popular method.
2. Sort the calculated distances in ascending order based on distance values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows
5. Return the predicted class

[email protected] 12
REMARKS
CHOOSING THE MOST SUITABLE K
NORMALIZATION
NORMALIZATION
NORMALIZATION
NORMALIZATION
K-NEAREST NEIGHBOR CLASSIFICATION (KNN)

Unlike all the previous learning methods, kNN does not build model
from the training data.
To classify a test instance d, define k-neighborhood P as k nearest
neighbors of d
Count number n of training instances in P that belong to class cj
Estimate Pr(cj|d) as n/k
No training is needed. Classification time is linear in training set size for
each test case.

19
DISCUSSIONS
kNN can deal with complex and arbitrary decision boundaries.
Despite its simplicity, researchers have shown that the classification accuracy of kNN
can be quite strong and in many cases as accurate as those elaborated methods.
kNN is slow at the classification time
kNN does not produce an understandable model

20
EXERCISE

[email protected] 21
EXERCISE

[email protected] 22
EXERCISE
Suppose, you have given the following data where x and y are the 2 input variables
and Class is the dependent variable.

Q1. Suppose, you want to predict the class of new data point x=1 and y=1 using
eucludian distance in 3-NN. In which class this data point belong to?
[email protected] 23
EXERCISE
Q2. In the previous question, you are now want use 7-NN instead of 3-KNN which of
the following x=1 and y=1 will belong to?

Q2. In the previous question, you are now want use 5-NN instead of 3-KNN which of
the following x=1 and y=1 will belong to?

[email protected] 24
THANKS

[email protected] 25
MACHINE LEARNING (ML-11)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oNaïve Bayes Classifier

[email protected] 2
WHAT IS NAIVE BAYES ALGORITHM?
• It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors.

• A Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.

• Naive Bayes model is easy to build and particularly useful for very large data sets.

• Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.

[email protected] 3
PREREQUISITES FOR BAYES’ THEOREM
What is an Experiment?
“An experiment is a planned operation carried out under controlled conditions.”
Tossing a coin, rolling a die, and drawing a card out of a well-shuffled pack of cards are all
examples of experiments.

[email protected] 4
SAMPLE SPACE
The result of an experiment is called an outcome. The set of all possible outcomes of
an event is called the sample space.
For example, if our experiment is throwing dice and recording its outcome, the sample
space will be:
S1 = {1, 2, 3, 4, 5, 6}
What will be the sample when we’re tossing a coin?
S2 = {H, T}

[email protected] 5
EVENT
An event is a set of outcomes (i.e. a subset of the sample space) of an experiment.
Let’s get back to the experiment of rolling a dice and define events E and F as:
E = An even number is obtained = {2, 4, 6}
F = A number greater than 3 is obtained = {4, 5, 6}
The probability of these events:

P(E) = Number of favorable outcomes / Total number of possible outcomes = 3 / 6

= 0.5
P(F) = 3 / 6 = 0.5
[email protected] 6
RANDOM VARIABLE
A Random Variable is exactly what it sounds like – a variable taking on random values
with each value having some probability (which can be zero).
It is a real-valued function defined on the sample space of an experiment:

[email protected] 7
RANDOM VARIABLE
Let’s take a simple example (refer to the above image as we go along). Define a
random variable X on the sample space of the experiment of tossing a coin. It takes a
value +1 if “Heads” is obtained and -1 if “Tails” is obtained. Then, X takes on values
+1 and -1 with equal probability of 1/2.

Consider that Y is the observed temperature (in Celsius) of a given place on a given
day. So, we can say that Y is a continuous random variable defined on the same space,
S = [0, 100] (Celsius Scale is defined from zero degree Celsius to 100 degrees
Celsius).

[email protected] 8
EXHAUSTIVE EVENTS
A set of events is said to be exhaustive if at least one of the events must occur at
any time. Thus, two events A and B are said to be exhaustive if A ∪ B = S, the
sample space.

For example, let’s say that A is the event that a card drawn out of a pack is red and
B is the event that the card drawn is black. Here, A and B are exhaustive because
the sample space S = {red, black}. Pretty straightforward stuff, right?

[email protected] 9
INDEPENDENT EVENTS
If the occurrence of one event does not have any effect on the occurrence of
another, then the two events are said to be independent. Mathematically, two events
A and B are said to be independent if:

P(A ∩ B) = P(AB) = P(A)*P(B)

For example, if A is obtaining a 5 on throwing a die and B is drawing a king of

hearts from a well-shuffled pack of cards, then A and B are independent just by
their definition. It’s usually not as easy to identify independent events, hence we use
the formula I mentioned above.

[email protected] 10
CONDITIONAL PROBABILITY
Consider that we’re drawing a card from a given deck.
What is the probability that it is a black card?
That’s easy – 1/2, right?
However, what if we know it was a black card – then what would be the probability
that it was a king?
This is where the concept of conditional probability comes into play.
Conditional probability is defined as the probability of an event A, given that
another event B has already occurred (i.e. A conditional B). This is represented by
P(A|B) and we can define it as:
P(A|B) = P(A ∩ B) / P(B)
[email protected] 11
CONDITIONAL PROBABILITY
Let event A represent picking a king, and event B, picking a black card. Then, we find
P(A|B) using the above formula:

P(A ∩ B) = P(Obtaining a black card which is a King) = 2/52

P(B) = P(Picking a black card) = 1/2
Thus, P(A|B) = 4/52. Try this out on an example of your choice.

[email protected] 12
WHAT IS BAYES’ THEOREM?

[email protected] 13
WHAT IS BAYES’ THEOREM?
“Have you ever seen the popular TV show ‘Sherlock’
(or any crime thriller show)? Think about it – our
beliefs about the culprit change throughout the
episode. We process new evidence and refine our
hypothesis at each step.

This is Bayes’ Theorem in real life!”

[email protected] 14
BAYES’ THEOREM
Now, let’s understand this mathematically. Consider that A and B are any two
events from a sample space S where P(B) ≠ 0. Using our understanding of
conditional probability, we have:

P(A|B) = P(A ∩ B) / P(B)

Here, P(A) and P(B) are probabilities of observing A and B independently of

each other. P(B|A) and P(A|B) are conditional probabilities. [email protected] 15
BAYES’ THEOREM
P(A) is called Prior probability and P(B) is called Evidence.
P(B|A) is called Likelihood and P(A|B) is called Posterior probability.

posterior = likelihood * prior / evidence

[email protected] 16
AN ILLUSTRATION OF BAYES’ THEOREM
Let’s solve a problem using Bayes’ Theorem. This will help you understand and
visualize where you can apply it.
There are 3 boxes labeled A, B, and C:
Box A contains 2 red and 3 black balls
Box B contains 3 red and 1 black ball
And box C contains 1 red ball and 4 black balls

The three boxes are identical and have an equal probability of getting picked.
Consider that a red ball is chosen. Then what is the probability that this red ball was
picked out of box A?
[email protected] 17
CONTD…
We have prior probabilities P(A) = P(B) = P (C) = 1 / 3, since all boxes have equal
probability of getting picked.

P(E|A) = Number of red balls in box A / Total number of balls in box A = 2 / 5

Similarly, P(E|B) = 3 / 4 and P(E|C) = 1 / 5

Then evidence P(E) = P(E|A)P(A) + P(E|B)P(B) + P(E|C)*P(C)

= (2/5) * (1/3) + (3/4) * (1/3) + (1/5) * (1/3) = 0.45
Therefore, P(A|E) = P(E|A) * P(A) / P(E) = (2/5) * (1/3) / 0.45 = 0.296
[email protected] 18
APPLICATIONS OF BAYES’ THEOREM
The three main applications of Bayes’ Theorem:
Naive Bayes’ Classifiers
Discriminant Functions and Decision Surfaces
Bayesian Parameter Estimation

[email protected] 19
NAIVE BAYES’ CLASSIFIERS
Naive Bayes’ Classifiers are a set of probabilistic classifiers based on
the Bayes’ Theorem. The underlying assumption of these classifiers is
that all the features used for classification are independent of each
other.
That’s where the name ‘naive’ comes in since it is rare that we obtain a
set of totally independent features.
The way these classifiers work is exactly how we solved in the
illustration, just with a lot more features assumed to be independent of
each other.

[email protected] 20
NAIVE BAYES’ CLASSIFIERS
Here, we need to find the probability P(Y|X) where X is an n-dimensional random
variable whose component random variables X_1, X_2, …., X_n are independent of
each other:

Finally, the Y for which P(Y|X) is maximum is our predicted class.

[email protected] 21
WORKING OF NAÏVE BAYES' CLASSIFIER
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular
day according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?

[email protected] 22
CONTD…

[email protected] 23
CONTD…

[email protected] 24
ADVANTAGES & DISADVANTAGES
Advantages of Naïve Bayes Classifier:
•Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
•It can be used for Binary as well as Multi-class Classifications.
•It performs well in Multi-class predictions as compared to the other Algorithms.
•It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
•Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.

[email protected] 25
APPLICATIONS OF NAÏVE BAYES CLASSIFIER:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.

[email protected] 26
TYPES OF NAÏVE BAYES MODEL
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.

[email protected] 27
TYPES OF NAÏVE BAYES MODEL
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.

[email protected] 28
TYPES OF NAÏVE BAYES MODEL
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or
not in a document. This model is also famous for document classification tasks.
BernoulliNB implements the naive Bayes training and classification algorithms for data that is
distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features
but each one is assumed to be a binary-valued (Bernoulli, boolean) variable.

[email protected] 29
THANKS

[email protected] 30
MACHINE LEARNING (ML-12)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oDecision Tree

[email protected] 2
DECISION TREE

A decision tree is a graphical representation

of all the possible solutions to a decision
based on certain conditions.

[email protected] 3
DECISION TREE

[email protected] 4
ILLUSTRATING CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

[email protected] 5
EXAMPLES OF CLASSIFICATION TASK
Predicting tumor cells as benign or malignant

Classifying credit card transactions

as legitimate or fraudulent

Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil

Categorizing news stories as finance,

weather, entertainment, sports, etc

[email protected] 6
EXAMPLE OF A DECISION TREE
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

[email protected] 7
ANOTHER EXAMPLE OF DECISION TREE
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10

[email protected] 8
DECISION TREE CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply Decision Tree

Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

[email protected] 9
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No