0% found this document useful (0 votes)
59 views

04 Probability and Learning PDF

The document discusses different probabilistic and discriminative models for classification problems, including generative models that model the joint distribution of inputs and outputs, discriminative models that directly model the posterior probabilities, and discriminant functions that map inputs to class labels. It also covers the naive Bayes classifier, logistic regression, and how logistic regression uses a logistic sigmoid function to output class probabilities between 0 and 1 for binary classification problems.

Uploaded by

Raden Eka G.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

04 Probability and Learning PDF

The document discusses different probabilistic and discriminative models for classification problems, including generative models that model the joint distribution of inputs and outputs, discriminative models that directly model the posterior probabilities, and discriminant functions that map inputs to class labels. It also covers the naive Bayes classifier, logistic regression, and how logistic regression uses a logistic sigmoid function to output class probabilities between 0 and 1 for binary classification problems.

Uploaded by

Raden Eka G.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Probability and Learning

Machine Learning
Unit 4

University of Vienna

28. März 2014

1
Probability and Learning

The Naive Bayes Classifier


Logistic Regression
Linear Discriminant Analysis (LDA)

2
Classification problem

The goal in classification is to take an input vector x and to assign it to one of k discrete
classes Ci where i = 1, ..., k .
In the most common scenario, the classes are taken to be disjoint, so that each input is
assigned to one and only one class.
The input space is thereby divided into decision regions whose boundaries are called
decision boundaries or decision surfaces.
qk
in probabilistic models we define ti as the probability that the class is Ci : i =1 ti =1

Three strategies for classification

1 generative models
2 discriminative models
3 discriminant function

3
Generative models

Approaches that explicitly or implicitly model the distribution of inputs as well as


outputs are known as generative models, because by sampling from them it is
possible to generate synthetic data points in the input space.

1 solve the inference problem of determining the class-conditional densities p(x |Ci )
for each class Ci individually
2 separately infer the prior class probabilities p(Ci )
3 use Bayes theorem in the form

p(x |Ci )p(Ci )


p(Ci |x ) =
p (x )

to find the posterior class probabilities p(Ci |x ) The denominator


ÿ
p (x ) = p(x |Ci )p(Ci )
i

Equivalently, we can model the joint distribution p(x , Ci ) directly and then normalize to
obtain the posterior probabilities.
4
Discriminative models

Approaches that model the posterior probabilities directly are called discriminative
models.

1 solve the inference problem of determining the posterior class probabilities p(Ci |x )
2 subsequently use decision theory to assign each new x to one of the classes

5
Discriminant function

Find a function f (x ), called a discriminant function, which maps each input x directly
onto a class label.
In the case of two-class problems, f might be binary valued and such that f =0
represents class C1 and f = 1 represents class C2 .
In this case, probabilities play no role.
Example: Fisher’s discriminant, perceptron algorithm.

6
Example

Options: go to the pub, watch TV, go to a party or study. Depends from an assignement,
an availability of a party and feeling.
Deadline Party Lazy Activity
Urgent Yes Yes Party
Urgent No Yes Study
Near Yes Yes Party
None Yes No Party
None No Yes Pub
None Yes No Party
Near No No Study
Near No Yes TV
Near Yes Yes Party
Urgent No No Study

7
The Bayes Theorem

There are m = 4 different classes Ci and n = 10 different examples Xj .

C1 = Pub, C2 = TV, C3 = Party, C4 = Study

For Deadline we have 3 states:

D1 = Urgent, D2 = Near, D3 = None

For Party we have 2 states:


P1 = Yes, P2 = No
For Lazy we have 2 states:
L1 = Yes, L2 = No
We calculate the value of P (Ci ) as how many times out of the total the class was Ci ,
divide by the total number of examples.

P (Pub) = 0.1, P (TV) = 0.1, P (Party) = 0.5, P (Study) = 0.3

8
The Bayes Theorem

The conditional probaility of Ci given that x has value X : P (Ci |X ) and tells us how likely
it is that the class is Ci given the value of x is X .
The question is how to get to this conditional probability, since we can not read it directly
from table.
Direct from table we can read the probabilities P (Dj |Ci ):
P(Urg|Pub)= 0, P(Near|Pub)= 0, P(None|Pub)= 1
P(Urg|TV)= 0, P(Near|TV)= 1, P(None|TV)= 0
P(Urg|Party)= 0.2, P(Near|Party)= 0.4, P(None|Party)= 0.4
P(Urg|Study)= 2/3, P(Near|Study)= 1/3, P(None|Study)= 0

9
The Bayes Theorem

The same procedure we use for the second feature Party


P(Party|Pub) = 0, P(No Party|Pub ) = 1
P(Party|TV) = 0, P(No Party|TV ) = 1
P(Party|Party) = 1, P(No Party|Party) = 0
P(Party|Study) = 0, P(No Party|Study) = 1
and for the third feature Lazy
P(Lazy|Pub) = 1, P(No Lazy|Pub) = 0
P(Lazy|TV) = 1, P(No Lazy|TV) = 0
P(Lazy|Party) = 0.6, P(No Lazy|Party) = 0.4
P(Lazy|Study) = 1/3, P(No Lazy|Study) = 2/3

10
The Naive Bayes classifier

The simplification:
The features are conditionally independent of each others.
The Naive Bayes classifier:

feed the values of the features


compute the probabilities of each of the possible classes
pick the most likely class

Suppose:
Deadline = Near, Party = No, Lazy = Yes

11
The Naive Bayes classifier

From conditional independence P (BC |A) = P (B |A)P (C |A) follows


P (A)P (BCD |A) P (A)P (B |A)P (C |A)P (D |A)
P (A|BCD ) = P (BCD )
= P (BCD )
It is sufficient to calculate the nominator, because the denominator is for each activity the
same.
P(Pub) P(Near|Pub) P(No Part|Pub) P(Lazy|Pub)= 0.1 ◊ 0 ◊ 1 ◊ 1 =0
P(TV) P(Near|TV) P(No Part|TV) P(Lazy|TV)= 0.1 ◊ 1 ◊ 1 ◊ 1 = 0.1
P(Party)P(Near|Party) P(No Part|Party) P(Lazy|Party)= 0.5 ◊ 0.4 ◊ 0 ◊ 0.6 = 0
P(Study) P(Near|Study) P(No Part|Study) P(Lazy|Study)=
1
0.3 ◊ (1/3) ◊ 1 ◊ 3
= 0.033
So based on this you will be watching TV tonight.
To scale the probability:
1
0.1
P(TV) = 1 = 0.769 P(Study) = 3
1 = 0.231
0.1+ 3 0.1+ 3

12
Probabilistic Discriminative Models

The second approach is to use the functional form of the generalized linear model
explicitly and to determine its parameters directly by using maximum likelihood.
In the direct approach, we are maximizing a likelihood function defined through the
conditional distribution p(Ci |x ), which represents a form of discriminative training.
One advantage of the discriminative approach is that there will typically be fewer
adaptive parameters to be determined.
It may also lead to improved predictive performance, particularly when the
class-conditional density assumptions give a poor approximation to the true distributions.

13
Logistic Regression

Classification
Email: Spam/Not Spam?
Online Transaction: Fraudulent (Yes/No)?
Tumor: Malignant /Benign

Two-Class Classification problem: y œ 0, 1


0: Negative Class (e.g. Benign Tumor) - absence of something 1: Positive Class
(e.g. Malignant Tumor)-presence of something
Multiclass Classification problem:
y œ 0, 1, 2, 3

14
Logistic Regression

The posterior probability for class C1 can be written as

p(x |C1 )p(C1 )


p(C1 |x ) =
p(x |C1 )p(C1 ) + p(x |C2 )p(C2 )

1
p(C1 |x ) = = g (z )
1 + exp (≠z )
with

p(x |C1 )p(C1 )


z = ln
p(x |C2 )p(C2 )
and

15
Logistic Sigmoid Function

1
g (z ) =
1 + e ≠z
g (z ) - Sigmoid function or Logistic function. The term sigmoid means S-shaped.

16
Logit function

The inverse of the logistic sigmoid is given by

g
z = ln
1≠g
and is known as the logit function.
It represents the log of the ratio of probabilities for the two classes, also known as the log
odds.

17
Logistic regression

Linear threshold classifier output h◊ (x ) = ◊T x at 0.5:


If h◊ (x ) Ø 0.5, predict y = 1
If h◊ (x ) < 0.5, predict y = 0
The use of any regression for the classification problem is not a very good idea:

1 ◊ can be influenced by outliers


2 h◊ (x ) can be > 1 or < 0

Logistic Regression:
want 0 Ø h◊ (x ) Ø 1
Logistic Regression is a classification problem and not a regression problem.
Here is a solution:
h◊ ( x ) = g (◊T x )

18
Hypothesis Representation

Task: fix the parameter ◊ to the data


h◊ ( x )= 1+e1≠◊T x = estimated probability that y = 1 on input x :
h◊ (x ) = P (y = 1|x ; ◊)
Suppose:

predict y = 1 if h◊ (x ) Ø 0.5
predict y = 0 if h◊ (x ) < 0.5

h◊ (x ) = g (◊T x ) Ø 0.5 when ◊T x Ø 0


h◊ (x ) = g (◊T x ) < 0.5 when ◊T x < 0

19
Decision Boundary

predict y = 1 when ◊T x Ø 0
predict y = 0 when ◊T x < 0
E.g. h◊ (x )= g (◊0 + ◊1 x1 + ◊2 x2 )
Then ◊0 + ◊1 x1 + ◊2 x2 = 0 is a decision boundary.

20
Non-linear decision boundary

= g (◊0 + ◊1 x1 + ◊2 x2 + ◊3 x12 + ◊4 x22 )


h◊ ( x )
Then ◊0 + ◊1 x1 + ◊2 x2 + ◊3 x12 + ◊4 x22 = 0 is a decision boundary. E.g. if
◊ = (≠1, 0, 0, 1, 1)T , then decision boundary will be x12 + x22 = 1.
More complicated cases:
h◊ ( x ) = g (◊0 + ◊1 x1 + ◊2 x2 + ◊3 x12 + ◊4 x12 x2 + ◊5 x12 x22 + ◊6 x13 x2 + · · · )

21
Logistic regression cost function

Cost (h◊ (x ), y ) = ≠ log (h◊ (x )) if y = 1


Cost (h◊ (x ), y ) = ≠ log (1 ≠ h◊ (x )) if y = 0
Note: y = 0 or 1 always. The combination of two cost functions:

Cost (h◊ (x ), y ) = ≠y log (h◊ (x )) ≠ (1 ≠ y ) log (1 ≠ h◊ (x ))


m
ÿ
1 (i ) (i )
J (◊) =≠ [ y log h◊ (x ) + (1 ≠ y (i ) ) log (1 ≠ h◊ (x i ))]
m i =1

22
Cost function and gradient

The gradient of the cost function is a vector ◊ where the j element is defined as follow:

m
ˆ J (◊) 1 ÿ (i )
= (h◊ (x (i ) ) ≠ y (i ) )xj
ˆ◊j m i =1

For the calculation of ◊ we use the gradient method:


scipy.optimize.fmin_bfgs
It will find the best parameters ◊ for the logistic regression cost function given a fixed
dataset (of x and x values). The parameters of scipy.optimize.fmin_bfgs are:

The initial values of the parameters you are trying to optimize;


A function that, when given the training set and a particular ◊ , computes the
logistic regression cost and gradient with respect to ◊ for the dataset (x , y ).

23
Visualizing the Data

The first two components for the first two groups of iris datas with different markers.

24
Results of logistic regression for iris data
The first two components for the first two groups of iris datas with different markers with
the decision Boundary
≠5.672 + 7.726x1 ≠ 11.645x2 = 0 Accuracy: 99.00%

25
Multiclass classification

Email foldering/tagging:
Work (y = 1), Friends (y = 2), Family (y = 3), Hobby (y = 4)
Medical diagrams:
Not ill (y = 1) , Cold (y = 2) , flu (y = 3)
Weather:
Sunny (y = 1) , Cloudy (y = 2), Rain (y = 3), Snow (y = 4)

26
One-vs-all

(1)
Class 1: h◊ (x )
(2)
Class 2: h◊ (x )
(3)
Class 3: h◊ (x )
(i )
h◊ (x ) = P (y = i |x ; ◊) (1, 2, 3)
(i )
h◊ (x ) - estimated probability that y = i on input x
(i )
1 ≠ h◊ (x ) - estimated probability that y ”= i on input x i ( one-vs-rest)

27
One-vs-all

Train a logistic regression classifier ◊ (i ) (x ) for each class i to predict the probability that
y = i.
On a new input x, to make a prediction, pick the class i that maximizes

(i )
max ◊ (x )
i

28
Discriminant Function

The third strategy for classification is a discriminant function.


Non probabilistic method.
A discriminant is a function that takes an input vector x and assigns it to one of m
classes, denoted Ci .
Linear discriminants have hyperplanes as the decision surfaces.
If m = 2 a linear discriminant function:

y (x ) = w T x + w0
w is a weight vector and w0 is a bias.
The negative of the bias is sometimes called a threshold.
Decision rule:
x is assigned to class C1 if y (x ) > 0 and to class C2 otherwise.
The corresponding decision boundary is defined by y (x ) = 0, which corresponds to a
(D ≠ 1)-dimensional hyperplane within the D-dimensional input space.

29
Linear Discriminant Analysis (LDA)

c classes of data with means µ1 , µ2 , · · · , µc and mean of the entire dataset µ


Covariance: ÿ
C = (xj ≠ µ)(xj ≠ µ)T
j
q q
Within-class scatter : SW = classes c j œc pc (xj ≠ µc )(xj ≠ µc )T with pc the
probability of the class (that is, the number of datapoints there are in that class divided
by the total number)
q q
Between-classes scatter: SB = classes c j œc (µc ≠ µ)(µc ≠ µ)T

C = SW + SB

30
Fisher’s linear discriminant

31
Linear Discriminant Analysis (LDA)

The datasets are easy to separate into different classes (i.e. the classes are
discriminable) if SB /SW is large.
The projection of the data:
z = wT · x
w T SW w
We want to make the ration of within-class and between-class scatter w T SB w
maximal.
≠1
w are the generalised eigenvectors of SW SB

32
Linear Discriminant Analysis (LDA)
Plot of the first two dimention of the iris data showing the three classes before and after
LDA has been applied. Only one dimention (y ) is required for the separation of the
classes after LDA has been applied.

33
Multiple classes

Extension of linear discriminants to m > 2 classes.


1 one-versus-the-rest classifier :
the use of m ≠ 1 classifiers each of which solves a two-class problem of
separating points in a particular class Ci from points not in that class.
2 one-versus-one classifier
the use of m(m ≠ 1)/2 binary discriminant functions, one for every possible pair
of classes. Each point is then classified according to a majority vote amongst the
discriminant functions.
3 single m-class discriminant
the use of m linear functions of the form

yi (x ) = wiT x + wi0

and then assigning a point x to class Ci if yi (x ) > yj (x ) for all j ”= i.

34

You might also like