04 Probability and Learning PDF
04 Probability and Learning PDF
Machine Learning
Unit 4
University of Vienna
1
Probability and Learning
2
Classification problem
The goal in classification is to take an input vector x and to assign it to one of k discrete
classes Ci where i = 1, ..., k .
In the most common scenario, the classes are taken to be disjoint, so that each input is
assigned to one and only one class.
The input space is thereby divided into decision regions whose boundaries are called
decision boundaries or decision surfaces.
qk
in probabilistic models we define ti as the probability that the class is Ci : i =1 ti =1
1 generative models
2 discriminative models
3 discriminant function
3
Generative models
1 solve the inference problem of determining the class-conditional densities p(x |Ci )
for each class Ci individually
2 separately infer the prior class probabilities p(Ci )
3 use Bayes theorem in the form
Equivalently, we can model the joint distribution p(x , Ci ) directly and then normalize to
obtain the posterior probabilities.
4
Discriminative models
Approaches that model the posterior probabilities directly are called discriminative
models.
1 solve the inference problem of determining the posterior class probabilities p(Ci |x )
2 subsequently use decision theory to assign each new x to one of the classes
5
Discriminant function
Find a function f (x ), called a discriminant function, which maps each input x directly
onto a class label.
In the case of two-class problems, f might be binary valued and such that f =0
represents class C1 and f = 1 represents class C2 .
In this case, probabilities play no role.
Example: Fisher’s discriminant, perceptron algorithm.
6
Example
Options: go to the pub, watch TV, go to a party or study. Depends from an assignement,
an availability of a party and feeling.
Deadline Party Lazy Activity
Urgent Yes Yes Party
Urgent No Yes Study
Near Yes Yes Party
None Yes No Party
None No Yes Pub
None Yes No Party
Near No No Study
Near No Yes TV
Near Yes Yes Party
Urgent No No Study
7
The Bayes Theorem
8
The Bayes Theorem
The conditional probaility of Ci given that x has value X : P (Ci |X ) and tells us how likely
it is that the class is Ci given the value of x is X .
The question is how to get to this conditional probability, since we can not read it directly
from table.
Direct from table we can read the probabilities P (Dj |Ci ):
P(Urg|Pub)= 0, P(Near|Pub)= 0, P(None|Pub)= 1
P(Urg|TV)= 0, P(Near|TV)= 1, P(None|TV)= 0
P(Urg|Party)= 0.2, P(Near|Party)= 0.4, P(None|Party)= 0.4
P(Urg|Study)= 2/3, P(Near|Study)= 1/3, P(None|Study)= 0
9
The Bayes Theorem
10
The Naive Bayes classifier
The simplification:
The features are conditionally independent of each others.
The Naive Bayes classifier:
Suppose:
Deadline = Near, Party = No, Lazy = Yes
11
The Naive Bayes classifier
12
Probabilistic Discriminative Models
The second approach is to use the functional form of the generalized linear model
explicitly and to determine its parameters directly by using maximum likelihood.
In the direct approach, we are maximizing a likelihood function defined through the
conditional distribution p(Ci |x ), which represents a form of discriminative training.
One advantage of the discriminative approach is that there will typically be fewer
adaptive parameters to be determined.
It may also lead to improved predictive performance, particularly when the
class-conditional density assumptions give a poor approximation to the true distributions.
13
Logistic Regression
Classification
Email: Spam/Not Spam?
Online Transaction: Fraudulent (Yes/No)?
Tumor: Malignant /Benign
14
Logistic Regression
1
p(C1 |x ) = = g (z )
1 + exp (≠z )
with
15
Logistic Sigmoid Function
1
g (z ) =
1 + e ≠z
g (z ) - Sigmoid function or Logistic function. The term sigmoid means S-shaped.
16
Logit function
g
z = ln
1≠g
and is known as the logit function.
It represents the log of the ratio of probabilities for the two classes, also known as the log
odds.
17
Logistic regression
Logistic Regression:
want 0 Ø h◊ (x ) Ø 1
Logistic Regression is a classification problem and not a regression problem.
Here is a solution:
h◊ ( x ) = g (◊T x )
18
Hypothesis Representation
predict y = 1 if h◊ (x ) Ø 0.5
predict y = 0 if h◊ (x ) < 0.5
19
Decision Boundary
predict y = 1 when ◊T x Ø 0
predict y = 0 when ◊T x < 0
E.g. h◊ (x )= g (◊0 + ◊1 x1 + ◊2 x2 )
Then ◊0 + ◊1 x1 + ◊2 x2 = 0 is a decision boundary.
20
Non-linear decision boundary
21
Logistic regression cost function
22
Cost function and gradient
The gradient of the cost function is a vector ◊ where the j element is defined as follow:
m
ˆ J (◊) 1 ÿ (i )
= (h◊ (x (i ) ) ≠ y (i ) )xj
ˆ◊j m i =1
23
Visualizing the Data
The first two components for the first two groups of iris datas with different markers.
24
Results of logistic regression for iris data
The first two components for the first two groups of iris datas with different markers with
the decision Boundary
≠5.672 + 7.726x1 ≠ 11.645x2 = 0 Accuracy: 99.00%
25
Multiclass classification
Email foldering/tagging:
Work (y = 1), Friends (y = 2), Family (y = 3), Hobby (y = 4)
Medical diagrams:
Not ill (y = 1) , Cold (y = 2) , flu (y = 3)
Weather:
Sunny (y = 1) , Cloudy (y = 2), Rain (y = 3), Snow (y = 4)
26
One-vs-all
(1)
Class 1: h◊ (x )
(2)
Class 2: h◊ (x )
(3)
Class 3: h◊ (x )
(i )
h◊ (x ) = P (y = i |x ; ◊) (1, 2, 3)
(i )
h◊ (x ) - estimated probability that y = i on input x
(i )
1 ≠ h◊ (x ) - estimated probability that y ”= i on input x i ( one-vs-rest)
27
One-vs-all
Train a logistic regression classifier ◊ (i ) (x ) for each class i to predict the probability that
y = i.
On a new input x, to make a prediction, pick the class i that maximizes
(i )
max ◊ (x )
i
28
Discriminant Function
y (x ) = w T x + w0
w is a weight vector and w0 is a bias.
The negative of the bias is sometimes called a threshold.
Decision rule:
x is assigned to class C1 if y (x ) > 0 and to class C2 otherwise.
The corresponding decision boundary is defined by y (x ) = 0, which corresponds to a
(D ≠ 1)-dimensional hyperplane within the D-dimensional input space.
29
Linear Discriminant Analysis (LDA)
C = SW + SB
30
Fisher’s linear discriminant
31
Linear Discriminant Analysis (LDA)
The datasets are easy to separate into different classes (i.e. the classes are
discriminable) if SB /SW is large.
The projection of the data:
z = wT · x
w T SW w
We want to make the ration of within-class and between-class scatter w T SB w
maximal.
≠1
w are the generalised eigenvectors of SW SB
32
Linear Discriminant Analysis (LDA)
Plot of the first two dimention of the iris data showing the three classes before and after
LDA has been applied. Only one dimention (y ) is required for the separation of the
classes after LDA has been applied.
33
Multiple classes
yi (x ) = wiT x + wi0
34