0% found this document useful (0 votes)
10 views

Lecture 4 Classification P1

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 4 Classification P1

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

INT3405 - Machine Learning


Lecture 4: Classification (P1)
Duc-Trong Le & Viet-Cuong Ta

Hanoi, 09/2023
Recap: Key Issues in Machine Learning
● What are good hypothesis spaces? We choose
○ Which spaces have been useful in practical applications and why? To
● What algorithms can work with these spaces? Optimize
○ Are there general design principles for machine learning algorithms?
● How can we find the best hypothesis in an efficient way?
○ How to find the optimal solution efficiently (“optimization” question)
● How can we optimize accuracy on future data?
○ Known as the “overfitting” problem (i.e., “generalization” theory)
● How can we have confidence in the results?
○ How much training data is required to find accurate hypothesis? (“statistical” question)
● Are some learning problems computationally intractable? (“computational” question)
● How can we formulate application problems as machine learning problems? (“engineering”
question)
FIT-CS INT3405 - Machine Learning 2
Recap: Model Representation
Training Set How do we represent h ?

Learning Algorithm y

Size of h Estimated x
house price
x Hypothesis y
Linear regression with one variable.
“Univariate Linear Regression”

How to choose parameters ?


FIT-CS INT3405 - Machine Learning 3
Recap: Gradient Descent for Optimization

FIT-CS INT3405 - Machine Learning 4


Recap: Gradient Descent Example

(for fixed , this is a function of x) (function of the parameters )

How fast to converge to the Global Optimal?

FIT-CS INT3405 - Machine Learning 5


Normal Equation (3)
● Matrix-vector formulation

● Analytical solution
Take O(mn2+n3)

FIT-CS INT3405 - Machine Learning 6


Outline
● Bayesian Learning
○ Bayes Theorem
○ MAP learning vs. MLE learning
● Probabilistic Generative Models
○ Naïve Bayes Classifier
● Discriminative Models
○ Logistic Regression
○ K-Nearest Neighbors

FIT-CS INT3405 - Machine Learning 7


Bayes Theorem
● Bayes Theorem

Posterior Likelihood Prior

Thomas Bayes (1702–1761)

○ P(h) = prior probability of hypothesis h


○ P(D) = prior probability of training data D
○ P(h|D) = conditional probability of h given D (Posterior)
○ P(D|h) = conditional probability of D given h (Likelihood)

FIT-CS INT3405 - Machine Learning 8


Maximum A Posterior Learning (MAP)
●Maximum a posterior learning (MAP)
○ Find the most probable hypothesis given the training data by
maximizing the posterior probability.

Prior encodes the


knowledge/preference
FIT-CS INT3405 - Machine Learning 9
MAP Learning
● For each hypothesis h in H, calculate the posterior prob.

● Output the hypothesis h with the highest posterior prob.

● Comments:
○ Computational intensive
○ Give a standard for judging the performance of learning algorithms
○ Choosing P(h) reflects our prior knowledge about the learning task

FIT-CS INT3405 - Machine Learning 10


Maximum-Likelihood Estimation (MLE)

● Maximum Likelihood Estimation (MLE) learning


○ Assume each hypothesis is equally probably a prior

○ Maximizing the likelihood of the training data

FIT-CS INT3405 - Machine Learning 11


Relationship between MLE Learning
and Least-Squared Error Learning (1)
● Consider

● Assume

● We want learn for f(x)


● Linear Regression minimizes the objective (cost function) of MSE

FIT-CS INT3405 - Machine Learning 12


Relationship between MLE Learning
and Least-Squared Error Learning (2)

FIT-CS INT3405 - Machine Learning 13


Probabilistic Generative Models (1)
• Classify instance x into one of K classes

Density function for class Ck Class prior

FIT-CS INT3405 - Machine Learning 14


Probabilistic Generative Models (2)
• Classification decision

• The key is to decide the parameters

FIT-CS INT3405 - Machine Learning 15


Probabilistic Generative Models (3)
● Given training data
● We have closed-form solutions:

FIT-CS INT3405 - Machine Learning 16


Probabilistic Generative Models (4)

class-conditional posterior
densities probability

FIT-CS INT3405 - Machine Learning 17


Curse of Dimensionality
● One challenge of learning with high-dimensional data is insufficient data
samples
● Suppose 5 samples/objects is considered enough in 1-D
– 1D : 5 points
– 2D : 25 points
– 3D : 125 points
– 10D : 9 765 625 points

FIT-CS INT3405 - Machine Learning 18


Naïve Bayes Classifier (1)
•Hard to estimate for high dimensional data x
•Conditional Independence assumption
• All attributes are conditionally independent
•Naïve Bayes approximation Distribution of 1 D

FIT-CS INT3405 - Machine Learning 19


Naïve Bayes Classifier (2)
● Text categorization
: word histogram of a document
● Bag of words assumption:
○ Assume position doesn’t matter
● Conditional independence:
Occurring times
of word in
document x

FIT-CS INT3405 - Machine Learning 20


Parameter Estimation
●Learning by Maximum Likelihood Estimates
○ Simply count the frequencies in the data

○ Create a mega-document for topic k by concatenating all the docs in this topic
○ Compute frequency of w in the mega-document

FIT-CS INT3405 - Machine Learning 21


Problem with Maximum Likelihood
● What if there is a new word (e.g., any novel words created in internet) in a
test document which never appears in the training data

● Smoothing
○ Avoid zero probability

CSUET INT3405 - Machine Learning 22


Naïve Bayes Classifier (3)
• Bad approximation Text categorization for 20 Newsgroups

• Good classification accuracy

FIT-CS INT3405 - Machine Learning 23


Naïve Bayes Classifier (4)

Naïve Bayes Classifier:

FIT-CS INT3405 - Machine Learning 24


Example: “Play Tennis” (1)
● Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)

FIT-CS INT3405 - Machine Learning 25


Example: “Play Tennis” (2)

FIT-CS INT3405 - Machine Learning 26


The Independence Assumption
● Makes computation possible
● Yields optimal classifiers when satisfied
● Fairly good empirical results
● But is seldom satisfied in practice, as attributes (variables) are
often correlated
● Attempts to overcome this limitation:
○ Bayesian networks, that combine Bayesian reasoning with causal relationships
between attributes

FIT-CS INT3405 - Machine Learning 27


Decision Boundary of Naïve Bayes (1)
● Consider text categorization of two classes
● The ratio determines the decision

Linear decision boundary


FIT-CS INT3405 - Machine Learning 28
Decision Boundary of Naïve Bayes (2)
● Consider two class classification
● Gaussian density function
● Shared covariance matrix

Linear decision boundary

FIT-CS INT3405 - Machine Learning 29


Decision Boundary
• Generative models essentially create linear decision boundaries
• Why not directly model the linear decision boundary

FIT-CS
SML– Term 1 2020-2021 INT3405 - Machine Learning 30
30
Outline
● Bayesian Learning
○ Bayes Theorem
○ MAP learning vs. MLE learning
● Probabilistic Generative Models
○ Naïve Bayes Classifier
● Discriminative Models
○ Logistic Regression
○ K-Nearest Neighbors

FIT-CS INT3405 - Machine Learning 31


Discriminative Models: Logistic Regression
• Generative models often lead to linear decision boundary
• Linear discriminatory model
• Directly model the linear decision boundary

• w is the parameter to be decided

FIT-CS INT3405 - Machine Learning 32


Logistic Regression

FIT-CS INT3405 - Machine Learning 33


Logistic Sigmoid Function
● The logistic / sigmoid function

FIT-CS INT3405 - Machine Learning 34


Logistic Regression
• Given training data
• Likelihood function (or the Log-Likelihood)

• Learn parameter w by Maximum Likelihood Estimation (MLE)

FIT-CS INT3405 - Machine Learning 35


Convex Objective Functions

If y = 1 Convex Loss Functions: If y = -1

FIT-CS INT3405 - Machine Learning 36


Logistic Regression
• Convex objective function, global optimal

• No closed-form solution
• Gradient Descent

Classification error

FIT-CS INT3405 - Machine Learning 37


Example: Heart Disease (1)
1: 25-29
2: 30-34
3: 35-39
4: 40-44
5: 45-49
6: 50-54
7: 55-59
• Input feature x: age group id 8: 60-64
• Output y: if having heart disease
• y=1: having heart disease
• y=-1: no heart disease

FIT-CS INT3405 - Machine Learning 38


Example: Heart Disease (2)

FIT-CS INT3405 - Machine Learning 39


Example: Text Categorization (1)
● Learn to classify text into two categories
● Input d:
• a document, represented by a word histogram
● Output
• y=±1:
+1 for political document
-1 for non-political document

FIT-CS INT3405 - Machine Learning 40


Example: Text Categorization (2)
• Training data

FIT-CS INT3405 - Machine Learning 41


Example: Text Categorization (3)

• Dataset: Reuter-21578
• Classification accuracy
• Naïve Bayes: 77%
• Logistic regression: 88%

FIT-CS INT3405 - Machine Learning 42


Multi-class Logistic Regression
• How to extend logistic regression model to multi-class classification ?

FIT-CS INT3405 - Machine Learning 43


Conditional Exponential Model (1)
• Consider K classes
• Define

• where Z is normalization factor:

Normalization factor
(partition function)

• Need to learn

FIT-CS INT3405 - Machine Learning 44


Conditional Exponential Model (2)
• Learn weights w’s by maximum likelihood estimation

• Modified Conditional Exponential Model

FIT-CS INT3405 - Machine Learning 45


Logistic Regression versus Naïve Bayes
•Both are linear decision boundaries
• Naïve Bayes:

• Logistic regression: learn weights by MLE

•Both can be viewed as modeling p(x|y)


• Naïve Bayes: independence assumption
• Logistic regression: assume an exponential family distribution for
p(x|y) (a broad assumption)
FIT-CS INT3405 - Machine Learning 46
K-Nearest Neighbor
Main idea: It classifies the data point on how its neighbor is classified

FIT-CS INT3405 - Machine Learning 47


K-Nearest Neighbor

Complexity: O(ndk)
FIT-CS INT3405 - Machine Learning 48
Discriminative versus Generative
Discriminative Models Generative Models

● Model P(y|x) directly • Model P(x|y) directly


Pros Pros
● Usually better performance • Usually fast convergence
(with small training data) • Cheap computation
● Robust to noise data (easier to learn, e.g. NB)
Cons Cons
● Slow convergence • Sensitive to noise data
(e.g., LR by gradient descent) • Usually performs worse (with
small training data)
● Expensive computation

FIT-CS INT3405 - Machine Learning 49


Summary
● Bayesian Learning
○ Bayes Theorem
○ MAP learning vs. MLE learning
● Probabilistic Generative Models
○ Naïve Bayes Classifier
● Discriminative Models
○ Logistic Regression
○ K-Nearest Neighbors

FIT-CS INT3405 - Machine Learning 50


UET
Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

Thank you
Email me
[email protected]

You might also like