AI notes Week 11
AI notes Week 11
Machine Learning
CS-412
Week-11-Fall 2024
*
Why “Learn” ?
■ Machine learning is programming computers to optimize
a performance criterion using example data or past
experience.
■ There is no need to “learn” to calculate payroll
■ Learning is used when:
◻ Human expertise does not exist (navigating on Mars),
◻ Humans are unable to explain their expertise (speech
recognition)
◻ Solution changes in time (routing on a computer network)
◻ Solution needs to be adapted to particular cases (user
biometrics)
2
What We Talk About When We
Talk About“Learning”
■ Learning general models from a data of particular
examples
■ Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
■ Example in retail: Customer transactions to consumer
behavior:
People who bought “x-product” also bought “Y-product”
(www.amazon.com)
■ Build a model that is a good and useful approximation to
the data.
3
Data Mining
■ Retail: Market basket analysis, Customer relationship
management (CRM)
■ Finance: Credit scoring, fraud detection
■ Manufacturing: Optimization, troubleshooting
■ Medicine: Medical diagnosis
■ Telecommunications: Quality of service optimization
■ Bioinformatics: Motifs (protein sequence patterns),
alignment
■ Web mining: Search engines
■ ...
4
What is Machine Learning?
■ Optimize a performance criterion using example data or
past experience.
■ Role of Statistics: Inference from a sample
■ Role of Computer science: Efficient algorithms to
◻ Solve the optimization problem
◻ Representing and evaluating the model for inference
5
Applications
■ Association
■ Supervised Learning
◻ Classification
◻ Regression
■ Unsupervised Learning
■ Reinforcement Learning
6
Learning Associations
■ Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.
7
Classification
■ Example: Credit
scoring
■ Differentiating
between low-risk
and high-risk
customers from their
income and savings
8
Classification: Applications
■ Aka Pattern recognition
■ Face recognition: Pose, lighting, occlusion (glasses,
beard), make-up, hair style
■ Character recognition: Different handwriting styles.
■ Speech recognition: Temporal dependency.
◻ Use of a dictionary or the syntax of the language.
◻ Sensor fusion: Combine multiple modalities; eg, visual (lip
image) and acoustic for speech
■ Medical diagnosis: From symptoms to illnesses
■ ...
9
Face Recognition
Training examples of a person
Test images
10
Regression
■ Example: Price of a used
car
■ x : car attributes y = wx+w0
y : price
y = g (x | θ )
g ( ) model,
θ parameters
11
Supervised Learning: Uses
■ Prediction of future cases: Use the rule to predict the
output for future inputs
■ Knowledge extraction: The rule is easy to understand
■ Compression: The rule is simpler than the data it
explains
■ Outlier detection: Exceptions that are not covered by the
rule, e.g., fraud
12
Unsupervised Learning
■ Learning “what normally happens”
■ No output
■ Clustering: Grouping similar instances
■ Example applications
◻ Customer segmentation in CRM
◻ Image compression: Color quantization
◻ Bioinformatics: Learning motifs
13
Reinforcement Learning
■ The “reinforcement” in reinforcement learning refers to
how certain behaviors are encouraged, and others
discouraged.
■ Behaviors are reinforced through rewards which are
gained through experiences with the environment.
■ Learning a policy: A sequence of outputs
■ Credit assignment problem
■ Game playing
■ Robot in a maze
14
Resources: Datasets
■ UCI Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html
■ UCI KDD Archive:
http://kdd.ics.uci.edu/summary.data.application.html
■ Statlib: http://lib.stat.cmu.edu/
■ Delve: http://www.cs.utoronto.ca/~delve/
15
Resources: Journals
■ Journal of Machine Learning Research www.jmlr.org
■ Machine Learning
■ Neural Computation
■ Neural Networks
■ IEEE Transactions on Neural Networks
■ IEEE Transactions on Pattern Analysis and Machine
Intelligence
■ Annals of Statistics
■ Journal of the American Statistical Association
■ ...
16
Resources: Conferences
■ International Conference on Machine Learning (ICML)
◻ ICML05: http://icml.ais.fraunhofer.de/
■ European Conference on Machine Learning (ECML)
◻ ECML05: http://ecmlpkdd05.liacc.up.pt/
■ Neural Information Processing Systems (NIPS)
◻ NIPS05: http://nips.cc/
■ Uncertainty in Artificial Intelligence (UAI)
◻ UAI05: http://www.cs.toronto.edu/uai2005/
■ Computational Learning Theory (COLT)
◻ COLT05: http://learningtheory.org/colt2005/
■ International Joint Conference on Artificial Intelligence (IJCAI)
◻ IJCAI05: http://ijcai05.csd.abdn.ac.uk/
■ International Conference on Neural Networks (Europe)
◻ ICANN05: http://www.ibspan.waw.pl/ICANN-2005/
■ ...
17
Supervised Learning
An example application
■ An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of newly
admitted patients.
■ A decision is needed: whether to put a new patient
in an intensive-care unit.
■ Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority.
■ Problem: to predict high-risk patients and
discriminate them from low-risk patients.
19
Another application
■ A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
◻ age
◻ Marital status
◻ annual salary
◻ outstanding debts
◻ credit rating
◻ etc.
■ Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.
20
Machine learning and our focus
■ Like human learning from past experiences.
■ A computer does not have “experiences”.
■ A computer system learns from data, which
represent some “past experiences” of an
application domain.
■ Our focus: learn a target function that can be used
to predict the values of a discrete class attribute,
e.g., approve or not-approved, and high-risk or low
risk.
■ The task is commonly called: Supervised learning,
classification, or inductive learning.
21
The data and the goal
■ Data: A set of data records (also called examples,
instances or cases) described by
◻ k attributes: A1, A2, … Ak.
◻ a class: Each example is labelled with a pre-defined class.
22
An example: data (loan application)
Approved or not
23
An example: the learning task
■ Learn a classification model from the data
■ Use the model to classify future loan applications
into
◻ Yes (approved) and
◻ No (not approved)
■ What is the class for following case/instance?
24
Supervised vs. unsupervised Learning
■ Supervised learning: classification is seen as supervised
learning from examples.
◻ Supervision: The data (observations, measurements, etc.) are
labeled with pre-defined classes. It is like that a “teacher” gives
the classes (supervision).
◻ Test data are classified into these classes too.
■ Unsupervised learning (clustering)
◻ Class labels of the data are unknown
◻ Given a set of data, the task is to establish the existence of
classes or clusters in the data
25
Supervised learning process: two
steps
■ Learning (training): Learn a model using the
training data
■ Testing: Test the model using unseen test
data to assess the model accuracy
26
What
■ Given
do we mean by learning?
◻ a data set D,
◻ a task T, and
◻ a performance measure M,
a computer system is said to learn from D to perform the
task T if after learning the system’s performance on T
improves as measured by M.
■ In other words, the learned model helps the system to
perform T better as compared to no learning.
27
An example
■ Data: Loan application data
■ Task: Predict whether a loan should be approved or not.
■ Performance measure: accuracy.
28
Fundamental assumption of learning
Assumption: The distribution of training examples is
identical to the distribution of test examples (including
future unseen examples).
29
Introduction
■ Decision tree learning is one of the most widely used
techniques for classification.
◻ Its classification accuracy is competitive with other methods,
and
◻ it is very efficient.
30
The loan data
Approved or not
31
A decision tree from the loan data
■ Decision nodes and leaf nodes (classes)
32
Use the decision tree
No
33
Is the decision tree unique?
■ No. Here is a simpler tree.
■ We want smaller tree and accurate tree.
■ Easy to understand and perform better.
34
From a decision tree to a set of rules
■ A decision tree can
be converted to a
set of rules
■ Each path from the
root to a leaf is a
rule.
35
Algorithm for decision tree learning
■ Basic algorithm (a greedy divide-and-conquer algorithm)
◻ Assume attributes are categorical now (continuous attributes
can be handled too)
◻ Tree is constructed in a top-down recursive manner
◻ At start, all the training examples are at the root
◻ Examples are partitioned recursively based on selected
attributes
◻ Attributes are selected on the basis of an impurity function (e.g.,
information gain)
■ Conditions for stopping partitioning
◻ All examples for a given node belong to the same class
◻ There are no remaining attributes for further partitioning –
majority class is the leaf
◻ There are no examples left
36
Decision tree learning algorithm
37
Choose an attribute to partition data
■ The key to building a decision tree - which attribute to
choose in order to branch.
■ The objective is to reduce impurity or uncertainty in data
as much as possible.
◻ A subset of data is pure if all instances belong to the same class.
■ The heuristic in C4.5 is to choose the attribute with the
maximum Information Gain or Gain Ratio based on
information theory.
38
The loan data (reproduced)
Approved or not
39
Two possible roots, which is better?
40
Information theory
■ Information theory provides a mathematical
basis for measuring the information content.
■ To understand the notion of information, think
about it as providing the answer to a question,
for example, whether a coin will come up heads.
◻ If one already has a good guess about the answer,
then the actual answer is less informative.
◻ If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).
41
Information theory (cont …)
■ For a fair (honest) coin, you have no
information, and you are willing to pay more
(say in terms of $) for advanced information -
less you know, the more valuable the
information.
■ Information theory uses this same intuition,
but instead of measuring the value for
information in dollars, it measures information
contents in bits.
■ One bit of information is enough to answer a
yes/no question about which one has no idea,
such as the flip of a fair coin
42
Information theory: Entropy measure
■ The entropy formula,
43
Entropy measure: let us get a
feeling
45
Information gain (cont …)
■ Information gained by selecting attribute Ai to
branch or to partition the data is
46
An example
47
We build the final tree
48
Handling continuous attributes
■ Handle continuous attribute by splitting into two intervals
(can be more) at each node.
■ How to find the best threshold to divide?
◻ Use information gain or gain ratio again
◻ Sort all the values of an continuous attribute in increasing order
{v1, v2, …, vr},
◻ One possible threshold between two adjacent values vi and vi+1.
Try all possible thresholds and find the one that maximizes the
gain (or gain ratio).
49
An example in a continuous space
50
Avoid overfitting in classification
■ Overfitting: A tree may overfit the training data
◻ Good accuracy on training data but poor on test data
◻ Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
■ Two approaches to avoid overfitting
◻ Pre-pruning: Halt tree construction early
■ Difficult to decide because we do not know what may happen
subsequently if we keep growing the tree.
◻ Post-pruning: Remove branches or sub-trees from a
“fully grown” tree.
■ This method is commonly used. C4.5 uses a statistical method to
estimates the errors at each node for pruning.
■ A validation set may be used for pruning as well.
51
Likely to overfit the data
An example
52
Other issues in decision tree
learning
■ From tree to rules, and rule pruning
■ Handling of miss values
■ Handing skewed distributions
■ Handling attributes and classes with different costs.
■ Attribute construction
■ Etc.
53
Evaluating classification methods
■ Predictive accuracy
■ Efficiency
◻ time to construct the model
◻ time to use the model
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability:
◻ understandable and insight provided by the model
■ Compactness of the model: size of the tree, or the
number of rules.
54
Evaluation methods
■ Holdout set: The available data set D is divided into
two disjoint subsets,
◻ the training set Dtrain (for learning a model)
◻ the test set Dtest (for testing the model)
■ Important: training set should not be used in testing
and the test set should not be used in learning.
◻ Unseen test set provides a unbiased estimate of accuracy.
■ The test set is also called the holdout set. (the
examples in the original data set D are all labeled
with classes.)
■ This method is mainly used when the data set D is
large.
55
Evaluation methods (cont…)
■ n-fold cross-validation: The available data is
partitioned into n equal-size disjoint subsets.
■ Use each subset as the test set and combine the rest
n-1 subsets as the training set to learn a classifier.
■ The procedure is run n times, which give n
accuracies.
■ The final estimated accuracy of learning is the
average of the n accuracies.
■ 10-fold and 5-fold cross-validations are commonly
used.
■ This method is used when the available data is not
large.
56
Evaluation methods (cont…)
■ Leave-one-out cross-validation: This method is used
when the data set is very small.
■ It is a special case of cross-validation
■ Each fold of the cross validation has only a single test
example and all the rest of the data is used in training.
■ If the original data has m examples, this is m-fold
cross-validation
57
Evaluation methods (cont…)
■ Validation set: the available data is divided into
three subsets,
◻ a training set,
◻ a validation set and
◻ a test set.
■ A validation set is used frequently for estimating
parameters in learning algorithms.
■ In such cases, the values that give the best
accuracy on the validation set are used as the final
parameter values.
■ Cross-validation can be used for parameter
estimating as well.
58
Classification measures
■ Accuracy is only one measure (error = 1-accuracy).
■ Accuracy is not suitable in some applications.
■ In text mining, we may only be interested in the
documents of a particular topic, which are only a
small portion of a big document collection.
■ In classification involving skewed or highly
imbalanced data, e.g., network intrusion and
financial fraud detections, we are interested only in
the minority class.
◻ High accuracy does not mean any intrusion is detected.
◻ E.g., 1% intrusion. Achieve 99% accuracy by doing
nothing.
■ The class of interest is commonly called the
positive class, and the rest negative classes.
59
Precision and recall measures
■ Used in information retrieval and text classification.
■ We use a confusion matrix to introduce them.
60
Precision and recall measures (cont…)
62
F1-value (also called F1-score)
■ It is hard to compare two classifiers using two measures. F1
score combines precision and recall into one measure
63
Receive operating characteristics curve
64
Sensitivity and Specificity
■ In statistics, there are two other evaluation measures:
◻ Sensitivity: Same as TPR
◻ Specificity: Also called True Negative Rate (TNR)
■ Then we have
65
Example ROC curves
66
Area under the curve (AUC)
■ Which classifier is better, C1 or C2?
◻ It depends on which region you talk about.
■ Can we have one measure?
◻ Yes, we compute the area under the curve (AUC)
■ If AUC for Ci is greater than that of Cj, it is said that Ci is
better than Cj.
◻ If a classifier is perfect, its AUC value is 1
◻ If a classifier makes all random guesses, its AUC value is 0.5.
67
Drawing an ROC curve
68