0% found this document useful (0 votes)
6 views68 pages

AI notes Week 11

The document discusses machine learning, emphasizing its role in optimizing performance criteria using data, especially in scenarios where human expertise is lacking or difficult to articulate. It covers various learning types, including supervised, unsupervised, and reinforcement learning, along with their applications in fields like retail, finance, and medicine. Additionally, it highlights decision tree learning as a popular classification method and introduces concepts like information gain and entropy for building effective models.

Uploaded by

izahri495
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views68 pages

AI notes Week 11

The document discusses machine learning, emphasizing its role in optimizing performance criteria using data, especially in scenarios where human expertise is lacking or difficult to articulate. It covers various learning types, including supervised, unsupervised, and reinforcement learning, along with their applications in fields like retail, finance, and medicine. Additionally, it highlights decision tree learning as a popular classification method and introduces concepts like information gain and entropy for building effective models.

Uploaded by

izahri495
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Artificial Intelligence

Machine Learning
CS-412
Week-11-Fall 2024
*
Why “Learn” ?
■ Machine learning is programming computers to optimize
a performance criterion using example data or past
experience.
■ There is no need to “learn” to calculate payroll
■ Learning is used when:
◻ Human expertise does not exist (navigating on Mars),
◻ Humans are unable to explain their expertise (speech
recognition)
◻ Solution changes in time (routing on a computer network)
◻ Solution needs to be adapted to particular cases (user
biometrics)

2
What We Talk About When We
Talk About“Learning”
■ Learning general models from a data of particular
examples
■ Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
■ Example in retail: Customer transactions to consumer
behavior:
People who bought “x-product” also bought “Y-product”
(www.amazon.com)
■ Build a model that is a good and useful approximation to
the data.

3
Data Mining
■ Retail: Market basket analysis, Customer relationship
management (CRM)
■ Finance: Credit scoring, fraud detection
■ Manufacturing: Optimization, troubleshooting
■ Medicine: Medical diagnosis
■ Telecommunications: Quality of service optimization
■ Bioinformatics: Motifs (protein sequence patterns),
alignment
■ Web mining: Search engines
■ ...

4
What is Machine Learning?
■ Optimize a performance criterion using example data or
past experience.
■ Role of Statistics: Inference from a sample
■ Role of Computer science: Efficient algorithms to
◻ Solve the optimization problem
◻ Representing and evaluating the model for inference

5
Applications
■ Association
■ Supervised Learning
◻ Classification
◻ Regression

■ Unsupervised Learning
■ Reinforcement Learning

6
Learning Associations
■ Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.

Example: P ( Milk | bread ) = 0.7

7
Classification
■ Example: Credit
scoring
■ Differentiating
between low-risk
and high-risk
customers from their
income and savings

Discriminant: IF income > θ1 AND savings > θ2


THEN low-risk ELSE high-risk

8
Classification: Applications
■ Aka Pattern recognition
■ Face recognition: Pose, lighting, occlusion (glasses,
beard), make-up, hair style
■ Character recognition: Different handwriting styles.
■ Speech recognition: Temporal dependency.
◻ Use of a dictionary or the syntax of the language.
◻ Sensor fusion: Combine multiple modalities; eg, visual (lip
image) and acoustic for speech
■ Medical diagnosis: From symptoms to illnesses
■ ...

9
Face Recognition
Training examples of a person

Test images

AT&T Laboratories, Cambridge UK


http://www.uk.research.att.com/facedatabase.html

10
Regression
■ Example: Price of a used
car
■ x : car attributes y = wx+w0
y : price
y = g (x | θ )
g ( ) model,
θ parameters

11
Supervised Learning: Uses
■ Prediction of future cases: Use the rule to predict the
output for future inputs
■ Knowledge extraction: The rule is easy to understand
■ Compression: The rule is simpler than the data it
explains
■ Outlier detection: Exceptions that are not covered by the
rule, e.g., fraud

12
Unsupervised Learning
■ Learning “what normally happens”
■ No output
■ Clustering: Grouping similar instances
■ Example applications
◻ Customer segmentation in CRM
◻ Image compression: Color quantization
◻ Bioinformatics: Learning motifs

13
Reinforcement Learning
■ The “reinforcement” in reinforcement learning refers to
how certain behaviors are encouraged, and others
discouraged.
■ Behaviors are reinforced through rewards which are
gained through experiences with the environment.
■ Learning a policy: A sequence of outputs
■ Credit assignment problem
■ Game playing
■ Robot in a maze

14
Resources: Datasets
■ UCI Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html
■ UCI KDD Archive:
http://kdd.ics.uci.edu/summary.data.application.html
■ Statlib: http://lib.stat.cmu.edu/
■ Delve: http://www.cs.utoronto.ca/~delve/

15
Resources: Journals
■ Journal of Machine Learning Research www.jmlr.org
■ Machine Learning
■ Neural Computation
■ Neural Networks
■ IEEE Transactions on Neural Networks
■ IEEE Transactions on Pattern Analysis and Machine
Intelligence
■ Annals of Statistics
■ Journal of the American Statistical Association
■ ...
16
Resources: Conferences
■ International Conference on Machine Learning (ICML)
◻ ICML05: http://icml.ais.fraunhofer.de/
■ European Conference on Machine Learning (ECML)
◻ ECML05: http://ecmlpkdd05.liacc.up.pt/
■ Neural Information Processing Systems (NIPS)
◻ NIPS05: http://nips.cc/
■ Uncertainty in Artificial Intelligence (UAI)
◻ UAI05: http://www.cs.toronto.edu/uai2005/
■ Computational Learning Theory (COLT)
◻ COLT05: http://learningtheory.org/colt2005/
■ International Joint Conference on Artificial Intelligence (IJCAI)
◻ IJCAI05: http://ijcai05.csd.abdn.ac.uk/
■ International Conference on Neural Networks (Europe)
◻ ICANN05: http://www.ibspan.waw.pl/ICANN-2005/
■ ...

17
Supervised Learning
An example application
■ An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of newly
admitted patients.
■ A decision is needed: whether to put a new patient
in an intensive-care unit.
■ Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority.
■ Problem: to predict high-risk patients and
discriminate them from low-risk patients.

19
Another application
■ A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
◻ age
◻ Marital status
◻ annual salary
◻ outstanding debts
◻ credit rating
◻ etc.
■ Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.

20
Machine learning and our focus
■ Like human learning from past experiences.
■ A computer does not have “experiences”.
■ A computer system learns from data, which
represent some “past experiences” of an
application domain.
■ Our focus: learn a target function that can be used
to predict the values of a discrete class attribute,
e.g., approve or not-approved, and high-risk or low
risk.
■ The task is commonly called: Supervised learning,
classification, or inductive learning.

21
The data and the goal
■ Data: A set of data records (also called examples,
instances or cases) described by
◻ k attributes: A1, A2, … Ak.
◻ a class: Each example is labelled with a pre-defined class.

■ Goal: To learn a classification model from the data that


can be used to predict the classes of new (future, or test)
cases/instances.

22
An example: data (loan application)
Approved or not

23
An example: the learning task
■ Learn a classification model from the data
■ Use the model to classify future loan applications
into
◻ Yes (approved) and
◻ No (not approved)
■ What is the class for following case/instance?

24
Supervised vs. unsupervised Learning
■ Supervised learning: classification is seen as supervised
learning from examples.
◻ Supervision: The data (observations, measurements, etc.) are
labeled with pre-defined classes. It is like that a “teacher” gives
the classes (supervision).
◻ Test data are classified into these classes too.
■ Unsupervised learning (clustering)
◻ Class labels of the data are unknown
◻ Given a set of data, the task is to establish the existence of
classes or clusters in the data

25
Supervised learning process: two
steps
■ Learning (training): Learn a model using the
training data
■ Testing: Test the model using unseen test
data to assess the model accuracy

26
What
■ Given
do we mean by learning?
◻ a data set D,
◻ a task T, and
◻ a performance measure M,
a computer system is said to learn from D to perform the
task T if after learning the system’s performance on T
improves as measured by M.
■ In other words, the learned model helps the system to
perform T better as compared to no learning.

27
An example
■ Data: Loan application data
■ Task: Predict whether a loan should be approved or not.
■ Performance measure: accuracy.

No learning: classify all future applications (test data) to the


majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
■ We can do better than 60% with learning.

28
Fundamental assumption of learning
Assumption: The distribution of training examples is
identical to the distribution of test examples (including
future unseen examples).

■ In practice, this assumption is often violated to certain


degree.
■ Strong violations will clearly result in poor classification
accuracy.
■ To achieve good accuracy on the test data, training
examples must be sufficiently representative of the test
data.

29
Introduction
■ Decision tree learning is one of the most widely used
techniques for classification.
◻ Its classification accuracy is competitive with other methods,
and
◻ it is very efficient.

■ The classification model is a tree, called decision tree.


■ C4.5 by Ross Quinlan is perhaps the best known
system. It can be downloaded from the Web.

30
The loan data
Approved or not

31
A decision tree from the loan data
■ Decision nodes and leaf nodes (classes)

32
Use the decision tree

No

33
Is the decision tree unique?
■ No. Here is a simpler tree.
■ We want smaller tree and accurate tree.
■ Easy to understand and perform better.

■ Finding the best tree is


NP-hard.
■ All current tree building
algorithms are heuristic
algorithms

34
From a decision tree to a set of rules
■ A decision tree can
be converted to a
set of rules
■ Each path from the
root to a leaf is a
rule.

35
Algorithm for decision tree learning
■ Basic algorithm (a greedy divide-and-conquer algorithm)
◻ Assume attributes are categorical now (continuous attributes
can be handled too)
◻ Tree is constructed in a top-down recursive manner
◻ At start, all the training examples are at the root
◻ Examples are partitioned recursively based on selected
attributes
◻ Attributes are selected on the basis of an impurity function (e.g.,
information gain)
■ Conditions for stopping partitioning
◻ All examples for a given node belong to the same class
◻ There are no remaining attributes for further partitioning –
majority class is the leaf
◻ There are no examples left

36
Decision tree learning algorithm

37
Choose an attribute to partition data
■ The key to building a decision tree - which attribute to
choose in order to branch.
■ The objective is to reduce impurity or uncertainty in data
as much as possible.
◻ A subset of data is pure if all instances belong to the same class.
■ The heuristic in C4.5 is to choose the attribute with the
maximum Information Gain or Gain Ratio based on
information theory.

38
The loan data (reproduced)
Approved or not

39
Two possible roots, which is better?

■ Fig. (B) seems to be better.

40
Information theory
■ Information theory provides a mathematical
basis for measuring the information content.
■ To understand the notion of information, think
about it as providing the answer to a question,
for example, whether a coin will come up heads.
◻ If one already has a good guess about the answer,
then the actual answer is less informative.
◻ If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).

41
Information theory (cont …)
■ For a fair (honest) coin, you have no
information, and you are willing to pay more
(say in terms of $) for advanced information -
less you know, the more valuable the
information.
■ Information theory uses this same intuition,
but instead of measuring the value for
information in dollars, it measures information
contents in bits.
■ One bit of information is enough to answer a
yes/no question about which one has no idea,
such as the flip of a fair coin

42
Information theory: Entropy measure
■ The entropy formula,

■ Pr(cj) is the probability of class cj in data set D


■ We use entropy as a measure of impurity or
disorder of data set D. (Or, a measure of
information in a tree)

43
Entropy measure: let us get a
feeling

■ As the data become purer and purer, the entropy value


becomes smaller and smaller. This is useful to us!
44
Information gain
■ Given a set of examples D, we first compute its
entropy:

■ If we make attribute Ai, with v values, the root of the


current tree, this will partition D into v subsets D1, D2
…, Dv . The expected entropy if Ai is used as the
current root:

45
Information gain (cont …)
■ Information gained by selecting attribute Ai to
branch or to partition the data is

■ We choose the attribute with the highest gain to


branch/split the current tree.

46
An example

■ Own_house is the best


choice for the root.

47
We build the final tree

■ We can use information gain ratio to evaluate the


impurity as well (see the handout)

48
Handling continuous attributes
■ Handle continuous attribute by splitting into two intervals
(can be more) at each node.
■ How to find the best threshold to divide?
◻ Use information gain or gain ratio again
◻ Sort all the values of an continuous attribute in increasing order
{v1, v2, …, vr},
◻ One possible threshold between two adjacent values vi and vi+1.
Try all possible thresholds and find the one that maximizes the
gain (or gain ratio).

49
An example in a continuous space

50
Avoid overfitting in classification
■ Overfitting: A tree may overfit the training data
◻ Good accuracy on training data but poor on test data
◻ Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
■ Two approaches to avoid overfitting
◻ Pre-pruning: Halt tree construction early
■ Difficult to decide because we do not know what may happen
subsequently if we keep growing the tree.
◻ Post-pruning: Remove branches or sub-trees from a
“fully grown” tree.
■ This method is commonly used. C4.5 uses a statistical method to
estimates the errors at each node for pruning.
■ A validation set may be used for pruning as well.

51
Likely to overfit the data
An example

52
Other issues in decision tree
learning
■ From tree to rules, and rule pruning
■ Handling of miss values
■ Handing skewed distributions
■ Handling attributes and classes with different costs.
■ Attribute construction
■ Etc.

53
Evaluating classification methods
■ Predictive accuracy

■ Efficiency
◻ time to construct the model
◻ time to use the model
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability:
◻ understandable and insight provided by the model
■ Compactness of the model: size of the tree, or the
number of rules.

54
Evaluation methods
■ Holdout set: The available data set D is divided into
two disjoint subsets,
◻ the training set Dtrain (for learning a model)
◻ the test set Dtest (for testing the model)
■ Important: training set should not be used in testing
and the test set should not be used in learning.
◻ Unseen test set provides a unbiased estimate of accuracy.
■ The test set is also called the holdout set. (the
examples in the original data set D are all labeled
with classes.)
■ This method is mainly used when the data set D is
large.

55
Evaluation methods (cont…)
■ n-fold cross-validation: The available data is
partitioned into n equal-size disjoint subsets.
■ Use each subset as the test set and combine the rest
n-1 subsets as the training set to learn a classifier.
■ The procedure is run n times, which give n
accuracies.
■ The final estimated accuracy of learning is the
average of the n accuracies.
■ 10-fold and 5-fold cross-validations are commonly
used.
■ This method is used when the available data is not
large.
56
Evaluation methods (cont…)
■ Leave-one-out cross-validation: This method is used
when the data set is very small.
■ It is a special case of cross-validation
■ Each fold of the cross validation has only a single test
example and all the rest of the data is used in training.
■ If the original data has m examples, this is m-fold
cross-validation

57
Evaluation methods (cont…)
■ Validation set: the available data is divided into
three subsets,
◻ a training set,
◻ a validation set and
◻ a test set.
■ A validation set is used frequently for estimating
parameters in learning algorithms.
■ In such cases, the values that give the best
accuracy on the validation set are used as the final
parameter values.
■ Cross-validation can be used for parameter
estimating as well.

58
Classification measures
■ Accuracy is only one measure (error = 1-accuracy).
■ Accuracy is not suitable in some applications.
■ In text mining, we may only be interested in the
documents of a particular topic, which are only a
small portion of a big document collection.
■ In classification involving skewed or highly
imbalanced data, e.g., network intrusion and
financial fraud detections, we are interested only in
the minority class.
◻ High accuracy does not mean any intrusion is detected.
◻ E.g., 1% intrusion. Achieve 99% accuracy by doing
nothing.
■ The class of interest is commonly called the
positive class, and the rest negative classes.
59
Precision and recall measures
■ Used in information retrieval and text classification.
■ We use a confusion matrix to introduce them.

60
Precision and recall measures (cont…)

■ Precision p is the number of correctly classified


positive examples divided by the total number of
examples that are classified as positive.
■ Recall r is the number of correctly classified positive
examples divided by the total number of actual
positive examples in the test set.
61
An example

■ This confusion matrix gives


◻ precision p = 100% and
◻ recall r = 1%
because we only classified one positive example correctly
and no negative examples wrongly.
■ Note: precision and recall only measure
classification on the positive class.

62
F1-value (also called F1-score)
■ It is hard to compare two classifiers using two measures. F1
score combines precision and recall into one measure

■ The harmonic mean of two numbers tends to be closer to the


smaller of the two.
■ For F1-value to be large, both p and r much be large.

63
Receive operating characteristics curve

■ It is commonly called the ROC curve.


■ It is a plot of the true positive rate (TPR) against the false
positive rate (FPR).
■ True positive rate:

■ False positive rate:

64
Sensitivity and Specificity
■ In statistics, there are two other evaluation measures:
◻ Sensitivity: Same as TPR
◻ Specificity: Also called True Negative Rate (TNR)

■ Then we have

65
Example ROC curves

66
Area under the curve (AUC)
■ Which classifier is better, C1 or C2?
◻ It depends on which region you talk about.
■ Can we have one measure?
◻ Yes, we compute the area under the curve (AUC)
■ If AUC for Ci is greater than that of Cj, it is said that Ci is
better than Cj.
◻ If a classifier is perfect, its AUC value is 1
◻ If a classifier makes all random guesses, its AUC value is 0.5.

67
Drawing an ROC curve

68

You might also like