ML-chap-2
ML-chap-2
❖ A brief overview of ML
❖ Key tasks in ML
❖ Why we need ML
❖ K-nearest neighbors algorithm
❖ kNN Classification
❖ kNN Regression
❖ Some Issues in KNN
❖ Decision Tree
❖ Naïve Bayes
9/4/2023 2
Machine Learning
9/4/2023 3
What is Machine Learning?
9/4/2023 4
Traditional Vs ML systems
❖ In ML, once the system is provided with the right data and
algorithms, it can "•
fish for itself”.
9/4/2023 5
Traditional Vs ML systems
9/4/2023 6
Sensor and the Data Deluge
9/4/2023 8
Key Terminology
9/4/2023 10
Key Tasks of Machine Learning
9/4/2023 11
Key Tasks of Machine Learning
9/4/2023 12
Key Tasks of Machine Learning
9/4/2023 13
Key Tasks of Machine Learning
❖ There are two fundamental cause of prediction error: a model bias, and
its variance.
❖ A model with high variance over-fits the training data, while a model
with high bias under-fits the training data.
❖ High bias, low variance
❖ Low bias, high variance
❖ High bias, high variance
❖ Low bias, low variance
❖ The predictive power of many ML algorithms improve as the
amount of training data increases.
❖ Quality of data is also important.
❖ Ideally, a model will have both low bias and variance; but effort
to reduce one will frequently increase the other. This is known as
9/4/2023 the bias-variance trade-off. 14
Model bias Vs Variance
• Model bias refers to the presence of systematic errors in a model
that can cause it to consistently make incorrect predictions.
These errors can arise from many sources, including:
– the selection of the training data,
– the choice of features used to build the model, or
– the algorithm used to train the model.
• Variance refers to the changes in the model when using different
portions of the training data set. Simply stated, variance is the
variability in the model prediction—how much the ML function
can adjust depending on the given data set. Variance comes
from highly complex models with a large number of features.
9/4/2023 15
Common measurement of
performance
❖ Common measurement of performance include:
❖ Accuracy (ACC) = (TP + TN / TP+TN+FP+FN)
❖ Precision (P) = (TP / TP+FP)
❖ Recall (R) = (TP / TP+FN)
• A true positive is an outcome where the model correctly predicts
the positive class.
• A true negative is an outcome where the
model correctly predicts the negative class.
• A false positive is an outcome where the
model incorrectly predicts the positive class.
• A false negative is an outcome where the
model incorrectly predicts the negative class.
9/4/2023 16
Accuracy (ACC)
For binary classification, accuracy can also be calculated in terms of positives and
negatives as follows:
Example: Let's try calculating accuracy for the following model that
classified 100 tumors as malignant (the positive class) or benign (the negative
class):
9/4/2023 17
Example
9/4/2023 18
Precision (P)
Example: Let's try calculating precision for the following model that classified 100
tumors as malignant (the positive class) or benign (the negative class):
Our model has a precision of 0.5—in other words, when it predicts a tumor is
malignant, it is correct 50% of the time.
9/4/2023 19
Recall (R)
Example: Let's try calculating recall for the following model that classified 100
tumors as malignant (the positive class) or benign (the negative class):
Our model has a recall of 0.11—in other words, it correctly identifies 11% of all
malignant tumors.
9/4/2023 20
How to Choose the Right
Algorithm
9/4/2023 21
How to Choose the Right
Algorithm
❖ Spend some time to know the data, and the more we know it,
we can build successful application.
❖ Things to know about the data are these:
❖ Are the features nominal or continuous?
❖ Are there missing values in the features?
❖ If there missing values, why are there missing values?
❖ Are there outliers in the data? etc…
❖ All of these features about your data can help you narrow the
algorithm selection process.
9/4/2023 22
How to Choose the Right
Algorithm
9/4/2023 24
Machine Learning Systems and
Data
❖ In AI (ML), instead of writing a program by hand for each
specific task, we collect lots of examples that specify the correct
output for a given input.
❖ The most important factors in ML is not the algorithm or the
software systems.
❖ The quality of the data is the soul of the ML systems.
9/4/2023 25
Machine Learning Systems and
Data
❖ Invalid training data:
❖ Garbage In ------ Garbage Out.
9/4/2023 26
Machine Learning Systems and
Data
❖ “garbage” can be several things:
❖ Wrong label (Dog – Cat, Cat – Dog)
❖ Inaccurate and Missing Values
❖ A bias dataset etc…
❖ Handling missing data:
❖ Small portion row and columns – discarded them
❖ Data imputation (time serial data) – the last valid value
❖ Substitute with mean or median
❖ Predicting the missing values from the available data
❖ A missing value can have a meaning on its own (missing)
9/4/2023 27
Machine Learning Systems and
Data
❖ Having a clear dataset is not always enough.
❖ Features with large magnitudes can dominate features with small
magnitudes during the training.
❖ Example: Age [0-100], salary [6,000 – 20,000] – Scaling and
Standardization
❖ Data imbalance:
❖ Leave as it is.
No Classes Number
❖Under sampling (if all classes are
1 Cat 5000
equally important) [5000 – 25]
2 Dog 5000
3 Tiger 150 ❖Over sampling (if all classes are
4 Cow 25 equally important) [25-5000]
9/4/2023 28
Challenges in Machine
Learning
❖ It requires considerable data and compute power.
❖ It requires knowledgeable data science specialists or teams.
❖ It adds complexity to the organization's data integration
strategy. (data-driven culture)
9/4/2023 30
Stages of ML Process
9/4/2023 31
Data Collection and Preparation
❖ The validation set is used to select and tune the final ML model.
❖ The test data set is used to evaluate how well your algorithm
was trained with the training data set.
9/4/2023 34
Data Collection and Preparation
9/4/2023 35
Classifying with k-Nearest
Neighbors(KNN)
9/4/2023 41
K-Nearest Neighbors (KNN)
9/4/2023 42
K-Nearest Neighbors (KNN)
9/4/2023 43
K-Nearest Neighbors (KNN)
9/4/2023 44
K-Nearest Neighbors (KNN)
9/4/2023 45
KNN Classification
❖ Now, you find a movie you haven’t seen yet and want to know
if it’s a romance movie or an action movie.
❖ To determine this, we’ll use the kNN algorithm.
9/4/2023 46
KNN Classification
❖ We find the movie in question and see how many kicks and
kisses it has.
❖ We don’t know what type of movie the question mark movie is.
❖ First, we calculate the distance to all the other movies.
9/4/2023 50
Distances
• Manhattan Distance
|X1-X2| + |Y1-Y2|
KNN Classification
9/4/2023 53
General Approach to KNN
9/4/2023 54
K-Nearest Neighbors (KNN)
9/4/2023 55
K-Nearest Neighbors (KNN)
❖ Advantage:
❖ It remembers
❖ Fast (no learning time)
❖ Simple and straight forward
❖ Disadvantage :
❖ No generalization
❖ Over-fitting (noise)
❖ Computationally expensive for large datasets
9/4/2023 56
K-Nearest Neighbors (KNN)
❖ Given:
❖ Training data D = (xi, yi)
❖ Distance metric d(q, x): domain knowledge important
❖ Number of neighbors K: domain knowledge important
❖ Query point q
❖ d(): k Average
Regression ❖ Euclidian: 1-NN ___8___
X1, X2 y ED
❖ 3-NN ___42__
1, 6 7 25
2, 4 8 8 ❖ Manhattan 1-NN _______
3, 7 16 26 ❖ 3-NN _______
6, 8 44 40
7, 1 50 10
8, 4 68 20 Euclidian = (X1i – q1)2 + (X2i – q2)2
Q = 4, 2, y = ???
9/4/2023 59
KNN- Regression Problem
❖ d(): k Average
Regression ❖ Euclidian: 1-NN _______
X1, X2 y mD
❖ 3-NN _______
1, 6 7 7
2, 4 8 4 ❖ Manhattan 1-NN ___29__
3, 7 16 6 ❖ 3-NN __35.5__
6, 8 44 8
7, 1 50 4
8, 4 68 6 Manhattan = (|X1i – q1|) + (|X2i - q1|)
Q = 4, 2, y = ???
9/4/2023 60
K-Nearest Neighbors Bias
❖ Preference Bias?
❖ Our believe about what makes a good hypothesis.
❖ Locality: near points are similar (distance function / domain)
❖ Smoothness: averaging
❖ All features matter equally
❖ Best practices for Data preparation
❖ Rescale data: normalizing the data to the range [0, 1] is a good
idea.
❖ Address missing data: excluded or imputed the missing values.
❖ Lower dimensionality: KNN is suitable for lower dimensional
data
9/4/2023 61
KNN and Curse of
Dimensionality
9/4/2023 62
Some Other Issues
9/4/2023 64
Question & Answer
9/4/2023 65