0% found this document useful (0 votes)
14 views

ML-chap-2

Uploaded by

Amanuel Fentahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ML-chap-2

Uploaded by

Amanuel Fentahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Classification : Machine

Learning Basic and kNN


Wachemo University
School of Computing and Informatics
Department of Software Engineering
Ms. Senedu G/mariam (2023)
Outline

❖ A brief overview of ML
❖ Key tasks in ML
❖ Why we need ML
❖ K-nearest neighbors algorithm
❖ kNN Classification
❖ kNN Regression
❖ Some Issues in KNN
❖ Decision Tree
❖ Naïve Bayes

9/4/2023 2
Machine Learning

❖ With machine learning we can gain insight from a dataset.


❖ We’re going to ask the computer to make some sense from the
data.
❖ This is what we mean by learning.

❖ Machine learning is the process of turning the data into


information and Knowledge.
❖ ML lies at the intersection of computer science, engineering,
and statistics and often appears in other disciplines.

9/4/2023 3
What is Machine Learning?

❖ It’s a tool that can be applied to many problems.


❖ Any field that needs to interpret and act on data can benefit
from ML techniques.

❖ There are many problems where the solution isn’t deterministic.


❖ That is, we don’t know enough about the problem or don’t have
enough computing power to properly model the problem.

9/4/2023 4
Traditional Vs ML systems

❖ In ML, once the system is provided with the right data and
algorithms, it can "•
fish for itself”.

9/4/2023 5
Traditional Vs ML systems

❖ A key aspect of ML that makes it particularly appealing in


terms of business value is that it does not require as much
explicit programming in advance.

9/4/2023 6
Sensor and the Data Deluge

❖ We have a tremendous amount of human-created data from the


WWW, but recently more non-human sources of data have
been coming online.
❖ Sensors connected to the web.
❖ 20 % of non-video internet traffic by sensors.
❖ Data collected from mobile phone (three-axis accelerometer,
temperature sensors, and GPS receivers)

❖ Due to the two trends of mobile computing and sensor


generated data mean that we’ll be getting more and more data
in the future.
9/4/2023 7
Key Terminology

❖ Weight, Wingspan, Webbed feet, Back color are features or


attributes.
❖ An instance is made up of features. (controlled, exposure etc.)
❖ Species is the target variable. (response, outcome, output etc.)
❖ Attributes can be numeric, binary, nominal.

9/4/2023 8
Key Terminology

❖ To train the ML algorithm we need to feed it quality data


known as a training set.
❖ In the above example each training example (instant) has four
features and one target variable.
❖ In a training set the target variable is known.

❖ The machine learns by finding some relationship between the


features and the target variable.
❖ In the classification problem the target variables are called
classes, and they are assumed to be a finite number of classes.
9/4/2023 9
Key Terminology Cont…

❖ To test machine learning algorithms a separate dataset is used


which is called a test set.
❖ The target variable for each example from the test set isn’t
given to the program.

❖ The program (model) decides in which class each example


should belong to.
❖ Then compare the predicted value with the target variable.

9/4/2023 10
Key Tasks of Machine Learning

❖ In classification, our job is to predict what class an instance of


data should fall into.
❖ Regression is the prediction of a numeric value.

❖ Classification and regression are examples of supervised


learning.
❖ This set of problems is known as supervised because we’re
telling the algorithm what to predict.

9/4/2023 11
Key Tasks of Machine Learning

❖ The opposite of supervised learning is a set of tasks known as


unsupervised learning.
❖ In unsupervised learning, there’s no label or target value given
for the data. (known as clustering)
❖ In unsupervised learning, we may also want to find statistical
values that describe the data. This is known as density
estimation.
❖ Another task of unsupervised learning may be reducing the
data from many features to a small number so that we can
properly visualize the dimensions.

9/4/2023 12
Key Tasks of Machine Learning

❖ Common algorithms used to perform classification, regression,


clustering, and density estimation tasks.
❖ Balancing generalization and memorization (over fitting) is a
common problem to many ML algorithms.
❖ Regularization techniques are used to reduce over fitting.

9/4/2023 13
Key Tasks of Machine Learning

❖ There are two fundamental cause of prediction error: a model bias, and
its variance.
❖ A model with high variance over-fits the training data, while a model
with high bias under-fits the training data.
❖ High bias, low variance
❖ Low bias, high variance
❖ High bias, high variance
❖ Low bias, low variance
❖ The predictive power of many ML algorithms improve as the
amount of training data increases.
❖ Quality of data is also important.
❖ Ideally, a model will have both low bias and variance; but effort
to reduce one will frequently increase the other. This is known as
9/4/2023 the bias-variance trade-off. 14
Model bias Vs Variance
• Model bias refers to the presence of systematic errors in a model
that can cause it to consistently make incorrect predictions.
These errors can arise from many sources, including:
– the selection of the training data,
– the choice of features used to build the model, or
– the algorithm used to train the model.
• Variance refers to the changes in the model when using different
portions of the training data set. Simply stated, variance is the
variability in the model prediction—how much the ML function
can adjust depending on the given data set. Variance comes
from highly complex models with a large number of features.
9/4/2023 15
Common measurement of
performance
❖ Common measurement of performance include:
❖ Accuracy (ACC) = (TP + TN / TP+TN+FP+FN)
❖ Precision (P) = (TP / TP+FP)
❖ Recall (R) = (TP / TP+FN)
• A true positive is an outcome where the model correctly predicts
the positive class.
• A true negative is an outcome where the
model correctly predicts the negative class.
• A false positive is an outcome where the
model incorrectly predicts the positive class.
• A false negative is an outcome where the
model incorrectly predicts the negative class.
9/4/2023 16
Accuracy (ACC)

❖ is the fraction of predictions our model got right. Formally,


accuracy has the following definition:

For binary classification, accuracy can also be calculated in terms of positives and
negatives as follows:

Example: Let's try calculating accuracy for the following model that
classified 100 tumors as malignant (the positive class) or benign (the negative
class):

9/4/2023 17
Example

9/4/2023 18
Precision (P)

❖ Is attempts to answer the question: What proportion of positive identifications


was actually correct?

Example: Let's try calculating precision for the following model that classified 100
tumors as malignant (the positive class) or benign (the negative class):

Our model has a precision of 0.5—in other words, when it predicts a tumor is
malignant, it is correct 50% of the time.

9/4/2023 19
Recall (R)

❖ attempts to answer the question: What proportion of actual positives was


identified correctly?
Note: A model that produces no false negatives has a recall of 1.0.

Example: Let's try calculating recall for the following model that classified 100
tumors as malignant (the positive class) or benign (the negative class):

Our model has a recall of 0.11—in other words, it correctly identifies 11% of all
malignant tumors.
9/4/2023 20
How to Choose the Right
Algorithm

❖ First, you need to consider your goal.


❖ If you’re trying to predict or forecast a target value, then you
need to look into supervised learning.
❖ If not, then unsupervised learning is the place you want to be.

❖ If you’ve chosen supervised learning, what’s your target value?


❖ Discrete value (y/n, 1/2/3, Red/Yellow/Black):- classification
❖ A number of values (0.00 to 100.00 etc…):- regression

9/4/2023 21
How to Choose the Right
Algorithm

❖ Spend some time to know the data, and the more we know it,
we can build successful application.
❖ Things to know about the data are these:
❖ Are the features nominal or continuous?
❖ Are there missing values in the features?
❖ If there missing values, why are there missing values?
❖ Are there outliers in the data? etc…

❖ All of these features about your data can help you narrow the
algorithm selection process.
9/4/2023 22
How to Choose the Right
Algorithm

❖ Finding the best algorithm is an iterative process of trial and


error.
❖ Steps in developing a machine learning application:
❖ Collect data: scraping a website, RSS feed or API etc..
❖ Prepare the input data: make sure the unstableness of the data
format.
❖ Analyze the input data: looking at the data.
❖ Understand the data.
❖ Train the algorithm: the ML takes place (not for unsupervised)
❖ Test the algorithm: (go back to the 4th step)
❖ Use it (implement ML application)
9/4/2023 23
Problem Solving Framework

❖ Problem solving Framework for ML application:


❖ Business issue understanding
❖ Data understanding
❖ Data preparation
❖ Analysis Modeling
❖ Validation
❖ Presentation / Visualization

9/4/2023 24
Machine Learning Systems and
Data
❖ In AI (ML), instead of writing a program by hand for each
specific task, we collect lots of examples that specify the correct
output for a given input.
❖ The most important factors in ML is not the algorithm or the
software systems.
❖ The quality of the data is the soul of the ML systems.

9/4/2023 25
Machine Learning Systems and
Data
❖ Invalid training data:
❖ Garbage In ------ Garbage Out.

❖ Invalid dataset leads to invalid results.


❖ This is not to say that the training data needs to be prefer.

❖ Out of a million examples, some inaccurate labels is


acceptable.
❖ The quality of the data is the soul of the ML systems.

9/4/2023 26
Machine Learning Systems and
Data
❖ “garbage” can be several things:
❖ Wrong label (Dog – Cat, Cat – Dog)
❖ Inaccurate and Missing Values
❖ A bias dataset etc…
❖ Handling missing data:
❖ Small portion row and columns – discarded them
❖ Data imputation (time serial data) – the last valid value
❖ Substitute with mean or median
❖ Predicting the missing values from the available data
❖ A missing value can have a meaning on its own (missing)
9/4/2023 27
Machine Learning Systems and
Data
❖ Having a clear dataset is not always enough.
❖ Features with large magnitudes can dominate features with small
magnitudes during the training.
❖ Example: Age [0-100], salary [6,000 – 20,000] – Scaling and
Standardization
❖ Data imbalance:
❖ Leave as it is.
No Classes Number
❖Under sampling (if all classes are
1 Cat 5000
equally important) [5000 – 25]
2 Dog 5000
3 Tiger 150 ❖Over sampling (if all classes are
4 Cow 25 equally important) [25-5000]
9/4/2023 28
Challenges in Machine
Learning
❖ It requires considerable data and compute power.
❖ It requires knowledgeable data science specialists or teams.
❖ It adds complexity to the organization's data integration
strategy. (data-driven culture)

❖ Learning AI(ML) algorithms is challenging without an


advanced math background.
❖ The context of data often changes. (private data Vs public data)
❖ Algorithmic bias, privacy and ethical concerns may be
overlooked.
9/4/2023 29
Stages of ML Process

❖ The •first key step in preparing to explore and exploit AI(ML) is


to understand the basic stages involved.

9/4/2023 30
Stages of ML Process

❖ Machine Learning Tasks and Subtasks:

9/4/2023 31
Data Collection and Preparation

❖ Data collection is the process of gathering and measuring


information from countless different sources.

❖ Data generating at an unprecedented rate. These data can be:


❖ Numeric (temperature, loan amount, customer retention rate),
❖ Categorical (gender, color, highest degree earned), or
❖ Even free text (think doctor’s notes or opinion surveys).

❖ In order to use the data we collect to develop practical


solutions, it must be collected and stored in a way that makes
sense for the business problem at hand.
9/4/2023 32
Data Collection and Preparation
Data Collection and Preparation

❖ During an AI development, we always rely on data.


❖ From training, tuning, model selection to testing, we use three
different data sets: the training set, the validation set ,and the
testing set.

❖ The validation set is used to select and tune the final ML model.

❖ The test data set is used to evaluate how well your algorithm
was trained with the training data set.

9/4/2023 34
Data Collection and Preparation

❖ Testing sets represent 20% or 30% of the data. (cross validation)


❖ The test set is ensured to be the input data grouped together with
verified correct outputs, generally by human verification.

9/4/2023 35
Classifying with k-Nearest
Neighbors(KNN)

9/4/2023 41
K-Nearest Neighbors (KNN)

❖ It is an easy to grasp (understand and implement) and very


effective (powerful tool).
❖ The model for kNN is the entire training dataset.

❖ Pros: High accuracy, insensitive to outliers, no assumptions


about data.
❖ Cons: computationally expensive, requires a lot of memory.
❖ Works with: Numeric values, nominal values. (Classification
and regression)

9/4/2023 42
K-Nearest Neighbors (KNN)

❖ We have an existing set of example data (training set).


❖ We know what class each piece of the data should fall into.

❖ When we’re given a new piece of data without a label.


❖ We compare that new piece of data to the existing data, every
piece of existing data.
❖ We then take the most similar pieces of data (the nearest
neighbors) and look at their labels.

9/4/2023 43
K-Nearest Neighbors (KNN)

❖ We look at the top k most similar pieces of data from our


known dataset. (usually less than 20)
❖ The K is often set to an odd number to prevent ties.

❖ Lastly, we take a majority vote from the k most similar pieces


of data, and the majority is the new class we assign to the data
we were asked to classify.

9/4/2023 44
K-Nearest Neighbors (KNN)

❖ KNN, non-paramteric models can be useful when training data


is abundant and you have little prior knowledge about the
relationship b/n the response and explanatory variables.
❖ KNN makes only one assumption: instance that are near each
other are likely to have similar values of response variable.

❖ A model that makes assumption about the relationship can be


useful if training data is scarce or if you already know about
the relationship.

9/4/2023 45
KNN Classification

❖ Classifying movies into romance or action movies.


❖ The number of kisses and kicks in each movie (features)

❖ Now, you find a movie you haven’t seen yet and want to know
if it’s a romance movie or an action movie.
❖ To determine this, we’ll use the kNN algorithm.

9/4/2023 46
KNN Classification

❖ We find the movie in question and see how many kicks and
kisses it has.

Classifying movies by plotting the # kicks and kisses in each movie


9/4/2023 47
KNN Classification

Movies with the # of kicks, # of kisses along with their class


9/4/2023 48
KNN Classification

❖ We don’t know what type of movie the question mark movie is.
❖ First, we calculate the distance to all the other movies.

Distance b/n each movie and the unknown movie


9/4/2023 49
KNN Classification

Euclidian distance where the distance between two vectors

9/4/2023 50
Distances

Distance are Distance


used to measure
similarity
There are many Mahalanobis
Distance
Euclidean
Distance
ways to measure
the distances
between two
instances
Hamming Minkowski
Distance distance
Distances

• Manhattan Distance
|X1-X2| + |Y1-Y2|
KNN Classification

❖ Let’s assume k=3.


❖ Then, the three closest movies are He’s Not Really into
Dudes, Beautiful Woman, and California Man.
❖ Because all three movies are romances, we forecast that the
mystery movie is a romance movie. (majority vote)

9/4/2023 53
General Approach to KNN

❖ General approach to kNN:


❖ Collect: Any method
❖ Prepare: Numeric values are needed for a distance calculation.
❖ Analyze: Any method (plotting).
❖ Train: Does not apply to the kNN algorithm.
❖ Test: Calculate the error rate.
❖ Use: This application needs to get some input data and output
structured numeric values.

9/4/2023 54
K-Nearest Neighbors (KNN)

❖ kNN is an instance-based learning algorithm.

<x, y> 1 <x, y> 1


<x, y> 2 <x, y> 2
<x, y> 3 <x, y> 3 Database
F(x) = wx + b
<x, y> 4 <x, y> 4
…….. ……..
<x, y> n <x, y> n F(x) = lookup(x)
Non-instance supervised learning Instance-based supervised learning

9/4/2023 55
K-Nearest Neighbors (KNN)

❖ Advantage:
❖ It remembers
❖ Fast (no learning time)
❖ Simple and straight forward
❖ Disadvantage :
❖ No generalization
❖ Over-fitting (noise)
❖ Computationally expensive for large datasets

9/4/2023 56
K-Nearest Neighbors (KNN)

❖ Given:
❖ Training data D = (xi, yi)
❖ Distance metric d(q, x): domain knowledge important
❖ Number of neighbors K: domain knowledge important
❖ Query point q

❖ KNN = {i : d(q, xi) k smallest }


❖ Return:
❖ Classification: Vote of the yi.
❖ Regression: mean of the yi.
9/4/2023 57
KNN- Regression Problem

❖ The similarity measure is dependent on the type of the data:


❖ Real-valued data: Euclidean distance
❖ Hamming distance: categorical or binary data (P-norm; when p=0)
Regression
X1, X2 y
❖ d(): k Average
1, 6 7
❖ Euclidian: 1-NN _______
2, 4 8
3, 7 16 ❖ 3-NN _______
6, 8 44
7, 1 50 ❖ Manhattan 1-NN _______
8, 4 68
❖ 3-NN _______
Q = 4, 2, y = ???
9/4/2023 58
KNN- Regression Problem

❖ d(): k Average
Regression ❖ Euclidian: 1-NN ___8___
X1, X2 y ED
❖ 3-NN ___42__
1, 6 7 25
2, 4 8 8 ❖ Manhattan 1-NN _______
3, 7 16 26 ❖ 3-NN _______
6, 8 44 40
7, 1 50 10
8, 4 68 20 Euclidian = (X1i – q1)2 + (X2i – q2)2
Q = 4, 2, y = ???
9/4/2023 59
KNN- Regression Problem

❖ d(): k Average
Regression ❖ Euclidian: 1-NN _______
X1, X2 y mD
❖ 3-NN _______
1, 6 7 7
2, 4 8 4 ❖ Manhattan 1-NN ___29__
3, 7 16 6 ❖ 3-NN __35.5__
6, 8 44 8
7, 1 50 4
8, 4 68 6 Manhattan = (|X1i – q1|) + (|X2i - q1|)
Q = 4, 2, y = ???
9/4/2023 60
K-Nearest Neighbors Bias

❖ Preference Bias?
❖ Our believe about what makes a good hypothesis.
❖ Locality: near points are similar (distance function / domain)
❖ Smoothness: averaging
❖ All features matter equally
❖ Best practices for Data preparation
❖ Rescale data: normalizing the data to the range [0, 1] is a good
idea.
❖ Address missing data: excluded or imputed the missing values.
❖ Lower dimensionality: KNN is suitable for lower dimensional
data
9/4/2023 61
KNN and Curse of
Dimensionality

❖ As the number of features or dimension grows, the amount of


data we need to generalize accurately grows exponentially.
❖ Exponentially mean “bad”. O(2d)

9/4/2023 62
Some Other Issues

❖ What is needed to select a KNN model?


❖ How to measure closeness of neighbors.
❖ Correct value for K.

❖ d(x, q) = Euclidian, Manhattan, weighted etc…


❖ The choice of the distance function matters.
❖ K value
❖ K = n (the average of all data / no need of query)
❖ K = n (weighted average) [Locally weighted regression]
9/4/2023 63
Summary

❖ kNN is an example of instance-based learning.


❖ The algorithm has to carry around the full dataset; for large
datasets, this implies a large amount of storage.
❖ Need to calculate the distance measurement for every piece of
data in the database, and this can be bulky.
❖ kNN doesn’t give you any idea of the underlying structure of
the data.
❖ kNN is an example of lazy learning, which is the opposite of
eager learning.
❖ kNN can handle both classification and regression.

9/4/2023 64
Question & Answer

9/4/2023 65

You might also like