0% found this document useful (0 votes)

2 views49 pages

L05-Predictive Analytics I

The document outlines Lecture 5 on Predictive Analytics, covering key topics such as K-Nearest Neighbor (KNN) and Naïve Bayes classifiers, including their applications in Big Data. It discusses the algorithms, distance functions, and performance evaluation metrics for classifiers, emphasizing the importance of scaling and cross-validation. Additionally, it highlights the use of MapReduce to enhance the efficiency of these classifiers when dealing with large datasets.

Uploaded by

abdullah.oudaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views49 pages

L05-Predictive Analytics I

Uploaded by

abdullah.oudaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Lecture 5

Predictive Analytics I

Dr. Lydia Wahid

1
Agenda
Introduction

K-Nearest Neighbor

Applying KNN on Big Data

Naïve Bayes classifier

Applying Naïve Bayes classifier on Big Data

Performance Evaluation of classifiers

2
Introduction

3
Introduction
➢PredictiveAnalytics use Predictive Data Mining techniques which
use Supervised Machine Learning techniques.

Predictive use Predictive use Supervised

Analytics Data Mining Machine Learning

4
Introduction
➢Classification
is an instance of Supervised Machine Learning and is widely
used for prediction purposes.

➢In classification, a classifier is given a set of examples that are already

classified (i.e. given a class label), and from these examples, the classifier
learns to assign a label to unseen examples.

➢Examples of classification problems include:

• Given an email, classify if it is spam or not.
• Given a handwritten character, classify it as one of the known characters.

5
Introduction
➢Examples of Classification Techniques:
• K-Nearest Neighbor (KNN)
• Naïve Bayes
• Decision Trees (DT)
• Support Vector Machines (SVM)
• Neural Networks

6
K-Nearest Neighbor

7
K-Nearest Neighbor
➢K-nearest neighbors is an algorithm that stores all available cases and
classifies new cases based on a similarity measure (e.g., distance
functions).

➢KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.

8
K-Nearest Neighbor
➢ The K-NN working can be explained on the basis of the below algorithm:
Step 1 − For implementing any algorithm, we need dataset. So during the first
step of KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K
can be any integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data with
the help of any of the method namely: Euclidean, Manhattan or Hamming
distance.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class of
these rows.
9
K-Nearest Neighbor

10
K-Nearest Neighbor
➢Distance functions:
1. Euclidean Distance:

n is the number of features

2. Manhattan Distance:

n is the number of features

11
K-Nearest Neighbor
➢Distancefunctions:
3. Hamming Distance:
• It is a measure of the number of instances in which corresponding symbols are
different in two strings of equal length. It is suitable for categorial features.

n is the number of features

12
K-Nearest Neighbor
➢Example:
New customer has height 161cm and weight 61kg.

What is his T Shirt Size?

13
K-Nearest Neighbor
➢Example: Euclidean Distance is used.
For K=5,
T shirt Size=M

14
K-Nearest Neighbor
➢Example:

15
K-Nearest Neighbor

What are the limitations of KNN?

16
K-Nearest Neighbor
➢Normalization and Standardization:
• When independent variables in training data are measured in different units, it is
important to scale the variables before calculating distance.
• For example, if one variable is based on height in cms, and the other is based on
weight in kgs then height will influence more on the distance calculation.
• Scaling the variables can be done by any of the following methods:

Normalization Standardization

17
K-Nearest Neighbor
➢Handling categorial features:
• Hamming Distance can be used
• Can assign a number to each category (not a good option)

18
K-Nearest Neighbor
➢How to find best K value?
• Cross-validation is a way to find out the optimal K value. It estimates the
validation error rate by holding out a subset of the training set from the model
building process.
• Cross-validation (let's say 10 fold validation) involves randomly dividing the
training set into 10 groups, or folds, of approximately equal size. 90% data is
used to train the model and remaining 10% to validate it. The error rate is then
computed on the 10% validation data. This procedure repeats 10 times each time
with a different fold. It results to 10 estimates of the validation error which are
then averaged out.
• The process is repeated for different values of K. The value of K that yields the
smallest average error is selected.

19
Applying KNN on Big Data

20
Applying KNN on Big Data
➢Despitethe promising results shown by the k-NN in a wide variety of
problems, it lacks scalability to address Big datasets.
➢The main problems found to deal with large-scale data are:
• Runtime: The complexity of the traditional k-NN algorithm is O((n · D)),
where n is the number of instances and D the number of features.
• Memory consumption: For a rapid computation of the distances, the k-NN
model may normally require to store the training data in memory. When TR is
too big, it could easily exceed the available RAM memory.

21
Applying KNN on Big Data
➢These drawbacks motivate the use of Big Data techniques to distribute
the processing of KNN over a cluster a nodes.

➢A MapReduce-based approach for k-Nearest neighbor classification can

be applied.

➢This allows us to simultaneously classify large amounts of unseen cases

(test examples) against a big (training) dataset.

22
Applying KNN on Big Data
➢First, the training data will be divided into multiple splits.

➢The map phase will determine the k-nearest neighbors in the different
splits of the data.

➢Asa result of each map, the k nearest neighbors together with their
computed distance values will be emitted to the reduce phase.

23
Applying KNN on Big Data
➢Afterwards, the reduce phase will compute the definitive neighbors
from the list obtained in the map phase.

➢The reduce phase will determine which are the final k nearest neighbors
from the list provided by the maps.

➢This parallel implementation provides the exact classification rate as the

original k-NN model.

24
Naïve Bayes classifier

25
Naïve Bayes
➢Naïve Bayes is a probabilistic classification method based on Bayes’
theorem (or Bayes’ law).

➢Bayes’theorem gives the relationship between the probabilities of

two events and their conditional probabilities.

➢A naïve Bayes classifier assumes that the presence (or absence) of a

particular feature of a class is unrelated to the presence (or absence) of
other features.
26
Naïve Bayes
➢The input variables are generally categorical, but variations of the
algorithm can accept continuous variables.

➢There are also ways to convert continuous variables into categorical ones.
This process is often referred to as the discretization of continuous
variables.

➢The output typically includes a class label and its corresponding

probability score.
27
Naïve Bayes: Bayes’ Theorem
➢The conditional probability of event C occurring, given that event A
has already occurred, is denoted as P(C|A) , which can be found using
the following equation:

where C is the class label C ∈{c1, c2,…cn } and A is the observed

attributes A ={a1, a2, …am}

28
Naïve Bayes: Bayes’ Theorem
➢Amore general form of Bayes’ theorem assigns a classified label to an
object with multiple attributes A ={a1, a2, …am} such that the label
corresponds to the largest value of P( ci| A ).

29
Naïve Bayes classifier
➢With two simplifications, Bayes’ theorem can be extended to become a
naïve Bayes classifier.
➢The first simplification is to use the conditional independence
assumption. That is, each attribute is conditionally independent of every
other attribute given a class label ci.
➢The second simplification is to ignore the denominator P (a1, a2,..am)
because it appears in the denominator for all values of i, removing the
denominator will have no impact on the relative probability scores and
will simplify calculations.

30
Naïve Bayes classifier
➢Naïve Bayes classification applies the two simplifications mentioned
earlier and, as a result, P (ci | a1, a2,..am) is proportional to the product
of P (aj |ci) times P (ci).

31
Naïve Bayes classifier
➢Building a naïve Bayes classifier requires knowing certain statistics, all
calculated from the training set.
➢The first requirement is to collect the probabilities of all class labels,
P(ci).
➢The second thing the naïve Bayes classifier needs to know is the
conditional probabilities of each attribute aj given each class label ci,
namely P (aj |ci). For each attribute and its possible values, computing
the conditional probabilities given each class label is required.

32
Naïve Bayes classifier
➢For a given attribute assume it can have the following values {x,y,z} and
assume that we have two class labels {c1 and c2}.
➢Then the following probabilities need to be computed:
• P(x | c1)
• P(x | c2)
• P(y | c1)
• P(y | c2)
• P(z | c1)
• P(z | c2)
33
Naïve Bayes classifier
➢After that, the naïve Bayes classifier can be tested over the testing set.
➢For each record in the testing set, the naïve Bayes classifier assigns the
classifier label ci that maximizes:

where:
• m is the number of features (dimensions)
• i is the index of class labels
• j is the index of features (dimensions)
• a1 is the value of the first feature in the test record
• a2 is the value of the second feature in the test
record…….
34
Applying Naïve Bayes classifier
on Big Data

35
Applying Naïve Bayes classifier on Big Data
➢Applying MapReduce with Naïve Bayes classifier significantly decreases
computation times allowing its application on Big Data problems.
➢First, the training data will be divided into multiple splits.
➢During the map phase, each map processes a single split, and computes
statistics of the input data.
➢For each attribute, the map outputs a <Key, Value> pair, where
• Key is the class label,
• Value is {AttributeValue, the frequency of attribute value within that
class label}
36
Applying Naïve Bayes classifier on Big Data
➢In the reduce phase, the reduce function aggregates the number of
each attribute value within each class label.
➢For each attribute, the reduce function outputs a <Key, Value> pair,
where:
• Key is the class label,
• Value is {AttributeValue, i the frequency of attribute value within
that class label}
where i is the number of mappers.

37
Performance Evaluation of
classifiers

38
Performance Evaluation of classifiers
➢A confusion matrix is a specific table layout that allows visualization of
the performance of a classifier.
➢The following figure shows the confusion matrix for a two-class
classifier:

39
Performance Evaluation of classifiers
➢True positives (TP) are the number of positive instances the classifier
correctly identified as positive.
➢False positives (FP) are the number of instances in which the classifier
identified as positive but in reality are negative.
➢True negatives (TN) are the number of negative instances the classifier
correctly identified as negative.
➢False negatives (FN) are the number of instances classified as negative
but in reality are positive.

40
Performance Evaluation of classifiers
➢The accuracy (or the overall success rate) is a metric defining the rate
at which a model has classified the records correctly.
➢It is defined as the sum of TP and TN divided by the total number of
instances, as shown in the following equation:

41
Performance Evaluation of classifiers
➢The false positive rate (FPR) shows what percent of negatives the
classifier marked as positive.
➢The FPR is also called the false alarm rate or the type I error rate .
➢FPR is computed as follows:

42
Performance Evaluation of classifiers
➢Precision is the percentage of instances marked positive that really are
positive. It is computed as follows:

➢Recall is the percentage of positive instances that were correctly

identified. It is also called true positive rate (TPR). It is computed as
follows:

43
Performance Evaluation of classifiers
➢F1-score is the harmonic mean of the precision and recall:

44
Performance Evaluation of classifiers – Multi classes
➢Micro-average and Macro-average:
• The micro-average precision and recall scores are calculated from the sum of
classes’ true positives (TPs), false positives (FPs), and false negatives (FNs) of the
model.
• The macro-average precision and recall scores are calculated as arithmetic mean
(or weighted mean) of individual classes’ precision and recall scores.
• The macro-average F1-score is calculated as arithmetic mean (or weighted
mean) of individual classes’ F1-score.

45
Performance Evaluation of classifiers – Multi classes
➢Exercise:

46
Performance Evaluation of classifiers – Multi classes
➢Exercise:

Confusion matrix TP, FP, FN calculated from the confusion matrix

47
Performance Evaluation of classifiers – Multi classes
➢Exercise:
• Calculate micro-average, macro-average, and weighted macro-average of
precision, recall and F1-score.

48
Thank You

Explorers!
83% (6)
Explorers!
2 pages
Cinematography 1 Syllabus 2018-2019
No ratings yet
Cinematography 1 Syllabus 2018-2019
3 pages
Classification and Regression: Arturo Calder On Mora
No ratings yet
Classification and Regression: Arturo Calder On Mora
8 pages
Lecture_07_slides
No ratings yet
Lecture_07_slides
45 pages
KNN
No ratings yet
KNN
29 pages
2. Classification and clustering algorithms
No ratings yet
2. Classification and clustering algorithms
108 pages
Co-2 ML 2019
No ratings yet
Co-2 ML 2019
71 pages
FPA Notes
No ratings yet
FPA Notes
13 pages
KNN Algorithm
No ratings yet
KNN Algorithm
16 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
Lecture 3 Basics of Clssification
No ratings yet
Lecture 3 Basics of Clssification
53 pages
06-knn
No ratings yet
06-knn
41 pages
KNN PDF
No ratings yet
KNN PDF
30 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
22 pages
Chapter
100% (1)
Chapter
101 pages
Unit 5
No ratings yet
Unit 5
28 pages
Unit 5 Learning with Algorithm
No ratings yet
Unit 5 Learning with Algorithm
7 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
19 pages
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
23 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
Classification KNN
No ratings yet
Classification KNN
11 pages
KNN
No ratings yet
KNN
53 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
05_kNN
No ratings yet
05_kNN
49 pages
3.1 K Nearest Neighbour Classifier (1)
No ratings yet
3.1 K Nearest Neighbour Classifier (1)
24 pages
02-knn Notes
No ratings yet
02-knn Notes
23 pages
ML Lec07 KNN
100% (2)
ML Lec07 KNN
37 pages
12_23ECE216_Nearest Neighbors
No ratings yet
12_23ECE216_Nearest Neighbors
29 pages
sensitivity unit 4
No ratings yet
sensitivity unit 4
4 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Jntuk r20 ML Unit-II
No ratings yet
Jntuk r20 ML Unit-II
33 pages
Naive Bayes Classifier: K M M I I M
No ratings yet
Naive Bayes Classifier: K M M I I M
16 pages
ml5
No ratings yet
ml5
35 pages
5. K-Nearest Neighbors
No ratings yet
5. K-Nearest Neighbors
35 pages
جلسه پنجم-3
No ratings yet
جلسه پنجم-3
17 pages
KNN Updated
No ratings yet
KNN Updated
30 pages
KNN HMM
No ratings yet
KNN HMM
51 pages
Instance Based Learning
No ratings yet
Instance Based Learning
16 pages
FPA unit 2
No ratings yet
FPA unit 2
20 pages
STAT 451: Introduction To Machine Learning Lecture Notes
No ratings yet
STAT 451: Introduction To Machine Learning Lecture Notes
22 pages
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
No ratings yet
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
61 pages
Ml Module4 Classification
No ratings yet
Ml Module4 Classification
79 pages
K_Nearest_Neighbour_Classifier
No ratings yet
K_Nearest_Neighbour_Classifier
24 pages
DADM S15 K-NN Classification
No ratings yet
DADM S15 K-NN Classification
13 pages
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
No ratings yet
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
93 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Bayesian
No ratings yet
Bayesian
23 pages
A Complete Guide To KNN
No ratings yet
A Complete Guide To KNN
16 pages
K-Nearest Neighbor Learning
No ratings yet
K-Nearest Neighbor Learning
31 pages
ml unit2
No ratings yet
ml unit2
38 pages
ML Notes
100% (2)
ML Notes
125 pages
k-NN
No ratings yet
k-NN
17 pages
19-K-Nearest Neighbor Learning.-22-08-2024
No ratings yet
19-K-Nearest Neighbor Learning.-22-08-2024
25 pages
Lecture Note #3_PEC-CS701E
No ratings yet
Lecture Note #3_PEC-CS701E
27 pages
Slide 2 ML Basics
No ratings yet
Slide 2 ML Basics
42 pages
331mt 3.2 (1)
No ratings yet
331mt 3.2 (1)
23 pages
UNIT-3
No ratings yet
UNIT-3
100 pages
S3-K-Nearest-Neighbor-LKW-15Jan2025
No ratings yet
S3-K-Nearest-Neighbor-LKW-15Jan2025
16 pages
K Nearest Neighbor KNN
No ratings yet
K Nearest Neighbor KNN
18 pages
ML-KN
No ratings yet
ML-KN
12 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Systems Development - JAD and RAD
No ratings yet
Systems Development - JAD and RAD
9 pages
Sleep Reading Comprehension (1)
No ratings yet
Sleep Reading Comprehension (1)
2 pages
Equivalence
No ratings yet
Equivalence
7 pages
Maharaja Ganga Singh University, Bikaner: Programme of Examination
No ratings yet
Maharaja Ganga Singh University, Bikaner: Programme of Examination
2 pages
Grammar 7 Module 6
No ratings yet
Grammar 7 Module 6
2 pages
Use of Scripts and Script-Fading Proc
No ratings yet
Use of Scripts and Script-Fading Proc
5 pages
All by Electricity and Nation's Strength FNBK
No ratings yet
All by Electricity and Nation's Strength FNBK
6 pages
Science 8
100% (2)
Science 8
22 pages
27 - Semi Detailed Lesson Plan About Cite Evidences To Support General Statement
No ratings yet
27 - Semi Detailed Lesson Plan About Cite Evidences To Support General Statement
5 pages
F1 in Schools Team Proposal
No ratings yet
F1 in Schools Team Proposal
10 pages
Portfolio On Work Immersion
No ratings yet
Portfolio On Work Immersion
17 pages
221 Spanish 2 Lesson
No ratings yet
221 Spanish 2 Lesson
3 pages
Self-Assessment On Accountability: I. Questions
100% (1)
Self-Assessment On Accountability: I. Questions
2 pages
IBMCyberSec4Wk
No ratings yet
IBMCyberSec4Wk
1 page
2023 Higher Cert in Bus Man Brochure - Final
No ratings yet
2023 Higher Cert in Bus Man Brochure - Final
12 pages
Non-Fiction Introduction Lesson
No ratings yet
Non-Fiction Introduction Lesson
3 pages
Alejandro Portes Economic Sociology a Sy
No ratings yet
Alejandro Portes Economic Sociology a Sy
4 pages
Shruthi BD
No ratings yet
Shruthi BD
3 pages
PR Chapter 1-3
No ratings yet
PR Chapter 1-3
24 pages
LESSON-1-3
No ratings yet
LESSON-1-3
63 pages
Pro 2 an Assessment of Parental Attitude Towards the Achievement of Junior Secondary School Students in Business Study in Oyo East Local Government Area
No ratings yet
Pro 2 an Assessment of Parental Attitude Towards the Achievement of Junior Secondary School Students in Business Study in Oyo East Local Government Area
70 pages
Clarification
100% (1)
Clarification
107 pages
GCSE Timetable
No ratings yet
GCSE Timetable
3 pages
PRAC 6635 Clinical Skills Self Assessment Form
No ratings yet
PRAC 6635 Clinical Skills Self Assessment Form
4 pages
Amul CSR - MMS Sem 4 Report 1 - Certificates
No ratings yet
Amul CSR - MMS Sem 4 Report 1 - Certificates
6 pages
MCM301 Mid Term Fall 2006 Solved 100% Correct by Suleyman Khan
No ratings yet
MCM301 Mid Term Fall 2006 Solved 100% Correct by Suleyman Khan
7 pages
Mounce Overheads
100% (3)
Mounce Overheads
116 pages
Kimberly Burns: Objective
No ratings yet
Kimberly Burns: Objective
2 pages