L05-Predictive Analytics I
L05-Predictive Analytics I
Predictive Analytics I
1
Agenda
Introduction
K-Nearest Neighbor
2
Introduction
3
Introduction
➢PredictiveAnalytics use Predictive Data Mining techniques which
use Supervised Machine Learning techniques.
4
Introduction
➢Classification
is an instance of Supervised Machine Learning and is widely
used for prediction purposes.
5
Introduction
➢Examples of Classification Techniques:
• K-Nearest Neighbor (KNN)
• Naïve Bayes
• Decision Trees (DT)
• Support Vector Machines (SVM)
• Neural Networks
6
K-Nearest Neighbor
7
K-Nearest Neighbor
➢K-nearest neighbors is an algorithm that stores all available cases and
classifies new cases based on a similarity measure (e.g., distance
functions).
➢KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
8
K-Nearest Neighbor
➢ The K-NN working can be explained on the basis of the below algorithm:
Step 1 − For implementing any algorithm, we need dataset. So during the first
step of KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K
can be any integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data with
the help of any of the method namely: Euclidean, Manhattan or Hamming
distance.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class of
these rows.
9
K-Nearest Neighbor
10
K-Nearest Neighbor
➢Distance functions:
1. Euclidean Distance:
2. Manhattan Distance:
11
K-Nearest Neighbor
➢Distancefunctions:
3. Hamming Distance:
• It is a measure of the number of instances in which corresponding symbols are
different in two strings of equal length. It is suitable for categorial features.
12
K-Nearest Neighbor
➢Example:
New customer has height 161cm and weight 61kg.
13
K-Nearest Neighbor
➢Example: Euclidean Distance is used.
For K=5,
T shirt Size=M
14
K-Nearest Neighbor
➢Example:
15
K-Nearest Neighbor
16
K-Nearest Neighbor
➢Normalization and Standardization:
• When independent variables in training data are measured in different units, it is
important to scale the variables before calculating distance.
• For example, if one variable is based on height in cms, and the other is based on
weight in kgs then height will influence more on the distance calculation.
• Scaling the variables can be done by any of the following methods:
Normalization Standardization
17
K-Nearest Neighbor
➢Handling categorial features:
• Hamming Distance can be used
• Can assign a number to each category (not a good option)
18
K-Nearest Neighbor
➢How to find best K value?
• Cross-validation is a way to find out the optimal K value. It estimates the
validation error rate by holding out a subset of the training set from the model
building process.
• Cross-validation (let's say 10 fold validation) involves randomly dividing the
training set into 10 groups, or folds, of approximately equal size. 90% data is
used to train the model and remaining 10% to validate it. The error rate is then
computed on the 10% validation data. This procedure repeats 10 times each time
with a different fold. It results to 10 estimates of the validation error which are
then averaged out.
• The process is repeated for different values of K. The value of K that yields the
smallest average error is selected.
19
Applying KNN on Big Data
20
Applying KNN on Big Data
➢Despitethe promising results shown by the k-NN in a wide variety of
problems, it lacks scalability to address Big datasets.
➢The main problems found to deal with large-scale data are:
• Runtime: The complexity of the traditional k-NN algorithm is O((n · D)),
where n is the number of instances and D the number of features.
• Memory consumption: For a rapid computation of the distances, the k-NN
model may normally require to store the training data in memory. When TR is
too big, it could easily exceed the available RAM memory.
21
Applying KNN on Big Data
➢These drawbacks motivate the use of Big Data techniques to distribute
the processing of KNN over a cluster a nodes.
22
Applying KNN on Big Data
➢First, the training data will be divided into multiple splits.
➢The map phase will determine the k-nearest neighbors in the different
splits of the data.
➢Asa result of each map, the k nearest neighbors together with their
computed distance values will be emitted to the reduce phase.
23
Applying KNN on Big Data
➢Afterwards, the reduce phase will compute the definitive neighbors
from the list obtained in the map phase.
➢The reduce phase will determine which are the final k nearest neighbors
from the list provided by the maps.
24
Naïve Bayes classifier
25
Naïve Bayes
➢Naïve Bayes is a probabilistic classification method based on Bayes’
theorem (or Bayes’ law).
➢There are also ways to convert continuous variables into categorical ones.
This process is often referred to as the discretization of continuous
variables.
28
Naïve Bayes: Bayes’ Theorem
➢Amore general form of Bayes’ theorem assigns a classified label to an
object with multiple attributes A ={a1, a2, …am} such that the label
corresponds to the largest value of P( ci| A ).
29
Naïve Bayes classifier
➢With two simplifications, Bayes’ theorem can be extended to become a
naïve Bayes classifier.
➢The first simplification is to use the conditional independence
assumption. That is, each attribute is conditionally independent of every
other attribute given a class label ci.
➢The second simplification is to ignore the denominator P (a1, a2,..am)
because it appears in the denominator for all values of i, removing the
denominator will have no impact on the relative probability scores and
will simplify calculations.
30
Naïve Bayes classifier
➢Naïve Bayes classification applies the two simplifications mentioned
earlier and, as a result, P (ci | a1, a2,..am) is proportional to the product
of P (aj |ci) times P (ci).
31
Naïve Bayes classifier
➢Building a naïve Bayes classifier requires knowing certain statistics, all
calculated from the training set.
➢The first requirement is to collect the probabilities of all class labels,
P(ci).
➢The second thing the naïve Bayes classifier needs to know is the
conditional probabilities of each attribute aj given each class label ci,
namely P (aj |ci). For each attribute and its possible values, computing
the conditional probabilities given each class label is required.
32
Naïve Bayes classifier
➢For a given attribute assume it can have the following values {x,y,z} and
assume that we have two class labels {c1 and c2}.
➢Then the following probabilities need to be computed:
• P(x | c1)
• P(x | c2)
• P(y | c1)
• P(y | c2)
• P(z | c1)
• P(z | c2)
33
Naïve Bayes classifier
➢After that, the naïve Bayes classifier can be tested over the testing set.
➢For each record in the testing set, the naïve Bayes classifier assigns the
classifier label ci that maximizes:
where:
• m is the number of features (dimensions)
• i is the index of class labels
• j is the index of features (dimensions)
• a1 is the value of the first feature in the test record
• a2 is the value of the second feature in the test
record…….
34
Applying Naïve Bayes classifier
on Big Data
35
Applying Naïve Bayes classifier on Big Data
➢Applying MapReduce with Naïve Bayes classifier significantly decreases
computation times allowing its application on Big Data problems.
➢First, the training data will be divided into multiple splits.
➢During the map phase, each map processes a single split, and computes
statistics of the input data.
➢For each attribute, the map outputs a <Key, Value> pair, where
• Key is the class label,
• Value is {AttributeValue, the frequency of attribute value within that
class label}
36
Applying Naïve Bayes classifier on Big Data
➢In the reduce phase, the reduce function aggregates the number of
each attribute value within each class label.
➢For each attribute, the reduce function outputs a <Key, Value> pair,
where:
• Key is the class label,
• Value is {AttributeValue, i the frequency of attribute value within
that class label}
where i is the number of mappers.
37
Performance Evaluation of
classifiers
38
Performance Evaluation of classifiers
➢A confusion matrix is a specific table layout that allows visualization of
the performance of a classifier.
➢The following figure shows the confusion matrix for a two-class
classifier:
39
Performance Evaluation of classifiers
➢True positives (TP) are the number of positive instances the classifier
correctly identified as positive.
➢False positives (FP) are the number of instances in which the classifier
identified as positive but in reality are negative.
➢True negatives (TN) are the number of negative instances the classifier
correctly identified as negative.
➢False negatives (FN) are the number of instances classified as negative
but in reality are positive.
40
Performance Evaluation of classifiers
➢The accuracy (or the overall success rate) is a metric defining the rate
at which a model has classified the records correctly.
➢It is defined as the sum of TP and TN divided by the total number of
instances, as shown in the following equation:
41
Performance Evaluation of classifiers
➢The false positive rate (FPR) shows what percent of negatives the
classifier marked as positive.
➢The FPR is also called the false alarm rate or the type I error rate .
➢FPR is computed as follows:
42
Performance Evaluation of classifiers
➢Precision is the percentage of instances marked positive that really are
positive. It is computed as follows:
43
Performance Evaluation of classifiers
➢F1-score is the harmonic mean of the precision and recall:
44
Performance Evaluation of classifiers – Multi classes
➢Micro-average and Macro-average:
• The micro-average precision and recall scores are calculated from the sum of
classes’ true positives (TPs), false positives (FPs), and false negatives (FNs) of the
model.
• The macro-average precision and recall scores are calculated as arithmetic mean
(or weighted mean) of individual classes’ precision and recall scores.
• The macro-average F1-score is calculated as arithmetic mean (or weighted
mean) of individual classes’ F1-score.
45
Performance Evaluation of classifiers – Multi classes
➢Exercise:
46
Performance Evaluation of classifiers – Multi classes
➢Exercise:
48
Thank You
49