Week2-Day 1-Introduction To Data Mining

The document discusses bias-variance tradeoff in machine learning models and introduces naive Bayes classification. It explains that bias and variance capture underfitting and overfitting, and the optimal model balances these. Naive Bayes is presented as a simple probabilistic classifier based on Bayes' theorem and conditional independence assumptions.

Uploaded by

Juan Manuel Ferreyra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Week2-Day 1-Introduction To Data Mining

Uploaded by

Juan Manuel Ferreyra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

We acknowledge the Australian Aboriginal and Torres Strait Islander peoples as the

traditional owners of the lands and waters where we live and work.

Week 2-Day 1: Introduction to Data Mining

by
Jagakala Thankaraj(Jaga)
Systems Engineer at Hewlett Packard Enterprise & Casual Tutor at James Cook University
Discussion Topics
• Bias-Variance trade off
• Bayes Classifiers- Naïve Bayes
• Linear Discriminant Analysis
• Quadratic Discriminant Analysis
• Cross Validation
• Model Evaluation-Confusion matrix-Introduction
Supervised Learning
Predictions Errors-Bias-Variance
Tradeoff
Data Visualization in tabular format-Training dataset in ISLR chapter 1
Predictions Errors-Bias-Variance Tradeoff
A good model must emulate the data closely. One way to summarize this is
to obtain the average ‘gap’ between observed and estimated/predicted
data .
1. Compute from training data
( − ) ,( − ) ,…,( − ) : ℎ .
Easy to compute but how useful?
2. But we are more interested in
( − ) : ℎ
But often harder to compute, sometimes not possible.
3. What is test data? Strategies that often work
a. Train, Test split
b. Or we set aside some data, randomly (random sub-sampling). E.g – Cross-
validation
Predictions Errors-Bias-Variance
Tradeoff
A model that exhibits small variance and high bias will underfit the target, while a model with high variance and little bias will
overfit the target. How to minimize these errors?
What is MSE? Why MSE is important in Bias –Variance Trade off?
MSE can be written as the sum of the variance of the estimator and the squared bias of the estimator, providing a useful way to
calculate the MSE and implying that in the case of unbiased estimators, the MSE and variance are equivalent.
Fundamental equation of Mean Squared Error: Test Data(MSE)
( ! "# $ , $ , … , $ % & "# $ , $ , … , $ %! '
! "# $ , $ , … , $ : Error due to sampling-Overfitting(complex and too noisy)
Error caused by using a different training data (or sample).
A large training helps reduce test variance.
A simpler model "# . (that is less flexible) usually has a lower variance on test data.
As the model flexibility increases the test MSE gradually increases.
& "# $ , $ , … , $ : Error due to modelling- Underfitting(overlooks irregularities in the data)
Error due to simplification of reality by a mathematical model- (e.g)if the data is non-linear in the linear regression, or data
becomes skewed after transformations.
Even if your you have a large training sample you can’t reduce the bias if "# . is very different from " .
But if "# . is too complex (that is fits the training data too well) it would have poor test MSE.
! ' : is irreducible error.
So this is the lower bound –minimum achievable value- of the test MSE .
Predictions Errors-Bias-Variance
Tradeoff
As the model complexity increases in test
data
1. Bias: & "# $ , $ , … , $ can
decrease rapidly and then saturates.
Why?
2. Variance:
! "# $ , $ , … , $ increases
gradually but then sharply. Why?
3. Test MSE initially decreases (due to 1)
reaches a minimum and then starts
increasing (due to 2).
4. How to attain the sweet spot?
Middling the parameters?
Source: ISLR Chapter 2
The flexibility of a design increases, the usability and
performances of the design decreases.
Predictions Errors-Bias-Variance
Tradeoff
More parameters always improve the fit to a data set, just like more
pixels on a camera always improve the realism of the photo. With
enough parameters, the model can interpolate the data, meaning
the error is zero. This is called the interpolation threshold, and it
happens when the number of parameters equals the number of
examples, allowing the examples to be fit perfectly. You can add still
more parameters, but the additional parameters cannot reduce
error because it’s already zero.
However, if these models are used to predict a different sample
of data, such as the test data, then the error typically increases as
the interpolation threshold is approached. Plotting the error on
the test data set relative to the number of parameters typically
results in a U-shaped curve. The point where the model again
starts to perform well on the testing data is called the
interpolation threshold.

To minimize test error, the optimal number of parameters lies

between 0 and the interpolation threshold.
Predictions Errors-Bias-Variance
Tradeoff
Underfitting (Bias) Overfitting(Variance)
Main Reasons -More Outliers, Main Reasons-Less outliers, fit all the
assumption fails assumptions

Few ways to reduce -Increase the

Few ways to reduce -Simplify the complexity, Fearure
model, Increase the features selection(Variable Importance)
ML Techniques-Cross-Validation,
ML Techniques- Transformations Regularization.

In figure at the right, each point represents one model that is trained by data. The centre of the area which all
points occupy represents bias and the degree of dispersal of the points represents variance.
The centre of the target represents the area of zero error that can predict the correct value. As we move away
from the centre, the error of the model would increase and predictions would get worse.
Probability Theory
Probability is how likely is the event to occur, and its value always lies between 0 and 1 (inclusive of 0-
impossibility and 1-certainty).
Example
A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two outcomes, “heads” and
“tails,” are both equally probable. Since no other outcomes are possible, the probability of either “heads” or
“tails” is 0.5 or 50%.
Conditional probability:
Probability Theory
INDEPENDENCE

Example
For example, let’s say you rolled a die and flipped a coin. The probability of getting any number face
on the die in no way influences the probability of getting a head or a tail on the coin.
Probability Theory
Naïve Bayes Classifier
Supervised, probabilistic classifier based on Bayes Theorem
Bayes theory
Bayes theorem, by Reverend Thomas Bayes, is about conditional probability as P(A|B); the probability of A given
that B occurred [also called evidence/predictor prior probability]. We encounter a new observation for which we
know the values of the predictors X, but not the class Y, so we would like to make a guess about Y based on the
information we have (our sample). The key insight of Bayes' theorem is that the probability of an event can be
adjusted as new data is introduced.

Parameter estimation for naive Bayes models uses the method of maximum likelihood.
Given data the maximum likelihood estimate (MLE) for the parameter p is the value of p that maximizes the
likelihood P(data |p). That is, the MLE is the value of p for which the data is most likely. 100 P(55 heads|p) = ( 55 )
p55(1 − p)45.

Prior: Probability distribution representing knowledge or

uncertainty of a data object prior or before observing it.
Bayes theorm Posterior: Conditional probability distribution representing
what parameters are likely after observing the data object.
Likelihood: The probability of falling under a specific category
or class.
Naïve Bayes Classifier
x1,x2,..xp=Class, Sex …. Age

The prior probability of the outcome - based on the training data, what
is the probability of a person surviving or not?
Naïve Bayes Classifier
Naïve Bayes Classifier
Why Naïve?
Strong independence assumption
.Possible in real world? X= Age,
exercise, gender, weight…. ; Y=
Diabetes Y/N?

Bayes classifier will choose the group

with the highest posterior density p(Ck
|X). Normalized to 1.
Naïve Bayes Classifier
Assumptions:
1) Predictors are conditionally independent.
Chi-squared test - < 0.05 (Ha True)
2) Can handle both discrete(Categorical) and continuous(Numeric)variables.
3) It also assumes that all features contribute equally to the outcome and most
suitable for real time data(equal weight).
Naïve Bayes classifier types:
Factor variable - Multinomial Naive Bayes Classifier
Factor variables are categorical variables that can be either numeric or string variables.
A "factor" is a vector whose elements can take on one of a specific set of
values(levels).
Uses frequencies from the training data to calculate probabilities of each predictor in
each class, or multinomial. Example, whether a document belongs to the category of
sports, politics, technology etc. The features/predictors used by the classifier are the
frequency of the words present in the document.
Naïve Bayes Classifier
Bernoulli Naive Bayes:
This is similar to the multinomial naive bayes but the predictors are boolean
variables. The parameters that we use to predict the class variable take up
only values yes or no, for example if a word occurs in the text or not.
Continuous variables - Gaussian Naive Bayes Classifier
Uses Gaussian or Kernal (non-parametric) density estimate, to calculate the
likelihood of each predictor in a class.
In a nut shell, Naive Bayes algorithms are mostly used in sentiment analysis,
spam filtering, recommendation systems etc. They are fast and easy to
implement but their biggest disadvantage is that the requirement of
predictors to be independent. Does Naïve Bayes performs good even if the
assumptions are violated?
Linear Discriminant Analysis
LDA: Linear Discriminant Analysis (LDA) is a generalization of Fisher's linear discriminant, a method used
in Statistics, pattern recognition and machine learning to find a linear combination of features that
characterizes or separates two or more classes of objects or events.
Fisher’s linear discriminant attempts to find the vector that maximizes the separation between classes
of the projected data. Maximizing “separation” can be ambiguous. The criteria that Fisher’s linear
discriminant follows to do this is to maximize the distance of the projected means and to minimize the
projected within-class variance.
This method projects a dataset onto a lower-dimensional space with good class-separability to avoid
overfitting (“curse of dimensionality- as the number of features increase, our data become sparser,
which results in overfitting, and we therefore need more data to avoid it”), and to reduce computational
cost. Linear Discriminant Analysis or LDA is a dimensionality reduction technique. It is used as a pre-
processing step in machine learning and applications of pattern classification.
It is based on “Guassian Bayes Classifier(normal distribution), uses probability density function.
Probability density function (PDF), density function, or density of an absolutely continuous random
variable, is a function whose value at any given sample (or point) in the sample space (the set of
possible values taken by the random variable) can be interpreted as providing a relative likelihood that
the value of the random variable would be equal to that sample.
Linear Discriminant Analysis
How does LDA work?
LDA focuses primarily on projecting the features in higher dimension space to lower
dimensions.
Firstly, calculate the separability between classes which is the distance between the
mean of different classes. This is called the between-class variance.
•Secondly, calculate the distance between the mean and sample of each class. It is
also called the within-class variance.
•Finally, construct the lower-dimensional space which maximizes the between-class
variance and minimizes the within-class variance. P is considered as the lower-
dimensional space projection, also called Fisher’s criterion.

LDA reduces dimensionality from original number of

feature to C — 1 features, where C is the number of
classes. In this case, we have 3 classes, therefore the new
feature space will have only 2 features.
Linear Discriminant Analysis
How to prepare data from LDA?
Some suggestions you should keep in mind while preparing your data to build your LDA model:
•LDA is mainly used in classification problems where you have a categorical output variable. It allows
both binary classification and multi-class classification.
•The standard LDA model makes use of the Gaussian Distribution of the input variables. You should
check the univariate distributions of each attribute and transform them into a more Gaussian-
looking distribution. For example, for the exponential distribution, use log and root function.
•Outliers can skew the primitive statistics used to separate classes in LDA, so it is preferable to
remove them.
•Since LDA assumes that each input variable has the same variance, it is always better to standardize
your data before using an LDA model. Keep the mean to be 0 and the standard deviation to be 1.
Linear Discriminant Analysis
Linear Discriminant Analysis
Assumptions-LDA
1)Predictors are normally distributed (No Outliers) and it is a parametric model

Shapiro-Wilk test(Analytical)
Histogram, QQplot(Graphical)

Parametric tests are those that make assumptions about the parameters of the population distribution from
which the sample is drawn. This is often the assumption that the population data are normally distributed.
Non-parametric tests are “distribution-free” and, as such, can be used for non-Normal variables.
Linear Discriminant Analysis
2) Variances among group variables are the same across levels of predictors(X)against each levels of
response variable(Y(0,1). This make is linear different from QDA(covariance matrix is not identical for
different classes). It can be checked using standard deviation or F-test.
To assess variability in a box and whisker plot, remember that half your data for each group falls within
the interquartile box. The longer the box and whiskers, the greater the variability of the distribution.
The total length of the whiskers represents the range of the data. In the plot below, Group 2 has more
variability than Group 1 because it has a longer box and whiskers. Group 1 ranges from approximately 3
to 7 while Group 2 ranges from roughly 1.5 to 9 or

3) The predictors can’t be categorical variables and it has to be continuous variables .

Quadratic Discriminant Analysis
QDA follows the same model as LDA, except that it drops the assumption that the class conditional distributions must
share a common covariance matrix. In other words, QDA models the distribution of each class by means of an
independent multivariate Normal probability density function.
1)Predictors are normally distributed and it is a parametric model(No Outliers).
2)Each predictors can have its own covariance each levels of response variable(Y).
3) The predictors can’t be categorical variables and it has to continuous variables.
The main implications of allowing class conditional distributions with different covariances are twofold. On the one
hand, the model is more flexible in the sense that its assumptions are less restrictive in practice (i.e., they are more
likely to be met, at least approximately) as well as in the sense that the resulting decision boundary is no longer
necessarily linear (it can be non-linear). On the other hand, there is the need to estimate multiple covariance
matrices, which means that the number of parameters to be estimated is larger, thus making the model and its
training mathematically and computationally more complex as well as more prone to overfitting.
Cross-Validations
Why cross-validation?
The most popular cross-validation is k-fold cross-validation.
K-fold Cross-Validation is when the dataset is split into a K number of folds and is used
to evaluate the model's ability when given new data. K refers to the number of groups
the data sample is split into. For example, if you see that the k-value is 5, we can call
this a 5-fold cross-validation.
When the cv argument is an integer, cross_val_score uses the K-Fold or Stratified K-
Fold strategies by default.
Few other cross-validation techniques: Repeated K-Fold – Repeated K-Fold repeats K-Fold n times. It
can be used when one requires to run K-Fold n times, producing different splits in
each repetition.
Leave One Out Cross Validation: Leave out one data point and build the model on the
rest of the data set.
Cross-Validations
The k-fold cross-validation method evaluates the model performance on different subset of the
training data and then calculate the average prediction error rate. The algorithm is as follow:
1.Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
2.Reserve one subset and train the model on all other subsets
3.Test the model on the reserved subset and record the prediction error
4.Repeat this process until each of the k subsets has served as the test set.
5.Compute the average of the k recorded errors. This is called the cross-validation error serving as
the performance metric for the model.
Cross-Validations
In the leave-one-out (LOO) cross-validation, we train our machine-learning model n times where n
is to our dataset’s size. Each time, only one sample is used as a test set while the rest are used to
train our model.
We’ll show that LOO is an extreme case of k-fold where{k=n}. If we apply LOO to the previous
example, we’ll have 6 test subsets: Cross-Validations
Confusion Matrix- Model Evaluation

Key Factors to lookout:

Sensitivity, Specificity
References
Textbook- An introduction to Statistical Learning with Application in R- Authors Gareth James, Daniela Witten, Trevor
Hastie, Robert Tibshirani .
Texbook online reference -https://www.stat.berkeley.edu/users/rabbee/s154/ISLR_First_Printing.pdf
https://github.com/MarthaCooper/
https://www.stat.cmu.edu
https://www.geeksforgeeks.org/
https://medium.com/
https://towardsdatascience.com/
https://scikit-learn.org

Machine_learning(unit 3)
No ratings yet
Machine_learning(unit 3)
9 pages
4.4 Parametric and Non-parametric Estimator
No ratings yet
4.4 Parametric and Non-parametric Estimator
47 pages
lec1
No ratings yet
lec1
54 pages
Bias-Variance Tradeoff
No ratings yet
Bias-Variance Tradeoff
6 pages
1 5 Bias Variance Trade Off
No ratings yet
1 5 Bias Variance Trade Off
34 pages
02 Chap02 AssesingModelAccuracy
No ratings yet
02 Chap02 AssesingModelAccuracy
22 pages
ASSESSING MODEL Accuracy PDF
No ratings yet
ASSESSING MODEL Accuracy PDF
22 pages
Machine Learning Math Essentials _12.02.2025
No ratings yet
Machine Learning Math Essentials _12.02.2025
88 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
7 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Intro To Data Science Lecture 5
No ratings yet
Intro To Data Science Lecture 5
7 pages
Bark08 Ghahramani Samlbb 01
No ratings yet
Bark08 Ghahramani Samlbb 01
26 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Agricultural Land Use in Kerala
No ratings yet
Agricultural Land Use in Kerala
5 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
1 Machine Learning
No ratings yet
1 Machine Learning
111 pages
NBayes-1-20-2011-ann
No ratings yet
NBayes-1-20-2011-ann
21 pages
Bias Variance Trade Off
No ratings yet
Bias Variance Trade Off
14 pages
Data Science Interview Questions -1
No ratings yet
Data Science Interview Questions -1
55 pages
ESGB Evaluation Methods
No ratings yet
ESGB Evaluation Methods
84 pages
MLT Unit 2 - Updated
No ratings yet
MLT Unit 2 - Updated
58 pages
UNIT IV Na-Ve Bayes Classifier Algorithm
No ratings yet
UNIT IV Na-Ve Bayes Classifier Algorithm
33 pages
226 Lecture5 Prediction
No ratings yet
226 Lecture5 Prediction
45 pages
Lecture 6_Generative Models
No ratings yet
Lecture 6_Generative Models
33 pages
Machine Learning 10-601: Today: - Bayes Classifiers - Conditional Independence - Naïve Bayes Readings
No ratings yet
Machine Learning 10-601: Today: - Bayes Classifiers - Conditional Independence - Naïve Bayes Readings
51 pages
BiasVarianceTradeOff
No ratings yet
BiasVarianceTradeOff
4 pages
L25 - Naïve Bayes
No ratings yet
L25 - Naïve Bayes
18 pages
3.3 Bias Variance
No ratings yet
3.3 Bias Variance
14 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Naive Bayes Classifier in Machine Learning
No ratings yet
Naive Bayes Classifier in Machine Learning
16 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
All Cards
No ratings yet
All Cards
104 pages
Lecture - 4.1 - Bayes Classifier
No ratings yet
Lecture - 4.1 - Bayes Classifier
31 pages
Data Science Interview Questions
100% (2)
Data Science Interview Questions
55 pages
cs229 Notes4 PDF
No ratings yet
cs229 Notes4 PDF
11 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
PA DL Consolidated
No ratings yet
PA DL Consolidated
94 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
ML 19.03 Sidenotes
No ratings yet
ML 19.03 Sidenotes
30 pages
KSMF
No ratings yet
KSMF
35 pages
Module - 4 - ECE3047 - Machine Learning
No ratings yet
Module - 4 - ECE3047 - Machine Learning
81 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Important Tems
No ratings yet
Important Tems
61 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Deep Learning[1]
No ratings yet
Deep Learning[1]
26 pages
Supervised Machine Learning Algorithm
100% (1)
Supervised Machine Learning Algorithm
111 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Ghojogh, Benyamin, and Mark Crowley
No ratings yet
Ghojogh, Benyamin, and Mark Crowley
23 pages
FML Unit3
No ratings yet
FML Unit3
18 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
19_ML_intro
No ratings yet
19_ML_intro
33 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-04 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-04 Reference-Material-I
69 pages