0% found this document useful (0 votes)
54 views

Week2-Day 1-Introduction To Data Mining

The document discusses bias-variance tradeoff in machine learning models and introduces naive Bayes classification. It explains that bias and variance capture underfitting and overfitting, and the optimal model balances these. Naive Bayes is presented as a simple probabilistic classifier based on Bayes' theorem and conditional independence assumptions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Week2-Day 1-Introduction To Data Mining

The document discusses bias-variance tradeoff in machine learning models and introduces naive Bayes classification. It explains that bias and variance capture underfitting and overfitting, and the optimal model balances these. Naive Bayes is presented as a simple probabilistic classifier based on Bayes' theorem and conditional independence assumptions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

We acknowledge the Australian Aboriginal and Torres Strait Islander peoples as the

traditional owners of the lands and waters where we live and work.

Week 2-Day 1: Introduction to Data Mining


by
Jagakala Thankaraj(Jaga)
Systems Engineer at Hewlett Packard Enterprise & Casual Tutor at James Cook University
Discussion Topics
• Bias-Variance trade off
• Bayes Classifiers- Naïve Bayes
• Linear Discriminant Analysis
• Quadratic Discriminant Analysis
• Cross Validation
• Model Evaluation-Confusion matrix-Introduction
Supervised Learning
Predictions Errors-Bias-Variance
Tradeoff
Data Visualization in tabular format-Training dataset in ISLR chapter 1
Predictions Errors-Bias-Variance Tradeoff
A good model must emulate the data closely. One way to summarize this is
to obtain the average ‘gap’ between observed and estimated/predicted
data .
1. Compute from training data
( − ) ,( − ) ,…,( − ) : ℎ .
Easy to compute but how useful?
2. But we are more interested in
( − ) : ℎ
But often harder to compute, sometimes not possible.
3. What is test data? Strategies that often work
a. Train, Test split
b. Or we set aside some data, randomly (random sub-sampling). E.g – Cross-
validation
Predictions Errors-Bias-Variance
Tradeoff
A model that exhibits small variance and high bias will underfit the target, while a model with high variance and little bias will
overfit the target. How to minimize these errors?
What is MSE? Why MSE is important in Bias –Variance Trade off?
MSE can be written as the sum of the variance of the estimator and the squared bias of the estimator, providing a useful way to
calculate the MSE and implying that in the case of unbiased estimators, the MSE and variance are equivalent.
Fundamental equation of Mean Squared Error: Test Data(MSE)
( ! "# $ , $ , … , $ % & "# $ , $ , … , $ %! '
! "# $ , $ , … , $ : Error due to sampling-Overfitting(complex and too noisy)
Error caused by using a different training data (or sample).
A large training helps reduce test variance.
A simpler model "# . (that is less flexible) usually has a lower variance on test data.
As the model flexibility increases the test MSE gradually increases.
& "# $ , $ , … , $ : Error due to modelling- Underfitting(overlooks irregularities in the data)
Error due to simplification of reality by a mathematical model- (e.g)if the data is non-linear in the linear regression, or data
becomes skewed after transformations.
Even if your you have a large training sample you can’t reduce the bias if "# . is very different from " .
But if "# . is too complex (that is fits the training data too well) it would have poor test MSE.
! ' : is irreducible error.
So this is the lower bound –minimum achievable value- of the test MSE .
Predictions Errors-Bias-Variance
Tradeoff
As the model complexity increases in test
data
1. Bias: & "# $ , $ , … , $ can
decrease rapidly and then saturates.
Why?
2. Variance:
! "# $ , $ , … , $ increases
gradually but then sharply. Why?
3. Test MSE initially decreases (due to 1)
reaches a minimum and then starts
increasing (due to 2).
4. How to attain the sweet spot?
Middling the parameters?
Source: ISLR Chapter 2
The flexibility of a design increases, the usability and
performances of the design decreases.
Predictions Errors-Bias-Variance
Tradeoff
More parameters always improve the fit to a data set, just like more
pixels on a camera always improve the realism of the photo. With
enough parameters, the model can interpolate the data, meaning
the error is zero. This is called the interpolation threshold, and it
happens when the number of parameters equals the number of
examples, allowing the examples to be fit perfectly. You can add still
more parameters, but the additional parameters cannot reduce
error because it’s already zero.
However, if these models are used to predict a different sample
of data, such as the test data, then the error typically increases as
the interpolation threshold is approached. Plotting the error on
the test data set relative to the number of parameters typically
results in a U-shaped curve. The point where the model again
starts to perform well on the testing data is called the
interpolation threshold.

To minimize test error, the optimal number of parameters lies


between 0 and the interpolation threshold.
Predictions Errors-Bias-Variance
Tradeoff
Underfitting (Bias) Overfitting(Variance)
Main Reasons -More Outliers, Main Reasons-Less outliers, fit all the
assumption fails assumptions

Few ways to reduce -Increase the


Few ways to reduce -Simplify the complexity, Fearure
model, Increase the features selection(Variable Importance)
ML Techniques-Cross-Validation,
ML Techniques- Transformations Regularization.

In figure at the right, each point represents one model that is trained by data. The centre of the area which all
points occupy represents bias and the degree of dispersal of the points represents variance.
The centre of the target represents the area of zero error that can predict the correct value. As we move away
from the centre, the error of the model would increase and predictions would get worse.
Probability Theory
Probability is how likely is the event to occur, and its value always lies between 0 and 1 (inclusive of 0-
impossibility and 1-certainty).
Example
A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two outcomes, “heads” and
“tails,” are both equally probable. Since no other outcomes are possible, the probability of either “heads” or
“tails” is 0.5 or 50%.
Conditional probability:
Probability Theory
INDEPENDENCE

Example
For example, let’s say you rolled a die and flipped a coin. The probability of getting any number face
on the die in no way influences the probability of getting a head or a tail on the coin.
Probability Theory
Naïve Bayes Classifier
Supervised, probabilistic classifier based on Bayes Theorem
Bayes theory
Bayes theorem, by Reverend Thomas Bayes, is about conditional probability as P(A|B); the probability of A given
that B occurred [also called evidence/predictor prior probability]. We encounter a new observation for which we
know the values of the predictors X, but not the class Y, so we would like to make a guess about Y based on the
information we have (our sample). The key insight of Bayes' theorem is that the probability of an event can be
adjusted as new data is introduced.

Parameter estimation for naive Bayes models uses the method of maximum likelihood.
Given data the maximum likelihood estimate (MLE) for the parameter p is the value of p that maximizes the
likelihood P(data |p). That is, the MLE is the value of p for which the data is most likely. 100 P(55 heads|p) = ( 55 )
p55(1 − p)45.

Prior: Probability distribution representing knowledge or


uncertainty of a data object prior or before observing it.
Bayes theorm Posterior: Conditional probability distribution representing
what parameters are likely after observing the data object.
Likelihood: The probability of falling under a specific category
or class.
Naïve Bayes Classifier
x1,x2,..xp=Class, Sex …. Age

The prior probability of the outcome - based on the training data, what
is the probability of a person surviving or not?
Naïve Bayes Classifier
Naïve Bayes Classifier
Why Naïve?
Strong independence assumption
.Possible in real world? X= Age,
exercise, gender, weight…. ; Y=
Diabetes Y/N?

Bayes classifier will choose the group


with the highest posterior density p(Ck
|X). Normalized to 1.
Naïve Bayes Classifier
Assumptions:
1) Predictors are conditionally independent.
Chi-squared test - < 0.05 (Ha True)
2) Can handle both discrete(Categorical) and continuous(Numeric)variables.
3) It also assumes that all features contribute equally to the outcome and most
suitable for real time data(equal weight).
Naïve Bayes classifier types:
Factor variable - Multinomial Naive Bayes Classifier
Factor variables are categorical variables that can be either numeric or string variables.
A "factor" is a vector whose elements can take on one of a specific set of
values(levels).
Uses frequencies from the training data to calculate probabilities of each predictor in
each class, or multinomial. Example, whether a document belongs to the category of
sports, politics, technology etc. The features/predictors used by the classifier are the
frequency of the words present in the document.
Naïve Bayes Classifier
Bernoulli Naive Bayes:
This is similar to the multinomial naive bayes but the predictors are boolean
variables. The parameters that we use to predict the class variable take up
only values yes or no, for example if a word occurs in the text or not.
Continuous variables - Gaussian Naive Bayes Classifier
Uses Gaussian or Kernal (non-parametric) density estimate, to calculate the
likelihood of each predictor in a class.
In a nut shell, Naive Bayes algorithms are mostly used in sentiment analysis,
spam filtering, recommendation systems etc. They are fast and easy to
implement but their biggest disadvantage is that the requirement of
predictors to be independent. Does Naïve Bayes performs good even if the
assumptions are violated?
Linear Discriminant Analysis
LDA: Linear Discriminant Analysis (LDA) is a generalization of Fisher's linear discriminant, a method used
in Statistics, pattern recognition and machine learning to find a linear combination of features that
characterizes or separates two or more classes of objects or events.
Fisher’s linear discriminant attempts to find the vector that maximizes the separation between classes
of the projected data. Maximizing “separation” can be ambiguous. The criteria that Fisher’s linear
discriminant follows to do this is to maximize the distance of the projected means and to minimize the
projected within-class variance.
This method projects a dataset onto a lower-dimensional space with good class-separability to avoid
overfitting (“curse of dimensionality- as the number of features increase, our data become sparser,
which results in overfitting, and we therefore need more data to avoid it”), and to reduce computational
cost. Linear Discriminant Analysis or LDA is a dimensionality reduction technique. It is used as a pre-
processing step in machine learning and applications of pattern classification.
It is based on “Guassian Bayes Classifier(normal distribution), uses probability density function.
Probability density function (PDF), density function, or density of an absolutely continuous random
variable, is a function whose value at any given sample (or point) in the sample space (the set of
possible values taken by the random variable) can be interpreted as providing a relative likelihood that
the value of the random variable would be equal to that sample.
Linear Discriminant Analysis
How does LDA work?
LDA focuses primarily on projecting the features in higher dimension space to lower
dimensions.
Firstly, calculate the separability between classes which is the distance between the
mean of different classes. This is called the between-class variance.
•Secondly, calculate the distance between the mean and sample of each class. It is
also called the within-class variance.
•Finally, construct the lower-dimensional space which maximizes the between-class
variance and minimizes the within-class variance. P is considered as the lower-
dimensional space projection, also called Fisher’s criterion.

LDA reduces dimensionality from original number of


feature to C — 1 features, where C is the number of
classes. In this case, we have 3 classes, therefore the new
feature space will have only 2 features.
Linear Discriminant Analysis
How to prepare data from LDA?
Some suggestions you should keep in mind while preparing your data to build your LDA model:
•LDA is mainly used in classification problems where you have a categorical output variable. It allows
both binary classification and multi-class classification.
•The standard LDA model makes use of the Gaussian Distribution of the input variables. You should
check the univariate distributions of each attribute and transform them into a more Gaussian-
looking distribution. For example, for the exponential distribution, use log and root function.
•Outliers can skew the primitive statistics used to separate classes in LDA, so it is preferable to
remove them.
•Since LDA assumes that each input variable has the same variance, it is always better to standardize
your data before using an LDA model. Keep the mean to be 0 and the standard deviation to be 1.
Linear Discriminant Analysis
Linear Discriminant Analysis
Assumptions-LDA
1)Predictors are normally distributed (No Outliers) and it is a parametric model

Shapiro-Wilk test(Analytical)
Histogram, QQplot(Graphical)

Parametric tests are those that make assumptions about the parameters of the population distribution from
which the sample is drawn. This is often the assumption that the population data are normally distributed.
Non-parametric tests are “distribution-free” and, as such, can be used for non-Normal variables.
Linear Discriminant Analysis
2) Variances among group variables are the same across levels of predictors(X)against each levels of
response variable(Y(0,1). This make is linear different from QDA(covariance matrix is not identical for
different classes). It can be checked using standard deviation or F-test.
To assess variability in a box and whisker plot, remember that half your data for each group falls within
the interquartile box. The longer the box and whiskers, the greater the variability of the distribution.
The total length of the whiskers represents the range of the data. In the plot below, Group 2 has more
variability than Group 1 because it has a longer box and whiskers. Group 1 ranges from approximately 3
to 7 while Group 2 ranges from roughly 1.5 to 9 or

3) The predictors can’t be categorical variables and it has to be continuous variables .


Quadratic Discriminant Analysis
QDA follows the same model as LDA, except that it drops the assumption that the class conditional distributions must
share a common covariance matrix. In other words, QDA models the distribution of each class by means of an
independent multivariate Normal probability density function.
1)Predictors are normally distributed and it is a parametric model(No Outliers).
2)Each predictors can have its own covariance each levels of response variable(Y).
3) The predictors can’t be categorical variables and it has to continuous variables.
The main implications of allowing class conditional distributions with different covariances are twofold. On the one
hand, the model is more flexible in the sense that its assumptions are less restrictive in practice (i.e., they are more
likely to be met, at least approximately) as well as in the sense that the resulting decision boundary is no longer
necessarily linear (it can be non-linear). On the other hand, there is the need to estimate multiple covariance
matrices, which means that the number of parameters to be estimated is larger, thus making the model and its
training mathematically and computationally more complex as well as more prone to overfitting.
Cross-Validations
Why cross-validation?
The most popular cross-validation is k-fold cross-validation.
K-fold Cross-Validation is when the dataset is split into a K number of folds and is used
to evaluate the model's ability when given new data. K refers to the number of groups
the data sample is split into. For example, if you see that the k-value is 5, we can call
this a 5-fold cross-validation.
When the cv argument is an integer, cross_val_score uses the K-Fold or Stratified K-
Fold strategies by default.
Few other cross-validation techniques: Repeated K-Fold – Repeated K-Fold repeats K-Fold n times. It
can be used when one requires to run K-Fold n times, producing different splits in
each repetition.
Leave One Out Cross Validation: Leave out one data point and build the model on the
rest of the data set.
Cross-Validations
The k-fold cross-validation method evaluates the model performance on different subset of the
training data and then calculate the average prediction error rate. The algorithm is as follow:
1.Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
2.Reserve one subset and train the model on all other subsets
3.Test the model on the reserved subset and record the prediction error
4.Repeat this process until each of the k subsets has served as the test set.
5.Compute the average of the k recorded errors. This is called the cross-validation error serving as
the performance metric for the model.
Cross-Validations
In the leave-one-out (LOO) cross-validation, we train our machine-learning model n times where n
is to our dataset’s size. Each time, only one sample is used as a test set while the rest are used to
train our model.
We’ll show that LOO is an extreme case of k-fold where{k=n}. If we apply LOO to the previous
example, we’ll have 6 test subsets: Cross-Validations
Confusion Matrix- Model Evaluation

Key Factors to lookout:


Sensitivity, Specificity
References
Textbook- An introduction to Statistical Learning with Application in R- Authors Gareth James, Daniela Witten, Trevor
Hastie, Robert Tibshirani .
Texbook online reference -https://www.stat.berkeley.edu/users/rabbee/s154/ISLR_First_Printing.pdf
https://github.com/MarthaCooper/
https://www.stat.cmu.edu
https://www.geeksforgeeks.org/
https://medium.com/
https://towardsdatascience.com/
https://scikit-learn.org

You might also like