Week2-Day 1-Introduction To Data Mining
Week2-Day 1-Introduction To Data Mining
traditional owners of the lands and waters where we live and work.
In figure at the right, each point represents one model that is trained by data. The centre of the area which all
points occupy represents bias and the degree of dispersal of the points represents variance.
The centre of the target represents the area of zero error that can predict the correct value. As we move away
from the centre, the error of the model would increase and predictions would get worse.
Probability Theory
Probability is how likely is the event to occur, and its value always lies between 0 and 1 (inclusive of 0-
impossibility and 1-certainty).
Example
A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two outcomes, “heads” and
“tails,” are both equally probable. Since no other outcomes are possible, the probability of either “heads” or
“tails” is 0.5 or 50%.
Conditional probability:
Probability Theory
INDEPENDENCE
Example
For example, let’s say you rolled a die and flipped a coin. The probability of getting any number face
on the die in no way influences the probability of getting a head or a tail on the coin.
Probability Theory
Naïve Bayes Classifier
Supervised, probabilistic classifier based on Bayes Theorem
Bayes theory
Bayes theorem, by Reverend Thomas Bayes, is about conditional probability as P(A|B); the probability of A given
that B occurred [also called evidence/predictor prior probability]. We encounter a new observation for which we
know the values of the predictors X, but not the class Y, so we would like to make a guess about Y based on the
information we have (our sample). The key insight of Bayes' theorem is that the probability of an event can be
adjusted as new data is introduced.
Parameter estimation for naive Bayes models uses the method of maximum likelihood.
Given data the maximum likelihood estimate (MLE) for the parameter p is the value of p that maximizes the
likelihood P(data |p). That is, the MLE is the value of p for which the data is most likely. 100 P(55 heads|p) = ( 55 )
p55(1 − p)45.
The prior probability of the outcome - based on the training data, what
is the probability of a person surviving or not?
Naïve Bayes Classifier
Naïve Bayes Classifier
Why Naïve?
Strong independence assumption
.Possible in real world? X= Age,
exercise, gender, weight…. ; Y=
Diabetes Y/N?
Shapiro-Wilk test(Analytical)
Histogram, QQplot(Graphical)
Parametric tests are those that make assumptions about the parameters of the population distribution from
which the sample is drawn. This is often the assumption that the population data are normally distributed.
Non-parametric tests are “distribution-free” and, as such, can be used for non-Normal variables.
Linear Discriminant Analysis
2) Variances among group variables are the same across levels of predictors(X)against each levels of
response variable(Y(0,1). This make is linear different from QDA(covariance matrix is not identical for
different classes). It can be checked using standard deviation or F-test.
To assess variability in a box and whisker plot, remember that half your data for each group falls within
the interquartile box. The longer the box and whiskers, the greater the variability of the distribution.
The total length of the whiskers represents the range of the data. In the plot below, Group 2 has more
variability than Group 1 because it has a longer box and whiskers. Group 1 ranges from approximately 3
to 7 while Group 2 ranges from roughly 1.5 to 9 or