0% found this document useful (0 votes)
10 views

2. Statistical Inference - II

The document discusses key statistical concepts including covariance, correlation, and measures of position. It explains how covariance measures the joint variability of two random variables, while correlation quantifies the strength of their relationship. Additionally, it covers percentiles, quartiles, and their applications in understanding data distributions.

Uploaded by

kiarasinha336
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

2. Statistical Inference - II

The document discusses key statistical concepts including covariance, correlation, and measures of position. It explains how covariance measures the joint variability of two random variables, while correlation quantifies the strength of their relationship. Additionally, it covers percentiles, quartiles, and their applications in understanding data distributions.

Uploaded by

kiarasinha336
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 171

Statistical Inference – II

Tushar B. Kute,
http://tusharkute.com
Covariance

• Covariance is a measure of how much two


random variables vary together.
• It’s similar to variance, but where variance tells
you how a single variable varies, co variance
tells you how two variables vary together.
Covariance
Covariance Formula
Covariance

• In the above formula,


– xi, yi - are individual elements of the x and y
series
– x̄ , y̅ - are the mathematical means of the x and y
series
– N - is the number of elements in the series
– The denominator is N for a whole dataset and N -
1 in the case of a sample. As our dataset is a
small sample of the entire Iris dataset, we use N -
1.
Example:

• Calculate covariance for the following data set:


x: 2.1, 2.5, 3.6, 4.0 (mean = 3.1)
y: 8, 10, 12, 14 (mean = 11)
• Substitute the values into the formula and solve:
• Cov(X,Y) = ΣE((X-μ)(Y-ν)) / n-1
= (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)
(14-11) /(4-1)
= (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3
= 3 + 0.6 + .5 + 2.7 / 3
= 6.8/3
= 2.267
What is correlation ?

• Statistics and data science are often concerned about the


relationships between two or more variables (or features) of a
dataset. Each data point in the dataset is an observation, and the
features are the properties or attributes of those observations.
• Every dataset you work with uses variables and observations. For
example, you might be interested in understanding the following:
– How the height of basketball players is correlated to their
shooting accuracy
– Whether there’s a relationship between employee work
experience and salary
– What mathematical dependence exists between the population
density and the gross domestic product of different countries
What is correlation ?

• In this table, each row represents one observation, or


the data about one employee (either Ann, Rob, Tom, or
Ivy). Each column shows one property or feature (name,
experience, or salary) for all the employees.
Forms of correlation
Forms of correlation

• Negative correlation (red dots): In the plot on the left, the y


values tend to decrease as the x values increase. This shows
strong negative correlation, which occurs when large values of
one feature correspond to small values of the other, and vice
versa.
• Weak or no correlation (green dots): The plot in the middle shows
no obvious trend. This is a form of weak correlation, which occurs
when an association between two features is not obvious or is
hardly observable.
• Positive correlation (blue dots): In the plot on the right, the y
values tend to increase as the x values increase. This illustrates
strong positive correlation, which occurs when large values of one
feature correspond to large values of the other, and vice versa.
Example: Employee table
Correlation Techniques

• There are several statistics that you can use to quantify


correlation. We will be learning about three correlation
coefficients:
– Pearson’s r
– Spearman’s rho
– Kendall’s tau
• Pearson’s coefficient measures linear correlation, while the
Spearman and Kendall coefficients compare the ranks of data.
• There are several NumPy, SciPy, and Pandas correlation
functions and methods that you can use to calculate these
coefficients.
• You can also use Matplotlib to conveniently illustrate the
results.
What sort of correlation ?

• The values on the main diagonal of the correlation


matrix (upper left and lower right) are equal to 1.
• The upper left value corresponds to the correlation
coefficient for x and x, while the lower right value is
the correlation coefficient for y and y. They are always
equal to 1.
• However, what you usually need are the lower left and
upper right values of the correlation matrix.
• These values are equal and both represent the
Pearson correlation coefficient for x and y. In this case,
it’s approximately 0.76.
Linear Correlation

• Linear correlation measures the proximity of


the mathematical relationship between
variables or dataset features to a linear
function.
• If the relationship between the two features is
closer to some linear function, then their linear
correlation is stronger and the absolute value of
the correlation coefficient is higher.
Karl Pearson Correlation

• Consider a dataset with two features: x and y. Each feature


has n values, so x and y are n-tuples. Say that the first value
x₁ from x corresponds to the first value y₁ from y, the second
value x₂ from x to the second value y₂ from y, and so on.
Then, there are n pairs of corresponding values: (x₁, y₁), (x₂,
y₂), and so on. Each of these x-y pairs represents a single
observation.
• The Pearson (product-moment) correlation coefficient is a
measure of the linear relationship between two features. It’s
the ratio of the covariance of x and y to the product of their
standard deviations. It’s often denoted with the letter r and
called Pearson’s r. You can express this value mathematically
with this equation:
Pearson Correlation

In the above formula,

xi, yi - are individual elements of the x and y series

The numerator corresponds to the covariance

The denominators correspond to the individual standard


deviations of x and y
Pearson Correlation

• The Pearson correlation coefficient can take on any real value in the
range −1 ≤ r ≤ 1.
• The maximum value r = 1 corresponds to the case when there’s a
perfect positive linear relationship between x and y. In other words,
larger x values correspond to larger y values and vice versa.
• The value r > 0 indicates positive correlation between x and y.
• The value r = 0 corresponds to the case when x and y are
independent.
• The value r < 0 indicates negative correlation between x and y.
• The minimal value r = −1 corresponds to the case when there’s a
perfect negative linear relationship between x and y. In other
words, larger x values correspond to smaller y values and vice versa.
Pearson Correlation
Summary

• Covariance brings about the variation across


variables.
• We use covariance to measure how much two
variables change with each other.
• Correlation reveals the relation between the
variables. We use correlation to determine how
strongly linked two variables are to each other.
Measure of Position

• Measures of position give us a way to see where a


certain data point or value falls in a sample or
distribution.
• A measure can tell us whether a value is about the
average, or whether it’s unusually high or low.
• Measures of position are used for quantitative data
that falls on some numerical scale.
• Sometimes, measures can be applied to ordinal
variables— those variables that have an order, like
first, second…fiftieth.
Measure of Position

• Measures of position can also show how to values


from different distributions or measurement scales
compare.
• For example, a person’s height (measured in feet)
and weight (measured in pounds) can be compared
by converting the measurements to z-scores.
Percentile

• “Percentile” is in everyday use, but there is no


universal definition for it.
• The most common definition of a percentile is a
number where a certain percentage of scores fall
below that number.
• You might know that you scored 67 out of 90 on a
test. But that figure has no real meaning unless you
know what percentile you fall into.
• If you know that your score is in the 90th percentile,
that means you scored better than 90% of people
who took the test.
Percentile

• Percentiles are commonly used to report scores in


tests, like the SAT, GRE and LSAT. for example, the
70th percentile on the 2013 GRE was 156. That
means if you scored 156 on the exam, your score
was better than 70 percent of test takers.
Percentile Rank

• The word “percentile” is used informally in the


above definition. In common use, the percentile
usually indicates that a certain percentage falls
below that percentile.
• For example, if you score in the 25th percentile, then
25% of test takers are below your score. The “25” is
called the percentile rank.
• In statistics, it can get a little more complicated as
there are actually three definitions of “percentile.”
Here are the first two, based on an arbitrary “25th
percentile”:
Percentile Rank

• Definition 1: The nth percentile is the lowest score that


is greater than a certain percentage (“n”) of the scores.
In this example, our n is 25, so we’re looking for the
lowest score that is greater than 25%.
• Definition 2: The nth percentile is the smallest score
that is greater than or equal to a certain percentage of
the scores.
• To rephrase this, it’s the percentage of data that falls at
or below a certain observation. This is the definition
used in AP statistics. In this example, the 25th percentile
is the score that’s greater or equal to 25% of the scores.
Percentile

• They may seem very similar, but they can lead


to big differences in results, although they are
both the 25th percentile rank. Take the
following list of test scores, ordered by rank:
How to find percentile?

• Example question: Find out where the 25th percentile


is in the above list.
• Step 1: Calculate what rank is at the 25th percentile.
Use the following formula:
Rank = Percentile / 100 * (number of items + 1)
Rank = 25 / 100 * (8 + 1) = 0.25 * 9 = 2.25.
A rank of 2.25 is at the 25th percentile. However, there
isn’t a rank of 2.25 (ever heard of a high school rank of
2.25? I haven’t!), so you must either round up, or round
down. As 2.25 is closer to 2 than 3, I’m going to round
down to a rank of 2.
How to find percentile?

• Step 2: Choose either definition 1 or 2:


• Definition 1: The lowest score that is greater than
25% of the scores. That equals a score of 43 on this
list (a rank of 3).
• Definition 2: The smallest score that is greater than
or equal to 25% of the scores. That equals a score of
33 on this list (a rank of 2).
• Depending on which definition you use, the 25th
percentile could be reported at 33 or 43! A third
definition attempts to correct this possible
misinterpretation:
How to find percentile?

• Definition 3: A weighted mean of the percentiles from the first


two definitions.
• In the above example, here’s how the percentile would be worked
out using the weighted mean:
• Multiply the difference between the scores by 0.25 (the fraction
of the rank we calculated above). The scores were 43 and 33,
giving us a difference of 10: (0.25)(43 – 33) = 2.5
• Add the result to the lower score. 2.5 + 33 = 35.5
• In this case, the 25th percentile score is 35.5, which makes more
sense as it’s in the middle of 43 and 33.
• In most cases, the percentile is usually definition #1. However, it
would be wise to double check that any statistics about
percentiles are created using that first definition.
Percentile Range

• A percentile range is the difference between two


specified percentiles. these could theoretically be any
two percentiles, but the 10-90 percentile range is the
most common. To find the 10-90 percentile range:
– Calculate the 10th percentile using the above steps.
– Calculate the 90th percentile using the above steps.
– Subtract Step 1 (the 10th percentile) from Step 2
(the 90th percentile).
– Example...
Quantile

• The word “quantile” comes from the word quantity. In


simple terms, a quantile is where a sample is divided
into equal-sized, adjacent, subgroups (that’s why it’s
sometimes called a “fractile“). It can also refer to
dividing a probability distribution into areas of equal
probability.
• The median is a quantile; the median is placed in a
probability distribution so that exactly half of the data
is lower than the median and half of the data is above
the median. The median cuts a distribution into two
equal areas and so it is sometimes called 2-quantile.
Quantile
Quartiles

• Quartiles are values that divide your data into quarters.


However, quartiles aren’t shaped like pizza slices;
Instead they divide your data into four segments
according to where the numbers fall on the number
line. The four quarters that divide a data set into
quartiles are:
– The lowest 25% of numbers.
– The next lowest 25% of numbers (up to the median).
– The second highest 25% of numbers (above the
median).
– The highest 25% of numbers.
Quartiles
Example :

• Example: Divide the following data set into


quartiles: 2, 5, 6, 7, 10, 22, 13, 14, 16, 65, 45, 12.
– Step 1: Put the numbers in order: 2, 5, 6, 7, 10,
12 13, 14, 16, 22, 45, 65.
– Step 2: Count how many numbers there are in
your set and then divide by 4 to cut the list of
numbers into quarters. There are 12 numbers
in this set, so you would have 3 numbers in
each quartile.
2, 5, 6, | 7, 10, 12 | 13, 14, 16, | 22, 45, 65
Example :

• If you have an uneven set of numbers, it’s OK to slice a number


down the middle. This can get a little tricky (imagine trying to
divide 10, 13, 17, 19, 21 into quarters!), so you may want to use
an online interquartile range calculator to figure those quartiles
out for you. The calculator gives you the 25th Percentile, which
is the end of the first quartile, the 50th Percentile which is the
end of the second quartile (or the median) and the 75th
Percentile, which is the end of the third quartile. For 10, 13, 17,
19 and 21 the results are:
– 25th Percentile: 11.5
– 50th Percentile: 17
– 75th Percentile: 20
– Interquartile Range: 8.5.
Upper Quartile

• The upper quartile (sometimes called Q3) is the


number dividing the third and fourth quartile.
• The upper quartile can also be thought of as the
median of the upper half of the numbers.
• The upper quartile is also called the 75th
percentile; it splits the lowest 75% of data from
the highest 25%.
Upper Quartile

• You can find the upper quartile by placing a set of


numbers in order and working out Q3 by hand, or you can
use the upper quartile formula.
• If you have a small set of numbers (under about 20), by
hand is usually the easiest option. However, the formula
works for all sets of numbers, from very small to very
large. You may also want to use the formula if you are
uncomfortable with finding the median for sets of data
with odd or even numbers.
– Example question: Find the upper quartile for the
following set of numbers:
27, 19, 5, 7, 6, 9, 15, 12, 18, 2, 1.
Upper Quartile

• Step 1: Put your numbers in order: 1, 2, 5, 6, 7, 9,


12, 15, 18, 19, 27
• Step 2: Find the median: 1, 2, 5, 6, 7, 9, 12, 15, 18,
19, 27.
• Step 3: Place parentheses around the numbers
above the median.
1, 2, 5, 6, 7, 9, (12, 15, 18, 19, 27).
• Step 4: Find the median of the upper set of
numbers. This is the upper quartile:
1, 2, 5, 6, 7, 9, (12, 15, 18 ,19 ,27).
Upper Quartile

• The upper quartile formula is:


Q3 = ¾(n + 1)th Term.
• The formula doesn’t give you the value for the upper quartile, it gives you
the place. For example, the 5th place, or the 76th place.
• Step 1: Put your numbers in order: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
Note: for very large data sets, you may want to use Excel to place your
numbers in order. See: Sorting Numbers in Excel.
• Step 2: Work the formula. There are 11 numbers in the set, so:
Q3 = ¾(n + 1)th Term.
Q3 = ¾(11 + 1)th Term.
Q3 = ¾(12)th Term.
Q3 = 9th Term.
• In this set of numbers (1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27), the upper quartile
(18) is the 9th term, or the 9th place from the left.
Quarter vs. Quartile

• There’s a slight difference between a quarter


and quartile. A quarter is the whole slice of
pizza, but a quartile is the mark the pizza cutter
makes at the end of the slice.
Z-score

• Simply put, a z-score (also called a standard score) gives


you an idea of how far from the mean a data point is.
• But more technically it’s a measure of how many standard
deviations below or above the population mean a raw
score is.
• A z-score can be placed on a normal distribution curve. Z-
scores range from -3 standard deviations (which would fall
to the far left of the normal distribution curve) up to +3
standard deviations (which would fall to the far right of the
normal distribution curve).
• In order to use a z-score, you need to know the mean μ and
also the population standard deviation σ.
Z-score

• Z-scores are a way to compare results to a “normal”


population. Results from tests or surveys have
thousands of possible results and units; those results
can often seem meaningless.
• For example, knowing that someone’s weight is 150
pounds might be good information, but if you want to
compare it to the “average” person’s weight, looking
at a vast table of data can be overwhelming (especially
if some weights are recorded in kilograms).
• A z-score can tell you where that person’s weight is
compared to the average population’s mean weight.
Z-score

• The basic z score formula for a sample is:


z = (x – μ) / σ
• For example, let’s say you have a test score of 190. The
test has a mean (μ) of 150 and a standard deviation (σ)
of 25. Assuming a normal distribution, your z score
would be:
z = (x – μ) / σ
= (190 – 150) / 25 = 1.6.
• The z score tells you how many standard deviations
from the mean your score is. In this example, your score
is 1.6 standard deviations above the mean.
Z-score

• You may also see the z score formula shown to


the left.
• This is exactly the same formula as z = x – μ / σ,
except that x̄ (the sample mean) is used instead
of μ (the population mean) and s (the sample
standard deviation) is used instead of σ (the
population standard deviation). However, the
steps for solving it are exactly the same.
Standard Error of mean

• When you have multiple samples and want to


describe the standard deviation of those sample
means (the standard error), you would use this z
score formula:
z = (x – μ) / (σ / √n)
• This z-score will tell you how many standard
errors there are between the sample mean and
the population mean.
Standard Error of mean

• Example problem: In general, the mean height of women is 65″


with a standard deviation of 3.5″. What is the probability of
finding a random sample of 50 women with a mean height of
70″, assuming the heights are normally distributed?
z = (x – μ) / (σ / √n)
= (70 – 65) / (3.5/√50) = 5 / 0.495 = 10.1
• The key here is that we’re dealing with a sampling distribution
of means, so we know we have to include the standard error in
the formula. We also know that 99% of values fall within 3
standard deviations from the mean in a normal probability
distribution (see 68 95 99.7 rule).
• Therefore, there’s less than 1% probability that any sample of
women will have a mean height of 70″.
Example:

• Example question: You take the SAT and score


1100. The mean score for the SAT is 1026 and
the standard deviation is 209.
• How well did you score on the test compared to
the average test taker?
Example:

• Step 1: Write your X-value into the z-score equation.


For this example question the X-value is your SAT
score, 1100.

• Step 2: Put the mean, μ, into the z-score equation.

• Step 3: Write the standard deviation, σ into the z-


score equation.
Example:

• Step 4: Find the answer using a calculator:


• (1100 – 1026) / 209 = .354. This means that your
score was .354 std devs above the mean.
• Step 5: (Optional) Look up your z-value in the z-table
to see what percentage of test-takers scored below
you. A z-score of .354 is .1368 + .5000* = .6368 or
63.68%.
• *Why add .500 to the result? The z-table shown has
scores for the RIGHT of the mean. Therefore, we
have to add .500 for all of the area LEFT of the mean.
Z-score and Standard Deviation

• Technically, a z-score is the number of standard deviations


from the mean value of the reference population (a
population whose known values have been recorded, like in
these charts the CDC compiles about people’s weights). For
example:
– A z-score of 1 is 1 standard deviation above the mean.
– A score of 2 is 2 standard deviations above the mean.
– A score of -1.8 is -1.8 standard deviations below the mean.
• A z-score tells you where the score lies on a normal
distribution curve. A z-score of zero tells you the values is
exactly average while a score of +3 tells you that the value is
much higher than average.
Z-score in real life

• You can use the z-table and the normal distribution


graph to give you a visual about how a z-score of 2.0
means “higher than average”.
• Let’s say you have a person’s weight (240 pounds),
and you know their z-score is 2.0. You know that 2.0
is above average (because of the high placement on
the normal distribution curve), but you want to know
how much above average is this weight?
• The z-score in the center of the curve is zero. The z-
scores to the right of the mean are positive and the
z-scores to the left of the mean are negative.
Marginal Probability

• Marginal probability is the probability of an event,


irrespective of other random variables.
• If the random variable is independent, then it is the
probability of the event directly, otherwise, if the
variable is dependent upon other variables, then the
marginal probability is the probability of the event
summed over all outcomes for the dependent
variables, called the sum rule.
– Marginal Probability: The probability of an event
irrespective of the outcomes of other random
variables, e.g. P(A).
Joint Probability

• The joint probability is the probability of two (or


more) simultaneous events, often described in
terms of events A and B from two dependent
random variables, e.g. X and Y.
• The joint probability is often summarized as just
the outcomes, e.g. A and B.
– Joint Probability: Probability of two (or more)
simultaneous events, e.g. P(A and B) or P(A,
B).
Conditional Probability

• The conditional probability is the probability of


one event given the occurrence of another
event, often described in terms of events A and
B from two dependent random variables e.g. X
and Y.
– Conditional Probability: Probability of one (or
more) event given the occurrence of another
event, e.g. P(A given B) or P(A | B).
Summary

• The joint probability can be calculated using the


conditional probability; for example:
– P(A, B) = P(A | B) * P(B)
• This is called the product rule. Importantly, the joint
probability is symmetrical, meaning that:
– P(A, B) = P(B, A)
• The conditional probability can be calculated using the
joint probability; for example:
– P(A | B) = P(A, B) / P(B)
• The conditional probability is not symmetrical; for example:
– P(A | B) != P(B | A)
Alternate way for conditional prob

• Specifically, one conditional probability can be


calculated using the other conditional probability; for
example:
– P(A|B) = P(B|A) * P(A) / P(B)
• The reverse is also true; for example:
– P(B|A) = P(A|B) * P(B) / P(A)
• This alternate approach of calculating the conditional
probability is useful either when the joint probability is
challenging to calculate (which is most of the time), or
when the reverse conditional probability is available or
easy to calculate.
Bayes Theorem

• Bayes Theorem: Principled way of calculating a


conditional probability without the joint probability. It
is often the case that we do not have access to the
denominator directly, e.g. P(B).
• We can calculate it an alternative way; for example:
– P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
• This gives a formulation of Bayes Theorem that we can
use that uses the alternate calculation of P(B),
described below:
– P(A|B) = P(B|A) * P(A) / P(B|A) * P(A) + P(B|not A) *
P(not A)
Bayes Theorem

• Firstly, in general, the result P(A|B) is referred to as the


posterior probability and P(A) is referred to as the prior
probability.
– P(A|B): Posterior probability.
– P(A): Prior probability.
• Sometimes P(B|A) is referred to as the likelihood and P(B)
is referred to as the evidence.
– P(B|A): Likelihood.
– P(B): Evidence.
• This allows Bayes Theorem to be restated as:
– Posterior = Likelihood * Prior / Evidence
Naive Bayes Classifier

• Naive Bayes classifiers are a collection of


classification algorithms based on Bayes’
Theorem.
• It is not a single algorithm but a family of
algorithms where all of them share a common
principle, i.e. every pair of features being
classified is independent of each other.
Bayes Theorem

Example Reference: Super Data Science


Bayes Theorem

Defective Spanners
Bayes Theorem
Bayes Theorem
Bayes Theorem
Bayes Theorem
That’s intuitive
Exercise
Example:
Step-1
Step-1
Step-1
Step-2
Step-3
Naive Bayes – Step-1
Naive Bayes – Step-2
Naive Bayes – Step-3
Combining altogether
Naive Bayes – Step-4
Naive Bayes – Step-5
Types of model
Final Classification
Probability Distribution
Types of Naive Bayes Classifier

• Multinomial Naive Bayes:


– This is mostly used for document classification
problem, i.e whether a document belongs to the
category of sports, politics, technology etc.
– The features/predictors used by the classifier are
the frequency of the words present in the
document.
Types of Naive Bayes Classifier

• Bernoulli Naive Bayes:


– This is similar to the multinomial naive bayes but
the predictors are boolean variables.
– The parameters that we use to predict the class
variable take up only values yes or no, for example
if a word occurs in the text or not.
Types of Naive Bayes Classifier

• Gaussian Naive Bayes:


– When the predictors take up a continuous value
and are not discrete, we assume that these values
are sampled from a gaussian distribution.
Advantages

• When assumption of independent predictors


holds true, a Naive Bayes classifier performs
better as compared to other models.
• Naive Bayes requires a small amount of
training data to estimate the test data. So, the
training period is less.
• Naive Bayes is also easy to implement.
Disadvantages

• Main imitation of Naive Bayes is the assumption of


independent predictors. Naive Bayes implicitly assumes
that all the attributes are mutually independent. In real life,
it is almost impossible that we get a set of predictors which
are completely independent.
• If categorical variable has a category in test data set, which
was not observed in training data set, then model will
assign a 0 (zero) probability and will be unable to make a
prediction. This is often known as Zero Frequency. To solve
this, we can use the smoothing technique. One of the
simplest smoothing techniques is called Laplace estimation.
Bayesian Network

• Bayesian networks are a probabilistic graphical


model that explicitly capture the known
conditional dependence with directed edges in a
graph model. All missing connections define the
conditional independencies in the model.
• As such Bayesian Networks provide a useful tool to
visualize the probabilistic model for a domain,
review all of the relationships between the random
variables, and reason about causal probabilities for
scenarios given available evidence.
Challenges in Probabilistic Modeling

• Most often, the problem is the lack of information about the


domain required to fully specify the conditional dependence
between random variables. If available, calculating the full
conditional probability for an event can be impractical.
• A common approach to addressing this challenge is to add some
simplifying assumptions, such as assuming that all random
variables in the model are conditionally independent. This is a
drastic assumption, although it proves useful in practice,
providing the basis for the Naive Bayes classification algorithm.
• An alternative approach is to develop a probabilistic model of a
problem with some conditional independence assumptions. This
provides an intermediate approach between a fully conditional
model and a fully conditionally independent model.
Probabilistic Graphical Model

• A probabilistic graphical model (PGM), or simply


“graphical model” for short, is a way of
representing a probabilistic model with a graph
structure.
• The nodes in the graph represent random
variables and the edges that connect the nodes
represent the relationships between the
random variables.
Probabilistic Graphical Model

• A graph comprises nodes (also called vertices)


connected by links (also known as edges or arcs).
In a probabilistic graphical model, each node
represents a random variable (or group of random
variables), and the links express probabilistic
relationships between these variables.
– Nodes: Random variables in a graphical model.
– Edges: Relationships between random variables
in a graphical model.
Bayesian Belief Network

• A Bayesian Belief Network, or simply “Bayesian Network,”


provides a simple way of applying Bayes Theorem to complex
problems.
• The networks are not exactly Bayesian by definition, although
given that both the probability distributions for the random
variables (nodes) and the relationships between the random
variables (edges) are specified subjectively, the model can be
thought to capture the “belief” about a complex domain.
• Bayesian probability is the study of subjective probabilities or
belief in an outcome, compared to the frequentist approach
where probabilities are based purely on the past occurrence
of the event.
Bayesian Belief Network

• Bayesian networks provide useful benefits as a


probabilistic model.
• For example:
– Visualization. The model provides a direct way to
visualize the structure of the model and motivate
the design of new models.
– Relationships. Provides insights into the presence
and absence of the relationships between random
variables.
– Computations. Provides a way to structure complex
probability calculations.
Developing Bayesian Belief Network

• Designing a Bayesian Network requires defining at


least three things:
– Random Variables. What are the random variables
in the problem?
– Conditional Relationships. What are the conditional
relationships between the variables?
– Probability Distributions. What are the probability
distributions for each variable?
• It may be possible for an expert in the problem
domain to specify some or all of these aspects in the
design of the model.
Example:

• We can make Bayesian Networks concrete with a small


example.
• Consider a problem with three random variables: A, B, and C.
A is dependent upon B, and C is dependent upon B.
• We can state the conditional dependencies as follows:
– A is conditionally dependent upon B, e.g. P(A|B)
– C is conditionally dependent upon B, e.g. P(C|B)
• We know that C and A have no effect on each other.
• We can also state the conditional independencies as follows:
– A is conditionally independent from C: P(A|B, C)
– C is conditionally independent from A: P(C|B, A)
Example:

• We can also write the joint probability of A and C


given B or conditioned on B as the product of two
conditional probabilities; for example:
P(A, C | B) = P(A|B) * P(C|B)
• The model summarizes the joint probability of P(A,
B, C), calculated as:
P(A, B, C) = P(A|B) * P(C|B) * P(B)
Example:
Generative vs. Discriminative
Generative vs. Discriminative
Generative vs. Discriminative
Generative vs. Discriminative
Problem Setup
Models: Parametric Families of Distributions
Likelihood Function
Maximum Likelihood
Hidden Variables

• In the real-world applications of machine learning,


it is very common that there are many relevant
features available for learning but only a small
subset of them are observable.
• So, for the variables which are sometimes
observable and sometimes not, then we can use the
instances when that variable is visible is observed
for the purpose of learning and then predict its
value in the instances when it is not observable.
Hidden Variables

• Expectation-Maximization algorithm can be used


for the latent variables (variables that are not
directly observable and are actually inferred from
the values of the other observed variables) too in
order to predict their values with the condition that
the general form of probability distribution
governing those latent variables is known to us.
• This algorithm is actually at the base of many
unsupervised clustering algorithms in the field of
machine learning.
Hidden Variables

• It was explained, proposed and given its name in a


paper published in 1977 by Arthur Dempster, Nan
Laird, and Donald Rubin.
• It is used to find the local maximum likelihood
parameters of a statistical model in the cases where
latent variables are involved and the data is missing
or incomplete.
Algorithm

• Given a set of incomplete data, consider a set of


starting parameters.
• Expectation step (E – step): Using the observed
available data of the dataset, estimate (guess) the
values of the missing data.
• Maximization step (M – step): Complete data
generated after the expectation (E) step is used in
order to update the parameters.
• Repeat step 2 and step 3 until convergence.
Algorithm
Algorithm in detail

• Initially, a set of initial values of the parameters are considered. A set of


incomplete observed data is given to the system with the assumption
that the observed data comes from a specific model.
• The next step is known as “Expectation” – step or E-step. In this step, we
use the observed data in order to estimate or guess the values of the
missing or incomplete data. It is basically used to update the variables.
• The next step is known as “Maximization”-step or M-step. In this step,
we use the complete data generated in the preceding “Expectation” –
step in order to update the values of the parameters. It is basically used
to update the hypothesis.
• Now, in the fourth step, it is checked whether the values are converging
or not, if yes, then stop otherwise repeat step-2 and step-3 i.e.
“Expectation” – step and “Maximization” – step until the convergence
occurs.
Algorithm in detail
Applications

• It can be used to fill the missing data in a


sample.
• It can be used as the basis of unsupervised
learning of clusters.
• It can be used for the purpose of estimating the
parameters of Hidden Markov Model (HMM).
• It can be used for discovering the values of
latent variables.
Types of Supervised Learning

• Many of the most popular supervised learning algorithms fall


into three key categories:
– Linear models, which use a simple formula to find a best-fit
line through a set of data points.
– Tree-based models, which use a series of “if-then” rules to
generate predictions from one or more decision trees, similar
to the BuzzFeed quiz example.
– Probabilistic Models, uses the probability of the expected
outcome for prediction.
– Artificial neural networks, which are modeled after the way
that neurons interact in the human brain to interpret
information and solve problems. This is also often referred to
as deep learning.
Linear Models

• Linear Regression
• Ordinary Least Square Regression
• Support Vector Machine (linear Kernel)
• Perceptron with linear activation function
• Linear Discriminant Analysis (Fisher’s
Discriminant Analysis)
• Logistic Regression (only when we are using this
model for maximum likelihood estimation)
Linear Models

Examples...
Logistic Regression

• Logistic regression is the appropriate regression


analysis to conduct when the dependent variable is
dichotomous (binary).
• Like all regression analyses, the logistic regression is a
predictive analysis.
• Logistic regression is used to describe data and to
explain the relationship between one dependent
binary variable and one or more nominal, ordinal,
interval or ratio-level independent variables.
• Remember: though the name of algorithm carries
regression, it is used for classification.
Type of Logistic Regression

• Binary Logistic Regression


– The categorical response has only two 2 possible
outcomes. Example: Spam or Not.
• Multinomial Logistic Regression
– Three or more categories without ordering. Example:
Predicting which food is preferred more (Veg, Non-
Veg, Vegan).
• Ordinal Logistic Regression
– Three or more categories with ordering. Example:
Movie rating from 1 to 5.
What we know?

Example Reference: Super Data Science


A new problem

A company has provided an offer by email to their customers.


Apply Linear Regression
Apply Linear Regression
Apply Linear Regression
Logistic Regression
Logistic Regression – Logit Function
Logistic Regression
Logistic Regression – Probabilities
Logistic Regression – Prediction
Regression Analysis

• Regression analysis is a way to find trends in


data.
• For example, you might guess that there’s a
connection between how much you eat and how
much you weigh; regression analysis can help
you quantify that.
Regression Analysis

• Regression analysis will provide you with an


equation for a graph so that you can make
predictions about your data.
• For example, if you’ve been putting on weight
over the last few years, it can predict how much
you’ll weigh in ten years time if you continue to
put on weight at the same rate.
• It will also give you a slew of statistics (including
a p-value and a correlation coefficient) to tell
you how accurate your model is.
Regression Analysis

• In statistics, it’s hard to stare at a set of random


numbers in a table and try to make any sense of it.
• For example, global warming may be reducing
average snowfall in your town and you are asked
to predict how much snow you think will fall this
year.
• Looking at the following table you might guess
somewhere around 10-20 inches. That’s a good
guess, but you could make a better guess, by
using regression.
Regression Analysis
Regression Analysis

• Essentially, regression is the “best guess” at


using a set of data to make some kind of
prediction.
• It’s fitting a set of points to a graph. There’s a
whole host of tools that can run regression for
you, including Excel, which I used here to help
make sense of that snowfall data:
Regression Analysis
Performance Evaluation

• The performance of a regression model can be


understood by knowing the error rate of the
predictions made by the model.
• You can also measure the performance by knowing
how well your regression line fit the dataset.
• A good regression model is one where the
difference between the actual or observed values
and predicted values for the selected model is
small and unbiased for train, validation and test
data sets.
Performance Evaluation

• To measure the performance of your regression


model, some statistical metrics are used. Here
we will discuss four of the most popular
metrics. They are-
– Mean Absolute Error(MAE)
– Root Mean Square Error(RMSE)
– Coefficient of determination or R2
– Adjusted R2
Mean Absolute Error

• This is the simplest of all the metrics. It is


measured by taking the average of the absolute
difference between actual values and the
predictions.
Example:
Example:
Example:
Mean Absolute Error
Mean Absolute Error
Mean Absolute Error

• Mean Absolute Error (MAE) tells us the average error in


units of y, the predicted feature. A value of 0 indicates a
perfect fit, i.e. all our predictions are spot on.
• The MAE has a big advantage in that the units of the
MAE are the same as the units of y, the feature we want
to predict.
• In the example above, we have an MAE of 8.5, so it
means that on average our predictions of the number of
machine failures are incorrect by 8.5 machine failures.
• This makes MAE very intuitive and the results are easily
conveyed to a non-machine learning expert!
Root Mean Square Error

• The Root Mean Square Error is measured by


taking the square root of the average of the
squared difference between the prediction and
the actual value.
• It represents the sample standard deviation of
the differences between predicted values and
observed values(also called residuals).
Root Mean Square Error
RMSE

• As with MAE, we can think of RMSE as being


measured in the y units.
• So the above error can be read as an error of 9.9
machine failures on average per observation.
MAE vs. RMSE

• Compared to MAE, RMSE gives a higher total error and the


gap increases as the errors become larger. It penalizes a
few large errors more than a lot of small errors. If you want
your model to avoid large errors, use RMSE over MAE.
• Root Mean Square Error (RMSE) indicates the average error
in units of y, the predicted feature, but penalizes larger
errors more severely than MAE. A value of 0 indicates a
perfect fit.
• You should also be aware that as the sample size increases,
the accumulation of slightly higher RMSEs than MAEs
means that the gap between these two measures also
increases as the sample size increases.
R2 Error

• It measures how well the actual outcomes are


replicated by the regression line.
• It helps you to understand how well the
independent variable adjusted with the variance
in your model.
• That means how good is your model for a
dataset. The mathematical representation for
R2 is-
R2 Error

• Here,
– SSR = Sum Square of Residuals(the squared
difference between the predicted and the
average value)
– SST = Sum Square of Total(the squared
difference between the actual and average
value)
Example:

You can see that the regression line fits the data better than the mean line, which is what we
expected (the mean line is a pretty simplistic model, after all). But can you say how much
better it is? That's exactly what R2 does! Here is the calculation.
Example:
R2 Error

• The additional parts to the calculation are the


column on the far right (in blue) and the final
calculation row, computing R2
• So we have an R-squared of 0.85. Without even
worrying about the units of y we can say this is a
decent model. Why? Because the model explains
85% of the variation in the data. That's exactly what
an R-squared of 0.85 tells us!
• R-squared (R2) tells us the degree to which the model
explains the variance in the data. In other words, how
much better it is than just predicting the mean.
Example:

• Here's another example. What if our data points and


regression line looked like this?

• The variance around the regression line is 0. In other


words, var(line) is 0. There are no errors.
Now,

• Now, remember that the formula for R-squared


is:

• So, with var(line) = 0 the above calculation for R-
squared is

• So, if we have a perfect regression line, with no


errors, we get an R-squared of 1.
R2 Error

• Let's look at another example. What if our data points and


regression line looked like this, with the regression line equal to
the mean line?
• Data points where the regression line is equal to the mean line

• In this case, var(line) and var(mean) are the same. So the above
calculation will yield an R-squared of 0:
R2 Error

• What if our regression line was really bad, worse than the mean line?

• It's unlikely to get this bad! But if it does, var(mean)-var(line) will be


negative, so R-squared will be negative.
• An R-squared of 1 indicates a perfect fit. An R-squared of 0 indicates
a model no better or worse than the mean. An R-squared of less
than 0 indicates a model worse than just predicting the mean.
Summary

• Mean Absolute Error (MAE) tells us the average error in units


of y, the predicted feature. A value of 0 indicates a perfect fit.
• Root Mean Square Error (RMSE) indicates the average error in
units of y, the predicted feature, but penalizes larger errors
more severely than MAE. A value of 0 indicates a perfect fit.
• R-squared (R2) tells us the degree to which the model explains
the variance in the data. In other words how much better it is
than just predicting the mean.
– A value of 1 indicates a perfect fit.
– A value of 0 indicates a model no better than the mean.
– A value less than 0 indicates a model worse than just
predicting the mean.
Least Square Regression

• The least-squares regression method is a


technique commonly used in Regression
Analysis.
• It is a mathematical method used to find the
best fit line that represents the relationship
between an independent and dependent
variable.
• To understand the least-squares regression
method lets get familiar with the concepts
involved in formulating the line of best fit.
What is line of best fit ?

• Line of best fit is drawn to represent the relationship


between 2 or more variables. To be more specific, the best
fit line is drawn across a scatter plot of data points in order
to represent a relationship between those data points.
• Regression analysis makes use of mathematical methods
such as least squares to obtain a definite relationship
between the predictor variable (s) and the target variable.
• The least-squares method is one of the most effective
ways used to draw the line of best fit. It is based on the
idea that the square of the errors obtained must be
minimized to the most possible extent and hence the name
least squares method.
Visualizing

• If we were to plot the best fit line that shows the depicts the
sales of a company over a period of time, it would look
something like this:

• Notice that the line is as close as possible to all the scattered


data points. This is what an ideal best fit line looks like.
Visualizing
Calculate line of best fit

• To start constructing the line that best depicts the relationship


between variables in the data, we first need to get our basics right.
Take a look at the equation below:

• Surely, you’ve come across this equation before. It is a simple


equation that represents a straight line along 2 Dimensional data,
i.e. x-axis and y-axis. To better understand this, let’s break down the
equation:
y: dependent variable
m: the slope of the line
x: independent variable
c: y-intercept
Calculate line of best fit

• Step 1: Calculate the slope ‘m’ by using the


following formula:

• Step 2: Compute the y-intercept (the value of y


at the point where the line crosses the y-axis):

• Step 3: Substitute the values in the final


equation:
Example

• Consider an example. Tom who is the owner of a retail


shop, found the price of different T-shirts vs the number
of T-shirts sold at his shop over a period of one week.
• He tabulated this like shown below:
Step-1

• Let us use the concept of least squares regression


to find the line of best fit for the above data.
• Step 1: Calculate the slope ‘m’ by using the
following formula:

• After you substitute the respective values, m =


1.518 approximately.
Step-2

• Step 2: Compute the y-intercept value

• After you substitute the respective values, c = 0.305


approximately.
Step-3

• Step 3: Substitute the values in the final


equation

• Once you substitute the values, it should look


something like this:
Predicting

• Let’s construct a graph that represents the y=mx + c line of


best fit:

• Now Tom can use the above equation to estimate how


many T-shirts of price $8 can he sell at the retail shop.

y = 1.518 x 8 + 0.305 = 12.45 T-shirts


Summary

• The least squares regression method works by minimizing the sum of


the square of the errors as small as possible, hence the name least
squares. Basically the distance between the line of best fit and the
error must be minimized as much as possible.
• A few things to keep in mind before implementing the least squares
regression method is:
– The data must be free of outliers because they might lead to a
biased and wrongful line of best fit.
– The line of best fit can be drawn iteratively until you get a line
with the minimum possible squares of errors.
– This method works well even with non-linear data.
– Technically, the difference between the actual value of ‘y’ and the
predicted value of ‘y’ is called the Residual (denotes the error).
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- MITUSkillologies


skillologies

Web Resources
https://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]

You might also like