Distribution PPT
Distribution PPT
1
Discrete and Continuous Data
When looking at a set of numbers, they are
typically :
Discrete (countable)
Continuous (measurable)
2
Discrete Data
Refers to individual and countable items
(discrete variables).
Involves counting rather than measuring.
Examples-
Count number of computers in each
department.
Count the number of students in a class.
3
Discrete Data
Characteristics-
Discrete variables are finite, numeric, countable,
and non-negative integers (5, 10, 15, and so on).
It can be easily visualized and demonstrated
using simple statistical methods such as bar
charts, line charts, or pie charts.
It can also be categorical - containing a finite
number of data values, such as the gender of a
person.
4
Continuous Data
It is a type of numerical data that refers to the
unspecified number of possible measurements
between two realistic points.
Continuous data is all about accuracy.
Variables in these data sets often carry decimal
points.
Examples-
Measuring daily wind speed
Measuring temperature of a city
Measuring a person’s height
5
Continuous Data
Characteristics-
Data changes over time and can have different
values at different time intervals.
Data is made up of random variables, which may or
may not be whole numbers.
Data is measured using data analysis methods such
as line graphs, skews, and so on.
Regression analysis is one of the most common
types of continuous data analysis.
6
Statistical Distributions
Also called as probability distribution.
Statistical distributions are mathematical
functions that describe the behavior and
characteristics of random variables.
Statistical distribution helps to understand a
problem better by assigning a range of possible
values to the variables, making them very
useful in data science and machine learning.
7
Types of Statistical Distributions
Depending on the type of data, distribution
are grouped into two categories:
Discrete distributions for discrete data
Continuous distributions for continuous
data
8
Discrete Distributions
A discrete distribution is a probability
distribution that describes the probability of
occurrence of each possible outcome in a set
of discrete values.
It is characterized by a probability mass
function (PMF), which gives the probability of
each possible outcome.
9
Probability Mass Function (PMF)
Gives the probability of a discrete random
variable taking on a specific value.
Maps each possible outcome of a random
variable to its probability.
The PMF is defined as:
P(X=x)
X is the discrete random variable
x is the value of the random variable,
10
Types of Discrete Distributions
Bernoulli distribution
Binomial distribution
Poisson distribution
11
Bernoulli Distribution
Single Trial with Two Possible Outcomes.
Any event with a single trial and only two possible
outcomes follow a Bernoulli distribution.
Example-
Flipping a coin.
Choosing between True and False in a quiz.
12
Bernoulli Distribution
13
Bernoulli Distribution
The PMF of Bernoulli distribution=
px (1 - p)1 - x, x ϵ {0, 1}
14
Bernoulli Distribution
The expected value or Mean of Bernoulli
distribution:
E(x) = p
Variance of Bernoulli distribution:
Var(x) = p(1-p)
= pq
15
Binomial Distribution
A sequence of Bernoulli events.
It can be thought of as the sum of outcomes of
an event following a Bernoulli distribution.
Therefore, it is used in binary outcome events,
and the probability of success and failure is
the same in all successive trials.
Example -
Flipping a coin multiple times to count the
number of heads and tails
16
Binomial Distribution
Example- If you flipped a coin twice
[{H,H}, {H,T}, {T,H}, {T,T}]
{H,H} = ½ * ½ = ¼, {T,T} = ½ * ½ = ¼
{H,T} or {T,H} = ½ * ½ + ½ * ½ = ½
17
Binomial Distribution
A binomial distribution is represented by :
B (n, p)
‘n’ is the number of trials,
‘p’ is the probability of success in a single trial
The probability of success (x) for these n trials
or PMF:
x= 0,1,2…………..n
18
Binomial Distribution
Expected value or Mean of a binomial
distribution can be represented as :
E(x) = np
Similarly, variance is represented as:
Var(x) = np(1-p)
= npq
19
Binomial Distribution
For example, suppose that a candy company
produces both milk chocolate and dark
chocolate candy bars. The total products
contain half milk chocolate bars and half dark
chocolate bars.
Say choose ten candy bars at random and
choosing milk chocolate is defined as a
success.
n=10, p=1/2=0.5
20
Binomial Distribution
The probability distribution of the number of
successes during these 10 trials with p = 0.5
21
Numerical
Suppose a basketball player makes a free throw
with a probability of 0.7. If the player attempts
10 free throws, what is the probability that
they make exactly 7 of them?
22
Numerical
Solution- Binomial probability problem
23
Poisson Distribution
The probability that an event May or May not
occur.
It gives the probability of an event happening a
certain number of times (x) within a given
interval of time or space.
24
Poisson Distribution
Examples-
The number of phone calls received by a call
center during one hour of operation
Text messages per hour
Website visitors per month
25
Poisson Distribution
Characteristics:
The events are independent of each other.
An event can occur any number of times
(within the defined period).
Two events can’t take place simultaneously.
26
Poisson Distribution
The probability mass function (PMF) of the
Poisson distribution is:
28
Numerical
Suppose that the average rate of calls received
by the call center during one hour is 10. Then,
calculate the probability of receiving 8 or
fewer calls during one hour?
29
Numerical
Solution- Poisson Distribution
λ= 10
where λ is the mean or average rate of calls
received by the call center during one hour
X=x ≤ 8
where ‘X’ is the random variable representing
the number of calls received by the call center
during one hour.
30
Numerical
X=x ≤ 8
P(X ≤ 8) = Σ P(X = x), for x = 0 to 8
31
Numerical
P(X = 5) = (10^5 * e^(-10)) / 5! ≈ 0.0378
P(X = 6) = (10^6 * e^(-10)) / 6! ≈ 0.0631
P(X = 7) = (10^7 * e^(-10)) / 7! ≈ 0.0901
P(X = 8) = (10^8 * e^(-10)) / 8! ≈ 0.1126
P(X ≤ 8) ≈ 0.332
32
Numerical
1. Suppose that a manufacturing company
produces light bulbs at a rate of 3 defective
bulbs per hour. What is the probability that
exactly 2 defective bulbs are produced in a
30-minute interval?
2. Suppose a factory produces electronic
components, and 5% of the components are
defective. If a sample of 200 components is
randomly selected, what is the probability
that there are fewer than 10 defective
components in the sample? 33
Numerical
Solution (1)- Poisson Distribution
λ = (3/60) * 30 = 1.5
where λ is the rate parameter for the Poisson
distribution
X=x=2
Put the values in the formula:
P(X = x=2) = (e^(-1.5) * 1.5^2) / 2!
P(X = 2) ≈ 0.2510
34
Numerical
Solution (2)- Binomial distribution problem
p= 0.05, n=200, x<10
P(X<10) = 0.98
35
Continuous Distribution
Describes the distribution of continuous
random variables.
A continuous random variable can take on any
value within a range or interval of values, as
opposed to a discrete random variable that can
only take on distinct values.
It is characterized by Probability Density
Function (PDF).
36
Probability Density Function (PDF)
Describes the probability distribution of a
continuous random variable.
Gives the relative likelihood of a random
variable (X) taking on a particular value (x)
within a given range of values (a, b).
PDF=
37
Types of Continuous Distribution
Uniform Distribution
Normal or Gaussian Distribution
Student t-Test Distribution
Exponential Distribution
38
Uniform Distribution
It is a continuous or rectangular distribution.
It describes an experiment where an outcome
lies between certain boundaries.
Example-
Time to fly from Delhi to Hyderabad ranges
from 120 to 150 minutes if we monitor the fly
time for many commercial flights it will follow
more or less the uniform distribution.
39
Uniform Distribution
PDF f(x) = 1 / (b - a) for a ≤ x ≤ b
f(x) is the probability density function of X
a and b are the lower and upper bounds of the
distribution, respectively.
40
Uniform Distribution
The Expected value or Mean
E(X) = (a + b) / 2
Variance
Var(X) = (b - a)^2 / 12
41
Normal Distribution
Symmetric Distribution of Values Around the
Mean
Also called as Gaussian or Bell curve distribution.
It is most commonly used in data science.
Describes the probability of a continuous random
variable that takes real values.
When plotted, the data follows a bell shape, with
most values clustering around a central region
and tapering off as they go further away from the
center.
42
Normal Distribution
Example-
Average weight of a population
The scores of a quiz
43
The scores of a quiz
Many of the students scored between 60 and
80.
The students with scores that fall outside this
range (outliers) are deviating from the center.
44
Normal Distribution
Characteristics-
The random variable takes values from -∞ to
+∞.
Mean, mode and median (measures of central
tendency) coincide with each other.
The distribution curve is symmetrical to the
centre.
The area under the curve is equal to 1.
45
Normal Distribution- 68-95-99.7 Rule
While plotting a graph for a normal
distribution, 68% of all values lie within one
standard deviation from the mean.
Similarly, 95% of the values lie within two
standard deviations from the mean, and
99.7% lie within three standard deviations
from the mean.
This last interval captures almost all matters.
If a data point is not included, it is most likely
an outlier.
46
Normal Distribution- 68-95-99.7 Rule
If the mean is 70 and the standard deviation is
10, 68% of the values will lie between 60 and
80, and so on for 95% and 99.7%.
47
Normal Distribution
PDF of normal distribution-
49
Standard Normal Distribution
Has a mean of zero and a standard deviation
of one.
The x values of the standard normal
distribution are called z-scores.
Z-score is used to determine the probability of
a given value occurring in a normal
distribution, using standard normal
distribution.
50
Z-SCORE
The z-score equals an X minus the population
mean (μ) all divided by the standard deviation
(σ).
51
Standard Normal Distribution
PDF :
52
Numerical
The marks of students (X) in a class of 70
students follows normal distribution with
mean 50 units and variance 225 units. Find the
probability that P(40 < X< 60).
53
Numerical
Solution- Normal Distribution
Mean (μ) of 50 units
Variance (σ^2) of 225 units,
Standardize the distribution using the Z-score
So, to find the probability P(40 < X < 60), first
find the Z-score for X = 40 and X = 60:
Z1 = (40 - 50) / 15 = -0.67
Z2 = (60 - 50) / 15 = 0.67
54
Numerical
Solution-
Using a calculator, the probability of Z being
between -0.67 and 0.67.
P(-0.67 < Z < 0.67) = 0.7486 – 0.2514
= 0.4970
55
Student t-Test Distribution
Small sample size approximation of a
normal distribution.
It is also known as the ‘t’ distribution.
Similar to the standard normal distribution
with its bell shape but has heavier tails.
The shape of the t-distribution depends on the
degrees of freedom ‘n’, which is equal to the
sample size ‘k’ minus one.
Degree of freedom ‘n’ = k-1
56
Student t-Test Distribution
Example-
Suppose we deal with the total apples sold by a
shopkeeper in a month.
In that case, we will use the normal
distribution.
Whereas, if we are dealing with the total
amount of apples sold in a day, i.e., a smaller
sample, we can use the ‘t’ distribution.
57
Student t-Test Distribution
As the sample size increases, the t-distribution
approaches the normal distribution, and, the
t-distribution can be used for larger sample
sizes as well.
58
Student t-Test Distribution
PDF =
n= degree of freedom
Γ is the gamma function, which is a
generalization of the factorial function to
complex numbers
59
Student t-Test Distribution
Expected value or Mean
E(x) = 0
Variance
Var(x)= n/(n-2)
n= degree of freedom
60
Exponential Distribution
It models elapsed time between two events.
It is concerned with the amount of time until
some specific event occurs.
61
Exponential Distribution
Example-
How long do we need to wait before a
customer enters a shop?
How long will it take before a call center
receives the next phone call?
How long will a piece of machinery work
without breaking down?
62
Exponential Distribution
All these questions concern the time we need
to wait before a given event occurs.
If the waiting time is unknown, it is often
appropriate to think of it as a random variable
having an exponential distribution.
63
Exponential Distribution
PDF of Exponential Distribution:
64
Exponential Distribution
The CDF of the exponential distribution gives
the probability that the time between events is
less than or equal to a specific value x.
CDF of Exponential Distribution:
66
Exponential Distribution
Let's say find the probability that the time
between events is less than or equal to ‘1’
minute if λ = 10 events per hour.
67
Exponential Distribution
Solution- P(X ≤ 1), λ = 10
Convert 1 minute into hour
Using the CDF of the exponential distribution:
= 1 - e^(-10*0.0167)
= 1 - e^(-0.167)
= 0.15
68
Exponential Distribution
The time (in hours) required to repair a
machine is an exponentially distributed
random variable with parameter λ = 1/2. What
is the probability that a repair time exceeds 2
hours?
69
Exponential Distribution
Solution- λ = ½, P(X ≥ 2)
Complement rule
P(X ≥ x) = 1- P(X ≤ x)
= 1- [1 - e^(-λx)]
= e^(-λx)
P(X ≥ 2) = 1- P(X ≤ 2)
= e^(-λx)
= e^(-1/2 * 2)
= 0.367
70
Numerical
The length of life of an instrument produced
by a machine has a normal distribution with a
mean of 12 months and standard deviation of
2 months. Find the probability that an
instrument produced by this machine will last
less than 7 months.
71
Numerical
Solution-
X is the value to standardize, X = 7 months
μ is the mean, μ = 12 months
σ is the standard deviation, σ = 2 months
Substituting the given values, in the Z-score
formula
z = (7 - 12) / 2 = -2.5
=0.0062 or 0.62%
72
Exponential Distribution
Suppose that the time between machine
breakdowns at a factory follows an exponential
distribution with a mean of 10 hours. Calculate
the probability that the time between
breakdowns is between 5 and 10 hours.
73
Exponential Distribution
Find P(5 ≤ X ≤ 10)
Use Interval rule-
Probability of being inside the interval is
complement of being outside the interval.
The probability of being outside the
interval is the composite event of being too
low P(X ≤ 5) for the interval and being too
high P(X ≥ 10) for the interval.
P(5 ≤ X ≤ 10) = 1- [P(X ≤ 5) + P(X ≥ 10)]
74
Exponential Distribution
Compute P(5 ≤ X ≤ 10) with λ = 1/10
P(5 ≤ X ≤ 10)= 1- [P(X ≤ 5) + P(X ≥ 10)]
Too low P(X ≤ 5) =0.3934
Too high P(X ≥ 10)=0.3678
Outside= [P(X ≤ 5) + P(X ≥ 10)] =0.7612
Inside= P(5 ≤ X ≤ 10) =1-0.7612=0.2388
75