Probability Distributions in Data Science - Towards Data Science
Probability Distributions in Data Science - Towards Data Science
This is your last free member-only story this month. Sign up for Medium and get an extra one
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 1/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
00:00 / 10:51
1×
Introduction
Having a sound statistical background can be greatly beneficial in the daily life of a
Data Scientist. Every time we start exploring a new dataset, we need to first do an
Exploratory Data Analysis (EDA) in order to get a feeling of what are the main
characteristics of certain features. If we are able to understand if it’s present any
pattern in the data distribution, we can then tailor-made our Machine Learning models
to best fit our case study. In this way, we will be able to get a better result in less time
(reducing the optimisation steps). In fact, some Machine Learning models are designed
to work best under some distribution assumptions. Therefore, knowing with which
distributions we are working with, can help us to identify which models are best to use.
Let’s imagine we want to predict the price of a house given a certain set of features. We
might be able to find online a dataset with all the house prices of San Francisco (our
sample) and after performing some statistical analysis, we might be able to make quite
accurate predictions of the house price in any other city in the USA (our population).
Datasets are composed of two main types of data: Numerical (eg. integers, floats), and
Categorical (eg. names, laptops brands).
Numerical data can additionally be divided into other two categories: Discrete and
Continue. Discrete data can take only certain values (eg. number of students in a
school) while continuous data can take any real or fractional value (eg. the concepts of
height and weights).
Functions.
Get started Open in app
Probability Mass Functions gives the probability that a variable can be equal to a
certain value, instead, the values of Probability Density Functions are not itself
probabilities because they need first to be integrated over the given range.
There exist many different probability distributions in nature (Figure 1), in this article I
will introduce you to the ones most commonly used in Data Science.
Throughout this article, I will provide code snippets on how to create each of the
different distributions. If you are interested in additional resources, these are available
in this my GitHub repository.
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import scipy.stats as stats
5 i t b
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 3/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
5 import seaborn as sns
Get started
distributions1.py
hosted
Open with
in ❤
appby GitHub view raw
Bernoulli Distribution
The Bernoulli distribution is one of the easiest distributions to understand and can be
used as a starting point to derive more complex distributions.
This distribution has only two possible outcomes and a single trial.
A simple example can be a single toss of a biased/unbiased coin. In this example, the
probability that the outcome might be heads can be considered equal to p and (1 - p)
for tails (the probabilities of mutually exclusive events that encompass all possible
outcomes needs to sum up to one).
distributions2.py
hosted with ❤ by GitHub view raw
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 4/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
Uniform Distribution
The Uniform Distribution can be easily derived from the Bernoulli Distribution. In this
case, a possibly unlimited number of outcomes are allowed and all the events hold the
same probability to take place.
As an example, imagine the roll of a fair dice. In this case, there are multiple possible
events with each of them having the same probability to happen.
distributions3.py
hosted with ❤ by GitHub view raw
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 5/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
Binomial Distribution
The Binomial Distribution can instead be thought as the sum of outcomes of an event
following a Bernoulli distribution. The Binomial Distribution is therefore used in
binary outcome events and the probability of success and failure is the same in all the
successive trials. This distribution takes two parameters as inputs: the number of times
an event takes place and the probability assigned to one of the two classes.
Varying the amount of bias will change the way the distribution will look like (Figure
4).
distributions4.py
hosted with ❤ by GitHub view raw
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 6/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
Given multiple trials, each of them is independent of each other (the outcome of
one trial doesn’t affect another one).
Each trial can lead to just two possible results (eg. winning or losing), which have
probabilities p and (1 - p).
If we are given the probability of success (p) and the number of trials (n), we can then
be able to calculate the probability of success (x) within these n trials using the
formula below (Figure 5).
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 7/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
“In probability theory, the central limit theorem (CLT) establishes that, in some
Get started
situations, whenOpen in app
independent random variables are added, their properly normalized sum
tends toward a normal distribution even if the original variables themselves are not
normally distributed.”
— Wikipedia
1 n = np.arange(-50, 50)
2 mean = 0
3 normal = stats.norm.pdf(n, mean, 10)
4 plt.plot(n, normal)
5 plt.xlabel('Distribution', fontsize=12)
6 plt.ylabel('Probability', fontsize=12)
7 plt.title("Normal Distribution")
distributions5.py
hosted with ❤ by GitHub view raw
Some of the characteristics which can help us to recognise a normal distribution are:
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 8/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
The curve is symmetric at the centre. Therefore mean, mode and median are all
Get started
equal to theOpen
sameinvalue,
app
making distribute all the values symmetrically around the
mean.
The area under the distribution curve is equal to 1 (all the probabilities must sum
up to 1).
A normal distribution can be derived using the following formula (Figure 7).
When using Normal Distributions, the distribution mean and standard deviation plays
a really important role. If we know their values, we can then easily find out the
probability of predicting exact values by just examining the probability distribution
(Figure 8). In fact, thanks to the distribution properties, 68% of the data lies within
one standard deviation of the mean, 95% within two standard deviations of the mean
and 99.7% within three standard deviations of the mean.
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 9/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
Many Machine Learning models are designed to work best-using data that follow a
Normal Distribution. Some examples are:
Poisson Distribution
Poisson Distributions are commonly used to find the probability that an event might
happen or not knowing how often it usually occurs. Additionally, Poisson Distributions
can also be used to predict how many times an event might occur in a given time
period.
When working with Poisson Distributions, we can be confident of the average time
between the occurrence of different events, but the precise moment an event might
take place is randomly spaced in time.
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 10/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
A Poisson Distribution can be modelled using the following formula (Figure 9), where
Get started
λ represents theOpen in app
expected number of events which can take place in a period.
1. The events are independent of each other (if an event happens, this does not alter
the probability that another event can take place).
2. An event can take place any number of times (within the defined time period).
In Figure 10, is shown how varying the expected number of events which can take
place in a period (λ) can change a Poisson Distribution.
distributions6.py
hosted with ❤ by GitHub view raw
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 11/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
Exponential Distribution
Finally, the Exponential Distribution is used to model the time taken between the
occurrence of different events.
❤
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 12/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
distributions7.py
hosted with ❤ by GitHub view raw
The Exponential Distribution is modelled using the following formula (Figure 12).
Contacts
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 13/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
If you want to keep updated with my latest articles and projects follow me on Medium
Get started
and subscribe toOpen in app
my mailing list. These are some of my contacts details:
Personal Blog
Personal Website
Medium Profile
GitHub
Kaggle
Bibliography
[1] Introduction to Statistics for Data Science.
Diogo Menezes Borges, The Making Of… a Data Scientist. Accessed at:
https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-data-
science-7bf596237ac6
[2] Binomial Random Variables, UF Biostatistics Open Learning Textbook. Accessed at:
https://bolt.mph.ufl.edu/6050-6052/unit-3b/binomial-random-variables/
[3] Formula for the Normal Distribution or Bell Curve. ThoughtCo, Courtney Taylor.
Accessed at: https://www.thoughtco.com/normal-distribution-bell-curve-formula-
3126278
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 14/15
11/10/2021 12:25 Probability Distributions in Data Science | by Pier Paolo Ippolito | Towards Data Science
Get started
Sign up forOpen
TheinVariable
app
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Statistics Data Science Machine Learning Artificial Intelligence Towards Data Science
https://towardsdatascience.com/probability-distributions-in-data-science-cce6e64873a7 15/15