A Case Study With Conditional Probability - Kaggle
A Case Study With Conditional Probability - Kaggle
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 1/11
link code
8/13/2021 Notebook
Overview
In this post, I'll give a gentle introduction to the conditional probability with the help of a real life
example. Then, I'll try to extend this idea of calculating conditional probabilities to Bayes Theorem. I
know there already exist excellent resources on this but this post is no way a replacement to those, it's
just a supplement to see how we can use those concepts to some real-life data.
In probability theory, conditional probability is the probability of occurring of an event where it is given
that another event has already occurred. To understand it a little better, we first need to set up the
stage by defining a few terms from set theory.
In the simplest terms, an event is just the result of a random experiment. For example, getting a head
when we toss a coin is one event, drawing a ball at random from a bag containing 3 black and 5 red
balls is also an event. As we can see, we can easily associate the concept of probability to the events.
A collection of all possible outcomes of an event is called a sample space, for tossing the coin we can
have just two outcomes: head (H) or a tail(T). Similarly, rolling a fair die will always result in some
number between 1 to 6, hence the sample space is {1, 2, 3, 4, 5, 6}.
Union of events
Consider again the rolling of a fair die where we define two events:
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 2/11
8/13/2021 Notebook
(https://imgur.com/7o5B3u9)
In terms of probabilities, we can easily calculate the probabilities for all the events as follows:
Intersection of events
Following the events defined previously, we can also define an event D which is getting a number which
is divisible by both 2 AND 3, meaning the common element in the sample space of both the events. In
terms of venn diagrams, it can be shown as:
(https://imgur.com/dwBjCji)
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 3/11
8/13/2021 Notebook
Number of cases where the output is divisible by 2
P(A) = Total possible outcomes
= 3/6 = 0.5
Number of cases where the output is divisible by 3
P(B) = Total possible outcomes
= 2/6 = 0.333
Number of cases where the output is divisible by both 2 AND 3
P(C) = P(A ∩ B) = Total possible outcomes
= 1/6 = 0.167
Disjoint events
(https://imgur.com/KqXin9c)
If the occurrence of one event does not effect the occurrence of another event,then these events are
termed as independent events. Few examples of independent events include:
Getting a head when a coin is tossed AND getting a 5 in rolling a fair die
Getting rains in the month of August AND snow in December
The probability in the case of independent events can be written as P(A ∩ B) = P(A) * P(B) , that
is, the probability of occurring both the events is just the product of individual probabilities. Let's try to
understand this more concretely, suppose we have a bag containing 3 blue and 5 green balls and we
draw two balls at random with replacement (meaning putting back the first ball in the bag after first
trial). We define the two events as:
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 4/11
8/13/2021 Notebook
We are interested in calculating the probability of getting a blue ball in the first trial AND a green ball in
the second. Writing the probabilities for both events:
P(A) = 3/8 ; P(B) = 5/8 . Since getting a blue
ball in a trial is independent of getting the green ball, this is the case of independent events. So we can
write:
Now for the case of dependent events, we can use the same example with one big difference: we are
not going to replace the drawn ball in the first trial! In this case, P(A) = 3/8 and P(B) = 5/7 .
Notice the denominator of P(B) : since the drawn ball is not replaced, now we are sampling the second
draw out of a smaller number of sample space. This would also ensure a greater chance of getting the
green ball in the second draw.
This discussion of dependent events naturally extends to the idea of conditional probability: We try to
calculate the probability of an event A given another event B has already happened. It is denoted by
P(A|B) . To get a feel for it, let's see some examples based on it
Probability of drawing a diamond from a deck of well-shuffled cards given the drawn card is red
Probability of rain on a given day of the month given it is July
We can easily infer from above two
examples that both the events in the examples are dependent of each other.
P (A∩B)
Probability of event A given event B has already occurred = P(A|B) =
P (B)
we can easily see the above equation reduces to P(A) for independent events by writing P(A ∩ B) =
P(A)*P(B) .
Now in my attempt to make this post a little less boring, let's make our hands dirty and use python to
understand the concept of conditional probability
In [1]:
# Import libraries
import numpy as np
import pandas as pd
The dataset contains the monthly rainfall data from years 1901 to 2018 for the Indian state of Kerala.
Kerala is one of the few states which are usually badly hit by monsoons every year. You can read more
about it in this excellent kernel (https://www.kaggle.com/biphili/india-rainfall-kerala-flood).
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 5/11
8/13/2021 Notebook
In [2]:
df = pd.read_csv("/kaggle/input/kerela-flood/kerala.csv")
df.head()
Out[2]:
SUBDIVISION YEAR JAN FEB MAR APR MAY JUN JUL AUG SE
0 KERALA 1901 28.7 44.7 51.6 160.0 174.7 824.6 743.0 357.5 19
1 KERALA 1902 6.7 2.6 57.3 83.9 134.5 390.9 1205.0 315.8 49
2 KERALA 1903 3.2 18.6 3.1 83.6 249.7 558.6 1022.5 420.2 34
3 KERALA 1904 23.7 3.0 32.2 71.5 235.7 1098.2 725.5 351.8 22
4 KERALA 1905 1.2 22.3 9.4 105.9 263.3 850.2 520.5 293.6 21
In [3]:
We will be needing only columns JUN , JUL , YEAR and FLOODS since we are interested in
calculating the probability of flooding in that year given it rained more than a certain threshold (500
mm) in these months. We will create a couple more columns based on these columns.
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 6/11
8/13/2021 Notebook
In [4]:
# Creating binary data for the months of June and July using the rainfall threshol
d
df_small["COUNT"] = 1
df_small.head()
Out[4]:
0 1901 1 1 1 1
1 1902 0 1 1 1
2 1903 1 1 1 1
3 1904 1 1 1 1
4 1905 1 1 0 1
In [5]:
df_small.shape
Out[5]:
(118, 5)
In [6]:
pd.crosstab(df_small["FLOODS"], df_small["JUN_GT_500"])
Out[6]:
JUN_GT_500 0 1
FLOODS
0 19 39
1 6 54
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 7/11
8/13/2021 Notebook
In [7]:
P_F_intersect_J = 54 / (6 + 54 + 19 + 39)
print(f"P(F): {P_F}")
print(f"P(J): {P_J}")
P(F): 0.5084745762711864
P(J): 0.788135593220339
Using the formula - P(A|B) = P(A ∩ B) / P(B) we can easily calculate the conditional probability:
In [8]:
# Now calculate probailitity of flood given it rained more than 500 mm in June (P
(A|B))
print(f"P(F|J): {P_F_J}")
P(F|J): 0.5806451612903226
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 8/11
8/13/2021 Notebook
Now, we can also ask ourselves this: *given that it flooded in Kerala in a given year what is the
probability that it rained more than 500 mm in the month of June or July?* This is where Bayes
Theorem comes into action. Some other examples of Bayes Theorem are like:
The probability of a woman having breast cancer given she tested positive in the test
Probability that a given email is actually a spam given it contains certain flagged words.
Bayes Theorem can be easily derived using the relationship between conditional probability and
intersection of events. Given two events, we already know:
P (A ∩ B) = P (A|B). P (B) = P (B|A). P (A)
P (A|B). P (B)
so, P (B|A) =
P (A)
In Bayesian inference, `P(B)` is called **Prior Probability**. In our case, `P(J)` is the prior probability
which tells the probability of rain more than 500 mm in June (or July) without knowing whether it
flooded or not that year. We can see prior probability is the probability of the event we are interested in
before any new information.
Okay, enough chatter, let's try to code this in python. Actually we have already done most of the work,
it's just a matter of plugging in the numbers into the above equation.
In [9]:
# Probability of rain more than 500 mm in June given it flooded that year (P(B|A))
print(f"P(J|F): {P_J_F}")
P(J|F): 0.9000000000000001
In [10]:
pd.crosstab(df_small["FLOODS"], df_small["JUL_GT_500"])
Out[10]:
JUL_GT_500 0 1
FLOODS
0 19 39
1 3 57
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 9/11
8/13/2021 Notebook
In [11]:
P_F_intersect_J = 57 / (3 + 57 + 19 + 39)
print(f"P(F): {P_F}")
print(f"P(J): {P_J}")
P(F): 0.5084745762711864
P(J): 0.8135593220338984
In [12]:
# Now calculate probailitity of flood given it rained more than 500 mm in July
print(f"P(F|J): {P_F_J}")
P(F|J): 0.59375
In [13]:
# Probability of rain more than 500 mm in July given it flooded that year (P(B|A))
print(f"P(J|F): {P_J_F}")
P(J|F): 0.9500000000000002
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 10/11
8/13/2021 Notebook
Important Takeaways
1. Based on the probability outputs above we can easily infer that it flooded almost 59% of the time in
the year when it rained more than 500 mm in July whereas for June it's only 58%. This means only
rainfall in the months of June and July are not completely responsible for the flooding in Kerala. This
actually makes sense since in both 2018 and 2020, the flooding happened in August. May be
including August in the analysis provide more insight to this.
2. Using Bayes theorem we found that whenever it flooded in Kerala, both June and July have a very
high probability (90% and 95% respectively) of rain for more than 500 mm. This also makes sense
June and July are the peak months of rainfall because of monsoon.
Thanks for reading my kernel! I hope it helped you to understand this concept as much as it helped me.
https://www.statisticshowto.com/bayes-theorem-problems/
(https://www.statisticshowto.com/bayes-theorem-problems/)
https://www.analyticsvidhya.com/blog/2017/03/conditional-probability-bayes-theorem/
(https://www.analyticsvidhya.com/blog/2017/03/conditional-probability-bayes-theorem/)
https://towardsdatascience.com/bayes-theorem-the-holy-grail-of-data-science-55d93315defb
(https://towardsdatascience.com/bayes-theorem-the-holy-grail-of-data-science-55d93315defb)
In [ ]:
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 11/11