0% found this document useful (0 votes)
298 views

Bayesian Analysis - Explanation

Bayesian analysis is a statistical method that allows one to combine prior information about a population parameter with evidence from a sample to guide statistical inference. A prior probability distribution is specified first, then evidence is obtained and combined through Bayes' theorem to provide a posterior probability distribution. The posterior distribution provides the basis for statistical inferences concerning the parameter.

Uploaded by

James
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
298 views

Bayesian Analysis - Explanation

Bayesian analysis is a statistical method that allows one to combine prior information about a population parameter with evidence from a sample to guide statistical inference. A prior probability distribution is specified first, then evidence is obtained and combined through Bayes' theorem to provide a posterior probability distribution. The posterior distribution provides the basis for statistical inferences concerning the parameter.

Uploaded by

James
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Bayesian analysis, a method of statistical inference (named for

English mathematician Thomas Bayes) that allows one to combine


prior information about a population parameter with evidence from
information contained in a sample to guide the
statistical inference process. A prior probability distribution for
a parameter of interest is specified first. The evidence is then obtained
and combined through an application of Bayes’s theorem to provide a
posterior probability distribution for the parameter. The posterior
distribution provides the basis for statistical inferences concerning the
parameter.

READ MORE ON THIS TOPIC

statistics: Bayesian methods


The methods of statistical inference previously described are often referred to as

classical methods. Bayesian methods (so called after...

This method of statistical inference can be described mathematically


as follows. If, at a particular stage in an inquiry, a scientist assigns a
probability distribution to the hypothesis H, Pr(H)—call this the prior
probability of H—and assigns probabilities to the obtained evidence E
conditionally on the truth of H, PrH(E), and conditionally on the
falsehood of H, Pr−H(E), Bayes’s theorem gives a value for the
probability of the hypothesis H conditionally on the evidence E by the
formulaPrE(H) = Pr(H)Pr H (E)/ [Pr(H)PrH(E) + Pr(−H)Pr−H(E)] .
One of the attractive features of this approach to confirmation is that
when the evidence would be highly improbable if the hypothesis were
false—that is, when Pr−H(E) is extremely small—it is easy to see how a
hypothesis with a quite low prior probability can acquire a probability
close to 1 when the evidence comes in. (This holds even when Pr(H) is
quite small and Pr(−H), the probability that H is false,
correspondingly large; if E follows deductively from H, Pr H(E) will be
1; hence, if Pr−H(E) is tiny, the numerator of the right side of the
formula will be very close to the denominator, and the value of the
right side thus approaches 1.)

A key, and somewhat controversial, feature of Bayesian methods is the


notion of a probability distribution for a population parameter.
According to classical statistics, parameters are constants and cannot
be represented as random variables. Bayesian proponents argue that,
if a parameter value is unknown, then it makes sense to specify a
probability distribution that describes the possible values for the
parameter as well as their likelihood. The Bayesian approach permits
the use of objective data or subjective opinion in specifying a prior
distribution. With the Bayesian approach, different individuals might
specify different prior distributions. Classical statisticians argue that
for this reason Bayesian methods suffer from a lack of objectivity.
Bayesian proponents argue that the classical methods of statistical
inference have built-in subjectivity (through the choice of
a sampling plan) and that the advantage of the Bayesian approach is
that the subjectivity is made explicit.

Get exclusive access to content from our 1768 First Edition with your subscription.Subscribe
today

Bayesian methods have been used extensively in statistical decision


theory (see statistics: Decision analysis). In this context, Bayes’s
theorem provides a mechanism for combining a prior probability
distribution for the states of nature with sample information to
provide a revised (posterior) probability distribution about the states
of nature. These posterior probabilities are then used to make better
decisions.

https://www.britannica.com/science/Bayesian-analysis

Bayesian Statistics explained to


Beginners in Simple English
NSS, JUNE 20, 2016 

LOGIN TO BOOKMARK THIS ARTICLE 

Overview
 The drawbacks of frequentist statistics lead to the need for Bayesian Statistics
 Discover Bayesian Statistics and Bayesian Inference
 There are various methods to test the significance of the model like p-value,
confidence interval, etc
Introduction
Bayesian Statistics continues to remain incomprehensible in the ignited minds of many
analysts. Being amazed by the incredible power of machine learning, a lot of us have
become unfaithful to statistics. Our focus has narrowed down to exploring machine learning.
Isn’t it true?

We fail to understand that machine learning is not the only way to solve real world
problems. In several situations, it does not help us solve business problems, even though
there is data involved in these problems. To say the least, knowledge of statistics will allow
you to work on complex analytical problems, irrespective of the size of data.

In 1770s, Thomas Bayes introduced ‘Bayes Theorem’. Even after centuries later, the
importance of ‘Bayesian Statistics’ hasn’t faded away. In fact, today this topic is being
taught in great depths in some of the world’s leading universities.

With this idea, I’ve created this beginner’s guide on Bayesian Statistics. I’ve tried to explain
the concepts in a simplistic manner with examples. Prior knowledge of basic probability &
statistics is desirable. You should check out this course to get a comprehensive low down
on statistics and probability.

By the end of this article, you will have a concrete understanding of Bayesian Statistics and
its associated concepts.

 
Table of Contents
1. Frequentist Statistics
2. The Inherent Flaws in Frequentist Statistics
3. Bayesian Statistics
o Conditional Probability
o Bayes Theorem
4. Bayesian Inference
o Bernoulli likelihood function
o Prior Belief Distribution
o Posterior belief Distribution
5. Test for Significance – Frequentist vs Bayesian
o p-value
o Confidence Intervals
o Bayes Factor
o High Density Interval (HDI)

Before we actually delve in Bayesian Statistics, let us spend a few minutes


understanding Frequentist Statistics, the more popular version of statistics most of us come
across and the inherent problems in that.

1. Frequentist Statistics
The debate between frequentist and bayesian have haunted beginners for centuries.
Therefore, it is important to understand the difference between the two and how does there
exists a thin line of demarcation!

It is the most widely used inferential technique in the statistical world. Infact, generally it is
the first school of thought that a person entering into the statistics world comes across.

Frequentist Statistics tests whether an event (hypothesis) occurs or not. It calculates the


probability of an event in the long run of the experiment (i.e the experiment is repeated
under the same conditions to obtain the outcome).

Here, the sampling distributions of fixed size are taken. Then, the experiment is


theoretically repeated infinite number of times but practically done with a stopping
intention. For example, I perform an experiment with a stopping intention in mind that I will
stop the experiment when it is repeated 1000 times or I see minimum 300 heads in a coin
toss.

Let’s go deeper now.


Now, we’ll understand frequentist statistics using an example of coin toss. The objective is
to estimate the fairness of the coin. Below is a table representing the frequency of heads:

We know that probability of getting a head on tossing a fair coin is 0.5. No. of


heads represents the actual number of heads obtained. Difference is the difference
between 0.5*(No. of tosses) - no. of heads.

An important thing is to note that, though the difference between the actual number of
heads and expected number of heads( 50% of number of tosses) increases as the number
of tosses are increased, the proportion of number of heads to total number of tosses
approaches 0.5 (for a fair coin).

This experiment presents us with a very common flaw found in frequentist


approach i.e. Dependence of the result of an experiment on the number of times the
experiment is repeated.

To know more about frequentist statistical methods, you can head to this excellent


course on inferential statistics.

2. The Inherent Flaws in Frequentist Statistics


Till here, we’ve seen just one flaw in frequentist statistics. Well, it’s just the beginning.

20th century saw a massive upsurge in the frequentist statistics being applied to numerical


models to check whether one sample is different from the other, a parameter is important
enough to be kept in the model and variousother  manifestations of hypothesis testing.
But frequentist statistics suffered some great flaws in its design and interpretation  which
posed a serious concern in all real life problems. For example:

1. p-values measured against a sample (fixed size) statistic with some stopping intention
changes with change in intention and sample size. i.e If two persons work on the same data
and have different stopping intention, they may get two different  p- values for the same
data, which is undesirable.
For example: Person A may choose to stop tossing a coin when the total count reaches 100
while B stops at 1000. For different sample sizes, we get different t-scores and different p-
values. Similarly, intention to stop may change from fixed number of flips to total duration of
flipping. In this case too, we are bound to get different p-values.

2- Confidence Interval (C.I) like p-value depends heavily on the sample size. This makes
the stopping potential absolutely absurd since no matter how many persons perform the
tests on the same data, the results should be consistent.

3- Confidence Intervals (C.I) are not probability distributions therefore they do not
provide the most probable value for a parameter and the most probable values.

These three reasons are enough to get you going into thinking about the drawbacks of
the frequentist approach and why is there a need for bayesian approach. Let’s find it out.

From here, we’ll first understand the basics of Bayesian Statistics.

3. Bayesian Statistics
“Bayesian statistics is a mathematical procedure that applies probabilities to statistical
problems. It provides people the tools to update their beliefs in the evidence of new data.”

You got that? Let me explain it with an example:

Suppose, out of all the 4 championship races (F1) between Niki Lauda and James hunt,
Niki won 3 times while James managed only 1.

So, if you were to bet on the winner of next race, who would he be ?

I bet you would say Niki Lauda.

Here’s the twist. What if you are told that it rained once when James won and once when
Niki won and it is definite that it will rain on the next date. So, who would you bet your
money on now ?

By intuition, it is easy to see that chances of winning for James have increased drastically.
But the question is: how much ?

To understand the problem at hand, we need to become familiar with some concepts, first
of which is conditional probability (explained below).

In addition, there are certain pre-requisites:


Pre-Requisites:

1. Linear Algebra : To refresh your basics, you can check out Khan’s Academy
Algebra.
2. Probability and Basic Statistics : To refresh your basics, you can check out another
course by Khan Academy.

3.1 Conditional Probability


It is defined as the: Probability of an event A given B equals the probability of B and A
happening together divided by the probability of B.”

For example: Assume two partially intersecting sets A and B as shown below.

Set A represents one set of events and Set B represents another. We wish to calculate the
probability of A given B has already happened. Lets represent the happening of event B by
shading it with red.

Now since B has happened, the part which now matters for A is the part shaded in blue
which is interestingly  . So, the probability of A given B turns out to be:

Therefore, we can write the formula for event B given A has already occurred by:

or
Now, the second equation can be rewritten as :

This is known as Conditional Probability.

Let’s try to answer a betting problem with this technique.

Suppose, B be the event of winning of James Hunt. A be the event of raining. Therefore,

1. P(A) =1/2, since it rained twice out of four days.


2. P(B) is 1/4, since James won only one race out of four.
3. P(A|B)=1, since it rained every time when James won.

Substituting the values in the conditional probability formula, we get the probability to be
around 50%, which is almost the double of 25% when rain was not taken into account
(Solve it at your end).

This further strengthened our belief  of  James winning in the light of new evidence i.e


rain. You must be wondering that this formula bears close resemblance to something you
might have heard a lot about. Think!

Probably, you guessed it right. It looks like Bayes Theorem.

Bayes  theorem is built on top of conditional probability and lies in the heart of Bayesian
Inference. Let’s understand it in detail now.

3.2 Bayes Theorem

Bayes Theorem comes into effect when multiple events   form an exhaustive set with
another event B. This could be understood with the help of the below diagram.

 
Now, B can be written as

So, probability of B can be written as,

But

So, replacing P(B) in the equation of conditional probability we get

This is the equation of Bayes Theorem.

4. Bayesian Inference
There is no point in diving into the theoretical aspect of it. So, we’ll learn how it works! Let’s
take an example of coin tossing to understand the idea behind bayesian inference.

An important part of bayesian inference is the establishment of parameters and models.

Models are the mathematical formulation of the observed events. Parameters are the
factors in the models affecting the observed data. For example, in tossing a coin, fairness
of coin may be defined as the parameter of coin denoted by θ. The outcome of the events
may be denoted by D.

Answer this now. What is the probability of 4 heads out of 9 tosses(D) given the fairness of
coin (θ). i.e P(D|θ)

Wait, did I ask the right question? No.


We should be more interested in knowing : Given an outcome (D) what is the probbaility of
coin being fair (θ=0.5)

Lets represent it using Bayes Theorem:

P(θ|D)=(P(D|θ) X P(θ))/P(D)

Here, P(θ) is the prior i.e the strength of our belief in the fairness of coin before the toss. It
is perfectly okay to believe that coin can have any degree of fairness between 0 and 1.

P(D|θ) is the likelihood of observing our result given our distribution for θ. If we knew that
coin was fair, this gives the probability of observing the number of heads in a particular
number of flips.

P(D) is the evidence. This is the probability of data as determined by summing (or
integrating) across all possible values of θ, weighted by how strongly we believe in those
particular values of θ.

If we had multiple views of what the fairness of the coin is (but didn’t know for sure), then
this tells us the probability of seeing a certain sequence of flips for all possibilities of our
belief in the coin’s fairness.

P(θ|D) is the posterior belief of our parameters after observing the evidence i.e the number
of heads .

From here, we’ll dive deeper into mathematical implications of this concept. Don’t worry.
Once you understand them, getting to its mathematics is pretty easy.

To define our model correctly , we need two mathematical models before hand. One to
represent the likelihood function P(D|θ)  and the other for representing the distribution
of prior beliefs . The product of these two gives the posterior belief P(θ|D) distribution.

Since prior and posterior are both beliefs about the distribution of fairness of coin, intuition
tells us that both should have the same mathematical form. Keep this in mind. We will come
back to it again.

So, there are several functions which support the existence of bayes theorem. Knowing
them is important, hence I have explained them in detail.

4.1. Bernoulli likelihood function


Lets recap what we learned about the likelihood function. So, we learned that:
It is the probability of observing a particular number of heads in a particular number of flips
for a given fairness of coin. This means our probability of observing heads/tails depends
upon the fairness of coin (θ).

P(y=1|θ)=      [If coin is fair θ=0.5, probability of observing heads (y=1) is 0.5]

P(y=0|θ)=  [If coin is fair θ=0.5, probability of observing tails(y=0) is 0.5]

It is worth noticing that representing 1 as heads and 0 as tails is just a mathematical


notation to formulate a model. We can combine the above mathematical definitions into a
single definition to represent the probability of both the outcomes.

P(y|θ)= 

This is called the Bernoulli Likelihood Function and the task of coin flipping is called
Bernoulli’s trials.

y={0,1},θ=(0,1)

And, when we want to see a series of heads or flips, its probability is given by:

Furthermore, if we are interested in the probability of number of heads z turning up


in N number of flips then the probability is given by:

4.2. Prior Belief  Distribution


This distribution is used to represent our strengths on beliefs about the parameters based
on the previous experience.

But, what if one has no previous experience?

Don’t worry. Mathematicians have devised methods to mitigate this problem too. It is known
as uninformative priors. I would like to inform you beforehand that it is just a misnomer.
Every uninformative prior always provides some information event the constant distribution
prior.

Well, the mathematical function used to represent the prior beliefs is known as beta
distribution. It has some very nice mathematical properties which enable us to model our
beliefs about a binomial distribution.

Probability density function of beta distribution is of the form :

where, our focus stays on numerator. The denominator is there just to ensure that the total
probability density function upon integration evaluates to 1.

α and β are called the shape deciding parameters of the density function. Here α is
analogous to number of heads in the trials and β corresponds to the number of tails. The
diagrams below will help you visualize the beta distributions for different values of α and β
You too can draw the beta distribution for yourself using the following code in R:

> library(stats)
> par(mfrow=c(3,2))
> x=seq(0,1,by=o.1)
> alpha=c(0,2,10,20,50,500)
> beta=c(0,2,8,11,27,232)
> for(i in 1:length(alpha)){
       y<-dbeta(x,shape1=alpha[i],shape2=beta[i])
       plot(x,y,type="l")
}

Note: α and β are intuitive to understand since they can be calculated by knowing the mean
(μ) and standard deviation (σ) of the distribution. In fact, they are related as :

If mean and standard deviation of a distribution are known , then there shape parameters
can be easily calculated.

Inference drawn from graphs above:

1. When there was no toss we believed that every fairness of coin is possible as
depicted by the flat line.
2. When there were more number of heads than the tails, the graph showed a peak
shifted towards the right side, indicating higher probability of heads and that coin is
not fair.
3. As more tosses are done, and heads continue to come in larger proportion the peak
narrows increasing our confidence in the fairness of the coin value.

4.3. Posterior Belief Distribution


The reason that we chose prior belief is to obtain a beta distribution. This is because when
we multiply it with a likelihood function, posterior distribution yields a form similar to the prior
distribution which is much easier to relate to and understand. If this much information whets
your appetite, I’m sure you are ready to walk an extra mile.

Let’s calculate posterior belief using bayes theorem.

Calculating posterior belief using Bayes Theorem


Now, our posterior belief becomes,

This is interesting. Just knowing the mean and standard distribution of our belief about the
parameter θ and by observing the number of heads in N flips, we can update our belief
about the model parameter(θ).

Lets understand this with the help of a simple example:

Suppose, you think that a coin is biased. It has a mean (μ) bias of around 0.6 with standard
deviation of 0.1.

Then ,

α= 13.8 , β=9.2

i.e our distribution will be biased on the right side. Suppose, you observed 80 heads ( z=80)
in 100 flips(N=100). Let’s see how our prior and posterior beliefs are going to look:

prior = P(θ|α,β)=P(θ|13.8,9.2)

Posterior = P(θ|z+α,N-z+β)=P(θ|93.8,29.2)

Lets visualize both the beliefs on a graph:


The R code for the above graph is as:
> library(stats)
> x=seq(0,1,by=0.1)
> alpha=c(13.8,93.8)
> beta=c(9.2,29.2)
> for(i in 1:length(alpha)){
      y<-dbeta(x,shape1=alpha[i],shape2=beta[i])
      plot(x,y,type="l",xlab = "theta",ylab = "density")

As more and more flips are made and new data is observed, our beliefs get updated. This is
the real power of Bayesian Inference.

5. Test for Significance – Frequentist vs Bayesian


Without going into the rigorous mathematical structures, this section will provide you a
quick overview of different approaches of frequentist and bayesian methods to test for
significance and difference between groups and which method is most reliable.

5.1. p-value
In this, the t-score for a particular sample from a sampling distribution of fixed size is
calculated. Then, p-values are predicted. We can interpret p values as (taking an example
of p-value as 0.02 for a distribution of mean 100) : There is 2% probability that the sample
will have mean equal to 100.

This interpretation suffers from the flaw that for sampling distributions of different sizes, one
is bound to get different t-score and hence different p-value. It is completely absurd. A p-
value less than 5% does not guarantee that null hypothesis is wrong nor a p-value greater
than 5% ensures that null hypothesis is right.

5.2. Confidence Intervals


Confidence Intervals also suffer from the same defect. Moreover since C.I is not a
probability distribution , there is no way to know which values are most probable.

 
5.3. Bayes Factor
Bayes factor is the equivalent of p-value in the bayesian framework. Lets understand it in an
comprehensive manner.

The null hypothesis in bayesian framework assumes ∞ probability distribution only at a


particular value of a parameter (say θ=0.5) and a zero probability else where. (M1)

The alternative hypothesis is that all values of θ are possible, hence a flat curve
representing the distribution. (M2)

Now, posterior distribution of the new data looks like below.

Bayesian statistics adjusted credibility (probability) of various values of θ. It can be easily
seen that the probability distribution has shifted towards M2 with a value higher than M1 i.e
M2 is more likely to happen.

Bayes factor does not depend upon the actual distribution values of θ but the magnitude of
shift in values of M1 and M2.

In panel A (shown above): left bar (M1) is the prior probability of the null hypothesis.

In panel B (shown), the left bar is the posterior probability of the null hypothesis.
Bayes factor is defined as the ratio of the posterior odds to the prior odds,

To reject a null hypothesis, a BF <1/10 is preferred.

We can see the immediate benefits of using Bayes Factor instead of p-values since they
are independent of intentions and sample size.

5.4. High Density Interval (HDI)


HDI is formed from the posterior distribution after observing the new data. Since HDI is a
probability, the 95% HDI gives the 95% most credible values. It is also guaranteed that 95
% values will lie in this interval unlike C.I.

Notice, how the 95% HDI in prior distribution is wider than the 95% posterior distribution.
This is because our belief in HDI increases upon observation of new data.
 

End Notes
The aim of this article was to get you thinking about the different type of statistical
philosophies out there and how any single of them cannot be used in every situation.

It’s a high time that both the philosophies are merged to mitigate the real world problems by
addressing the flaws of the other. Part II of this series will focus on the Dimensionality
Reduction techniques using MCMC (Markov Chain Monte Carlo) algorithms. Part III will be
based on creating a Bayesian regression model from scratch and interpreting its results in
R. So, before I start with Part II, I would like to have your suggestions / feedback on this
article.

https://www.analyticsvidhya.com/blog/2016/06/bayesian-statistics-beginners-simple-english/

You might also like