Learning Bayesian Models With R - Sample Chapter
Learning Bayesian Models With R - Sample Chapter
ee
P U B L I S H I N G
science problems.
C o m m u n i t y
$ 44.99 US
28.99 UK
pl
Learning Bayesian
Models with R
Sa
m
D i s t i l l e d
Learning Bayesian
Models with R
Become an expert in Bayesian machine learning methods using R
and apply them to solve real-world Big Data problems
E x p e r i e n c e
Preface
Bayesian inference provides a unified framework to deal with all sorts of uncertainties
when learning patterns from data using machine learning models and using it for
predicting future observations. However, learning and implementing Bayesian
models is not easy for data science practitioners due to the level of mathematical
treatment involved. Also, applying Bayesian methods to real-world problems
requires high computational resources. With the recent advancements in cloud and
high-performance computing and easy access to computational resources, Bayesian
modeling has become more feasible to use for practical applications today. Therefore,
it would be advantageous for all data scientists and data engineers to understand
Bayesian methods and apply them in their projects to achieve better results.
Preface
The last two chapters are devoted to the latest developments in the field. One
chapter discusses deep learning, which uses a class of neural network models that
are currently at the frontier of artificial intelligence. The book concludes with the
application of Bayesian methods on Big Data using frameworks such as Hadoop
and Spark.
Chapter 1, Introducing the Probability Theory, covers the foundational concepts
of probability theory, particularly those aspects required for learning Bayesian
inference, which are presented to you in a simple and coherent manner.
Chapter 2, The R Environment, introduces you to the R environment. After reading
through this chapter, you will learn how to import data into R, make a selection of
subsets of data for its analysis, and write simple R programs using functions and
control structures. Also, you will get familiar with the graphical capabilities of R
and some advanced capabilities such as loop functions.
Chapter 3, Introducing Bayesian Inference, introduces you to the Bayesian
statistic framework. This chapter includes a description of the Bayesian theorem,
concepts such as prior and posterior probabilities, and different methods to
estimate posterior distribution such as MAP estimates, Monte Carlo simulations,
and variational estimates.
Chapter 4, Machine Learning Using Bayesian Inference, gives an overview of what
machine learning is and what some of its high-level tasks are. This chapter also
discusses the importance of Bayesian inference in machine learning, particularly in
the context of how it can help to avoid important issues such as model overfit and
how to select optimum models.
Chapter 5, Bayesian Regression Models, presents one of the most common supervised
machine learning tasks, namely, regression modeling, in the Bayesian framework.
It shows by using an example how you can get tighter confidence intervals of
prediction using Bayesian regression models.
Chapter 6, Bayesian Classification Models, presents how to use the Bayesian framework
for another common machine learning task, classification. The two Bayesian models
of classification, Nave Bayes and Bayesian logistic regression, are discussed along
with some important metrics for evaluating the performance of classifiers.
Chapter 7, Bayesian Models for Unsupervised Learning, introduces you to the concepts
behind unsupervised and semi-supervised machine learning and their Bayesian
treatment. The two most important Bayesian unsupervised models, the Bayesian
mixture model and LDA, are discussed.
Preface
P ( A) =
[1]
NA
N
Probability distributions
In both classical and Bayesian approaches, a probability distribution function is
the central quantity, which captures all of the information about the relationship
between variables in the presence of uncertainty. A probability distribution assigns
a probability value to each measurable subset of outcomes of a random experiment.
The variable involved could be discrete or continuous, and univariate or multivariate.
Although people use slightly different terminologies, the commonly used probability
distributions for the different types of random variables are as follows:
Chapter 1
N ( x; , ) =
1
2 2
( x )2
2 2
Here, is the mean or location parameter and is the standard deviation or scale
parameter ( 2 is called variance). The following graphs show what the distribution
looks like for different values of location and scale parameters:
One can see that as the mean changes, the location of the peak of the distribution
changes. Similarly, when the standard deviation changes, the width of the
distribution also changes.
[3]
x = [ x1 , x2 , , xN ]
Then, the corresponding normal distribution is given by:
N ( x | , ) =
( 2 )
T
1
exp ( x ) 1 ( x )
2
12
=
2 1
[4]
1 2
22
Chapter 1
2
2
Here, 1 and 2 are the variances along x1 and x2 directions, and is the
correlation between x1 and x2 . A plot of two-dimensional normal distribution for
12 = 9 , 22 = 4 , and = 0.8 is shown in the following image:
[5]
The high correlation between x and y in the first case forces most of the data points
along the 45 degree line and makes the distribution more anisotropic; whereas, in the
second case, when the correlation is zero, the distribution is more isotropic.
We will briefly review some of the other well-known distributions used in Bayesian
inference here.
Conditional probability
Often, one would be interested in finding the probability of the occurrence of a set
of random variables when other random variables in the problem are held fixed. As
an example of population health study, one would be interested in finding what is
the probability of a person, in the age range 40-50, developing heart disease with
high blood pressure and diabetes. Questions such as these can be modeled using
conditional probability, which is defined as the probability of an event, given that
another event has happened. More formally, if we take the variables A and B, this
definition can be rewritten as follows:
P ( A | B) =
P ( A, B )
P ( B)
P ( B | A) =
P ( A, B )
P ( A)
Similarly:
[6]
Chapter 1
P ( x1 , x2 , , xN , z1 , z2 , , zM )
P ( z1 , z2 , , zM )
P ( x1 , x2 , , xN | z1 , z2 , , zM ) =
P ( x1 | x2 ) =
N ( x1 , x2 )
N ( x2 )
It can be shown that (exercise 2 in the Exercises section of this chapter) the RHS can be
simplified, resulting in an expression for P ( x1 | x2 ) in the form of a normal distribution
again with the mean = 1 + 1 ( x2 2 ) and variance = (1 u 2 ) 12 .
2
Bayesian theorem
From the definition of the conditional probabilities P ( A | B ) and P ( B | A) , it is easy to
show the following:
P ( A | B) =
P ( B | A) P ( A)
P ( B)
Rev. Thomas Bayes (17011761) used this rule and formulated his famous Bayes
theorem that can be interpreted if P ( A ) represents the initial degree of belief (or prior
probability) in the value of a random variable A before observing B; then, its posterior
probability or degree of belief after accounted for B will get updated according to the
preceding equation. So, the Bayesian inference essentially corresponds to updating
beliefs about an uncertain system after having made some observations about it. In the
sense, this is also how we human beings learn about the world. For example, before
we visit a new city, we will have certain prior knowledge about the place after reading
from books or on the Web.
[7]
However, soon after we reach the place, this belief will get updated based on our initial
experience of the place. We continuously update the belief as we explore the new
city more and more. We will describe Bayesian inference more in detail in Chapter 3,
Introducing Bayesian Inference.
Marginal distribution
In many situations, we are interested only in the probability distribution of a subset
of random variables. For example, in the heart disease problem mentioned in the
previous section, if we want to infer the probability of people in a population
having a heart disease as a function of their age only, we need to integrate out the
effect of other random variables such as blood pressure and diabetes. This is called
marginalization:
P ( x1 , x2 , xM ) = P ( x1 , x2 , xM , xM +1 , xN ) dxM +1 dxN
Or:
P ( x1 , x2 , xM ) = P ( x1 , x2 , xM | xM +1 , xN ) P ( xM +1 , xN ) dxM +1 dxN
Note that marginal distribution is very different from conditional distribution.
In conditional probability, we are finding the probability of a subset of random
variables with values of other random variables fixed (conditioned) at a given
value. In the case of marginal distribution, we are eliminating the effect of a subset
of random variables by integrating them out (in the sense averaging their effect)
from the joint distribution. For example, in the case of two-dimensional normal
distribution, marginalization with respect to one variable will result in a onedimensional normal distribution of the other variable, as follows:
N ( x1 )
N (x
x2 ) dx2
[8]
Chapter 1
E [ xi ] = xi P ( x1 , x2 , , xi , xN ) dx1 dxN
( xi , x j ) = E ( xi E [ xi ]) x j E x j
Binomial distribution
A binomial distribution is a discrete distribution that gives the probability of heads
in n independent trials where each trial has one of two possible outcomes, heads or
tails, with the probability of heads being p. Each of the trials is called a Bernoulli trial.
The functional form of the binomial distribution is given by:
P ( k ; n, p ) =
n!
nk
p k (1 p )
( n k )!k !
[9]
Here, P ( k ; n, p ) denotes the probability of having k heads in n trials. The mean of the
binomial distribution is given by np and variance is given by np(1-p). Have a look at
the following graphs:
The preceding graphs show the binomial distribution for two values of n; 100 and
1000 for p = 0.7. As you can see, when n becomes large, the Binomial distribution
becomes sharply peaked. It can be shown that, in the large n limit, a binomial
distribution can be approximated using a normal distribution with mean np and
variance np(1-p). This is a characteristic shared by many discrete distributions that,
in the large n limit, they can be approximated by some continuous distributions.
Beta distribution
The Beta distribution denoted by Beta ( x | , ) is a function of the power of x , and its
reflection (1 x ) is given by:
Beta ( x | , ) =
1
1
x 1 (1 x )
B ( , )
[ 10 ]
Chapter 1
Here, , > 0 are parameters that determine the shape of the distribution
function and B ( , ) is the Beta function given by the ratio of Gamma functions:
B ( , ) = ( ) ( ) / ( + )
.
The Beta distribution is a very important distribution in Bayesian inference. It is the
conjugate prior probability distribution (which will be defined more precisely in the
next chapter) for binomial, Bernoulli, negative binomial, and geometric distributions.
It is used for modeling the random behavior of percentages and proportions. For
example, the Beta distribution has been used for modeling allele frequencies in
population genetics, time allocation in project management, the proportion of
minerals in rocks, and heterogeneity in the probability of HIV transmission.
Gamma distribution
The Gamma distribution denoted by Gamma ( x | , ) is another common distribution
used in Bayesian inference. It is used for modeling the waiting times such as survival
rates. Special cases of the Gamma distribution are the well-known Exponential and
Chi-Square distributions.
In Bayesian inference, the Gamma distribution is used as a conjugate prior for the
inverse of variance of a one-dimensional normal distribution or parameters such as
the rate ( ) of an exponential or Poisson distribution.
The mathematical form of a Gamma distribution is given by:
1
Gamma ( x | , ) =
x exp ( x )
( )
Here, and are the shape and rate parameters, respectively (both take values
greater than zero). There is also a form in terms of the scale parameter ( = 1/ ) , which
is common in econometrics. Another related distribution is the Inverse-Gamma
distribution that is the distribution of the reciprocal of a variable that is distributed
according to the Gamma distribution. It's mainly used in Bayesian inference as the
conjugate prior distribution for the variance of a one-dimensional normal distribution.
[ 11 ]
Dirichlet distribution
The Dirichlet distribution is a multivariate analogue of the Beta distribution.
It is commonly used in Bayesian inference as the conjugate prior distribution for
multinomial distribution and categorical distribution. The main reason for this is
that it is easy to implement inference techniques, such as Gibbs sampling, on the
Dirichlet-multinomial distribution.
The Dirichlet distribution of order K is defined over an open ( K 1) dimensional
simplex as follows:
Dir ( x | ) =
1 K i 1
xi
B () 1
Wishart distribution
The Wishart distribution is a multivariate generalization of the Gamma distribution. It
is defined over symmetric non-negative matrix-valued random variables. In Bayesian
inference, it is used as the conjugate prior to estimate the distribution of inverse of
1
the covariance matrix (or precision matrix) of the normal distribution. When we
discussed Gamma distribution, we said it is used as a conjugate distribution for the
inverse of the variance of the one-dimensional normal distribution.
The mathematical definition of the Wishart distribution is as follows:
Wp ( X | V , n ) =
2
np
2
n
V p
2
n
2
n p 1
2
exp tr (V 1 X )
2
Chapter 1
Exercises
1. By using the definition of conditional probability, show that any multivariate
joint distribution of N random variables [ x1 , x2 , , xN ] has the following trivial
factorization:
P ( x1 , x2 , , xN ) = P ( x1 | x2 , , xN ) P ( x2 | x3 , , xN ) P ( xN 1 | xN ) P ( xN )
2. The bivariate normal distribution is given by:
N ( x | , ) =
( 2 )
T
1
exp ( x ) 1 ( x )
2
Here:
1
,
2
2
= 1
2 1
1 2
22
where
= 1 +
1
( x 2 )
2 2
2
2
and = (1 u ) 1 .
Sepal
Width
Petal
Length
Petal
Width
Class of
Flower
5.1
3.5
1.4
0.2
Iris-setosa
4.9
1.4
0.2
Iris-setosa
4.7
3.2
1.3
0.2
Iris-setosa
4.6
3.1
1.5
0.2
Iris-setosa
3.6
1.4
0.2
Iris-setosa
3.2
4.7
1.4
Iris-versicolor
[ 13 ]
Sepal
Length
Sepal
Width
Petal
Length
Petal
Width
Class of
Flower
6.4
3.2
4.5
1.5
Iris-versicolor
6.9
3.1
4.9
1.5
Iris-versicolor
5.5
2.3
1.3
Iris-versicolor
6.5
2.8
4.6
1.5
Iris-versicolor
6.3
3.3
2.5
Iris-virginica
5.8
2.7
5.1
1.9
Iris-virginica
7.1
5.9
2.1
Iris-virginica
6.3
2.9
5.6
1.8
Iris-virginica
6.5
5.8
2.2
Iris-virginica
References
1. http://en.wikipedia.org/wiki/List_of_probability_distributions
2. Feller W. An Introduction to Probability Theory and Its Applications.
Vol. 1. Wiley Series in Probability and Mathematical Statistics. 1968.
ISBN-10: 0471257087
3. Jayes E.T. Probability Theory: The Logic of Science. Cambridge University Press.
2003. ISBN-10: 0521592712
4. Radziwill N.M. Statistics (The Easier Way) with R: an informal text on applied
statistics. Lapis Lucera. 2015. ISBN-10: 0692339426
[ 14 ]
Chapter 1
Summary
To summarize this chapter, we discussed elements of probability theory;
particularly those aspects required for learning Bayesian inference. Due to lack of
space, we have not covered many elementary aspects of this subject. There are some
excellent books on this subject, for example, books by William Feller (reference 2
in the References section of this chapter), E. T. Jaynes (reference 3 in the References
section of this chapter), and M. Radziwill (reference 4 in the References section of this
chapter). Readers are encouraged to read these to get a more in-depth understanding
of probability theory and how it can be applied in real-life situations.
In the next chapter, we will introduce the R programming language that is the
most popular open source framework for data analysis and Bayesian inference
in particular.
[ 15 ]
www.PacktPub.com
Stay Connected: