Bayesian Analysis Using MCMC On Survey Data
Bayesian Analysis Using MCMC On Survey Data
IMPROVING FORECASTING
OF POLITICAL POLLING OUTCOMES
By
A Capstone Project
(August, 2015)
ii
The Capstone Project committee for Lancelot Muwayi and Sanoj Kumar
certifies that this is the approved version of the following capstone project report:
____________________________________
_____________________________________
iii
Abstract
This research explores the extent to which Bayesian estimators improve the
forecasting of political polling outcomes. A Bayesian model is built using a two-step
process. The first step applies decision trees to select significant demographic
variables; the second step uses the Markov Chain Monte Carlo (MCMC) method to
estimate the Bayesian model. The Bayesian model takes into account sampling
variation and ensures the stability of parameter estimates by weighting them with
prior.
Key Words
Bayesian estimator, forecasting of polling outcomes, MCMC, decision tree
iv
Executive Summary
Incomplete and noisy data from disparate sources call for non-conventional
statistical tools for correct analysis. Bayesian methods are beneficial for obtaining
robust estimates and combining information from disparate sources. This research
examines the suitability of Bayesian methods in estimating the parameters of a model
that predicts election outcomes on the basis of polling data in which parameters of
earlier models could be used as priors. A two-stage process, CART and CHAID
decision trees for variable selection and Markov Chain Monte Carlo (MCMC) method,
is used to predict approval of the Obama presidency based on race, gender, education,
and region of respondents. Results show that black respondents from the U.S. southern
region approve of Obamas presidency in greater numbers than white respondents.
Approval of Obamas presidency is lower among respondents with no college. The
U.S. southern region stands out among all regions as having the lowest approval of
Obamas presidency. More survey data related to reasons for approval or disapproval
of Obamas presidency would further validate and support the results of this research.
Table of Contents
Introduction..............................................................................................................................1
Problem Statement..............................................................................................1
Research Purpose.................................................................................................2
Research Question..............................................................................................3
Background..............................................................................................................................3
Methodology............................................................................................................................8
Exploratory Data analysis.......................................................................................8
Building Classification Model................................................................................10
Bayesian Logistic Regression Model..............................................................10
PROC MCMC with Bayesian Logistic Regression Model........................11
Final Results...........................................................................................................................13
Exploratory Analysis Results.13
Factors Selection........18
Building Classification Model Results...19
Summary and Conclusions....................................................................................................22
Recommendations.................................................................................................................23
Further Model Development .................................................................................23
Appendix................................................................................................................................24
Bibliography ..........................................................................................................................27
vi
List of Tables
List of Diagrams
Introduction
Problem Statement
Every four years, Americans elect their President for a new term. Since this is
an important decision for the nation, political groups have been conducting election
polls since the 1930s. Millions of Americans have been surveyed on their political
opinions, yielding a wealth of information that political scientists can utilize to their
full advantage. (Caughey and Warshaw, 2003). State-level pre-election survey data
represent a rich new source of information for both forecasting election outcomes and
tracking the evolution of voter preferences during the campaign (Linzer, 2013).
However, major hurdles still exist that are affecting survey researchers,
including low response rates, rising costs to carry out the surveys, and the demand for
quicker turn-around times. These factors have increased the demand for new ways to
generate accurate and timely polling estimates of public opinion and social behaviors
by using incomplete or noisy data and combining insights from different surveys and
other sources. Researchers are faced with the task of doing more with disparate
datasets.
In the arena of political polling, for example, Linzer (2013) proposed a
Bayesian approach to improve forecasting election outcomes from state polls using
recent advances in Bayesian methodology (see also Jackman, 2005; Carlin and Louis,
2009; and Ntzoufras 2009). Bayesian inference is normally used to update a
previously estimated probability given new information (Gelman et al., 2004).
This study uses polls conducted from January 2012 to March 2012, which
asked nearly 37,000 voters in each state about their approval of the Obama presidency
on seven different scales, ranging from strongly disapprove to strongly approve. The
channel through which data was collected is not known, and the collected data are
sparse, noisy, and come from unrepresentative samples.
The goal of this study is to build a statistically robust model to relate approval
ratings of the Obama presidency to characteristics of the populations using the survey
data provided by Ipsos.
Research Purpose
The primary purpose of this research is to build a model to predict approval
ratings of the Obama presidency and to identify significant predictors. The secondary
purpose is to examine the suitability of Bayesian methods. The last, but not the least
important purpose, is to develop custom SAS codes that our Capstone sponsor can
reference.
Ipsos Public Affairs, our Capstone sponsor, maintains an active program of
research that integrates Bayesian and other methodologies into its polling and broader
research practices. This research assists and advises the companys researchers on
how to obtain population-valid public opinion outcomes using Bayesian and other
appropriate methods for both static and time-dependent data (including, eventually,
real-time continuous-feed data).
Data from disparate sources are noisy and incomplete. Therefore, Bayesian
methods are beneficial since robust estimates can be obtained from them. They are
2
also useful in combining information from disparate sources. Parameters of the models
from earlier polls can be used as priors for the parameters of the model for the new
poll.
The particular case study at the center of this project is U.S. state-level mid-term
elections, using historical data sources as well as partial or incomplete polling data and
other data (including simulated data). We worked closely with Ipsos methodologists
and polling experts to advance their forecasting and polling capabilities, as well as to
broaden their program of research on nonprobability sampling.
Research Question
We are tasked to develop a free standing and flexible SAS program that will
implement the Bayesian model estimation, generate correct standard errors, and allow
for the eventual introduction of additional target/model variables. The model will
allow for integration of other specialized attitudinal measures, in addition to the
parameters used in this project. The project asks this question: to what extent are
Bayesian methods suitable for estimating parameters of a model that predicts election
outcome on the basis of polling data, where parameters of earlier models can be used
as prior?
Background
The focus of our research of this project is to analyze survey data related to the
presidency of Obama and to identify the significant predictors that affected his
approval rating across different states or regions. The polls used in our study were
3
conducted as random samples of registered or likely voters. Sample data is sparse and
noisy. In some states / regions, there are more respondents than in other states /
regions. In our data, response variable is categorical in nature. We therefore had to
use classification technique. We started looking for different statistical methods to fit
our needs.
Our data is factorial in nature: some factors are observed within other factors.
For instance, different levels of race are observed in different regions. Within each
race we have two genders, and so on. Our goal is to understand the interaction effects
among these factors. Therefore, we started by developing an SAS program for multilevel regression of polling data.
To understand multi-level regression on survey data, we first analyzed case
studies conducted by a number of researchers. Gelman and Ghitza, in Deep
Interactions with MRP: Election Turnout and Voting Patterns among Small Electoral
Subgroups, identified how multilevel regression derives the voting pattern and
turnout estimates based on small subgroups of the population. MRP stands for multilevel regression and post-stratification1. They analyzed how to fit a multilevel
regression model that includes group-level predictors as well as unexplained variation
at each of the levels of the factors and their interactions.
Ipsos wants to use its prior knowledge on the parameter estimates. Therefore,
an application of Bayesian analysis became inevitable in this project. To understand
Bayesian analysis on polling data, we studied Linzers Dynamic Bayesian
Forecasting of Presidential Elections in the States. In this study, Linzer introduced a
dynamic Bayesian forecasting model that unifies the regression-based historical
1 Post-stratification is not within the scope of this project.
4
forecasting approach developed in political science and economics with the polltracking capabilities made feasible by the recent upsurge in state-level opinion polls.
In the case study Electoral Forecasting and Public Opinion Tracking in Latin
America: An Application to Chile, Bunker et al. argued that Bayesian inference is
well suited to estimate true public opinion. Bayesian inference is normally used to
update a previously estimated probability given new information (Gelman et al.,
2004).
Bayesian methods are derived from the application of Bayes theorem. For events
A and B, Bayes theorem is expressed as
Pr ( AB) =
Pr ( B| A ) Pr ( A)
Pr ( B)
Pr (A|B) =
Pr ( B|A ) Pr ( A )
Pr ( B| A ) Pr ( A ) + Pr ( B| ) Pr ( )
Pr(|y) =
Pr ( y| ) Pr ( )
Pr ( y )
Pr ( y| ) Pr ( )
Pr ( y| ) Pr ( ) + Pr ( y| ) Pr ( )
The quantity Pr(y) is the marginal probability, and it serves as a normalizing constant to
ensure that the probabilities add up to unity. Because Pr(y) is a constant, we can ignore it
and write
Pr( | y ) Pr ( y| ) Pr ( )
Thus, the prior Pr() is being updated with likelihood Pr(y|) to form the posterior
distribution Pr(|y).
In a nutshell, Bayesian analysis updates beliefs about the parameters by
accounting for additional data. We need to weight the likelihood for the data with the
prior distribution to produce the posterior distribution. If we want to estimate a parameter
from data y =
, Bayesian analysis says that we cannot determine exactly but we can describe
the uncertainty by using probability statements and distribution. We can formulate a prior
distribution ( to express our beliefs about . We then update these beliefs by
combining the information from the prior distribution and the data, described with the
statistical model p( y , to generate the posterior distribution p( | y ) .
p( | y )=
p( , y)
p(y)
p( y) ()
p( y)
p ( ) P( y )
p() p ( y| ) d
In general, any prior distribution can be used, depending on the available prior
information. The choice can include informative prior distribution if something is
known about the likely values of the unknown parameters, or diffuse or noninformative priors if either little is known about the coefficient values or if one
wishes to see what the data themselves provide as inferences. Non-informative prior
distributions play a minimal role in the posterior distribution.
Several features of the Bayesian approach make it attractive for researchers
because it provides a mechanism for combining a prior probability distribution for the
states of nature with sample information to provide a revised (posterior) probability
distribution about the states of nature. These posterior probabilities are then used to
make better decisions.
Previous posterior distribution can also be used as a prior when new
observations become available. Sparse and noisy data inference proceeds in the same
manner as if one had a large sample. It provides interpretable answers, such as the
true parameter has a probability of 0.95 of falling in a 95% credible interval. It
provides a convenient setting for a wide range of models, such as hierarchical models
and missing data problems.
In case the posterior distribution in Bayesian analysis does not have a closed
form, one can apply the Markov Chain Monte Carlo (MCMC) simulation methods for
any sample size and obtain accurate estimates of parameters.
The MCMC procedure uses the Markov Chain Monte Carlo technique to
estimate the model parameters and to produce correct standard errors and confidence
limits. Markov chain Monte Carlo is a general computing algorithm that has been
widely used in many scientific disciplines, including statistics. The posterior
distribution often involves a high-dimensional integration. The function of Monte
Carlos two-part algorithm is to simulate the prior distribution to estimate the posterior
distribution of parameters.
The simulations are done many times and form a Markov Chain. When the
chain is stabilized, the chains of estimates are used to produce final results. When
MCMC is used for optimization and Bayesian inference, the objective is to compute
the global optimum of some Bayesian posterior probability by drawing representative
samples from the posterior probability distribution. In practice, a Bayesian posterior
distribution does not take any well-known form and may have multiple local
optimums. Thus, simulating the posterior will yield a plausible understanding of the
distribution. Very often the MCMC needs to use a very general purpose optimization
technique such as the simulated annealing procedure, which will return the global
optimum.
Methodology
Exploratory Data Analysis
Dataset included Obama approval ratings and demographic information of over
37,000 respondents from all fifty states of America. The demographic information
included the respondents age, education, sex, race, region, and state of residence. Each
Finally, the Approval value is 1 when Obama is approved, and 0 otherwise, i.e.,
Disapproval. We tabulated the 24 (=2 * 3 * 2 * 2) possible values of the target and the
above factors using programming code in appendix. Then for modeling purposes, when
Approval is 1, we calculated total counts for all categories discussed above, for which
we generated this data:
Table 1: Aggregated Data for Model
Group
Black
Approval
Total
0 0
946
2164
0 1
405
1100
1 0
280
737
1 1
124
362
0 0
5099
14697
0 1
2106
8337
1 0
1308
4883
1 1
517
2888
0 0
403
544
10
0 1
920
1177
11
1 0
124
190
12
1 1
297
404
From the above table, we deduce that when races are neither Black nor White,
education level is other than no college, and region is other than south, total
respondents who approved of Obamas presidency are 946 out of 2164. There are 12
groups.
11
We built the model on the above data using factors Black, White, education, and
region as fixed effects and interaction of Black with region and White with education
as random. Then we fit a Bayesian logistic model in PROC MCMC.
We modeled each respondents response as Bernoulli trial with probability of
success pi. This means the approval in each cell follows Binomial distribution with
parameters pi and ni, where ni is known but pi is to be estimated. We used the logit link
function to link the covariates of each observation, edu (for no college education) and
region (for south region), to the probability of success:
i = beta0 + beta1*black + beta2*white + beta3*edu + beta4*region +
beta5*black*region + beta6*white*edu.
The probability of approval is:
pi=
i +i
e
+
1+ e
i
where i is assumed to be an
identically independently distributed random effect with the default normal prior with
mean zero and constant variance. The six regression coefficients and the variance
2
in the random effects are model parameters. The betas are given non-
in
i centers at the
The PROC MCMC statement specifies the input/output datasets, sets a seed for the
random number generator, requests a very large simulation number of 30000, and thins
the Markov chain by 10. The PARMS statement declares the model parameters. This is
nothing but regression coefficients. The PRIOR statements specify the prior
distribution for beta and s2.
The symbol w calculates the regression mean, and RANDOM statement
specifies the random-effects, with a normal prior distribution, centered at w with
variance s2. The SUBJECT option indicates the group index for the random-effects.
The symbol pi is the logit transformation. The MODEL specifies the response
variable count as a binomial distribution with parameters n and pi.
Final Results
13
Region
Count
%Approve
%Disapprove
338
74.85
25.15
215
73.95
26.05
1,581
76.98
23.02
181
63.54
36.46
Midwest
73
27.40
72.60
Northeast
77
31.17
68.83
South
119
34.45
65.55
West
100
24.00
76.00
Midwest
225
45.78
54.22
Northeast
254
48.43
51.57
South
550
36.55
63.45
West
638
42.01
57.99
Midwest
400
45.25
54.75
Northeast
308
42.86
57.14
South
793
36.19
63.81
West
826
42.49
57.51
BlackOnly Midwest
Northeast
South
West
Dont
Know
Hispanic
Others
14
ObamaApproval
Race
Region
White
Only
Count
%Approve
%Disapprove
Midwest
7,618
32.23
67.77
Northeast
5,270
36.07
63.93
South
11,225
23.37
76.63
West
6,692
30.65
69.35
37,483
33.43
66.57
All
From the above table, it appears that Black only respondents from the South
approved of Obamas presidency to a greater extent than any other race from any
region. White only respondents and Dont Know (coded as DK in the dataset)
disapproved of Obamas presidency the most across all regions.
Similarly, when we explored the interaction between race and sex in Table 3, we
found that in case of race DK there is a huge gap between male and female in favor of
approval. For all other races, there is more or less similarity across sex.
Sex
BlackOnly Female
%Approve
Count
%Disapprove
1,562
75.61
24.39
Male
753
74.77
25.23
Dont
Know
Female
227
35.24
64.76
Male
142
20.42
79.58
Hispanic
Female
1,045
42.78
57.22
622
39.87
60.13
1,403
41.84
58.16
924
39.39
60.61
Male
Others
Female
Male
15
ObamaApproval
Race
Sex
WhiteOnly Female
Male
All
%Approve
Count
%Disapprove
19,955
29.30
70.70
10,850
29.35
70.65
37,483
33.43
66.57
Education
BlackOnly College
Count
%Approve
%Disapprove
1,721
76.87
23.13
Nocollege
594
70.88
29.12
Dont
Know
College
295
31.53
68.47
74
21.62
78.38
Hispanic
College
1,153
42.15
57.85
514
40.66
59.34
1,816
42.51
57.49
511
35.03
64.97
Nocollege
Nocollege
Others
College
Nocollege
16
ObamaApproval
Race
Education
White
Only
College
All
Nocollege
Count
%Approve
%Disapprove
23,034
31.28
68.72
7,771
23.48
76.52
37,483
33.43
66.57
It therefore appears that there are some interactions between black with south region
and white with no college education. We needed to verify above findings and for that
we drew a decision tree.
The screenshots below, in diagram 1, show the results from the SAS decision
tree splits for the demographics variables. As expected, race is the major predictor in
Obama approval ratings compared to all other factors. Of the 1,900 survey
participants, over 75% of blacks approved of Obamas presidency, regardless of
geographic region. On the other hand, the other races as a whole had an approval
rating of only 30%. Therefore Black only, as a race, stands out in this diagram.
17
From the diagram 2 above, South as a region stands out among all regions. The
south had the lowest approval (24.75%) of Obama. After black as a race in Diagram 1,
we can see that white only stands out from other race in Diagram 2. White only has
18
Variable
Race
Region
Education
Sex
Table 5 above shows the Variable Importance metric, which is a relative metric
with a value of one for the most important variable. Less important variables have
metrics less than one. We see that the most important variable is race, then region,
education and sex, in descending order.
Factors Selection
From the above tables and diagrams, it appears that among race black
respondents stand out clearly for their maximum approval of Obamas presidency. Out
of the remaining four race categories, white respondents differentiate themselves for
low approval of Obamas presidency. Among regions, south stands out clearly.
Although other than these three (i.e., black, white, and region south), no other factors
appear significant in the decision tree, the white with no college education had the
lowest approval of Obamas presidency, as shown in table 4. We decided not to
consider the races DK, i.e., dont know. Therefore, we choose black and white among
19
race category, south among regions, and no college among education. Interaction of
black in south region, and white with no college needs to be studied. We decided to
keep a minimum number of factors and their interactions to keep our model simple
and easy to interpret.
Initial
Value PriorDistribution
1 s2
Conjugate
2 beta0
NMetropolis
1.0000 igamma(0.01,s=0.01)
0 general(0)
beta1
0 general(0)
beta2
0 general(0)
beta3
0 general(0)
beta4
0 general(0)
beta5
0 general(0)
beta6
0 general(0)
The Parameters table lists the sampling information, the name of the model
parameters, sampling algorithms used, initial values, and their prior distributions. The
conjugate sampling algorithm is used to draw the posterior samples of s2 and random
walk Metropolis for the regression parameters.
20
Subject
N-Metropolis
Number of Subject
Subjects Values
groups
Prior
Distribution
12 1 2 3 4 5 6 7 8 9 10 11 12 normal(w, var=s2)
The Random Effect Parameters in the table above list the name of the random
effect, the subject variable, and the number of distinct levels in the subject variable.
The total number of random-effects parameters in this model is 12.
Table 8: Posterior Summaries of Parameter Estimates
Paramete
r
Label
Mean
Standard
Deviation
beta0
Intercept
3000
-0.1893
0.1123
-0.4144
0.0324
beta1
3000
1.1767
0.1631
0.8390
1.4947
beta2
Race = White
3000
-0.4817
0.1389
-0.7502
-0.2074
beta3
Education = No College
3000
-0.2496
0.1223
-0.4867
-0.00440
beta4
Region = South
3000
-0.3758
0.1097
-0.5792
-0.1357
beta5
3000
0.6631
0.2108
0.2834
1.1427
beta6
Race = White *
Education = No College
3000
-0.1484
0.1900
-0.4896
0.2541
s2
Variance
3000
0.0213
0.0245
0.00143
0.0597
The Posterior Summaries table reports the posterior mean, standard deviation,
and confidence intervals of p. Here we have Mean as the parameter estimate to our
logistic model. The standard deviations are interpreted as standard errors for the
parameter estimates. The 95% HPD Interval stands for Highest Posterior Density
Interval.
21
22
that the beta1 parameter converged. And Posterior Density has one maximum
likelihood estimate. In addition, Posterior Density has bell-shaped posterior
distribution, which supports applying inferences for normal distribution. Similarly,
diagnostics plots for other parameters provide evidence that these parameters have
converged too.
23
the macros to implement Gelmans multi-level regression process that can be further
used for post-stratification.
Recommendations
Further Model Development
Though we were able to create a working model, we feel strongly that other
factors that influence respondents opinions of the Obama presidency should be
explored in further research. Survey participants should have been asked the reasons for
their approval/disapproval, which might be related to foreign policy, handling of the
economy, party identification, and/or social issues.
Other predictors and interactions can be further explored to see if those have an
effect on approval of Obamas presidency. Different priors can be passed to the model
and we need to see if results are affected by choice of priors.
Our vision is that an application could be developed leveraging the methods
applied in our work that would be useful for multilevel regression with poststratification process using SAS. Although MRP has been implemented in R, we did
not find any example in SAS. Therefore, this method \ is highly effective for
forecasting elections or marketing applications at a time when getting survey data is
difficult and corporations and government have scarce survey data.
24
Appendix
Programming code
This set of code has been used for coming up for data lines used on MCMC PROC.
data cstone.Ipsos_August;
set cstone.Ipsos_August;
if race = "Black Only" then black=1; else black=0;
if race = "White Only" then white=1; else white=0;
if education2 = "No college" then nedu=1; else nedu=0;
if region= "South" then nregion=1; else nregion=0;
if SEX= "Male" then nsex=1; else nsex=0;
if bo_apr2="Approve" then y1=1;else y1=0;
if bo_apr2="Disapprove" then y2=1;else y2=0;
run;
proc freq noprint;
tables y1*black*white*nsex*nedu*nregion / nocum nopercent nocum nocol
out=d;
run;
proc print data=d; run;
data temp1 (drop=percent);
set d;
if y1=1;
run;
data temp2 (keep=count2);
set d;
count2=count;
if y1=0;
run;
data last (drop = y1 count2);
merge temp2 temp1;
n=count+count2;
ind=_N_;
run;
proc print; run;
table. There we have mean regression coefficient estimates as negative. Diagram for
beta2 represents white respondents. Diagram for beta3, and beta4 represent education
and region.
26
27
Bibliography
Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford.
Carlin, B. P. and Louis, T. A. 2009. Bayesian Methods for Data Analysis. Third Edition.
CRC/Chapman and Hall.
Gelman, A. Carlin, J. B., Stern, H. S., Dunson, D. B. 2013. Bayesian Data Analysis.
Third Edition. CRC/Chapman and Hall.
Gelman, A. and Hill, J. 2007. Data Analysis Using Regression and
Multilevel/Hierarchical Models. Cambridge.
Ghitza, Y. and Gelman, A. 2013. Deep Interactions with MRP: Election Turnout and
Voting Patterns Among Small Electoral Subgroups American Journal of Political
Science.
Hastie, T., Tibshirani, R., Friedman, J. 2009. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer.
Jackman, S. 2005. Pooling the Polls Over an Election Campaign Australian Journal
of Political Science 40: 499-517.
Linzer, D. A. 2013. Dynamic Bayesian Forecasting of Presidential Elections in the
States Journal of the American Statistical Association 108: 124-134.
Ntzoufras, I. 2009. Bayesian Modeling Using WinBUGS. Wiley.
Park, D. K., Gelman, A., and Bafumi, J. 2004. Bayesian Multilevel Estimation with
Poststratification: State Level Estimates from National Polls Political Analysis. 12:
375-385.
Wang, W., Rothschild, D., Goel, S., and Gelman, A. 2014. Forecasting Elections with
Non-Representative Polls International Journal of Forecasting Forthcoming.
28