0% found this document useful (0 votes)
33 views

Multivariate Regression Modelling

Uploaded by

bdvd1007092
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Multivariate Regression Modelling

Uploaded by

bdvd1007092
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4

http://www.ete-online.com/content/10/1/4
EMERGING THEMES
IN EPIDEMIOLOGY

ANALYTIC PERSPECT IVE Open Access

Improving epidemiologic data analyses


through multivariate regression modelling
Fraser I Lewis1* and Michael P Ward2

Abstract
Regression modelling is one of the most widely utilized approaches in epidemiological analyses. It provides a method
of identifying statistical associations, from which potential causal associations relevant to disease control may then be
investigated. Multivariable regression – a single dependent variable (outcome, usually disease) with multiple
independent variables (predictors) – has long been the standard model. Generalizing multivariable regression to
multivariate regression – all variables potentially statistically dependent – offers a far richer modelling framework.
Through a series of simple illustrative examples we compare and contrast these approaches. The technical
methodology used to implement multivariate regression is well established – Bayesian network structure discovery –
and while a relative newcomer to the epidemiological literature has a long history in computing science. Applications
of multivariate analysis in epidemiological studies can provide a greater understanding of disease processes at the
population level, leading to the design of better disease control and prevention programs.

Introduction [3,4], but whose focus is on structure discovery: determin-


Multivariable regression modelling in which multiple ing an optimal statistical model, i.e. graphical structure,
independent variables are regressed on a single depen- directly from observed data. Whilst relatively uncommon
dent variable is a technique familiar to any epidemiologist. in the epidemiological literature, Bayesian network analy-
This analytical approach is a regular feature in the epi- ses are increasingly finding application in areas of biology,
demiological literature, and is without doubt a useful tool. medicine and ecology (e.g. [5-12]) and Bayesian network
By extending this approach to an analogous multivariate modelling itself has a vast technical literature (as is easily
regression model, in which all variables are simultane- seen by using the search term “Bayesian network” in any
ously considered, substantially enhanced insight into the bibliographic database, e.g. pubmed, web of knowledge).
disease system under study may be gained. At worst, both Identifying causal relationships is the objective of many
multivariable and multivariate approaches will give iden- epidemiological analyses involving regression modelling.
tical results —as they must, because to determine the best Empirical analyses of epidemiological data can demon-
possible multivariate model of study data, all possible mul- strate statistical dependency between variables, and as
tivariable models must also be considered, as the latter are we later demonstrate Bayesian network analysis is ideally
simply special cases of the former. suited to such a task. While the identification of statisti-
Gaining additional insights into a disease system by cal dependency is often a natural step towards postulating
simply switching to a more general data analytic tech- causal mechanisms, it is, however, vastly more ambitious
nique is clearly very attractive, in particular when the to further assert that any given dependency exists within
theoretical foundations for the more general approach are a particular causal web. Expert knowledge and biological
long established. The modelling methodology we con- understanding is clearly essential, since this is more than
sider here is referred to as Bayesian network analysis (as a statistical data analysis exercise. To avoid any unneces-
defined in [1,2]). This is a form of graphical modelling sary confusion, all analyses and discussion here pertain
only to models of statistical association —it is a common
*Correspondence: [email protected] misinterpretation to assume that arcs in a Bayesian net-
1 Section of Epidemiology, VetSuisse Faculty, University of Zürich,
work model denote causality, they denote only statistical
Winterthurerstrasse 270, Zürich, CH 8057, Switzerland
Full list of author information is available at the end of the article
dependency.

© 2013 Lewis and Ward; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 2 of 10
http://www.ete-online.com/content/10/1/4

Our objective here is to demonstrate the potential utility a putative risk factor) may disappear or even be reversed
of Bayesian network structure discovery to epidemiolo- when other variables are taken into account, is particularly
gists. We consider specifically additive Bayesian networks, troublesome here. Similarly, the closely related difficulties
which are Bayesian network models parameterized in of negative (or positive) confounding.
an analogous fashion to generalised linear models. The In multivariable regression, relationships between the
classical formulation of Bayesian networks for binary or “independent” variables in a study do not feature explicitly
multinomial variables uses a mathematically elegant con- in the modelling process. This seems entirely reasonable
tingency table parameterisation [1,2]. For epidemiological in the classical designed experiment scenario. In regres-
analyses such a parameterisation is both unusual and sion analyses of epidemiologic data where many inter-
rather opaque, and is likely vastly over parameterized dependencies between study variables may be present,
compared to the familiar additive formulation used in explicitly modelling all relationships between all variables
generalised linear models (as discussed in [13]). is intuitively far more reasonable (as demonstrated in our
In the following sections we first briefly review the moti- later examples). Common multivariable model selection
vation and experimental origins of regression modelling approaches, such as stepwise searches, may be sufficient
in scientific studies. Graphical regression is then intro- to implicitly account for such inter-dependencies, and
duced, followed by a series of simple empirical examples thus identify an optimal set of predictors for the outcome
which compare and contrast multivariable and multi- (disease) variable. But a considerable difficulty here is how
variate regression. We then discuss the epidemiological to justify that the modelling results obtained are as opti-
implications of these results and the limitations of the mal as is practicable for a given study. The standard way to
approach. address such issues in statistical modelling is to compare
a simpler model with a more general model. If the good-
Regression modelling concepts - a brief review ness of fit of the simpler model is no worse than the more
In classical experimental trial scenarios (e.g. [14], such as general model then the former is chosen as the preferred
factorial or Latin square designs), the investigator is able model. This is the concept of parsimony —it is more desir-
to fix at predetermined values all of the variables of inter- able to explain a phenomenon, e.g. disease occurrence,
est in the experiment. These are the independent vari- with a simpler than a complex model. In our current con-
ables in a multivariable regression model. The research text “more general” also refers to expanding the scope of
question being asked here is how the measurement vari- the modelling framework to explicitly include all relation-
able – the outcome or response variable – changes across ships between all variables, i.e. a multivariate rather than
the various different patterns of values chosen for the multivariable regression model. The Bayesian network lit-
independent variables. This is the historical foundation erature has long provided all the necessary theory and
of regression modelling. The ability to fix all variables of algorithms (e.g. [1,2,19,20]) to implement such regression
interest to predetermined values is crucial and underlies modelling. Historically, the main practical difficulty in the
the experimental study design, because it enables unam- application of this approach has been a lack of suitable
biguous estimation of all key covariate effects on the computing resources and relevant accessible software.
response variable.
The classical experimental design scenario contrasts Regression modelling in epidemiology
sharply with what is feasible and practical in many epi- In typical regression analyses found in the epidemiolog-
demiological studies, either in humans or other animals. ical literature (e.g. [21,22]) the use of a hypothesis test-
Considering zoonotic pathogens for example, animal hus- ing (P-values) framework is still far more common than
bandry, livestock production and farm environment char- Bayesian inference. There is a considerable body of evi-
acteristics are by their nature highly inter-dependent. dence which strongly argues against the use of hypothesis
Thus, it is generally impossible to separate out the “true” testing and P-values for model comparison and selec-
effects of individual covariates on the response variable tion. Information theoretic and Bayesian approaches are
(e.g. the design matrix is not orthogonal, see [15]) because argued to be preferable on both conceptual and per-
the estimated effect of any covariate will now generally formance grounds [23-26]. When the primary objective
also depend on what other covariates are also included is to identify optimal parsimonious models, i.e. struc-
in the model (including the case in which all variables ture discovery, then, in purely practical terms, using a
are included). Moreover, determining the most appropri- Bayesian or non-Bayesian paradigm is largely irrelevant
ate covariates for inclusion in the model is considerably as in such analyses the use of uninformative or diffuse
more difficult when dependencies exist between study priors is the standard practice in structure discovery
variables, as in the case in which confounding variables (e.g. see [1,2,19,20]). Hence, the actual parameter esti-
are present. The Yule-Simpson paradox [16-18]: that an mates in any given model will be almost identical to
apparent relationship between variables (e.g. a disease and the maximum likelihood analogue. However, the very
Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 3 of 10
http://www.ete-online.com/content/10/1/4

considerable advantage of adopting a Bayesian paradigm All the graphical models we consider here are con-
is that we can then directly utilize established model cerned only with statistical dependency, and arc direction
selection and comparison techniques from the Bayesian in such models in no way implies any causal relationship.
networks literature [2,19,20]. The direction of arcs is a result of the probability cal-
culus required when dealing with models comprising of
Empirical examples: multivariable versus joint probabilities. In general, arc direction has no epi-
multivariate demiological interpretation because observed data alone
We first briefly describe a graphical statistical model, cannot discriminate between arcs of opposite directions.
recall that additive Bayesian network structure discov- This is simply a consequence of factorising joint probabil-
ery is concerned entirely with graphical models, and ity distributions, and is typically referred to as likelihood
its conceptual differences from classical regression. We equivalence (see p.1052 in [11] for a more general expla-
then present three separate illustrative analyses using nation, and [2] for technical details). A potential practical
risk factor case study data (unpublished veterinary data complication of likelihood equivalence is when search-
with variable names anonymized to maintain confiden- ing for an optimal graphical structure. Standard search
tiality) comprising of 400 observations across 17 vari- approaches in the literature, such as Heckerman’s heuris-
ables, where each variable is a measurement or attribute tic hill climber [2], and the exact order based search by
from an individual subject (animal) and each subject only Koivisto [20] (the latter is used in our later case stud-
appears once in the data. There are five binary vari- ies), identify a single optimal (directed) graph. This is as
ables and 12 continuous variables. Note that for our opposed to all graphs within the same likelihood equiv-
current purposes background knowledge of the partic- alence class, which is computationally intractable [2]. If
ular variables in the study is not relevant, as we are the objective is to identify all statistical dependencies in
only interested in comparing and contrasting the statisti- study data then, as mentioned above, arc direction is not
cal results obtained by applying two different techniques relevant and such difficulties can be ignored. This is not
to identical data. This is an observational study and the the case, however, when viewing the modelling results
investigator was not able to fix the values of any of within a causal (or indeed a longitudinal) framework as
these variables. the arc direction then has an obvious real interpretation.
In causal analyses the use of a priori restrictions on arc
Introducing graphical regression directions to avoid contradicting known epidemiological
In graphical statistical modelling there is no distinction fact is likely appropriate (although not without some con-
made between covariates and a response variable. All ceptual challenges, see p212. in [2]). Causal analyses of
are just “variables” as, formally speaking, a graphical data using graphical models represents a large, and some-
statistical model is a representation of the joint probabil- what distinct, literature from Bayesian networks, with [27]
ity distribution of all the random variables in the data. a standard text.
Figure 1(a) depicts a graphical model which is directly In summary, classical multivariable regression can easily
analogous to a classical multivariable regression model, be denoted by a graphical model, but where the inter-
as arcs terminate only at a single “response” variable (e.g. pretation of the model is different in that it is now a
g5). But this model has a statistical interpretation which is joint probability model, albeit of a very simple structure.
radically different from that in classical regression, here: The reason for considering such regression models within
i) variables b3, b6, g9 and g10 are directly dependent with a graphical modelling framework is that the graphical
variable g5; ii) variables b3, b6, g9, g10 are all indirectly structure can now easily be relaxed to allow dependen-
dependent with each other (via g5); and iii) all other vari- cies (arcs) to be present between any variables, i.e. this
ables are independent. In terms of i), direct dependence framework allows us to directly compare results from
means there is an arc directly connecting these variables applying multivariable regression and multivariate regres-
(in either direction). In terms of ii), in a graphical model all sion on the same data. This then gives us our main
variables in the same component (collection of connected “result” of this paper —a demonstration of how using mul-
arcs —ignoring direction) are jointly statistically depen- tivariate regression may enhance our understanding of a
dent. This means that knowing the value of one variable disease system.
in this component can potentially provide information
about the likely value of any other variable in this compo- Case study results
nent (see [3,4]). If a variable has no arcs, either emanating We now present three analyses. In each we determine
from it or terminating at it, then it is statistically inde- the globally optimal “multivariable” graphical regression
pendent. In such a case knowing the value of any other model, and compare this with the globally optimal “mul-
variable in the model tells us nothing about the value of tivariate” graphical regression model. The term “globally
these variables. optimal” here refers to a model which has the best possible
Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 4 of 10
http://www.ete-online.com/content/10/1/4

a b
g12 g1 g11 g12 g1 g11

g6 g8 g6 g8

b2 g2 b3 b2 g2 b3

g9 g3 g4 g9 g3 g4

b4 g7 b4 g7

b6 b6

g5 g5

b5 g10 b5 g10

Figure 1 Globally optimal multivariable regression model with g5 as the response variable and globally optimal multivariate regression
model of all 17 variables. (a) Globally optimal multivariable regression model with g5 as the response variable and covariates b3, b6, g9 and g10,
log marginal likelihood = -8664.4; (b) Globally optimal multivariate regression model of all 17 variables, log marginal likelihood = -8311.6. Markov
blanket for variable g5 are those variables in grey. Squares denote binary variables, ovals continuous.

goodness of fit of all possible models, and is determined evidence, P(D|Hi ), is also called the marginal likelihood
using an established exact (as opposed to heuristic) struc- and has been shown to have a number of theoretically
tural search algorithm [20]. The goodness of fit metric desirable qualities, where the model with highest marginal
used here is the marginal likelihood [28], which is the likelihood is the preferred model. Model selection using
standard metric in Bayesian model selection. the marginal likelihood has been shown to be equivalent
When comparing models in a Bayesian paradigm the to using Occam’s Razor (for more details see [28] p.422).
objective is to infer which is the most plausible model The model in Figure 1(a) is the best possible multi-
given suitable observations. Borrowing notation and ter- variable model for the data when we consider g5 as the
minology from Mackay [28], the posterior probability response variable. That is, it is the best possible graph
of each Bayesian model, P(Hi |D), can be written as structure when an exact model search is used with the
P(Hi |D) ∝ P(D|Hi )P(Hi ) where D denotes the observed restriction that arcs are only allowed to terminate at vari-
data, e.g. a database of study records, and Hi denotes able g5. This search restriction ensures that the scope
hypothesis, in other words a model of the data, i.e. a of our graphical model is limited only to a multivari-
chosen hypothesis about relationships in the data parame- able regression model. We now repeat an identical exact
terized into a statistical model. The data dependent term, search but this time without the previous restriction on
P(D|Hi ) is called the evidence, and P(Hi ) is a quantifi- the location of arcs. This allows us to determine the best
cation of our subjective prior belief about the current multivariate regression model of the data, that is, we con-
hypothesis (i.e. model i) before any data has arrived. In the sider all variables simultaneously. This model is given in
Bayesian networks literature is it usual for all models to be Figure 1(b), and note that this is a directed acyclic graph
considered equally plausible prior to observing any data (DAG), no cycles —feedback loops —exist which is a
[2,19], in which case P(Hi ) is just a fixed constant for all technical requirement of a graphical statistical model.
models (and thus can be ignored) and the evidence is pro- Before we compare the modelling results in Figure 1(a)
portional to the posterior probability for each model. The and (b), it is worth emphasizing the key methodological
Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 5 of 10
http://www.ete-online.com/content/10/1/4

point here: the only difference in the process which identi- should be included along with these other variables for
fied graph (a) as the best model of the observed data, and further investigation into their potential epidemiological
graph (b) as the best model of the observed data, is that significance with response g5. In summary, even although
in the former the scope of the model search was restricted we only have a single response variable in this current
to only consider graphs with arcs terminating at g5, i.e. a analysis, using the more general multivariate model has
multivariable graphical regression model. provided a different set of most supported “predictor”
In a graphical model the standard way to interpret variables.
the results relative to a single variable is to compute its We now consider two further examples which are anal-
Markov blanket [29]. A Markov blanket (highlighted in ogous to the g5 example but which now treat variable
grey in the figures) comprises the parents, children and g2 (Figure 2) and then variable b3 (Figure 3) as the
children’s parents of the variable of interest (arcs go from response variables in multivariable analyses. The globally
parents to children [19]). To predict values for any vari- optimal multivariate model is obviously unchanged from
able in a DAG, then all we need to know are the variables Figure 1(b) since in such a model all variables are consid-
in its Markov blanket, and all other variables in the graph ered jointly, but the Markov blankets for g2 (Figure 2b)
can be discarded. Conversely, each variable in the Markov and b3 (Figure 3b) are now highlighted.
blanket is needed because each provides knowledge about It is apparent that in both Figure 2 and Figure 3 the
the variable of interest. results obtained using multivariable regression are quite
In Figure 1(a) the multivariable model provides statisti- different from those obtained using multivariate regres-
cal evidence that variables b3, b6, g9 and g10 are directly sion. The number of direct arcs connected to (or from)
dependent with g5. In Figure 1(b) the multivariate model the response variable have increased, from two to five in
provides evidence that additionally variable b5 is directly Figure 2 and from three to six in Figure 3. As the multi-
dependent with g5, and therefore obviously it is also in variate model permits arcs both to and from the response
the Markov blanket of g5. This then suggests that b5 variable this is perhaps no surprise, although there is no

a b
g12 g1 g11 g12 g1 g11

g6 g8 g6 g8

b2 g2 b3 b2 g2 b3

g9 g3 g4 g9 g3 g4

b4 g7 b4 g7

b6 b6

g5 g5

b5 g10 b5 g10

Figure 2 Globally optimal multivariable regression model with g2 as the response variable and globally optimal multivariate regression
model of all 17 variables. (a) Globally optimal multivariable regression model with g2 as the response variable and covariates b4 and g3, log
marginal likelihood = -8530.0. (b) Globally optimal multivariate regression model of all 17 variables, log marginal likelihood = -8311.6. Markov
blanket for variable g2 are those variables in grey. Squares denote binary variables, ovals continuous.
Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 6 of 10
http://www.ete-online.com/content/10/1/4

a b
g12 g1 g11 g12 g1 g11

g6 g8 g6 g8

b2 g2 b3 b2 g2 b3

g9 g3 g4 g9 g3 g4

b4 g7 b4 g7

b6 b6

g5 g5

b5 g10 b5 g10

Figure 3 Globally optimal multivariable regression model with b3 as the response variable and globally optimal multivariate regression
model of all 17 variables. (a) Globally optimal multivariable regression model with binary variable b3 as the response and covariates b4, g7 and g8,
log marginal likelihood = -8670.9. This is a generalised linear model with logit link function. (b) Globally optimal multivariate regression model of all
17 variables, log marginal likelihood = -8311.6. Markov blanket for variable b3 are those variables in grey. Squares denote binary variables, ovals
continuous.

reason that this need always be the case. What may be in favour of the model with greater (log) marginal likeli-
rather more surprising is that arcs identified in the mul- hood. In summary, therefore, the data supports that the
tivariable model may not be identified in the multivariate multivariate models are simply a better fit to the data in
model. For example in Figure 2(a) there is an arc from our examples.
b4 to g2, but in Figure 2(b) b4 is not directly connected Our final example is shown in Figure 3. The key dif-
to g2 – moreover, it is not even in g2’s Markov blanket. ference between the results here is that the multivariable
The multivariable model suggests b4 is worthy of further model implies that there are three variables worthy of fur-
investigation. In contrast, the full multivariate model sug- ther investigation. In contrast, the multivariate model has
gests that in fact b4 is only indirectly related with g2, and ten variables in the Markov blanket for b3, six of which are
this indirect dependence is also remote in the graph, i.e. directly connected with b3.
outside the Markov blanket. In other words there is little To complete our case study analyses, and further
statistical evidence to support epidemiological investiga- emphasize that our proposed multivariate regression
tion of b4. This result cannot be dismissed by arguing that approach is simply a generalization of usual multivari-
the multivariable model is somehow more parsimonious, able regression, it is readily possible to compute odds
because the same model selection metric is used in all ratios and mean effects of the parameters (arcs) in our
model comparisons. There is a very large difference (> graphical model. For example, the marginal (posterior) log
100) in (log) marginal likelihood values between the mul- odds ratio for the arc from g5 to b5 (see panel b of any
tivariable and multivariate models (see figure legends). A of the figures) has a 95% confidence (or credible) inter-
guide to the relative size and interpretation of differences val of (0.20, 0.65). This is a log odds ratio as we have a
in (log) marginal likelihoods can be found in Table 2.1 logistic regression between b5 and g5 in this part of the
page 27 in [30], and using the terminology there, a differ- graph. Similarly, the marginal mean (posterior) effect for
ence of greater than 10 is considered very strong evidence the arc from b3 to g4 has a 95% confidence interval of
Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 7 of 10
http://www.ete-online.com/content/10/1/4

(0.18, 0.54), and for the arc from g2 to g9 the correspond- about the disease system under study, in terms of identi-
ing interval is (0.08, 0.27). The latter two intervals are for fying statistical dependencies. This may lead to new and
the mean effect rather than log odds as these are Gaus- novel findings. But equally, some of the newly identified
sian regressions. It is straightforward to compute such statistical dependencies may also be readily discarded as
parameter coefficients for any node in the model, and note potential causal associations, when viewed through the
also that each of these 95% confidence intervals does not prism of an epidemiologist’s expert knowledge of the biol-
include zero. These would therefore typically be consid- ogy of the disease(s) of interest. A brief contrast may be
ered as having a strong degree of statistical support, and made here with historical approaches such as path analy-
each is connected to the “response” node in each of our ses [31], which were applied reasonably commonly during
three multivariate examples. the 1970s to address a range of chronic and environ-
Tables giving medians and marginal 95% posterior con- mental diseases [32-34], and this approach still appears
fidence intervals for every parameter in each of the occasionally in the epidemiological literature. The key
three multivariable models (Figures 1a, 2a and 3a) and distinction between path analyses and additive Bayesian
in the full multivariate model (Figures 1b/2b/3b) can be network structure discovery is that the former is explic-
found in the Additional file 1: Appendix. A key point itly causal, where some or all, of the graph structure is
of note here is that nodes with the same parents have determined apriori via expert opinion. The latter asserts
identical parameter estimates in each model (e.g. com- only the presence of statistical dependency, and while it
pare variables g1, g11 and g12 between the multivariable can include prior expert opinion into the structural search
and multivariate models) as they should. The multivari- (it is a Bayesian approach after all) the default usage is to
ate model is simply a collection of multivariable mod- allow the data itself to identify an optimal graph structure.
els and so the parameter estimates will be the same The advantage of allowing the study data itself to identify
given the same parents. The difference is that the for- an optimal graph is that this may include arcs which an
mer is more flexible and allows any node to have parents, expert may not, and may not include arcs which an expert
unlike in a GLM type model. Generally speaking —and would assert must be present in the given disease system.
as we have seen in our case study examples —this means The epidemiological challenge is then to explain such dis-
that the parents, and therefore parameter estimates, may crepancies which may result in gaining new insight into
be different for at least some variables (nodes) in the the disease system.
data (e.g. compare the parameter estimates for node g5
between Figure 1a and 1b). Software for multivariate regression
Reliable, easy to use software is essential for facilitat-
Epidemiological implications ing the uptake of any new data analytic technique into
When the analytical task is to identify statistical depen- the epidemiological community. In order to apply addi-
dencies with one (or more) response variables, then both tive Bayesian network structure discovery to study data
theoretically and as demonstrated in the above empiri- appropriate software is required. In theory, Bayesian
cal examples, the more general additive Bayesian network network analyses could be performed within a num-
structure discovery approach appears clearly preferable. ber of widely used Bayesian software programs, such
In particular, the multivariable approach is just a special as WinBUGS/OpenBUGS [35] or JAGS [36]. In prac-
case of the multivariate approach, i.e. there is nothing pre- tice, however, other approaches are necessary because
venting the more general structural search (Figures 1b, the central task in Bayesian network analyses is struc-
2b, 3b) from identifying the same globally optimal model ture learning which involves fitting and comparing a great
as in the restricted structural search (Figures 1a, 2a, 3a). many different multivariate models. In programs such
Hence, there is nothing to lose, at least in statistical terms, as OpenBUGS and JAGS it is simply computationally
by adopting the more general approach. Moreover, the impractical to fit every model via Markov chain Monte
far simpler multivariable approach may identify covari- Carlo simulation, in addition to the difficulty in reli-
ates which are not supported by the multivariate model, ably estimating the marginal likelihood for each model.
e.g. the second case study example. A possible explana- Instead, programs which employ analytical approxima-
tion for such contradictions is the Yule-Simpson paradox, tions, i.e. Laplace approximations [37,38], are preferable
in that we are trying to describe observations from a com- and indeed arguably necessary. The particular software we
plex disease system of inter-dependent variables through used in the examples is the abn library for R [39], which
a multivariable model, which may just be too simplistic for has been developed by the authors for performing addi-
this particular application. tive Bayesian network structure discovery with epidemi-
By using a multivariate regression approach the trade- ological data and is available from CRAN (http://cran.
off being made with classical multivariable regression is r-project.org/web/packages/abn/index.html). This library
that the former may provide potentially more information has been extensively tested and validated against other
Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 8 of 10
http://www.ete-online.com/content/10/1/4

established Bayesian modelling software such as INLA reliable results. An additional severe computational drain
[40] (and also JAGS). The abn library also includes is addressing over-fitting, which is an ever present
wrappers to allow INLA to be used for all numerical problem in model selection [41], irrespective of whether
estimation. Case studies which demonstrate how to imple- using exact or heuristic searches. Good practice in
ment analyses similar to those presented, along with structure discovery is to either utilize some form of model
detailed numerical validation studies are available from averaging, for example using majority consensus graphs
http://www.r-bayesian-networks.org. A range of other R as the optimal model [9,42], or else using parametric
libraries for fitting Bayesian networks and other forms of bootstrapping approaches [43] applied to the globally
graphical regression can be found at http://cran.r-project. best graph [11]. The majority consensus approach is sim-
org/web/views/gR.html. While these software libraries are ilar to that used in phylogenetics with tree structures,
all for use with R it is also possible for R to be accessed except here a majority consensus graph is created from
from within other popular statistical software such as SAS all arcs which appeared in at least a majority of heuristic
(via IML Studio). search results. This provides an alternative way to esti-
mate relative support for individual arcs other than by
Limitations Markov chain Monte Carlo, which can be highly prob-
Computational feasibility lematic when dealing with graph structures (see [19]).
The main limitation when applying additive Bayesian net- A single exact search for a model comprising of 20
work structure discovery to epidemiological data is com- variables may take 24 hours to complete on a modern
putational feasibility. The number of variables which can desktop, and this may need to be repeated many (hun-
be included in a Bayesian networks analysis is limited. dreds) times during model averaging or bootstrapping
As a guide, this might be less than about 25 variables for to ensure robust results. Code for addressing over-fitting
exact structural search techniques and perhaps up to 40 using parametric bootstrapping and also parallelization
for heuristic approaches (e.g. see [11]). Inclusion of more across a cluster computer can be found at http://www.r-
variables is possible with access to specialist computing bayesian-networks.org.
facilities and expertise. This means that including addi-
tional variables, such as interaction terms, which can be Future potential: missing values
done easily enough just as in standard regression mod- Missing observations are a common feature of field
elling, requires careful consideration. Each term adds to studies and epidemiological data. In standard regression
the number of variables in the model, and therefore adds modelling, observations with missing values are usually
considerably to the computational time required to per- dropped from analyses (as it is essential to maintain
form structural searches. There are a number of ad-hoc identical observations when comparing different models).
ways to address the computational demands. For exam- In graphical regression modelling this is also the easi-
ple by splitting variables into smaller thematic groups for est course of action. There are, however, a number of
analyses. This may then suggest that some variables can established algorithms for fitting graphical models in the
be dropped, reducing the computationally burden to a presence of missing values due to the joint probabilistic
more manageable level. For larger problems (more vari- nature of these models. Rather than “fill-in” such values
ables), model averaging using order-based Markov chain using traditional approaches such as multiple-inputation,
Monte Carlo is an option [19] as it can cope with many a graphical model can be used to either marginalize out
more variables (e.g. > 100). Such averaging approaches missing entries in the data [44] or predict their most
randomly sample from the posterior landscape of possi- likely values using the graphical structure itself via the
ble graphs (strictly speaking, node orderings), with better propagation of probabilities across the graph (methods
fitting graphs being sampled more often than poorer fit- of propagation form a considerable part of the graphi-
ting graphs, and during this sampling (i.e. jumping from cal modelling literature, see [4,45,46]). These are elegant
model to model) it is possible to estimate the relative conceptual solutions, although they do still assume that
support for each arc (or groups of arcs) in terms pos- values are missing at random, but such approaches are
terior probabilities. This is an approach used in bioin- numerically highly complex. It is unclear whether these
formatics for sequence analyses (e.g. [10]) but does have would be feasible in the context of structure discovery, as
some important caveats, such as producing results which when there are missing values in the data the graph can no
will be biased relative to direct (non-order based) model longer be split into conditionally independent computa-
selection. tional units (i.e. each node and its parents - for estimating
Access to scientific computing facilities, while not the marginal likelihood, see [2]). This is a very consider-
essential, is highly beneficial. For larger problems (> 25 able complication, both in terms of implementation and
variables), heuristic search approaches are required which computing time. Approaches have been developed for
are demanding as they must be run many times to ensure structure discovery in the presence of missing values, such
Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 9 of 10
http://www.ete-online.com/content/10/1/4

as Structural-EM [47], although implemented in mod- 7. Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR: A primer on learning
els with a simpler parameterisation than those presented in Bayesian networks for computational biology. PLoS Comput Biol
2007, 3(8):e129.
here. The implementation of such approaches is an area of 8. Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP: Causal
future work, but further highlights the considerable exist- protein-signaling networks derived from multiparameter single-cell
ing theory and potential of graphical regression and struc- data. Science 2005, 308(5721):523–529.
9. Poon AFY, Lewis FI, Pond SLK, Frost SDW: Evolutionary interactions
ture discovery approaches in analyses of epidemiological
between N-linked glycosylation sites in the HIV-1 envelope. PLoS
data. Comput Biol 2007, 3:110–119.
10. Poon AFY, Lewis FI, Frost SDW, Pond SLK: Spidermonkey: rapid
Conclusion detection of co-evolving sites using Bayesian graphical models.
Bioinformatics 2008, 24(17):1949–1950.
The wide utilization of regression modelling in epidemi- 11. Lewis FI, McCormick BJJ: Revealing the complexity of health
ological analyses means that outputs from such analyses determinants in resource-poor Settings. Am J Epidemiol 2012,
have a ready application in disease control and preven- 176(11):1051–1059.
tion programs. Up until recently, such applications have 12. Sanchez-Vazquez MJ, Nielen M, Edwards SA, Gunn GJ, Lewis: Identifying
associations between pig pathologies using a multi-dimensional
been constrained by the use of multivariable regres- machine learning methodology. BMC Vet Res 2012, 8:151.
sion. Extending multivariable regression to full multi- 13. Rijmen F: Bayesian networks with a logistic regression model for the
variate regression —utilizing additive Bayesian network conditional probabilities. Int J Approximate Reasoning 2008,
48(2):659–666.
structure discovery —offers the epidemiologist potentially 14. Fisher RA: Miscellanea - The goodness of fit of regression formulae,
far greater insight into the complex inter-relationships and the distribution of regression coefficients. J R Stat Soc 1922,
between variables within a disease system. The main con- 85:597–612.
straint in the use of this methodology is its considerable 15. Montgomery DC: Design and Analysis of Experiments, 6th Edition. New York:
Wiley; 2005.
computational demands, but given the ever increasing 16. Yule GU: On the association of attributes in statistics: with
availability of cheap computing power this technique is illustrations from the material of the childhood sociesy, &c. Philos
increasingly feasible for use in a wide range of studies. Trans R Soc Lond Ser A-containing Papers Math Phys Character 1900,
194:257–319.
17. Simpson EH: The interpretation of interaction in contingency tables.
J R Stat Soc Ser B-stat Methodol 1951, 13(2):238–241
Additional file
18. Hand DJ, McConway KJ, Stanghellini E: Graphical models of applicants
for credit. IMA J Manage Math 1997, 8(2):143–155. [http://imaman.
Additional file 1: Appendix. Supplementary tables of parameter Oxfordjournals.org/content/8/2/143.abstract]
estimates. 19. Friedman N, Koller D: Being Bayesian about network structure. A
Bayesian approach to structure discovery in Bayesian networks.
Competing interests Mach Learning 2003, 50(1-2):95–125.
Both authors have no financial or non-financial competing interests to declare. 20. Koivisto M, Sood K: Exact Bayesian structure discovery in Bayesian
networks. J Mach Learn Res 2004, 5:549–573.
Authors’ contributions 21. Holmoy IH, Kielland C, Stubsjoen SM, Hektoen L, Waage S: Housing
FIL wrote the manuscript and performed the statistical modelling, MPW conditions and management practices associated with neonatal
co-wrote and assisted with the manuscript. Both authors read and approved lamb mortality in sheep flocks in Norway. Prev Vet Med 2012,
the final manuscript. 107(3-4):231–241.
22. Sanogo M, Abatih E, Thys E, Fretin D, Berkvens D, Saegerman C: Risk
Author details factors associated with brucellosis seropositivity among cattle in
1 Section of Epidemiology, VetSuisse Faculty, University of Zürich, the central savannah-forest area of Ivory Coast. Prev Vet Med 2012,
Winterthurerstrasse 270, Zürich, CH 8057, Switzerland. 2 Faculty of Veterinary 107(1–2):51–56. [http://www.sciencedirect.com/science/article/pii/
Science, University of Sydney, Camden, NSW 2570, Australia. S0167587712001663]
23. Lukacs PM, Thompson WL, Kendall WL, Gould WR, Doherty J, Paul F,
Received: 7 January 2013 Accepted: 10 May 2013 Burnham KP, Anderson DR: Concerns regarding a call for pluralism of
Published: 17 May 2013 information theory and hypothesis testing. J Appl Ecol 2007,
44(2):456–460.
References 24. Burnham KP, Anderson DR: Model Selection and Multimodel Inference: A
1. Buntine W: Theory refinement on Bayesian networks. In Proceedings of Practical Information-Theoretic Approach. New York: Springer-Verlag; 2002.
Seventh Conference on Uncertainty in Artificial Intelligence. Los Angeles: 25. Raftery AE: Bayesian model selection in social research. Sociol
Morgan Kaufmann; 1991:52–60. Methodol 1995, 25:111–163.
2. Heckerman D, Geiger D, Chickering DM: Learning Bayesian networks -
26. Posada D, Buckley TR: Model selection and model averaging in
The combination of knowledge and statistical-data. Mach Learn 1995,
phylogenetics: Advantages of akaike information criterion and
20(3):197–243.
Bayesian approaches over likelihood ratio tests. Syst Biol 2004,
3. Jensen FV: Bayesian Network and Decision Graphs. New York:
53(5):793–808.
Springer-Verlag; 2001.
4. Lauritzen SL: Graphical Models. Oxford: Univ Press; 1996. 27. Pearl J: Causality: Models, Reasoning and Inference. New York: Cambridge
5. Jansen R, Yu HY, Greenbaum D, Kluger Y, Krogan NJ, Chung SB, Emili A, Univ Press; 2000.
Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for 28. Mackay DJC: Bayesian interpolation. Neural Comput 1992, 4(3):415–447.
predicting protein-protein interactions from genomic data. Science 29. Pearl J: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
2003, 302(5644):449–453. Inference. Morgan Kaufmann: San Mateo CA; 1988.
6. Milns I, Beale CM, Smith VA: Revealing ecological networks using 30. Congdon P: Bayesian Statistical Modelling. Chichester: Wiley; 2001.
Bayesian network inference algorithms. Ecology 2010, 31. Wright S: Correlation and causation Part I. Method of path
91(7):1892–1899. coefficients. J Agric Res 1920, 20:0557–0585.
Lewis and Ward Emerging Themes in Epidemiology 2013, 10:4 Page 10 of 10
http://www.ete-online.com/content/10/1/4

32. Lave LB, Seskin EP: Air-pollution, climate, and home heating - their
effects on us mortality-rates. Am J Public Health Nations Health 1972,
62(7):909.
33. Page T, Harris RH, Epstein SS: Drinking-water and cancer mortality in
Louisiana. Science 1976, 193(4247):55–57.
34. Chase HC: 100th Annual meeting of the american public health
association on a study of risks medical care and infant mortality
atlantic city New-jersey USA November 14–15 1972. Ame J Public
Health 1973, 63(SUPPL):1–56.
35. Lunn D, Spiegelhalter D, Thomas A, Best N: The BUGS project: evolution,
critique and future directions. Stat Med 2009, 28(25):3049–3067.
36. Plummer M: JAGS: A program for analysis of Bayesian graphical
models Using Gibbs sampling. In Proceedings of the 3rd International
Workshop on Distributed Statistical Computing (DSC 2003), March 20-22,
Vienna, Austria; 2003.
37. Tierney L, Kadane JB: Accurate approximations for posterior moments
and marginal densities. J Am Stat Assoc 1986, 81(393):82–86.
38. Smith AFM: Bayesian Computational methods. Philos Trans R Soc Lond
Ser Math Phys Eng Sci 1991, 337(1647):369–386.
39. R Development CoreTeam: R: A Language and Environment for Statistical
Computing. Vienna: R Foundation for Statistical Computing; 2006. [http://
www.R-project.org][ISBN 3-900051-07-0]
40. Rue H, Martino S, Chopin N: Approximate Bayesian inference for
latent Gaussian models by using integrated nested Laplace
approximations. J R Stat Soc Ser B-Stat Methodol 2009, 71:319–392.
41. Babyak MA: What you see may not be what you get: A brief,
nontechnical introduction to overfitting in regression-type models.
Psychosom Med 2004, 66(3):411–421.
42. Poon AFY, Lewis FI, Pond SLK, Frost SDW: An evolutionary-network
model reveals stratified interactions in the V3 loop of the HIV-1
envelope. PLoS Comput Biol 2007, 3(11):2279–2290.
43. Friedman N, Goldszmidt M, Wyner A: Data analysis with Bayesian
networks: A bootstrap approach. In Proc. Fifteenth Conference on
Uncertainty in Artificial Intelligence (UAI’99) (pp. 206–215). San Francisco:
Morgan Kaufmann; 1999.
44. Chickering DM, Heckerman D: Efficient approximations for the
marginal likelihood of Bayesian networks with hidden variables.
Mach Learning 1997, 29(2–3):181–212.
45. Korb KB, Nicholson AE: Bayesian Artificial Intelligence. Boca Raton:
Chapman and Hall/CRC; 2004.
46. Lauritzen SL, Spiegelhalter DJ: Local computations with probabilities
on graphical structures and their application to expert systems. J R
Stat Soc Ser B-Methodological 1988, 50(2):157–224.
47. Friedman N: The Bayesian structural EM algorithm. In Uncertainty in
Artificial Intelligence. Proceedings of the Fourteenth Conference (1998).
Edited by Cooper GF, Moral S; 1998:129–138.

doi:10.1186/1742-7622-10-4
Cite this article as: Lewis and Ward: Improving epidemiologic data analyses
through multivariate regression modelling. Emerging Themes in Epidemiol-
ogy 2013 10:4.

Submit your next manuscript to BioMed Central


and take full advantage of:

• Convenient online submission


• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution

Submit your manuscript at


www.biomedcentral.com/submit

You might also like