Causal Analysis For Multivariate Integrated Clinic
Causal Analysis For Multivariate Integrated Clinic
Abstract
Electronic health records (EHRs) provide a rich source of observational patient data that can be explored to infer
underlying causal relationships. These causal relationships can be applied to augment medical decision-making
or suggest hypotheses for healthcare research. In this study, we explored a large-scale EHR dataset on patients
with asthma or related conditions (N = 14,937). The dataset included integrated data on features representing
demographic factors, clinical measures, and environmental exposures. The data were accessed via a service named
the Integrated Clinical and Environmental Service (ICEES). We estimated underlying causal relationships from the data
to identify significant predictors of asthma attacks. We also performed simulated interventions on the inferred causal
network to detect the causal effects, in terms of shifts in probability distribution for asthma attacks.
Keywords Causal inference, Structure learning, Open clinical data, Asthma
© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if
you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or
parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To
view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Sinha et al. BMC Medical Informatics and Decision Making (2025) 25:27 Page 2 of 7
while necessary to ensure patient privacy and protection answer important questions about the effects of clinical
of sensitive data, access to the data for research is often interventions. We use subject matter expert knowledge
challenging. and publication support as our ground truth to measure
In this research, we analyzed a patient-level data- the correctness of our causal inference modeling. Finally,
set extracted from a regulatory-compliant open service we discuss our findings, including the benefits and limi-
called the Integrated Clinical and Environmental Expo- tations of our causal inference model and approach.
sures Service (ICEES). ICEES supports several use cases
including asthma. The ICEES data are constructed by Analysis of the multivariate ICEES table
integrating clinical data elements derived from patient We queried the ICEES OpenAPI to generate an eight-
EHRs and environmental exposures data derived from a feature multivariate table. The multivariate table ana-
variety of public sources of environmental exposures data lysed in this work comprised data on 14,937 patients
before binning or recoding the data and stripping all pro- (rows represent individual patients in the asthma cohort)
tected health information per the Safe Harbor method and eight ICEES feature variables, per patient, namely,
of the Health Insurance Portability and Accountability TotalEDInpatientVisits, Sex, Race, Prednisone, Obesity,
Act [3]. PM2.5Exposure, RoadwayExposure, and EstResidential-
The ICEES data are then exposed via an open applica- Density, where TotalEDInpatientVisits is our primary
tion programming interface (OpenAPI). For our prin- outcome variable (Table 1). In Fig. 1, we plot bar charts
cipal application use case, we asked if there is a causal to show comparisons of the number of TotalEDInpatient-
relationship between asthma attacks and the following Visits among the discrete categories of each feature. We
features: sex, race, prescriptions for prednisone, diag- can see that the count for zero TotalEDInpatientVisits
noses of obesity, residential proximity to a major road- is the largest among all categories. Upon further analy-
way or highway, residential density, and exposure to sis, we found that the multivariate data table extracted
high levels of airborne pollutants. These features were from the openAPI largely consisted of patients who were
selected because published studies, including our prior inactive in the year 2010. Hence, to avoid bias and reduce
work [4, 5], have recognized them to be associated to noise in our analysis, we removed patients who were not
asthma attacks. We focused on an existing ICEES cohort active in the year of interest, meaning their EHR did not
of patients with asthma or related conditions (see [4] for indicate any healthcare usage, by applying the “Active_
details), and we considered the number of annual emer- In_Year” feature as a filter to extract a multivariate table,
gency department (ED) or inpatient visits for respiratory with Active_In_Year = 1 to select only patients who were
issues as the primary outcome measure and indicator of active in 2010. We show the bar charts for the number
asthma attacks. We used the ICEES OpenAPI to extract of ED/inpatient visits for each feature in Fig. 1. We can
features that might be causally related to each other and observe that most of patients who were active in year
used the resultant multivariate table for causal inference 2010 only visited the ED or an inpatient clinic once. We
modeling. Because EHR data are purely observational, also can see there is an imbalance among the levels in
we also demonstrate a way to perform simulated exter- some features like Prednisone, Obesity, Race, Roadway-
nal intervention, given a known causal network, to help Exposure, and Pm2.5exposure.
Fig. 1 Stacked bar chart representing the number of TotalEDInpatientVisits across each level of the feature variables. See Table 1 for feature variable
definitions
Fig. 2 Relative feature importance for all feature with respect to TotalEDInpatientVisits. See Table 1 for feature variable definitions
Sinha et al. BMC Medical Informatics and Decision Making (2025) 25:27 Page 4 of 7
“Drinking beer everyday increase chance of prostate (2) Constraint-based: This method is based on estimat-
cancer” are common in the news and scientific reporting ing some of the conditional (in)dependencies in
and in our day-to-day personal beliefs. These associations the distribution P from the data D by performing
can be easily mistaken as causation, making us suscep- hypothesis tests of conditional independence. Con-
tible to logical fallacies without knowing the real under- straint-based methods usually start with a fully con-
lying cause. Causal inference is the science of learning nected, undirected graph and progressively remove
cause from effect [1]. It is an important field of research edges whenever a new conditional independence
because it helps us eradicate spurious correlation [7, 8]. relation is discovered, while satisfying the corre-
The primary aim of inferring causal relations from data is sponding d-separation statements. In this work, we
to discover interactions between different entities in the will use a constraint-based approach called the PC
form of Vi → Vj , where Vi and Vj are observable features algorithm, given that the dataset is observational.
in domain and the arrow indicates that the state of Vi To infer the causal graph from data, we learn the
influences the state of Vj . Causal inference can be either equivalence class of a directed acyclic graph (DAG)
discovered through observational measurements (seeing) from data with the traditional constraint-based PC
or from measurements after performing some external algorithm proposed by [9]. Given a dataset D hav-
manipulation/intervention (doing). A causal network [1, ing n features Vi , ....., Vn, we conduct the following
9] can be represented with a directed acyclic graph steps. We start with a complete undirected graph
(DAG) G = (V , E), where V = Vi , ....., Vn denotes the set given n features. We then eliminate edges between
of features and E ∈ (V × V ) denotes the set of edges that variables that are unconditionally independent. For
are causal in nature. For a causal edge (Vi , Vj ), we say that each pair of variables (Vi , Vj ) with an edge between
Vi is a cause (parent) of Vj , and Vj is the resulting effect them, and for each variable Vk with an edge con-
(child) of Vi . Let pa(Vi ) denote the set of parents of Vi . nected to either of them, we eliminate the edge
The conditional probability distribution Pi defines the between Vi and Vj if Vi ⊥ ⊥ Vj | Vk . For each pair of
probability of Vi given the state of its parents pa(Vi ). A variables Vi , Vj having an edge between them, and
causal network represents a joint distribution P over vari- for each pair of variables Vk , Vl with edges both con-
ables V as long as it satisfies two main assumptions: nected to Vi or both connected to Vj , we eliminate
the edge between Vi and Vj if Vi ⊥ ⊥ Vj | Vk , Vl . We
(a) Causal Markov assumption: Any given variable Vi continue to check independencies conditional on
is independent of its non-descendants, conditioned subsets of variables of increasing size n until there
on all of its direct causes (parents). This implies are no more adjacent pairs (Vi , Vj ) such that there is
that the joint distribution P(V) can be factored as: a subset of variables of size n in which all of the var-
p(V ) = ni=1 pi (Vi | Pa(Vi )). iables in the subset are adjacent to Vi or adjacent to
(b) Faithfulness assumption: The joint distribution Vj . For each triple of variables (Vi , Vj , Vk ) such that
p(V1 , . . . , Vn ) is faithful to G if every conditional Vi and Vj are adjacent, Vj and Vk are adjacent, and Vi
independence relation in the probability distribu- and Vk are not adjacent, we orient the edges Vi—-Vj
tion P is entailed by the Markov assumption applied —-Vk as Vi → Vj ← Vk , if Vj is not in the set con-
to G [10]. ditioning on which Vi and Vk became independent
and the edge between them was accordingly elimi-
To reconstruct a causal graph from data, we generally nated. We call such a triple of variables a v-struc-
start by finding an approximation of the graph, given V, ture. For each triple of variables such that Vi → Vj
and then optimize based on conditions on data. The two —-Vk , and Vj and Vk are not adjacent, we orient the
main approaches used for causal network inference are: edge Vj—-Vk as Vj → Vk . This is called orientation
propagation.
(1) Score-based: This is based on a Bayesian scoring
function S(G | D), which estimates the goodness-of- Results
fit of graph G to the data D [11], as objective func- Inferring causal graphs
tions to maximize, while favoring simpler struc- We first applied the PC algorithm to the ICEES multivar-
tures. The score function is usually combined with iate feature table. In Fig. 3a, we show the inferred casual
a search heuristic that explores the space of all pos- graphs, first using the entire table with all eight features
sible graphs. Score-based methods are robust and and second in Fig. 3b using only the top four important
can be extended to include interventional studies (if features with respect to TotalEDInpatientVisits, as deter-
available), but they are not scalable as network or mined in Feature importance section. Expected relation-
data size increases. ships between features based on subject matter expertise
Sinha et al. BMC Medical Informatics and Decision Making (2025) 25:27 Page 5 of 7
Fig. 3 Inferred causal graph. Solid black lines represent true positives, dashed lines represent false negatives and red lines represents false positives
We conducted these three interventions on our learned from a major roadway/highway as an additional variable
causal network. To test Claim (a), we created a muti- that is casually related to annual ED or inpatient visits for
lated network by fixing the state of ObesityDx to 1, respiratory issues.
which means we are forcing ObesityDx to be present. We validated our findings based on expert knowledge
For Claim (b), we fixed the state of Prednisone to be 1, and prior published literature. Most of our results are
again meaning that we are forcing prednisone to be pre- consistent with previously published literature [12]. For
sent. For Claim (c), we fixed state of Sex2 to be Male. instance, prednisone, which is commonly prescribed
Next, we compared the changes in the probability dis- for patients who are non-responsive to first-line treat-
tribution of TotalEDInpatientVisits before and after ments such as inhaled albuterol [13], has been identi-
these three ad hoc interventions to confirm the expected fied as a factor associated with asthma exacerbations
causal influences. We plotted the changes in the prob- and ED or inpatient visits for respiratory issues [14].
ability distribution of TotalEDInpatientVisits in Fig. 4. As Female sex, obesity, and African American race have
expected, there were changes in the probability distribu- previously been identified as factors that contribute to
tion of TotalEDInpatientVisits for interventions a and b, asthma attacks [15]. In another work by our group [5]
reflected in Fig. 4a and b, respectively. For intervention c, and others [16], obesity and sex have been found to be
the changes before and after intervention were negligible, highly related to asthma attacks. Several other works [3,
meaning that Sex2 had no causal effect on the frequency 17] have additionally found a significant association
of TotalEDInpatientVisits. between African American race and increased risk of
asthma attacks. Exposure to major roadways or highways
Discussion has also been found to be a risk factor for asthma. Sev-
We demonstrated the ability to use the ICEES OpenAPI eral studies [18, 19] have demonstrated an increase in
to answer important questions about causal relationships asthma attacks among patients residing in close proxim-
between factors affecting asthma attacks. We focused on ity to a major roadway or highway. Our findings on the
a large cohort of patients with asthma or related condi- relationship between roadway exposures and asthma
tions and a dataset that included data derived from EHRs exacerbations have been inconsistent, with evidence to
and a variety of public sources of environmental expo- support [14] and negate [12] a relationship.
sures data. We applied PC analysis, a constraint-based One factor that we expected to find in our model as
causal learning algorithm, on the dataset and identified causally related to asthma attacks, but did not, is expo-
prednisone, race, and obesity as significant predictors of sure to airborne particulate matter. Exposure to airborne
annual ED or inpatient visits for respiratory issues, fol- particulate matter is a well-established trigger for asthma
lowed by residential distance from a major roadway/ attacks [4, 12, 14, 15, 20]. The failure to detect a causal
highway, airborne particulate exposure, and sex. Of relationship between exposure to airborne particulate
those, prednisone and obesity were found to be caus- matter and asthma attacks likely reflects the imbalance
ally related to annual ED or inpatient visits in our causal in the distribution of patients across bins. Indeed, we
inference model, and sex and race were found to be indi- are actively refining both our exposure models and our
rectly related to annual ED or inpatient visits via a causal binning strategy. For instance, instead of using a Python
relationship to obesity. On a smaller dataset, comprising algorithm to bin the airborne pollutant exposures, we are
only the four most important features, as determined by considering a binning strategy based on subject matter
random forest analysis, we identified residential distance expertise.
Fig. 4 Effect of intervention on a Obesity, b Prednisone and c Sex: change in the probability distribution of TotalEDInpatientVisits before (red)
and after (blue) intervention
Sinha et al. BMC Medical Informatics and Decision Making (2025) 25:27 Page 7 of 7
References
1. Pearl J. Causality: models, reasoning, and inference. Econ Theory.
2003;19(675–685):46.
2. Rizzi DA. Causal reasoning and the diagnostic process. Theor Med.
1994;15(3):315–33.
3. Xu H, Cox S, Stillwell L, Pfaff E, Champion J, Ahalt SC, et al. FHIR PIT: an
open software application for spatiotemporal integration of clinical data
and environmental exposures data. BMC Med Inform Decis Making.
2020;20(1):1–8.
4. Fecho K, Pfaff E, Xu H, Champion J, Cox S, Stillwell L, et al. A novel
approach for exposing and sharing clinical data: the Translator Integrated
Clinical and Environmental Exposures Service. J Am Med Inform Assoc.
2019;26(10):1064–73.
5. Fecho K, Ahalt SC, Arunachalam S, Champion J, Chute CG, Davis S,
et al. Sex, obesity, diabetes, and exposure to particulate matter among
patients with severe asthma: Scientific insights from a comparative analy-
sis of open clinical data sources during a five-day hackathon. J Biomed
Inform. 2019;100:103325.
6. Kuhn M. Building predictive models in R using the caret package. J Stat
Softw. 2008;28(1):1–26.
7. Sinha M, Tadepalli P, Ramsey SA. Pooling vs Voting: An Empirical Study of
Learning Causal Structures. 2019.
8. Sinha M, Tadepalli P, Ramsey SA. Voting-based integration algorithm
improves causal network learning from interventional and observational