Mlhops: Machine Learning For Healthcare Operations: Keywords
Mlhops: Machine Learning For Healthcare Operations: Keywords
Abstract
Machine Learning Health Operations (MLHOps) is the combination of pro-
cesses for reliable, efficient, usable, and ethical deployment and maintenance
of machine learning models in healthcare settings. This paper provides both
a survey of work in this area and guidelines for developers and clinicians
to deploy and maintain their own models in clinical practice. We cover the
foundational concepts of general machine learning operations, describe the
initial setup of MLHOps pipelines (including data sources, preparation, en-
gineering, and tools). We then describe long-term monitoring and updating
(including data distribution shifts and model updating) and ethical consid-
erations (including bias, fairness, interpretability, and privacy). This work
therefore provides guidance across the full pipeline of MLHOps from concep-
tion to initial and ongoing deployment.
Keywords: MLOps, Healthcare, Responsible AI
1. Introduction
Over the last decade, efforts to use health data for solving complex medical
problems have increased significantly. Academic hospitals are increasingly
dedicating resources to bring machine learning (ML) to the bedside and to
addressing issues encountered by clinical staff. These resources are being uti-
lized across a range of applications including clinical decision support, early
warning, treatment recommendation, risk prediction, image informatics, tele-
diagnosis, drug discovery, and intelligent health knowledge systems.
When deployed successfully, data-driven models can free time for clinicians[109],
improve clinical outcomes [217], reduce costs [28], and provide improved qual-
ity care for patients. However, most studies remain preliminary, limited to
small datasets, and/or implemented in select health sub-systems. Integrat-
ing with clinical workflows remains crucial [278, 266] but, despite recent
computational advances and an explosion of health data, deploying ML in
healthcare responsibly and reliably faces several operational and engineering
challenges, including:
2
requires aspects specific to healthcare, best practices and concepts from other
application domains are also relevant. This summarizes the primary outcome
of our review, which is to provide a set of recommendations for implementing
MLHOps pipelines in practice – i.e., a “how-to” guide for practitioners.
2. Foundations of MLOps
2.1. What is MLOps?
Machine learning operations (MLOps) is a combination of tools, techniques,
standards, and engineering best practices to standardize ML system devel-
opment and operations [251]. It is used to streamline and automate the
deployment, monitoring, and maintenance of machine learning models, in
order to ensure they are robust, reliable, and easily updated or upgraded.
Recently, MLOps has become more well-defined and widely implemented due
to the reusability and standardization benefits across various applications
[229]. As a result, the structure and definitions of different components are
becoming quite well-established.
3
Figure 1: MLOps pipeline
4
either on the cloud or on-premise so that their functions are accessible
to multiple applications through remote function calls (i.e., application
programming interfaces (APIs)).
• Data query: The component queries the data, processes it and stores
it in a format that models can easily utilize.
5
2.4. Levels of MLOps maturity
MLOps practices can be divided into different levels based on the maturity
of the ML system automation process [118, 251], as described below.
6
Ultimately implementation of MLOps leads to many benefits, including bet-
ter system quality, increased scalability, simplified management processes,
improved governance and compliance, increased cost savings and improved
collaboration.
3. MLHOps Setup
Operationalizing ML models in healthcare is unique among other application
domains. Decisions made in clinical environments have a direct impact on
patient outcomes and, hence, the consequences of integrating ML models into
health systems need to be carefully controlled. For example, early warning
systems might enable clinicians to prescribe treatment plans with increased
lead time [109]; however, these systems might also suffer from a high false
alarm rate, which could result in alarm fatigue and possibly worse outcomes.
The requirements placed on such ML systems are therefore very high and,
if they are not adequately satisfied, the result is diminished adoption and
trust from clinical staff. Rigorous long-term evaluation is needed to validate
the efficacy and to identify and assess risks, and this evaluation needs to be
reported comprehensively and transparently [265].
While most MLOps best practices extend to healthcare settings, the data,
competencies, tools, and model evaluation differ significantly [179, 172, 255,
17]. For example, typical performance metrics (e.g., positive predictive value
and F1-scores) may differ between clinicians and engineers. Therefore, unlike
in other industries, it becomes necessary to evaluate physician experience
when predictions and model performance are presented to clinical staff [272].
In order to build trust in the clinical setting, the interpretability of ML
models is also exceptionally important. As more ML models are integrated
into hospitals, new legal frameworks and standards for evaluation need to be
adopted, and MLHOps tools need to comply with existing standards.
In the following sections, we explore the different components of MLHOps
pipelines.
3.1. Data
Successfully digitizing health data has resulted in a prodigious increase in the
volume and complexity of patient data collected [218]. These datasets are
now stored, maintained, and processed by hospital IT infrastructure systems
which in turn use specialized software systems.
7
3.1.1. Data sources
There could be multiple sources of data, which are categorized as follows:
Electronic health records (EHRs) record, analyze, and present information
to clinicians, including:
Other sources of health data include primary care data, wearable data (e.g.,
smartwatches), genomics data, video data, surveys, medical claims, billing
data, registry data, and other patient-generated data [216, 30, 45].
8
Ambient sensors
Wearables
Pharmacological
Immune Sensors
Other
Microbiome
Metabolomics
Epigenomics
Demographic
Waveform
Transcriptomics
Billing
Genomics
Administrative
Insurance EHR
Interventions
Pathology
Observations Imaging
Medications
X-rays
Notes Ultrasound
MRIs CT scans
Databases must also support scaling to large numbers of records which can
be processed concurrently. Hence, efficient storage systems along with com-
putational techniques are needed to facilitate analyses. One of the first steps
9
Exchange formats and proto-
cols
Figure 3: The hierarchy of standardization that common data models and open standards
for interoperability address. The lowest level is about achieving standardization of vari-
able names such as lab test names, medications and diagnosis codes, as well as the data
types used to store these variables (i.e. integer vs. character). The next level is about
having abstract concepts such that data can be mapped and grouped under these concept
definitions. The top level of standardization is about data exchange formats, e.g. JSON,
XML, protocols, along with protocols for information exchange like supported RESTful
API architectures. This level addresses questions on interoperability and how data can be
exchanged across sites and EHR systems.
10
can save time and effort and promote adoption. For OMOP, the ATLAS
tool [195] developed by Observational Health Data Sciences and Informatics
(OHDSI) provides such a feature through their web based interactive analysis
platform.
The FHIR standard [31] is a leading open standard for exchanging health
data. FHIR is developed by Health Level 7 (HL7), a not-for-profit stan-
dards development organization that was established to develop standards
for hospital information systems. FHIR defines the key entities involved
in healthcare information exchange as resources, where each resource is a
distinct identifiable entity. FHIR also defines APIs which conform to the
representational state transfer (REST) architectural style for exchanging re-
sources, allowing for stateless Hypertext Transfer Protocol (HTTP) methods,
and exposing directory-structure like URIs to resources. RESTful architec-
tures are light-weight interfaces that allow for faster transmission, which is
more suitable for mobile devices. RESTful interfaces also facilitate faster
development cycles because of their simple structure.
11
3.1.4. Quality assurance and validation
Data collected in retrospective databases for analysis and ML use cases need
to be checked for quality and consistency. Data validation is an important
step towards ensuring that ML systems developed using the data are highly
performant, and do not incorporate biases from the data. Errors in data
propagate through the MLOps pipeline and hence specialized data quality
assurance tools and checks at various stages of the pipeline are necessary
[223]. A standardized data validation framework that includes i) data ele-
ment pre-processing, ii) checks for completeness, conformance, and plausi-
bility, and iii) a review process by clinicians and other stakeholders should
capture generalizable insight across various clinical investigations [238].
12
1. Cleaning: Formatting values, adjusting data types, checking and fix-
ing issues with raw data.
Multiple data sources such as EHR data, clinical notes and text, imaging
data, and genomics data can be processed independently to create features
and they can be combined to be used as inputs to ML models. Hence, com-
posing pipelines of these tasks facilitates component reusability [115]. Fur-
thermore, since the ML development life-cycle constitutes a chain of tasks,
the pipelining approach becomes even more desirable. Some of the high
level tasks in the MLHOps pipeline include feature creation, feature selec-
tion, model training, evaluation, and monitoring. Evaluating models across
different slices of data, hyper-parameters, and other confounding variables is
necessary for building trust.
Table 7 lists popular open-source tools and packages specific to health data
and ML processing. These tools are at different stages of development and
maturity. Some examples of popular tools include MIMIC-Extract [273],
Clairvoyance [115] and CheXstray [245].
3.3. Modelling
At this stage, the data has been collected, cleaned, and curated, ready to be
fed to the ML model to accomplish the desired task. The modelling phase in-
volves choosing the available models that fit the problem, training & testing
13
the models, and choosing the model with the best performance & reliabil-
ity guarantees. Given the the existence of numerous surveys summarizing
machine learning and deep learning algorithms for general healthcare scenar-
ios [74, 1], as well as specific use cases such as brain tumor detection [18],
COVID-19 prevention[26], and clinical text representation [127], we omit this
discussion and let the reader explore the surveys relevant to their prediction
problem.
14
supporting the teams and governing partnerships with collaborators
from other health organizations.
15
AI model. Broadly these guidelines suggest inclusion of the following criteria
[65]:
16
Table 1: MLOps tools
Category Description Tooling Examples
• MLFlow 1
Model metadata storage Section 3.1 • Comet 2
and management • Neptune3
• DVC4
Data and pipeline version- Section 3.2
• Pachyderm5
ing
• DEPLOYR6 [59]
Model deployment and Section 3.3 • Flyte7
serving • ZenML8
• MetaFlow9
Production model monitor- Section 4 • Kedro10
ing • Seldon Core11
• Kuberflow12
Run orchestration and Orchestrating the execution • Polyaxon13
workflow pipelines of preprocessing, training, • MLRun14
and evaluation pipelines.
Section 3.4 & 3.5
• ChatOps15
• Slack16
Collaboration Tool Setting up an MLOps • Trello17
pipeline requires collabora- • GitLab18
tion between different • Rocket Chat 19
17
4. MLHOps Monitoring and Updating
Once an MLHOps pipeline and required resources are setup and deployed,
robust monitoring protocols are crucial to the safety and longevity of clinical
AI systems. For example, inevitable updates to a model can introduce var-
ious operational issues (and vice versa), including bias (e.g., a new hospital
policy that shifts the nature of new data) and new classes (e.g., new subtypes
in a disease classifier) [287]. Incorporating expert labels can improve model
performance; however, the time, cost, and expertise required to acquire ac-
curate labels for very large imaging datasets like those used in radiology- or
histology-based classifiers makes this difficult [138].
18
a low positive predictive value (PPV). Moreover, clinical datasets are often
imbalanced, consisting of far fewer positive instances of a label than negative
ones. As a result, measures like accuracy that weigh positive and negative
labels equally can be detrimental to monitoring. For instance, in the context
of disease classification, it may be particularly important to monitor sensi-
tivity, in contrast to more time-sensitive clinical scenarios like the intensive
care unit (ICU) where false positives (FP) can have critical outcomes [20].
19
around a health system, and new public health and immigration poli-
cies. Distribution shifts due to demographic differences can dispropor-
tionately deteriorate model performance in specific patient populations.
For instance, although Black women are more likely to develop breast
tumours with poor prognosis, many breast mammography ML classi-
fiers experience deterioration in performance on this patient population
[284]. Similarly, skin-lesion classifiers trained primarily on images of
lighter skin tones may show decreased performance when evaluated on
images of darker skin tones [9, 69].
• Technology - Data shifts can be attributed to changes in technology
between institutions or over time. This includes chest X-ray classifiers
trained on portable radiographs that are evaluated on stationary ra-
diographs or deteroriation of clinical AI systems across EHR systems
(e.g., Philips Carevue vs. Metavision) [188].
Although evaluated differently, data shifts are present across various modal-
ities of clinical data such as medical images [98] and EHR data [70, 201].
In order to effectively prevent these malignant shifts from occurring, it is
necessary to perform prospective evaluation of clinical AI systems [303] in
order to understand the circumstances under which they arise, and to design
strategies that mitigate model biases and improve models for future itera-
tions [290]. Broadly, these data shifts can be categorized into three groups
which can co-occur or lead to one another:
20
perform feature shift detection between training and deployment data and
provide users with summary statistics (Table 4.4). It is also possible to de-
tect feature shift while conditioning on the other features in a model using
conditional distribution tests [135].
Dataset Shift Detection: Dataset shift refers to the change in the joint
distribution between the source and target data for a group of input features.
Multivariate testing is crucial because input to ML models typically consist
of more than one variable and multiple modalities. In order to test whether
the distribution of the target data has drifted from the source data two
main approaches exist: 1) two-sample testing and 2) classifiers. These
approaches often work better on low-dimensional data compared to high-
dimensional data, therefore dimensionality reduction is typically applied first
[215]. For instance, variational autoencoders (VAE) have been used to reduce
chest X-ray images to a low-dimensional space prior to two-sample testing
[245]. In the context of medical images, including chest X-rays [211] [289],
diabetic retinopathies [41], and histology slides [246], classifier methods have
proven effective. For EHR data, dimensionality reduction using clinically
meaningful patient representations has improved model performance [188].
For clinically relevant drift detection, it is important to ensure that drift
metrics correlate well with ground truth performance differences.
21
Method Shift Test Type
L-infinity distance Feature (c) 2-ST
Cramér-von Mises Feature (c) 2-ST
Fisher’s Exact Test Feature (c) 2-ST
Chi-Squared Test Feature (c) 2-ST
Jensen-Shannon divergence Feature (n) 2-ST
Kolmogorov-Smirnov [174] Feature (n) 2-ST
Feature Shift Detector [135] Feature Model
Maximum Mean Discrepancy (MMD) [93] Dataset 2-ST
Least Squares Density Difference [37] Dataset 2-ST
Learned Kernel MMD [155] Dataset 2-ST
Context Aware MMD [56] Dataset 2-ST
MMD Aggregated [236] Dataset 2-ST
Classifier [161] Dataset Model
Spot-the-diff [117] Dataset Model
Model Uncertainty [240] Dataset Model
Mahalanobis distance [222] Dataset Model
Gram matrices [202] [234] Dataset Model
Energy Based Test [157] Dataset Model
H-Divergence [299] Dataset Model
22
4.3.3. Label Shift
Label shift is a difference in the distribution of class variables in the outcome
between the source and target data. Label shift may appear when some con-
cepts are under-sampled or over-sampled in the target domain compared to
the source domain. Label shift arises when class proportions differ between
the source and target, but the feature distributions of each class do not. For
instance, in the context of disease diagnosis, a classifier trained to predict
disease occurrence is subject to drift due to changes in the baseline preva-
lence of the disease across various populations.
Label Shift Detection: Label shift can be detected using moment matching-
based estimator methods that leverage model predictions like Black Box
Shift Estimation (BBSE) [151] and Regularized Learning under Label Shift
(RLLS) [22]. Assuming access to a classifier that outputs the true source dis-
tribution conditional probabilities ps (y|x) Expectation Maximization (EM)
algorithms like Maximum Likelihood Label Shift (MLLS) can also be used to
detect label shift [87]. Furthermore, methods using bias-corrected calibration
show promise in correcting label shift [14].
23
Name of tool Capabilities
Evidently 20 Interactive reports to analyze ML models
during validation or production monitoring.
NannyML21 Performance estimation and monitoring,
data drift detection and intelligent alerting
for deployment.
River [185] Online metrics, drift detection and outlier
detection for streaming data.
SeldonCore [262] Serving, monitoring, explaining, and man-
agement of models using advanced metrics,
explainers, and outlier detection.
TFX22 Explore and validate data used for machine
learning models.
TorchDrift23 Covariate and concept drift detection.
deepchecks [54] Testing for continuous validation of ML mod-
els and data.
EHR OOD Detection Uncertainty estimation, OOD detection and
[258] (deep) generative modelling for EHRs.
Avalanche [160] Prototyping, training and reproducible eval-
uation of continual learning algorithms.
Giskard24 Evaluation, monitoring and drift testing.
Table 3: List of open-source tools available on Github that can be used for ML Monitoring
and Updating
24
4.4.3. Frequency of Model Updates
In practice, retraining procedures for clinical AI models have generally been
locked after FDA approval [140] or confined to ad-hoc one-time updates [261]
[104]. The timing of when it is necessary to update or retrain a model varies
across use case. As a result, it is imperative to evaluate the appropriate fre-
quency to update a model. Strategies employed include: i) Periodic train-
ing on a regular schedule (e.g. weekly, monthly). ii) Performance-based
trigger in response to a statistically significant change in performance. iii)
Data-based trigger in response to a statistically significant data distribu-
tion shift. iv) Retraining on demand is not based on a trigger or regular
schedule, and instead initiated based on user prompts.
25
(LWF); and 3) Replay-based approaches that retain some samples from
the previous tasks and use them for training or as constraints to reduce for-
getting e.g. episodic representation replay (ERR) [66]. Evaluation of several
continual learning methods on ICU data across a large sequence of tasks
indicate replay-based methods achieves more stable long-term performance,
compared to regularization and rehearsal based methods [19]. In the context
of chest X-ray classification, Joint Training (JT) has demonstrated superior
model performance, with LWF as a promising alternative in the event that
training data is unavailable at deployment [141]. For sepsis prediction using
EHR data, a joint framework leveraging EWC and ERR has been proposed
[16]. More recently, continual model editing strategies have shown promise
in overcoming the limitations of continual fine-tuning methods by updating
model behavior with minimal influence on unrelated inputs and maintaining
upstream test performance [105].
26
minimization outperforms domain generalization and unsupervised do-
main adaptation methods [97] [294].
5. Responsible MLHOps
AI has surged in healthcare, out of necessity or/and [290, 199], but many
issues still exist. For instance, many sources of bias exist in clinical data,
large models are opaque, and there are malicious actors who damage or pol-
lute the AI/ML systems. In response, responsible AI and trustworthiness
have together become a growing area of study [176, 264]. Responsible AI, or
trustworthy MLOps, is defined as an ML pipeline that is fair and unbiased,
27
explainable and interpretable, secure, private, reliable, robust, and resilient
to attacks. In healthcare, trust is critical to ensuring a meaningful relation-
ship between the healthcare provider and patient [63]. In this section, we
discuss components of responsible and trustworthy AI [142], which can be
applied to the MLHOps pipeline. In Section 5.1, we review the main con-
cepts of responsible AI and in Section 5.2 we explore how these concepts can
be embedded in the MLHOps pipeline to enable safe deployment of clinical
AI systems.
28
7. Ordeal: A patient may have to face an ordeal (i.e., go through painful
procedures) in order to be rescued.
While some of these criteria relate to the humanity of the healthcare provider,
others relate to the following topics in ML models:
We discuss these concepts further in Sections 5.1.1, 5.1.2, 5.1.3 and 5.1.4.
29
5.1.1.1. Causes
A lack of fairness in clinical AI systems may be a result of various contributing
causes:
• Objective:
• Data:
25
from https://clinicalcenter.nih.gov/about/welcome/faq.html.
30
at the early stages of diagnosis. Moreover, as a specialized hos-
pital, patient admission is selective and chosen solely by institute
physicians based on if they have an illness being studied by the
given institute 26 . Such a dataset will not contain the diversity
of disease cases that might be seen in hospitals specialized across
different diseases, or account for patients visiting for routine treat-
ment services at general hospitals.
– Insufficient sample size: Insufficient sample sizes of under-
represented groups can also result in unfairness [89]. For instance,
patients of low socioeconomic status may use healthcare services
less, which reduces their sample size in the overall dataset, re-
sulting in an unfair model [294, 38, 49]. In another instance, an
algorithm that can classify skin cancer [73] with high accuracy will
not be able to generalize to different skin colours if similar samples
have not been represented sufficiently in the training data [38].
– Missing essential representative features: Sometimes, es-
sential representative features are missed or not collected during
the dataset curation process, which prohibits downstream fairness
analyses. For instance, if the patient’s race has not been recorded,
it is not possible to analyze whether a model trained on that data
is fair with respect to that race [242]. Failure to include sensitive
features can generate discrimination and reduce transparency [48].
• Labels:
26
from https://clinicalcenter.nih.gov/about/welcome/faq.html.
31
– Bias of automatic labeling: Due to the high cost and labour-
intensive process of acquiring labels for healthcare data, there has
been a shift away from hand-labelled data, towards automatic
labelling [39, 113, 120]. For instance, instead of expert-labeled
radiology images, natural language processing (NLP) techniques
are applied to radiology reports in order to extract labels. This
presents concerns as these techniques have shown racial biases,
even after they have been trained on clinical notes [295]. There-
fore, using NLP techniques for automatic labeling may sometimes
amplify the overall bias of the labels [242].
• Resources:
5.1.1.2. Evaluation
To evaluate the fairness of a model, we need to decide which fairness metric
to use and what sensitive attributes to consider in our analysis.
32
sensitive attributes such as age [241, 242], socioeconomic status, [241,
242, 295], and spoken language [295] are also important to consider.
33
• Reliability & robustness: Interpretability can help in auditing ML
models, further increasing model reliability.
• Model-based
27
https://www.tensorflow.org/lattice
34
the model’s internal structure to analyze the impact of features,
for example.
– Model-agnostic: Interpretability is not restricted to a specific
machine learning model and can be used more generally with sev-
eral.
• Complexity-based
• Scope-based
• Methodology-based approach
35
different medical conditions including cardiovascular diseases, eye diseases,
cancer, influenza, infection, COVID-19, depression, and autism. Similarly,
Meng et al. [179] performed interpretability of deep learning mortality pre-
diction models and fairness analysis on the MIMIC-III dataset [119], showing
connections between interpretability methods and fairness metrics.
36
5.1.3.2. Types of threats
Violation of privacy & security can occur either due to human error (uninten-
tional or non-malicious) or an adversarial attack (intentional or malicious).
1. Human error: Human error can cause data leakage through the care-
lessness or incompetence of authorized individuals. Most of the litera-
ture in this context [148, 75] divides human error into two types:
(a) Slip: the wrong execution of correct, intended actions; e.g., in-
correct data entry, forgetting to secure the data, giving access of
information to unauthorized persons using the wrong email ad-
dress.
(b) Mistake: the right execution of incorrect, unintended actions;
e.g., collecting data that is not required, using the same password
for different systems to avoid password recovery, giving access of
information to unauthorized persons assuming they can have ac-
cess.
While people dealing with data should be trained to avoid such negli-
gence, some researchers have suggested policies, frameworks, and strate-
gies such as error avoidance, error interception, or error correction to
prevent or mitigate these issues [148, 75].
2. Adversarial attacks: A primary risk for any digital data or system is
from adversarial attackers [99] who can damage, pollute, or leak infor-
mation from the system. An adversarial attacker can attack in many
ways; e.g., they can be remote or physically present, they can access
the system through a third-party device, or they can be personified as
a patient [189]. The most common types of attacks are listed below.
37
• Data modification: Maliciously modifying data.
• Information leakage: Retrieving sensitive information from the
system.
38
Data Protection Directive in the EU [280]. These acts mainly aim at protect-
ing patient data from being shared or used without their consent but while
allowing them to access to their own data.
39
These aspects have been studied in the healthcare domain [181, 213] and
different approaches such as interpretability, security, privacy, and methods
to deal with data shift (discussed in Sections 5.1.2 and 5.1.3) have been sug-
gested.
5.2.1. Data
The process of a responsible and trustworthy MLOps pipeline starts with
data collection and preparation. The impact of biased or polluted data prop-
agates through all the subsequent steps of the pipeline [82]. This can be even
more important and challenging in the healthcare domain due to the privacy
and sensitivity of the data [21]. If compromised, this information can be
tempered or misused in various ways (e.g., identity theft, information sold to
a third party) and introduce bias in the healthcare system. Such challenges
can also cause economic harm (such as job loss), psychological harm (e.g.,
causing embarrassment due to a medical issue), and social isolation (e.g.,
due to a serious illness such as HIV) [187, 4]. It can also impact ML model
performance and trustworthiness [50].
40
5.2.1.1. Data collection
In healthcare, data can be acquired through multiple sources [257], which
increases the chance of the data being polluted by bias. Bias can concern, for
example, race[284], gender, sexual orientation, gender identity, and disability.
Bias in healthcare data can be mitigated against by increasing diversity in
data, e.g., by including underrepresented minorities (URMs), which can lead
to better outcomes [169]. Debiasing during data collection can include:
(a) Racial bias e.g., Black, Hispanic, and Native American physi-
cians are underrepresented [197]. According to one study, white
males from the upper classes are preferred by the admission com-
mittees [42] (although some other sources suggest the opposite28 ).
28
https://applymd.utoronto.ca/admission-stats
41
(b) Gender bias: e.g., professional women in healthcare being less
likely to be invited to give talks [177], to be introduced using
professional titles [77], to experience harassment or exclusion, to
receive insufficient support at work or negative comparisons with
male colleagues, and to be perceived as weak & less competitive
[150, 252].
(c) Gender minority bias e.g., LGBTQ people receive lower quality
healthcare [226] and faced challenges to get jobs in healthcare
[232].
(d) Disability bias e.g., people with disabilities receive limited ac-
cessibility supports to all facilities and have to work harder to be
feel validated or recognized [175].
42
2. Debiasing during data collection and annotation:
In addition to human factors, we can take steps to improve the data
collection process itself. In this regard, the following measures can be
taken [156]:
43
vices (e.g., smartwatches, skin-based sensors), body area networks
(e.g., EEG sensors, blood pressure sensors), tele-healthcare (e.g.,
tele-monitoring, tele-treatment), digital healthcare systems (e.g.,
electronic health records (EHR), electronic medical records (EMR)),
and health analytics (e.g., medical big-data). While the digitiza-
tion of healthcare has improved access to medical facilities, it has
increased the risk of data leakage and malicious attacks. Extra
care should be taken while designing an MLOps pipeline to avoid
privacy and security risks, as it can lead to serious life-threatening
consequences. Other issues include the number of people involved
in using the data and proper storage for high volumes of data.
Chaudhry et al. [45] proposed an AI-based framework using 6G-
networks for secure data exchange in digital healthcare devices. In
the past decade, the blockchain has also emerged as a way of ensur-
ing data privacy and security. Blockchain is a distributed database
with unique characteristics such as immutability, decentralization,
and transparency. This is especially relevant in healthcare because
of security and privacy issues [101, 286, 190]. Using blockchain can
help in more efficient and secure management of patient’s health
records, transparency, identification of false content, patient mon-
itoring, and maintaining financial statements [101].
(e) Data-sheet: Often, creating a dataset that represents the full
diversity of a population is not feasible, especially for very multi-
cultural societies. Additionally, the prevalence of diseases among
different sub-populations may be different [242]. If it is not pos-
sible to build an ideal dataset with the above specifications, the
data needs to be delivered by a data-sheet. The data-sheet is
meta-data that helps to analyze and specify the characteristics of
the data, clearly explain exclusion and inclusion criteria, detail
demographic features of the patients, and statistics of the data
distribution over sub-populations, labels and features.
44
and extracted into a project-specific data store. After this, a three-step
framework is applied: (1) use different measures for data pre-processing
to ensure the correctness of all data elements (e.g, converting each lab
measurement to the same unit), (2) ensure completeness, conformance,
plausibility, and possible data shifts, and (3) adjudicate the data with
the clinicians.
29
https://www.openaire.eu/item/amnesia-data-anonymization-made-easy
30
https://realrolfje.github.io/anonimatron/
45
detect race by removing that part.
5.2.2. Methodology
The following sections overview the steps to put these concepts into practice.
1. Pre-processing
46
to sensitive attributes as well as the sensitive attribute itself [122],
learning representations that are relatively invariant to sensitive
attribute [162]. One might also adjust representation rates of pro-
tected groups and achieve target fairness metrics [44], or utilize
optimization to learn a data transformation that reduce discrimi-
nation [40].
2. In-processing
47
173]. However, all these methods need awareness about the mem-
bership of the instance to sensitive attributes. There are also
group un-aware methods where they try to weights each sample
with an adversary that tries to maximize the weighted loss [136],
or trains an additional classier that up-weights samples classified
incorrectly in the last training step [154].
There are some software tools and libraries for algorithmic fairness check,
listed in [282], which can be used by developer and end user to evaluate the
fairness of the AI model outcomes.
6. Concluding remarks
Machine learning (ML) has been applied to many clinically-relevant tasks
and many relevant datasets in the research domain but, to fully realize the
promise of ML in healthcare, practical considerations that are not typically
necessary or even common in the research community must be carefully de-
signed and adhered to. We have provided a deep survey into a breadth
of these ML considerations, including infrastructure, human resources, data
sources, model deployment, monitoring and updating, bias, interpretability,
48
privacy and security.
7. Appendix
Table 4: List of open-source tools available on Github that can be used for ML system
development specific to health.
49
Table 5: Key Roles in an MLOps Team
Role Alternatively Description
• Business Translator
Domain Expert • Business Stakeholder An instrumental role in any phase of
• PO/Manager the MLOps process where a deeper
understanding of the data and the
domain is required.
• IT Architect
Solution Architect Unifying the work of data scientists,
• ML Architect
data engineers, and software devel-
opers through developing strategies
for MLOps processes, defining the
project lifecycle, and identifying the
best tools and assemble the team of
engineers and developers to work on
projects.
• ML Specialist
Data Scientist A central player in any MLOps team
• ML Developer
responsible for creating the data and
ML model pipelines. The pipelines
include analysing and processing the
data as well as building and testing
the ML models.
• DataOps Engineer
Data Engineer Working in coordination with prod-
• Data Analyst
uct manager and domain expert to
uncover insights from data through
data ingestion pipelines.
Software Developer • Full-stack engineer Focusing on the productionizing of
ML models and the supporting in-
frastructure based on the ML archi-
tect’s blueprints. They standardize
the code for compatibility and re-
usability
DevOps Engineer • CI/CD Engineer Facilitating access to the specialized
tools and high performance comput-
ing infrastructure, enabling transi-
tion from development to deploy-
ment and monitoring, and automat-
ing ML lifecycle.
ML Engineer • MLOps Engineer
50 Highly skilled programmers support-
ing designing and deploying ML
models in close collaboration with
Data Scientists and DevOps Engi-
neers.
References
[1] Abdullah A Abdullah, Masoud M Hassan, and Yaseen T Mustafa. A
review on bayesian deep learning in healthcare: Applications and chal-
lenges. IEEE Access, 2022.
[2] Talal AA Abdullah, Mohd Soperi Mohd Zahid, and Waleed Ali. A
review of interpretable ml in healthcare: Taxonomy, applications, chal-
lenges, and future directions. Symmetry, 13(12):2439, 2021.
[3] Adnan Ahmed Abi Sen and Abdullah M Basahel. A comparative study
between security and privacy. In 2019 6th International Conference on
Computing for Sustainable Global Development (INDIACom), pages
1282–1286. IEEE, 2019.
[4] Karim Abouelmehdi, Abderrahim Beni-Hessane, and Hayat Khaloufi.
Big healthcare data: preserving security and privacy. Journal of big
data, 5(1):1–18, 2018.
[5] Karim Abouelmehdi, Abderrahim Beni-Hssane, Hayat Khaloufi, and
Mostafa Saadi. Big data security and privacy in healthcare: A review.
Procedia Computer Science, 113:73–80, 2017.
[6] George A. Adam, Chun-Hao K. Chang, Benjamin Haibe-Kains, and
Anna Goldenberg. Hidden risks of machine learning applied to health-
care: Unintended feedback loops between models and future data caus-
ing model degradation. Proceedings of Machine Learning Research,
(126):710–731, 2020.
[7] George A. Adam, Chun-Hao K. Chang, Benjamin Haibe-Kains, and
Anna Goldenberg. Hidden risks of machine learning applied to health-
care: Unintended feedback loops between models and future data caus-
ing model degradation. Proceedings of Machine Learning Research,
(182):1–26, 2022.
[8] Roy Adams, Katharine E Henry, Anirudh Sridharan, Hossein
Soleimani, Andong Zhan, Nishi Rawat, Lauren Johnson, David N
Hager, Sara E Cosgrove, Andrew Markowski, et al. Prospective, multi-
site study of patient outcomes after implementation of the trews ma-
chine learning-based early warning system for sepsis. Nature medicine,
pages 1–6, 2022.
51
[9] Adewole S. Adamson and Avery Smith. Machine learning and health
care disparities in dermatology. JAMA Dermatology, 154(11):1247–
1248, 2018.
[10] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz
Hardt, and Been Kim. Sanity checks for saliency maps. Advances in
neural information processing systems, 31, 2018.
[12] Philip Adler, Casey Falk, Sorelle A Friedler, Tionney Nix, Gabriel Ry-
beck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubra-
manian. Auditing black-box models for indirect influence. Knowledge
and Information Systems, 54:95–122, 2018.
[13] Yongsu Ahn and Yu-Ru Lin. Fairsight: Visual analytics for fairness
in decision making. IEEE transactions on visualization and computer
graphics, 26(1):1086–1095, 2019.
[15] Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin,
Tristan Naumann, and Matthew McDermott. Publicly available clinical
bert embeddings. arXiv preprint arXiv:1904.03323, 2019.
52
[19] Jacob Armstrong and David A Clifton. Continual learning of longitu-
dinal health records. In 2022 IEEE-EMBS International Conference on
Biomedical and Health Informatics (BHI), pages 01–06. IEEE, 2022.
[20] Anand Avati, Martin Seneviratne, Emily Xue, Zhen Xu, Balaji Lak-
shminarayanan, and Andrew M. Dai. Beds-bench: Behavior of
ehr-models under distributional shift–a benchmark. arXiv preprint
arXiv:2107.08189, 2021.
53
[27] Niels Bantilan. Themis-ml: A fairness-aware machine learning inter-
face for end-to-end discrimination discovery and mitigation. Journal of
Technology in Human Services, 36(1):15–30, 2018.
[28] David W Bates, Suchi Saria, Lucila Ohno-Machado, Anand Shah, and
Gabriel Escobar. Big data in health care: using analytics to identify and
manage high-risk and high-cost patients. Health affairs, 33(7):1123–
1131, 2014.
[29] Firas Bayram, Bestoun S Ahmed, and Andreas Kassler. From concept
drift to model degradation: An overview on performance-aware drift
detectors. Knowledge-Based Systems, page 108632, 2022.
[30] Ashwin Belle, Raghuram Thiagarajan, SM Soroushmehr, Fatemeh Na-
vidi, Daniel A Beard, and Kayvan Najarian. Big data analytics in
healthcare. BioMed research international, 2015, 2015.
[31] Duane Bender and Kamran Sartipi. Hl7 fhir: An agile and restful ap-
proach to healthcare information exchange. In Proceedings of the 26th
IEEE International Symposium on Computer-Based Medical Systems,
pages 326–331, 2013.
[32] Ayne A. Beyene, Tewelle Welemariam, Marie Persson, and Niklas
Lavesson. Improved concept drift handling in surgery prediction and
other applications. Knowledge and Information Systems, 44(1):177–
196, 2015.
[33] Sarah Bird, Miro Dudı́k, Richard Edgar, Brandon Horn, Roman Lutz,
Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen
Walker. Fairlearn: A toolkit for assessing and improving fairness in
ai. Microsoft, Tech. Rep. MSR-TR-2020-32, 2020.
[34] Hart Blanton, James Jaccard, Jonathan Klick, Barbara Mellers, Gre-
gory Mitchell, and Philip E Tetlock. Strong claims and weak evi-
dence: reassessing the predictive validity of the iat. Journal of applied
Psychology, 94(3):567, 2009.
[35] Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-
Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and
Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium
on Security and Privacy (SP), pages 141–159, 2021.
54
[36] Jonathan Brophy and Daniel Lowd. Machine unlearning for random
forests. In Marina Meila and Tong Zhang, editors, Proceedings of the
38th International Conference on Machine Learning, volume 139 of
Proceedings of Machine Learning Research, pages 1092–1104. PMLR,
18–24 Jul 2021.
[37] Li Bu, Cesare Alippi, and Dongbin Zhao. A pdf-free change detec-
tion test based on density difference estimation. IEEE transactions on
neural networks and learning systems, 29(2):324–334, 2016.
[38] Joy Buolamwini and Timnit Gebru. Gender Shades: Intersectional Ac-
curacy Disparities in Commercial Gender Classification. In Proceedings
of the 1st Conference on Fairness, Accountability and Transparency,
volume 81 of FAT*’18, page 15, 2018.
[41] Tianshi Cao, Chinwei Huang, David Yu-Tung Hui, and Joseph Paul
Cohen. A benchmark of medical out of distribution detection. arXiv
preprint arXiv:2007.04250, 2020.
[42] Quinn Capers IV, Daniel Clinchot, Leon McDougle, and Anthony G
Greenwald. Implicit racial bias in medical school admissions. Academic
Medicine, 92(3):365–369, 2017.
[44] L. Elisa Celis, Vijay Keswani, and Nisheeth Vishnoi. Data preprocess-
ing to mitigate bias: A maximum entropy based approach. 119:1349–
1359, 2020.
55
[45] Sachi Chaudhary, Riya Kakkar, Nilesh Kumar Jadav, Anuja Nair, Ra-
jesh Gupta, Sudeep Tanwar, Smita Agrawal, Mohammad Dahman Al-
shehri, Ravi Sharma, Gulshan Sharma, et al. A taxonomy on smart
healthcare technologies: Security framework, case study, and future
directions. Journal of Sensors, 2022, 2022.
[48] Irene Chen, Fredrik D Johansson, and David Sontag. Why Is My Clas-
sifier Discriminatory? In Advances in Neural Information Processing
Systems 31, pages 3539–3550. Curran Associates, Inc., 2018.
[49] Irene Chen, Shalmali Joshi, and Marzyeh Ghassemi. Treating health
disparities with artificial intelligence. volume 26, page 16–17, 2020.
[50] Irene Y Chen, Emma Pierson, Sherri Rose, Shalmali Joshi, Kadija Fer-
ryman, and Marzyeh Ghassemi. Ethical machine learning in healthcare.
Annual review of biomedical data science, 4:123–144, 2021.
[52] Weijie Chen, Berkman Sahiner, Frank Samuelson, Aria Pezeshk, and
Nicholas Petrick. Calibration of medical diagnostic classifier scores to
the probability of disease. Statistical methods in medical research,
27(5):1394–1409, 2018.
56
[54] Shir Chorev, Philip Tannor, Dan Ben Israel, Noam Bressler, Itay Gab-
bay, Nir Hutnik, Jonatan Liberman, Matan Perlmutter, Yurii Ro-
manyshyn, and Lior Rokach. Deepchecks: A Library for Testing and
Validating Machine Learning Models and Data.
[55] Alexandra Chouldechova. Fair prediction with disparate impact: A
study of bias in recidivism prediction instruments. Big data, 5(2):153–
163, 2016.
[56] Oliver Cobb and Arnaud Van Looveren. Context-aware drift detection.
In International Conference on Machine Learning, pages 4087–4111.
PMLR, 2022.
[57] Gary S Collins, Paula Dhiman, Constanza L Andaur Navarro, Jie Ma,
Lotty Hooft, Johannes B Reitsma, Patricia Logullo, Andrew L Beam,
Lily Peng, Ben Van Calster, et al. Protocol for development of a report-
ing guideline (tripod-ai) and risk of bias tool (probast-ai) for diagnostic
and prognostic prediction model studies based on artificial intelligence.
BMJ open, 11(7):e048008, 2021.
[58] Sam Corbett-Davies and Sharad Goel. The measure and mismeasure
of fairness: A critical review of fair machine learning. 2018.
[59] Conor K Corbin, Rob Maclay, Aakash Acharya, Sreedevi Mony,
Soumya Punnathanam, Rahul Thapa, Nikesh Kotecha, Nigam H Shah,
and Jonathan H Chen. Deployr: A technical framework for deploying
custom real-time machine learning models into the electronic medical
record. arXiv preprint arXiv:2303.06269, 2023.
[60] Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik
Sridharan, Serena Wang, Blake Woodworth, and Seungil You. Train-
ing well-generalizing classifiers for fairness metrics and other data-
dependent constraints. In International Conference on Machine
Learning, pages 1397–1405. PMLR, 2019.
[61] Fida Kamal Dankar and Khaled El Emam. Practicing differential pri-
vacy in health care: A review. Trans. Data Priv., 6(1):35–67, 2013.
[62] Sharon E Davis, Robert A Greevy Jr, Christopher Fonnesbeck,
Thomas A Lasko, Colin G Walsh, and Michael E Matheny. A non-
parametric updating method to correct clinical prediction model drift.
57
Journal of the American Medical Informatics Association, 26(12):1448–
1457, 2019.
[63] Angus Dawson. Trust, trustworthiness and health. Forum for Medical
Ethics Society, 2015.
[67] Kevin Donnelly et al. Snomed-ct: The advanced terminology and cod-
ing system for ehealth. Studies in health technology and informatics,
121:279, 2006.
[68] Finale Doshi-Velez and Been Kim. Towards a rigorous science of inter-
pretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
[69] Xinyi Du-Harpur, Callum Arthurs, Clarisse Ganier, Rick Woolf, Zainab
Laftah, Manpreet Lakhan, Amr Salam, Bo Wan, Fiona M. Watt,
Nicholas M. Luscombe, and Magnus D. Lynch. Clinically relevant vul-
nerabilities of deep machine learning systems for skin cancer diagnosis.
J Invest Dermatol., 141(4):916–920, 2021.
58
Michael J. Boniface. Using explainable machine learning to characterise
data drift and detect emergent health risks for emergency department
admissions during covid-19. Sci Rep, 11:23017, 2021.
[71] Michael Ekstrand, Robin Burke, and Fernando Diaz. Fairness and
discrimination in recommendation and retrieval. In RecSys ’19:
Proceedings of the 13th ACM Conference on Recommender Systems,
RecSys*’19, page 576–577, 2016.
[73] Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M.
Swetter, Helen M. Blau, and Sebastian Thrun. Dermatologist-level
classification of skin cancer with deep neural networks. Nature,
542(7639):115–118, February 2017.
[75] Mark Evans, Ying He, Leandros Maglaras, and Helge Janicke. Heart-
is: A novel technique for evaluating human error-related information
security incidents. Computers & Security, 80:74–89, 2019.
[77] Julia A Files, Anita P Mayer, Marcia G Ko, Patricia Friedrich, Marjorie
Jenkins, Michael J Bryan, Suneela Vegunta, Christopher M Wittich,
Melissa A Lyle, Ryan Melikian, et al. Speaker introductions at internal
medicine grand rounds: forms of address reveal gender bias. Journal
of women’s health, 26(5):413–419, 2017.
59
Saria. The clinician and dataset shift in artificial intelligence. New
England Journal of Medicine, 385(3):283–286, 2021.
[79] Chloë FitzGerald and Samia Hurst. Implicit bias in healthcare profes-
sionals: a systematic review. BMC medical ethics, 18(1):1–18, 2017.
[81] Zee Fryer, Vera Axelrod, Ben Packer, Alex Beutel, Jilin Chen, and
Kellie Webster. Flexible text generation for counterfactual fairness
probing. arXiv preprint arXiv:2206.13757, 2022.
[84] João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and
Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM
computing surveys (CSUR), 46(4):1–37, 2014.
[85] Ruoyuan Gao and Chirag a Shah. Toward creating a fairer ranking
in search engine results. In Information Processing & Management,
volume 57, 2020.
[88] Marzyeh Ghassemi and Shakir Mohamed. Machine learning and health
need better values. npj Digital Medicine, 5(1):1–4, 2022.
60
[89] Milena A. Gianfrancesco, Suzanne Tamang, Jinoos Yazdany, and
Gabriela Schmajuk. Potential biases in machine learning algo-
rithms using electronic health record data. JAMA internal medicine,
178(11):1544–1547, 2018.
[90] Tony Ginart, Martin Jinye Zhang, and James Zou. Mlde-
mon:deployment monitoring for machine learning systems. In Gus-
tau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, edi-
tors, Proceedings of The 25th International Conference on Artificial
Intelligence and Statistics, volume 151 of Proceedings of Machine
Learning Research, pages 3962–3997. PMLR, 28–30 Mar 2022.
[91] Elliot Graham, Samer Halabi, and Arie Nadler. Ingroup bias in health-
care contexts: Israeli-jewish perceptions of arab and jewish doctors.
Frontiers in psychology, 12, 2021.
[92] Alex Graves, Marc G. Bellemare, Jacob Menick, Rémi Munos, and
Koray Kavukcuoglu. Automated curriculum learning for neural net-
works. In Doina Precup and Yee Whye Teh, editors, Proceedings of
the 34th International Conference on Machine Learning, volume 70 of
Proceedings of Machine Learning Research, pages 1311–1320. PMLR,
06–11 Aug 2017.
[95] Hao Guan and Mingxia Liu. Domain adaptation for medical image
analysis: a survey. IEEE Transactions on Biomedical Engineering,
69(3):1173–1185, 2021.
[96] Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van
Der Maaten. Certified data removal from machine learning models.
arXiv preprint arXiv:1911.03030, 2019.
61
[97] Lin Lawrence Guo, Stephen R. Pfohl, Jason Fries, Alistair E. W. John-
son, Jose Posada, Catherine Aftandilian, Nigam Shah, and Lillian
Sung. Evaluation of domain generalization and adaptation on improv-
ing model robustness to temporal dataset shift in clinical medicine. Sci
Rep, page 2726, 2022.
[98] Xiaoyuan Guo, Judy Wawira Gichoya, Hari Trivedi, Saptarshi
Purkayastha, and Imon Banerjee. Medshift: identifying shift data for
medical dataset curation. arXiv preprint arXiv:2112.13885, 2021.
[99] Kishor Datta Gupta and Dipankar Dasgupta. Who is responsible for
adversarial defense? arXiv preprint arXiv:2106.14152, 2021.
[100] Raia Hadsell, Dushyant Rao, Andrei A. Rusu, and Razvan Pascanu.
Embracing change: Continual learning in deep neural networks. Trends
in cognitive sciences, 24(12):1028–1040, 2020.
[101] Abid Haleem, Mohd Javaid, Ravi Pratap Singh, Rajiv Suman, and
Shanay Rab. Blockchain technology applications in healthcare: An
overview. International Journal of Intelligent Networks, 2:130–139,
2021.
[102] Frederik Harder, Matthias Bauer, and Mijung Park. Interpretable
and differentially private predictions. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 4083–4090, 2020.
[103] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of Opportu-
nity in Supervised Learning. In Proceedings of the 30th International
Conference on Neural Information Processing Systems, NIPS’16, pages
3323–3331, USA, 2016. Curran Associates Inc. event-place: Barcelona,
Spain.
[104] David A Harrison, Anthony R Brady, Gareth J Parry, James R Car-
penter, and Kathy Rowan. Recalibration of risk prediction models in
a large multicenter cohort of admissions to adult, general critical care
units in the united kingdom. Critical care medicine, 34(5):1378–1388,
2006.
[105] Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon
Kim, and Marzyeh Ghassemi. Aging with grace: Lifelong model editing
with discrete key-value adaptors. 2022.
62
[106] Passent M. El-Kafrawy Hassan Moharram, Ahmed Awad. Optimiz-
ing adwin for steady streams. SAC ’22: Proceedings of the 37th
ACM/SIGAPP Symposium on Applied Computing, pages 450–459,
2022.
[107] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn:
Adaptive synthetic sampling approach for imbalanced learning. In 2008
IEEE international joint conference on neural networks (IEEE world
congress on computational intelligence), pages 1322–1328. IEEE, 2008.
[108] Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the
state-of-the-art. Knowledge-Based Systems, 212:106622, 2021.
[110] Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and
Kasia Chmielinski. The dataset nutrition label: A framework to drive
higher data quality standards. arXiv preprint arXiv:1805.03677, 2018.
[111] Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S Yu,
and Xuyun Zhang. Membership inference attacks on machine learning:
A survey. ACM Computing Surveys (CSUR), 54(11s):1–37, 2022.
[112] Hamish Huggard, Yun Sing Koh, Gillian Dobbie, and Edmond Zhang.
Detecting concept drift in medical triage. pages 1733–1736, 2020.
[113] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana
Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn
Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S.
Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P.
Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng.
CheXpert: A Large Chest Radiograph Dataset with Uncertainty La-
bels and Expert Comparison. arXiv:1901.07031 [cs, eess], January 2019.
arXiv: 1901.07031.
63
[114] Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou.
Approximate data deletion from machine learning models. volume 130,
pages 2008–2016, 2021.
[115] Daniel Jarrett, Jinsung Yoon, Ioana Bica, Zhaozhi Qian, Ari Er-
cole, and Mihaela van der Schaar. Clairvoyance: A pipeline toolkit
for medical time series. In International Conference on Learning
Representations, 2020.
[116] Shouling Ji, Weiqing Li, Prateek Mittal, Xin Hu, and Raheem Beyah.
{SecGraph}: A uniform and open-source evaluation system for graph
data anonymization and de-anonymization. In 24th USENIX Security
Symposium (USENIX Security 15), pages 303–318, 2015.
[118] Meenu Mary John, Helena Holmström Olsson, and Jan Bosch. Towards
mlops: A framework and maturity model. In 2021 47th Euromicro
Conference on Software Engineering and Advanced Applications
(SEAA), pages 1–8, 2021.
64
[122] Faisal Kamiran and Toon Calders. Data preprocessing techniques
for classification without discrimination. Knowledge and information
systems, 33(1):1–33, 2012.
[123] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma.
Fairness-aware classifier with prejudice remover regularizer. In Joint
European conference on machine learning and knowledge discovery in
databases, pages 35–50. Springer, 2012.
[124] Sehj Kashyap, Keith E Morse, Birju Patel, and Nigam H Shah. A
survey of extant organizational and computational setups for deploying
predictive models in health systems. Journal of the American Medical
Informatics Association, 28(11):2445–2450, 2021.
[125] Sara Kaviani, Ki Jin Han, and Insoo Sohn. Adversarial attacks and de-
fenses on ai in medical imaging informatics: A survey. Expert Systems
with Applications, page 116815, 2022.
[126] Jane Kaye. The tension between data sharing and the protection of
privacy in genomics research. Annual review of genomics and human
genetics, 13:415, 2012.
[127] Faiza Khan Khattak, Serena Jeblee, Chloé Pou-Prom, Mohamed Ab-
dalla, Christopher Meaney, and Frank Rudzicz. A survey of word
embeddings for clinical text. Journal of Biomedical Informatics,
100:100057, 2019.
[128] Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo
Kim. Learning not to learn: Training deep neural networks with biased
data. CoRR, 2018.
65
[131] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent
trade-offs in the fair determination of risk scores. In 8th Innovations
in Theoretical Computer Science Conference, page 3:1–43:23, 2017.
[132] William A. Knaus. Prognostic modeling and major dataset shifts dur-
ing the covid-19 pandemic: What have we learned for the next pan-
demic? JAMA Health Forum, 3(5):e221103, 2022.
[133] Wouter M Kouw and Marco Loog. A review of domain adaptation with-
out target labels. IEEE transactions on pattern analysis and machine
intelligence, 43(3):766–785, 2019.
[135] Sean Kulinski, Saurabh Bagchi, and David I. Inouye. Feature shift
detection: Localizing which features have shifted via conditional dis-
tribution tests. Advances in neural information processing systems, 33,
2020.
[136] Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost,
Nithum Thain, Xuezhi Wang, and Ed Chi. Fairness without demo-
graphics through adversarially reweighted learning. 33:728–740, 2020.
[139] Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. Mixout: Effec-
tive regularization to finetune large-scale pretrained language models.
arXiv preprint arXiv:1909.11299, 2019.
66
[140] Sebastian Lee, Sebastian Goldt, and Andrew Saxe. Continual learning
in the teacher-student setup: Impact of task similarity. In International
Conference on Machine Learning, pages 6109–6119. PMLR, 2021.
[141] Matthias Lenga, Heinrich Schulz, and Axel Saalbach. Continual learn-
ing for domain adaptation in chest x-ray classification. In Medical
Imaging with Deep Learning, pages 413–423. PMLR, 2020.
[142] Bo Li, Peng Qi, Bo Liu, Shuai Di, Jingen Liu, Jiquan Pei, Jinfeng Yi,
and Bowen Zhou. Trustworthy ai: From principles to practices. arXiv
preprint arXiv:2110.01167, 2021.
[143] Junbing Li, Changqing Zhang, Joey Tianyi Zhou, Huazhu Fu, Shuyin
Xia, and Qinghua Hu. Deep-lift: deep label-specific feature learning
for image annotation. IEEE Transactions on Cybernetics, 2021.
[144] Xiaoxiao Li, Ziteng Cui, Yifan Wu, Lin Gu, and Tatsuya Harada. Esti-
mating and improving fairness with adversarial learning. arXiv preprint
arXiv:2103.04243, 2021.
[145] Xuhong Li, Haoyi Xiong, Xingjian Li, Xuanyu Wu, Xiao Zhang, Ji Liu,
Jiang Bian, and Dejing Dou. Interpretable deep learning: Interpreta-
tion, interpretability, trustworthiness, and beyond. Knowledge and
Information Systems, pages 1–38, 2022.
[147] Shun Liao, Jamie Kiros, Jiyang Chen, Zhaolei Zhang, and Ting
Chen. Improving domain adaptation in de-identification of electronic
health records through self-training. Journal of the American Medical
Informatics Association, 28(10):2093–2100, 2021.
[148] Divakaran Liginlal, Inkook Sim, and Lara Khansa. How significant is
human error as a cause of privacy breaches? an empirical study and a
framework for error management. computers & security, 28(3-4):215–
228, 2009.
[149] James Liley, Samuel Emerson, Bilal Mateen, Catalina Vallejos, Louis
Aslett, and Sebastian Vollmer. Model updating after interventions
67
paradoxically introduces bias. In Arindam Banerjee and Kenji Fuku-
mizu, editors, Proceedings of The 24th International Conference on
Artificial Intelligence and Statistics, volume 130 of Proceedings of
Machine Learning Research, pages 3916–3924. PMLR, 13–15 Apr 2021.
[150] Wen Hui Lim, Chloe Wong, Sneha Rajiv Jain, Cheng Han Ng, Chia Hui
Tai, M Kamala Devi, Dujeepa D Samarasekera, Shridhar Ganpathi
Iyer, and Choon Seng Chong. The unspoken reality of gender bias in
surgery: A qualitative systematic review. PloS one, 16(2):e0246420,
2021.
[151] Zachary C. Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and
correcting for label shift with black box predictors. arXiv preprint
arXiv:1802.03916, 2018.
[155] Feng Liu, Wenkai Xu, Jie Lu, and Danica J Sutherland. Meta
two-sample testing: Learning kernels for testing with limited data.
Advances in Neural Information Processing Systems, 34:5848–5860,
2021.
68
[156] Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili
Jain, Yunhao Liu, Anil K Jain, and Jiliang Tang. Trustworthy ai: A
computational perspective. arXiv preprint arXiv:2107.06641, 2021.
[157] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-
based out-of-distribution detection. Advances in neural information
processing systems, 33:21464–21475, 2020.
[158] Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J
Calvert, and Alastair K Denniston. Reporting guidelines for clini-
cal trial reports for interventions involving artificial intelligence: the
consort-ai extension. bmj, 370, 2020.
[159] Vishnu Suresh Lokhande, Aditya Kumar Akash, Sathya N. Ravi, and
Vikas Singh. Fairalm: Augmented lagrangian method for training fair
models with little regret. page 365–381, 2020.
[160] Vincenzo Lomonaco, Lorenzo Pellegrini, Andrea Cossu, Antonio Carta,
Gabriele Graffieti, Tyler L. Hayes, Matthias De Lange, Marc Masana,
Jary Pomponi, Gido van de Ven, Martin Mundt, Qi She, Keiland
Cooper, Jeremy Forest, Eden Belouadah, Simone Calderara, German I.
Parisi, Fabio Cuzzolin, Andreas Tolias, Simone Scardapane, Luca
Antiga, Subutai Amhad, Adrian Popescu, Christopher Kanan, Joost
van de Weijer, Tinne Tuytelaars, Davide Bacciu, and Davide Maltoni.
Avalanche: an end-to-end library for continual learning. In Proceedings
of IEEE Conference on Computer Vision and Pattern Recognition, 2nd
Continual Learning in Computer Vision Workshop, 2021.
[161] David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample
tests. arXiv preprint arXiv:1610.06545, 2016.
[162] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and
Richard Zemel. The variational fair autoencoder. arXiv preprint
arXiv:1511.00830, 2015.
[163] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan
Zhang. Learning under concept drift: A review. arXiv preprint
arXiv:2004.05785, 2020.
[164] Jonathan H Lu, Alison Callahan, Birju S Patel, Keith E Morse, Dev
Dash, Michael A Pfeffer, and Nigam H Shah. Assessment of adherence
69
to reporting guidelines by commonly used clinical prediction models
from a single vendor: a systematic review. JAMA Network Open,
5(8):e2227779–e2227779, 2022.
[165] Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jor-
dan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha
Bansal, and Su-In Lee. Explainable ai for trees: From local explana-
tions to global understanding. arXiv preprint arXiv:1905.04610, 2019.
[166] Scott M Lundberg and Su-In Lee. A unified approach to interpreting
model predictions. Advances in neural information processing systems,
30, 2017.
[167] Sasu Mäkinen, Henrik Skogström, Eero Laaksonen, and Tommi Mikko-
nen. Who needs mlops: What data scientists seek to accomplish
and how can mlops help? In 2021 IEEE/ACM 1st Workshop on
AI Engineering-Software Engineering for AI (WAIN), pages 109–112.
IEEE, 2021.
[168] A James Mamary, Jeffery I Stewart, Gregory L Kinney, John E Hokan-
son, Kartik Shenoy, Mark T Dransfield, Marilyn G Foreman, Gwen-
dolyn B Vance, Gerard J Criner, COPDGene® Investigators, et al.
Race and gender disparities are evident in copd underdiagnoses across
all severities of measured airflow obstruction. Chronic Obstructive
Pulmonary Diseases: Journal of the COPD Foundation, 5(3):177, 2018.
[169] Jasmine R Marcelin, Dawd S Siraj, Robert Victor, Shaila Kotadia, and
Yvonne A Maldonado. The impact of unconscious bias in healthcare:
how to recognize and mitigate it. The Journal of infectious diseases,
220(Supplement 2):S62–S73, 2019.
[170] Ričards Marcinkevičs and Julia E Vogt. Interpretability and ex-
plainability: A machine learning zoo mini-tour. arXiv preprint
arXiv:2012.01805, 2020.
[171] Andrea Margheri, Massimiliano Masi, Abdallah Miladi, Vladimiro Sas-
sone, and Jason Rosenzweig. Decentralised provenance for healthcare
data. International Journal of Medical Informatics, 141:104197, 2020.
[172] Aniek F Markus, Jan A Kors, and Peter R Rijnbeek. The role of
explainability in creating trustworthy artificial intelligence for health
70
care: a comprehensive survey of the terminology, design choices, and
evaluation strategies. Journal of Biomedical Informatics, 113:103655,
2021.
[174] Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit.
Journal of the American statistical Association, 46(253):68–78, 1951.
[175] Lisa M Meeks, Kurt Herzer, and Neera R Jain. Removing barriers and
facilitating access: increasing the number of physicians with disabili-
ties. Academic Medicine, 93(4):540–543, 2018.
[179] Chuizheng Meng, Loc Trinh, Nan Xu, James Enouen, and Yan Liu. In-
terpretability and fairness evaluation of deep learning models on mimic-
iv dataset. Scientific Reports, 12(1):1–28, 2022.
[180] Vishwali Mhasawade, Yuan Zhao, and Rumi Chunara. Machine learn-
ing and algorithmic fairness in public and population health. Nature
Machine Intelligence, 3(8):659–666, 2021.
71
[182] Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Man-
ning, and Chelsea Finn. Memory-based model editing at scale. In
International Conference on Machine Learning, pages 15817–15831.
PMLR, 2022.
[183] Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and
Kristian Lum. Algorithmic fairness: Choices, assumptions, and defi-
nitions. Annual Review of Statistics and Its Application, 8:141–163,
2021.
[186] Hussein Mozannar and David Sontag. Consistent estimators for learn-
ing to defer to an expert. In Hal Daumé III and Aarti Singh, editors,
Proceedings of the 37th International Conference on Machine Learning,
volume 119 of Proceedings of Machine Learning Research, pages 7076–
7087. PMLR, 13–18 Jul 2020.
[189] Akm Iqtidar Newaz, Amit Kumar Sikder, Mohammad Ashiqur Rah-
man, and A Selcuk Uluagac. A survey on security and privacy issues in
modern healthcare systems: Attacks and defenses. ACM Transactions
on Computing for Healthcare, 2(3):1–44, 2021.
72
[190] Wei Yan Ng, Tien-En Tan, Prasanth VH Movva, Andrew Hao Sen
Fang, Khung-Keong Yeo, Dean Ho, Fuji Shyy San Foo, Zhe Xiao, Kai
Sun, Tien Yin Wong, et al. Blockchain applications in health care for
covid-19 and beyond: a systematic review. The Lancet Digital Health,
3(12):e819–e829, 2021.
[192] Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. In-
terpretml: A unified framework for machine learning interpretability.
arXiv preprint arXiv:1909.09223, 2019.
[193] Ziad Obermeyer, Christine Vogeli, Brian Powers, and Sendhil Mul-
lainathan. Dissecting racial bias in an algorithm used to manage the
health of population. Science, 366(6464):447–453, 2019.
[194] Se-Ra Oh, Young-Duk Seo, Euijong Lee, and Young-Gab Kim. A com-
prehensive survey on security and privacy for electronic health data.
International Journal of Environmental Research and Public Health,
18(18):9668, 2021.
[195] OHDSI. The Book of OHDSI: Observational Health Data Sciences and
Informatics. OHDSI, 2019.
73
[199] Avneet Pannu. Artificial intelligence and its application in different
areas. Artificial Intelligence, 4(10):79–84, 2015.
[200] Mathias PM Parisot, Balazs Pejo, and Dayana Spagnuelo. Property in-
ference attacks on convolutional neural networks: Influence and impli-
cations of target model’s complexity. arXiv preprint arXiv:2104.13061,
2021.
[201] Chunjong Park, Anas Awadalla, Tadayoshi Kohno, and Shwetak Patel.
Reliable and trustworthy machine learning for health using dataset shift
detection. arXiv preprint arXiv:2110.14019, 2021.
[202] Chunjong Park, Anas Awadalla, Tadayoshi Kohno, and Shwetak Patel.
Reliable and trustworthy machine learning for health using dataset
shift detection. Advances in Neural Information Processing Systems,
page 34, 2021.
[206] Oleg S Pianykh, Georg Langs, Marc Dewey, Dieter R Enzmann, Chris-
tian J Herold, Stefan O Schoenberg, and James A Brink. Continuous
learning ai in radiology: implementation principles and early applica-
tions. Radiology, 297(1):6–14, 2020.
74
[207] Nikolaos Pitropakis, Emmanouil Panaousis, Thanassis Giannetsos,
Eleftherios Anastasiadis, and George Loukas. A taxonomy and sur-
vey of attacks against machine learning. Computer Science Review,
34:100199, 2019.
[209] John Platt et al. Probabilistic outputs for support vector machines
and comparisons to regularized likelihood methods. Advances in large
margin classifiers, 10(3):61–74, 1999.
[210] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q
Weinberger. On fairness and calibration. 30, 2017.
[213] Adnan Qayyum, Junaid Qadir, Muhammad Bilal, and Ala Al-Fuqaha.
Secure and robust machine learning for healthcare: A survey. IEEE
Reviews in Biomedical Engineering, 14:156–180, 2020.
75
[217] Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. Machine learning in
medicine. New England Journal of Medicine, 380(14):1347–1358, 2019.
[218] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Ha-
jaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi
Sun, et al. Scalable and accurate deep learning with electronic health
records. NPJ digital medicine, 1(1):1–10, 2018.
[219] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel
Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie
Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on
chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
[220] Khansa Rasheed, Adnan Qayyum, Mohammed Ghaly, Ala Al-Fuqaha,
Adeel Razi, and Junaid Qadir. Explainable, trustworthy, and ethical
machine learning for healthcare: A survey. Computers in Biology and
Medicine, page 106043, 2022.
[221] Christian Reimers, Paul Bodesheim, Jakob Runge, and Joachim Den-
zler. Towards learning an unbiased classifier from biased data via con-
ditional adversarial debiasing. page 48–62, 2021.
[222] Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas
Padhy, and Balaji Lakshminarayanan. A simple fix to maha-
lanobis distance for improving near-ood detection. arXiv preprint
arXiv:2106.09022, 2021.
[223] Cedric Renggli, Luka Rimanic, Nezihe Merve Gürel, Bojan Karlaš,
Wentao Wu, and Ce Zhang. A data quality-driven view of mlops.
arXiv preprint arXiv:2102.07750, 2021.
[224] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why
should i trust you?” explaining the predictions of any classifier. In
Proceedings of the 22nd ACM SIGKDD international conference on
knowledge discovery and data mining, pages 1135–1144, 2016.
[225] Samantha Cruz Rivera, Xiaoxuan Liu, An-Wen Chan, Alastair K Den-
niston, Melanie J Calvert, Hutan Ashrafian, Andrew L Beam, Gary S
Collins, Ara Darzi, Jonathan J Deeks, et al. Guidelines for clinical trial
protocols for interventions involving artificial intelligence: the spirit-ai
extension. The Lancet Digital Health, 2(10):e549–e560, 2020.
76
[226] Dani E Rosenkrantz, Whitney W Black, Roberto L Abreu, Mollie E
Aleshire, and Keisa Fallin-Bennett. Health and health care of rural
sexual and gender minorities: A systematic review. Stigma and Health,
2(3):229, 2017.
[229] Philipp Ruf, Manav Madan, Christoph Reich, and Djaffar Ould-
Abdeslam. Demystifying mlops and presenting a recipe for the selection
of open-source tools. Applied Sciences, 11(19):8861, 2021.
[230] Theo Ryffel, Andrew Trask, Morten Dahl, Bobby Wagner, Jason
Mancuso, Daniel Rueckert, and Jonathan Passerat-Palmbach. A
generic framework for privacy preserving deep learning. arXiv preprint
arXiv:1811.04017, 2018.
[231] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy
Liang. Distributionally robust neural networks for group shifts: On
the importance of regularization for worst-case generalization. arXiv
preprint arXiv:1911.08731, 2019.
[233] Rishi Kanth Saripalle. Fast health interoperability resources (fhir): cur-
rent status in the healthcare system. International Journal of E-Health
and Medical Communications (IJEHMC), 10(1):76–93, 2019.
77
[235] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann,
Wieland Brendel, and Matthias Bethge. Improving robustness against
common corruptions by covariate shift adaptation. In H. Larochelle,
M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances
in Neural Information Processing Systems, volume 33, pages 11539–
11551. Curran Associates, Inc., 2020.
[236] Antonin Schrab, Ilmun Kim, Mélisande Albert, Béatrice Laurent, Ben-
jamin Guedj, and Arthur Gretton. Mmd aggregated two-sample test.
arXiv preprint arXiv:2110.15073, 2021.
[237] Jessica Schrouff, Natalie Harris, Oluwasanmi Koyejo, Ibrahim Alabdul-
mohsin, Eva Schnider, Krista Opsahl-Ong, Alex Brown, Subhrajit Roy,
Diana Mincu, Christina Chen, et al. Maintaining fairness across distri-
bution shift: do we have viable solutions for real-world applications?
arXiv preprint arXiv:2202.01034, 2022.
[238] Mark Sendak, Gaurav Sirdeshmukh, Timothy Ochoa, Hayley Premo,
Linda Tang, Kira Niederhoffer, Sarah Reed, Kaivalya Deshpande,
Emily Sterrett, Melissa Bauer, et al. Development and validation
of ml-dqa–a machine learning data quality assurance framework for
healthcare. arXiv preprint arXiv:2208.02670, 2022.
[239] MP Sendak, W Ratliff, D Sarro, E Alderton, J Futoma, M Gao,
M Nichols, M Revoir, F Yashar, C Miller, et al. Real-world integration
of a sepsis deep learning technology into routine clinical care: imple-
mentation study. jmir med inform. 2020 jul 15; 8 (7): e15182. doi:
10.2196/15182.
[240] Tegjyot Singh Sethi and Mehmed Kantardzic. On the reliable detection
of concept drift from streaming unlabeled data. Expert Systems with
Applications, 82:77–99, 2017.
[241] Laleh Seyyed-Kalantari, Guanxiong Liu, Matthew McDermott, Irene
Chen, and Ghassemi Marzyeh. Chexclusion: Fairness gaps in deep
chest x-ray classifiers. 2021.
[242] Laleh Seyyed-Kalantari, Haoran Zhang, Matthew McDermott, Irene
Chen, and Ghassemi Marzyeh. Underdiagnosis bias of artificial intelli-
gence algorithms applied to chest radiographs in under-served patient
populations. Nature Medicine, 27:2176–2182, 2021.
78
[243] Shubham Sharma, Jette Henderson, and Joydeep Ghosh. Certifai: A
common framework to provide explanations and analyse the fairness
and robustness of black-box models. In Proceedings of the AAAI/ACM
Conference on AI, Ethics, and Society, pages 166–172, 2020.
[244] Ying Sheng, Sandeep Tata, James B Wendt, Jing Xie, Qi Zhao, and
Marc Najork. Anatomy of a privacy-safe large-scale information ex-
traction system over email. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining,
pages 734–743, 2018.
[245] Arjun Soin, Jameson Merkow, Jin Long, Joseph Paul Cohen, Smitha
Saligrama, Stephen Kaiser, Steven Borg, Ivan Tarapov, and Matthew P
Lungren. Chexstray: Real-time multi-modal data concordance for drift
detection in medical imaging ai, 2022.
[246] Karin Stacke, Gabriel Eilertsen, Jonas Unger, and Claes Lundström.
Measuring domain shift for deep learning in histopathology. IEEE
journal of biomedical and health informatics, 25(2):325–336, 2020.
[247] G Stiglic, P Kocbek, N Fijacko, M Zitnik, K Verbert, and L. Cilar.
Interpretability of machine learning based prediction models in health-
care. WIREs Data Mining Knowl Discov., 10(5):e1379, 2020.
[248] Vallijah Subasri, Amrit Krishnan, Azra Dhalla, Deval Pandya, David
Malkin, Fahad Razak, Amol Verma, Anna Goldenberg, and Elham
Dolatabadi. Diagnosing and remediating harmful data shifts for the
responsible deployment of clinical ai models. medRxiv, pages 2023–03,
2023.
[249] Adarsh Subbaswamy, Roy Adams, and Suchi Saria. Evaluating model
robustness and stability to dataset shift. Proceedings of Machine
Learning Research, pages 2611–2619, 2021.
[250] Tony Y Sun, Oliver J Walk IV, Jennifer L Chen, Harry Reyes Nieva,
and Noémie Elhadad. Exploring gender disparities in time to diagnosis.
2020.
[251] Georgios Symeonidis, Evangelos Nerantzis, Apostolos Kazakis, and
George A Papakostas. Mlops–definitions, tools and challenges. arXiv
preprint arXiv:2201.00162, 2022.
79
[252] Kim Templeton, Carol A Bernstein, Javeed Sukhera, Lois Margaret
Nora, Connie Newman, Helen Burstin, Constance Guille, Lorna Lynn,
Margaret L Schwarze, Srijan Sen, et al. Gender-based differences in
burnout: Issues faced by women physicians. NAM Perspectives, 2019.
[253] Erico Tjoa and Cuntai Guan. A survey on explainable artificial intelli-
gence (xai): Toward medical xai. IEEE transactions on neural networks
and learning systems, 32(11):4793–4813, 2020.
[254] Joana Tomás, Deolinda Rasteiro, and Jorge Bernardino. Data
anonymization: An experimental evaluation using open-source tools.
Future Internet, 14(6):167, 2022.
[255] Sana Tonekaboni, Gabriela Morgenshtern, Azadeh Assadi, Aslesha
Pokhrel, Xi Huang, Anand Jayarajan, Robert Greer, Gennady Pekhi-
menko, Melissa McCradden, Mjaye Mazwi, et al. How to validate
machine learning models prior to deployment: Silent trial protocol
for evaluation of real-time models at icu. In Conference on Health,
Inference, and Learning, pages 169–182. PMLR, 2022.
[256] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander
Turner, and Aleksander Madry. Robustness may be at odds with ac-
curacy. arXiv preprint arXiv:1805.12152, 2018.
[257] Ata Ullah, Muhammad Azeem, Humaira Ashraf, Abdulellah A Al-
aboudi, Mamoona Humayun, and NZ Jhanjhi. Secure healthcare data
aggregation and transmission in iot—a survey. IEEE Access, 9:16849–
16865, 2021.
[258] Dennis Ulmer, Lotta Meijerink, and Giovanni Cinà. Trust issues: Un-
certainty estimation does not enable reliable ood detection on medical
tabular data. In Machine Learning for Health, pages 341–354. PMLR,
2020.
[259] Boris van Breugel, Trent Kyono, Jeroen Berrevoets, and Mihaela
van der Schaar. Decaf: Generating fair synthetic data using causally-
aware generative networks. Advances in Neural Information Processing
Systems, 34:22221–22233, 2021.
[260] Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual
learning. arXiv preprint arXiv:1904.07734, 2019.
80
[261] MHWA van den Boogaard, L Schoonhoven, E Maseda, C Plowright,
C Jones, A Luetz, PV Sackey, PG Jorens, LM Aitken, FMP van
Haren, et al. Recalibration of the delirium prediction model for icu
patients (pre-deliric): a multinational observational study. Intensive
care medicine, 40(3):361–369, 2014.
[263] Basil Varkey. Principles of clinical ethics and their application to prac-
tice. Medical Principles and Practice, 30(1):17–28, 2021.
[266] Amol A. Verma, Russell Murray, Joshua Greiner, Joseph Paul Cohen,
Kaveh G. Shojania, Marzyeh Ghassemi, Sharon E. Straus, Chloe Pou-
Prom, and Muhammad Mamdani. Implementing machine learning in
medicine. CMAJ, 193(34):E1351–E1357, 2021.
[268] Olga Vovk, Gunnar Piho, and Peeter Ross. Evaluation of anonymiza-
tion tools for health data. In International Conference on Model and
Data Engineering, pages 302–313. Springer, 2021.
81
[270] Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris
Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gal-
lagher, and Scott McLachlan. Synthea: An approach, method, and
software mechanism for generating synthetic patients and the syn-
thetic electronic health care record. Journal of the American Medical
Informatics Association, 25(3):230–238, 2018.
[271] Jie Wang, Ghulam Mubashar Hassan, and Naveed Akhtar. A survey
of neural trojan attacks and defenses in deep learning. arXiv preprint
arXiv:2202.07183, 2022.
[272] Lu Wang, Mark Chignell, Yilun Zhang, Andrew Pinto, Fahad Razak,
Kathleen Sheehan, and Amol Verma. Physician experience design
(pxd): more usable machine learning prediction for clinical decision
making. In AMIA Annual Symposium Proceedings, volume 2022, page
476. American Medical Informatics Association, 2022.
[274] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente
Ordonez. Balanced datasets are not enough: Estimating and mitigating
gender bias in deep image representations. pages 5309–5318, 10 2019.
82
[277] Sarah Wiegreffe and Ana Marasović. Teach me to explain: A review of
datasets for explainable nlp. arXiv preprint arXiv:2102.12060, 2021.
[278] Jenna Wiens, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X
Liu, Finale Doshi-Velez, Kenneth Jung, Katherine Heller, David Kale,
Mohammed Saeed, et al. Do no harm: a roadmap for responsible
machine learning for health care. Nature medicine, 25(9):1337–1340,
2019.
[279] Yinjun Wu, Edgar Dobriban, and Susan Davidson. Deltagrad: Rapid
retraining of machine learning models. volume 119, pages 10355–10366,
2020.
[280] Dingyi Xiang and Wei Cai. Privacy protection and secondary use of
health data: Strategies and methods. BioMed Research International,
2021, 2021.
[281] Depeng Xu, Yongkai Wu, Shuhan Yuan, Lu Zhang, and Xintao Wu.
Achieving causal fairness through generative adversarial networks. In
Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, 2019.
[282] Jie Xu, Yunyu Xiao, Wendy Hui Wang, Yue Ning, Elizabeth A
Shenkman, Jiang Bian, and Fei Wang. Algorithmic fairness in com-
putational medicine. medRxiv, 2022.
[283] Shen Xu, Toby Rogers, Elliot Fairweather, Anthony Glenn, James Cur-
ran, and Vasa Curcin. Application of data provenance in healthcare
analytics software: information visualisation of user activities. AMIA
Summits on Translational Science Proceedings, 2018:263, 2018.
[284] Adam Yala, Constance Lehman, Tal Schuster, Tally Portnoi, and
Regina Barzilay. A deep learning mammography-based model for im-
proved breast cancer risk prediction. Radiology, 292:60–66, 2019.
[285] Yuzhe Yang, Haoran Zhang, Dina Katabi, and Marzyeh Ghassemi.
Change is hard: A closer look at subpopulation shift. arXiv preprint
arXiv:2302.12254, 2023.
[286] Sobia Yaqoob, Muhammad Murad Khan, Ramzan Talib, Arslan Da-
wood Butt, Sohaib Saleem, Fatima Arif, and Amna Nadeem. Use of
83
blockchain in healthcare: a systematic literature review. International
Journal of Advanced Computer Science and Applications, 10(5), 2019.
[287] Eileen Yoshida, Shirley Fei, Karen Bavuso, Charles Lagor, and Saverio
Maviglia. The value of monitoring clinical decision support interven-
tions. Applied Clinical Informatics, 9(1):163–173, 2018.
[288] Shujian Yu, Xiaoyang Wang, and José C. Prı́ncipe. Request-and-
reverify: Hierarchical hypothesis testing for concept drift detec-
tion with expensive labels. In Proceedings of the Twenty-Seventh
International Joint Conference on Artificial Intelligence, IJCAI-18,
pages 3033–3039. International Joint Conferences on Artificial Intel-
ligence Organization, 7 2018.
[289] John R Zech, Marcus A Badgeley, Manway Liu, Anthony B Costa,
Joseph J Titano, and Eric Karl Oermann. Variable generalization per-
formance of a deep learning model to detect pneumonia in chest ra-
diographs: a cross-sectional study. PLoS medicine, 15(11):e1002683,
2018.
[290] Angela Zhang, Lei Xing, James Zou, and Joseph C Wu. Shifting ma-
chine learning for healthcare from development to deployment and from
models to data. Nature Biomedical Engineering, pages 1–16, 2022.
[291] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating
unwanted biases with adversarial learning. page 335–340, 2018.
[292] Haoran Zhang, Natalie Dullerud, Karsten Roth, Lauren Oakden-
Rayner, Stephen Pfohl, and Marzyeh Ghassemi. Improving the fair-
ness of chest x-ray classifiers. In Conference on Health, Inference, and
Learning, pages 204–233. PMLR, 2022.
[293] Haoran Zhang, Natalie Dullerud, Laleh Seyyed-Kalantari, Quaid Mor-
ris, Shalmali Joshi, and Marzyeh Ghassemi. An empirical framework
for domain generalization in clinical settings. In Proceedings of the
Conference on Health, Inference, and Learning, pages 279–290, 2021.
[294] Haoran Zhang, Natalie Dullerud, Laleh Seyyed-Kalantari, Quaid Mor-
ris, Shalmali Joshi, and Marzyeh Ghassemi. An empirical framework
for domain generalization in clinical settings. In Proceedings of the
Conference on Health, Inference, and Learning, pages 279–290, 2021.
84
[295] Haoran Zhang, Amy Liu, Mohamed Abdalla, Matthew B. A. McDer-
mott, and Marzyeh Ghassemi. Hurtful words: Quantifying biases in
clinical contextual word embeddings. 2020.
[296] Tianran Zhang, Muhao Chen, and Alex AT Bui. Adadiag: Adver-
sarial domain adaptation of diagnostic prediction with clinical event
sequences. Journal of biomedical informatics, 134:104168, 2022.
[298] Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei
Chang. Learning gender-neutral word embeddings. arXiv preprint
arXiv:1809.01496, 2018.
[299] Shengjia Zhao, Abhishek Sinha, Yutong He, Aidan Perreault, Jiaming
Song, and Stefano Ermon. Comparing distributions by measuring dif-
ferences that affect decision making. In International Conference on
Learning Representations, 2021.
[301] Xiaofeng Zhu and Diego Klabjan. Continual neural network model
retraining. In 2021 IEEE International Conference on Big Data (Big
Data), pages 1163–1171. IEEE, 2021.
85
Tool Description
FairMLHealth31 Tools and tutorials for variation analysis in healthcare ma-
chine learning.
AIF360 [33] An open-source library containing techniques developed by
the research community to help detect and mitigate bias in
machine learning models throughout the AI application life-
cycle.
Fairlearn32 An open-source, community-driven project to help data sci-
entists improve the fairness of AI systems.
Fairness-comparison33 Benchmark fairness-aware machine learning techniques.
Fairness Indicators34 Fairness Indicators is a suite of tools built on top of Tensor-
Flow Model Analysis (TFMA) that enable regular evaluation
of fairness metrics in product pipelines.
ML-fairness-gym35 A tool for exploring long-term impacts of ML systems.
themis-ml [27] An open source machine learning library that implements
several fairness-aware methods that comply with the sklearn
API.
FairML [11] ToolBox for diagnosing bias in predictive modelling.
Black Box Auditing [12] A toolkit for auditing ML model deviations.
What-If Tool36 Visually probe the behaviour of trained machine learning
models, with minimal coding.
Aequitas37 An open source bias audit toolkit for machine learning devel-
opers, analysts, and policymakers to audit machine learning
models for discrimination and bias, and make informed and
equitable decisions around developing and deploying predic-
tive risk-assessment tools.
DECAF [259] A fair synthetic data generator for tabular data utilizing
GANs and causal models.
REPAIR [146] A dataset resampling algorithm to reduce representation bias
by reweighting.
CERTIFAI [243] Evaluates AI models for robustness, fairness, and explainabil-
ity, and allows users to compare different models or model
versions for these qualities.
FairSight [13] A fair decision making pipeline to assist decision makers track
fairness throughout a model.
Adv-Demog-Text [72] An adversarial network demographic attributes remover from
text data.
GN-GloVe [298] A framework for generating gender neutral word embeddings.
Tensorflow Constrained A library for optimizing inequality-constrained problems us-
Optimization 38
ing rate helpers.
Responsibly39 [60] Toolkit for auditing and mitigating bias and fairness of ML
systems.
Dataset-Nutrition-Label The Data Nutrition Project aims to create a standard label
[110] for interrogating datasets.
86
Table 6: List of open-source tools available on Github that can be used for ML Monitoring
and Updating specific to health.