0% found this document useful (0 votes)
95 views86 pages

Mlhops: Machine Learning For Healthcare Operations: Keywords

Uploaded by

NeyderMonteagudo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views86 pages

Mlhops: Machine Learning For Healthcare Operations: Keywords

Uploaded by

NeyderMonteagudo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

MLHOps: Machine Learning for Healthcare Operations

Faiza Khan Khattaka , Vallijah Subasria,b,c , Amrit Krishnana , Elham


Dolatabadia , Deval Pandyaa , Laleh Seyyed-Kalantarid , Frank Rudzicza,c,e
a
Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
b
arXiv:2305.02474v1 [cs.LG] 4 May 2023

Hospital for Sick Children, Toronto, Ontario, Canada


c
University of Toronto, Toronto, Ontario, Canada
d
York University, Toronto, Ontario, Canada
e
Dalhousie University, Halifax, Nova Scotia, Canada

Abstract
Machine Learning Health Operations (MLHOps) is the combination of pro-
cesses for reliable, efficient, usable, and ethical deployment and maintenance
of machine learning models in healthcare settings. This paper provides both
a survey of work in this area and guidelines for developers and clinicians
to deploy and maintain their own models in clinical practice. We cover the
foundational concepts of general machine learning operations, describe the
initial setup of MLHOps pipelines (including data sources, preparation, en-
gineering, and tools). We then describe long-term monitoring and updating
(including data distribution shifts and model updating) and ethical consid-
erations (including bias, fairness, interpretability, and privacy). This work
therefore provides guidance across the full pipeline of MLHOps from concep-
tion to initial and ongoing deployment.
Keywords: MLOps, Healthcare, Responsible AI

1. Introduction
Over the last decade, efforts to use health data for solving complex medical
problems have increased significantly. Academic hospitals are increasingly
dedicating resources to bring machine learning (ML) to the bedside and to
addressing issues encountered by clinical staff. These resources are being uti-
lized across a range of applications including clinical decision support, early
warning, treatment recommendation, risk prediction, image informatics, tele-
diagnosis, drug discovery, and intelligent health knowledge systems.

Preprint submitted to Journal of Biomedical Informatics May 5, 2023


There are various examples of ML being applied to medical data, including
prediction of sepsis [239], in-hospital mortality, prolonged length-of-stay, pa-
tient deterioration, and unplanned readmission [218]. In particular, sepsis is
one of the leading causes of in-hospital deaths. A large-scale study demon-
strated the impact of an early warning system to reduce the lead time for
detecting the onset of sepsis, and hence allowing more time for clinicians
to prescribe antibiotics [8]. Similarly, deep convolutional neural networks
have been shown to achieve superior performance in detecting pneumonia
and other pathologies from chest X-rays, compared to practicing radiologists
[219]. These results highlight the potential of ML models when they are
strongly integrated into clinical workflows.

When deployed successfully, data-driven models can free time for clinicians[109],
improve clinical outcomes [217], reduce costs [28], and provide improved qual-
ity care for patients. However, most studies remain preliminary, limited to
small datasets, and/or implemented in select health sub-systems. Integrat-
ing with clinical workflows remains crucial [278, 266] but, despite recent
computational advances and an explosion of health data, deploying ML in
healthcare responsibly and reliably faces several operational and engineering
challenges, including:

• Standardizing data formats,


• Strengthening methodologies for evaluation, monitoring and updating,
• Building trust with clinicians and hospital staff,
• Adopting interoperability standards, and
• Ensuring that deployed models align with ethical considerations, do
not exacerbate biases, and adhere to privacy and governance policies

In this review, we articulate the challenges involved in implementing suc-


cessful Machine Learning Health Operations (MLHOps) pipelines, specific
to clinical use cases. We begin by outlining the foundations of model de-
ployment in general, and provide a comprehensive study of the emerging
discipline [251, 167]. We then provide a detailed review of the different com-
ponents of development pipelines specific to healthcare. We discuss data,
pipeline engineering, deployment, monitoring and updating models, and eth-
ical considerations pertaining to healthcare use cases. While MLHOps often

2
requires aspects specific to healthcare, best practices and concepts from other
application domains are also relevant. This summarizes the primary outcome
of our review, which is to provide a set of recommendations for implementing
MLHOps pipelines in practice – i.e., a “how-to” guide for practitioners.

2. Foundations of MLOps
2.1. What is MLOps?
Machine learning operations (MLOps) is a combination of tools, techniques,
standards, and engineering best practices to standardize ML system devel-
opment and operations [251]. It is used to streamline and automate the
deployment, monitoring, and maintenance of machine learning models, in
order to ensure they are robust, reliable, and easily updated or upgraded.

2.2. MLOps Pipeline


Pipelines are processes of multiple modules that streamline the ML workflow.
Once the project is defined, the MLOps pipeline begins with identifying the
inputs and outputs relevant to the problem, cleaning, and transforming the
data towards useful and efficient representations for machine learning, train-
ing, and evaluating model performance, and deploying selected models in
production while continuing to monitor their performance. Figure 1 illus-
trates a general MLOps pipeline. Common types of pipelines include:

• Automated pipelines: An end-to-end pipeline that is automated to-


wards a single task, e.g., a model training pipeline.

• Orchestrated pipelines: A pipeline that consists of multiple modules,


designed for several automated tasks, and managed and coordinated in
a dynamic workflow, e.g., the pipeline managing MLOps.

Recently, MLOps has become more well-defined and widely implemented due
to the reusability and standardization benefits across various applications
[229]. As a result, the structure and definitions of different components are
becoming quite well-established.

2.3. MLOps Components


MLOps pipelines consist of different components and key concepts [134, 108],
stated below (and shown in Figure 1):

3
Figure 1: MLOps pipeline

• Stores: Stores encapsulate the tools designed to centralize building,


managing, and sharing either features or models across different teams
and applications in an organization.

– Raw data source: A raw data store is a centralized repository


that stores data in its raw, unprocessed form. It is a staging area
where data is initially collected and stored before processing or
transformation.
– Feature store: A centralized online repository for storing, man-
aging, and sharing features used in ML models. These features
are acquired by processing the raw data and are made available
for real-time serving through the feature store.
– ML metadata store: A ML metadata store helps record and
retrieve metadata associated with an ML pipeline including infor-
mation about various pipeline components, their executions (e.g.
training runs), and resulting artifacts (e.g. trained models).

• Serving: Serving is the task of hosting ML artifacts (usually models)

4
either on the cloud or on-premise so that their functions are accessible
to multiple applications through remote function calls (i.e., application
programming interfaces (APIs)).

– In batch serving, the artifact is used by scheduled jobs.


– In online serving, the artifact processes requests in real-time. Com-
munication and access point channels, traffic management, pre-
and post-processing requests, and performance monitoring should
all be considered while serving artifacts.

• Data query: The component queries the data, processes it and stores
it in a format that models can easily utilize.

• Experimentation: The experimentation component consists of model


training, model evaluation, and model validation.

• Model registry: The model registry is a centralized repository that


stores trained machine learning models, their metadata, and their ver-
sions.

• Drift-detection: The drift-detection component is responsible for


monitoring the AI system for potentially harmful drift and issuing an
alert when drift is detected.

• Workflow orchestration: The workflow orchestration component re-


sponsible for the process of automating and managing the end-to-end
flow of the ML pipeline.

• Source repository: The source repository is a centralized code repos-


itory that stores the source code (and its history) for ML models and
related components.

• Containerization: Containerization involves packaging models with


the components required to run them; this includes libraries and frame-
works so they can run in isolated user spaces with minimal configuration
of the underlying operating system [86]. Sometimes, source code is also
included in these containers.

5
2.4. Levels of MLOps maturity
MLOps practices can be divided into different levels based on the maturity
of the ML system automation process [118, 251], as described below.

• Level 0 – Manual ML pipeline: Every step in the ML pipeline, in-


cluding data processing, model building, evaluation, and deployment,
are manual processes. In Level 0, the experimental and operational
pipelines are distinct and the data scientists provide a trained model
as an artifact to the engineering team to deploy on their infrastructure.
Here, only the trained model is served for deployment and there are
infrequent model updates. Level 0 processes typically lack rigorous and
continuous performance monitoring capabilities.

• Level 1 – Continuous Model Training and Delivery: Here, the


entire ML pipeline is automated to perform continuous training of the
model as well as continuous delivery of model prediction services. Soft-
ware orchestrates the execution and transition between the steps in
the pipeline, leading to rapid iteration over experiments and an auto-
matic process for deploying a selected model into production. Contrary
to Level 0, the entire training pipeline is automated, and the deployed
model can incorporate newer data based on pipeline triggers. Given the
automated nature of Level 1, it is necessary to continuously monitor,
evaluate, and validate models and data to ensure expected performance
during production.

• Level 2 – Continuous Integration and Continuous Delivery:


This involves the highest maturity in automation through enforcing
combined practice of continuous integration and delivery which en-
ables for a rapid and reliable update of the pipelines in production.
Through automated test and deployment of new pipeline implemen-
tations, any rapid changes in data and business environment can be
addressed. In this level, the pipeline and its components are auto-
matically built, tested, and packaged when new code is committed or
pushed to the source code repository. Moreover, the system continu-
ously delivers new pipeline implementations to the target environment
that in turn delivers prediction services of the newly trained model.

6
Ultimately implementation of MLOps leads to many benefits, including bet-
ter system quality, increased scalability, simplified management processes,
improved governance and compliance, increased cost savings and improved
collaboration.

3. MLHOps Setup
Operationalizing ML models in healthcare is unique among other application
domains. Decisions made in clinical environments have a direct impact on
patient outcomes and, hence, the consequences of integrating ML models into
health systems need to be carefully controlled. For example, early warning
systems might enable clinicians to prescribe treatment plans with increased
lead time [109]; however, these systems might also suffer from a high false
alarm rate, which could result in alarm fatigue and possibly worse outcomes.
The requirements placed on such ML systems are therefore very high and,
if they are not adequately satisfied, the result is diminished adoption and
trust from clinical staff. Rigorous long-term evaluation is needed to validate
the efficacy and to identify and assess risks, and this evaluation needs to be
reported comprehensively and transparently [265].

While most MLOps best practices extend to healthcare settings, the data,
competencies, tools, and model evaluation differ significantly [179, 172, 255,
17]. For example, typical performance metrics (e.g., positive predictive value
and F1-scores) may differ between clinicians and engineers. Therefore, unlike
in other industries, it becomes necessary to evaluate physician experience
when predictions and model performance are presented to clinical staff [272].
In order to build trust in the clinical setting, the interpretability of ML
models is also exceptionally important. As more ML models are integrated
into hospitals, new legal frameworks and standards for evaluation need to be
adopted, and MLHOps tools need to comply with existing standards.
In the following sections, we explore the different components of MLHOps
pipelines.

3.1. Data
Successfully digitizing health data has resulted in a prodigious increase in the
volume and complexity of patient data collected [218]. These datasets are
now stored, maintained, and processed by hospital IT infrastructure systems
which in turn use specialized software systems.

7
3.1.1. Data sources
There could be multiple sources of data, which are categorized as follows:
Electronic health records (EHRs) record, analyze, and present information
to clinicians, including:

1. Patient demographic data: E.g., age and sex.


2. Administrative data: E.g., treatment costs and insurance.
3. Patient observations records: E.g., chart events such as lab tests
and vitals. These include a multitude of physiological signals captured
using various methods such as heart rate, blood pressure, skin temper-
ature, and respiratory rate.
4. Interventions: These are steps that significantly alter the course of
patient care, such as mechanical ventilation, dialysis, or blood transfu-
sions.
5. Medications information: E.g., medications administered and their
dosage.
6. Waveform data: This digitizes physiological signals collected from
bedside patient monitors.
7. Imaging reports and metadata: E.g., CT scans, MRI, ultrasound,
and corresponding radiology reports.
8. Medical notes: These are made by clinical staff on patient condition.
These can also be transcribed text of recorded interactions between the
patient and clinician.

Other sources of health data include primary care data, wearable data (e.g.,
smartwatches), genomics data, video data, surveys, medical claims, billing
data, registry data, and other patient-generated data [216, 30, 45].

Figure 2 illustrates the heterogeneous nature of health data. The stratifica-


tion shown can be extended further to contain more specialized data. For
example, genomics data can be further stratified into different types of data
based on the method of sequencing; observational EHR data can be further
stratified to include labs, vital measurements, and other recorded observa-
tions.

8
Ambient sensors

Wearables

Pharmacological

Immune Sensors

Other
Microbiome
Metabolomics

Epigenomics

Healthcare data Omics Proteomics

Demographic

Waveform
Transcriptomics
Billing
Genomics
Administrative
Insurance EHR
Interventions
Pathology

Observations Imaging
Medications
X-rays
Notes Ultrasound

MRIs CT scans

Figure 2: Stratification of health data. Further levels of stratification can be extended as


the data becomes richer. For example, observational EHR data could include labs, vital
measurements, and other recorded observations.

With such large volumes and variability in data, standardization is key to


achieve scalability and interoperability. Figure 3 illustrates the different lev-
els of standardization that need to be achieved with respect to health data.

3.1.2. Common Data Model (CDM)


Despite the widespread adoption of EHR systems, clinical events are not cap-
tured in a standard format across observational databases [195]. For effective
research and implementation, data must be drawn from many sources and
compared and contrasted to be fully understood.

Databases must also support scaling to large numbers of records which can
be processed concurrently. Hence, efficient storage systems along with com-
putational techniques are needed to facilitate analyses. One of the first steps

9
Exchange formats and proto-
cols

Concepts and relationships


4.4 and 4.0

Variable names and types

Figure 3: The hierarchy of standardization that common data models and open standards
for interoperability address. The lowest level is about achieving standardization of vari-
able names such as lab test names, medications and diagnosis codes, as well as the data
types used to store these variables (i.e. integer vs. character). The next level is about
having abstract concepts such that data can be mapped and grouped under these concept
definitions. The top level of standardization is about data exchange formats, e.g. JSON,
XML, protocols, along with protocols for information exchange like supported RESTful
API architectures. This level addresses questions on interoperability and how data can be
exchanged across sites and EHR systems.

towards scalability is to transform the data to a common data standard.


Once available in a common format, the process of extracting, transforming,
and loading (ETL) becomes simplified. In addition to scale, patient data
require a high level of protection with strict data user agreements and ac-
cess control. A common data model addresses these challenges by allowing
for downstream functional access points to be designed independent of the
data model. Data that is available in a common format promotes collabora-
tion and mitigates duplicated effort. Specific implementations and formats
of data should be hidden from users, and only high-level abstractions need
to be visible.

The Systematized Nomenclature of Medicine (SNOMED) was among the


first efforts to standardize clinical terminology, and a corresponding dic-
tionary with a broad range of clinical terminology is available as part of
SNOMED-CT [67]. Several data models use SNOMED-CT as part of their
core vocabulary. Converting datasets to a common data model like the Ob-
servational Medical Outcomes Partnership (OMOP) model involves mapping
from a source database to the target content delivery manager. This process
is usually time-consuming and involves a lot of manual effort undertaken
by data scientists. Tools to simplify the mapping and conversion process

10
can save time and effort and promote adoption. For OMOP, the ATLAS
tool [195] developed by Observational Health Data Sciences and Informatics
(OHDSI) provides such a feature through their web based interactive analysis
platform.

3.1.3. Interoperability and open standards


As the volume of data grows in healthcare institutions and applications in-
gest data for different use cases, real-time performance and data management
is crucial. To enable real-time operation and easy exchange of health data
across systems, an interoperability standard for data exchange along with
protocols for accessing data through easy-to-use programming interfaces is
necessary. Some of the popular healthcare data standards include Health
Level 7 (HL7), Fast Healthcare Interoperability Resources (FHIR), Health
Level 7 v2 (HL7v2), and Digital Imaging and Communications in Medicine
(DICOM).

The FHIR standard [31] is a leading open standard for exchanging health
data. FHIR is developed by Health Level 7 (HL7), a not-for-profit stan-
dards development organization that was established to develop standards
for hospital information systems. FHIR defines the key entities involved
in healthcare information exchange as resources, where each resource is a
distinct identifiable entity. FHIR also defines APIs which conform to the
representational state transfer (REST) architectural style for exchanging re-
sources, allowing for stateless Hypertext Transfer Protocol (HTTP) methods,
and exposing directory-structure like URIs to resources. RESTful architec-
tures are light-weight interfaces that allow for faster transmission, which is
more suitable for mobile devices. RESTful interfaces also facilitate faster
development cycles because of their simple structure.

DICOM is the standard for the communication and management of medical


imaging information and related metadata. The DICOM standard specifies
the format and protocol for exchange of digital information between medical
imaging equipment and other systems. Persistent information objects which
encode images are exchanged and an instance of such an information object
may be exchanged across many systems and many organizational contexts,
and over time. DICOM has enabled deep collaboration and standardization
across different disciplines such as radiology, cardiology, pathology, ophthal-
mology, and related disciplines.

11
3.1.4. Quality assurance and validation
Data collected in retrospective databases for analysis and ML use cases need
to be checked for quality and consistency. Data validation is an important
step towards ensuring that ML systems developed using the data are highly
performant, and do not incorporate biases from the data. Errors in data
propagate through the MLOps pipeline and hence specialized data quality
assurance tools and checks at various stages of the pipeline are necessary
[223]. A standardized data validation framework that includes i) data ele-
ment pre-processing, ii) checks for completeness, conformance, and plausi-
bility, and iii) a review process by clinicians and other stakeholders should
capture generalizable insight across various clinical investigations [238].

3.2. Pipeline Engineering


Data stored in raw formats need to be processed to create feature represen-
tations for ML models. Each transformation is a computation, and a chain
of these processing elements, arranged so that the output of each element is
the input of the next, constitutes a pipeline [134] and using software tools
and workflow practices that enable such pipelines is pipeline engineering.

There are advantages to using such a pipeline approach, including:

• Modularization: By breaking the chain of transformations into small


steps, modularization is naturally achieved.

• Testing: Each transformation step can be tested independently, which


facilitates quality assurance and testing.

• Debugging: Version controlling the outputs at each step makes it eas-


ier to ensure reproducibility, especially when many steps are involved.

• Parallelism: If any step in the pipeline is easily parallelizable across


multiple compute nodes, the overall processing time can be reduced.

• Automation: By breaking a complex task into a series of smaller


tasks, the completion of each task can be used to trigger the start of
the next task, and this can be automated using continuous integration
tools such as Jenkins, Github actions and Gitlab CI.

In health data processing, the following steps are crucial:

12
1. Cleaning: Formatting values, adjusting data types, checking and fix-
ing issues with raw data.

2. Encoding: Computing word embeddings for clinical text, encoding the


text and raw values into embeddings [127, 15]. Encoding is a general
transformation step that can be used to create vector representations of
raw data. For example, transforming images to numeric representations
can also be considered to be encoding.

3. Aggregation: Grouping values into buckets, e.g., for aggregating mea-


surements into fixed time-intervals, or grouping values by patient ID.

4. Normalization: Normalizing values into standard ranges or using


statistics of the data.

5. Imputation: Handling missing values in the data. For various clinical


data, ‘missingness’ can actually provide valuable contextual informa-
tion about the patient’s health and needs to be handled carefully [47].

Multiple data sources such as EHR data, clinical notes and text, imaging
data, and genomics data can be processed independently to create features
and they can be combined to be used as inputs to ML models. Hence, com-
posing pipelines of these tasks facilitates component reusability [115]. Fur-
thermore, since the ML development life-cycle constitutes a chain of tasks,
the pipelining approach becomes even more desirable. Some of the high
level tasks in the MLHOps pipeline include feature creation, feature selec-
tion, model training, evaluation, and monitoring. Evaluating models across
different slices of data, hyper-parameters, and other confounding variables is
necessary for building trust.

Table 7 lists popular open-source tools and packages specific to health data
and ML processing. These tools are at different stages of development and
maturity. Some examples of popular tools include MIMIC-Extract [273],
Clairvoyance [115] and CheXstray [245].

3.3. Modelling
At this stage, the data has been collected, cleaned, and curated, ready to be
fed to the ML model to accomplish the desired task. The modelling phase in-
volves choosing the available models that fit the problem, training & testing

13
the models, and choosing the model with the best performance & reliabil-
ity guarantees. Given the the existence of numerous surveys summarizing
machine learning and deep learning algorithms for general healthcare scenar-
ios [74, 1], as well as specific use cases such as brain tumor detection [18],
COVID-19 prevention[26], and clinical text representation [127], we omit this
discussion and let the reader explore the surveys relevant to their prediction
problem.

3.4. Infrastructure and System


Hospitals typically use models developed by their EHR vendor which are
deployed through the native EHR vendor configuration. Often, inference is
run locally or in a cloud instance, and the model outputs are communicated
within the EHR [124]. Predominantly, these models are pre-trained and
sometimes fine-tuned on the specific site’s data.
A feature store is a ML-specific data system used to centralize storage, pro-
cessing, and access to frequently used features, making them available for
reuse in the development of future machine learning models. Feature stores
operationalize and streamline the input, tracking, and governance of the data
as part of feature engineering for machine learning [134].
To ensure reliability, the development, staging, and production environments
are separated and have different requirements. The staging and production
environments typically consist of independent virtual machines with ade-
quate compute and storage, along with reliable and secure connections to
the databases.
The infrastructure and software systems also have to follow and comply with
cybersecurity, medical software design and software testing standards [65].

3.4.1. Roles and Responsibilities


Efficient and successful MLHOps requires a collaborative, interdisciplinary
team across a range of expertise and competencies commonly found in data
science, ML, software, operations, production engineering, medicine, and pri-
vacy capabilities [134]. Similar to general MLOps practices, data and ML
scientists, data, DevOps, and ML engineers, solution and data architects,
ML and software fullstack developers, and project managers are needed. In
addition, the following role are required, which are distinct to healthcare (for
more general MLOps roles see Table 5):
• Health AI Project Managers: Responsibilities include panning
projects, establishing guidelines, milestone tracking, managing risk,

14
supporting the teams and governing partnerships with collaborators
from other health organizations.

• Health AI Implementation Coordinator: Liaison that engages


with key stakeholders to facilitate the implmentation of clinical AI sys-
tems.

• Healthcare Operations Manager: Oversees and coordinates quality


management, resource management, process improvement, and patient
safety in clinical settings like hospitals.

• Clinical Researchers & Scientists: Domain experts that provide


critical domain-specific knowledge relevant to model development and
implementation.

• Patient-Facing Practitioners: Responsibilities include providing sys-


tem requirements, pipeline usage feedback, and perspective about the
patient experience (e.g. clinicians, nurses).

• Ethicists: Provides support regarding ethical implications of clinical


AI systems.

• Privacy Analysts: Provides assessments regarding privacy concerns


pertaining to the usage of patient data.

• Legal Analysts: Works closely with privacy analysts and ethicists to


evaluate the legal vulnerabilities of clinical AI systems.

3.5. Reporting Guidelines


Many clinical AI systems do not meet reporting standards because of a fail-
ure to assess for poor quality or unavailable input data, insufficient analysis
of performance errors, or a lack of information regarding code or algorithm
availability [208]. Systematic reviews of clinical AI systems suggest there is a
substantial reporting burden, and additions regarding reliability and fairness
can improve reporting [164]. As a result, guidelines informed by challenges
in existing AI deployments in health settings have become imperative [57].
Reporting guidelines including CONSORT-AI [158], DECIDE-AI [265], and
SPIRIT-AI [225] were developed by a multidisciplinary group of international
experts using the Delphi process to ensure complete and transparent report-
ing of randomized clinical trials (RCT) that evaluate interventions with an

15
AI model. Broadly these guidelines suggest inclusion of the following criteria
[65]:

• Intended use: Inclusion of the medical problem and context, current


standard practice, intended patient population(s), how the AI system
will be integrated into the care pathway, and the intended patient out-
comes.

• Patient and user recruitment: Well-defined inclusion and exclusion


criteria.

• Data and outcomes: The use of a representative patient population,


data coding and processing, missing- and low-quality data handling,
and sample size considerations.

• Model: Inclusion of inputs, outputs, training, model selection, param-


eter tuning, and performance.

• Implementation: Inclusion of user experience with the AI system,


user adherence to intended implementation, and changes to clinical
workflow.

• Modifications: A description protocol for changes made, timing and


rationale for modifications, and outcome changes after each modifica-
tion.

• Safety and errors: Identification of system errors and malfunctions,


anticipated risks and mitigation strategies, undesirable outcomes, and
worst-case scenarios.

• Ethics and fairness: Inclusion of subgroup analyses, and fairness


metrics.

• Human-computer agreement: Report of user agreement with the


AI system, reasons for disagreement, and cases of users changing their
mind based on the AI system.

• Transparency: Inclusion of data and code availability.

• Reliability: Inclusion of uncertainty measures, and performance against


realistic baselines.

16
Table 1: MLOps tools
Category Description Tooling Examples
• MLFlow 1
Model metadata storage Section 3.1 • Comet 2
and management • Neptune3
• DVC4
Data and pipeline version- Section 3.2
• Pachyderm5
ing
• DEPLOYR6 [59]
Model deployment and Section 3.3 • Flyte7
serving • ZenML8
• MetaFlow9
Production model monitor- Section 4 • Kedro10
ing • Seldon Core11
• Kuberflow12
Run orchestration and Orchestrating the execution • Polyaxon13
workflow pipelines of preprocessing, training, • MLRun14
and evaluation pipelines.
Section 3.4 & 3.5
• ChatOps15
• Slack16
Collaboration Tool Setting up an MLOps • Trello17
pipeline requires collabora- • GitLab18
tion between different • Rocket Chat 19

people. Section 3.4.1

• Generalizability: Inclusion of measures taken to reduce overfitting,


and external performance evaluations.

3.5.1. Tools and Frameworks


Understanding the MLOps pipeline and required expertise is just the first
step to addressing the problem. Once this has been accomplished, it is nec-
essary to create and/or adopt appropriate tooling for transforming these
principles into practice. There are seven broad categories of MLOps tools as
shown in Table 1 whereby different tools to automate different phase of the
workflows involved in MLOps processes exist. A compiled list of tools within
each category is shown in Table 1

17
4. MLHOps Monitoring and Updating
Once an MLHOps pipeline and required resources are setup and deployed,
robust monitoring protocols are crucial to the safety and longevity of clinical
AI systems. For example, inevitable updates to a model can introduce var-
ious operational issues (and vice versa), including bias (e.g., a new hospital
policy that shifts the nature of new data) and new classes (e.g., new subtypes
in a disease classifier) [287]. Incorporating expert labels can improve model
performance; however, the time, cost, and expertise required to acquire ac-
curate labels for very large imaging datasets like those used in radiology- or
histology-based classifiers makes this difficult [138].

As a result, there exist monitoring frameworks with policies to determine


when to query experts for labels [300]. These include:

• Periodic Querying, a non-adaptive policy whereby labels are period-


ically queried in batches according to a predetermined schedule;

• Request-and-Reverify which sets a predetermined threshold for drift


and queries a batch of labels whenever the drift threshold is exceeded
[288];

• MLDemon which follows a periodic query cycle and uses a linear


estimate of the accuracy based on changes in the data [90].

4.1. Time-scale windows


Monitoring clinical AI systems requires evaluating robustness to temporal
shifts. Since the time-scale used can change the types of shifts detected (i.e.,
gradual versus sudden shifts), multiple time windows should be considered
(e.g., week, month). Moreover, it is important to use both 1) cumulative
statistics, which use a single time window and updates at the beginning
of each window and 2) sliding statistics, which retain previous data and
update with new data.

4.2. Appropriate metrics


It is critical to choose evaluation and monitoring metrics optimal for each
clinical context. The quality of labels is highly dependent on the data from
which they are derived and, as such, can possess inherent biases. For in-
stance, sepsis labels derived from incorrect billing codes will inherently have

18
a low positive predictive value (PPV). Moreover, clinical datasets are often
imbalanced, consisting of far fewer positive instances of a label than negative
ones. As a result, measures like accuracy that weigh positive and negative
labels equally can be detrimental to monitoring. For instance, in the context
of disease classification, it may be particularly important to monitor sensi-
tivity, in contrast to more time-sensitive clinical scenarios like the intensive
care unit (ICU) where false positives (FP) can have critical outcomes [20].

4.3. Detecting data distribution shift


Data distribution shift occurs when the underlying distribution of the train-
ing data used to build an ML model differs from the distribution of data
applied to the model during deployment [214]. When the difference between
the probability distributions of these data sets is sufficient to deteriorate the
model’s performance, the shift is considered malignant.
In healthcare, there are multiple sources of data distribution shifts, many of
which can occur concurrently [78, 248]. Common occurrences of malignant
shifts include differences attributed to:

• Institution - These differences can arise when comparing teaching to


non-teaching hospitals, government-owned to private hospitals, or gen-
eral to specialized hospitals (e.g., paediatric, rehabilitation, trauma).
These institutions can have differing local clinical practices, resource
allocation schemes, medical instruments, and data-collection and pro-
cessing workflows that can lead to downstream variation. This has
previously been reported in Pneumothorax classifiers when evaluated
on external institutions [130].

• Behaviour - Temporal changes in behaviour at the systemic, physi-


cian and patient levels are unavoidable sources of data drift. These
changes include new healthcare reimbursement incentives, changes in
the standard-of-care in medical practice, novel therapies, and updates
to hospital operational processes. An example of this is the COVID-19
pandemic, which required changes in resource allocation to cope with
hospital bed shortages [132, 201].

• Patient demographics - Differences in factors like age, race, gender,


religion, and socioeconomic background can arise for various reasons
including epidemiological transitions, gentrification of neighbourhoods

19
around a health system, and new public health and immigration poli-
cies. Distribution shifts due to demographic differences can dispropor-
tionately deteriorate model performance in specific patient populations.
For instance, although Black women are more likely to develop breast
tumours with poor prognosis, many breast mammography ML classi-
fiers experience deterioration in performance on this patient population
[284]. Similarly, skin-lesion classifiers trained primarily on images of
lighter skin tones may show decreased performance when evaluated on
images of darker skin tones [9, 69].
• Technology - Data shifts can be attributed to changes in technology
between institutions or over time. This includes chest X-ray classifiers
trained on portable radiographs that are evaluated on stationary ra-
diographs or deteroriation of clinical AI systems across EHR systems
(e.g., Philips Carevue vs. Metavision) [188].
Although evaluated differently, data shifts are present across various modal-
ities of clinical data such as medical images [98] and EHR data [70, 201].
In order to effectively prevent these malignant shifts from occurring, it is
necessary to perform prospective evaluation of clinical AI systems [303] in
order to understand the circumstances under which they arise, and to design
strategies that mitigate model biases and improve models for future itera-
tions [290]. Broadly, these data shifts can be categorized into three groups
which can co-occur or lead to one another:

4.3.1. Covariate Shift


Covariate shift is a difference in the distribution of input variables between
source and target data. It can occur due to a lack of randomness, inadequate
sampling, biased sampling, or a non-stationary environment. This can be
at the level of a single input variable (i.e. feature shift) or a group of input
features (i.e. dataset shift). Table 4.3.1 contains a list of commonly used
methods used for covariate shift detection.

Feature Shift Detection: Feature shift refers to the change in distribu-


tion between the source and target data for a single input feature. Feature
shift detection can be performed using two-sample univariate tests such as
the Kolmogorov-Smirnov (KS) test [215]. Publicly available tools like Ten-
sorFlow Extended (TFX) apply univariate tests (i.e., L-infinity distance for
categorical variables, Jensen-Shannon divergence for continuous variables) to

20
perform feature shift detection between training and deployment data and
provide users with summary statistics (Table 4.4). It is also possible to de-
tect feature shift while conditioning on the other features in a model using
conditional distribution tests [135].

Dataset Shift Detection: Dataset shift refers to the change in the joint
distribution between the source and target data for a group of input features.
Multivariate testing is crucial because input to ML models typically consist
of more than one variable and multiple modalities. In order to test whether
the distribution of the target data has drifted from the source data two
main approaches exist: 1) two-sample testing and 2) classifiers. These
approaches often work better on low-dimensional data compared to high-
dimensional data, therefore dimensionality reduction is typically applied first
[215]. For instance, variational autoencoders (VAE) have been used to reduce
chest X-ray images to a low-dimensional space prior to two-sample testing
[245]. In the context of medical images, including chest X-rays [211] [289],
diabetic retinopathies [41], and histology slides [246], classifier methods have
proven effective. For EHR data, dimensionality reduction using clinically
meaningful patient representations has improved model performance [188].
For clinically relevant drift detection, it is important to ensure that drift
metrics correlate well with ground truth performance differences.

4.3.2. Concept Shift


Concept shift is a difference in the relationship (i.e., joint distribution) of the
variables and the outcome between the source and target data. In healthcare,
concept shift can arise due to changes in symptoms for a disease or antigenic
drift. This has been explored in the context of surgery prediction [32] and
medical triage for emergency and urgent care [112].

Concept Shift Detection: There are three broad categories of concept


shift detection based on their approach.

1. Distribution techniques which use a sliding window to divide the


incoming data streams into windows based on data size or time interval
and that compare the performance of the most recent observations
with a reference window [84]. ADaptive WINdowing (ADWIN), and its
extension ADWIN2, are windowing techniques which use the Hoeffding

21
Method Shift Test Type
L-infinity distance Feature (c) 2-ST
Cramér-von Mises Feature (c) 2-ST
Fisher’s Exact Test Feature (c) 2-ST
Chi-Squared Test Feature (c) 2-ST
Jensen-Shannon divergence Feature (n) 2-ST
Kolmogorov-Smirnov [174] Feature (n) 2-ST
Feature Shift Detector [135] Feature Model
Maximum Mean Discrepancy (MMD) [93] Dataset 2-ST
Least Squares Density Difference [37] Dataset 2-ST
Learned Kernel MMD [155] Dataset 2-ST
Context Aware MMD [56] Dataset 2-ST
MMD Aggregated [236] Dataset 2-ST
Classifier [161] Dataset Model
Spot-the-diff [117] Dataset Model
Model Uncertainty [240] Dataset Model
Mahalanobis distance [222] Dataset Model
Gram matrices [202] [234] Dataset Model
Energy Based Test [157] Dataset Model
H-Divergence [299] Dataset Model

Table 2: Covariate Shift Detection Methods c: categorical; n: numeric; 2-ST: Two-


Sample Test

bound to examine the change between the means of two sufficiently


large subwindows [106].

2. Sequential Analysis strategies use the Sequential Probability Ratio


Test (SPRT) as the basis for their change detection algorithms. A well-
known algorithm is CUMSUM which outputs an alarm when the mean
of the incoming data significantly deviates from zero [29].

3. Statistical Process Control (SPC) methods track changes in the


online error rate of classifiers and trigger an update process when there
is a statistically significant change in error rate [163]. Some common
SPC methods include: Drift Detection Method (DDM), Early Drift
Detection Method (EDDM), and Local Drift Detection (LLDD) [23].

22
4.3.3. Label Shift
Label shift is a difference in the distribution of class variables in the outcome
between the source and target data. Label shift may appear when some con-
cepts are under-sampled or over-sampled in the target domain compared to
the source domain. Label shift arises when class proportions differ between
the source and target, but the feature distributions of each class do not. For
instance, in the context of disease diagnosis, a classifier trained to predict
disease occurrence is subject to drift due to changes in the baseline preva-
lence of the disease across various populations.

Label Shift Detection: Label shift can be detected using moment matching-
based estimator methods that leverage model predictions like Black Box
Shift Estimation (BBSE) [151] and Regularized Learning under Label Shift
(RLLS) [22]. Assuming access to a classifier that outputs the true source dis-
tribution conditional probabilities ps (y|x) Expectation Maximization (EM)
algorithms like Maximum Likelihood Label Shift (MLLS) can also be used to
detect label shift [87]. Furthermore, methods using bias-corrected calibration
show promise in correcting label shift [14].

4.4. Model Updating and Retraining


As the implementation of ML-enabled tools is realized in the clinic, there is
a growing need for continuous monitoring and updating in order to improve
models over time and adapt to malignant distribution shifts. Retraining of
ML models has demonstrated improved model performance in clinical con-
texts like pneumothorax diagnosis [130]. However, proposed modifications
can also degrade performance and introduce bias [149]; as a result it may be
preferable to avoid making a prediction and defer the decision to a down-
stream expert [186]. When defining a model updating or retraining strategy
for clinical AI models there are several factors to consider [279], we outline
they key criteria in this section.

4.4.1. Quality and Selection of Model Update Data


When updating a model it is important to consider the relevance and size of
the data to be used. This is typically done by defining a window of data to
update the model: i) Fixed window uses a window that remains constant
across time. ii) Dynamic window uses a window that changes in size due
to an adaptive data shift, iii) Representative subsample uses a subsample
from a window that is representative of the entire window distribution.

23
Name of tool Capabilities
Evidently 20 Interactive reports to analyze ML models
during validation or production monitoring.
NannyML21 Performance estimation and monitoring,
data drift detection and intelligent alerting
for deployment.
River [185] Online metrics, drift detection and outlier
detection for streaming data.
SeldonCore [262] Serving, monitoring, explaining, and man-
agement of models using advanced metrics,
explainers, and outlier detection.
TFX22 Explore and validate data used for machine
learning models.
TorchDrift23 Covariate and concept drift detection.
deepchecks [54] Testing for continuous validation of ML mod-
els and data.
EHR OOD Detection Uncertainty estimation, OOD detection and
[258] (deep) generative modelling for EHRs.
Avalanche [160] Prototyping, training and reproducible eval-
uation of continual learning algorithms.
Giskard24 Evaluation, monitoring and drift testing.

Table 3: List of open-source tools available on Github that can be used for ML Monitoring
and Updating

4.4.2. Updating Strategies


There are several ways to update a model including: i) Model recalibration
is the simplest type of model update, where continuous scores (e.g. predicted
risks) produced by the original model are mapped to new values [52]. Some
common methods to achieve this include Platt scaling [209], temperature
scaling, and isotonic regression [191]. ii) Model updating includes changes
to an existing model, for instance, fine-tuning with regularization [139] or
model editing where pre-collected errors are used to train hypernetworks
that can be used to edit a model’s behaviour by predicting new weights or
building a new classifier [182]. iii) Model retraining involves retraining a
model from scratch or fitting an entirely different model.

24
4.4.3. Frequency of Model Updates
In practice, retraining procedures for clinical AI models have generally been
locked after FDA approval [140] or confined to ad-hoc one-time updates [261]
[104]. The timing of when it is necessary to update or retrain a model varies
across use case. As a result, it is imperative to evaluate the appropriate fre-
quency to update a model. Strategies employed include: i) Periodic train-
ing on a regular schedule (e.g. weekly, monthly). ii) Performance-based
trigger in response to a statistically significant change in performance. iii)
Data-based trigger in response to a statistically significant data distribu-
tion shift. iv) Retraining on demand is not based on a trigger or regular
schedule, and instead initiated based on user prompts.

4.4.4. Continual Learning


Continual learning is a strategy used to update models when there is a con-
tinuous stream of input data that may be subject to changes over time. Prior
to deployment, it is crucial to simulate the online learning procedure on ret-
rospective data to assess robustness to data shifts [51] [198]. When models
are retrained on only the most recent data, this can result in “catastrophic
forgetting” [267] [140], in which the integration of new data into the model
can overwrite knowledge learned in the past and interfere with what the
model has already learned [138]. Contrastingly, procedures that retrain mod-
els on all previously collected data can fail to adapt to important temporal
shifts and are computationally expensive. More recently, strategies leverag-
ing multi-armed bandits have been utilized to select important samples or
batches of data for retraining [92] [301]. This is an important consideration
in healthcare contexts like radiology, where the labelling of new data can be
a time-consuming bottleneck [100] [206].

To ensure continual learning satisfies performance guarantees, hypothesis


testing can be used for approving proposed modifications [62]. An effec-
tive approach for parametric models include continual updating procedures
like online recalibration/revision [76]. Strategies for continual learning can
broadly be categorized into: 1) Parameter isolation where changes to pa-
rameters that are important for the previous tasks are forbidden e.g. Local
Winner Takes All (LWTA), Incremental Moment Matching (IMM) [260]; 2)
Regularization methods which builds on the observation forgetting can
be reduced by protecting parameters that are important for the previous
tasks e.g. elastic weight consolidation (EWC), Learning Without Forgetting

25
(LWF); and 3) Replay-based approaches that retain some samples from
the previous tasks and use them for training or as constraints to reduce for-
getting e.g. episodic representation replay (ERR) [66]. Evaluation of several
continual learning methods on ICU data across a large sequence of tasks
indicate replay-based methods achieves more stable long-term performance,
compared to regularization and rehearsal based methods [19]. In the context
of chest X-ray classification, Joint Training (JT) has demonstrated superior
model performance, with LWF as a promising alternative in the event that
training data is unavailable at deployment [141]. For sepsis prediction using
EHR data, a joint framework leveraging EWC and ERR has been proposed
[16]. More recently, continual model editing strategies have shown promise
in overcoming the limitations of continual fine-tuning methods by updating
model behavior with minimal influence on unrelated inputs and maintaining
upstream test performance [105].

4.4.5. Domain Generalization and Adaptation


Broadly, domain generalization and adaptation methods are used to improve
clinical AI model stability and robustness to data shifts by reducing distri-
bution differences between training and test data [293] [95]. However, it is
critical to evaluate several methods over a range of metrics, as the effective-
ness of each method varies based on several factors including the type of shift
and data modality [285].

• Data-based methods perform manipulations based on the patient


data to minimize distribution shifts. This can be done by re-weighting
observations during training based on the target domain [133], upsam-
pling informative training examples [153] or leveraging a combination
of labeled and pseudo-labeled [147].

• Representation-based methods focus on achieving a feature rep-


resentation such that the source classifier performs well on the target
domain. In clinical data this has been explored using strategies in-
cluding invariant risk minimization (IRM), distribution matching (e.g.
CORAL) and domain-adversarial adaptation networks (DANN). DANN
methods have demonstrated a reduction on the impact of data shift on
cross-institutional transfer performance for diagnostic prediction [296].
However, it has been shown that for clinical AI models subject to real
life data shifts, in contrast to synthetic perturbations, empirical risk

26
minimization outperforms domain generalization and unsupervised do-
main adaptation methods [97] [294].

• Inference-based methods introduce constraints on the optimization


procedure to reduce domain shift [133]. This can be done by estimating
a model’s performance on the “worst-case” distribution [249] or con-
straining the learning objective to enforces closeness between protected
groups [237]. Batch normalization statistics can also be leveraged to
build models that are more robust to covariate shifts [235].

4.4.6. Data Deletion and Unlearning


In healthcare there are two primary reasons for wanting to remove data
from models. Firstly, with the growing concerns around privacy and ML in
healthcare, it may become necessary to remove patient data for privacy rea-
sons. Secondly, it may also be beneficial to a model’s performance to delete
noisy or corrupted training data [35]. The naive approach to data deletion
is to exclude unwanted samples and retrain the model from scratch on the
remaining data, however this approach can quickly become time consum-
ing and resource-intensive [114]. As a result, more sophisticated approaches
have been proposed for unlearning in linear and logistic models [114], random
forest models [36], and other non-linear models [96].

4.4.7. Feedback Loops


Feedback loops that incorporate patient outcomes and clinician decisions
are critical to improving outcomes in future model iterations. However, re-
training feedback loops can also lead to error amplification, and subsequent
downstream increases in false positives [6]. As a result, it is important to
consider model complexity and choose an appropriate classification threshold
to ensure minimization of error amplification [7].

5. Responsible MLHOps
AI has surged in healthcare, out of necessity or/and [290, 199], but many
issues still exist. For instance, many sources of bias exist in clinical data,
large models are opaque, and there are malicious actors who damage or pol-
lute the AI/ML systems. In response, responsible AI and trustworthiness
have together become a growing area of study [176, 264]. Responsible AI, or
trustworthy MLOps, is defined as an ML pipeline that is fair and unbiased,

27
explainable and interpretable, secure, private, reliable, robust, and resilient
to attacks. In healthcare, trust is critical to ensuring a meaningful relation-
ship between the healthcare provider and patient [63]. In this section, we
discuss components of responsible and trustworthy AI [142], which can be
applied to the MLHOps pipeline. In Section 5.1, we review the main con-
cepts of responsible AI and in Section 5.2 we explore how these concepts can
be embedded in the MLHOps pipeline to enable safe deployment of clinical
AI systems.

5.1. Responsible AI in healthcare


Ethics in healthcare:
Ethics in healthcare primarily consists of the following criteria [263]:

1. Nonmaleficence: Do not harm the patient.


2. Beneficence: Act to the benefit of the patient.
3. Autonomy: The patient (when able) should have the freedom to make
decisions about his/her body. More specifically, the following aspects
should be taken care of:
• Informed Consent: The patient (when able) should give in-
formed consent for any medical or surgical procedure, or for re-
search.
• Truth-telling: The patient (when able) should receive full dis-
closure to his/her diagnosis and prognosis.
• Confidentiality: The patient’s medical information should not
be disclosed to any third party without the patient’s consent.
4. Justice: Ensure fairness to the patient.

To supplement these criteria, guiding principles drawn from surgical settings


[152, 228] include:

5. Rescue: A patient surrenders to the healthcare provider’s expertise to


be rescued.
6. Proximity: The emotional proximity to the patient should be limited
to maintain self-preservation and stability in case of any failure.

28
7. Ordeal: A patient may have to face an ordeal (i.e., go through painful
procedures) in order to be rescued.

8. Aftermath: The physical and psychological aftermath that may occur


to the patient due to any treatment must be acknowledged.

9. Presence: An empathetic presence must be provided to the patient.

While some of these criteria relate to the humanity of the healthcare provider,
others relate to the following topics in ML models:

 Fairness involves the justice component in the healthcare domain [50].

 Interpretability & explainability relate to explanations and better


understanding of the ML models’ decisions, which can help in achieving
nonmaleficence, beneficence, informed consent, and truth-telling prin-
ciples in healthcare. Interpretability can help identify the reasons for a
given model outcome, which can help inform healthcare providers and
patients on how to respond accordingly [179].

 Privacy and security relate to confidentiality. [126].

 Reliability, robustness, and resilience addresses rescue [227].

We discuss these concepts further in Sections 5.1.1, 5.1.2, 5.1.3 and 5.1.4.

5.1.1. Bias & Fairness


The fairness of AI-based decision support systems have been studied gener-
ally in a variety of applications including occupation classifiers [64], criminal
risk assessments algorithms [55], recommendation systems [71], facial recog-
nition algorithms [38], search engines [85], and risk score assessment tools in
hospitals [193]. In recent years, the topic of fairness in AI models in health-
care has received a lot of attention [193, 241, 137, 49, 278, 242]. Unfairness
in healthcare manifests as differences in model performance against or in
favour of a sub-population, for a given predictive task. For instance, dis-
proportionate performance differences for disease diagnosis in Black versus
White patients [241].

29
5.1.1.1. Causes
A lack of fairness in clinical AI systems may be a result of various contributing
causes:

• Objective:

– Unfair objective functions: The initial objective used in de-


veloping a ML approach may not consider fairness. This does not
mean that the developer explicitly (or implicitly) used an unfair
objective function to train the model, but the oversimplification
of that objective can lead to downstream issues. For example, a
model designed to maximize accuracy across all populations may
not inherently provide fairness across different sub-populations
even if it reaches state-of-the-art performance on average, across
the whole population [241, 242].
– Incorrect presumptions: In some instances, the objective func-
tion includes incorrect interpretations of features, which can lead
to bias. For instance, a commercial algorithm used in the USA,
used health costs as a proxy for health needs[193]; however, due to
financial limitations, Black patients with the same need for care as
White patients often spend less on healthcare and therefore have
a lower health cost. As a result, the model falsely inferred that
Black patients require less care compared to White patients be-
cause they spend less [193]. Additionally, patients may be charged
differently for the same service based on their insurance, suggest-
ing cost may not be representative of healthcare needs.

• Data:

– Inclusion and exclusion: It is important to clearly outline the


conditions and procedures utilized for patient data collection, in
order to understand patient inclusion criteria and any potential
selection biases that could occur. For instance, the Chest X-ray
dataset [275] was gathered in a research hospital that does not
routinely conduct diagnostic and treatment procedures25 . This
dataset therefore includes mostly critical cases, and few patients

25
from https://clinicalcenter.nih.gov/about/welcome/faq.html.

30
at the early stages of diagnosis. Moreover, as a specialized hos-
pital, patient admission is selective and chosen solely by institute
physicians based on if they have an illness being studied by the
given institute 26 . Such a dataset will not contain the diversity
of disease cases that might be seen in hospitals specialized across
different diseases, or account for patients visiting for routine treat-
ment services at general hospitals.
– Insufficient sample size: Insufficient sample sizes of under-
represented groups can also result in unfairness [89]. For instance,
patients of low socioeconomic status may use healthcare services
less, which reduces their sample size in the overall dataset, re-
sulting in an unfair model [294, 38, 49]. In another instance, an
algorithm that can classify skin cancer [73] with high accuracy will
not be able to generalize to different skin colours if similar samples
have not been represented sufficiently in the training data [38].
– Missing essential representative features: Sometimes, es-
sential representative features are missed or not collected during
the dataset curation process, which prohibits downstream fairness
analyses. For instance, if the patient’s race has not been recorded,
it is not possible to analyze whether a model trained on that data
is fair with respect to that race [242]. Failure to include sensitive
features can generate discrimination and reduce transparency [48].

• Labels:

– Social bias reflection on labels: Biases in healthcare systems


widely reflect existing biases in society [168, 250, 269]. For in-
stance, race and sex biases exist in COPD underdiagnosis [168], in
medical risk score analysis (whereby there exists a higher thresh-
old for Black patients to gain access to clinical resources) [269],
and in the time of diagnosis for cardiovascular disease (whereby
female patients are diagnosed much later compared to the male
patients with similar conditions) [250]. These biases are reflected
in the labels used to train clinical AI systems and, as a result, the
model will learn to replicate this bias.

26
from https://clinicalcenter.nih.gov/about/welcome/faq.html.

31
– Bias of automatic labeling: Due to the high cost and labour-
intensive process of acquiring labels for healthcare data, there has
been a shift away from hand-labelled data, towards automatic
labelling [39, 113, 120]. For instance, instead of expert-labeled
radiology images, natural language processing (NLP) techniques
are applied to radiology reports in order to extract labels. This
presents concerns as these techniques have shown racial biases,
even after they have been trained on clinical notes [295]. There-
fore, using NLP techniques for automatic labeling may sometimes
amplify the overall bias of the labels [242].

• Resources:

– Limited computational resources: Not all centers have enough


labeled data or computational resources to train ML models ‘from
scratch’ and must use pretrained models for inference or transfer
learning. If the original model has been trained on biased (or dif-
ferently distributed) data, it will unfairly influence the outcome,
regardless of the quality of the data at the host center.

5.1.1.2. Evaluation
To evaluate the fairness of a model, we need to decide which fairness metric
to use and what sensitive attributes to consider in our analysis.

• Fairness metric(s): There are many ways to define fairness metrics.


For instance, [55] and [103] discussed several fairness criteria and sug-
gested balancing the error rate between different subgroups [58, 292].
However, it is not always possible to satisfy multiple fairness constraints
concurrently [242]. Jon Kleinberg et al., [131] showed that three fair-
ness conditions evaluated could not be simultaneously satisfied. As a
result, a trade-off between the different notions of fairness is required,
or a single fairness metric can be chosen based on domain knowledge
and the given clinical application.

• Sensitive attributes: Sensitive attributes are protected groups that


we want to consider when evaluating the fairness of an AI model. Sex
and race are two commonly used sensitive attributes [292, 241, 242,
295]. However, a lack of fairness in an AI system with respect to other

32
sensitive attributes such as age [241, 242], socioeconomic status, [241,
242, 295], and spoken language [295] are also important to consider.

Defining AI fairness is context- and problem-dependent. For instance, if we


build an AI model to support decision making for disease diagnosis with the
goal of using it in the clinic, then it is critical to ensure equal opportunity
in the model is provided; i.e., patients from different races should have equal
opportunity to be accurately diagnosed [241]. However, if an AI model is to
be used to triage patients, then ensuring the system does not underdiagnose
unhealthy patients of a certain group may be of greater concern compared to
the specific disease itself because the patient will lose access to timely care
[242].

5.1.2. Interpretability & Explainability


In recent years, interpretability has received a lot of interest from the ML
community [184, 253, 172]. In machine learning, interpretability is defined
as the ability to explain the rationale for an ML model’s predictions in terms
that a human can understand [68] and explainability refers to a detailed un-
derstanding of the model’s internal representations, a priori of any decision.
After other research in this area [170], we use ‘interpretability’ and ‘explain-
ability’ interchangeably.

Interpretability is not a pre-requisite for all AI systems [68, 184], including


in low-risk environments (in which miscalculations have very limited conse-
quences) and in well-studied problems (which have been tested and validated
extensively according to robust MLOps methods). However, interpretability
can be crucial in many cases, especially for systems deployed in the healthcare
domain [88]. The need for interpretability arises from the incompleteness of
the problem where system results require an accompanying rationale.

5.1.2.1. Importance of interpretability


Interpretability applied to an ML model can be useful for the following rea-
sons:

• Trust: Interpretability enhances trust when all components are well-


explained. This builds an understanding of the decisions made by a
model and may help integrate it into the overall workflow.

33
• Reliability & robustness: Interpretability can help in auditing ML
models, further increasing model reliability.

• Privacy & security: Interpretability can be used to assess if any


private information is leaked from the results. While some researchers
claim that interpretability may hinder privacy [244, 102, 244] as the in-
terpretable features may leak sensitive information, others have shown
that it can help make the system robust against the adversarial attacks
[145, 297].

• Fairness: Interpretability can help in identifying and reducing biases


discussed in Sec. 5.1.1. However, the quality of these explanations can
differ significantly between subgroups and, as such, it is important to
test various explanation models in order to carefully select an equitable
model with high overall fidelity [24].

• Better understanding and knowledge: A good interpretation of


the model can lead to the identification of the factors that most impact
the model. This can also result in a better understanding of the use
case itself and enhance knowledge in that particular area.

• Causality: Interpretability gives a better understanding of the model


decisions and the features and hence can help to identify causal rela-
tionships of the features [43].

5.1.2.2. Types of approaches for interpretability in ML:


Many methods have been developed for better interpretability in ML, such
as explainable AI for trees [165], Tensorflow Lattice27 , DeepLIFT [143],
InterpretML[192], LIME [224], and SHAP [166]. Some of these have been
applied to healthcare [2, 247]. The methods for interpretability are usually
categorized as:

• Model-based

– Model-specific: Model-specific interpretability can only be used


for a particular model. Usually, this type of interpretability uses

27
https://www.tensorflow.org/lattice

34
the model’s internal structure to analyze the impact of features,
for example.
– Model-agnostic: Interpretability is not restricted to a specific
machine learning model and can be used more generally with sev-
eral.

• Complexity-based

– Intrinsic: Relatively simple methods, such as height-bound deci-


sion trees, are easier for humans to understand.
– Post-hoc: After the model has produced output, interpretation
proceeds for more complex methods.

• Scope-based

– Locally interpretable: Interprets individual or per-instance pre-


dictions of the model.
– Globally interpretable: Interprets the model’s overall predic-
tion set and provides insight into how the model works in general.

• Methodology-based approach

– Feature-based: Methods that interpret the models based on the


impact of the features on that model. E.g., weight plot, feature
selection, etc.
– Perturbation-based: Methods that interpret the model by per-
turbing the settings or features of the model. E.g., LIME [224],
SHAP [166] and anchors.
– Rule-based: Methods that apply rules on features to identify
their impact on the model e.g., BETA, MUSE, and decision trees.
– Image-based: Methods where important inputs are shown using
images superimposed over the input e.g., saliency maps [10].

5.1.2.3. Interpretability in healthcare


In recent years, interpretability has become common in healthcare [2, 179,
220]. In particular, Abdullah et al. [2] reported that interpretability methods
(e.g., decision trees, LIME, SHAP) have been applied to extract insights into

35
different medical conditions including cardiovascular diseases, eye diseases,
cancer, influenza, infection, COVID-19, depression, and autism. Similarly,
Meng et al. [179] performed interpretability of deep learning mortality pre-
diction models and fairness analysis on the MIMIC-III dataset [119], showing
connections between interpretability methods and fairness metrics.

5.1.3. Privacy & Security


While digitizing healthcare has led to centralized data and improved access
for healthcare professionals, it has also increased risks to data security and
privacy [189]. After previous work [3], privacy is the individual’s ability to
control, interact with, and regulate their personal information and security
is a systemic protection of data from leaks or cyber-attacks.

5.1.3.1. Security & privacy requirements


In order to ensure privacy and security, the following requirements should be
met [189]:
• Authentication: Strong authentication mechanisms for accessing the
system.
• Confidentiality: Access to data and devices should be restricted to
authorized users.
• Integrity: Integrity-checking mechanisms should be applied to restrict
any modifications to the data or to the system.
• Non-repudiation: Logs should be maintained to monitor the system.
Access to those logs should be restricted and avoid any tampering.
• Availability: Quick, easy, and fault-tolerant availability should be
ensured at all times.
• Anonymity: Anonymity of the device, data, and communication should
be guaranteed.
• Device unlinkability: An unauthorized person should not be able to
establish a connection between data and the sender.
• Auditability and accountability: It should be possible to trace back
the recording time, recording person, and origins of the data to validate
its authenticity.

36
5.1.3.2. Types of threats
Violation of privacy & security can occur either due to human error (uninten-
tional or non-malicious) or an adversarial attack (intentional or malicious).
1. Human error: Human error can cause data leakage through the care-
lessness or incompetence of authorized individuals. Most of the litera-
ture in this context [148, 75] divides human error into two types:
(a) Slip: the wrong execution of correct, intended actions; e.g., in-
correct data entry, forgetting to secure the data, giving access of
information to unauthorized persons using the wrong email ad-
dress.
(b) Mistake: the right execution of incorrect, unintended actions;
e.g., collecting data that is not required, using the same password
for different systems to avoid password recovery, giving access of
information to unauthorized persons assuming they can have ac-
cess.
While people dealing with data should be trained to avoid such negli-
gence, some researchers have suggested policies, frameworks, and strate-
gies such as error avoidance, error interception, or error correction to
prevent or mitigate these issues [148, 75].
2. Adversarial attacks: A primary risk for any digital data or system is
from adversarial attackers [99] who can damage, pollute, or leak infor-
mation from the system. An adversarial attacker can attack in many
ways; e.g., they can be remote or physically present, they can access
the system through a third-party device, or they can be personified as
a patient [189]. The most common types of attacks are listed below.

• Hardware or software attack: Modifying the hardware or soft-


ware to use it for malicious purposes.
• System unavailability: Making the device or data unavailable.
• Communication attack: Interrupting the communication or
forcing a device to communicate with unauthorized external de-
vices.
• Data sniffing: Illegally capturing the communication to get sen-
sitive information.

37
• Data modification: Maliciously modifying data.
• Information leakage: Retrieving sensitive information from the
system.

5.1.3.3. Healthcare components and security & privacy


Extra care needs to be taken to protect healthcare data [5]. Components
[194] include:

• Electronic health data: This data can be leaked due to human


mistakes or malicious attacks, which can result in tampering or misuse
of data. In order to overcome such risks, measures such as access
control, cryptography, anonymization, blockchain, steganography, or
watermarking can be used.

• Medical devices: Medical devices such as smartwatches and sensors


are also another source of information that can be attacked. Secure
hardware and software, authentication and cryptography can be used
to avoid such problems.

• Medical network: Data shared across medical professionals and or-


ganizations through a networks may be susceptible to eavesdropping,
spoofing, impersonating, and unavailability attacks. These threats can
be reduced by applying encryption, authentication, access control, and
compressed sensing.

• Cloud storage: Cloud computing is becoming widely adopted in


healthcare. However, like any system, it is also prone to unavailability,
data breaches, network attacks, and malicious access. Similar to those
above, threats to cloud services can be avoided through authentica-
tion, cryptography, and decoying (i.e., a method to make an attacker
erroneously believe that they have acquired useful information).

5.1.3.4. Healthcare privacy & security laws


Due to the sensitivity of healthcare data and communication, many coun-
tries have introduced laws and regulations such as the Personal Information
Protection and Electronic Documents Act (PIPEDA) in Canada, the Health
Insurance Portability and Accountability Act (HIPPA) in the USA, and the

38
Data Protection Directive in the EU [280]. These acts mainly aim at protect-
ing patient data from being shared or used without their consent but while
allowing them to access to their own data.

5.1.3.5. Attacks on ML pipeline


Any ML model that learns from data can also leak information about the
data, even if it is generalized well; e.g., using membership inference (i.e.,
determining if a particular instance was used to train the model) [178, 111]
or using property inference (i.e., inferring properties of the training dataset
from a given model) [178, 200]. Adversarial attacks in the context of the
MLOps pipeline can occur in the following phases [99]:

• Data collection phase: At this phase, a poisoning attack results in


modified or polluted data, impacting the training of the model and
lowering performance on unmodified data.

• Modelling phase: Here, the Trojan AI attack can modify a model


to provide an incorrect response for specific trigger instances [271] by
changing the model architecture and parameters. Since it is now com-
mon to use pre-trained models, these models can be modified or re-
placed by attackers.

• Production and deployment phases: At these phases, both Trojan


AI attacks and evasion attacks can occur. Evasion attacks consist, e.g.,
of modifying test data to have them misclassified [207].

5.1.4. Reliability, robustness and resilience


A trustworthy MLOps system should be reliable, robust, and resilient. These
terms are defined as follows [302]:

• Reliability: The system performs in a satisfactory manner under spe-


cific, unaltered operating conditions.

• Robustness: The system performs in a satisfactory manner despite


changes in operating conditions, e.g., data shift.

• Resilience: The system performs in a satisfactory manner despite a


major disruption in operating conditions; e.g., adversarial attacks.

39
These aspects have been studied in the healthcare domain [181, 213] and
different approaches such as interpretability, security, privacy, and methods
to deal with data shift (discussed in Sections 5.1.2 and 5.1.3) have been sug-
gested.

Trade-off between accuracy and trustworthiness: In Section 5.1, we


discussed different important components of trustworthy AI that should be
considered while designing an ML system; however, literature shows that
there can be a trade-off between accuracy, interpretability, and robustness
[220, 256]. While a main reason for the trade-off is that robust models learn
a different feature representation that may decrease accuracy, it is better
perceived by humans [256].

5.2. Incorporating Responsibility and Trust into MLHOps


In recent years, responsible and trustworthy AI have gained a lot of attention
in general as well as for healthcare due to its implications on society [220].
There are several definitions of trustworthiness [220], and they are related
to making the system robust, unbiased, generalizable, reproducible, trans-
parent, explainable, and secure. However, the lack of standardized practices
for applying, explaining, and evaluating trustworthiness in AI for healthcare
makes this very challenging [220]. In this section, we discuss how we can
incorporate all these qualities at each step of the pipeline.

5.2.1. Data
The process of a responsible and trustworthy MLOps pipeline starts with
data collection and preparation. The impact of biased or polluted data prop-
agates through all the subsequent steps of the pipeline [82]. This can be even
more important and challenging in the healthcare domain due to the privacy
and sensitivity of the data [21]. If compromised, this information can be
tempered or misused in various ways (e.g., identity theft, information sold to
a third party) and introduce bias in the healthcare system. Such challenges
can also cause economic harm (such as job loss), psychological harm (e.g.,
causing embarrassment due to a medical issue), and social isolation (e.g.,
due to a serious illness such as HIV) [187, 4]. It can also impact ML model
performance and trustworthiness [50].

40
5.2.1.1. Data collection
In healthcare, data can be acquired through multiple sources [257], which
increases the chance of the data being polluted by bias. Bias can concern, for
example, race[284], gender, sexual orientation, gender identity, and disability.
Bias in healthcare data can be mitigated against by increasing diversity in
data, e.g., by including underrepresented minorities (URMs), which can lead
to better outcomes [169]. Debiasing during data collection can include:

1. Identifying & acknowledging potential real-world biases: Bias


in healthcare is introduced long before the data collection stage. Al-
though increasingly less common in many countries28 , bias can still
occur in medical school admission, job interviews, patient care, disease
identification, research samples, and case studies. Such biases lead to
the dominance of people from certain communities [169] or in-group
vs. out-group bias [91], which can result in stereotyped and biased
data generation and hence biased data collection.
Bias can be unconscious or conscious [169, 79]. Unconscious bias stems
from implicit or unintentional associations outside conscious awareness
resulting from stereotypical perceptions and experiences. On the other
hand, conscious bias is explicit and intentional and has resulted in abuse
and criminal acts in healthcare; e.g., the Tuskegee study of untreated
Syphilis in black men demonstrated intentional racism [80]. Both con-
scious and unconscious biases damage the validity of the data. Since
conscious bias is relatively more visible, it is openly discouraged not
only in healthcare but also in all areas of society. However, uncon-
scious bias is more subtle and not as easy to identify. In most cases,
unconscious bias is not even known to the person suffering from it.
Different surveys, tests, and studies have found the following types of
biases (conscious or unconscious) common in healthcare [169]:

(a) Racial bias e.g., Black, Hispanic, and Native American physi-
cians are underrepresented [197]. According to one study, white
males from the upper classes are preferred by the admission com-
mittees [42] (although some other sources suggest the opposite28 ).

28
https://applymd.utoronto.ca/admission-stats

41
(b) Gender bias: e.g., professional women in healthcare being less
likely to be invited to give talks [177], to be introduced using
professional titles [77], to experience harassment or exclusion, to
receive insufficient support at work or negative comparisons with
male colleagues, and to be perceived as weak & less competitive
[150, 252].
(c) Gender minority bias e.g., LGBTQ people receive lower quality
healthcare [226] and faced challenges to get jobs in healthcare
[232].
(d) Disability bias e.g., people with disabilities receive limited ac-
cessibility supports to all facilities and have to work harder to be
feel validated or recognized [175].

Various tests identify the existence of unconscious bias, such as the


Implicit Association Test (IAT), and have been reported to be useful.
For example, Race IAT results detected unintentional bias in 75% of
the population taking the test [25]. While debate continues regarding
the degree of usefulness of these tests [34], they may still capture some
subtle human behaviours. Some other assessment tools (e.g., Diversity
Engagement Survey (DES) [203]) have also been built for successfully
measuring inclusion and diversity in medical institutes.
According to Marcelin et al. [169], the following measures can help in
reducing unintentional bias:

(a) Using IAT to identify potential biases in admissions or hiring com-


mittee members in advance.
(b) Promoting equity, diversity, inclusion, and accessibility (EDIA) in
teams. Including more people from underrepresented minorities
(URM) in the healthcare profession, especially in admissions and
hiring committees.
(c) Conducting and analyzing surveys to keep track of the challenges
faced by URM individuals due to the biased perception of them.
(d) Training to highlight the existence and need for mitigation of bias.
(e) Self-monitoring bias can be another way to incorporate inclusion
and diversity.

42
2. Debiasing during data collection and annotation:
In addition to human factors, we can take steps to improve the data
collection process itself. In this regard, the following measures can be
taken [156]:

(a) Investigating the exclusion: In dataset creation, an important


step is to carefully investigate which patients are included in the
dataset. An exclusion criterion in dataset creation may be con-
scious and clinically motivated, but there are many unintentional
exclusion criteria that are not very well visible and enforce biases.
For instance, a dataset that is gathered in a research hospital that
does not routinely provide standard diagnostic and treatment ser-
vices and select the patients only because they have an illness
being studied by the Institutes have a different type of patients
compared to clinical hospitals that do not have these limitations
[242]. Alternatively, whether the service delivered to the patient
is free or covered by insurance would change the distribution of
the patients and infect biases into the resulting AI model [241].
(b) Annotation with explanation: Adding justification for choos-
ing the label by the human annotators not only helps them identify
their own unconscious biases but also can help in setting standards
for unbiased annotations and avoid any automatic association and
stereotyping (e.g., high prevalence HIV in gay men led to under-
diagnosis of this disease in women and children [169]. Moreover,
these explanations can be a good resource for training explainable
AI models [277].
(c) Data provenance: This involves tracking data lineage through
the data source, dependencies, and data collection process. Health-
care data can come from multiple sources which increases the
chances of it being biased [45]. Data provenance improves data
quality, integrity, audibility, and transparency [283]. Different
tools for data provenance are available including Fast Health-
care Interoperability Resources (FHIR) [233] and Atmolytics[283].
[171]
(d) Data security & privacy during data collection: Smart
healthcare technologies have become a common practice [45]. A
wide variety of smart devices is available, including wearable de-

43
vices (e.g., smartwatches, skin-based sensors), body area networks
(e.g., EEG sensors, blood pressure sensors), tele-healthcare (e.g.,
tele-monitoring, tele-treatment), digital healthcare systems (e.g.,
electronic health records (EHR), electronic medical records (EMR)),
and health analytics (e.g., medical big-data). While the digitiza-
tion of healthcare has improved access to medical facilities, it has
increased the risk of data leakage and malicious attacks. Extra
care should be taken while designing an MLOps pipeline to avoid
privacy and security risks, as it can lead to serious life-threatening
consequences. Other issues include the number of people involved
in using the data and proper storage for high volumes of data.
Chaudhry et al. [45] proposed an AI-based framework using 6G-
networks for secure data exchange in digital healthcare devices. In
the past decade, the blockchain has also emerged as a way of ensur-
ing data privacy and security. Blockchain is a distributed database
with unique characteristics such as immutability, decentralization,
and transparency. This is especially relevant in healthcare because
of security and privacy issues [101, 286, 190]. Using blockchain can
help in more efficient and secure management of patient’s health
records, transparency, identification of false content, patient mon-
itoring, and maintaining financial statements [101].
(e) Data-sheet: Often, creating a dataset that represents the full
diversity of a population is not feasible, especially for very multi-
cultural societies. Additionally, the prevalence of diseases among
different sub-populations may be different [242]. If it is not pos-
sible to build an ideal dataset with the above specifications, the
data needs to be delivered by a data-sheet. The data-sheet is
meta-data that helps to analyze and specify the characteristics of
the data, clearly explain exclusion and inclusion criteria, detail
demographic features of the patients, and statistics of the data
distribution over sub-populations, labels and features.

5.2.1.2. Data pre-processing


1. Data quality assurance: Sendak et al. [238] argued that clinical re-
searchers choose data for research very carefully but the machine learn-
ing community in healthcare does not follow this practice. To overcome
this gap, they suggest that data points are identified by the clinicians

44
and extracted into a project-specific data store. After this, a three-step
framework is applied: (1) use different measures for data pre-processing
to ensure the correctness of all data elements (e.g, converting each lab
measurement to the same unit), (2) ensure completeness, conformance,
plausibility, and possible data shifts, and (3) adjudicate the data with
the clinicians.

2. Data anonymization: Due to the sensitivity of healthcare data prepa-


ration, data anonymization should minimize the chances of it being
de-anonymized. Olatunji et al. [196] provide a detailed overview of
data anonymization models and techniques in healthcare such as k-
anonymity, k-map, l-diversity, t-closeness, δ-disclosure privacy, β-likeness,
δ-presence, and (, δ)-differential privacy. To avoid data leakage, many
tools for data anonymization and its evaluation tools [268] such as Sec-
Graph [116], ARX- tool for anonymizing biomedical data [212], Amne-
sia29 [254], PySyft [230], Synthea [270] and Anonimatron30 (open-source
data anonymization tool written in Java) can be incorporated in the
MLHOps pipeline.

3. Removing subgroups indicators. Changing the race of the patients


can have a dramatic impact on the outcome of an algorithm that is de-
signed to fill a prompt [295]. Therefore, the existence of race attributes
in the text can decrease the fairness of the model dramatically. In some
specific problems, removing subgroup indicators such as the sex of a job
candidate from their application has shown to have minimal influence
on classifier accuracy while improving the fairness [64]. This method
is applicable mostly in text-based data where sensitive attributes are
easily removable. As a preprocessing step, one can estimate the effect
of keeping or removing such sensitive attributes on the overall accuracy
and fairness of a developed model. At the same time, it is not always
possible to remove the sensitive attributes from the data. For example,
AI models can predict patient race from medical images, but it is not
yet clear how they can do it [276]. In one study [276], researchers did
not provide the patient race during model training, but they also could
not find a particular patch or region in the data for which AI failed to

29
https://www.openaire.eu/item/amnesia-data-anonymization-made-easy
30
https://realrolfje.github.io/anonimatron/

45
detect race by removing that part.

4. Differential privacy: Differential privacy [61] aims to provide infor-


mation about inherent groups while withholding the information about
the individuals. Many algorithms and tools have been developed for
this, including CapC[53] and PySyft [230].

5.2.2. Methodology
The following sections overview the steps to put these concepts into practice.

5.2.2.1. Algorithmic fairness


Algorithmic fairness [183, 282, 83] attempts to ensure the unbiased output
across the available classes. Here, we discuss how we can overcome this
challenge at different stages of model training [183, 282].

1. Pre-processing

• Choice of sampling & data augmentation: Making sure that the


dataset is balanced (having approximately an equal number of
instances from each class) and all the classes get equal representa-
tion in the dataset using simple under- or over-sampling methods
[282]. This can also be done by data augmentation [180, 81] to
improve the counterfactual fairness by counterfactual text genera-
tion and using it to augment data. Augmentation methods include
Synthetic Minority Oversampling Technique (SMOTE) [46] and
Adaptive Synthetic Sampling (ADASYN) [107]. Since synthetic
samples may not be universally beneficial for the healthcare do-
main, acquiring more data and undersampling may be the best
strategy [282].
• Causal fairness using data pre-processing: Causal fairness is achieved
by reducing the impact of protected or sensitive attributes (e.g.,
race and gender) on predicted variables and different methods
have been developed to accomplish this [83, 281]. Kamiran et
al. [121] proposed “massaging the data” before using traditional
classification algorithms.
• Re-weighing: In a pre-processing approach, one may re-weight the
training dataset samples or remove features with high correlation

46
to sensitive attributes as well as the sensitive attribute itself [122],
learning representations that are relatively invariant to sensitive
attribute [162]. One might also adjust representation rates of pro-
tected groups and achieve target fairness metrics [44], or utilize
optimization to learn a data transformation that reduce discrimi-
nation [40].

2. In-processing

• Adversarial learning: It is also possible to enforce fairness dur-


ing model training, using adversarial debiasing [291, 221, 274].
Adversarial learning refers to the methods designed to intention-
ally confound ML models during training, through deceptive or
misleading inputs, to make those models more robust. This tech-
nique has been used in healthcare to create robust models [125],
and for bias mitigation, by intentionally inputting biased examples
[144, 204].
• Prejudice remover: Another important aspect is prejudice injected
into the features [123]. Prejudice can be (a) Direct prejudice: using
a protected attribute as a prediction variable, (b) Indirect prej-
udice: statistical dependence between protected attributes and
prediction variables, and (c) Latent prejudice: statistical depen-
dence between protected attributes and non-protected attributes.
Kamishaima et al. [123] proposed a method to remove prejudice
using regularization. Similarly, Grgic et al. [94] introduced a
method using constraints for classifier optimization objectives to
remove prejudice.
• Enforcing fairness in the model training: Fairness can also be
enforced by making changes to the model through constraint op-
timization [159], modifying loss functions to penalize deviation
from the general population for subpopulations [205], regularizing
loss function to minimize mutual information between feature em-
bedding and bias [128], or adding regularizer to identify and treat
latent discriminating features [123].
• Up-weighing: It is possible to improve the outcome on worst case
group by up weighting the groups with the largest loss [292, 231,

47
173]. However, all these methods need awareness about the mem-
bership of the instance to sensitive attributes. There are also
group un-aware methods where they try to weights each sample
with an adversary that tries to maximize the weighted loss [136],
or trains an additional classier that up-weights samples classified
incorrectly in the last training step [154].

3. Post-processing: The post-processing fairness mitigation approaches


may target post-hoc calibration of model predictions. This method has
sown impact in bias mitigation in both non-healthcare [103, 210] and
healthcare [129] applications.

There are some software tools and libraries for algorithmic fairness check,
listed in [282], which can be used by developer and end user to evaluate the
fairness of the AI model outcomes.

5.2.3. Development & evaluation


At this stage, the ML system is evaluated to make sure its trustworthiness,
which includes evaluating the evaluation methods [220, 17].

5.2.3.1. Model interpretability & explainability


At this stage, model evaluation can be done through interpretability and ex-
plainability methods to mitigate any potential issues such as possible anoma-
lies in the data or the model. However, it should be noted that the methods
perform interpretability and explainability should also be evaluated carefully
before relying on them, which can be performed using different methods such
as human evaluation [170, 24].

6. Concluding remarks
Machine learning (ML) has been applied to many clinically-relevant tasks
and many relevant datasets in the research domain but, to fully realize the
promise of ML in healthcare, practical considerations that are not typically
necessary or even common in the research community must be carefully de-
signed and adhered to. We have provided a deep survey into a breadth
of these ML considerations, including infrastructure, human resources, data
sources, model deployment, monitoring and updating, bias, interpretability,

48
privacy and security.

As there are an increasing number of AI systems being deployed into medical


practice, it is important to standardize and to specify specific engineering
pipelines for medical AI development and deployment, a process we term
MLHOps. To this end, we have outlined the key steps that should be put into
practice by multidisciplinary teams at the cutting-edge of AI in healthcare
to ensure the responsible deployment of clinical AI systems.

7. Appendix

Name of tool Description


MIMIC-Extract Pipeline to transform data from MIMIC-III into
DataFrames that are directly usable for ML mod-
elling
Clairvoyance End-to-End AutoML Pipeline for Medical Time
Series
Pyhealth A python library for health predictive models
ROMOP R package to easily interface with OMOP-
formatted EHR data
ATLAS Research tool to conduct scientific analyses on
data available in OMOP format
FIDDLE Preprocessing pipeline that transforms structured
EHR data into feature vectors for clinical use cases
hi-ml Toolbox for deep learning for medical imaging and
Azure integration
MedPerf An open benchmarking platform for medical arti-
ficial intelligence using Federated Evaluation.
MONAI AI Toolkit for Healthcare Imaging
TorchXRayVision A library of chest X-ray datasets and models
Leaf Clinical Data Explorer

Table 4: List of open-source tools available on Github that can be used for ML system
development specific to health.

49
Table 5: Key Roles in an MLOps Team
Role Alternatively Description
• Business Translator
Domain Expert • Business Stakeholder An instrumental role in any phase of
• PO/Manager the MLOps process where a deeper
understanding of the data and the
domain is required.
• IT Architect
Solution Architect Unifying the work of data scientists,
• ML Architect
data engineers, and software devel-
opers through developing strategies
for MLOps processes, defining the
project lifecycle, and identifying the
best tools and assemble the team of
engineers and developers to work on
projects.
• ML Specialist
Data Scientist A central player in any MLOps team
• ML Developer
responsible for creating the data and
ML model pipelines. The pipelines
include analysing and processing the
data as well as building and testing
the ML models.
• DataOps Engineer
Data Engineer Working in coordination with prod-
• Data Analyst
uct manager and domain expert to
uncover insights from data through
data ingestion pipelines.
Software Developer • Full-stack engineer Focusing on the productionizing of
ML models and the supporting in-
frastructure based on the ML archi-
tect’s blueprints. They standardize
the code for compatibility and re-
usability
DevOps Engineer • CI/CD Engineer Facilitating access to the specialized
tools and high performance comput-
ing infrastructure, enabling transi-
tion from development to deploy-
ment and monitoring, and automat-
ing ML lifecycle.
ML Engineer • MLOps Engineer
50 Highly skilled programmers support-
ing designing and deploying ML
models in close collaboration with
Data Scientists and DevOps Engi-
neers.
References
[1] Abdullah A Abdullah, Masoud M Hassan, and Yaseen T Mustafa. A
review on bayesian deep learning in healthcare: Applications and chal-
lenges. IEEE Access, 2022.
[2] Talal AA Abdullah, Mohd Soperi Mohd Zahid, and Waleed Ali. A
review of interpretable ml in healthcare: Taxonomy, applications, chal-
lenges, and future directions. Symmetry, 13(12):2439, 2021.
[3] Adnan Ahmed Abi Sen and Abdullah M Basahel. A comparative study
between security and privacy. In 2019 6th International Conference on
Computing for Sustainable Global Development (INDIACom), pages
1282–1286. IEEE, 2019.
[4] Karim Abouelmehdi, Abderrahim Beni-Hessane, and Hayat Khaloufi.
Big healthcare data: preserving security and privacy. Journal of big
data, 5(1):1–18, 2018.
[5] Karim Abouelmehdi, Abderrahim Beni-Hssane, Hayat Khaloufi, and
Mostafa Saadi. Big data security and privacy in healthcare: A review.
Procedia Computer Science, 113:73–80, 2017.
[6] George A. Adam, Chun-Hao K. Chang, Benjamin Haibe-Kains, and
Anna Goldenberg. Hidden risks of machine learning applied to health-
care: Unintended feedback loops between models and future data caus-
ing model degradation. Proceedings of Machine Learning Research,
(126):710–731, 2020.
[7] George A. Adam, Chun-Hao K. Chang, Benjamin Haibe-Kains, and
Anna Goldenberg. Hidden risks of machine learning applied to health-
care: Unintended feedback loops between models and future data caus-
ing model degradation. Proceedings of Machine Learning Research,
(182):1–26, 2022.
[8] Roy Adams, Katharine E Henry, Anirudh Sridharan, Hossein
Soleimani, Andong Zhan, Nishi Rawat, Lauren Johnson, David N
Hager, Sara E Cosgrove, Andrew Markowski, et al. Prospective, multi-
site study of patient outcomes after implementation of the trews ma-
chine learning-based early warning system for sepsis. Nature medicine,
pages 1–6, 2022.

51
[9] Adewole S. Adamson and Avery Smith. Machine learning and health
care disparities in dermatology. JAMA Dermatology, 154(11):1247–
1248, 2018.

[10] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz
Hardt, and Been Kim. Sanity checks for saliency maps. Advances in
neural information processing systems, 31, 2018.

[11] Julius A Adebayo et al. FairML: ToolBox for diagnosing bias in


predictive modeling. PhD thesis, Massachusetts Institute of Technol-
ogy, 2016.

[12] Philip Adler, Casey Falk, Sorelle A Friedler, Tionney Nix, Gabriel Ry-
beck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubra-
manian. Auditing black-box models for indirect influence. Knowledge
and Information Systems, 54:95–122, 2018.

[13] Yongsu Ahn and Yu-Ru Lin. Fairsight: Visual analytics for fairness
in decision making. IEEE transactions on visualization and computer
graphics, 26(1):1086–1095, 2019.

[14] Amr M. Alexandari, Anshul Kundaje, and Avanti Shrikumar. Maxi-


mum likelihood with bias-corrected calibration is hard-to-beat at label
shift adaptation. In International Conference on Machine Learning,
pages 222–232. PMLR, 2020.

[15] Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin,
Tristan Naumann, and Matthew McDermott. Publicly available clinical
bert embeddings. arXiv preprint arXiv:1904.03323, 2019.

[16] Fatemeh Amrollahi, Supreeth P Shashikumar, Andre L Holder, and


Shamim Nemati. Leveraging clinical data across healthcare institutions
for continual learning of predictive risk models. Scientific Reports,
12(1):1–10, 2022.

[17] Tony Antoniou and Muhammad Mamdani. Evaluation of machine


learning solutions in medicine. CMAJ, 193(36):E1425–E1429, 2021.

[18] Mahsa Arabahmadi, Reza Farahbakhsh, and Javad Rezazadeh. Deep


learning for smart healthcare—a survey on brain tumor detection from
medical imaging. Sensors, 22(5):1960, 2022.

52
[19] Jacob Armstrong and David A Clifton. Continual learning of longitu-
dinal health records. In 2022 IEEE-EMBS International Conference on
Biomedical and Health Informatics (BHI), pages 01–06. IEEE, 2022.

[20] Anand Avati, Martin Seneviratne, Emily Xue, Zhen Xu, Balaji Lak-
shminarayanan, and Andrew M. Dai. Beds-bench: Behavior of
ehr-models under distributional shift–a benchmark. arXiv preprint
arXiv:2107.08189, 2021.

[21] Joseph Bamidele Awotunde, Rasheed Gbenga Jimoh, Sak-


inat Oluwabukonla Folorunso, Emmanuel Abidemi Adeniyi,
Kazeem Moses Abiodun, and Oluwatobi Oluwaseyi Banjo. Pri-
vacy and security concerns in iot-based healthcare systems. In
The Fusion of Internet of Things, Artificial Intelligence, and Cloud
Computing in Health Care, pages 105–134. Springer, 2021.

[22] K. Azizzadenesheli, A. Liu, F. Yang, and A. Anandkumar. Regular-


ized learning for domain adaptation under label shifts. arXiv preprint
arXiv:1903.09734, 2019.

[23] Manuel Baena-Garcıa, Jose del Campo-Avila, Raul Fidalgo, Albert


Bifet, Ricard Gavalda, and Rafael Morales-Bueno. Early drift detection
method. In Fourth international workshop on knowledge discovery from
data streams, 6:77–86, 2006.

[24] Aparna Balagopalan, Haoran Zhang, Kimia Hamidieh, Thomas


Hartvigsen, Frank Rudzicz, and Marzyeh Ghassemi. The road to ex-
plainability is paved with bias: Measuring the fairness of explanations.
arXiv preprint arXiv:2205.03295, 2022.

[25] MR Banaji and G Greenwald. Blindspot: Hidden biases of good people


[kindle ipad version], 2013.

[26] Shahab S Band, Sina Ardabili, Atefeh Yarahmadi, Bahareh Pahlevan-


zadeh, Adiqa Kausar Kiani, Amin Beheshti, Hamid Alinejad-Rokny,
Iman Dehzangi, Arthur Chang, Amir Mosavi, et al. A survey on ma-
chine learning and internet of medical things-based approaches for han-
dling covid-19: Meta-analysis. Frontiers in Public Health, 10, 2022.

53
[27] Niels Bantilan. Themis-ml: A fairness-aware machine learning inter-
face for end-to-end discrimination discovery and mitigation. Journal of
Technology in Human Services, 36(1):15–30, 2018.
[28] David W Bates, Suchi Saria, Lucila Ohno-Machado, Anand Shah, and
Gabriel Escobar. Big data in health care: using analytics to identify and
manage high-risk and high-cost patients. Health affairs, 33(7):1123–
1131, 2014.
[29] Firas Bayram, Bestoun S Ahmed, and Andreas Kassler. From concept
drift to model degradation: An overview on performance-aware drift
detectors. Knowledge-Based Systems, page 108632, 2022.
[30] Ashwin Belle, Raghuram Thiagarajan, SM Soroushmehr, Fatemeh Na-
vidi, Daniel A Beard, and Kayvan Najarian. Big data analytics in
healthcare. BioMed research international, 2015, 2015.
[31] Duane Bender and Kamran Sartipi. Hl7 fhir: An agile and restful ap-
proach to healthcare information exchange. In Proceedings of the 26th
IEEE International Symposium on Computer-Based Medical Systems,
pages 326–331, 2013.
[32] Ayne A. Beyene, Tewelle Welemariam, Marie Persson, and Niklas
Lavesson. Improved concept drift handling in surgery prediction and
other applications. Knowledge and Information Systems, 44(1):177–
196, 2015.
[33] Sarah Bird, Miro Dudı́k, Richard Edgar, Brandon Horn, Roman Lutz,
Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen
Walker. Fairlearn: A toolkit for assessing and improving fairness in
ai. Microsoft, Tech. Rep. MSR-TR-2020-32, 2020.
[34] Hart Blanton, James Jaccard, Jonathan Klick, Barbara Mellers, Gre-
gory Mitchell, and Philip E Tetlock. Strong claims and weak evi-
dence: reassessing the predictive validity of the iat. Journal of applied
Psychology, 94(3):567, 2009.
[35] Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-
Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and
Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium
on Security and Privacy (SP), pages 141–159, 2021.

54
[36] Jonathan Brophy and Daniel Lowd. Machine unlearning for random
forests. In Marina Meila and Tong Zhang, editors, Proceedings of the
38th International Conference on Machine Learning, volume 139 of
Proceedings of Machine Learning Research, pages 1092–1104. PMLR,
18–24 Jul 2021.

[37] Li Bu, Cesare Alippi, and Dongbin Zhao. A pdf-free change detec-
tion test based on density difference estimation. IEEE transactions on
neural networks and learning systems, 29(2):324–334, 2016.

[38] Joy Buolamwini and Timnit Gebru. Gender Shades: Intersectional Ac-
curacy Disparities in Commercial Gender Classification. In Proceedings
of the 1st Conference on Fairness, Accountability and Transparency,
volume 81 of FAT*’18, page 15, 2018.

[39] Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la


Iglesia-Vayá. Padchest: A large chest x-ray image dataset with multi-
label annotated reports. Medical image analysis, 66:101797, 2020.

[40] Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan


Natesan Ramamurthy, and Kush R Varshney. Optimized pre-
processing for discrimination prevention. 30, 2017.

[41] Tianshi Cao, Chinwei Huang, David Yu-Tung Hui, and Joseph Paul
Cohen. A benchmark of medical out of distribution detection. arXiv
preprint arXiv:2007.04250, 2020.

[42] Quinn Capers IV, Daniel Clinchot, Leon McDougle, and Anthony G
Greenwald. Implicit racial bias in medical school admissions. Academic
Medicine, 92(3):365–369, 2017.

[43] Diogo V Carvalho, Eduardo M Pereira, and Jaime S Cardoso. Ma-


chine learning interpretability: A survey on methods and metrics.
Electronics, 8(8):832, 2019.

[44] L. Elisa Celis, Vijay Keswani, and Nisheeth Vishnoi. Data preprocess-
ing to mitigate bias: A maximum entropy based approach. 119:1349–
1359, 2020.

55
[45] Sachi Chaudhary, Riya Kakkar, Nilesh Kumar Jadav, Anuja Nair, Ra-
jesh Gupta, Sudeep Tanwar, Smita Agrawal, Mohammad Dahman Al-
shehri, Ravi Sharma, Gulshan Sharma, et al. A taxonomy on smart
healthcare technologies: Security framework, case study, and future
directions. Journal of Sensors, 2022, 2022.

[46] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip


Kegelmeyer. Smote: synthetic minority over-sampling technique.
Journal of artificial intelligence research, 16:321–357, 2002.

[47] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag,


and Yan Liu. Recurrent neural networks for multivariate time series
with missing values. Scientific reports, 8(1):1–12, 2018.

[48] Irene Chen, Fredrik D Johansson, and David Sontag. Why Is My Clas-
sifier Discriminatory? In Advances in Neural Information Processing
Systems 31, pages 3539–3550. Curran Associates, Inc., 2018.

[49] Irene Chen, Shalmali Joshi, and Marzyeh Ghassemi. Treating health
disparities with artificial intelligence. volume 26, page 16–17, 2020.

[50] Irene Y Chen, Emma Pierson, Sherri Rose, Shalmali Joshi, Kadija Fer-
ryman, and Marzyeh Ghassemi. Ethical machine learning in healthcare.
Annual review of biomedical data science, 4:123–144, 2021.

[51] Jonathan H Chen, Muthuraman Alagappan, Mary K Goldstein,


Steven M Asch, and Russ B Altman. Decaying relevance of clinical
data towards future decisions in data-driven inpatient clinical order
sets. International journal of medical informatics, 102:71–79, 2017.

[52] Weijie Chen, Berkman Sahiner, Frank Samuelson, Aria Pezeshk, and
Nicholas Petrick. Calibration of medical diagnostic classifier scores to
the probability of disease. Statistical methods in medical research,
27(5):1394–1409, 2018.

[53] Christopher A Choquette-Choo, Natalie Dullerud, Adam Dziedzic,


Yunxiang Zhang, Somesh Jha, Nicolas Papernot, and Xiao Wang.
Capc learning: Confidential and private collaborative learning. arXiv
preprint arXiv:2102.05188, 2021.

56
[54] Shir Chorev, Philip Tannor, Dan Ben Israel, Noam Bressler, Itay Gab-
bay, Nir Hutnik, Jonatan Liberman, Matan Perlmutter, Yurii Ro-
manyshyn, and Lior Rokach. Deepchecks: A Library for Testing and
Validating Machine Learning Models and Data.
[55] Alexandra Chouldechova. Fair prediction with disparate impact: A
study of bias in recidivism prediction instruments. Big data, 5(2):153–
163, 2016.
[56] Oliver Cobb and Arnaud Van Looveren. Context-aware drift detection.
In International Conference on Machine Learning, pages 4087–4111.
PMLR, 2022.
[57] Gary S Collins, Paula Dhiman, Constanza L Andaur Navarro, Jie Ma,
Lotty Hooft, Johannes B Reitsma, Patricia Logullo, Andrew L Beam,
Lily Peng, Ben Van Calster, et al. Protocol for development of a report-
ing guideline (tripod-ai) and risk of bias tool (probast-ai) for diagnostic
and prognostic prediction model studies based on artificial intelligence.
BMJ open, 11(7):e048008, 2021.
[58] Sam Corbett-Davies and Sharad Goel. The measure and mismeasure
of fairness: A critical review of fair machine learning. 2018.
[59] Conor K Corbin, Rob Maclay, Aakash Acharya, Sreedevi Mony,
Soumya Punnathanam, Rahul Thapa, Nikesh Kotecha, Nigam H Shah,
and Jonathan H Chen. Deployr: A technical framework for deploying
custom real-time machine learning models into the electronic medical
record. arXiv preprint arXiv:2303.06269, 2023.
[60] Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik
Sridharan, Serena Wang, Blake Woodworth, and Seungil You. Train-
ing well-generalizing classifiers for fairness metrics and other data-
dependent constraints. In International Conference on Machine
Learning, pages 1397–1405. PMLR, 2019.
[61] Fida Kamal Dankar and Khaled El Emam. Practicing differential pri-
vacy in health care: A review. Trans. Data Priv., 6(1):35–67, 2013.
[62] Sharon E Davis, Robert A Greevy Jr, Christopher Fonnesbeck,
Thomas A Lasko, Colin G Walsh, and Michael E Matheny. A non-
parametric updating method to correct clinical prediction model drift.

57
Journal of the American Medical Informatics Association, 26(12):1448–
1457, 2019.

[63] Angus Dawson. Trust, trustworthiness and health. Forum for Medical
Ethics Society, 2015.

[64] Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes,


Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram
Kenthapadi, and Adam Tauman Kalai. Bias in bios: a case study of
semantic representation bias in a high-stakes setting. In Proceedings
of the Conference on Fairness, Accountability, and Transparency,
FAT*’19, pages 120–128, USA, 2019. event-place: Atlanta, GA.

[65] Anne AH de Hond, Artuur M Leeuwenberg, Lotty Hooft, Ilse MJ


Kant, Steven WJ Nijman, Hendrikus JA van Os, Jiska J Aardoom,
Thomas Debray, Ewoud Schuit, Maarten van Smeden, et al. Guidelines
and quality criteria for artificial intelligence-based prediction models in
healthcare: a scoping review. npj Digital Medicine, 5(1):1–13, 2022.

[66] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot,


Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars.
A continual learning survey: Defying forgetting in classification
tasks. IEEE transactions on pattern analysis and machine intelligence,
44(7):3366–3385, 2021.

[67] Kevin Donnelly et al. Snomed-ct: The advanced terminology and cod-
ing system for ehealth. Studies in health technology and informatics,
121:279, 2006.

[68] Finale Doshi-Velez and Been Kim. Towards a rigorous science of inter-
pretable machine learning. arXiv preprint arXiv:1702.08608, 2017.

[69] Xinyi Du-Harpur, Callum Arthurs, Clarisse Ganier, Rick Woolf, Zainab
Laftah, Manpreet Lakhan, Amr Salam, Bo Wan, Fiona M. Watt,
Nicholas M. Luscombe, and Magnus D. Lynch. Clinically relevant vul-
nerabilities of deep machine learning systems for skin cancer diagnosis.
J Invest Dermatol., 141(4):916–920, 2021.

[70] Christopher Duckworth, Francis P. Chmiel, Dan K. Burns, Zlatko D.


Zlatev, Neil M. White, Thomas W. V. Daniels, Michael Kiuber, and

58
Michael J. Boniface. Using explainable machine learning to characterise
data drift and detect emergent health risks for emergency department
admissions during covid-19. Sci Rep, 11:23017, 2021.

[71] Michael Ekstrand, Robin Burke, and Fernando Diaz. Fairness and
discrimination in recommendation and retrieval. In RecSys ’19:
Proceedings of the 13th ACM Conference on Recommender Systems,
RecSys*’19, page 576–577, 2016.

[72] Yanai Elazar and Yoav Goldberg. Adversarial removal of demographic


attributes from text data. arXiv preprint arXiv:1808.06640, 2018.

[73] Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M.
Swetter, Helen M. Blau, and Sebastian Thrun. Dermatologist-level
classification of skin cancer with deep neural networks. Nature,
542(7639):115–118, February 2017.

[74] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr


Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado,
Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare.
Nature medicine, 25(1):24–29, 2019.

[75] Mark Evans, Ying He, Leandros Maglaras, and Helge Janicke. Heart-
is: A novel technique for evaluating human error-related information
security incidents. Computers & Security, 80:74–89, 2019.

[76] Jean Feng, Rachael V Phillips, Ivana Malenica, Andrew Bishara,


Alan E Hubbard, Leo A Celi, and Romain Pirracchio. Clinical ar-
tificial intelligence quality improvement: towards continual monitoring
and updating of ai algorithms in healthcare. npj Digital Medicine,
5(1):1–9, 2022.

[77] Julia A Files, Anita P Mayer, Marcia G Ko, Patricia Friedrich, Marjorie
Jenkins, Michael J Bryan, Suneela Vegunta, Christopher M Wittich,
Melissa A Lyle, Ryan Melikian, et al. Speaker introductions at internal
medicine grand rounds: forms of address reveal gender bias. Journal
of women’s health, 26(5):413–419, 2017.

[78] Samuel G. Finlayson, Adarsh Subbaswamy, Karandeep Singh, John


Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S. Kohan, and Suchi

59
Saria. The clinician and dataset shift in artificial intelligence. New
England Journal of Medicine, 385(3):283–286, 2021.

[79] Chloë FitzGerald and Samia Hurst. Implicit bias in healthcare profes-
sionals: a systematic review. BMC medical ethics, 18(1):1–18, 2017.

[80] Charles K Francis. Medical ethos and social responsibility in clinical


medicine. Journal of Urban Health, 78(1):29–45, 2001.

[81] Zee Fryer, Vera Axelrod, Ben Packer, Alex Beutel, Jilin Chen, and
Kellie Webster. Flexible text generation for counterfactual fairness
probing. arXiv preprint arXiv:2206.13757, 2022.

[82] Daniel James Fuchs. The dangers of human-like bias in machine-


learning algorithms. Missouri S&T’s Peer to Peer, 2(1):1, 2018.

[83] Sainyam Galhotra, Karthikeyan Shanmugam, Prasanna Sattigeri,


Kush R Varshney, Rachel Bellamy, Kuntal Dey, et al. Causal feature
selection for algorithmic fairness. 2022.

[84] João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and
Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM
computing surveys (CSUR), 46(4):1–37, 2014.

[85] Ruoyuan Gao and Chirag a Shah. Toward creating a fairer ranking
in search engine results. In Information Processing & Management,
volume 57, 2020.

[86] Satvik Garg, Pradyumn Pundir, Geetanjali Rathee, PK Gupta, Somya


Garg, and Saransh Ahlawat. On continuous integration/continuous
delivery for automated deployment of machine learning models us-
ing mlops. In 2021 IEEE fourth international conference on artificial
intelligence and knowledge engineering (AIKE), pages 25–28. IEEE,
2021.

[87] Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary C.


Lipton. A unified view of label shift estimatation. Advances in Neural
Information Processing Systems, 33:3290–3300, 2020.

[88] Marzyeh Ghassemi and Shakir Mohamed. Machine learning and health
need better values. npj Digital Medicine, 5(1):1–4, 2022.

60
[89] Milena A. Gianfrancesco, Suzanne Tamang, Jinoos Yazdany, and
Gabriela Schmajuk. Potential biases in machine learning algo-
rithms using electronic health record data. JAMA internal medicine,
178(11):1544–1547, 2018.

[90] Tony Ginart, Martin Jinye Zhang, and James Zou. Mlde-
mon:deployment monitoring for machine learning systems. In Gus-
tau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, edi-
tors, Proceedings of The 25th International Conference on Artificial
Intelligence and Statistics, volume 151 of Proceedings of Machine
Learning Research, pages 3962–3997. PMLR, 28–30 Mar 2022.

[91] Elliot Graham, Samer Halabi, and Arie Nadler. Ingroup bias in health-
care contexts: Israeli-jewish perceptions of arab and jewish doctors.
Frontiers in psychology, 12, 2021.

[92] Alex Graves, Marc G. Bellemare, Jacob Menick, Rémi Munos, and
Koray Kavukcuoglu. Automated curriculum learning for neural net-
works. In Doina Precup and Yee Whye Teh, editors, Proceedings of
the 34th International Conference on Machine Learning, volume 70 of
Proceedings of Machine Learning Research, pages 1311–1320. PMLR,
06–11 Aug 2017.

[93] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard


Schölkopf, and Alexander Smola. A kernel two-sample test. The
Journal of Machine Learning Research, 13(1):723–773, 2012.

[94] Nina Grgic-Hlaca, Muhammad Bilal Zafar, Krishna P Gummadi, and


Adrian Weller. The case for process fairness in learning: Feature selec-
tion for fair decision making. In NIPS symposium on machine learning
and the law, volume 1, page 2. Barcelona, Spain, 2016.

[95] Hao Guan and Mingxia Liu. Domain adaptation for medical image
analysis: a survey. IEEE Transactions on Biomedical Engineering,
69(3):1173–1185, 2021.

[96] Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van
Der Maaten. Certified data removal from machine learning models.
arXiv preprint arXiv:1911.03030, 2019.

61
[97] Lin Lawrence Guo, Stephen R. Pfohl, Jason Fries, Alistair E. W. John-
son, Jose Posada, Catherine Aftandilian, Nigam Shah, and Lillian
Sung. Evaluation of domain generalization and adaptation on improv-
ing model robustness to temporal dataset shift in clinical medicine. Sci
Rep, page 2726, 2022.
[98] Xiaoyuan Guo, Judy Wawira Gichoya, Hari Trivedi, Saptarshi
Purkayastha, and Imon Banerjee. Medshift: identifying shift data for
medical dataset curation. arXiv preprint arXiv:2112.13885, 2021.
[99] Kishor Datta Gupta and Dipankar Dasgupta. Who is responsible for
adversarial defense? arXiv preprint arXiv:2106.14152, 2021.
[100] Raia Hadsell, Dushyant Rao, Andrei A. Rusu, and Razvan Pascanu.
Embracing change: Continual learning in deep neural networks. Trends
in cognitive sciences, 24(12):1028–1040, 2020.
[101] Abid Haleem, Mohd Javaid, Ravi Pratap Singh, Rajiv Suman, and
Shanay Rab. Blockchain technology applications in healthcare: An
overview. International Journal of Intelligent Networks, 2:130–139,
2021.
[102] Frederik Harder, Matthias Bauer, and Mijung Park. Interpretable
and differentially private predictions. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 4083–4090, 2020.
[103] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of Opportu-
nity in Supervised Learning. In Proceedings of the 30th International
Conference on Neural Information Processing Systems, NIPS’16, pages
3323–3331, USA, 2016. Curran Associates Inc. event-place: Barcelona,
Spain.
[104] David A Harrison, Anthony R Brady, Gareth J Parry, James R Car-
penter, and Kathy Rowan. Recalibration of risk prediction models in
a large multicenter cohort of admissions to adult, general critical care
units in the united kingdom. Critical care medicine, 34(5):1378–1388,
2006.
[105] Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon
Kim, and Marzyeh Ghassemi. Aging with grace: Lifelong model editing
with discrete key-value adaptors. 2022.

62
[106] Passent M. El-Kafrawy Hassan Moharram, Ahmed Awad. Optimiz-
ing adwin for steady streams. SAC ’22: Proceedings of the 37th
ACM/SIGAPP Symposium on Applied Computing, pages 450–459,
2022.

[107] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn:
Adaptive synthetic sampling approach for imbalanced learning. In 2008
IEEE international joint conference on neural networks (IEEE world
congress on computational intelligence), pages 1322–1328. IEEE, 2008.

[108] Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the
state-of-the-art. Knowledge-Based Systems, 212:106622, 2021.

[109] Katharine E Henry, Rachel Kornfield, Anirudh Sridharan, Robert C


Linton, Catherine Groh, Tony Wang, Albert Wu, Bilge Mutlu, and
Suchi Saria. Human–machine teaming is key to ai adoption: clinicians’
experiences with a deployed machine learning system. NPJ digital
medicine, 5(1):1–6, 2022.

[110] Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and
Kasia Chmielinski. The dataset nutrition label: A framework to drive
higher data quality standards. arXiv preprint arXiv:1805.03677, 2018.

[111] Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S Yu,
and Xuyun Zhang. Membership inference attacks on machine learning:
A survey. ACM Computing Surveys (CSUR), 54(11s):1–37, 2022.

[112] Hamish Huggard, Yun Sing Koh, Gillian Dobbie, and Edmond Zhang.
Detecting concept drift in medical triage. pages 1733–1736, 2020.

[113] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana
Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn
Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S.
Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P.
Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng.
CheXpert: A Large Chest Radiograph Dataset with Uncertainty La-
bels and Expert Comparison. arXiv:1901.07031 [cs, eess], January 2019.
arXiv: 1901.07031.

63
[114] Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou.
Approximate data deletion from machine learning models. volume 130,
pages 2008–2016, 2021.

[115] Daniel Jarrett, Jinsung Yoon, Ioana Bica, Zhaozhi Qian, Ari Er-
cole, and Mihaela van der Schaar. Clairvoyance: A pipeline toolkit
for medical time series. In International Conference on Learning
Representations, 2020.

[116] Shouling Ji, Weiqing Li, Prateek Mittal, Xin Hu, and Raheem Beyah.
{SecGraph}: A uniform and open-source evaluation system for graph
data anonymization and de-anonymization. In 24th USENIX Security
Symposium (USENIX Security 15), pages 303–318, 2015.

[117] Wittawat Jitkrittum, Zoltán Szabó, Kacper P Chwialkowski, and


Arthur Gretton. Interpretable distribution features with maximum
testing power. Advances in Neural Information Processing Systems,
29, 2016.

[118] Meenu Mary John, Helena Holmström Olsson, and Jan Bosch. Towards
mlops: A framework and maturity model. In 2021 47th Euromicro
Conference on Software Engineering and Advanced Applications
(SEAA), pages 1–8, 2021.

[119] Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng,


Leo Anthony Celi, and Roger Mark. Mimic-iv. PhysioNet. Available
online at: https://physionet. org/content/mimiciv/1.0/(accessed
August 23, 2021), 2020.

[120] Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz,


Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng,
Roger G. Mark, and Steven Horng. MIMIC-CXR: A large publicly
available database of labeled chest radiographs. arXiv:1901.07042 [cs,
eess], January 2019. arXiv: 1901.07042.

[121] Faisal Kamiran and Toon Calders. Classifying without discriminat-


ing. In 2009 2nd international conference on computer, control and
communication, pages 1–6. IEEE, 2009.

64
[122] Faisal Kamiran and Toon Calders. Data preprocessing techniques
for classification without discrimination. Knowledge and information
systems, 33(1):1–33, 2012.

[123] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma.
Fairness-aware classifier with prejudice remover regularizer. In Joint
European conference on machine learning and knowledge discovery in
databases, pages 35–50. Springer, 2012.

[124] Sehj Kashyap, Keith E Morse, Birju Patel, and Nigam H Shah. A
survey of extant organizational and computational setups for deploying
predictive models in health systems. Journal of the American Medical
Informatics Association, 28(11):2445–2450, 2021.

[125] Sara Kaviani, Ki Jin Han, and Insoo Sohn. Adversarial attacks and de-
fenses on ai in medical imaging informatics: A survey. Expert Systems
with Applications, page 116815, 2022.

[126] Jane Kaye. The tension between data sharing and the protection of
privacy in genomics research. Annual review of genomics and human
genetics, 13:415, 2012.

[127] Faiza Khan Khattak, Serena Jeblee, Chloé Pou-Prom, Mohamed Ab-
dalla, Christopher Meaney, and Frank Rudzicz. A survey of word
embeddings for clinical text. Journal of Biomedical Informatics,
100:100057, 2019.

[128] Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo
Kim. Learning not to learn: Training deep neural networks with biased
data. CoRR, 2018.

[129] Michael P. Kim, Amirata Ghorbani, and James Zou. Multiaccuracy:


Black-box post-processing for fairness in classification. page 247–254,
2019.

[130] Gene Kitamura and Christopher Deible. Retraining an open-source


pneumothorax detecting machine learning algorithm for improved per-
formance to medical images. Clinical Imaging, 61:15–19, 2020.

65
[131] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent
trade-offs in the fair determination of risk scores. In 8th Innovations
in Theoretical Computer Science Conference, page 3:1–43:23, 2017.

[132] William A. Knaus. Prognostic modeling and major dataset shifts dur-
ing the covid-19 pandemic: What have we learned for the next pan-
demic? JAMA Health Forum, 3(5):e221103, 2022.

[133] Wouter M Kouw and Marco Loog. A review of domain adaptation with-
out target labels. IEEE transactions on pattern analysis and machine
intelligence, 43(3):766–785, 2019.

[134] Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl. Machine


learning operations (mlops): Overview, definition, and architecture.
arXiv preprint arXiv:2205.02302, 2022.

[135] Sean Kulinski, Saurabh Bagchi, and David I. Inouye. Feature shift
detection: Localizing which features have shifted via conditional dis-
tribution tests. Advances in neural information processing systems, 33,
2020.

[136] Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost,
Nithum Thain, Xuezhi Wang, and Ed Chi. Fairness without demo-
graphics through adversarially reweighted learning. 33:728–740, 2020.

[137] Agostina J. Larrazabal, Nicolás Nieto, Victoria Peterson, Diego H.


Milone, and Enzo Ferrante. Gender imbalance in medical imag-
ing datasets produces biased classifiers for computer-aided diagnosis.
Proceedings of the National Academy of Sciences of the United States
of America, 117:12592 – 12594, 2020.

[138] Cecilia S. Lee and Aaron Y. Lee. Clinical applications of continual


learning machine learning. Lancet Digital Health, 2(6):E279–E281,
2020.

[139] Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. Mixout: Effec-
tive regularization to finetune large-scale pretrained language models.
arXiv preprint arXiv:1909.11299, 2019.

66
[140] Sebastian Lee, Sebastian Goldt, and Andrew Saxe. Continual learning
in the teacher-student setup: Impact of task similarity. In International
Conference on Machine Learning, pages 6109–6119. PMLR, 2021.

[141] Matthias Lenga, Heinrich Schulz, and Axel Saalbach. Continual learn-
ing for domain adaptation in chest x-ray classification. In Medical
Imaging with Deep Learning, pages 413–423. PMLR, 2020.

[142] Bo Li, Peng Qi, Bo Liu, Shuai Di, Jingen Liu, Jiquan Pei, Jinfeng Yi,
and Bowen Zhou. Trustworthy ai: From principles to practices. arXiv
preprint arXiv:2110.01167, 2021.

[143] Junbing Li, Changqing Zhang, Joey Tianyi Zhou, Huazhu Fu, Shuyin
Xia, and Qinghua Hu. Deep-lift: deep label-specific feature learning
for image annotation. IEEE Transactions on Cybernetics, 2021.

[144] Xiaoxiao Li, Ziteng Cui, Yifan Wu, Lin Gu, and Tatsuya Harada. Esti-
mating and improving fairness with adversarial learning. arXiv preprint
arXiv:2103.04243, 2021.

[145] Xuhong Li, Haoyi Xiong, Xingjian Li, Xuanyu Wu, Xiao Zhang, Ji Liu,
Jiang Bian, and Dejing Dou. Interpretable deep learning: Interpreta-
tion, interpretability, trustworthiness, and beyond. Knowledge and
Information Systems, pages 1–38, 2022.

[146] Yi Li and Nuno Vasconcelos. Repair: Removing representation bias by


dataset resampling. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 9572–9581, 2019.

[147] Shun Liao, Jamie Kiros, Jiyang Chen, Zhaolei Zhang, and Ting
Chen. Improving domain adaptation in de-identification of electronic
health records through self-training. Journal of the American Medical
Informatics Association, 28(10):2093–2100, 2021.

[148] Divakaran Liginlal, Inkook Sim, and Lara Khansa. How significant is
human error as a cause of privacy breaches? an empirical study and a
framework for error management. computers & security, 28(3-4):215–
228, 2009.

[149] James Liley, Samuel Emerson, Bilal Mateen, Catalina Vallejos, Louis
Aslett, and Sebastian Vollmer. Model updating after interventions

67
paradoxically introduces bias. In Arindam Banerjee and Kenji Fuku-
mizu, editors, Proceedings of The 24th International Conference on
Artificial Intelligence and Statistics, volume 130 of Proceedings of
Machine Learning Research, pages 3916–3924. PMLR, 13–15 Apr 2021.

[150] Wen Hui Lim, Chloe Wong, Sneha Rajiv Jain, Cheng Han Ng, Chia Hui
Tai, M Kamala Devi, Dujeepa D Samarasekera, Shridhar Ganpathi
Iyer, and Choon Seng Chong. The unspoken reality of gender bias in
surgery: A qualitative systematic review. PloS one, 16(2):e0246420,
2021.

[151] Zachary C. Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and
correcting for label shift with black box predictors. arXiv preprint
arXiv:1802.03916, 2018.

[152] Miles Little. Invited commentary: is there a distinctively surgical


ethics? Surgery, 129(6):668–671, 2001.

[153] Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan,


Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just
train twice: Improving group robustness without training group in-
formation. In Marina Meila and Tong Zhang, editors, Proceedings of
the 38th International Conference on Machine Learning, volume 139 of
Proceedings of Machine Learning Research, pages 6781–6792. PMLR,
18–24 Jul 2021.

[154] Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan,


Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just
train twice: Improving group robustness without training group in-
formation. In Marina Meila and Tong Zhang, editors, Proceedings of
the 38th International Conference on Machine Learning, volume 139 of
Proceedings of Machine Learning Research, pages 6781–6792. PMLR,
2021.

[155] Feng Liu, Wenkai Xu, Jie Lu, and Danica J Sutherland. Meta
two-sample testing: Learning kernels for testing with limited data.
Advances in Neural Information Processing Systems, 34:5848–5860,
2021.

68
[156] Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili
Jain, Yunhao Liu, Anil K Jain, and Jiliang Tang. Trustworthy ai: A
computational perspective. arXiv preprint arXiv:2107.06641, 2021.
[157] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-
based out-of-distribution detection. Advances in neural information
processing systems, 33:21464–21475, 2020.
[158] Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J
Calvert, and Alastair K Denniston. Reporting guidelines for clini-
cal trial reports for interventions involving artificial intelligence: the
consort-ai extension. bmj, 370, 2020.
[159] Vishnu Suresh Lokhande, Aditya Kumar Akash, Sathya N. Ravi, and
Vikas Singh. Fairalm: Augmented lagrangian method for training fair
models with little regret. page 365–381, 2020.
[160] Vincenzo Lomonaco, Lorenzo Pellegrini, Andrea Cossu, Antonio Carta,
Gabriele Graffieti, Tyler L. Hayes, Matthias De Lange, Marc Masana,
Jary Pomponi, Gido van de Ven, Martin Mundt, Qi She, Keiland
Cooper, Jeremy Forest, Eden Belouadah, Simone Calderara, German I.
Parisi, Fabio Cuzzolin, Andreas Tolias, Simone Scardapane, Luca
Antiga, Subutai Amhad, Adrian Popescu, Christopher Kanan, Joost
van de Weijer, Tinne Tuytelaars, Davide Bacciu, and Davide Maltoni.
Avalanche: an end-to-end library for continual learning. In Proceedings
of IEEE Conference on Computer Vision and Pattern Recognition, 2nd
Continual Learning in Computer Vision Workshop, 2021.
[161] David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample
tests. arXiv preprint arXiv:1610.06545, 2016.
[162] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and
Richard Zemel. The variational fair autoencoder. arXiv preprint
arXiv:1511.00830, 2015.
[163] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan
Zhang. Learning under concept drift: A review. arXiv preprint
arXiv:2004.05785, 2020.
[164] Jonathan H Lu, Alison Callahan, Birju S Patel, Keith E Morse, Dev
Dash, Michael A Pfeffer, and Nigam H Shah. Assessment of adherence

69
to reporting guidelines by commonly used clinical prediction models
from a single vendor: a systematic review. JAMA Network Open,
5(8):e2227779–e2227779, 2022.
[165] Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jor-
dan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha
Bansal, and Su-In Lee. Explainable ai for trees: From local explana-
tions to global understanding. arXiv preprint arXiv:1905.04610, 2019.
[166] Scott M Lundberg and Su-In Lee. A unified approach to interpreting
model predictions. Advances in neural information processing systems,
30, 2017.
[167] Sasu Mäkinen, Henrik Skogström, Eero Laaksonen, and Tommi Mikko-
nen. Who needs mlops: What data scientists seek to accomplish
and how can mlops help? In 2021 IEEE/ACM 1st Workshop on
AI Engineering-Software Engineering for AI (WAIN), pages 109–112.
IEEE, 2021.
[168] A James Mamary, Jeffery I Stewart, Gregory L Kinney, John E Hokan-
son, Kartik Shenoy, Mark T Dransfield, Marilyn G Foreman, Gwen-
dolyn B Vance, Gerard J Criner, COPDGene® Investigators, et al.
Race and gender disparities are evident in copd underdiagnoses across
all severities of measured airflow obstruction. Chronic Obstructive
Pulmonary Diseases: Journal of the COPD Foundation, 5(3):177, 2018.
[169] Jasmine R Marcelin, Dawd S Siraj, Robert Victor, Shaila Kotadia, and
Yvonne A Maldonado. The impact of unconscious bias in healthcare:
how to recognize and mitigate it. The Journal of infectious diseases,
220(Supplement 2):S62–S73, 2019.
[170] Ričards Marcinkevičs and Julia E Vogt. Interpretability and ex-
plainability: A machine learning zoo mini-tour. arXiv preprint
arXiv:2012.01805, 2020.
[171] Andrea Margheri, Massimiliano Masi, Abdallah Miladi, Vladimiro Sas-
sone, and Jason Rosenzweig. Decentralised provenance for healthcare
data. International Journal of Medical Informatics, 141:104197, 2020.
[172] Aniek F Markus, Jan A Kors, and Peter R Rijnbeek. The role of
explainability in creating trustworthy artificial intelligence for health

70
care: a comprehensive survey of the terminology, design choices, and
evaluation strategies. Journal of Biomedical Informatics, 113:103655,
2021.

[173] Natalia L Martinez, Martin A Bertran, Afroditi Papadaki, Miguel Ro-


drigues, and Guillermo Sapiro. Blind pareto fairness and subgroup
robustness. 139:7492–7501, 2021.

[174] Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit.
Journal of the American statistical Association, 46(253):68–78, 1951.

[175] Lisa M Meeks, Kurt Herzer, and Neera R Jain. Removing barriers and
facilitating access: increasing the number of physicians with disabili-
ties. Academic Medicine, 93(4):540–543, 2018.

[176] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman,


and Aram Galstyan. A survey on bias and fairness in machine learning.
ACM Computing Surveys (CSUR), 54(6):1–35, 2021.

[177] Sangeeta Mehta, Louise Rose, Deborah Cook, Margaret Herridge,


Sawayra Owais, and Victoria Metaxa. The speaker gender gap at crit-
ical care conferences. Critical Care Medicine, 46(6):991–996, 2018.

[178] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly


Shmatikov. Exploiting unintended feature leakage in collaborative
learning. In 2019 IEEE symposium on security and privacy (SP), pages
691–706. IEEE, 2019.

[179] Chuizheng Meng, Loc Trinh, Nan Xu, James Enouen, and Yan Liu. In-
terpretability and fairness evaluation of deep learning models on mimic-
iv dataset. Scientific Reports, 12(1):1–28, 2022.

[180] Vishwali Mhasawade, Yuan Zhao, and Rumi Chunara. Machine learn-
ing and algorithmic fairness in public and population health. Nature
Machine Intelligence, 3(8):659–666, 2021.

[181] Diana Mincu and Subhrajit Roy. Developing robust benchmarks


for driving forward ai innovation in healthcare. Nature Machine
Intelligence, pages 1–6, 2022.

71
[182] Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Man-
ning, and Chelsea Finn. Memory-based model editing at scale. In
International Conference on Machine Learning, pages 15817–15831.
PMLR, 2022.

[183] Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and
Kristian Lum. Algorithmic fairness: Choices, assumptions, and defi-
nitions. Annual Review of Statistics and Its Application, 8:141–163,
2021.

[184] Christoph Molnar. Interpretable machine learning. Lulu. com, 2020.

[185] Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Geoffrey


Bolmier, Raphael Sourty, Robin Vaysse, Adil Zouitine, Heitor Murilo
Gomes, Jesse Read, Talel Abdessalem, et al. River: machine learn-
ing for streaming data in python. The Journal of Machine Learning
Research, 22(1):4945–4952, 2021.

[186] Hussein Mozannar and David Sontag. Consistent estimators for learn-
ing to defer to an expert. In Hal Daumé III and Aarti Singh, editors,
Proceedings of the 37th International Conference on Machine Learning,
volume 119 of Proceedings of Machine Learning Research, pages 7076–
7087. PMLR, 13–18 Jul 2020.

[187] Sharyl J Nass, Laura A Levit, Lawrence O Gostin, et al. The


value and importance of health information privacy. In Beyond the
HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through
Research. National Academies Press (US), 2009.

[188] Bret Nestor, Willie McDermott, Matthew B. A.and Boag, Gabriela


Berner, Tristan Naumann, Michael C. Hughes, Anna Goldenberg,
and Marzyeh Ghassemi. Feature robustness in non-stationary health
records: Caveats to deployable model performance in common clinical
machine learning tasks. Proceedings of Machine Learning Research,
(106):1–23, 2019.

[189] Akm Iqtidar Newaz, Amit Kumar Sikder, Mohammad Ashiqur Rah-
man, and A Selcuk Uluagac. A survey on security and privacy issues in
modern healthcare systems: Attacks and defenses. ACM Transactions
on Computing for Healthcare, 2(3):1–44, 2021.

72
[190] Wei Yan Ng, Tien-En Tan, Prasanth VH Movva, Andrew Hao Sen
Fang, Khung-Keong Yeo, Dean Ho, Fuji Shyy San Foo, Zhe Xiao, Kai
Sun, Tien Yin Wong, et al. Blockchain applications in health care for
covid-19 and beyond: a systematic review. The Lancet Digital Health,
3(12):e819–e829, 2021.

[191] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good prob-


abilities with supervised learning. In Proceedings of the 22nd
international conference on Machine learning, pages 625–632, 2005.

[192] Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. In-
terpretml: A unified framework for machine learning interpretability.
arXiv preprint arXiv:1909.09223, 2019.

[193] Ziad Obermeyer, Christine Vogeli, Brian Powers, and Sendhil Mul-
lainathan. Dissecting racial bias in an algorithm used to manage the
health of population. Science, 366(6464):447–453, 2019.

[194] Se-Ra Oh, Young-Duk Seo, Euijong Lee, and Young-Gab Kim. A com-
prehensive survey on security and privacy for electronic health data.
International Journal of Environmental Research and Public Health,
18(18):9668, 2021.

[195] OHDSI. The Book of OHDSI: Observational Health Data Sciences and
Informatics. OHDSI, 2019.

[196] Iyiola E Olatunji, Jens Rauch, Matthias Katzensteiner, and Megha


Khosla. A review of anonymization for healthcare data. Big Data,
2022.

[197] Aba Osseo-Asare, Lilanthi Balasuriya, Stephen J Huot, Danya Keene,


David Berg, Marcella Nunez-Smith, Inginia Genao, Darin Latimore,
and Dowin Boatright. Minority resident physicians’ views on the role
of race/ethnicity in their training experiences in the workplace. JAMA
network open, 1(5):e182723–e182723, 2018.

[198] Andrei Paleyes, Raoul-Gabriel Urma, and Neil D. Lawrence. Chal-


lenges in deploying machine learning: a survey of case studies. ACM
Computing Surveys, 2022.

73
[199] Avneet Pannu. Artificial intelligence and its application in different
areas. Artificial Intelligence, 4(10):79–84, 2015.

[200] Mathias PM Parisot, Balazs Pejo, and Dayana Spagnuelo. Property in-
ference attacks on convolutional neural networks: Influence and impli-
cations of target model’s complexity. arXiv preprint arXiv:2104.13061,
2021.

[201] Chunjong Park, Anas Awadalla, Tadayoshi Kohno, and Shwetak Patel.
Reliable and trustworthy machine learning for health using dataset shift
detection. arXiv preprint arXiv:2110.14019, 2021.

[202] Chunjong Park, Anas Awadalla, Tadayoshi Kohno, and Shwetak Patel.
Reliable and trustworthy machine learning for health using dataset
shift detection. Advances in Neural Information Processing Systems,
page 34, 2021.

[203] Sharina D Person, C Greer Jordan, Jeroan J Allison, Lisa M Fink


Ogawa, Laura Castillo-Page, Sarah Conrad, Marc A Nivet, and Deb-
orah L Plummer. Measuring diversity and inclusion in academic
medicine: the diversity engagement survey (des). Academic medicine:
journal of the Association of American Medical Colleges, 90(12):1675,
2015.

[204] Stephen Pfohl, Ben Marafino, Adrien Coulet, Fatima Rodriguez,


Latha Palaniappan, and Nigam H Shah. Creating fair models of
atherosclerotic cardiovascular disease risk. In Proceedings of the 2019
AAAI/ACM Conference on AI, Ethics, and Society, pages 271–278,
2019.

[205] Stephen R. Pfohl, Agata Foryciarz, and Nigam H. Shah. An empirical


characterization of fair machine learning for clinical risk prediction.
Journal of Biomedical Informatics, 113:103621, 2021.

[206] Oleg S Pianykh, Georg Langs, Marc Dewey, Dieter R Enzmann, Chris-
tian J Herold, Stefan O Schoenberg, and James A Brink. Continuous
learning ai in radiology: implementation principles and early applica-
tions. Radiology, 297(1):6–14, 2020.

74
[207] Nikolaos Pitropakis, Emmanouil Panaousis, Thanassis Giannetsos,
Eleftherios Anastasiadis, and George Loukas. A taxonomy and sur-
vey of attacks against machine learning. Computer Science Review,
34:100199, 2019.

[208] Deborah Plana, Dennis L Shung, Alyssa A Grimshaw, Anurag Saraf,


Joseph JY Sung, and Benjamin H Kann. Randomized clinical trials
of machine learning interventions in health care: A systematic review.
JAMA Network Open, 5(9):e2233946–e2233946, 2022.

[209] John Platt et al. Probabilistic outputs for support vector machines
and comparisons to regularized likelihood methods. Advances in large
margin classifiers, 10(3):61–74, 1999.

[210] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q
Weinberger. On fairness and calibration. 30, 2017.

[211] Eduardo HP Pooch, Pedro L Ballester, and Rodrigo C Barros. Can we


trust deep learning models diagnosis? the impact of domain shift in
chest radiograph classification. arXiv preprint arXiv:1909.01940, 2019.

[212] Fabian Prasser, Florian Kohlmayer, Ronald Lautenschläger, and


Klaus A Kuhn. Arx-a comprehensive tool for anonymizing biomed-
ical data. In AMIA Annual Symposium Proceedings, volume 2014,
page 984. American Medical Informatics Association, 2014.

[213] Adnan Qayyum, Junaid Qadir, Muhammad Bilal, and Ala Al-Fuqaha.
Secure and robust machine learning for healthcare: A survey. IEEE
Reviews in Biomedical Engineering, 14:156–180, 2020.

[214] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D.


Lawrence. Dataset Shift in Machine Learning. The MIT Press, 2009.

[215] Stephan Rabanser, Stephan Gunnemann, and Zachary C. Lipton. Fail-


ing loudly: An empirical study of methods for detecting dataset shift.
Advances in Neural Information Processing Systems, page 32, 2022.

[216] Wullianallur Raghupathi and Viju Raghupathi. Big data analytics in


healthcare: promise and potential. Health information science and
systems, 2(1):1–10, 2014.

75
[217] Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. Machine learning in
medicine. New England Journal of Medicine, 380(14):1347–1358, 2019.
[218] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Ha-
jaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi
Sun, et al. Scalable and accurate deep learning with electronic health
records. NPJ digital medicine, 1(1):1–10, 2018.
[219] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel
Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie
Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on
chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
[220] Khansa Rasheed, Adnan Qayyum, Mohammed Ghaly, Ala Al-Fuqaha,
Adeel Razi, and Junaid Qadir. Explainable, trustworthy, and ethical
machine learning for healthcare: A survey. Computers in Biology and
Medicine, page 106043, 2022.
[221] Christian Reimers, Paul Bodesheim, Jakob Runge, and Joachim Den-
zler. Towards learning an unbiased classifier from biased data via con-
ditional adversarial debiasing. page 48–62, 2021.
[222] Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas
Padhy, and Balaji Lakshminarayanan. A simple fix to maha-
lanobis distance for improving near-ood detection. arXiv preprint
arXiv:2106.09022, 2021.
[223] Cedric Renggli, Luka Rimanic, Nezihe Merve Gürel, Bojan Karlaš,
Wentao Wu, and Ce Zhang. A data quality-driven view of mlops.
arXiv preprint arXiv:2102.07750, 2021.
[224] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why
should i trust you?” explaining the predictions of any classifier. In
Proceedings of the 22nd ACM SIGKDD international conference on
knowledge discovery and data mining, pages 1135–1144, 2016.
[225] Samantha Cruz Rivera, Xiaoxuan Liu, An-Wen Chan, Alastair K Den-
niston, Melanie J Calvert, Hutan Ashrafian, Andrew L Beam, Gary S
Collins, Ara Darzi, Jonathan J Deeks, et al. Guidelines for clinical trial
protocols for interventions involving artificial intelligence: the spirit-ai
extension. The Lancet Digital Health, 2(10):e549–e560, 2020.

76
[226] Dani E Rosenkrantz, Whitney W Black, Roberto L Abreu, Mollie E
Aleshire, and Keisa Fallin-Bennett. Health and health care of rural
sexual and gender minorities: A systematic review. Stigma and Health,
2(3):229, 2017.

[227] Leahora Rotteau, Joanne Goldman, Kaveh G Shojania, Timothy J Vo-


gus, Marlys Christianson, G Ross Baker, Paula Rowland, and Maitreya
Coffey. Striving for high reliability in healthcare: a qualitative study
of the implementation of a hospital safety programme. BMJ Quality
& Safety, 2022.

[228] Frank Rudzicz and Raeid Saqur. Ethics of artificial intelligence in


surgery. arXiv preprint arXiv:2007.14302, 2020.

[229] Philipp Ruf, Manav Madan, Christoph Reich, and Djaffar Ould-
Abdeslam. Demystifying mlops and presenting a recipe for the selection
of open-source tools. Applied Sciences, 11(19):8861, 2021.

[230] Theo Ryffel, Andrew Trask, Morten Dahl, Bobby Wagner, Jason
Mancuso, Daniel Rueckert, and Jonathan Passerat-Palmbach. A
generic framework for privacy preserving deep learning. arXiv preprint
arXiv:1811.04017, 2018.

[231] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy
Liang. Distributionally robust neural networks for group shifts: On
the importance of regularization for worst-case generalization. arXiv
preprint arXiv:1911.08731, 2019.

[232] Nelson F Sánchez, Susan Rankin, Edward Callahan, Henry Ng,


Louisa Holaday, Kadian McIntosh, Norma Poll-Hunter, and John Paul
Sánchez. Lgbt trainee and health professional perspectives on academic
careers—facilitators and challenges. LGBT health, 2(4):346–356, 2015.

[233] Rishi Kanth Saripalle. Fast health interoperability resources (fhir): cur-
rent status in the healthcare system. International Journal of E-Health
and Medical Communications (IJEHMC), 10(1):76–93, 2019.

[234] Chandramouli Shama Sastry and Sageev Oore. Detecting out-of-


distribution examples with gram matrices. In International Conference
on Machine Learning, pages 8491–8501. PMLR, 2020.

77
[235] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann,
Wieland Brendel, and Matthias Bethge. Improving robustness against
common corruptions by covariate shift adaptation. In H. Larochelle,
M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances
in Neural Information Processing Systems, volume 33, pages 11539–
11551. Curran Associates, Inc., 2020.
[236] Antonin Schrab, Ilmun Kim, Mélisande Albert, Béatrice Laurent, Ben-
jamin Guedj, and Arthur Gretton. Mmd aggregated two-sample test.
arXiv preprint arXiv:2110.15073, 2021.
[237] Jessica Schrouff, Natalie Harris, Oluwasanmi Koyejo, Ibrahim Alabdul-
mohsin, Eva Schnider, Krista Opsahl-Ong, Alex Brown, Subhrajit Roy,
Diana Mincu, Christina Chen, et al. Maintaining fairness across distri-
bution shift: do we have viable solutions for real-world applications?
arXiv preprint arXiv:2202.01034, 2022.
[238] Mark Sendak, Gaurav Sirdeshmukh, Timothy Ochoa, Hayley Premo,
Linda Tang, Kira Niederhoffer, Sarah Reed, Kaivalya Deshpande,
Emily Sterrett, Melissa Bauer, et al. Development and validation
of ml-dqa–a machine learning data quality assurance framework for
healthcare. arXiv preprint arXiv:2208.02670, 2022.
[239] MP Sendak, W Ratliff, D Sarro, E Alderton, J Futoma, M Gao,
M Nichols, M Revoir, F Yashar, C Miller, et al. Real-world integration
of a sepsis deep learning technology into routine clinical care: imple-
mentation study. jmir med inform. 2020 jul 15; 8 (7): e15182. doi:
10.2196/15182.
[240] Tegjyot Singh Sethi and Mehmed Kantardzic. On the reliable detection
of concept drift from streaming unlabeled data. Expert Systems with
Applications, 82:77–99, 2017.
[241] Laleh Seyyed-Kalantari, Guanxiong Liu, Matthew McDermott, Irene
Chen, and Ghassemi Marzyeh. Chexclusion: Fairness gaps in deep
chest x-ray classifiers. 2021.
[242] Laleh Seyyed-Kalantari, Haoran Zhang, Matthew McDermott, Irene
Chen, and Ghassemi Marzyeh. Underdiagnosis bias of artificial intelli-
gence algorithms applied to chest radiographs in under-served patient
populations. Nature Medicine, 27:2176–2182, 2021.

78
[243] Shubham Sharma, Jette Henderson, and Joydeep Ghosh. Certifai: A
common framework to provide explanations and analyse the fairness
and robustness of black-box models. In Proceedings of the AAAI/ACM
Conference on AI, Ethics, and Society, pages 166–172, 2020.
[244] Ying Sheng, Sandeep Tata, James B Wendt, Jing Xie, Qi Zhao, and
Marc Najork. Anatomy of a privacy-safe large-scale information ex-
traction system over email. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining,
pages 734–743, 2018.
[245] Arjun Soin, Jameson Merkow, Jin Long, Joseph Paul Cohen, Smitha
Saligrama, Stephen Kaiser, Steven Borg, Ivan Tarapov, and Matthew P
Lungren. Chexstray: Real-time multi-modal data concordance for drift
detection in medical imaging ai, 2022.
[246] Karin Stacke, Gabriel Eilertsen, Jonas Unger, and Claes Lundström.
Measuring domain shift for deep learning in histopathology. IEEE
journal of biomedical and health informatics, 25(2):325–336, 2020.
[247] G Stiglic, P Kocbek, N Fijacko, M Zitnik, K Verbert, and L. Cilar.
Interpretability of machine learning based prediction models in health-
care. WIREs Data Mining Knowl Discov., 10(5):e1379, 2020.
[248] Vallijah Subasri, Amrit Krishnan, Azra Dhalla, Deval Pandya, David
Malkin, Fahad Razak, Amol Verma, Anna Goldenberg, and Elham
Dolatabadi. Diagnosing and remediating harmful data shifts for the
responsible deployment of clinical ai models. medRxiv, pages 2023–03,
2023.
[249] Adarsh Subbaswamy, Roy Adams, and Suchi Saria. Evaluating model
robustness and stability to dataset shift. Proceedings of Machine
Learning Research, pages 2611–2619, 2021.
[250] Tony Y Sun, Oliver J Walk IV, Jennifer L Chen, Harry Reyes Nieva,
and Noémie Elhadad. Exploring gender disparities in time to diagnosis.
2020.
[251] Georgios Symeonidis, Evangelos Nerantzis, Apostolos Kazakis, and
George A Papakostas. Mlops–definitions, tools and challenges. arXiv
preprint arXiv:2201.00162, 2022.

79
[252] Kim Templeton, Carol A Bernstein, Javeed Sukhera, Lois Margaret
Nora, Connie Newman, Helen Burstin, Constance Guille, Lorna Lynn,
Margaret L Schwarze, Srijan Sen, et al. Gender-based differences in
burnout: Issues faced by women physicians. NAM Perspectives, 2019.
[253] Erico Tjoa and Cuntai Guan. A survey on explainable artificial intelli-
gence (xai): Toward medical xai. IEEE transactions on neural networks
and learning systems, 32(11):4793–4813, 2020.
[254] Joana Tomás, Deolinda Rasteiro, and Jorge Bernardino. Data
anonymization: An experimental evaluation using open-source tools.
Future Internet, 14(6):167, 2022.
[255] Sana Tonekaboni, Gabriela Morgenshtern, Azadeh Assadi, Aslesha
Pokhrel, Xi Huang, Anand Jayarajan, Robert Greer, Gennady Pekhi-
menko, Melissa McCradden, Mjaye Mazwi, et al. How to validate
machine learning models prior to deployment: Silent trial protocol
for evaluation of real-time models at icu. In Conference on Health,
Inference, and Learning, pages 169–182. PMLR, 2022.
[256] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander
Turner, and Aleksander Madry. Robustness may be at odds with ac-
curacy. arXiv preprint arXiv:1805.12152, 2018.
[257] Ata Ullah, Muhammad Azeem, Humaira Ashraf, Abdulellah A Al-
aboudi, Mamoona Humayun, and NZ Jhanjhi. Secure healthcare data
aggregation and transmission in iot—a survey. IEEE Access, 9:16849–
16865, 2021.
[258] Dennis Ulmer, Lotta Meijerink, and Giovanni Cinà. Trust issues: Un-
certainty estimation does not enable reliable ood detection on medical
tabular data. In Machine Learning for Health, pages 341–354. PMLR,
2020.
[259] Boris van Breugel, Trent Kyono, Jeroen Berrevoets, and Mihaela
van der Schaar. Decaf: Generating fair synthetic data using causally-
aware generative networks. Advances in Neural Information Processing
Systems, 34:22221–22233, 2021.
[260] Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual
learning. arXiv preprint arXiv:1904.07734, 2019.

80
[261] MHWA van den Boogaard, L Schoonhoven, E Maseda, C Plowright,
C Jones, A Luetz, PV Sackey, PG Jorens, LM Aitken, FMP van
Haren, et al. Recalibration of the delirium prediction model for icu
patients (pre-deliric): a multinational observational study. Intensive
care medicine, 40(3):361–369, 2014.

[262] Arnaud Van Looveren, Giovanni Vacanti, Janis Klaise, Alexandru


Coca, and Oliver Cobb. Alibi detect: Algorithms for outlier, adver-
sarial and drift detection, 2019.

[263] Basil Varkey. Principles of clinical ethics and their application to prac-
tice. Medical Principles and Practice, 30(1):17–28, 2021.

[264] Kush R Varshney. Trustworthy machine learning and artificial intelli-


gence. XRDS: Crossroads, The ACM Magazine for Students, 25(3):26–
29, 2019.

[265] Baptiste Vasey, Myura Nagendran, Bruce Campbell, and David A.


et al. Clifton. Reporting guideline for the early-stage clinical evaluation
of decision support systems driven by artificial intelligence: Decide-ai.
Nature Medicine, 2022.

[266] Amol A. Verma, Russell Murray, Joshua Greiner, Joseph Paul Cohen,
Kaveh G. Shojania, Marzyeh Ghassemi, Sharon E. Straus, Chloe Pou-
Prom, and Muhammad Mamdani. Implementing machine learning in
medicine. CMAJ, 193(34):E1351–E1357, 2021.

[267] Kerstin N Vokinger, Stefan Feuerriegel, and Aaron S Kesselheim. Con-


tinual learning in medical devices: Fda’s action plan and beyond. The
Lancet Digital Health, 3(6):e337–e338, 2021.

[268] Olga Vovk, Gunnar Piho, and Peeter Ross. Evaluation of anonymiza-
tion tools for health data. In International Conference on Model and
Data Engineering, pages 302–313. Springer, 2021.

[269] Darshali A. Vyas, Leo G. Eisenstein, and David S. Jones. Hidden


in plain sight — reconsidering the use of race correction in clinical
algorithms. New England Journal of Medicine, 383:874–882, 2020.

81
[270] Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris
Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gal-
lagher, and Scott McLachlan. Synthea: An approach, method, and
software mechanism for generating synthetic patients and the syn-
thetic electronic health care record. Journal of the American Medical
Informatics Association, 25(3):230–238, 2018.

[271] Jie Wang, Ghulam Mubashar Hassan, and Naveed Akhtar. A survey
of neural trojan attacks and defenses in deep learning. arXiv preprint
arXiv:2202.07183, 2022.

[272] Lu Wang, Mark Chignell, Yilun Zhang, Andrew Pinto, Fahad Razak,
Kathleen Sheehan, and Amol Verma. Physician experience design
(pxd): more usable machine learning prediction for clinical decision
making. In AMIA Annual Symposium Proceedings, volume 2022, page
476. American Medical Informatics Association, 2022.

[273] Shirly Wang, Matthew BA McDermott, Geeticka Chauhan, Marzyeh


Ghassemi, Michael C Hughes, and Tristan Naumann. Mimic-extract: A
data extraction, preprocessing, and representation pipeline for mimic-
iii. In Proceedings of the ACM conference on health, inference, and
learning, pages 222–235, 2020.

[274] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente
Ordonez. Balanced datasets are not enough: Estimating and mitigating
gender bias in deep image representations. pages 5309–5318, 10 2019.

[275] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi


Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-
ray database and benchmarks on weakly-supervised classification and
localization of common thorax diseases. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 2097–
2106, 2017.

[276] Judy Wawira Gichoya, Imon Banerjee, Ananth Reddy Bhimireddy,


John L Burns, Leo Anthony Celi, Li-Ching Chen, Ramon Correa, Na-
talie Dullerud, Marzyeh Ghassemi, Shih-Cheng Huang, et al. Ai recog-
nition of patient race in medical imaging: a modelling study. The
Lancet Digital Health, 4:e406–e414, 2022.

82
[277] Sarah Wiegreffe and Ana Marasović. Teach me to explain: A review of
datasets for explainable nlp. arXiv preprint arXiv:2102.12060, 2021.

[278] Jenna Wiens, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X
Liu, Finale Doshi-Velez, Kenneth Jung, Katherine Heller, David Kale,
Mohammed Saeed, et al. Do no harm: a roadmap for responsible
machine learning for health care. Nature medicine, 25(9):1337–1340,
2019.

[279] Yinjun Wu, Edgar Dobriban, and Susan Davidson. Deltagrad: Rapid
retraining of machine learning models. volume 119, pages 10355–10366,
2020.

[280] Dingyi Xiang and Wei Cai. Privacy protection and secondary use of
health data: Strategies and methods. BioMed Research International,
2021, 2021.

[281] Depeng Xu, Yongkai Wu, Shuhan Yuan, Lu Zhang, and Xintao Wu.
Achieving causal fairness through generative adversarial networks. In
Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, 2019.

[282] Jie Xu, Yunyu Xiao, Wendy Hui Wang, Yue Ning, Elizabeth A
Shenkman, Jiang Bian, and Fei Wang. Algorithmic fairness in com-
putational medicine. medRxiv, 2022.

[283] Shen Xu, Toby Rogers, Elliot Fairweather, Anthony Glenn, James Cur-
ran, and Vasa Curcin. Application of data provenance in healthcare
analytics software: information visualisation of user activities. AMIA
Summits on Translational Science Proceedings, 2018:263, 2018.

[284] Adam Yala, Constance Lehman, Tal Schuster, Tally Portnoi, and
Regina Barzilay. A deep learning mammography-based model for im-
proved breast cancer risk prediction. Radiology, 292:60–66, 2019.

[285] Yuzhe Yang, Haoran Zhang, Dina Katabi, and Marzyeh Ghassemi.
Change is hard: A closer look at subpopulation shift. arXiv preprint
arXiv:2302.12254, 2023.

[286] Sobia Yaqoob, Muhammad Murad Khan, Ramzan Talib, Arslan Da-
wood Butt, Sohaib Saleem, Fatima Arif, and Amna Nadeem. Use of

83
blockchain in healthcare: a systematic literature review. International
Journal of Advanced Computer Science and Applications, 10(5), 2019.
[287] Eileen Yoshida, Shirley Fei, Karen Bavuso, Charles Lagor, and Saverio
Maviglia. The value of monitoring clinical decision support interven-
tions. Applied Clinical Informatics, 9(1):163–173, 2018.
[288] Shujian Yu, Xiaoyang Wang, and José C. Prı́ncipe. Request-and-
reverify: Hierarchical hypothesis testing for concept drift detec-
tion with expensive labels. In Proceedings of the Twenty-Seventh
International Joint Conference on Artificial Intelligence, IJCAI-18,
pages 3033–3039. International Joint Conferences on Artificial Intel-
ligence Organization, 7 2018.
[289] John R Zech, Marcus A Badgeley, Manway Liu, Anthony B Costa,
Joseph J Titano, and Eric Karl Oermann. Variable generalization per-
formance of a deep learning model to detect pneumonia in chest ra-
diographs: a cross-sectional study. PLoS medicine, 15(11):e1002683,
2018.
[290] Angela Zhang, Lei Xing, James Zou, and Joseph C Wu. Shifting ma-
chine learning for healthcare from development to deployment and from
models to data. Nature Biomedical Engineering, pages 1–16, 2022.
[291] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating
unwanted biases with adversarial learning. page 335–340, 2018.
[292] Haoran Zhang, Natalie Dullerud, Karsten Roth, Lauren Oakden-
Rayner, Stephen Pfohl, and Marzyeh Ghassemi. Improving the fair-
ness of chest x-ray classifiers. In Conference on Health, Inference, and
Learning, pages 204–233. PMLR, 2022.
[293] Haoran Zhang, Natalie Dullerud, Laleh Seyyed-Kalantari, Quaid Mor-
ris, Shalmali Joshi, and Marzyeh Ghassemi. An empirical framework
for domain generalization in clinical settings. In Proceedings of the
Conference on Health, Inference, and Learning, pages 279–290, 2021.
[294] Haoran Zhang, Natalie Dullerud, Laleh Seyyed-Kalantari, Quaid Mor-
ris, Shalmali Joshi, and Marzyeh Ghassemi. An empirical framework
for domain generalization in clinical settings. In Proceedings of the
Conference on Health, Inference, and Learning, pages 279–290, 2021.

84
[295] Haoran Zhang, Amy Liu, Mohamed Abdalla, Matthew B. A. McDer-
mott, and Marzyeh Ghassemi. Hurtful words: Quantifying biases in
clinical contextual word embeddings. 2020.

[296] Tianran Zhang, Muhao Chen, and Alex AT Bui. Adadiag: Adver-
sarial domain adaptation of diagnostic prediction with clinical event
sequences. Journal of biomedical informatics, 134:104168, 2022.

[297] Tianyuan Zhang and Zhanxing Zhu. Interpreting adversarially trained


convolutional neural networks. In International Conference on Machine
Learning, pages 7502–7511. PMLR, 2019.

[298] Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei
Chang. Learning gender-neutral word embeddings. arXiv preprint
arXiv:1809.01496, 2018.

[299] Shengjia Zhao, Abhishek Sinha, Yutong He, Aidan Perreault, Jiaming
Song, and Stefano Ermon. Comparing distributions by measuring dif-
ferences that affect decision making. In International Conference on
Learning Representations, 2021.

[300] Yizhen Zhao. MLOps Scaling ML in an Industrial Setting. PhD thesis,


University of Amsterdam, 2021.

[301] Xiaofeng Zhu and Diego Klabjan. Continual neural network model
retraining. In 2021 IEEE International Conference on Big Data (Big
Data), pages 1163–1171. IEEE, 2021.

[302] Georges Zissis. The r3 concept: Reliability, robustness, and resilience


[president’s message]. IEEE Industry Applications Magazine, 25(4):5–
6, 2019.

[303] E. Ötleş, J. Oh, B. Li, M. Bochinski, H. Joo, J. Ortwine, E. Shenoy,


L. Washer, V. B. Young, and J. Rao, K.and Wiens. Mind the per-
formance gap: examining dataset shift during prospective validation.
Proceedings of Machine Learning Research, pages 506–534, 2021.

85
Tool Description
FairMLHealth31 Tools and tutorials for variation analysis in healthcare ma-
chine learning.
AIF360 [33] An open-source library containing techniques developed by
the research community to help detect and mitigate bias in
machine learning models throughout the AI application life-
cycle.
Fairlearn32 An open-source, community-driven project to help data sci-
entists improve the fairness of AI systems.
Fairness-comparison33 Benchmark fairness-aware machine learning techniques.
Fairness Indicators34 Fairness Indicators is a suite of tools built on top of Tensor-
Flow Model Analysis (TFMA) that enable regular evaluation
of fairness metrics in product pipelines.
ML-fairness-gym35 A tool for exploring long-term impacts of ML systems.
themis-ml [27] An open source machine learning library that implements
several fairness-aware methods that comply with the sklearn
API.
FairML [11] ToolBox for diagnosing bias in predictive modelling.
Black Box Auditing [12] A toolkit for auditing ML model deviations.
What-If Tool36 Visually probe the behaviour of trained machine learning
models, with minimal coding.
Aequitas37 An open source bias audit toolkit for machine learning devel-
opers, analysts, and policymakers to audit machine learning
models for discrimination and bias, and make informed and
equitable decisions around developing and deploying predic-
tive risk-assessment tools.
DECAF [259] A fair synthetic data generator for tabular data utilizing
GANs and causal models.
REPAIR [146] A dataset resampling algorithm to reduce representation bias
by reweighting.
CERTIFAI [243] Evaluates AI models for robustness, fairness, and explainabil-
ity, and allows users to compare different models or model
versions for these qualities.
FairSight [13] A fair decision making pipeline to assist decision makers track
fairness throughout a model.
Adv-Demog-Text [72] An adversarial network demographic attributes remover from
text data.
GN-GloVe [298] A framework for generating gender neutral word embeddings.
Tensorflow Constrained A library for optimizing inequality-constrained problems us-
Optimization 38
ing rate helpers.
Responsibly39 [60] Toolkit for auditing and mitigating bias and fairness of ML
systems.
Dataset-Nutrition-Label The Data Nutrition Project aims to create a standard label
[110] for interrogating datasets.
86
Table 6: List of open-source tools available on Github that can be used for ML Monitoring
and Updating specific to health.

You might also like