Introduction to Deep Learning for Healthcare Cao Xiao pdf download
Introduction to Deep Learning for Healthcare Cao Xiao pdf download
https://ebookmeta.com/product/introduction-to-deep-learning-for-
healthcare-cao-xiao/
https://ebookmeta.com/product/machine-learning-deep-learning-big-
data-and-internet-of-things-for-healthcare-1st-edition-govind-
singh-patel/
https://ebookmeta.com/product/deep-learning-approaches-to-cloud-
security-deep-learning-approaches-for-cloud-security-1st-edition/
https://ebookmeta.com/product/emerging-technologies-for-
healthcare-internet-of-things-and-deep-learning-models-1st-
edition-monika-mangla-editor/
https://ebookmeta.com/product/the-king-follett-sermon-a-
biography-1st-edition-william-victor-smith/
Grigori 1st Edition Brandon Varnell
https://ebookmeta.com/product/grigori-1st-edition-brandon-
varnell/
https://ebookmeta.com/product/ptsd-and-coping-with-trauma-
information-for-teens-teen-health-series-1st-edition-angela-l-
williams/
https://ebookmeta.com/product/obsessed-1st-edition-jenika-snow/
https://ebookmeta.com/product/the-international-film-business-a-
market-guide-beyond-hollywood-3rd-edition-angus-finney/
https://ebookmeta.com/product/the-american-development-of-
biology-ronald-rainger-editor/
Tempest Tide Avalar Explored 03 1st Edition Deacon
Frost
https://ebookmeta.com/product/tempest-tide-avalar-
explored-03-1st-edition-deacon-frost/
Cao Xiao
Jimeng Sun
Introduction
to Deep
Learning
for Healthcare
Introduction to Deep Learning for Healthcare
Cao Xiao • Jimeng Sun
Introduction to Deep
Learning for Healthcare
Cao Xiao Jimeng Sun
Seattle, WA, USA San Francisco, CA, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Deep learning models are multi-layer neural networks that have shown great
success in diverse applications. This is a book describing deep learning models in
the context of healthcare applications.
Story 1 When we took an artificial intelligence class many year ago, many topics
were covered, including neural networks. The neural network model was presented
as a supervised learning method. However, it was considered a practical failure
compared to other more effective supervised learning methods such as decision
trees and support vector machine. The common explanation about neural networks
at the time involves two aspects: (1) Multi-layer neural networks can approximate
any arbitrary functions and hence is a theoretically powerful model. (2) In practice,
they don’t work well due to the ineffective learning algorithm (i.e., backpropagation
method). When we asked why backpropagation doesn’t work well, a typical answer
was about the accumulated errors across layers, which will eventually become too
big to lead to an accurate model. Of course, the understanding of neural networks
has evolved greatly in the past few years. When big labeled datasets and parallel
computing infrastructure such as graphic processing units (GPU) finally become
available, the power of deep neural networks will be unleashed. These days, deep
learning models have become the most popular and standard machine learning
models.
Story 2 When we first got into machine learning for healthcare many years ago, we
spoke with a senior medical doctor about the potential impact of machine learning
and artificial intelligence (AI) in medicine in the future. Specifically, we asked
him about the possibility of creating AI algorithms to mimic the practice of real-
world doctors. He was very pessimistic about the possibility because he believes
v
vi Preface
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivating Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Diabetic Retinopathy Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Early Detection of Heart Failure . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Sleep Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Treatment Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Clinical Trial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 Molecule Property Prediction and Generation . . . . . . . . . . . 5
1.2 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Who Are the Authors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Health Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 The Growth of Electronic Health Records . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Health Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 The Life Cycle of Health Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Structured Health Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Unstructured Clinical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Continuous Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Medical Imaging Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.6 Biomedical Data for In Silico Drug Discovery . . . . . . . . . . 18
2.3 Health Data Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Predictive Modeling Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Softmax Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 Stochastic and Minibatch Gradient Descent . . . . . . . . . . . . . 28
vii
viii Contents
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Chapter 1
Introduction
Humans are the only species on earth that can actively and systematically improve
their health via technologies in the form of medicine. Throughout history, human
knowledge is the driving force for the progress of medicine and healthcare. Humans
created new technologies such as diagnostic tests, drugs, medical procedures, and
devices. As the life expectancy increases, healthcare cost is growing dramatically
over the years to be deemed unsustainable. For example, the US healthcare cost in
2019 alone is over 3.6 trillion dollars and accounts for 17.8% of gross domestic
product (GDP). Within the gigantic spending in healthcare, there is enormous waste
that should be avoided. The estimated total annual costs of waste were $760 billion
to $935 billion [135].
Meanwhile, mountains of new medical knowledge are being created, making
human doctors’ knowledge quickly outdated. Moreover, human doctors are strug-
gling to catch up with the increasing volume of patient visits. Physician burnout is a
serious issue that affects all doctors in the age of electronic health records due to the
overwhelming patient data for doctors to review and complex workflows, including
tedious documentation tasks. Patients are also dissatisfied with limited interactions
and attention from doctors during their short clinical visits. Quality of care is often
sub-optimal, with over 400K preventable medical errors in hospitals each year [78].
With the rise of artificial intelligence (AI), can new healthcare technology be
created by machine directly? For example, can machines provide more accurate
diagnoses than human doctors? In the center of the AI revolution, deep learning
technology is a set of machine learning techniques that learn multiple layers
of neural networks for supporting prediction, classification, clustering, and data
generation tasks. The success of deep learning comes from
• Data: Large amounts of rich data, especially in images and natural language
texts, become available for training deep learning models.
• Algorithms: Efficient neural network methods have been proposed and enhanced
by many researchers in recent years.
Deep learning has great successes in the technology industry. Very hard technical
problems had amazing performance improvements such as image classification,
machine translation, and speech recognition. There are various promising deep
learning applications in healthcare, including medical imaging analysis, waveform
1.1 Motivating Applications 3
Fig. 1.1 Example retinal photos from a healthy individual and a disease patient. Credit to https://
ai.googleblog.com/2016/11/deep-learning-for-detection-of-diabetic.html
1 AUC is a common classification metric that will be described in details in Section “Real-Value
Heart failure (HF) is another deadly disease that affects approximately 5.7 million
people in the United States and has over 825,000 new cases per year with around
33 billion dollars total annual cost. The lifetime risk of developing HF is 20% at
40 years of age [2]. HF has a high mortality rate of about 50% within 5 years of
diagnosis [5]. There has been relatively little progress in slowing HF progression
because there is no effective means of early detection of HF to test interventions.
Choi et al. [30] used a deep learning model called recurrent neural networks to
model longitudinal electronic health records to accurately predict the onset of HF 6
to 18 months before the actual diagnosis.
Medication error is the third leading cause of death in the US. The Food and
Drug Administration (FDA) receives more than 100,000 reports each year related
to suspected medication errors. To ensure medication safety, different medication
recommendation methods have been proposed using deep learning methods. For
example, researchers have attempted to build predictive models for suggesting
treatments based on patient information, including diagnoses, procedures, and
medication history, with the abundant longitudinal electronic health record data.
LEAP [174] and GAMENet [134] are examples of such models using deep learning,
particularly sequence-to-sequence models and memory networks.
1.2 Who Should Read This Book? 5
Clinical trials play important roles in drug development but often suffer from expen-
sive, inaccurate, and insufficient patient recruitment. The availability of electronic
health records (EHR) and trial eligibility criteria (EC) bring a new opportunity to
develop computational models for patient recruitment. One key task named patient-
trial matching is to find qualified patients for clinical trials given structured EHR
and unstructured EC text (both inclusion and exclusion criteria). How to match
complex EC text with longitudinal patient EHRs? How to embed many-to-many
relationships between patients and trials? How to explicitly handle the difference
between inclusion and exclusion criteria? COMPOSE [50] and DeepEnroll [176] are
two deep learning models for patient-trial matching. They search through EHR data
to identifying the matching patients based on the trial eligibility criteria described
in the natural language.
Drug discovery is about finding the molecules with desirable properties for treating
a target disease. Traditional drug discovery heavily depends on high throughput
screening (HTS), which involves many costly wet-lab experiments. Given large
molecule databases and their associated drug properties (e.g., from drugbank),
machine learning models, especially deep learning models, have shown great
potentials in identifying promising drug candidates. For example, some deep
learning models were proposed to predict drug property given the input molecule
structures [105, 139]. Some were introduced to produce brand new molecules with
desirable properties [48, 80].
machine learning researchers and engineers can benefit from this book if they want
to learn more healthcare data and analytic problems.
This book does not require any computer programming knowledge and can
be used as a textbook for the concepts of deep learning and its applications. We
deliberately do not cover the programming details so that our readers can be
broadened. Also, we realize the fast pace of deep learning software evolution,
which will likely outdate what we wrote on that programming topic very quickly.
Nevertheless, the hands-on exercises of deep learning are essential to gain practical
knowledge of the topic. We encourage readers to supplement this book with other
hands-on programming exercises, online tutorials, and other books on deep learning
software packages.
• Chapter 2 covers the various healthcare data ranging from structured data such
as diagnosis codes to unstructured data such as clinical notes and medical
imaging data. This chapter also introduces important health data standards such
as international classification of diseases (ICD) codes.
• Chapter 3 provides a primer of machine learning basics. We present the funda-
mental machine learning tasks, including supervised and unsupervised learning
and some classical examples (e.g., logistic regression and principal component
analysis). We also describe evaluation metrics for different tasks such as the area
under the receiver operating characteristic curve (AUC) for classification and
mean squared error for regression.
• Chapter 4 presents the deep neural networks (DNN) called a feed-forward
neural network or multi-layered perceptron (MLP). In particular, we cover
the basic components of DNNs, including neurons, activation functions, loss
functions, and forward/backward passes. Of course, we also introduce the famous
backpropagation algorithm for training DNN models. We also present two case
studies: hospital re-admission prediction and drug property prediction.
• Chapter 5 illustrates the idea of embedding using neural networks, including
popular algorithms such as Word2Vec and other domain-specific embeddings for
EHR data such as Mec2Vec and MiME.
• Chapter 6 introduces convolutional neural networks (CNN) designed for grid-like
data such as images and time series. We will present the important operation of
CNNs, such as convolution and pooling, and some popular CNN architectures.
We will also describe the application of CNNs on medical imaging data and
clinical waveforms such as the electrocardiogram (ECG).
• Chapter 7 covers recurrent neural networks (RNN) designed to handle sequential
data such as clinical text. We present important variants of RNN, including Long
short-term memory (LSTM) and gated recurrent unit (GRU). The RNN case
studies include heart failure prediction and de-identification of clinical notes.
• Chapter 8 describe the autoencoder model, an unsupervised neural network
model. We also present case studies of autoencoder including phenotyping EHR
data.
The advanced part covers the attention model (Chap. 9), graph neural networks
(Chap. 10), memory networks (Chap. 11), and generative models (Chap. 12).
• Chapter 9 introduces attention models, which creates the foundation for many
advanced deep learning models. We illustrate the attention model in several
healthcare applications, including disease risk prediction, diagnosis code assign-
ment, and ECG classification.
• Chapter 10 presents another foundation of advanced deep learning models,
namely graph neural networks (GNN). Graph data are common in many health-
care tasks such as medical ontology and molecule graphs. GNN models are a
set of powerful neural network models for graph data. The case studies focus on
drug discovery.
• Chapter 11 presents memory network-based models, a set of powerful models
for embedding complex data (such as text and time series). We will introduce
8 1 Introduction
the idea behind original memory networks and their powerful extensions, such
as Transformer and BERT models. We demonstrate memory networks on
medication recommendation tasks.
• Chapter 12 presents deep generative models, including generative adversarial
networks (GAN) and variational autoencoder (VAE). We show their applications
in synthetic EHR data generation and molecule generation for drug discovery.
1.5 Exercises
1. Present an example data science application for lower healthcare cost, specify
the datasets needed for building such machine learning models, and describe the
evaluation metrics.
2. Which type of healthcare data are considered large in terms of data volume?
(a) Genomic data
(b) Medical imaging data
(c) Clinical notes
(d) Medical claims
3. Which type of healthcare data are considered fastest in velocity?
(a) Real-time monitoring data from intensive care units
(b) Medical imaging data
(c) Structured electronic health records
(d) Clinical notes
4. Which one is NOT a drug discovery and development application?
(a) Sepsis detection
(b) Molecule property prediction
(c) Clinical trial recruitment
(d) Molecule generation
5. Which one is NOT a drug discovery and development application?
(a) Sepsis detection
(b) Molecule property prediction
(c) Clinical trial recruitment
(d) Molecule generation
6. Which one is a public health application?
(a) Mortality prediction in ICU
(b) Patient triaging application
(c) Treatment recommendation for heart failure
(d) Predicting COVID19 cases at different locations in the US
Chapter 2
Health Data
Health data are diverse with multiple modalities. This chapter will introduce
different types of health data, including structured health data (e.g., diagnosis codes,
procedure codes) and unstructured data (e.g., clinical notes, medical images). We
will also present the popular health data standards for representing those data.
Over the past decade, more and more health service providers worldwide have
adopted electronic health record (EHR) systems to manage data about patients
and records of health care services. For example, Fig. 2.1 shows the increase
of national basic and certified EHR adoption rate over time according to the
American Hospital Association Annual Survey. Here the basic EHR adoption curve
corresponds to the EHR systems having basic EHR functions such as patient
demographics, physician notes, lab results, medications, diagnosis, clinical and drug
safety guidelines. A certified EHR just has to cover essential EHR technology that
meets the technological capability, functionality, and security requirements adopted
by the Department of Health and Human Services. From Fig. 2.1, we can see nearly
all reported hospitals (96%) possessed a certified EHR technology by 2015. In
2015, 84.8% of hospitals adopted at least a Basic EHR system; this represents a
ninefold increase since 2008. Thanks to the wide deployment of EHR systems, many
healthcare institutions have collected diverse health data. This chapter provides an
overview of health data: what different data types are available and how the data
are collected, and by whom. All these data are potential inputs for training deep
learning models for supporting diverse healthcare tasks.
Fig. 2.1 Percentage of EHR system adoption over time. Basic EHR means EHR systems with a set
of required functionalities such as patient demographics, physician notes, lab results, medications,
diagnosis, clinical and drug safety guidelines. A certified EHR system means the hospital has
the essential EHR technology certified by Department of Health and Human Services. Source:
American Hospital Association Annual Survey
Shifting from the traditional paper-based records to electronic records has generated
a massive collection of health data, which created opportunities for enhanced patient
care, data-driven care delivery, and accelerated healthcare research. According to the
definition from Centers for Medicare & Medicaid Services (CMS)—a federal insti-
tution that administers government-owned health insurance services, EHR is “an
electronic version of a patient’s medical history, that is maintained by the provider
over time, and may include all of the key administrative, clinical data relevant to that
person’s care under a particular provider, including demographics, progress notes,
problems, medications, vital signs, past medical history, immunizations, laboratory
data, and radiology reports”.1 From the modeling perspective, EHR can be viewed
as a longitudinal record of comprehensive medical services provided to patients and
documentation of patient medical history. There are several important observations
of EHR data:
1. EHR data are mostly managed by providers (although there is an ongoing
movement to enable patients to augment additional data into their EHRs);
2. Each provider manages their own EHR systems; as a result, partial information
about the same patient may scatter across EHR systems from multiple providers;
1 https://www.cms.gov/Medicare/E-Health/EHealthRecords/index.html.
2.2 Health Data 11
3. The main purpose of EHR data is to support accurate and efficient billing
services, which creates challenges for other secondary use of EHR data such
as research.
Let us first introduce the key players in the healthcare industry and the life cycle of
health data from the perspectives of those key healthcare players.
Key Healthcare Players There are diverse healthcare institutions that generate and
manage health data.
• Providers are hospitals, clinics, and health systems, which provide healthcare
services to patients. Providers use electronic health records (EHR) systems to
capture everything that happened during patient encounters, such as diagnosis
codes, medication prescription, lab and imaging results, and clinical notes.
Providers interact with other players such as payers, pharmacies, and labs.
• Payers are entities that provide health insurance. They can be private insurance
companies such as United Healthcare and Anthem. Or they can be public
insurance programs owned by government entities such as MEDICARE and
MEDICAID. Payers reimburse the full or partial cost associated with healthcare
services to providers. Payers interact with providers and pharmacies via claims.
More specifically, providers and pharmacies submit claims to corresponding
payers from which they have medical insurance. Claims are usually structured
data with diagnosis, procedure, and medication codes, and associated cost
information.
• Pharmacies prepare and dispense medications often based on medication pre-
scriptions. Pharmacies know what and when patients actually fill medications.
Pharmacies produce pharmacy claims, which are also sent to payers of the
patients for reimbursement.
• Pharmaceutical companies discover, develop, produce, and market drugs.
They conduct clinical trials to validate new drugs. Pharmaceuticals generate
experimental data for drug discovery and clinical trials data.
• Contracted Research Organizations (CROs) provide outsourced contract ser-
vices to pharmaceutical companies such as pre-clinical and clinical research and
clinical trial management. Depending on the services line, CROs produce various
datasets that support pharmaceutical companies to get their drugs to market.
• Government agencies play multiple roles in the healthcare ecosystem. Food
and Drug Administration (FDA) is the most important regulator to approve new
drugs and monitor existing drugs. Centers for Disease Control and Prevention
(CDC) is a public health institution focusing on monitoring, controlling, and
preventing diseases. For example, government agencies worldwide have been
collecting the reports from Spontaneous Reporting Systems (SRS) submitted by
pharmaceutical companies, healthcare professionals, and consumers to facilitate
12 2 Health Data
Fig. 2.2 Important players and the life cycle of health data
Structured data are common in healthcare, which are often represented as medical
codes.
Various medical codes are used in both EHR and claim data, which usually
follow common data standards. For example, diagnosis codes follow international
disease classification (ICD); procedures use current procedure terminology (CPT)
codes. The number of unique codes from each data standard is large and growing.
For example, ICD version 9 (ICD-9) has over 13,000 diagnosis codes, while ICD-
10 has over 68,000 diagnosis codes. Each encounter is only associated with a few
codes. The resulting data are high-dimensional but sparse. A simple and direct
way to represent such data is to create a one-hot vector for a medical code and
a multi-hot vector for the patient with multiple codes as shown in Fig. 2.3. For
example, to represent diagnosis information in an encounter, one can create a
68,000-dimensional binary vector where each element is a binary indicator of a
corresponding diagnosis code. If only one dimension is one and zeros otherwise, we
call it one-hot vector. If multiple ones are present, it is a multi-hot vector. As we will
show in later chapters, such multi-hot vectors can be improved with various deep
learning approaches to construct appropriate lower-dimensional representation.
Most medical codes follow certain standards such as ICD for diagnosis, CPT
for procedures, and NDC for drugs, which will be explained later in Health data
standards. Most of these medical codes are organized in hierarchies defined by
medical ontologies such as CCS codes. These hierarchies are instrumental in
constructing more meaningful and low-dimensional input features for the deep
learning models. For example, instead of directly treating each ICD-10 code as
a feature, we can group them into a few hundred CCS codes,2 higher disease
categories and treat each CCS code as a feature.
2 https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp.
14 2 Health Data
Fig. 2.3 Examples of one-hot vectors of medical codes, a multi-hot vector for a patient. Here 1 or
0 indicates the presence or absence of a particular diagnosis
Fig. 2.4 In EHR data, medical codes are structured hierarchically with heterogeneous relations,
e.g., medication Acetaminophen and procedure IV fluid are correlated, while both occur due to the
same diagnosis Fever
as Cardiac EKG come with several results (QRS duration, Q-T interval, notes),
but IV Fluid does not. Note that some diagnoses might not be associated with any
medication or procedure orders.
Other Structured Data In addition to medical codes, there are other structured
data such as patient demographics (e.g., age and gender), vital signs (e.g., blood
pressure), and social history (e.g., smoking status). These are also stored as struc-
tured fields in the EHR database. Various data standards are used for documenting
different medical codes, as summarized in Tables 2.1 and 2.2.
Clinical notes are recorded for various purposes, either for the documentation of an
encounter in discharge summaries, describing the reason and use of prescriptions in
16 2 Health Data
medication notes, interpreting the results of medical images with radiology reports,
or analyzing the results from lab tests pathology reports. The reports include differ-
ent sets of functional components, thus yield various difficulties in understanding.
There are growing interests in using machine learning for these unstructured clinical
notes, especially deep learning methods for automated indexing, abstracting, and
understanding. Also, some works focus on automated classification of clinical free
text to standardized medical codes, which are important for generating appropriate
claims for that visit [114]. For example, a progress note is one type of clinical
note commonly used in describing clinical encounters. A progress note is usually
structured in four sections with the acronym SOAP:
• Subjective part describes what the patient tells you;
• Objective part presents the objective findings such as lab test results and imaging
rests;
• Assessment part provides the diagnosis of the encounter;
• Plan part describes the treatment plan for the patient.
There are many different types of notes, including Admission notes, Emergency
department (ED) notes, Progress notes, Nursing notes, Radiology reports, ECG
reports, Echocardiogram reports, Physician notes, Discharge summary, and Social
work notes. Each type can be written in a very different format with different lengths
and quality.
Challenges Several key technical challenges exist in mining clinical text data:
1. High dimension—the number of unique words in clinical text corpus is large.
Many words are acronyms, which are important to understand using their context.
2. External knowledge—in addition to clinical text in EHR data, there is a large
amount of medical knowledge encoded in the text, such as medical literature and
clinical guidelines. It is important to be able to incorporate that knowledge in
modeling EHR data.
3. Privacy—Besides technical challenges, it is challenging to access clinical text
due to privacy concerns as the data are very sensitive and affect individual
privacy. As a result, very limited clinical notes are openly accessible for method
development. And the volume of the shared data is also limited due to largely
privacy concerns.
With an increasing number of new medical and wellness sensors being created, there
will be more continuous signals as part of health data. Most commonly collected
continuous signals in clinics include electrocardiogram (ECG) and electroen-
cephalogram (EEG). ECG measures the electrical activities by placing electrodes
on the chest, arms, and legs. In contrast, EEG measures the electrical activities of a
brain via electrodes that are placed on the scalp. Both ECG and EEG are routine clin-
2.2 Health Data 17
Medical imaging is about creating a visual representation of the human body for
diagnostic purposes. Various imaging technologies have been introduced, such
as X-ray radiography, computed tomography (CT), magnetic resonance imaging
(MRI), ultrasound (e.g., echocardiography). The resulting data are 2D images
or 3D representations from multiple 2D images and videos. Medical imaging
data are stored and managed by a separate system called Picture archiving and
communication system (PACS). The images themselves are stored and transmitted
in DICOM format (Digital Imaging and Communications in Medicine). Given the
raw imaging data, radiologists read and mark the images and write a text report
(radiology reports) to summarize the findings. The radiology reports are often
copied into the EHR systems so that clinicians and patients can access the findings of
the imaging tests. Thanks to the digitization of the radiology field, a large number of
high-resolution images and their corresponding phenotypes (labels) are available for
modeling. Thus there is tremendous excitement in using deep learning for radiology
tasks [65, 101].
Challenges Data quality and lack of reliable and detailed labels are still challenges
in analyzing such data. As the raw input images are very large (high-dimensional),
it demands a sufficient sample size to train accurate and generalizable models.
18 2 Health Data
level category. ICD codes are a widely used international standard maintained
by World Health Organization (WHO). The latest version is ICD-11 as of 2020.
Most of the world is currently using ICD-10. For example, “I50” corresponds
to the ICD-10 category for heart failure, I50.2 is Systolic (congestive) heart
failure, and I50.21 is Acute systolic (congestive) heart failure. ICD codes are
used to represent disease diagnosis in EHR and claims data. Most EHR data will
have ICD codes either in ICD-9 or ICD-10 format. An ICD9 code has up to 5
digits. The first digit is either alphabetic or numeric, and the remaining digits are
numeric. For example, the ICD-9 code for Diabetes mellitus without mention of
complications is 250.0x. The first three digits of an ICD-9 code corresponding
to the disease category. And the last 1 or 2 digits reflect the subcategories of the
disease. Besides numeric codes, ICD-9 codes can have an initial letter of V or E.
For example, V85.x is an ICD-9 code for body mass index (BMI). In particular,
V85.0 corresponds to BMI<19, V85.1 BMI between 19 and 25, and V85.2x
indicates BMI>25. ICD-10 codes are more granular than ICD-9 codes. Each
ICD-10 code has up to 7 digits. The first digit is always a letter; the second digit
is always numeric; the third to seventh digits are alphanumeric. For example,
E10.9 is the ICD-10 code for Type 1 diabetes mellitus without complications.
• CPT corresponds to Current Procedural Terminology, which is a standard
created and copyrighted by American Medical Association. CPT codes represent
medical services and procedures that doctors can document and bill for pay-
ment. CPT codes also follow a hierarchical structure. For example, CPT codes
between 99201 and 99215 correspond to Office/other outpatient services, while
a high-level category 99201–99499 corresponds to codes for evaluation and
management. Like ICD codes, CPT codes are commonly present in structured
EHR data.
• NDC codes are 10- or 11-digit national drug codes, which are managed by
Food and Drug Administration (FDA). It consists of three segments: labeler,
product, and package. For example, 0777-3105-02 is an NDC code where 0777
corresponds to labeler Dista Products Company, 3105 maps to the product
Prozac, and 02 indicates the package of 100 capsules in 1 bottle. The same drugs
with different packages will have different codes. From an analytic modeling
perspective, NDC codes are probably too specific to be used directly as features.
• LOINC is a terminology standard for lab tests. LOINC stands for Logical Obser-
vation Identifiers Names and Codes (LOINC). Like other standards, LOINC
has LOINC codes and associated descriptions of the code. To support lab tests,
LOINC description follows a specific format with six parts: (1) COMPONENT
(ANALYTE): The substance being measured or observed; (2) PROPERTY: The
characteristic of the analyte; (3) TIME: The interval of time of the observation;
(4) SYSTEM (SPECIMEN): The specimen upon which the observation was
made; (5) SCALE: How the observation is quantified: quantitative, ordinal,
nominal; (6) METHOD: how the observation was made (which is an optional
part). For example, LOINC code 806-0 is the lab test of the manual count of
white blood cells in the cerebral spinal fluid specimen. The different parts of the
description are Component:Leukocytes, Property:NCnc (Number concentration),
20 2 Health Data
2.4 Exercises
1. What are the most useful health data for predicting patient outcome (e.g.,
mortality)?
2. What are the most accessible health data? And why?
3. What are the most difficult health data (to access and to model)?
4. What are the important health data that are not described in this chapter?
5. Which of the following is NOT true about electronic health records (EHR)?
(a) EHR data from a single hospital consists of complete clinical history from
each patient.
(b) Outpatient EHR data are viewed as point events
(c) EHR data contain longitudinal patient records.
(d) Inpatient EHR data are viewed as interval events.
6. Which of the following is not true about clinical notes?
(a) They can provide a detailed description of patient status.
(b) Most EHR systems provide clinical notes functionality.
(c) Clinical notes can contain sensitive protected health information.
(d) Because of its unstructured format, it is easy for computer algorithms to
process the notes
7. Which of the following are the limitations of claims data?
(a) Coding errors can commonly occur in the claims data.
(b) Since claims data are for billing purposes, they do not accurately reflect
patient status.
(c) Claims data are rare and difficult to find.
(d) Claims data of a patient are often incomplete because they can go to
different hospitals.
8. Which of the following is not true?
(a) EHR are richer than claims.
(b) EHR captures the medication prescription information but does not capture
whether the prescription are filled.
(c) Continual signals are rarely collected in hospitals.
(d) Continuous signals provide objective assessments of patients.
9. Which of the following are not imaging data?
(a) X-rays
(b) Computed tomography
(c) Electrocardiogram
(d) Magnetic resonance imaging
10. What is true about medical literature data?
(a) They are difficult to parse because of the natural language format.
22 2 Health Data
Machine learning has changed many industries, including healthcare. The most
fundamental concepts in machine learning include (1) supervised learning that
has been used to develop risk prediction models for target diseases and (2)
unsupervised learning that has been applied to discover unknown disease subtypes.
Both supervised and unsupervised learning expect to model various patient features
as demographic features, including age, gender and ethnicity, and past diagnosis
features (e.g., ICD codes). The key difference is the presence of labels in supervised
learning and the absence of labels in unsupervised learning. A label is a gold
standard for a target of interest, such as a patient’s readmission status for training a
readmission predictive model.
As we will describe in later chapters, most of the deep learning successes are in
supervised learning. In contrast, the potential for unsupervised learning is immense
due to the availability of a large amount of unlabeled data. This chapter will
present the predictive model pipeline, basic models for supervised and unsupervised
learning, and various model evaluation metrics. Table 3.1 defines notations used in
this chapter.
2. We then need to construct the patient cohort for this study. For example,
we may include all patients with an age greater than 45 for a heart failure
study. There are many reasons why cohort construction is needed when building
healthcare predictive models: (1) there might be the financial cost associated with
acquiring the dataset based on the cohort; (2) we may want to build the model
for a specific group of patients instead of a general population; (3) the full set of
all patients may have various data quality issues.
3. Next, we will construct all the features from the data and select those relevant
features for predicting the target. In a traditional machine learning pipeline, we
3.2 Supervised Learning 25
often have to consider both feature construction and selection steps. With the
rise of deep learning, the features are often created and implicitly selected by the
multiple layers of neural networks.
4. After that we can build the predictive model which can be either classification
(i.e., discrete labels such as heart failure or not) and regression (i.e., continuous
output such as length of stay).
5. Finally we need to evaluate the model performance and iterate.
We will start with the problem of disease classification as one major supervised
learning task in healthcare applications: given a set of patients and their associated
patient data, assign each patient with a discrete label y ∈ Y, where Y is the set
of possible diseases. This problem has many applications, from studying electroen-
cephalography (EEG) time series for seizure detection to analyzing electronic health
records (EHR) for predicting heart failure diagnosis. Supervised learning tasks such
as disease classification are also a building block throughout many complex deep
learning architectures, which we will discuss later.
Supervised learning expects an input of a set of N data points; each data point
consists of m input features x (also known as variables or predictors) and a label
y ∈ Y (also known as a response, a target or an outcome). Supervised learning aims
to learn a mapping from features x to a label y based on the observed data points.
If the labels are continuous (e.g., hospital cost), the supervised learning problem is
called a regression problem. And if the labels are discrete variables (e.g., mortality
status), the problem is called a classification problem. In this chapter, the label y
are categorical values of K classes (i.e., y ∈ {1, 2, . . . , K}).
1 Maybe a confusing name as logistic regression is for classification not for regression. But the
naming choice will become meaningful after we explain the mathematical construction.
26 3 Machine Learning Basics
and other disease diagnoses. The classification task is to determine whether a patient
will have heart failure based on this M-dimensional feature vector x.
Mathematically logistic regression models the probability of heart failure onset
y = 1 given input features x, denoted by P (y = 1|x). Then the classification is
performed by comparing P (y = 1|x) with a threshold (e.g., 0.5). If P (y = 1|x) is
greater than the threshold, we predict the patient will have heart failure; otherwise,
the patient will not.
One building block of logistic regression is the log-odds or logit function. The
odds are the quantity that measures the relative probability of label presence and
label absence as
P (y = 1|x)
.
1 − P (y = 1|x)
The lower the odds, the lower probability of the given label. Sometimes we prefer
to use log-odds (natural logarithm transformation of odds), also known as the logit
function.
P (y = 1|x)
logit (x) = log .
1 − P (y = 1|x)
Now instead of modeling probability of heart failure label given input feature
P (y = 1|x) directly, it is easier to model its logit function as a linear regression
over x:
P (y = 1|x)
log = wT x + b (3.1)
1 − P (y = 1|x)
where w is the weight vector, b is the offset variable. Equation (3.1) is why logistic
regression is named logistic regression.
After taking exponential to both sides and some simple transformation, we will
have the following formula.
T x+b
ew
P (y = 1|x) = (3.2)
1 + ew
T x+b
With the formulation in Eq. (3.2), the logistic regression will always output values
between 0 and 1, which is desirable as a probability estimate.
Let us denote P (y = 1|x) as P (x) for brevity. Now learning the logistic
regression model means to estimate the parameters w and b on the training data.
We often use maximum likelihood estimation (MLE) to find the parameters. The
idea is to estimate w and b so that the prediction P̂ (x i ) to data point i in the training
data is as close as possible to actual observed values (in this case either 0 or 1).
Let x + be the set of indices for data points that belong to the positive class (i.e.,
with heart failure), and x − one for data points that belong to the negative class (i.e.,
without heart failure), the likelihood function used in the MLE is given by Eq. (3.3).
3.2 Supervised Learning 27
L(w, b) = P (xa+ ) (1 − P (xa− )) (3.3)
a+ ∈x+ a− ∈x−
If we take the logarithm to the MLE, we will get the following formula for log-
likelihood in Eq. (3.4).
N
log(L(w, b)) = [yi log P (x i ) + (1 − yi ) log(1 − P (x i ))] (3.4)
i=1
Note that since either yi or 1 − yi is zero, only one of two probability terms (either
log P (x i ) or log(1 − P (x i ))) will be added.
Multiplying a negative sign to have a minimization problem, what we have now
is the negative log-likelihood, also known as (binary) cross-entropy loss.
N
J (w, b) = − [yi log P (x i ) + (1 − yi ) log(1 − P (x i ))] (3.5)
i=1
We sometimes want to classify data points into more than two classes. For example,
given brain image data from patients that are suspected of having Alzheimer’s
disease (AD), the diagnoses outcomes include (1) normal, (2) mild cognitive
impairment (MCI), and (3) AD. In that case, we will use multinomial logistic
regression, also called softmax regression to model this problem.
Assuming we have K classes, the goal is to estimate the probability of the
class label taking on each of the K possible categories P (y = k|x) for k =
1, · · · , K. Thus, we will output a K-dimensional vector representing the estimated
probabilities for all K classes. The probability that data point i is in class a can be
modeled by Eq. (3.6).
ewa x i +ba
T
k=1
where wa is the weight for a-th class, x i is the feature vector for data point i, wa and
ba are the weight vector and the offset for class a, respectively. To learn parameters
for softmax regression, we often optimize the following average cross-entropy loss
over all N training data points:
28 3 Machine Learning Basics
1
N K
J (w) = − I (yi = k) log(P (yi = k|x i ))
N
i=1 k=1
Gradient descent (GD) is an iterative learning approach to find the optimal param-
eters based on data. For example, for softmax regression parameter estimation, we
can use GD by computing the derivatives
∇w J (w)
and update the weights in the opposite direction of the gradient like the following
rule
w := w − η∇wk J (w)
for each class k ∈ {1, · · · , K} and η is the learning rate, which is an important
hyperparameter that needs to be adjusted. The gradient computation and weight
update are iteratively performed until some stopping criterion is met (e.g., maximum
number of iterations reached). Here ∇ is a differentiation operator which transforms
a function J (w) into its gradient vector along each feature dimension xi . For
example, x = [x1 , x2 , x3 ], then the gradient vector is
∂J (w) ∂J (w) ∂J (w)
∇J (w) = , ,
∂x1 ∂x2 ∂x3
Fig. 3.2 Objective function changes for batch gradient descent and mini-batch gradient descent
The SGD performs parameter updating for every single data point in the training
set. Given data point x i with label yi , the SGD does the following update:
where η is the learning rate and g(θ; < x i , yi >) is the objective function evaluated
on one data point < x i , yi >. By updating one data point at a time, SGD is
computationally more efficient. However, since SGD updates based on one data
point at a time, it can have a much higher variance that causes the objective function
to fluctuate. Such behaviors can cause SGD to deviate from the true optimum. There
are several ways to alleviate this issue. For example, we can slowly decrease the
learning rate as it empirically shows SGD would have similar convergence behavior
as batch gradient descent (Fig. 3.2).
The mini-batch approach inherits benefits from both GD and SGD. It computes
the gradient over small batches of training data points. The mini-batch n is a
hyperparameter, and the mini-batch gradient descent does the following update.
where < x i , yi >, . . . , < x i+n−1 , yi+n−1 > are the n data points in a batch. Here
the gradient is iteratively computed using batches of data points. Via such a mini-
batch update, we reduce the variance of the parameter updates and solve the unstable
convergence issue seen by SGD.
In many healthcare applications, labels are not available. In such cases, we resort
to unsupervised learning models. Unsupervised learning models are not used for
classifying (or predicting) towards a known label y. Rather we discover patterns
or clusters about the input data points x. Next, we briefly introduce some popular
unsupervised learning methods.
30 3 Machine Learning Basics
Y = XW (3.9)
min X − XW W 2
W
X ≈ U W
3.3.2 Clustering
The mean squared error (MSE) is the most basic performance metric for regression
models. The formulation of MSE is given in Eq. (3.10).
32 3 Machine Learning Basics
Fig. 3.3 k-Means clustering over a set of points with K = 2. From left to right, top to bottom, we
firstly initialize two cluster centers with a blue cross and a red cross. After a few iterations, all blue
(red) points are assigned to the blue (red) cluster, completing the K-means clustering procedure
1
N
MSE = (f (x i ) − yi )2 (3.10)
N
i=1
where f (x i ) is the predicted value for the i-th data point. Small MSE means the
prediction is close to the true observation on average. Thus the model has a good fit.
We can take the squared root of MSE to obtain another popular metric called root
mean squared error (RMSE):
1
N
RMSE = (f (x i ) − yi )2 (3.11)
N
i=1
RMSE and MSE are commonly used in neural network parameter tuning. For
example, the authors in [150] built a feedforward neural network model to find
prognostic ischemic heart disease patterns from magnetocardiography (MCG) data.
In training the model, RMSE was calculated as the evaluation metric to help to
choose the model parameters, such as the number of nodes in the hidden layer and
the number of learning epochs (see Fig. 3.4). Hyperparameters leading to the lowest
RMSE was then chosen in the final model.
Another measure for regression problem is the coefficient of determination (also
called as R 2 ) that measures the correlation between the predicted values {f (x i )}
and actual observations {yi }. The R 2 is computed using Eq. (3.11).
3.4 Evaluation Metrics 33
Fig. 3.4 In [150], to train the neural network model, RMSE was used as the evaluation metric in
parameter tuning, including the number of learning epochs and the number of nodes in the hidden
layer. Parameters exhibiting the lowest RMSE were chosen for the final model
N
(f (x i ) − yi )2
i=1
R2 = 1 − (3.12)
N
(ȳ − yi )2
i=1
where ȳ = N1 N i yi is the sample mean of the target observation yi .
The R 2 measures the “squared correlation” between the observation yi and the
estimation f (x i ). If R 2 is close to 1, then the model’s estimations closely mirror true
observed outcomes. If R 2 is close to 0 (or even negative), it means the estimation is
far from the outcomes. Note that R 2 can become negative, in which case the model
fit is worse than predicting the simple average of yi regardless of the value of x i .
If the predictions are binary values, we can construct a 2-by-2 confusion matrix
to quantify all the possibilities between predictions and labels. In particular, we
count the following four numbers: the number of case patients that are correctly
34 3 Machine Learning Basics
predicted as cases is the true positive (TP); the number of case patients that are
wrongly predicted as controls is the false negative (FN); the number of control
patients that are correctly predicted as controls is the true negative (TN); and the
number of control patients that are wrongly predicted as cases is the false positive
(FP) (Table 3.2).
A few important performance metrics can be derived from the confusion matrix,
including accuracy, precision—also known as positive predictive value (PPV),
recall—also known as sensitivity, false positive rate, specificity, and F1 score:
Accuracy is the fraction of correct predictions over the total population, or
formally:
TP +TN
Accuracy = .
T P + T N + FP + FN
However, accuracy does not differentiate true positives (TP) or true negatives (TN).
If there are many more controls (e.g., patients without the disease) than cases
(patients with the disease), the high accuracy can be trivially achieved by classifying
everyone as controls (or negatives). Other metrics address this class imbalance
challenge indirectly, such as precision and recall, by focusing on the positive class.
Precision or positive predictive value (PPV) is the fraction of correct case
predictions over all case predictions:
TP
precision = . (3.13)
T P + FP
While recall, also known as sensitivity and true positive rate (TPR), is the fraction
of cases that are correctly predicted as cases
TP
recall = . (3.14)
T P + FN
Since precision and recall are often a trade-off, the F1 score is a popular
measure that combines them by treating false positives and false negatives as equally
important. More specifically, the F1 score is defined as the harmonic mean of
precision and recall, given by the following formula:
Exploring the Variety of Random
Documents with Different Content
The zone of the dwindling river:—
E. Huntington. The Border Belts of the Tarim Basin, Bull. Am. Geogr.
Soc., vol. 38, 1906, pp. 91-96; The Pulse of Asia, pp. 210-222,
262-279.
Plate 10.
A. View in Spitzbergen to illustrate the disintegration of rock
under the control of joints. (Photograph by O. Haldin.)
Plate 11.
Fig. 259.—Map to show the nature of the shore current and the
forms which are molded by it.
At but few places upon a shore will the storm waves beat
perpendicularly, and then for but short periods only. The broken
wave, as it crawls ever more slowly up the beach, carries the sand
with it in a sweeping curve, and by the time gravity has put a stop to
its forward movement, it is directed for a brief instant parallel to the
shore. Soon, however, the pull of gravity upon it has started the
backward journey in a more direct course down the slope of the
terrace; and here encountering the next succeeding breaker, a
portion of the water and the coarser sand particles with it are again
carried forward for a repetition of the zigzag journey. This many
times interrupted movement of the sand particles may be observed
during a high wind upon any sandy lee shore. The “set” of the water
along the shore as a result of its zigzag journeyings is described as
the shore current (Fig. 259), and the effect upon sand distribution is
the same as though a steady current moved parallel to the shore in
the direction of the average trend of the moving particles.
The sand beach.—The first effect of the shore current is to
deposit some portion of the sand within the first slight recess upon
the shore in the lee of the cliff. The earlier deposits near the cliff
gradually force the shore current farther from the shore and so lay
down a sand selvage to the shore, which is shaped in the form of an
arc or crescent and known as a beach (Fig. 259 and Fig. 260).
Fig. 260.—Crescent-shaped beach formed in the lee of a
headland. Santa Catalina Island, California (after a photograph
by Fairbanks).
Fig. 274.—Uplifted sea cave, ten feet above the water upon the
coast of California; the monument to a former earthquake
(after a photograph by Fairbanks).
Fig. 275.—Double-notched cliff near Cape Tiro, Celebes (after a
photograph by Sarasin).