100% found this document useful (2 votes)

19 views

Introduction to Deep Learning for Healthcare Cao Xiao pdf download

The document is an introduction to deep learning models specifically applied in healthcare, authored by Cao Xiao and Jimeng Sun. It discusses the evolution of neural networks, their applications in various healthcare scenarios, and the significance of big data and AI in medical practices. The book is structured to cover foundational concepts, health data, and machine learning basics relevant to healthcare applications.

Uploaded by

pitonafifie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

19 views

Introduction to Deep Learning for Healthcare Cao Xiao pdf download

Uploaded by

pitonafifie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Introduction to Deep Learning for Healthcare Cao

Xiao install download

https://ebookmeta.com/product/introduction-to-deep-learning-for-
healthcare-cao-xiao/

Download more ebook from https://ebookmeta.com

We believe these products will be a great fit for you. Click
the link to download now, or visit ebookmeta.com
to discover even more!

Machine Learning, Deep Learning, Big Data, and Internet

of Things for Healthcare 1st Edition Govind Singh Patel

https://ebookmeta.com/product/machine-learning-deep-learning-big-
data-and-internet-of-things-for-healthcare-1st-edition-govind-
singh-patel/

Deep Learning Approaches to Cloud Security Deep

Learning Approaches for Cloud Security 1st Edition

https://ebookmeta.com/product/deep-learning-approaches-to-cloud-
security-deep-learning-approaches-for-cloud-security-1st-edition/

Emerging Technologies for Healthcare Internet of Things

and Deep Learning Models 1st Edition Monika Mangla
(Editor)

https://ebookmeta.com/product/emerging-technologies-for-
healthcare-internet-of-things-and-deep-learning-models-1st-
edition-monika-mangla-editor/

The King Follett Sermon A Biography 1st Edition William

Victor Smith

https://ebookmeta.com/product/the-king-follett-sermon-a-
biography-1st-edition-william-victor-smith/
Grigori 1st Edition Brandon Varnell

https://ebookmeta.com/product/grigori-1st-edition-brandon-
varnell/

PTSD and Coping with Trauma Information for Teens Teen

Health Series 1st Edition Angela L. Williams

https://ebookmeta.com/product/ptsd-and-coping-with-trauma-
information-for-teens-teen-health-series-1st-edition-angela-l-
williams/

Obsessed 1st Edition Jenika Snow

https://ebookmeta.com/product/obsessed-1st-edition-jenika-snow/

The International Film Business: A Market Guide Beyond

Hollywood 3rd Edition Angus Finney

https://ebookmeta.com/product/the-international-film-business-a-
market-guide-beyond-hollywood-3rd-edition-angus-finney/

The American Development of Biology Ronald Rainger

(Editor)

https://ebookmeta.com/product/the-american-development-of-
biology-ronald-rainger-editor/
Tempest Tide Avalar Explored 03 1st Edition Deacon
Frost

https://ebookmeta.com/product/tempest-tide-avalar-
explored-03-1st-edition-deacon-frost/
Cao Xiao
Jimeng Sun

Introduction
to Deep
Learning
for Healthcare
Introduction to Deep Learning for Healthcare
Cao Xiao • Jimeng Sun

Introduction to Deep
Learning for Healthcare
Cao Xiao Jimeng Sun
Seattle, WA, USA San Francisco, CA, USA

ISBN 978-3-030-82183-8 ISBN 978-3-030-82184-5 (eBook)

https://doi.org/10.1007/978-3-030-82184-5

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

Life can only be understood

backwards, but it must be lived
forwards.
Søren Kierkegaard

Deep learning models are multi-layer neural networks that have shown great
success in diverse applications. This is a book describing deep learning models in
the context of healthcare applications.
Story 1 When we took an artificial intelligence class many year ago, many topics
were covered, including neural networks. The neural network model was presented
as a supervised learning method. However, it was considered a practical failure
compared to other more effective supervised learning methods such as decision
trees and support vector machine. The common explanation about neural networks
at the time involves two aspects: (1) Multi-layer neural networks can approximate
any arbitrary functions and hence is a theoretically powerful model. (2) In practice,
they don’t work well due to the ineffective learning algorithm (i.e., backpropagation
method). When we asked why backpropagation doesn’t work well, a typical answer
was about the accumulated errors across layers, which will eventually become too
big to lead to an accurate model. Of course, the understanding of neural networks
has evolved greatly in the past few years. When big labeled datasets and parallel
computing infrastructure such as graphic processing units (GPU) finally become
available, the power of deep neural networks will be unleashed. These days, deep
learning models have become the most popular and standard machine learning
models.
Story 2 When we first got into machine learning for healthcare many years ago, we
spoke with a senior medical doctor about the potential impact of machine learning
and artificial intelligence (AI) in medicine in the future. Specifically, we asked
him about the possibility of creating AI algorithms to mimic the practice of real-
world doctors. He was very pessimistic about the possibility because he believes

v
vi Preface

doctors largely depend on medical “intuition” to do their job, which is impossible

to be learned by algorithms. Of course, now we know it is not only possible,
but often AI algorithms can outperform human experts in various clinical pattern
recognition tasks such as diagnosis. Even commercial medical devices have now
become available (e.g., atrial fibrillation detection algorithm in Apple Watch). Many
rely on deep learning models. Before we finished the book, we saw that doctor’s
profile on LinkedIn listed as an innovator in AI for healthcare.

Seattle, WA, USA Cao Xiao

Champaign, IL, USA Jimeng Sun

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivating Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Diabetic Retinopathy Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Early Detection of Heart Failure . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Sleep Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Treatment Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Clinical Trial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 Molecule Property Prediction and Generation . . . . . . . . . . . 5
1.2 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Who Are the Authors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Health Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 The Growth of Electronic Health Records . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Health Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 The Life Cycle of Health Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Structured Health Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Unstructured Clinical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Continuous Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Medical Imaging Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.6 Biomedical Data for In Silico Drug Discovery . . . . . . . . . . 18
2.3 Health Data Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Predictive Modeling Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Softmax Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 Stochastic and Minibatch Gradient Descent . . . . . . . . . . . . . 28

vii
viii Contents

3.3 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Evaluation Metrics for Regression Tasks . . . . . . . . . . . . . . . . 31
3.4.2 Evaluation Metrics for Classification Tasks. . . . . . . . . . . . . . 33
3.4.3 Evaluation Metrics for Clustering Tasks . . . . . . . . . . . . . . . . . 37
3.4.4 Evaluation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Deep Neural Networks (DNN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 A Single Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.3 Train a Single Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Multilayer Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Network Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Train a Multilayer Neural Network . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 Parameters and Hyper-Parameters. . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Case Study: Readmission Prediction from EHR Data
with DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Case Study: DNN for Drug Property Prediction . . . . . . . . . . . . . . . . . . 57
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Idea and Formulation of Word2Vec . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 t-Distributed Stochastic Neighbor Embedding
(t-SNE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.3 Healthcare Application of Word2Vec . . . . . . . . . . . . . . . . . . . . 68
5.3 Med2Vec: Two-Level Embedding for EHR . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.1 Med2Vec Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 MiME: Embed Internal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.1 Notations of MIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.2 Description of MIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.3 Experiment Results of MIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6 Convolutional Neural Networks (CNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1 CNN Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Architecture of CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.1 Convolution Layer: 1D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.2 Convolution Layer: 2D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.3 Pooling Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.4 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Contents ix

6.3 Backpropagation Algorithm in CNN* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.3.1 Forward and Backward Computation for 1D Data . . . . . . 89
6.3.2 Special CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Case Study: Diabetic Retinopathy Detection . . . . . . . . . . . . . . . . . . . . . . 98
6.5 Case Study: Skin Cancer Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.6 Case Study: Automated Surveillance of Cranial Images
for Acute Neurologic Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.7 Case Study: Detection of Lymph Node Metastases from
Pathology Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.8 Case Study: Cardiologist-Level Arrhythmia Detection
and Classification in Ambulatory ECG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.9 Case Study: COVID X-Ray Image Classification . . . . . . . . . . . . . . . . . 105
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7 Recurrent Neural Networks (RNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.1 RNN Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Backpropagation Through Time (BPTT) Algorithm . . . . . . . . . . . . . . 115
7.2.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2.2 Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3 RNN Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.1 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . 117
7.3.2 Gated Recurrent Unit (GRU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3.3 Bidirectional RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3.4 Encoder-Decoder Sequence-to-Sequence Models . . . . . . . 124
7.4 Case Study: Early Detection of Heart Failure . . . . . . . . . . . . . . . . . . . . . 125
7.5 Case Study: Sequential Clinical Event Prediction. . . . . . . . . . . . . . . . . 127
7.6 Case Study: De-identification of Clinical Notes. . . . . . . . . . . . . . . . . . . 129
7.7 Case Study: Learning to Prescribe Treatment
Combination for Multimorbidity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8 Autoencoders (AE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2 Autoencoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.3 Sparse Autoencoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.4 Stacked Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.5 Denoising Autoencoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.6 Case Study: “Deep Patient” via Stacked Denoising
Autoencoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.7 Case Study: Learning from Noisy, Sparse, and Irregular
Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9 Attention Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
x Contents

9.3 Case Study: Attention Model over Longitudinal EHR . . . . . . . . . . . . 150

9.4 Case Study: Attention Model over a Medical Ontology . . . . . . . . . . 154
9.5 Case Study: ICD Classification from Clinical Notes . . . . . . . . . . . . . . 156
9.6 Case Study: Heart Disease Detection from
Electrocardiography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
10 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10.2 Notations and Tasks on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
10.2.1 Notations and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
10.2.2 Tasks on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10.3 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.4 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.5 Message Passing Neural Network (MPNN) . . . . . . . . . . . . . . . . . . . . . . . 167
10.6 Graph Attention Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.7 Case Study: Neural Fingerprint in Drug Molecule
Embedding with GCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
10.8 Case Study: Decagon Modeling Polypharmacy Side
Effects with GCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
10.9 Case Study: Deep Learning Approach to Antibiotic
Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
10.10 Case Study: STAN Spatio-Temporal Attention Network
with GAT for Pandemic Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
10.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11 Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
11.1 Original Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
11.2 End-to-End Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
11.3 Self-Attention and Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.4 BERT: Pre-training of Deep Bidirectional Transformers . . . . . . . . . 187
11.5 Case Study: Doctor2Vec—Doctor Recommendation for
Clinical Trial Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
11.6 Case Study: Medication Recommendation . . . . . . . . . . . . . . . . . . . . . . . . 191
11.7 Case Study: Pre-training of Graph Augmented
Transformers for Medication Recommendation . . . . . . . . . . . . . . . . . . . 196
11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
12 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.1 Generative Adversarial Networks (GAN) . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.1.1 The GAN Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
12.1.2 The Loss Function of Discriminator . . . . . . . . . . . . . . . . . . . . . 207
12.1.3 The Loss Function of Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 207
12.1.4 Caveats of GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Contents xi

12.2 Variational Autoencoders (VAE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

12.2.1 VAE from Deep Learning Perspective . . . . . . . . . . . . . . . . . . . 208
12.2.2 VAE from Probabilistic Model Perspective . . . . . . . . . . . . . . 210
12.2.3 Reparameterization Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
12.3 Case Study: Generating Patient Records with GAN . . . . . . . . . . . . . . 214
12.4 Case Study: Molecule Generation Using VAE . . . . . . . . . . . . . . . . . . . . 217
12.5 Case Study: MolGAN an Implicit Generative Model for
Small Molecular Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
12.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Chapter 1
Introduction

Humans are the only species on earth that can actively and systematically improve
their health via technologies in the form of medicine. Throughout history, human
knowledge is the driving force for the progress of medicine and healthcare. Humans
created new technologies such as diagnostic tests, drugs, medical procedures, and
devices. As the life expectancy increases, healthcare cost is growing dramatically
over the years to be deemed unsustainable. For example, the US healthcare cost in
2019 alone is over 3.6 trillion dollars and accounts for 17.8% of gross domestic
product (GDP). Within the gigantic spending in healthcare, there is enormous waste
that should be avoided. The estimated total annual costs of waste were $760 billion
to $935 billion [135].
Meanwhile, mountains of new medical knowledge are being created, making
human doctors’ knowledge quickly outdated. Moreover, human doctors are strug-
gling to catch up with the increasing volume of patient visits. Physician burnout is a
serious issue that affects all doctors in the age of electronic health records due to the
overwhelming patient data for doctors to review and complex workflows, including
tedious documentation tasks. Patients are also dissatisfied with limited interactions
and attention from doctors during their short clinical visits. Quality of care is often
sub-optimal, with over 400K preventable medical errors in hospitals each year [78].
With the rise of artificial intelligence (AI), can new healthcare technology be
created by machine directly? For example, can machines provide more accurate
diagnoses than human doctors? In the center of the AI revolution, deep learning
technology is a set of machine learning techniques that learn multiple layers
of neural networks for supporting prediction, classification, clustering, and data
generation tasks. The success of deep learning comes from
• Data: Large amounts of rich data, especially in images and natural language
texts, become available for training deep learning models.
• Algorithms: Efficient neural network methods have been proposed and enhanced
by many researchers in recent years.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 1

C. Xiao, J. Sun, Introduction to Deep Learning for Healthcare,
https://doi.org/10.1007/978-3-030-82184-5_1
2 1 Introduction

• Hardware: Advances in parallel computing, especially graphic process units

(GPUs), have enabled a fast and affordable computing engine for deep learning
workload.
• Software: Scalable and easy-to-use programming frameworks have been devel-
oped and released via open source projects to the public. Most of them, including
TensorFlow and Pytorch, have strong support from the technology industry.
This book explains the first two ingredients: rich health data and neural network
algorithms that can effectively model those health data. The advance in relevant
hardware and deep learning software will not be covered in this book as those topics
are largely independent of healthcare applications.
Healthcare Data Among all healthcare technologies, electronic health records
(EHRs) had vast adoption and a huge impact on healthcare delivery in recent years.
One important benefit of EHRs is to capture all the patient encounters with rich
multi-modality data. Healthcare data include both structured and unstructured infor-
mation. Structured data include various medical codes for diagnoses and procedures,
lab results, and medication information. Unstructured data contain (1) clinical notes
as text, (2) medical imaging data such as X-rays, echocardiogram, and magnetic
resonance imaging (MRI), and (3) time-series data such as the electrocardiogram
(ECG) and electroencephalogram (EEG). Beyond the data collected during clinical
visits, patient self-generated/reported data start to grow thanks to wearable sensors’
increasing use. It was estimated over 100 Zettabytes of health-related data are being
created [38].
Deep Learning Models Neural network models are a class of machine learning
methods that have a long history. Deep learning models are neural networks of
many layers, which can extract multiple levels of features from raw data. As
large labeled data sets and modern hardware (especially GPU) become available,
deep neural networks with many layers start to show significant performance
advantages over other machine learning methods. For example, AlexNet using
convolutional neural networks (a popular deep learning model) won the ImageNet
competition by reducing the error rate to 15.3%, more than 10% points lower than
that of the runner up [90]. Deep learning applications are in many domains, such
as computer vision [46, 90], speech recognition [70, 128], and natural language
processing [33, 79, 166]. Deep learning applied to healthcare is a natural and
promising direction with many initial successes [44, 61].

1.1 Motivating Applications

Deep learning has great successes in the technology industry. Very hard technical
problems had amazing performance improvements such as image classification,
machine translation, and speech recognition. There are various promising deep
learning applications in healthcare, including medical imaging analysis, waveform
1.1 Motivating Applications 3

sleep analysis, inpatient outcome prediction, outpatient disease risk prediction,

treatment recommendation, clinical trial matching, and molecule generation for
drug discovery. Next, we briefly present some concrete healthcare applications using
deep learning. More detailed applications will be presented in each chapter later.

1.1.1 Diabetic Retinopathy Detection

Diabetic retinopathy (DR) is a deadly complication of diabetes that can cause

blindness. DR affects 4.2 million adults. Suppose detected early DR can be
treated. However, in its early stage, patients often do not realize any symptoms.
But without timely treatment, DR can cause permanent vision damage. The gold
standard diagnosis is to have ophthalmologists (eye doctors) perform manual
grading of retinal photographs. The cost and accessibility for such diagnoses are
challenging, especially in the low resource environment. Deep learning models, in
this case, convolutional neural networks, have demonstrated initial successes in the
automatic diagnosis of DR based on the same retinal photographs also scored by
ophthalmologists [61]. According to this study, the deep learning models trained on
over 100K images can achieve expert-level accuracy over 99% area under the ROC
curve (AUC)1 Accurate automated diagnosis tools like this can potentially assist
ophthalmologists in speeding up the diagnosis process and quickly identifying the
patients who need the most help (Fig. 1.1).

Fig. 1.1 Example retinal photos from a healthy individual and a disease patient. Credit to https://
ai.googleblog.com/2016/11/deep-learning-for-detection-of-diabetic.html

1 AUC is a common classification metric that will be described in details in Section “Real-Value

Prediction for Classification”.

4 1 Introduction

1.1.2 Early Detection of Heart Failure

Heart failure (HF) is another deadly disease that affects approximately 5.7 million
people in the United States and has over 825,000 new cases per year with around
33 billion dollars total annual cost. The lifetime risk of developing HF is 20% at
40 years of age [2]. HF has a high mortality rate of about 50% within 5 years of
diagnosis [5]. There has been relatively little progress in slowing HF progression
because there is no effective means of early detection of HF to test interventions.
Choi et al. [30] used a deep learning model called recurrent neural networks to
model longitudinal electronic health records to accurately predict the onset of HF 6
to 18 months before the actual diagnosis.

1.1.3 Sleep Analysis

Scoring laboratory polysomnography (PSG) data remains a manual task by sleep

technologists. They need visually process the entire night of sleep data and annotate
different diagnostic categories, including sleep stages, sleep disordered breathing,
and limb movements. Attempts to automate this process have been hampered by
PSG signals’ complexity and physiological heterogeneity between patients. Biswal
et al. [7] used a combination of deep recurrent and convolutional neural networks
for assigning sleep stages, detecting sleep apnea and limb movement events. Their
models achieved PSG diagnostic scoring for sleep staging, sleep apnea, and limb
movements with accuracies of 87.6%, 88.2%, and 84.7%, respectively.

1.1.4 Treatment Recommendation

Medication error is the third leading cause of death in the US. The Food and
Drug Administration (FDA) receives more than 100,000 reports each year related
to suspected medication errors. To ensure medication safety, different medication
recommendation methods have been proposed using deep learning methods. For
example, researchers have attempted to build predictive models for suggesting
treatments based on patient information, including diagnoses, procedures, and
medication history, with the abundant longitudinal electronic health record data.
LEAP [174] and GAMENet [134] are examples of such models using deep learning,
particularly sequence-to-sequence models and memory networks.
1.2 Who Should Read This Book? 5

1.1.5 Clinical Trial Matching

Clinical trials play important roles in drug development but often suffer from expen-
sive, inaccurate, and insufficient patient recruitment. The availability of electronic
health records (EHR) and trial eligibility criteria (EC) bring a new opportunity to
develop computational models for patient recruitment. One key task named patient-
trial matching is to find qualified patients for clinical trials given structured EHR
and unstructured EC text (both inclusion and exclusion criteria). How to match
complex EC text with longitudinal patient EHRs? How to embed many-to-many
relationships between patients and trials? How to explicitly handle the difference
between inclusion and exclusion criteria? COMPOSE [50] and DeepEnroll [176] are
two deep learning models for patient-trial matching. They search through EHR data
to identifying the matching patients based on the trial eligibility criteria described
in the natural language.

1.1.6 Molecule Property Prediction and Generation

Drug discovery is about finding the molecules with desirable properties for treating
a target disease. Traditional drug discovery heavily depends on high throughput
screening (HTS), which involves many costly wet-lab experiments. Given large
molecule databases and their associated drug properties (e.g., from drugbank),
machine learning models, especially deep learning models, have shown great
potentials in identifying promising drug candidates. For example, some deep
learning models were proposed to predict drug property given the input molecule
structures [105, 139]. Some were introduced to produce brand new molecules with
desirable properties [48, 80].

1.2 Who Should Read This Book?

Most deep learning books focus on computational methods where algorithms

and the underlying mathematics are described—however, few books focus on
the applications in specific domains. We take a method-oriented approach with a
target application domain in healthcare. The main targets are graduate students and
researchers who want to learn about deep learning methods and their healthcare
applications. Ideally, the audience should have a basic machine learning back-
ground, but we provide a short overview of machine learning in Chap. 3. The
audience does not need to have a healthcare or medical background to read this
book. The other target audience is healthcare researchers and data scientists who
want to learn about the use cases of deep learning in healthcare. Finally, experienced
6 1 Introduction

machine learning researchers and engineers can benefit from this book if they want
to learn more healthcare data and analytic problems.
This book does not require any computer programming knowledge and can
be used as a textbook for the concepts of deep learning and its applications. We
deliberately do not cover the programming details so that our readers can be
broadened. Also, we realize the fast pace of deep learning software evolution,
which will likely outdate what we wrote on that programming topic very quickly.
Nevertheless, the hands-on exercises of deep learning are essential to gain practical
knowledge of the topic. We encourage readers to supplement this book with other
hands-on programming exercises, online tutorials, and other books on deep learning
software packages.

1.3 Who Are the Authors?

When we completed the book in Apr 2021, here is our background.

Dr. Cao “Danica” Xiao is the senior director, data science and machine learning
at Amplitude. Before that, she was the director of Machine Learning in the Analytics
Center of Excellence (ACOE) of IQVIA, located in Cambridge, Massachusetts.
Before joining IQVIA, she got her Ph.D. degree from University of Washington,
Seattle and was a research staff member in IBM Research. Her work focuses
on developing machine learning and deep learning models to solve real-world
healthcare challenges.
Dr. Jimeng Sun is a Professor at the Computer Science Department and Carle’s
Illinois College of Medicine at the University of Illinois Urbana-Champaign. Before
Illinois, he was an associate professor in the College of Computing at the Georgia
Institute of Technology. His research focuses on artificial intelligence (AI) for
healthcare, including deep learning for drug discovery, clinical trial optimization,
computational phenotyping, clinical predictive modeling, treatment recommenda-
tion, and health monitoring. He was recognized as one of the Top 100 AI Leaders in
Drug Discovery and Advanced Healthcare by Deep Knowledge Analytics.

1.4 Book Organization

We organize chapters based on the neural network techniques. In each chapter,

we first introduce the specific neural network techniques then present concrete
healthcare case studies of those techniques. We organize the book into core and
advanced parts. The core part includes health data (Chap. 2), machine learning
basics (Chap. 3), and fundamental neural network architectures, namely deep neural
networks (DNN) (Chap. 4), embedding (Chap. 5), convolutional neural networks
(CNN) (Chap. 6), recurrent neural networks (RNN) (Chap. 7), and autoencoder
(AE) (Chap. 8).
1.4 Book Organization 7

• Chapter 2 covers the various healthcare data ranging from structured data such
as diagnosis codes to unstructured data such as clinical notes and medical
imaging data. This chapter also introduces important health data standards such
as international classification of diseases (ICD) codes.
• Chapter 3 provides a primer of machine learning basics. We present the funda-
mental machine learning tasks, including supervised and unsupervised learning
and some classical examples (e.g., logistic regression and principal component
analysis). We also describe evaluation metrics for different tasks such as the area
under the receiver operating characteristic curve (AUC) for classification and
mean squared error for regression.
• Chapter 4 presents the deep neural networks (DNN) called a feed-forward
neural network or multi-layered perceptron (MLP). In particular, we cover
the basic components of DNNs, including neurons, activation functions, loss
functions, and forward/backward passes. Of course, we also introduce the famous
backpropagation algorithm for training DNN models. We also present two case
studies: hospital re-admission prediction and drug property prediction.
• Chapter 5 illustrates the idea of embedding using neural networks, including
popular algorithms such as Word2Vec and other domain-specific embeddings for
EHR data such as Mec2Vec and MiME.
• Chapter 6 introduces convolutional neural networks (CNN) designed for grid-like
data such as images and time series. We will present the important operation of
CNNs, such as convolution and pooling, and some popular CNN architectures.
We will also describe the application of CNNs on medical imaging data and
clinical waveforms such as the electrocardiogram (ECG).
• Chapter 7 covers recurrent neural networks (RNN) designed to handle sequential
data such as clinical text. We present important variants of RNN, including Long
short-term memory (LSTM) and gated recurrent unit (GRU). The RNN case
studies include heart failure prediction and de-identification of clinical notes.
• Chapter 8 describe the autoencoder model, an unsupervised neural network
model. We also present case studies of autoencoder including phenotyping EHR
data.
The advanced part covers the attention model (Chap. 9), graph neural networks
(Chap. 10), memory networks (Chap. 11), and generative models (Chap. 12).
• Chapter 9 introduces attention models, which creates the foundation for many
advanced deep learning models. We illustrate the attention model in several
healthcare applications, including disease risk prediction, diagnosis code assign-
ment, and ECG classification.
• Chapter 10 presents another foundation of advanced deep learning models,
namely graph neural networks (GNN). Graph data are common in many health-
care tasks such as medical ontology and molecule graphs. GNN models are a
set of powerful neural network models for graph data. The case studies focus on
drug discovery.
• Chapter 11 presents memory network-based models, a set of powerful models
for embedding complex data (such as text and time series). We will introduce
8 1 Introduction

the idea behind original memory networks and their powerful extensions, such
as Transformer and BERT models. We demonstrate memory networks on
medication recommendation tasks.
• Chapter 12 presents deep generative models, including generative adversarial
networks (GAN) and variational autoencoder (VAE). We show their applications
in synthetic EHR data generation and molecule generation for drug discovery.

1.5 Exercises

1. Present an example data science application for lower healthcare cost, specify
the datasets needed for building such machine learning models, and describe the
evaluation metrics.
2. Which type of healthcare data are considered large in terms of data volume?
(a) Genomic data
(b) Medical imaging data
(c) Clinical notes
(d) Medical claims
3. Which type of healthcare data are considered fastest in velocity?
(a) Real-time monitoring data from intensive care units
(b) Medical imaging data
(c) Structured electronic health records
(d) Clinical notes
4. Which one is NOT a drug discovery and development application?
(a) Sepsis detection
(b) Molecule property prediction
(c) Clinical trial recruitment
(d) Molecule generation
5. Which one is NOT a drug discovery and development application?
(a) Sepsis detection
(b) Molecule property prediction
(c) Clinical trial recruitment
(d) Molecule generation
6. Which one is a public health application?
(a) Mortality prediction in ICU
(b) Patient triaging application
(c) Treatment recommendation for heart failure
(d) Predicting COVID19 cases at different locations in the US
Chapter 2
Health Data

Health data are diverse with multiple modalities. This chapter will introduce
different types of health data, including structured health data (e.g., diagnosis codes,
procedure codes) and unstructured data (e.g., clinical notes, medical images). We
will also present the popular health data standards for representing those data.

2.1 The Growth of Electronic Health Records

Over the past decade, more and more health service providers worldwide have
adopted electronic health record (EHR) systems to manage data about patients
and records of health care services. For example, Fig. 2.1 shows the increase
of national basic and certified EHR adoption rate over time according to the
American Hospital Association Annual Survey. Here the basic EHR adoption curve
corresponds to the EHR systems having basic EHR functions such as patient
demographics, physician notes, lab results, medications, diagnosis, clinical and drug
safety guidelines. A certified EHR just has to cover essential EHR technology that
meets the technological capability, functionality, and security requirements adopted
by the Department of Health and Human Services. From Fig. 2.1, we can see nearly
all reported hospitals (96%) possessed a certified EHR technology by 2015. In
2015, 84.8% of hospitals adopted at least a Basic EHR system; this represents a
ninefold increase since 2008. Thanks to the wide deployment of EHR systems, many
healthcare institutions have collected diverse health data. This chapter provides an
overview of health data: what different data types are available and how the data
are collected, and by whom. All these data are potential inputs for training deep
learning models for supporting diverse healthcare tasks.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 9

C. Xiao, J. Sun, Introduction to Deep Learning for Healthcare,
https://doi.org/10.1007/978-3-030-82184-5_2
10 2 Health Data

Fig. 2.1 Percentage of EHR system adoption over time. Basic EHR means EHR systems with a set
of required functionalities such as patient demographics, physician notes, lab results, medications,
diagnosis, clinical and drug safety guidelines. A certified EHR system means the hospital has
the essential EHR technology certified by Department of Health and Human Services. Source:
American Hospital Association Annual Survey

2.2 Health Data

Shifting from the traditional paper-based records to electronic records has generated
a massive collection of health data, which created opportunities for enhanced patient
care, data-driven care delivery, and accelerated healthcare research. According to the
definition from Centers for Medicare & Medicaid Services (CMS)—a federal insti-
tution that administers government-owned health insurance services, EHR is “an
electronic version of a patient’s medical history, that is maintained by the provider
over time, and may include all of the key administrative, clinical data relevant to that
person’s care under a particular provider, including demographics, progress notes,
problems, medications, vital signs, past medical history, immunizations, laboratory
data, and radiology reports”.1 From the modeling perspective, EHR can be viewed
as a longitudinal record of comprehensive medical services provided to patients and
documentation of patient medical history. There are several important observations
of EHR data:
1. EHR data are mostly managed by providers (although there is an ongoing
movement to enable patients to augment additional data into their EHRs);
2. Each provider manages their own EHR systems; as a result, partial information
about the same patient may scatter across EHR systems from multiple providers;

1 https://www.cms.gov/Medicare/E-Health/EHealthRecords/index.html.
2.2 Health Data 11

3. The main purpose of EHR data is to support accurate and efficient billing
services, which creates challenges for other secondary use of EHR data such
as research.

2.2.1 The Life Cycle of Health Data

Let us first introduce the key players in the healthcare industry and the life cycle of
health data from the perspectives of those key healthcare players.
Key Healthcare Players There are diverse healthcare institutions that generate and
manage health data.
• Providers are hospitals, clinics, and health systems, which provide healthcare
services to patients. Providers use electronic health records (EHR) systems to
capture everything that happened during patient encounters, such as diagnosis
codes, medication prescription, lab and imaging results, and clinical notes.
Providers interact with other players such as payers, pharmacies, and labs.
• Payers are entities that provide health insurance. They can be private insurance
companies such as United Healthcare and Anthem. Or they can be public
insurance programs owned by government entities such as MEDICARE and
MEDICAID. Payers reimburse the full or partial cost associated with healthcare
services to providers. Payers interact with providers and pharmacies via claims.
More specifically, providers and pharmacies submit claims to corresponding
payers from which they have medical insurance. Claims are usually structured
data with diagnosis, procedure, and medication codes, and associated cost
information.
• Pharmacies prepare and dispense medications often based on medication pre-
scriptions. Pharmacies know what and when patients actually fill medications.
Pharmacies produce pharmacy claims, which are also sent to payers of the
patients for reimbursement.
• Pharmaceutical companies discover, develop, produce, and market drugs.
They conduct clinical trials to validate new drugs. Pharmaceuticals generate
experimental data for drug discovery and clinical trials data.
• Contracted Research Organizations (CROs) provide outsourced contract ser-
vices to pharmaceutical companies such as pre-clinical and clinical research and
clinical trial management. Depending on the services line, CROs produce various
datasets that support pharmaceutical companies to get their drugs to market.
• Government agencies play multiple roles in the healthcare ecosystem. Food
and Drug Administration (FDA) is the most important regulator to approve new
drugs and monitor existing drugs. Centers for Disease Control and Prevention
(CDC) is a public health institution focusing on monitoring, controlling, and
preventing diseases. For example, government agencies worldwide have been
collecting the reports from Spontaneous Reporting Systems (SRS) submitted by
pharmaceutical companies, healthcare professionals, and consumers to facilitate
12 2 Health Data

post-market drug surveillance. These SRSs have served as a cornerstone for

post-marketing drug surveillance, and the FDA Adverse Event Reporting System
(FAERS) is one of the most prominent SRSs.
• Patients are at the center of healthcare, who interact with all the other players.
For example, the healthcare benefits of patients are usually covered by their
payers. Patients are also increasingly empowered to produce and manage their
own health-related data. Most EHR systems provide patient portals (e.g., Web-
based or Apps) for patients to assess their own EHR data and interact with their
healthcare providers. For example, with wearable sensors such as wristbands and
smartphones, more people can have activity monitoring data such as movement
and heart rates, which are essential to monitoring individual health.
• Researchers are an important group of individuals that try to push the frontier of
medical and healthcare research. Healthcare researchers can have a diverse back-
ground, including medicine, biology, chemistry, engineering, and data science.
On one end, they can be conducting basic research in biology and chemistry that
can help discover new drugs in the future or understand the basics of disease
mechanisms. On the other end, they can be analyzing EHR data to produce
translational insights that immediately change clinical practice. Researchers
produce medical literature and clinical guidelines, which are important data in
itself.
Life Cycle of Health Data When a patient comes to a clinic or a hospital, an
electronic health record will be created about this clinical encounter or visit. This
record will be documented by doctors or nurses (or generally healthcare providers)
to describe what happened during this visit. A clinical note will be created to
describe the visit in narrative text. Then various medical codes will be assigned to
this visit, including diagnosis codes, procedure codes, and medication prescriptions.
Lab and imaging tests may be ordered, which are done either at the clinic or sent to
an external lab. The lab report contains both structured data and unstructured text.
After the clinical visit, a medical claim containing most structured medical codes
will be filed to the payer, usually by the provider on behalf of the patient. Then
the payer will verify and reimburse the associated cost to the providers or patients.
Meanwhile, the patient may take the medication prescription to a pharmacy to fill
their medication. The corresponding medication will be dispensed to the patient.
Then pharmacy may file a pharmacy claim to the payer to obtain reimbursement of
the medication. In parallel, to invent new drugs, pharmaceutical companies (pharma)
often work with providers to recruit patients to participate in clinical trials. Many
clinical trials are managed by external CRO for pharma. Once patients are enrolled
in the trials, various measurements related to the drug’s efficacy and safety will be
collected as part of the trial results. The complete results and their analysis will be
submitted to the FDA for approval. The new drugs can only be widely distributed
once three phases of trials are conducted with positive results and the corresponding
FDA’s approval is acquired. Different kinds of health data and the associated players
are illustrated in Fig. 2.2.
2.2 Health Data 13

Fig. 2.2 Important players and the life cycle of health data

2.2.2 Structured Health Data

Structured data are common in healthcare, which are often represented as medical
codes.
Various medical codes are used in both EHR and claim data, which usually
follow common data standards. For example, diagnosis codes follow international
disease classification (ICD); procedures use current procedure terminology (CPT)
codes. The number of unique codes from each data standard is large and growing.
For example, ICD version 9 (ICD-9) has over 13,000 diagnosis codes, while ICD-
10 has over 68,000 diagnosis codes. Each encounter is only associated with a few
codes. The resulting data are high-dimensional but sparse. A simple and direct
way to represent such data is to create a one-hot vector for a medical code and
a multi-hot vector for the patient with multiple codes as shown in Fig. 2.3. For
example, to represent diagnosis information in an encounter, one can create a
68,000-dimensional binary vector where each element is a binary indicator of a
corresponding diagnosis code. If only one dimension is one and zeros otherwise, we
call it one-hot vector. If multiple ones are present, it is a multi-hot vector. As we will
show in later chapters, such multi-hot vectors can be improved with various deep
learning approaches to construct appropriate lower-dimensional representation.
Most medical codes follow certain standards such as ICD for diagnosis, CPT
for procedures, and NDC for drugs, which will be explained later in Health data
standards. Most of these medical codes are organized in hierarchies defined by
medical ontologies such as CCS codes. These hierarchies are instrumental in
constructing more meaningful and low-dimensional input features for the deep
learning models. For example, instead of directly treating each ICD-10 code as
a feature, we can group them into a few hundred CCS codes,2 higher disease
categories and treat each CCS code as a feature.

2 https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp.
14 2 Health Data

Fig. 2.3 Examples of one-hot vectors of medical codes, a multi-hot vector for a patient. Here 1 or
0 indicates the presence or absence of a particular diagnosis

Fig. 2.4 In EHR data, medical codes are structured hierarchically with heterogeneous relations,
e.g., medication Acetaminophen and procedure IV fluid are correlated, while both occur due to the
same diagnosis Fever

All medical codes are interconnected in a hierarchical way within an encounter,

as depicted by Fig. 2.4. In particular, EHR data can be seen as a collection of
individual patient records, where each patient has a sequence of encounters over
time. Within an encounter, multiple interrelated medical codes are assigned. Once
a diagnosis code is assigned, the corresponding medication orders and procedure
orders are then created. Each medication order contains information such as the
medication code, start date, end date, and instructions. Procedure orders contain the
procedure code and possible lab results. As shown by Fig. 2.4, procedures such
2.2 Health Data 15

Table 2.1 Structured data in EHR

Codes Standard Example
Diagnoses International Classification of Disease (ICD) S06.0x1A
Procedures Current Procedural Terminology (CPT) 70010
Labs Logical Observation Identifiers Names & Codes (LOINC) 5792-7
Medication RxNorm, ATC
Demographics NA Gender, age
Vital signs NA Numeric
Behavioral status NA Smoking status

Table 2.2 Unstructured components of EHRs

Component Description Format
Discharge summary Summary of an encounter Text
Progress notes Timestamped detailed description Text
Radiology report Summary of imaging test Image+text
Electroencephalogram (EEG) Brain activity monitoring Time series
Electrocardiogram (ECG) Heart monitoring Time series

as Cardiac EKG come with several results (QRS duration, Q-T interval, notes),
but IV Fluid does not. Note that some diagnoses might not be associated with any
medication or procedure orders.
Other Structured Data In addition to medical codes, there are other structured
data such as patient demographics (e.g., age and gender), vital signs (e.g., blood
pressure), and social history (e.g., smoking status). These are also stored as struc-
tured fields in the EHR database. Various data standards are used for documenting
different medical codes, as summarized in Tables 2.1 and 2.2.

Challenges Several challenges exist in analyzing structured data:

1. Sparsity: Raw data such as medical codes are often high dimensional and
extremely sparse. For example, with 68,000 possible ICD-10 codes, each patient
encounter may have only a few codes present;
2. Temporality: Health data often have an important time aspect that needs to be
modeled. For example, EHR data of a patient may contain multiple visits over
time. Discovering temporal relations is crucial for making an accurate assessment
or prediction of any health outcome.

2.2.3 Unstructured Clinical Notes

Clinical notes are recorded for various purposes, either for the documentation of an
encounter in discharge summaries, describing the reason and use of prescriptions in
16 2 Health Data

medication notes, interpreting the results of medical images with radiology reports,
or analyzing the results from lab tests pathology reports. The reports include differ-
ent sets of functional components, thus yield various difficulties in understanding.
There are growing interests in using machine learning for these unstructured clinical
notes, especially deep learning methods for automated indexing, abstracting, and
understanding. Also, some works focus on automated classification of clinical free
text to standardized medical codes, which are important for generating appropriate
claims for that visit [114]. For example, a progress note is one type of clinical
note commonly used in describing clinical encounters. A progress note is usually
structured in four sections with the acronym SOAP:
• Subjective part describes what the patient tells you;
• Objective part presents the objective findings such as lab test results and imaging
rests;
• Assessment part provides the diagnosis of the encounter;
• Plan part describes the treatment plan for the patient.
There are many different types of notes, including Admission notes, Emergency
department (ED) notes, Progress notes, Nursing notes, Radiology reports, ECG
reports, Echocardiogram reports, Physician notes, Discharge summary, and Social
work notes. Each type can be written in a very different format with different lengths
and quality.
Challenges Several key technical challenges exist in mining clinical text data:
1. High dimension—the number of unique words in clinical text corpus is large.
Many words are acronyms, which are important to understand using their context.
2. External knowledge—in addition to clinical text in EHR data, there is a large
amount of medical knowledge encoded in the text, such as medical literature and
clinical guidelines. It is important to be able to incorporate that knowledge in
modeling EHR data.
3. Privacy—Besides technical challenges, it is challenging to access clinical text
due to privacy concerns as the data are very sensitive and affect individual
privacy. As a result, very limited clinical notes are openly accessible for method
development. And the volume of the shared data is also limited due to largely
privacy concerns.

2.2.4 Continuous Signals

With an increasing number of new medical and wellness sensors being created, there
will be more continuous signals as part of health data. Most commonly collected
continuous signals in clinics include electrocardiogram (ECG) and electroen-
cephalogram (EEG). ECG measures the electrical activities by placing electrodes
on the chest, arms, and legs. In contrast, EEG measures the electrical activities of a
brain via electrodes that are placed on the scalp. Both ECG and EEG are routine clin-
2.2 Health Data 17

ical measurements captured by multiple sensors (multi-channel) in high frequency

(e.g., 200 Hz). Human doctors currently conduct most interpretations of ECG and
EEG recordings (sometimes with machines’ help). Beyond clinical signals like ECG
and EEG, there are also consumer-grade wearables that monitor movement and heart
rate. Several such sensors, including an accelerometer, gyroscope, and heart rate
sensor, are already built into a typical smartphone.
Challenges Two significant challenges involved in analyzing continuous signals:
1. Noise—Sensor data often contain a significant amount of noises, making many
simple models fail. For example, ECG and EEG signals can easily interfere with
physical movement and power lines.
2. Lack of labels—To create direct clinical values, the continuous signals have to be
mapped to meaningful phenotypes such as disease diagnoses. However, learning
such a mapping requires sufficient labeled data that map continuous signals to
phenotypes, which can be difficult to obtain. Because it may require significant
time from human experts to produce the labels. Sometimes the data generated by
new types of sensors are not well studied before, making it difficult for anyone
to produce accurate labels.

2.2.5 Medical Imaging Data

Medical imaging is about creating a visual representation of the human body for
diagnostic purposes. Various imaging technologies have been introduced, such
as X-ray radiography, computed tomography (CT), magnetic resonance imaging
(MRI), ultrasound (e.g., echocardiography). The resulting data are 2D images
or 3D representations from multiple 2D images and videos. Medical imaging
data are stored and managed by a separate system called Picture archiving and
communication system (PACS). The images themselves are stored and transmitted
in DICOM format (Digital Imaging and Communications in Medicine). Given the
raw imaging data, radiologists read and mark the images and write a text report
(radiology reports) to summarize the findings. The radiology reports are often
copied into the EHR systems so that clinicians and patients can access the findings of
the imaging tests. Thanks to the digitization of the radiology field, a large number of
high-resolution images and their corresponding phenotypes (labels) are available for
modeling. Thus there is tremendous excitement in using deep learning for radiology
tasks [65, 101].
Challenges Data quality and lack of reliable and detailed labels are still challenges
in analyzing such data. As the raw input images are very large (high-dimensional),
it demands a sufficient sample size to train accurate and generalizable models.
18 2 Health Data

2.2.6 Biomedical Data for In Silico Drug Discovery

In silico drug discovery is about using computational methods to create and

select molecule compounds for new drugs. The data used in silico modeling
include molecule compound graphs or sequences, protein target sequences, genome
sequences, and disease-related knowledge graphs. For a molecule compound, a
molecular graph represents the structural formula of the compound in terms of nodes
(atoms) and edges (chemical bonds). Many computational models are not directly
conducted on their original graph form but using specialized encodings:
• SMILES: A special string encoding called simplified molecular-input line-entry
system (SMILES) is used to represent the chemical compounds.
• ECPF/FCFP: One can describe the compound using the functional connectivity
fingerprints, including the Extended Connectivity Fingerprint (ECPF), which is
a list of integer identifiers or a fixed-size bit string, where each identifier and
bit corresponds to a neighbor substructure; and similarly the Functional-Class
Fingerprint (FCFP), which is a variant of ECFP, integrates the functional features
to the ECFP fingerprint.
• Functional genes are characterized by a unique gene identifier in a gene database
such as GeneBank. There are also text descriptions of gene functions.
More generally, various knowledge graphs are constructed to represent the
relations among entities within and/or across different data types. For example,
the human disease network is a taxonomy of diseases themselves, the disease-drug
network, disease-gene network. These networks describe the association between
disease and drugs, as well as diseases and genes.
Challenges Several significant challenges in analyzing data associated with in
silico studies.
1. Incorporating domain knowledge—Data for in-silico studies are mainly chem-
ical data and various knowledge graphs. All the representations have a precise
meaning in their domains. It is crucial to understand and incorporate their domain
knowledge.
2. Interpretable models—Since the purpose of such data is to support drug dis-
covery, it is important to provide more convincing evidence and interpretable
explanation of each prediction.

2.3 Health Data Standards

Next, we overview a set of commonly used standards in healthcare data.

• ICD stands for International Classification of Diseases, which is a set of codes
that represents diseases, symptoms, and clinical procedures [12]. ICD codes
follow a hierarchical structure where related codes can be grouped into a higher
2.3 Health Data Standards 19

level category. ICD codes are a widely used international standard maintained
by World Health Organization (WHO). The latest version is ICD-11 as of 2020.
Most of the world is currently using ICD-10. For example, “I50” corresponds
to the ICD-10 category for heart failure, I50.2 is Systolic (congestive) heart
failure, and I50.21 is Acute systolic (congestive) heart failure. ICD codes are
used to represent disease diagnosis in EHR and claims data. Most EHR data will
have ICD codes either in ICD-9 or ICD-10 format. An ICD9 code has up to 5
digits. The first digit is either alphabetic or numeric, and the remaining digits are
numeric. For example, the ICD-9 code for Diabetes mellitus without mention of
complications is 250.0x. The first three digits of an ICD-9 code corresponding
to the disease category. And the last 1 or 2 digits reflect the subcategories of the
disease. Besides numeric codes, ICD-9 codes can have an initial letter of V or E.
For example, V85.x is an ICD-9 code for body mass index (BMI). In particular,
V85.0 corresponds to BMI<19, V85.1 BMI between 19 and 25, and V85.2x
indicates BMI>25. ICD-10 codes are more granular than ICD-9 codes. Each
ICD-10 code has up to 7 digits. The first digit is always a letter; the second digit
is always numeric; the third to seventh digits are alphanumeric. For example,
E10.9 is the ICD-10 code for Type 1 diabetes mellitus without complications.
• CPT corresponds to Current Procedural Terminology, which is a standard
created and copyrighted by American Medical Association. CPT codes represent
medical services and procedures that doctors can document and bill for pay-
ment. CPT codes also follow a hierarchical structure. For example, CPT codes
between 99201 and 99215 correspond to Office/other outpatient services, while
a high-level category 99201–99499 corresponds to codes for evaluation and
management. Like ICD codes, CPT codes are commonly present in structured
EHR data.
• NDC codes are 10- or 11-digit national drug codes, which are managed by
Food and Drug Administration (FDA). It consists of three segments: labeler,
product, and package. For example, 0777-3105-02 is an NDC code where 0777
corresponds to labeler Dista Products Company, 3105 maps to the product
Prozac, and 02 indicates the package of 100 capsules in 1 bottle. The same drugs
with different packages will have different codes. From an analytic modeling
perspective, NDC codes are probably too specific to be used directly as features.
• LOINC is a terminology standard for lab tests. LOINC stands for Logical Obser-
vation Identifiers Names and Codes (LOINC). Like other standards, LOINC
has LOINC codes and associated descriptions of the code. To support lab tests,
LOINC description follows a specific format with six parts: (1) COMPONENT
(ANALYTE): The substance being measured or observed; (2) PROPERTY: The
characteristic of the analyte; (3) TIME: The interval of time of the observation;
(4) SYSTEM (SPECIMEN): The specimen upon which the observation was
made; (5) SCALE: How the observation is quantified: quantitative, ordinal,
nominal; (6) METHOD: how the observation was made (which is an optional
part). For example, LOINC code 806-0 is the lab test of the manual count of
white blood cells in the cerebral spinal fluid specimen. The different parts of the
description are Component:Leukocytes, Property:NCnc (Number concentration),
20 2 Health Data

Time:Pt(Point in time), System:CSF (Cerebral spinal fluid), Scale:Qn (Quantita-

tive), Method:Manual count. LOINC codes demonstrate even structured data can
encode multiple aspects of information.
• SNOMED CT is a comprehensive ontology of all medical terminologies.
SNOMED CT stands for Systematized Nomenclature Of Medicine Clinical
Terms. The core components of SNOMED include concept codes (or SNOMED
ID), concept description, and the relationships between concepts. For example,
22298006 is the SNOMED code for a heart attack; there are various heart
attack descriptions, including Myocardial infarction, Infarction of heart, Cardiac
infarction, and Heart attack, Myocardial infarction (disorder), and Myocardial
infarct. There are many associated relationships to heart attacks, such as a parent
relationship to Ischemic heart disease (disorder), a child relationship to Acute
myocardial infarction (disorder), an associated-morphology relation to infarct,
and a finding-site relation to myocardium structure. Computationally SNOMED
CT provides a large knowledge graph that connects many clinical terminologies,
which can be extremely useful to combine with EHR data for predictive model
building.
In addition to the data standards, various mapping software packages can process
different types of healthcare data.
• CCS codes are a hierarchical categorization of ICD and CPT codes maintained
by the Healthcare Cost and Utilization Project (HCUP). The purpose of CCS
codes is to aggregate detailed ICD and CPT codes into clinically meaningful
groups to support better statistical analysis. CCS codes have much fewer
categories than the original ICD and CPT codes. For example, there are about
a few hundred CCS codes, while ICD and CPT have tens of thousands of codes.
From a machine learning modeling perspective, CCS codes can often be more
informative than raw ICD and CPT codes.
• RxNorm is a terminology system for drugs and the associated software for
mapping various mentions of drugs to normalized drug names. RxNorm group
synonyms of drug expressions into drug concept. Each concept is assigned with a
normalized name. In addition to drug name normalization, RxNorm also creates
relations for each drug. For example, The drug “Naproxen 250 MG Oral Tablet”
has a dose relation to “Oral Tablet”, an ingredient relation of “Naproxen” and an
is-a relation to “Naproxen Oral Tablet.”
• UMLS standards for Unified Medical Language System, which integrates many
biomedical terminologies. UMLS has three knowledge sources: (1) Metathe-
saurus integrates many terminologies including ICD, CPT, LOINC, SNOMED,
and RxNorm, normalizes concepts and provides concept unique identifiers
(CUIs) for each concept; (2) Semantic Network specifies all the relations among
concepts; (3) Lexical Tools normalizes strings, handles lexical variants and
provides basic natural language capability for biomedical text.
2.4 Exercises 21

2.4 Exercises

1. What are the most useful health data for predicting patient outcome (e.g.,
mortality)?
2. What are the most accessible health data? And why?
3. What are the most difficult health data (to access and to model)?
4. What are the important health data that are not described in this chapter?
5. Which of the following is NOT true about electronic health records (EHR)?
(a) EHR data from a single hospital consists of complete clinical history from
each patient.
(b) Outpatient EHR data are viewed as point events
(c) EHR data contain longitudinal patient records.
(d) Inpatient EHR data are viewed as interval events.
6. Which of the following is not true about clinical notes?
(a) They can provide a detailed description of patient status.
(b) Most EHR systems provide clinical notes functionality.
(c) Clinical notes can contain sensitive protected health information.
(d) Because of its unstructured format, it is easy for computer algorithms to
process the notes
7. Which of the following are the limitations of claims data?
(a) Coding errors can commonly occur in the claims data.
(b) Since claims data are for billing purposes, they do not accurately reflect
patient status.
(c) Claims data are rare and difficult to find.
(d) Claims data of a patient are often incomplete because they can go to
different hospitals.
8. Which of the following is not true?
(a) EHR are richer than claims.
(b) EHR captures the medication prescription information but does not capture
whether the prescription are filled.
(c) Continual signals are rarely collected in hospitals.
(d) Continuous signals provide objective assessments of patients.
9. Which of the following are not imaging data?
(a) X-rays
(b) Computed tomography
(c) Electrocardiogram
(d) Magnetic resonance imaging
10. What is true about medical literature data?
(a) They are difficult to parse because of the natural language format.
22 2 Health Data

(b) They are noisy and often low in quality.

(c) They often contain sensitive patient identifiers.
(d) They are in a machine-friendly format.
11. Which of the following is a medical ontology for medications?
(a) CPT codes
(b) RxNorm
(c) SNOMED codes
(d) MESH terms
12. Which of the following is not clinical trial data?
(a) Trial protocols
(b) Trial eligibility criteria
(c) Data in clinical trial management systems
(d) Electronic health records
13. Which of the following is not true about drug data?
(a) Drugs are often represented in molecule structures.
(b) Drug data are standard.
(c) Drug data are often encoded in 3D molecule structures.
(d) ChEMBL is a large bioactivity database.
Chapter 3
Machine Learning Basics

Machine learning has changed many industries, including healthcare. The most
fundamental concepts in machine learning include (1) supervised learning that
has been used to develop risk prediction models for target diseases and (2)
unsupervised learning that has been applied to discover unknown disease subtypes.
Both supervised and unsupervised learning expect to model various patient features
as demographic features, including age, gender and ethnicity, and past diagnosis
features (e.g., ICD codes). The key difference is the presence of labels in supervised
learning and the absence of labels in unsupervised learning. A label is a gold
standard for a target of interest, such as a patient’s readmission status for training a
readmission predictive model.
As we will describe in later chapters, most of the deep learning successes are in
supervised learning. In contrast, the potential for unsupervised learning is immense
due to the availability of a large amount of unlabeled data. This chapter will
present the predictive model pipeline, basic models for supervised and unsupervised
learning, and various model evaluation metrics. Table 3.1 defines notations used in
this chapter.

3.1 Predictive Modeling Pipeline

A predictive modeling pipeline is a process of building prediction models from

observational data. And predictive modeling pipelines are common use cases for
supervised learning. As shown in Fig. 3.1, such a pipeline is not a single algorithm
but a sequence of computational steps involving the following steps:
1. We first define the prediction target. For example, we may want to predict
a future diagnosis of heart failure. An appropriate prediction target should be
important for the application and feasible to achieve, given the available data.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 23

C. Xiao, J. Sun, Introduction to Deep Learning for Healthcare,
https://doi.org/10.1007/978-3-030-82184-5_3
24 3 Machine Learning Basics

Table 3.1 Notation table

Notation Definition
x ∈ RM M-dimensional feature vector
x+ Index set of data points of positive class
x− Index set of data points of negative class
yi ∈ {1, 2, . . . , K} Label for data point i
(x i , yi ) A data point for supervised learning
xi A data point for unsupervised learning
N Number of data points
X ∈ RM×N y ∈ RN Feature matrix and label vector
w weight vectors

Fig. 3.1 Predictive modeling pipeline

2. We then need to construct the patient cohort for this study. For example,
we may include all patients with an age greater than 45 for a heart failure
study. There are many reasons why cohort construction is needed when building
healthcare predictive models: (1) there might be the financial cost associated with
acquiring the dataset based on the cohort; (2) we may want to build the model
for a specific group of patients instead of a general population; (3) the full set of
all patients may have various data quality issues.
3. Next, we will construct all the features from the data and select those relevant
features for predicting the target. In a traditional machine learning pipeline, we
3.2 Supervised Learning 25

often have to consider both feature construction and selection steps. With the
rise of deep learning, the features are often created and implicitly selected by the
multiple layers of neural networks.
4. After that we can build the predictive model which can be either classification
(i.e., discrete labels such as heart failure or not) and regression (i.e., continuous
output such as length of stay).
5. Finally we need to evaluate the model performance and iterate.

3.2 Supervised Learning

We will start with the problem of disease classification as one major supervised
learning task in healthcare applications: given a set of patients and their associated
patient data, assign each patient with a discrete label y ∈ Y, where Y is the set
of possible diseases. This problem has many applications, from studying electroen-
cephalography (EEG) time series for seizure detection to analyzing electronic health
records (EHR) for predicting heart failure diagnosis. Supervised learning tasks such
as disease classification are also a building block throughout many complex deep
learning architectures, which we will discuss later.
Supervised learning expects an input of a set of N data points; each data point
consists of m input features x (also known as variables or predictors) and a label
y ∈ Y (also known as a response, a target or an outcome). Supervised learning aims
to learn a mapping from features x to a label y based on the observed data points.
If the labels are continuous (e.g., hospital cost), the supervised learning problem is
called a regression problem. And if the labels are discrete variables (e.g., mortality
status), the problem is called a classification problem. In this chapter, the label y
are categorical values of K classes (i.e., y ∈ {1, 2, . . . , K}).

3.2.1 Logistic Regression

Logistic regression is one popular binary classification model.1 Logistic regression

is actually the simplest neural network model, which is also known as the percep-
tron. Next, let us explain logistic regression with a healthcare example.
Suppose we want to predict the heart failure onset of patients based on their
health-related features. In this example, the label y = 1 if the patient has heart
failure and y = 0 if the patient does not have heart failure. Each patient has a M-
dimensional feature vector x representing demographic features, various lab tests,

1 Maybe a confusing name as logistic regression is for classification not for regression. But the
naming choice will become meaningful after we explain the mathematical construction.
26 3 Machine Learning Basics

and other disease diagnoses. The classification task is to determine whether a patient
will have heart failure based on this M-dimensional feature vector x.
Mathematically logistic regression models the probability of heart failure onset
y = 1 given input features x, denoted by P (y = 1|x). Then the classification is
performed by comparing P (y = 1|x) with a threshold (e.g., 0.5). If P (y = 1|x) is
greater than the threshold, we predict the patient will have heart failure; otherwise,
the patient will not.
One building block of logistic regression is the log-odds or logit function. The
odds are the quantity that measures the relative probability of label presence and
label absence as
P (y = 1|x)
.
1 − P (y = 1|x)

The lower the odds, the lower probability of the given label. Sometimes we prefer
to use log-odds (natural logarithm transformation of odds), also known as the logit
function.

P (y = 1|x)
logit (x) = log .
1 − P (y = 1|x)

Now instead of modeling probability of heart failure label given input feature
P (y = 1|x) directly, it is easier to model its logit function as a linear regression
over x:

P (y = 1|x)
log = wT x + b (3.1)
1 − P (y = 1|x)

where w is the weight vector, b is the offset variable. Equation (3.1) is why logistic
regression is named logistic regression.
After taking exponential to both sides and some simple transformation, we will
have the following formula.
T x+b
ew
P (y = 1|x) = (3.2)
1 + ew
T x+b

With the formulation in Eq. (3.2), the logistic regression will always output values
between 0 and 1, which is desirable as a probability estimate.
Let us denote P (y = 1|x) as P (x) for brevity. Now learning the logistic
regression model means to estimate the parameters w and b on the training data.
We often use maximum likelihood estimation (MLE) to find the parameters. The
idea is to estimate w and b so that the prediction P̂ (x i ) to data point i in the training
data is as close as possible to actual observed values (in this case either 0 or 1).
Let x + be the set of indices for data points that belong to the positive class (i.e.,
with heart failure), and x − one for data points that belong to the negative class (i.e.,
without heart failure), the likelihood function used in the MLE is given by Eq. (3.3).
3.2 Supervised Learning 27

L(w, b) = P (xa+ ) (1 − P (xa− )) (3.3)
a+ ∈x+ a− ∈x−

If we take the logarithm to the MLE, we will get the following formula for log-
likelihood in Eq. (3.4).

N
log(L(w, b)) = [yi log P (x i ) + (1 − yi ) log(1 − P (x i ))] (3.4)
i=1

Note that since either yi or 1 − yi is zero, only one of two probability terms (either
log P (x i ) or log(1 − P (x i ))) will be added.
Multiplying a negative sign to have a minimization problem, what we have now
is the negative log-likelihood, also known as (binary) cross-entropy loss.

N
J (w, b) = − [yi log P (x i ) + (1 − yi ) log(1 − P (x i ))] (3.5)
i=1

To maximize the log-likelihood is the same as to minimize the cross-entropy loss.

We can use the gradient descent method to find the optimal w and b.

3.2.2 Softmax Regression

We sometimes want to classify data points into more than two classes. For example,
given brain image data from patients that are suspected of having Alzheimer’s
disease (AD), the diagnoses outcomes include (1) normal, (2) mild cognitive
impairment (MCI), and (3) AD. In that case, we will use multinomial logistic
regression, also called softmax regression to model this problem.
Assuming we have K classes, the goal is to estimate the probability of the
class label taking on each of the K possible categories P (y = k|x) for k =
1, · · · , K. Thus, we will output a K-dimensional vector representing the estimated
probabilities for all K classes. The probability that data point i is in class a can be
modeled by Eq. (3.6).

ewa x i +ba
T

P (yi = a|x i ) = (3.6)

K
ewk x i +bk
T

k=1

where wa is the weight for a-th class, x i is the feature vector for data point i, wa and
ba are the weight vector and the offset for class a, respectively. To learn parameters
for softmax regression, we often optimize the following average cross-entropy loss
over all N training data points:
28 3 Machine Learning Basics

1
N K
J (w) = − I (yi = k) log(P (yi = k|x i ))
N
i=1 k=1

where K is the number of label classes (e.g., 3 classes in AD classification), I (yi =

k) is binary indicator (0 or 1) if k is the class for data point i. And P (yi = k|x i ) is
the predicted probability that data point i is of class k.

3.2.3 Gradient Descent

Gradient descent (GD) is an iterative learning approach to find the optimal param-
eters based on data. For example, for softmax regression parameter estimation, we
can use GD by computing the derivatives

∇w J (w)

and update the weights in the opposite direction of the gradient like the following
rule

w := w − η∇wk J (w)

for each class k ∈ {1, · · · , K} and η is the learning rate, which is an important
hyperparameter that needs to be adjusted. The gradient computation and weight
update are iteratively performed until some stopping criterion is met (e.g., maximum
number of iterations reached). Here ∇ is a differentiation operator which transforms
a function J (w) into its gradient vector along each feature dimension xi . For
example, x = [x1 , x2 , x3 ], then the gradient vector is
∂J (w) ∂J (w) ∂J (w)
∇J (w) = , ,
∂x1 ∂x2 ∂x3

3.2.4 Stochastic and Minibatch Gradient Descent

The gradient descent is a method to optimize an objective function g(θ) parame-

terized by model parameters θ ∈ R d by updating the parameters in the opposite
direction of the gradient of the objective function ∇θ g(θ) with respect to the
parameters. The full gradient can be very expensive to compute on a large data set
because it has to process all data points. Several gradient descent variants reduce the
computational cost, e.g., stochastic gradient descent (SGD) and mini-batch gradient
descent.
3.3 Unsupervised Learning 29

Fig. 3.2 Objective function changes for batch gradient descent and mini-batch gradient descent

The SGD performs parameter updating for every single data point in the training
set. Given data point x i with label yi , the SGD does the following update:

θ = θ − η∇θ g(θ ; < x i , yi >) (3.7)

where η is the learning rate and g(θ; < x i , yi >) is the objective function evaluated
on one data point < x i , yi >. By updating one data point at a time, SGD is
computationally more efficient. However, since SGD updates based on one data
point at a time, it can have a much higher variance that causes the objective function
to fluctuate. Such behaviors can cause SGD to deviate from the true optimum. There
are several ways to alleviate this issue. For example, we can slowly decrease the
learning rate as it empirically shows SGD would have similar convergence behavior
as batch gradient descent (Fig. 3.2).
The mini-batch approach inherits benefits from both GD and SGD. It computes
the gradient over small batches of training data points. The mini-batch n is a
hyperparameter, and the mini-batch gradient descent does the following update.

θ = θ − η∇θ g(θ ; < x i , yi >, . . . , < x i+n−1 , yi+n−1 >) (3.8)

where < x i , yi >, . . . , < x i+n−1 , yi+n−1 > are the n data points in a batch. Here
the gradient is iteratively computed using batches of data points. Via such a mini-
batch update, we reduce the variance of the parameter updates and solve the unstable
convergence issue seen by SGD.

3.3 Unsupervised Learning

In many healthcare applications, labels are not available. In such cases, we resort
to unsupervised learning models. Unsupervised learning models are not used for
classifying (or predicting) towards a known label y. Rather we discover patterns
or clusters about the input data points x. Next, we briefly introduce some popular
unsupervised learning methods.
30 3 Machine Learning Basics

3.3.1 Principal Component Analysis

Suppose we want to study N data points of M features represented by a matrix

X ∈ RN ×M . The number of features M can be large in many healthcare datasets.
For example, there are 68,000 ICD-9 codes where each code can be a separate
binary feature. Principal component analysis (PCA) can be applied to reduce the
data dimensionality from M to a much lower dimension R. More specifically, PCA
is a linear transformation:

Y = XW (3.9)

where X is the original data matrix, Y ∈ RN ×R is the low-dimensional representa-

tion after PCA, W ∈ RM×R is the orthogonal projection matrix. The objective of
PCA is to minimize the reconstruction error:

min X − XW W 2
W

where XW W = Y W is the reconstruction matrix. The solution of PCA relates

to another matrix factorization named singular value decomposition (SVD).

X ≈ U W

where U ∈ RM×R and W ∈ RN ×R are orthogonal matrices2 that contain left

and right singular vectors and ∈ RR×R . Connecting to PCA, if X is the high-
dimensional data matrix, the low-dimensional representation Y = U = XW .
In practice, PCA can be used as a feature extraction method for generating
features from high-dimensional data such as neuroimaging data. For example,
neuroimaging data such as Magnetic Resonance Imaging (MRI) or functional
MRI includes many voxels, whose high dimensionality brings many challenges
for diagnostic classification tasks. If we consider brain voxels as a raw feature,
we can apply PCA to generate low-dimensional features to support downstream
classification tasks. For instance, [86] showed that PCA features combining with
a support vector machine (SVM) classifier provided good discriminative power in
early diagnosis of Alzheimer’s disease.
To summarize, as an unsupervised learning method, PCA provides low-
dimensional linear representation to approximate the original high-dimensional
features. In fact, PCA can be achieved by a neural network via autoencoders with
linear activation. In later chapters, we can see more details about how neural
networks expand the idea of PCA to low-dimensional nonlinear embedding using
methods such as autoencoder.

2 This means U U = I where I is the identity matrix.

3.4 Evaluation Metrics 31

3.3.2 Clustering

Besides dimensionality reduction, clustering is another major topic in unsupervised

learning. Clustering methods aim to find homogeneous groups (or clusters) from
a dataset, such that similar points are within the same cluster but dissimilar points
in different clusters. For example, researchers have applied a clustering algorithm
on EHR data to find disease subtypes of type II diabetes patient group into three
clusters [97].
One popular clustering method is K-means, which tries to group data into K
clusters where users specify the number K. The clustering assignments are achieved
by minimizing the sum of distances (e.g., Euclidean distances) between data points
and the corresponding cluster centroid (or the mean vectors). The K-means method
is described in the following procedure:

Algorithm 1 The K-means algorithm

Input: (1) data points x 1 , · · · , x N ; (2) number of clusters K
Until convergence
DO
Initialize K centers.
WHILE not converged, DO
Assign each x i to the closest center arg mink∈1,2,...,K x i − μk ;
Compute L for this observation (x, t);
Update K centers μk := |S1i | i∈Si x i .
RETURN
Disjoint clusters S1 , S2 , . . . , SK .

Following the aforementioned procedure, we finish clustering all samples.

Figure 3.3 provides a visualization of the iterative clustering procedure, where we
apply k-means clustering over a set of points with cluster number K = 2.

3.4 Evaluation Metrics

In this section, we will introduce some common performance measures and

evaluation strategies in machine learning.

3.4.1 Evaluation Metrics for Regression Tasks

The mean squared error (MSE) is the most basic performance metric for regression
models. The formulation of MSE is given in Eq. (3.10).
32 3 Machine Learning Basics

Fig. 3.3 k-Means clustering over a set of points with K = 2. From left to right, top to bottom, we
firstly initialize two cluster centers with a blue cross and a red cross. After a few iterations, all blue
(red) points are assigned to the blue (red) cluster, completing the K-means clustering procedure

1
N
MSE = (f (x i ) − yi )2 (3.10)
N
i=1

where f (x i ) is the predicted value for the i-th data point. Small MSE means the
prediction is close to the true observation on average. Thus the model has a good fit.
We can take the squared root of MSE to obtain another popular metric called root
mean squared error (RMSE):

1
N
RMSE = (f (x i ) − yi )2 (3.11)
N
i=1

RMSE and MSE are commonly used in neural network parameter tuning. For
example, the authors in [150] built a feedforward neural network model to find
prognostic ischemic heart disease patterns from magnetocardiography (MCG) data.
In training the model, RMSE was calculated as the evaluation metric to help to
choose the model parameters, such as the number of nodes in the hidden layer and
the number of learning epochs (see Fig. 3.4). Hyperparameters leading to the lowest
RMSE was then chosen in the final model.
Another measure for regression problem is the coefficient of determination (also
called as R 2 ) that measures the correlation between the predicted values {f (x i )}
and actual observations {yi }. The R 2 is computed using Eq. (3.11).
3.4 Evaluation Metrics 33

Fig. 3.4 In [150], to train the neural network model, RMSE was used as the evaluation metric in
parameter tuning, including the number of learning epochs and the number of nodes in the hidden
layer. Parameters exhibiting the lowest RMSE were chosen for the final model

N
(f (x i ) − yi )2
i=1
R2 = 1 − (3.12)
N
(ȳ − yi )2
i=1

where ȳ = N1 N i yi is the sample mean of the target observation yi .
The R 2 measures the “squared correlation” between the observation yi and the
estimation f (x i ). If R 2 is close to 1, then the model’s estimations closely mirror true
observed outcomes. If R 2 is close to 0 (or even negative), it means the estimation is
far from the outcomes. Note that R 2 can become negative, in which case the model
fit is worse than predicting the simple average of yi regardless of the value of x i .

3.4.2 Evaluation Metrics for Classification Tasks

For binary classification, the label yi can be either 1 or 0. For example, yi =

1 indicates patient i is a heart failure patient (case) and yi = 0 indicates a
patient without heart failure (control). There are many performance metrics for
classification models. For binary classification, prediction scores can be real values
(e.g., probability risk scores between 0 and 1) or binary values.

Binary Prediction for Classification

If the predictions are binary values, we can construct a 2-by-2 confusion matrix
to quantify all the possibilities between predictions and labels. In particular, we
count the following four numbers: the number of case patients that are correctly
34 3 Machine Learning Basics

Table 3.2 Confusion matrix and performance metrics for classification

Total Actual cases Actual controls Accuracy = (TP+TN)/Total
Predicted cases True positive (TP) False positive (FP) Precision
Predicted controls False negative (FN) True negative (TN) = TP/(TP+FP)
Recall=Sensitivity Specificity=1-FPR False Positive Rate
=TP/(TP+FN) =TN/(FP+TN) FPR=FP/(FP+TN)

predicted as cases is the true positive (TP); the number of case patients that are
wrongly predicted as controls is the false negative (FN); the number of control
patients that are correctly predicted as controls is the true negative (TN); and the
number of control patients that are wrongly predicted as cases is the false positive
(FP) (Table 3.2).
A few important performance metrics can be derived from the confusion matrix,
including accuracy, precision—also known as positive predictive value (PPV),
recall—also known as sensitivity, false positive rate, specificity, and F1 score:
Accuracy is the fraction of correct predictions over the total population, or
formally:

TP +TN
Accuracy = .
T P + T N + FP + FN

However, accuracy does not differentiate true positives (TP) or true negatives (TN).
If there are many more controls (e.g., patients without the disease) than cases
(patients with the disease), the high accuracy can be trivially achieved by classifying
everyone as controls (or negatives). Other metrics address this class imbalance
challenge indirectly, such as precision and recall, by focusing on the positive class.
Precision or positive predictive value (PPV) is the fraction of correct case
predictions over all case predictions:

TP
precision = . (3.13)
T P + FP

While recall, also known as sensitivity and true positive rate (TPR), is the fraction
of cases that are correctly predicted as cases

TP
recall = . (3.14)
T P + FN

Since precision and recall are often a trade-off, the F1 score is a popular
measure that combines them by treating false positives and false negatives as equally
important. More specifically, the F1 score is defined as the harmonic mean of
precision and recall, given by the following formula:
Exploring the Variety of Random
Documents with Different Content
The zone of the dwindling river:—
E. Huntington. The Border Belts of the Tarim Basin, Bull. Am. Geogr.
Soc., vol. 38, 1906, pp. 91-96; The Pulse of Asia, pp. 210-222,
262-279.

The war of dune and oasis:—

R. Pumpelly. Explorations in Turkestan, Expedition of 1904, etc., Pub.
73, Carneg. Inst., Washington, vol. 1, pp. 1-13.
E. Huntington. The Oasis of Kharga, Bull. Am. Geogr. Soc., vol. 42.
1910, pp. 641-661.
Th. H. Kearney. The Country of the Ant Men, Nat. Geogr. Mag., vol.
22, 1911, pp. 367-382.

Features of the arid lands:—

C. E. Dutton. Tertiary History of the Grand Cañon District, with Atlas,
Mon. II, U. S. Geol. Surv., 1882, pp. 264, pls. 42, maps 23.
G. Sweinfurth. Map Sheets of the Eastern Egyptian Desert. Berlin,
1901-1902, 8 sheets.

The origin of the high plains:—

W. D. Johnson. The High Plains and their Utilization, 21st Ann. Rept.
U. S. Geol. Surv., Pt. iv, 1901, pp. 601-741.
CHAPTER XVII

REPEATING PATTERNS IN THE EARTH RELIEF

The weathering processes under control of the fracture

system.—In an earlier chapter it was learned that the rocks which
compose the earth’s surface shell are intersected by a system of
joint fractures which in little-disturbed areas divide the surface beds
into nearly square perpendicular prisms (Fig. 36, p. 55), more or less
modified by additional diagonal joints, and often also by more
disorderly fractures. Throughout large areas these fractures may
maintain nearly constant directions, though either one or more of
the master series may be locally absent. This distinctive architecture
of the surface shell of the lithosphere has exercised its influence
upon the various weathering processes, as it has also upon the
activities of running water and of other less common transporting
agencies at the surface.
Within high latitudes, where frost action is the dominant
weathering process, the water, by insinuating itself along the joints
and through repeated freezings, has broken down the rock in the
immediate neighborhood of these fractures, and so has impressed
upon the surface an image of the underlying pattern of structure
lines (plate 10 A).
In much lower latitudes and in regions of insufficient rainfall, the
same structures are impressed upon the relief, but by other
weathering processes. In the case of the less coherent deposits in
these provinces, the initial forms of their erosional surface have
sometimes been determined by the dash of rain from the sudden
cloudburst. Thus the “bad lands” may have their initial gullies
directed and spaced in conformity with the underlying joint
structures (Fig. 238).
In such portions of the
temperate regions as are
favored by a humid climate, the
mat of vegetation holds down a
layer of soil, and mat and soil in
coöperation are effective in
preventing any such large
measure of frostwork as is
characteristic of the subpolar
regions or of high levels in the
arid lands. In humid regions the
rocks become a prey especially
to the processes of solution and
accompanying chemical
Fig. 238.—Rain sculpturing under
decomposition, and these
control by joints. Coast of
southern California (after a processes, although guided by
photograph by Fairbanks). the course of the percolating
ground water along the fracture
planes, do not afford such striking examples of the control of surface
relief.
Those limestones which slowly pass into solution in the
percolating water do, however, quite generally indicate a localization
of the solution along the joint channels (Fig. 239 and plate 6 B).
Though in other rocks not so apparent, yet solutions generally take
their courses along the same channels, and upon them is localized
the development of the newly formed hydrated and carbonate
minerals, as is well illustrated by the phenomenon of spheroidal
weathering (Fig. 155, p. 150).
The fracture control of the
drainage lines.—The etching out
of the earth’s architectural plan in
the surface relief, which we have
seen begun in the processes of
weathering, is continued after the
transporting agents have become Fig. 239.—Outcrop of flaggy
effective. It is often easy to see that limestone which shows the
a river has taken its course in effects of solution along
rectangular zigzags like the elbows neighboring joints in a
of a jointed stove pipe, and that its sagging of the upper beds
(after Gilbert, U. S. G. S.).
walls are formed of joint planes
from which an occasional squared buttress projects into the channel.
This structure is rendered in the plan of the Abisko Cañon of
northern Lapland (Fig. 240). We are later to learn that another great
transporting agent, the water wave, makes a selective attack upon
the lithosphere along the fractures of the joint system (Fig. 250, p.
233 and Fig. 254, p. 235).
Where the scale of the example is large,
as in the cases which have been above
cited, the actual position and directions of
the joint wall are easily compared with the
near-by elements of the river’s course, so
that the connection of the drainage lines
with the underlying structure is at once
apparent. In many examples where the
scale is small, the evidence for the
controlling influence of the rock structure in Fig. 240.—Map of the
determining the courses of streams may be joint-controlled
Abisko Cañon in
found in the peculiar character of the northern Lapland
drainage plan. To illustrate: the course of (after Otto Sjögren).
the Zambesi River, within the gorge below
the famous Victoria Falls, not only makes repeated turnings at a
right angle, but its tributary streams, instead of making the usual
sharp angle where they join the main stream, also affect the right
angle in their junctions (Fig. 241).
The repeating pattern in
drainage networks.—It is a
characteristic of the joint system that
the fractures within each series are
spaced with approximation to uniformity.
If the plan of a drainage system has
been regulated in conformity with the
Fig. 241.—Map of the gorge architecture of the underlying rock
of the Zambesi River basement, the same repeating
below the Victoria Falls
(after Lamplugh).
rectangles of the master joints may be
expected to appear in the lines of
drainage—the so-called drainage network.
Such rectangular patterns do very generally appear in the
drainage network, though they are often masked upon modern maps
by what, to the geologist, seems impertinent intrusion of the black
lines of overprinting which indicate railways, lines of highway, and
other culture elements. On river maps, which are printed without
culture, the pattern is much more easily recognized (Figs. 242 and
243). Wherever the relief is strong, as is the case in the Adirondack
Mountain province of the State of New York, individual hills may
stand in relief between the bounding streams which compose the
rectangular network, like the squared pedestals of monuments. Such
a type of relief carved in repeating patterns has been described as
“checkerboard topography.”

The dividing lines of the relief patterns—lineaments.—The

repeating design outlined in the river network of the Temiskaming
district (Fig. 243) would appear in greater perfection if we could
reproduce the relief without at the same time obscuring the lines of
drainage; for where the pattern is not completely closed by the
course of the stream, there is generally found either a dry valley or a
ravine to complete the design. If these are not present, a bit of
straight coast line, a visible line of fracture, a zone of fault breccia,
o
r
t
h
e
b
o
u
n
d
a
r
y
li
n
e
Fig. 242.—Controlled s Fig. 243.—A river network of repeating
drainage network of rectangular pattern. Near Lake
the Shepaug River in e Temiskaming, Ontario (from the map by
Connecticut. p the Dominion Government).
a
rating different formations may one or more of them fill in the gaps
of the parallel straight drainage lines which by their intersection
bring out the pattern. These significant lines of landscapes which
reveal the hidden architecture of the rock basement are described as
lineaments (Fig. 82, p. 87). They are the character lines of the
earth’s physiognomy.
It is important to emphasize the essentially composite expression
of the lineament. At one locality it appears as a drainage line, a little
farther on it may be a line of coast; then, again, it is a series of
aligned waterfalls, a visible fault trace, or a rectilinear boundary
between formations; but in every case it is some surface expression
of a buried fracture. Hidden as they so generally are, the fracture
lines must be searched out by every means at our disposal, if we are
not to be misled in accounting for the positions and the attitudes of
disturbed rock masses.
As we have learned, during earthquake shocks, as at no other
time, the surface of the earth is so sensitized as to betray the
position of its buried fractures. As the boundaries of orographic
blocks, certain of the fractures are at such times the seats of
especially heavy vibrations; they are the seismotectonic lines of the
earthquake province. Many lineaments are identical with
seismotectonic lines, and they therefore afford a means of to some
extent determining in advance the lines of greatest danger from
earthquake shock.
The composite repeating patterns of the higher orders.—
Not only do the larger joint blocks become impressed upon the earth
relief as repeating diaper patterns, but larger and still larger
composite units of the same type may, in favorable districts, be
found to present the same characters. Attention has already been
more than once directed to the fact that the more perfect and
prominent fracture planes recur among the joints of any series at
more or less regular intervals (Fig. 40, p. 57, and Fig. 41, p. 58).
Nowhere, perhaps, is this larger order of the repeating pattern more
perfectly exemplified than in some recent deposits in the Syrian
desert (plate 10 B). It is usually, however, in the older sediments
that such structures may be recognized; as, for example, in the
squared towers and buttresses of the Tyrolean Dolomites (Fig. 244).
Here the larger blocks appear in the thick bedded lower formation,
the dolomite, divided into subordinate sections of large dimensions;
but in the overlying formations in blocks of relatively small size, yet
with similarly perfect subequal spacing.
Fig. 244.—Squared mountain masses which reveal a distribution
of the joints in block patterns of different orders of magnitude.
The Pordoi range of the Sella group of the Dolomites, seen from
the Cima di Rossi (after Mojsisovics).

The observing traveler who is privileged to make the journey by

steamer, threading its course in and out among the many islands
and skerries of the Norwegian coast, will hardly fail to be struck by
the remarkable profiles of most of the lower islands (Fig. 245).
These profiles are generally convexly scalloped with a noteworthy
regularity, and not in one unit only, but in at least two with one a
multiple of the other (Fig. 246). As the steamer passes near to the
islands, it is discovered that the smaller recognizable units in the
island profiles are separated by widely gaping joints which do not,
however, belong to the unit series, but to a larger composite group
(Fig. 246 b). Frostwork, which depends for its action upon open
spaces within the rocks, has here been the cause of the excessive
weathering above the more widely gaping joints.

Plate 10.
A. View in Spitzbergen to illustrate the disintegration of rock
under the control of joints. (Photograph by O. Haldin.)

B. Composite pattern of the joint structures within recent

alluvial deposits. (Photograph by Ellsworth Huntington.)
Fig. 245.—Island groups of the Lofoten archipelago off the
northwest coast of Norway, which reveal repeating patterns of
the relief in two orders of magnitude (after a photograph by
Knudsen).

High northern latitudes are

thus especially favorable for
revealing all the details in the
architectural pattern of the
lithosphere shell, and we need Fig. 246.—Diagrams to illustrate the
not be surprised that when composite profiles of the islands
the modern maps of the on the Norwegian coast. a, distant
view; b, near view, showing the
Norwegian coast are
individual joints and the more
examined, still larger widely gaping fractures beneath
repeating patterns than any each sag in the profile.
that may be seen in the field
are to be made out. The Norwegian coast was long ago shown to be
a complexly faulted region, and these larger divisions of the relief
pattern, instead of being explained as a consequence solely of
selective weathering, must be regarded as due largely to fault
displacements of the type represented in our model (plate 4 C). Yet
whether due to displacements or to the more numerous joints, all
belong to the same composite system of fractures expressed in the
relief.
Reading References for Chapter XVII
William H. Hobbs. The River System of Connecticut, Jour. Geol., vol.
9, 1901, pp. 469-485, pl. 1; Lineaments of the Atlantic Border
Region, Bull. Geol. Soc. Am., vol. 15, 1904, pp. 483-506, pls.
45-47; The Correlation of Fracture Systems and the Evidences
for Planetary Dislocations within the Earth’s Crust, Trans. Wis.
Acad. Sci., etc., vol. 15, 1905, pp. 15-29; Repeating Patterns in
the Relief and in the Structure of the Land, Bull. Geol. Soc. Am.,
vol. 22, 1911, pp. 123-176, pls. 7-13.
CHAPTER XVIII

THE FORMS CARVED AND MOLDED BY WAVES

The motion of a water wave.—The motions within a wave

upon the surface of a body of water may be thought of in two
different ways. First of all, there is the motion of each particle of
water within an orbit of its own; and there is, further, the forward
motion of propagation of the wave considered as a whole.
The water particle in a wave
has a continued motion round
and round its orbit like that of a
horse circling a race course,
only that here the track is in a
vertical plane, directed along
the line of propagation of the
wave (Fig. 247). Each particle of
water, through its friction upon
neighboring particles, is able to Fig. 247.—Diagram to show the
transmit its motion both along nature of the motions within a
free water wave.
the surface and downward into
the water below. The force which starts the water in motion and
develops the wave, is the friction of wind blowing over the water
surface, and the size of the orbit of the water particle at any point is
proportional to the wind’s force and to the stretch of water over
which it has blown. The wind’s effect is, therefore, cumulative—the
wave is proportional to the wind’s effect upon all water particles in
its rear, added to the local wind friction.
The size or height of the wave is measured by the diameter of
the orbit of motion of the surface particle, and this is the difference
in height between trough and crest. The distance from crest to crest,
or from trough to trough, is called the wave length. Though the
wave motion is transmitted downward into the water there is a
continued loss of energy which is here not compensated by added
wind friction, and so the orbital motion grows smaller and smaller,
until at the depth of about a wave length it has completely died out.
This level of no motion is called the wave base. In quiet weather the
level of no motion is practically at the water’s surface, and inasmuch
as the geological work of waves is in large part accomplished during
the great storms, the term “wave base” refers to the lowest level of
wave motion at the time of the heaviest storms. Upon the ocean the
highest waves that have been measured have an amplitude of about
fifty feet and a wave length of about six hundred feet.
Free waves and breakers.—So long as the depth of the water
is below wave base, there is obviously no possibility of interference
with the wave through friction upon the bottom. Under these
conditions waves are described as free waves, and their forms are
symmetrical except in so far as their crests are pulled over and more
or less dissipated in the spray of the “white caps” at the time of high
winds.

Fig. 248.—Diagram to illustrate the transformation of a free

wave into a breaker as it approaches the shore.
As a wave approaches a shore, which generally has a gentle
outward sloping surface, there is interposed in the way of a free
forward movement the friction upon the bottom. This friction begins
when the depth of water is less than wave base, and its effect is to
hold back the wave at the bottom. Carried slowly upward in the
water by the friction of particle upon particle, the effect of this
holding back is a piling up of the water, which increases the wave
height as it diminishes the wave length, and also interferes with
wave symmetry (Fig. 248). Moving forward at the top under its
inertia of motion and held back at the bottom by constantly
increasing friction, a strong turning motion or couple is started about
a horizontal axis, the immediate effect of which is to steepen the
forward slope of the wave, and this continues until it overhangs,
and, falling, “breaks” into surf. Such a breaking wave is called a
“comber” or “breaker” (plate 11 B).

Plate 11.

A. Ripple markings within an ancient sandstone (courtesy of U.

S. Grant).
B. A wave breaking as it approaches the shore. (Photograph by
Fairbanks.)

Effect of the breaking

wave upon a steep rocky
shore—the notched cliff.—
If the shore rises abruptly
from deeper water, the top of
the breaking wave is hurled
against the cliff with the force
of a battering ram. During
storms the water of shore Fig. 249.—Notched rock cliff cut by
waves and the fallen blocks
waves is charged with sand, derived from the cliff through
and each sand particle is undermining. Profile Rock at
driven like a stone cutter’s tool Farwell’s Point near Madison,
under the stroke of his Wisconsin.
hammer. The effect is thus
both to chip and to batter away the rock of the shore to the height
reached by the wave, undermining it and notching the rock at its
base (Fig. 249). When the notch has been cut in this manner to a
sufficient depth, the overhanging rock falls by its own weight in
blocks which are bounded by the ever present joints, leaving the
upper cliff face essentially vertical.
Coves, sea arches, and
stacks.—It is the headland
which is most exposed to the
work of the waves, since with
change of wind direction it is
exposed upon more than a
single face. The study of
headlands which have been cut
by waves shows that the joints
within the rock play a large rôle
in the shaping of shore
features. The attack of the
Fig. 250.—A wave-cut chasm under waves under the direction of
control by joints, coast of Maine
(after Tarr). these planes of ready
separation opens out
indentations of the shore (Fig. 250) or forms sea caves which, as
they extend to the top of the cliff by the process of sapping, yield
the coves which are so common a feature upon our rock-bound
shores (Fig. 259, p. 238). With continuation of this process, the
caves formed on opposite sides of the headland may be united to
form a sea arch (Fig. 251).
A later stage in this selective wave
carving under the control of joints is
reached when the bridge above the
arch has fallen in, leaving a detached
rock island with precipitous walls.
Such an offshore island of rock with
precipitous sides is known as a stack
(Fig. 252), or sometimes as a
Fig. 251.—The sea arch known “chimney”, though this latter term is
as the Grand Arch upon one best restricted to other and similar
of the Apostle Islands in
Lake Superior (after a forms which are the product of
photograph by the Detroit selective weathering (p. 300).
Photographic Company). Whenever the rock is less firmly
consolidated, and so does not stand upon such steep planes, the
stack is apt to have a more conical
form, and may not be preceded in its
formation by the development of the
sea arch (Fig. 260, p. 239). In the
reverse case, or where the rock
possesses an unusual tenacity, the
stack may be largely undermined and
stand supported like a table upon Fig. 252.—Stack near the
thick legs or pillars of rock (Fig. 253). shore of Lake Superior.
In Fig. 254 is seen a group of stacks
upon the coast of California, which show with clearness the control
of the joints in their formation, but unlike the marble of the South
American example the forms are not rounded, but retain their sharp
angles.
The cut rock terrace.—
When waves first begin their
attack upon a steep, rocky
shore, the lower limit of the
action is near the wave base.
The action at this depth is,
however, less efficient, and as
the recession of the cliff is one Fig. 253.—The Marble Islands,
of the most rapid of erosional stacks in Lake Buenos Aires,
processes, the rock floor southern Andes (after F. P.
outside the receding cliff Moreno).
comes to slope gradually
downward from the cliff to a maximum depth at the edge of the
terrace, approximately equal to wave base (Fig. 255). This cut
terrace is extended seaward or lakeward, as the case may be, in a
built terrace constructed from a portion of the rock débris acquired
from the cliff.
Fig. 254.—Squared stacks which reveal the position of the joint
planes which have controlled in the process of carving by the
waves. Pt. Buchon, California (after a photograph by
Fairbanks).

The broken wave, after

rising upon the terrace under
the inertia of its motion until all
its energy has been dissipated,
slides outward by gravity, and
though checked and overridden
by succeeding breakers, it
Fig. 255.—Ideal section of a steep
rocky shore carved by waves into
continues its outward slide as
a notched cliff and cut terrace, the “undertow” until it reaches
and extended by a built terrace. the end of the terrace. Here it
suddenly enters deep water,
and losing its velocity, drops its burden of rock, and builds the
terrace seaward after the manner of construction of an
embankment. As we are to see, the larger portion of the wave-
quarried material is diverted to a different quarter.
To gain some conception of the importance of wave cutting as an
eroding process, we may consider the late history of Heligoland, a
sandstone island off the mouth of the Elbe in the North Sea (Fig.
256). From a periphery of 120
miles, which it possessed in
the ninth century of the
Christian era, the island has
reduced its outline to 45 miles
in the fourteenth century, 8
miles in the seventeenth, and
to about 3 miles at the
beginning of the twentieth
century. The German Fig. 256.—Map showing the outlines
government, which recently of the Island of Heligoland at
different stages in its recent
acquired this little remnant history. The peripheries given are
from England, has expended in miles.
large sums of money in an
effort to save this last relic.
The cut and built
terrace on a steep shore of
loose materials.—In
materials which lack the
coherence of firm rock, no
Fig. 257.—Cut and built terrace with
vertical cliff can form; for as bowlder pavement shaped by
fast as undermined by the waves on a steep shore formed of
waves the loose materials loose materials.
slide down and assume a
surface of practically constant slope—the “angle of repose” of the
materials (Fig. 257). The terrace below this sloping cliff will not differ
in shape from that cut upon a rocky shore; but whenever the
materials of the shore include disseminated blocks too large for the
waves to handle, they collect upon the terrace near where they have
been exhumed, thus forming what has been called a “bowlder
pavement” (Fig. 258).
The edge of the cut and built terrace is, as already mentioned,
maintained at the depth of wave base. If one will study the
submerged contours of any of our inland lakes, it will be found that
these basins are surrounded by a gently sloping marginal shelf,—the
cut and built terrace,—and that
the depth of this shelf at its
outer edge is proportioned to
the size of the lake. Upon Lake
Mendota at Madison, Wisconsin,
the large storm waves have a
length of about twenty feet,
which is the depth of the outer
edge of the shore terraces (Fig.
267, p. 242). The shelf
Fig. 258.—Sloping cliff and terrace surrounding the continents has,
with bowlder pavement exposed with few local exceptions, a
at low tide upon the shore at uniform depth of 100 fathoms,
Scituate, Massachusetts.
or about the wave base of the
heaviest storm waves.
The work of the shore current.—In describing the formation
of the built terrace, it was stated that the greater part of the rock
material quarried upon headlands by the waves is diverted from the
offshore terrace. This diversion is the work of the shore current
produced by the wave.

Fig. 259.—Map to show the nature of the shore current and the
forms which are molded by it.
At but few places upon a shore will the storm waves beat
perpendicularly, and then for but short periods only. The broken
wave, as it crawls ever more slowly up the beach, carries the sand
with it in a sweeping curve, and by the time gravity has put a stop to
its forward movement, it is directed for a brief instant parallel to the
shore. Soon, however, the pull of gravity upon it has started the
backward journey in a more direct course down the slope of the
terrace; and here encountering the next succeeding breaker, a
portion of the water and the coarser sand particles with it are again
carried forward for a repetition of the zigzag journey. This many
times interrupted movement of the sand particles may be observed
during a high wind upon any sandy lee shore. The “set” of the water
along the shore as a result of its zigzag journeyings is described as
the shore current (Fig. 259), and the effect upon sand distribution is
the same as though a steady current moved parallel to the shore in
the direction of the average trend of the moving particles.
The sand beach.—The first effect of the shore current is to
deposit some portion of the sand within the first slight recess upon
the shore in the lee of the cliff. The earlier deposits near the cliff
gradually force the shore current farther from the shore and so lay
down a sand selvage to the shore, which is shaped in the form of an
arc or crescent and known as a beach (Fig. 259 and Fig. 260).
Fig. 260.—Crescent-shaped beach formed in the lee of a
headland. Santa Catalina Island, California (after a photograph
by Fairbanks).

The shingle beach.—With heavy

storms and an exceptional reach of the
waves, the shore currents are competent to
Fig. 261.—Cross
move, not the sand alone, but pebbles, the section of a beach
area of whose broader surface may be as pebble.
great as the palm of one’s hand. Such rock
fragments are shaped by the continued wear against their neighbors
under the restless breakers, until they have a lenticular or watch-
shaped form (Fig. 261). Such beach pebbles are described as
shingle, and they are usually built up into distinct ridges upon the
shore, which, under the fury of the high breakers, may be piled
several feet above the level of quiet water (Fig. 262). Such storm
beaches have a gentle forward slope graded by the shore current,
but a steep backward slope on the angle of repose. Most storm
beaches have been largely shaped by the last great storm, such as
comes only at intervals of a number of years.
Bar, spit, and barrier.—Wherever the shore upon which a
beach is building makes a sudden landward turn at the entrance to a
bay, the shore currents, by virtue
of their inertia of motion, are
unable longer to follow the shore.
The débris which they carry is thus
transported into deeper water in a
direction corresponding to a
Fig. 262.—Storm beach of coarse
continuation of the shore just
shingle about four feet in before the point of turning (see
height at the base of Burnt Fig. 259, p. 238). The result is the
Bluff on the northeast shore of formation of a bar, which rises to
Green Bay, Lake Michigan. near the water surface and is
extended across the entrance to the bay through continued growth
at its end, after the manner of constructing a railway embankment
across a valley.
Over the deeper water near
the bar the waves are at first
not generally halted and broken,
as they are upon the shore, and
so the bar does not at once
build itself to the surface, but
remains an invisible bar to
navigation. From its shoreward Fig. 263.—Spit of shingle on Au
end, however, the waves of Train Island, Lake Superior (after
even moderate storms are Gilbert).
broken, and the bar is there
built above the water surface, where it appears as a narrow cape of
sand or shingle which gradually thins in approaching its apex. This
feature is the well-known spit (Fig. 263) which, as it grows across
the entrance to the bay, becomes a barrier or barrier beach (Fig.
264).
The continuation of the visible in the usually invisible bar, is at
the time of high winds made strikingly apparent, for the wave base
is below the crest of the bar, and at such times its crescentic course
beyond the spit can be followed by the eye in a white arc of broken
water.
The construction of a
barrier across the entrance to
a bay transforms the latter
into a separate body of water,
a lagoon, within which silting
up and peat formation usually
lead to an early extinction (p.
429). The formation of
barriers thus tends to
straighten out the Fig. 264.—Barrier beach in front of a
irregularities of coast lines, lagoon on Lake Mendota at
and opens the way to a Madison, Wisconsin. The shallow
lagoon behind the barrier is filling
natural enlargement of the up and is largely hidden in
land areas. While the coasts of vegetation.
the United Kingdom of Great
Britain have been losing some four thousand acres through wave
erosion, there has been a gain by growth in quiet lagoons which
amounts to nearly seven times that amount. As evidence of the
straightening of the shore line which results from this process, the
coast of the Carolinas or of Nantucket (Fig. 459) may serve for
illustration.
The land-tied island.—We have seen that wave erosion
operates to separate small islands from the headlands, but the shore
currents counteract this loss to the continents by throwing out
barriers which join many separated islands to the mainland. Such
land-tied islands are a common feature on many rocky coasts, and
upon the New England coast they usually have been given the name
of “neck.” The long arc of Lynn Beach joins the former island of
Nahant, through its smaller neighbor Little Nahant, to the coast of
Massachusetts. A similar land-tied island is Marblehead Neck. The
Rock of Gibraltar, formerly an island, is now joined to Spain by the
low beach known as the “neutral ground.” The Spanish name,
tombola, has sometimes been employed to describe an island thus
connected to the shore.
A barrier series.—The
cross section of a barrier beach,
like that of a storm beach upon
the shore, slopes gently upon
Fig. 265.—Cross section of a barrier
beach with lagoon in its rear.
the forward side, and more
steeply at the angle, of repose
upon the rear or landward margin (Fig. 265). The thinning wedge of
shore deposits which the barrier throws out to seaward raises the
level of the lake bottom (Fig. 266), and when coast irregularities are
favorable to it, new spits will develop upon the shore outside the
earlier one, and a new bar, and in its turn a barrier, will be found
outside the initial one, taking a course in a direction more or less
parallel to it (Fig. 267).

Fig. 266.—Cross section of a series of barriers and an outer bar.

Fig. 267.—Formation of barrier series and an outer bar in

University Bay of Lake Mendota, at Madison, Wisconsin. The
water contour interval is five feet, and the land contour
interval ten feet (based on a map by the Wisconsin Geological
Survey).

So soon as the first barrier is

formed, processes are set in
operation which tend to transform the
newly formed lagoon into land, and
so with a series of barriers, a zone of
water lilies between the outer barrier
and the bar, a bog, and a land
platform may represent the
successive stages in this acquisition of Fig. 268.—Series of barriers
territory by the lands. A noteworthy at the western end of Lake
Superior (after Gilbert).
example of barrier series and
extension of the land behind them, is afforded by the bay at the
western end of Lake Superior (Fig. 268).
Character profiles.—The
character profiles yielded by the work
of waves are easy of recognition (Fig.
269). The vertical cliff with notch at
its base is varied by the stack of
sugar-loaf form carved in softer
Fig. 269.—Character profiles rocks, or the steeper notched variety
resulting from wave action cut from harder masses. Sea caves
upon shores. and sea arches yield variations of a
curve common to the undercut forms.
Wherever the materials of the shore are loosely consolidated only,
the sloping cliff is formed at the angle of repose of the materials.
The barrier beach, though projecting but a short distance above the
waves, shows an unsymmetrical curve of cross section with the
steeper slope toward the land.

Reading References for Chapter XVIII

G. K. Gilbert. The Topographic Features of Lake Shores, 5th Ann.
Rept. U. S. Geol. Surv., 1885, pp. 69-123, pls. 3-20; Lake
Bonneville, Mon. I, U. S. Geol. Surv., 1890, Chapters ii-iv, pp.
23-187.
Vaughan Cornish. On Sea Beaches and Sand Banks, Geogr. Jour., vol.
11, 1898, pp. 528-543, 628-658.
F. P. Gulliver. Shore Line Topography, Proc. Am. Acad. Arts and Sci.,
vol. 34, 1899, pp. 149-258.
N. S. Shaler. The Geological History of Harbors, 13th Ann. Rept. U. S.
Geol. Surv., 1893, pp. 93-209.
Sir A. Geikie. The Scenery of Scotland, 1901, pp. 46-89.
W. H. Wheeler. The Sea Coast. Longmans, London, 1902, pp. 1-
78.
G. W. von Zahn. Die zerstörende Arbeit des Meeres an Steilküsten
nach Beobachtungen in der Bretagne und Normandie in den
Jahren 1907 und 1908, Mitt. d. Geogr. Ges. Hamb., vol. 24,
1910, pp. 193-284, pls. 12-27.
CHAPTER XIX

COAST RECORDS OF THE RISE OR FALL OF THE

LAND

The characters in which the record has been preserved.—

The peculiar forms into which the sea has etched and molded its
shores have been considered in the last chapter. Of these the more
significant are the notched rock cliff, the cut rock terrace, the sea
cave, the sea arch, the stack, and the sloping cliff and terrace,
among the carved features; and the barrier beach and built terrace,
among the molded forms. It is important to remember that the
molded forms, by the very manner of their formation, stand in a
definite relationship to the carved features; so that when either one
has been in part effaced and made difficult of determination, the
discovery of the other in its correct natural position may remove all
doubt as to the origin of the relic.
In studies of the change of level of the land, it is customary to
refer all variations to the sea level as a zero plane of reference. It is
not on this account necessary to assume that the changes measured
from this arbitrary datum plane are the absolute upward or
downward oscillations which would be measured from the earth’s
center; for the sea, like the land, has been subject to its changes of
level. There need, however, be no apology for the use of the sea
surface as a plane of reference; for it is all that we have available for
the purpose, and the changes in level, even
if relative only, are of the greatest
significance. It is probable that in most cases
where the coast line is rising from uplift,
some portion of the sea basin not far distant
is becoming deepened, so that the visible
change of level is the algebraic sum of the
two effects.
Even coast line the mark of uplift.—It
was early pointed out in this volume (p. 158)
that the floor of the sea in the neighborhood
of the land presents a relatively even
surface. The carving by waves, combined
with the process of deposition of sediments,
tends to fill up the minor irregularities of
surface and preserve only the features of
Fig. 270.—The east
coast of Florida, larger scale, and these in much softened
with shore line outlines. Upon the continents, on the
characteristic of a contrary, the running water, taking
raised coast. advantage of every slight difference in
elevation and searching out the hidden structure planes within the
rock, soon etches out a surface of the most intricate detail. The
effect of elevation of the sea floor into the light of day will therefore
be to produce an even shore line devoid of harbors (Fig. 270). If the
coast has risen along visible planes of faulting near the sea margin,
the coast line, in addition to being even, will usually be made up of
notably straight elements joined to one another.
A ragged coast line the mark of subsidence.—When in place
of uplift a subsidence occurs upon the coast, the intricately etched
surface, resulting from erosion beneath the sky, comes to be invaded
by the sea along each trench and furrow, so that a most ragged
outline is the result (Fig. 271). Such a coast has many harbors, while
the uplifted coast is as remarkable for its lack of them.
Slow uplift of the coast—the coastal plain and cuesta.—A
gradual uplift of the coast is made apparent in a progressive
retirement of the sea across a
portion of its floor, thus exposing
this even surface of recent
sediments. The former shore land
will be easily recognized by it’s
etched surface, which will now
come into sharp contrast with the
new plain. It is therefore referred to
as the oldland and the newly
exposed coastal plain as the
Fig. 271.—Ragged coast line of
newland (Fig. 272). Alaska, the effect of
B subsidence.
ut
the near-shore deposits upon the sea
floor had an initial dip or slope to
seaward, and this inclination has
been increased in the process of
uplift. The streams from the oldland
have trenched their way across these
deposits while the shore was rising.
Fig. 272.—Portion of Atlantic But the process being a slow one,
coastal plain and
deposits have formed upon the
neighboring oldland of the
Appalachian Mountains. seaward side of the plain after the
landward portion was above tide, and
the coastal plain may come to have a “belted” or zoned character.
The streams tributary to those larger ones which have trenched the
plain may encounter in turn harder and softer layers of the plain
deposits, and at each hard layer will be deflected along its margin so
as to enter the main streams more nearly at right angles. They will
also, as time goes on, migrate laterally seaward through
undermining of the harder layers, and thus will be shaped
alternating belts of lowland separated by escarpments in the harder
rock from the residual higher slopes. Belts of upland of this character
upon a coastal plain are called cuestas (Fig. 273).
The sudden uplifts of the
coasts.—Elevations of the coast
which yield the coastal plain must
be accounted among the slower
Fig. 273.—Ideal form of
earth movements that result in cuestas and intermediate
changes of level. Such movements, lowlands carved from a
instead of being accompanied by coastal plain (after Davis).
disastrous earthquakes, were
probably marked by frequent slight shocks only, by subterranean
rumblings, or, it may be, the land rose gradually without
manifestations of a sensible character.
Upon those coasts which are often in the throes of seismic
disturbance, a quite different effect is to be observed. Here within
the rocks we will probably find the marks of recent faulting with
large displacements, and the movements have been upon such a
scale that shore features, little modified by subsequent weathering,
stand well above the present level of the seas. Above such coasts,
then, we recognize the characteristic marks of wave action, and the
evidence that they have been suddenly uplifted is beyond question.

Fig. 274.—Uplifted sea cave, ten feet above the water upon the
coast of California; the monument to a former earthquake
(after a photograph by Fairbanks).
Fig. 275.—Double-notched cliff near Cape Tiro, Celebes (after a
photograph by Sarasin).

The upraised cliff.—Upon the coast of southern California may

be found all the features of wave-cut shores now in perfect
preservation, and in some cases as much as fifteen hundred feet
above the level of the sea. These features are monuments to the
grandest of earthquake disturbances which in recent time have
visited the region (Fig. 274). Quite as striking an example of similar
movements is afforded by notched cliffs in hard limestone upon the
shore of the Island of Celebes (Fig. 275). But the coast of California
furnishes the other characteristic coast features in the high sea arch
and the stack as additional monuments to the recent uplift. Let one
but imagine the stacks which now form the Seal Rocks off the Cliff
House at San Francisco to be suddenly raised high above the sea,