CPAE Contrastive Predictive Autoencoder For Unsu - 2023 - Computer Methods and
CPAE Contrastive Predictive Autoencoder For Unsu - 2023 - Computer Methods and
a r t i c l e i n f o a b s t r a c t
Article history: Background and objective: Fully-supervised learning approaches have shown promising results in some
Received 12 December 2022 health status prediction tasks using Electronic Health Records (EHRs). These traditional approaches rely
Revised 20 February 2023
on sufficient labeled data to learn from. However, in practice, acquiring large-scaled labeled medical data
Accepted 12 March 2023
for various prediction tasks is often not feasible. Thus, it is of great interest to utilize contrastive pre-
training to leverage the unlabeled information.
Keywords:
Methods: In this work, we propose a novel data-efficient framework, contrastive predictive autoencoder
Unsupervised learning
Pre-training (CPAE), to first learn without labels from the EHR data in the pre-training process, and then fine-tune
Electronic health records on the downstream tasks. Our framework comprises of two parts: (i) a contrastive learning process, in-
herited from contrastive predictive coding (CPC), which aims to extract global slow-varying features, and
(ii) a reconstruction process, which forces the encoder to capture local features. We also introduce the
attention mechanism in one variant of our framework to balance the above two processes.
Results: Experiments on real-world EHR dataset verify the effectiveness of our proposed framework on
two downstream tasks (i.e., in-hospital mortality prediction and length-of-stay prediction), compared to
their supervised counterparts, the CPC model, and other baseline models.
Conclusions: By comprising of both contrastive learning components and reconstruction components,
CPAE aims to extract both global slow-varying information and local transient information. The best re-
sults on two downstream tasks are all achieved by CPAE. The variant AtCPAE is particularly superior when
fine-tuned on very small training data. Further work may incorporate techniques of multi-task learning
to optimize the pre-training process of CPAEs. Moreover, this work is based on the benchmark MIMIC-III
dataset which only includes 17 variables. Future work may extend to a larger number of variables.
© 2023 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.cmpb.2023.107484
0169-2607/© 2023 Elsevier B.V. All rights reserved.
S. Zhu, W. Zheng and H. Pang Computer Methods and Programs in Biomedicine 234 (2023) 107484
extract meaningful features from EHR data while relying on the tion mechanism (termed AtCPAE). Below we highlight our major
supervision of a large amount of task-specific labels. However, the contributions:
limitations of fully-supervised methods are twofold. First, super-
• We propose two novel architectures of unsupervised contrastive
vised methods learn in a task-specific way, which may not fully
learning, BaseCPAE and AtCPAE, to capture both global slow-
explore the intrinsic nature of data itself. Second, building large
varying features and local transient features for EHR research.
labeled datasets for all medical prediction tasks is not practically
Compared to other unsupervised models on 0.1%, 0.5%, 1%, 5%
feasible. In many scenarios, task-specific labels are a lot smaller
labeled data on two downstream classification tasks, the best
than the size of available data. Learning merely from task-specific
results are all achieved by our models.
labels does not fully utilize information provided by the whole
• Our models outperform their supervised counterparts in both
available population. Therefore, it is meaningful to design a learn-
low and high label rate scenarios with few exceptions, demon-
ing approach that can successfully leverage the unlabeled data and
strating that the contrastive pre-training helps improve the per-
quickly learn to predict downstream tasks using only a small num-
formance on downstream prediction tasks regardless of the la-
ber of labeled examples.
bel fraction. This shows the potentials of CPAEs as an effective
Contrastive unsupervised learning, which has drawn massive at-
pre-training paradigm for EHR research.
tention in computer vision and natural language processing, pro-
vides a “unsupervised pre-train then fine-tune” paradigm to mit-
2. Methodology
igate the above issues of learning from labels. It treats the pre-
diction task as a two-step problem. First, it pre-trains the model
In this section, we first introduce contrastive predictive coding
by minimizing the dissimilarity of augmented data derived from
(CPC) proposed by Oord et al. [24]. Then we describe the moti-
similar (or identical) samples and maximizing the dissimilarity of
vation and design of our proposed learning paradigm, contrastive
augmented data derived from different samples. Second, the pre-
predictive autoencoder (CPAE), and two versions of CPAE: the ba-
trained model is fine-tuned by the labels on the downstream tasks.
sic version of CPAE (BaseCPAE) and CPAE with attention mecha-
This self-supervision paradigm enables contrastive learning to fully
nism (AtCPAE). Figure 1 illustrates the architecture of CPC. Figure 2
exploit the abundant information of data itself, leading to higher
illustrates the architecture of BaseCPAE and AtCPAE.
data-efficiency and generalization ability, which motivates us to
employ self-supervised contrastive learning on EHR research. Sev-
2.1. Contrastive predictive coding
eral works have applied unsupervised contrastive learning to EHR
research. Cai et al. [3], Wang et al. [29] proposed graph-based
To capture higher-level features that broadly affect the shared
contrastive learning frameworks. Cai et al. [3] designed a frame-
context [24], CPC is designed to maximize the mutual information
work to learn patient-code graph, patient graph and medical code
between “past” context vector and “future” latent vector. There are
graph in a contrastive way. Wang et al. [29] proposed a graph sam-
five major components of CPC:
pling contrastive learning method for EHR coding problem. Both
works aimed to utilize International Classification of Diseases (ICD) • a data sampling and partition module that randomly chooses a
codes to construct contrastive pairs, which made them not ap- time frame t0 to segment the patient time series into two parts:
plicable to the scenarios where only time-series data is available. “past” and “future”;
For medical time-series data, there is little prior work using con- • an encoder function fenc (· ) which embeds each time frame into
trastive unsupervised learning. Yèche et al. [33] proposed a neigh- latent vectors;
borhood contrastive learning framework. Their data augmentation • a regressor freg (· ) which learns the context information c (t0 )
methods to define positive and negative samples were based on of “past” time series by sequentially feeding the latent vectors
channel dropout, Gaussian noise and momentum encoder. In our z (1 ), z (2 ), . . . , z (t0 ) encoded by fenc into freg ;
work, we propose a different way to augment the data, with the • a prediction function fpred (· ) which predicts “future” latent vec-
potentials to better exploit the intrinsic predictive features of the tors in K time steps: zˆ(t0 + 1 ), zˆ(t0 + 2 ), . . . , zˆ(t0 + K );
time-series. • a contrastive loss function L which measures the “discriminabil-
Our work is inspired by the intrinsic design to uncover the ity” of the model.
predictive and high-level features of time-series data of the con-
As shown in Fig. 1, in CPC framework, the original data
trastive learning paradigm, contrastive predictive coding (CPC) [24].
for one individual i is denoted as X i = [xi (1 ), xi (2 ),. . . , xi (n )].
One of the main research problems in clinical data is to predict
xi (1 ), xi (2 ) . . . , xi (n ) are the feature vector for individual i at
the tendency of patients’ future vitals and further predict the fu-
time point 1, 2, ..., n. X i (i = 1, . . . , N ; N is the number of in-
ture outcome. The framework of CPC aims to extract the intrinsic
dividuals) are firstly randomly split into “past” and “future”.
predictive features of time-series that can predict the future given
X i (i = 1, . . . , N ) are encoded by fenc , obtaining the latent vectors
the past, which has great potentials to match our goal. There-
zi (1 ), zi (2 ), . . . , zi (n ). Then latent vectors in the “past” are fed into
fore, we propose an unsupervised framework, termed Contrastive
freg to obtain the context information ci (t0 ) over time. fpred sub-
Predictive Autoencoder (CPAE), to learn the high-level information
from the EHR data and adapt the state-of-the-art contrastive learn- sequently takes ci (t0 ) as input, to predict future latent vectors
i i i
ing paradigm to EHR data. As CPC is designed to extract predic- zˆ (t0 +1 ), zˆ (t0 +2 ), . . . , zˆ (t0 +K ), in K time steps. The predicted fu-
tive and slow-varying features [24] over time, it may thus disre- ture latent vectors and the embedded future latent vectors across
gard transient local features. However, these local features can be samples will be used to form contrastive pairs.
very important in clinical scenarios. This is due to the fact that pa- The contrastive loss function LNCE is formulated with the simi-
tients’ time-series can be affected by various transient events (e.g., larity among positive pairs and the similarity among negative pairs
i
treatments and social/environmental events), of which the impacts without labels. Let (zˆ (t ), zi (t )) denote the predicted and embed-
should be taken into consideration. Therefore, we propose CPAE ded latent vectors of individual i at time frame t. Let B be a ran-
to comprise of: (i) a contrastive learning process inherited from domly sampled batch containing N samples. As contrastive learn-
CPC, which aims to extract slow-varying and predictive features, ing is a discriminative approach which aims to maximize the sim-
and (ii) a reconstruction process, which forces the encoder to cap- ilarity between positive pairs and the difference between negative
i
ture local features. In addition to the basic version of CPAE (termed pairs, we define positive pairs as (zˆ (t ), zi (t )) (t0 < t ≤ t0 +K , i ∈ B ),
BaseCPAE), we propose one variant of CPAE to incorporate atten- i.e., the predicted latent vectors and embedded latent vectors from
2
S. Zhu, W. Zheng and H. Pang Computer Methods and Programs in Biomedicine 234 (2023) 107484
Fig. 1. The framework of CPC: the whole time-series is divided into “past” and “future”. CPC aims to maximize the mutual information between “past” context vector, i.e.,
ci (t ), and “future” embedded latent vectors, i.e., z (t0 + 1 ), z (t0 + 2 ), . . . , z (t0 + K ), by learning to minimize the contrastive loss formulated with the predicted future latent
vectors, i.e., zˆ(t0 + 1 ), zˆ(t0 + 2 ), . . . , zˆ(t0 + K ), and the embedded future latent vectors, i.e., z (t0 + 1 ), z (t0 + 2 ), . . . , z (t0 + K ).
Fig. 2. Frameworks of CPAE. (a) BaseCPAE: in addition to components of CPC, BaseCPAE adds a decoder function (i.e., f dec ) to reconstruct the time-series; (b) AtCPAE:
attention mechanism is added on top of the latent vectors.
l
the same individual; we define negative pairs as (zˆ (t ), z j (t )) (t0 < ( fdec ) on the latent vectors (z (1 ), z (2 ), . . . , z (n )) to reconstruct X ,
t ≤ t0 +K ; l, j ∈ B; l = j ), i.e., the predicted latent vectors and em- so that it can capture local information jointly.
bedded latent vectors from different individuals. sim(u, v) is a sim- However, evaluating the reconstruction of such sparse data re-
ilarity measurement function for vectors u and v. Here we simply quires an elaborate design. As missing data can take up to 80% of
used inner product as sim(u, v). Then the contrastive loss for batch data points in EHR data, the imputation (following the benchmark
B, Noise Contrastive Estimation (NCE) loss [10,23,24], can be for- work [12]) for these “not missing at random” (the missingness is
mulated as: non-random and relates to the missing variable) data brings a large
amount of repeated values, leading to biased inference [1]. To mit-
K i
exp(sim((zˆ (t0 + k ), zi (t0 + k ))) igate the effect caused by missing data, the uncertainty of these
LBNCE = − log (1)
l
exp(sim((zˆ (t0 + k ), z j (t0 + k ))) imputed data should be passed to the network and the calculation
i∈B k=1 l, j∈B
of reconstruction loss should be weighted more on observed data.
The more discriminating the latent vectors across samples are, the Thus, we 1) stack a Boolean indicator matrix (masking matrix) I
lower the LBNCE would be. Therefore, the way to train CPC models to X , representing the missingness of corresponding data points
is to minimize the B LBNCE and thus to update the encoder. in X (1:missing, 0:observed) ; 2) calculate the reconstruction error
not only for the whole data matrix, but also explicitly for the data
2.2. BaseCPAE points which are not missing. In addition, the missingness of these
“not missing at random” data itself can be informative in clinical
CPC has shown promising results in problems such as speaker scenarios. For instance, blood oxygen saturation is usually a diag-
identification and image classifications, demonstrating its capabil- nostic test for patients who have symptoms of lung disease (such
ity in extracting high-level, global and slow-varying information as chest distress and shallow breathing). Patients showing no res-
[24]. However, merely extracting slow-varying information may piratory symptoms may have fewer records of blood oxygen satu-
be insufficient for clinical outcomes prediction. Transient medical ration tests. To take the information contained in missingness into
events, which may lead to sudden local fluctuations on the time- consideration, we additionally calculate the reconstruction error of
series data, are important for outcomes prediction. To jointly cap- the masking matrix I . Therefore, we formulate the reconstruction
ture global slow-varying information and local transient informa- loss as follows:
tion, we propose CPAE to incorporate two processes: a contrastive Let [Xˆp |Iˆp ] denote the reconstructed data matrix and the re-
learning process as in CPC and a reconstruction process as in au- constructed masking matrix for individual p, MSE(·, ·) denote the
mean squared error. Let set C p = {(i, j )|I i, j = 0}, containing the po-
p
toencoder.
As shown in Fig. 2(a), BaseCPAE inherits the components of sitions where the data are observed (excluding imputed data) for
p
CPC (as mentioned in Section 2.1) to capture high-level and global individual p. X i, j where (i, j ) ∈ C p means the j-th feature at time i
slow-varying information. It additionally adds a decoder function for individual p is an observed value, instead of an imputed value
3
S. Zhu, W. Zheng and H. Pang Computer Methods and Programs in Biomedicine 234 (2023) 107484
which was originally missing. We calculate the reconstructed loss Other than the attention modules, AtCPAE shares the same archi-
p p
for individual p, Ldec , as tecture with BaseCPAE. The zpred (t ) is fed to freg and subsequent
p
prediction procedures, and zdec (t ) is fed to fdec for reconstruction.
p
Ldec = λ1 L1p + λ2 L2p + λ3 L3p (2)
The loss function of AtCPAE is defined in the same way as shown
in Eq. (4).
p
L1p = MSE([X p |I p ], [Xˆ |Iˆ p ] )
3. Experiments
p
( p ˆ 2
(i, j )C p X i j −X i j ) 3.1. Data
L2p =
| | Cp
The development and evaluation of both the proposed models
p
L3p = MSE(I p , Iˆ ) and baseline models are conducted on MIMIC-III database [17]. We
pre-train our models on the training set without labels, then fine-
Then we can define a multitask loss in a batch B to balance the
p tune our models on the same training set for two downstream
reconstruction loss Ldec and NCE loss LBNCE of CPAE:
tasks, i.e., in-hospital mortality and length of stay predictions. All
λNCE λdec p
the data selection, training-test split and preprocessing are con-
LB = LBNCE + Ldec (3) ducted using codes provided by benchmark work [12]. As the re-
s s
p∈B sult, the first 48 h of 17 time-series variables of 21,138 patients are
included in this work. The training, validation, and test sets contain
λNCE λ1 λ2 λ3 14,681, 3221, and 3236 patients, respectively. In the pre-training
= LBNCE + ( L1p + L2p + L3p ), (4)
s s s s phase, the 48 h time-series is divided into “past” and “future” as
p∈B
illustrated in Section 2. For the downstream prediction phase, the
here the s = λNCE + λ1 + λ2 + λ3 . context vector c (t ) (as described in Fig. 2(a), and (b) of 48 h is
The BaseCPAE can be trained by iterating through batches to fed into the linear classifier, as it represents the information con-
minimize the loss function defined by Equation 4. The weights tained in the first 48 h after ICU admission. Since the recording
λNCE , λ1 , λ2 , λ3 are a set of hyper-parameters which need to be time points of medical records are extremely uneven, the data was
tuned in the process. Further study may consider using multi-task “discretized”(see Harutyunyan et al. [12]) so each hour has exact
learning techniques to optimize the choices of the weights. four time points. Out of the 17 variables, twelve are continuous
and five are categorical. The categorical variables are one-hot en-
2.3. CPAE with attention (AtCPAE) coded, which results in 76 dimensions overall.
For in-hospital mortality prediction, we follow the standard ex-
Forcing the features of the latent vectors to be responsible for perimental settings commonly adopted by previous research [12].
both reconstruction and prediction processes at the same time may The goal of this task is to use data of the first 48 h of a hospital
limit the learning capacity of these features. We thus introduce visit to predict whether this individual would eventually die in the
attention mechanism to allow a flexible selection of features in hospital before he/she was discharged.
a self-adaptive way. As shown in Fig. 3, for each individual, two For length-of-stay prediction, we modify the commonly adopted
linear feature-wise attention modules are introduced on top of settings to improve clinical utilization. The length of stay of a pa-
the latent vectors z p (t ), with one attending to the decoding task tient is defined as the time duration from the patient being ad-
and another attending to the predictive task. As defined above, mitted to the hospital to the patient being discharged from the
z p (t ) denotes the latent vector at time point t for individual p. Let hospital. Previous work recoded the length of stay into 10 cate-
Wpred , Wdec denote the linear attention module attending to pre- gories: less than one day, 1-2 days, 2-3 days, 3-4 days, 4-5 days,
dictive task and decoding task, respectively. Wpred and Wdec are 5-6 days, 6-7 days, 1-2 weeks and more than 2 weeks. However,
learnable matrices. We then have: this definition neglects the outcome of patients. It treats the pa-
tients discharged in short duration the same as the patients who
p
zpred (t ) = spred
p
(t ) · z p (t )= Wpred z p (t ) · z (t ) p
died in short duration, which is not appropriate. Instead, we re-
p
zdec (t ) = sdec
p
(t ) · z p (t ) = Wdec z p (t ) · z p (t ) (5) define the prediction target into three categories: death, short stay
4
S. Zhu, W. Zheng and H. Pang Computer Methods and Programs in Biomedicine 234 (2023) 107484
and long stay. The short stay is defined as length of stay shorter The evaluation metrics for in-hospital mortality prediction and
than 35.5 h (the median time for length of stay). The long stay is length-of-stay prediction are area under the curve (AUC) and accu-
defined as length of stay longer than 35.5 h. We do not recode the racy (top-1 accuracy), respectively.
stay duration into ten categories, since we aim to conduct experi-
ments on the training set with a small percentage of labels, which 3.4. Comparison results
would lead to too few labels for a ten-class classification task.
In this subsection, we analyse the comparison results on the
3.2. Baselines two downstream tasks between our proposed models and the
baseline models, as shown in Tables 1 and 2.
To demonstrate the effectiveness of our proposed two architec-
tures BaseCPAE and AtCPAE, we compare our models with the fol- 3.4.1. LSTM-based models
lowing four baselines of two types as ablation studies : First, compared to SUPl , AtCPAEl achieves better performance
across different proportions of labels, which demonstrates the
strength of pre-training in AtCPAEl . The pre-training is particularly
3.2.1. Supervised models trained from scratch (SUP) helpful when there are very few labels (0.1% and 0.5%). In contrast,
We train fully-supervised models (SUP), as a supervised coun- all other pre-trained models (AEl , CAEl , CPCl , BaseCPAEl ) do not
terpart of BaseCPAE and AtCPAE, from scratch, directly on down- outperform SUPl in at least one case. Particularly, the AUCs of AEl ,
stream task without pre-training. More specifically, the whole CAEl , CPCl and BaseCPAEl on length-of-stay prediction are lower
time-series is encoded by an encoder, then fed into a regression than SUPl with only one exception. This demonstrates that the
function to obtain the context vector, which is finally connected to pre-training effect of (AEl , CAEl , CPCl , BaseCPAEl ) based on LSTM
a fully connected layer for prediction. backbone on MIMIC-III dataset may not be stable, while the pre-
training in AtCPAEl stably contributes to the prediction on down-
3.2.2. Pre-trained models stream tasks.
Contrastive predictive coding (CPC) We pre-train the encoder and Second, we observe that AtCPAEl achieves the best performance
regressor of CPC. Then we connect the output of the regressor to out of all models, across two backbones (as shown with (∗ ) in Ta-
the downstream classifier. bles), on 0.1%, 0.5%, 1% of in-hospital mortality data, and 0.5%. 1%,
Contrastive autoencoder (CAE) We design CAE as an Autoencoder 5% of length-of-stay data. BaseCPAEl achieves the best performance
of which the latent vector and reconstructed vector are formed as on 5% of in-hospital mortality data. None of the best performance
contrastive pairs to pre-train. On the downstream task, the latent of the downstream tasks is achieved by baseline models.
vectors are flattened and fed into the downstream classifier.
Autoencoder (AE) We also conduct pre-training using AE and 3.4.2. CNN-based models
then feed the flattened latent vector into the downstream classi- First, we observe that for CNN backbone, CPAEs consistently
fier. outperform SUPc in both tasks. In contrast, other contrastive pre-
All the implementations of the above four baselines share the trained models (CAEc , CPCc ) do not outperform SUPc on some pro-
same hyperparameters with our proposed models, so that the portions of labeled data. For instance, on 1% labeled length-of-stay
comparisons can serve as model ablation studies. prediction data, the pre-training in CAEc and CPCc lead to 0.036
and 0.018 decrease in performance, respectively, compared to SUPc .
In addition, BaseCPAEc achieves the best performance among
3.3. Setting-up
CNN-based models for in-hospital mortality prediction. AtCPAEc
achieves the best performance among CNN-based models for
For models except SUP, the whole training process contains two
length-of-stay prediction when the label rate is 1% or 5%.
steps: 1) pre-training on the full training data without labels, 2)
BaseCPAEc ranks first on average in these prediction tasks, among
finetuning on a proportion of training data with labels. We evalu-
all CNN-based models.
ate the performance of all models on two backbones, long short-
term memory (LSTM) and convolutional neural network (CNN),
3.5. Effect of label rates
separately. Subscript l denotes the backbone LSTM (e.g., CPAEl ) and
subscript c denotes backbone CNN (e.g., CPAEc ).
We further investigate BaseCPAE and AtCPAE’s performance
To compare the performance of the above models in a fair man-
when the label rate is larger. Results are shown in Fig. 4. We ob-
ner, we implement the models in a way that the shared architec-
serve that 1) BaseCPAEl is more advantageous when the label rate
ture has identical layers and activation functions in each set of ex-
is larger than 10%; 2) AtCPAEl significantly surpasses its super-
periments. Please see source code2 for the hyper-parameters of the
vised counterpart when the label rate is smaller than 10%; 3) the
architecture.
pre-training in BaseCPAE and AtCPAE improves the performance in
The pre-training process uses Adam as the optimizer, with
most cases, except for few cases.
weight decay as 0.0 0 04, eps at 1e-9. The learning rate is auto-
matically updated during the process. For fine-tuning process, we
3.6. Case study: prediction among the elderly
sample 0.1%, 0.5%, 1%, 5% of the labeled training data in a class-
balanced way for in-hospital mortality prediction, and 0.5%, 1%, 5%
We conduct a case study to investigate the prediction perfor-
for length-of-stay prediction. Note that 0.1% length-of-stay data is
mances of our proposed models and baseline models on a sub-
too small for three-class classification so we do not conduct exper-
group consisting of the elderly (age>75). The training data, vali-
iments on 0.1% length-of-stay data. The sampling process is con-
dation data and test data are of sizes 3,750,824,843, respectively.
ducted ten times. Then we fine-tune the pre-trained models and
Since LSTM-based models achieve better performance on both
the downstream classifier (linear classifier) together on the sam-
tasks, we hereby conduct experiments focusing on the LSTM back-
pled subsets. Average performances and standard deviations are re-
bone. The results are shown in Table 3.
ported in Tables 1 and 2.
We observe that AtCPAEl achieves the best performance on in-
hospital mortality prediction, and BaseCPAEl achieves the best per-
2
Code is available at https://github.com/anonymousparticipant/CPAE. formance on length-of-stay prediction.
5
S. Zhu, W. Zheng and H. Pang Computer Methods and Programs in Biomedicine 234 (2023) 107484
Table 1
AUC of models fine-tuned with few labels on in-hospital mortality prediction. Models with
subscript l are LSTM-based models, and models with subscript c are CNN-based models.
Results closed to the best (difference ≤0.001) are shown in bold. The number in the bracket
is the standard deviation of the ten predictions.
Fig. 4. Performance of BaseCPAE, AtCPAE and SUP on two prediction tasks when the label rate ranges from 0.1% to 100%. Models in the left two figures are LSTM-based, and
models in the right two figures are CNN-based.
6
S. Zhu, W. Zheng and H. Pang Computer Methods and Programs in Biomedicine 234 (2023) 107484
[20,30,36]. Based on the labels, samples of the same class are Declaration of Competing Interest
treated as pairs to formulate a contrastive training signal [36]. To
mitigate the impact of high intra-class variation and class imbal- Authors declare that they have no conflict of interest.
ance, Wanyan et al. [30] proposed two strategies to construct the
Acknowledgments
k-nearest neighbors sample graph and then draw positive pairs. Li
and Gao [20] constructed a contrastive loss between samples and
HP reports personal fees from Genentech outside the submitted
learned cluster anchors, in addition to the supervised contrastive
work. Other authors state no conflict of interest. This work was
loss. However, these successes rely on the large number of labels,
exempt from Institutional Review Board review as it is an analysis
which are not always available in many clinical scenarios.
of publicly downloadable data.
Unsupervised contrastive pre-training Unsupervised contrastive
learning has been successfully applied in the fields of computer References
vision [4,14,31] and natural language processing [9,15,26], achiev-
ing comparable results to supervised learning while using only a [1] B.K. Beaulieu-Jones, D.R. Lavage, J.W. Snyder, J.H. Moore, S.A. Pendergrass,
C.R. Bauer, Characterizing and managing missing structured data in electronic
few labels. There are also efforts in biomedical fields. You et al. health records: data analysis, JMIR Med. Inform. 6 (1) (2018) e11, doi:10.2196/
[34,35] propose a graph contrastive learning model and extend medinform.8960.
contrastive learning to the biochemical applications. In the ap- [2] D. Bera, M.M. Nayak, Mortality risk assessment for ICU patients using logistic
regression, in: 2012 Computing in Cardiology, IEEE, 2012, pp. 493–496.
plication of medical image processing, several recent works in-
[3] D. Cai, C. Sun, M. Song, B. Zhang, S. Hong, H. Li, Hypergraph contrastive learn-
troduce novel pretext tasks tailored to its domain-specific down- ing for electronic health records, in: Proceedings of the 2022 SIAM Interna-
stream tasks [5,37]. Dong et al. [5] proposed a multi-task frame- tional Conference on Data Mining (SDM), SIAM, 2022, pp. 127–135.
work to learn from sequential medical images. In addition, Dong [4] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for con-
trastive learning of visual representations, arXiv preprint arXiv:2002.05709
and Voiculescu [6] extended contrastive learning for medical im- (2020).
age to a federated setup. Considering multiple modalities, some [5] N. Dong, M. Kampffmeyer, I. Voiculescu, Self-supervised multi-task representa-
other works proposed multi-modal contrastive learning methods tion learning for sequential medical images, in: Machine Learning and Knowl-
edge Discovery in Databases. Research Track: European Conference, ECML
on medical data to utilize both texts and images [13]. There also PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part III 21,
exists some works which applied unsupervised contrastive learn- Springer, 2021, pp. 779–794.
ing to non-imaging medical research. Cai et al. [3], Wang et al. [6] N. Dong, I. Voiculescu, Federated contrastive learning for decentralized un-
labeled medical images, in: Medical Image Computing and Computer As-
[29] proposed graph-based contrastive learning frameworks. Cai sisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg,
et al. [3] designed a framework to learn patient-code graph, pa- France, September 27–October 1, 2021, Proceedings, Part III 24, Springer, 2021,
tient graph and medical code graph in a contrastive way. Wang pp. 378–387.
[7] J. Egger, C. Gsaxner, A. Pepe, K.L. Pomykala, F. Jonske, M. Kurz, J. Li, J. Kleesiek,
et al. [29] proposed a graph sampling contrastive learning method
Medical deep learning–a systematic meta-review, Comput. Methods Programs
for EHR coding problem. Both works aimed to utilize ICD codes Biomed. (2022) 106874.
to build contrastive pairs, which made them not applicable to the [8] A. Fabregat, M. Magret, J.A. Ferré, A. Vernet, N. Guasch, A. Rodríguez, J. Gómez,
M. Bodí, A machine learning decision-making tool for extubation in intensive
scenarios where only time-series data is available. There is little
care unit patients, Comput. Methods Programs Biomed. 200 (2021) 105869.
prior work which designed contrastive unsupervised learning for [9] H. Fang, S. Wang, M. Zhou, J. Ding, P. Xie, Cert: contrastive self-supervised
medical time-series data. Yèche et al. [33] proposed a neighbor- learning for language understanding, arXiv preprint arXiv:2005.12766 (2020).
hood contrastive learning framework. Their augmentation meth- [10] M. Gutmann, A. Hyvärinen, Noise-contrastive estimation: a new estimation
principle for unnormalized statistical models, in: Proceedings of the Thir-
ods to define positive and negative samples were based on chan- teenth International Conference on Artificial Intelligence and Statistics, 2010,
nel dropout, Gaussian noise and momentum encoder. The afore- pp. 297–304.
7
S. Zhu, W. Zheng and H. Pang Computer Methods and Programs in Biomedicine 234 (2023) 107484
[11] S.L. Hamilton, J.R. Hamilton, Predicting in-hospital-death and mortality per- [24] A. van den Oord, Y. Li, O. Vinyals, Representation learning with contrastive pre-
centage using logistic regression, in: 2012 Computing in Cardiology, IEEE, 2012, dictive coding, arXiv preprint arXiv:1807.03748 (2018).
pp. 489–492. [25] T.J. Pollard, L. Harra, D. Williams, S. Harris, D. Martinez, K. Fong, 2012 phy-
[12] H. Harutyunyan, H. Khachatrian, D.C. Kale, G. Ver Steeg, A. Galstyan, Multitask sionet challenge: an artificial neural network to predict mortality in ICU pa-
learning and benchmarking with clinical time series data, Sci. Data 6 (1) (2019) tients and application of solar physics analysis methods, in: 2012 Computing
96, doi:10.1038/s41597- 019- 0103- 9. in Cardiology, IEEE, 2012, pp. 485–488.
[13] L. Heiliger, A. Sekuboyina, B. Menze, J. Egger, J. Kleesiek, Beyond medical [26] A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A.
imaging-a review of multimodal deep learning in radiology (2022). Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from nat-
[14] O. Henaff, Data-efficient image recognition with contrastive predictive coding, ural language supervision, arXiv preprint arXiv:2103.0 0 020 (2021).
in: International Conference on Machine Learning, PMLR, 2020, pp. 4182–4192. [27] M. Scherpf, F. Gräßer, H. Malberg, S. Zaunseder, Predicting sepsis with a re-
[15] D. Iter, K. Guu, L. Lansing, D. Jurafsky, Pretraining with contrastive sentence current neural network using the MIMIC III database, Comput. Biol. Med. 113
objectives improves discourse performance of language models, arXiv preprint (2019) 103395.
arXiv:2005.10389 (2020). [28] S. Vairavan, L. Eshelman, S. Haider, A. Flower, A. Seiver, Prediction of mortality
[16] A.E. Johnson, N. Dunkley, L. Mayaud, A. Tsanas, A.A. Kramer, G.D. Clifford, Pa- in an intensive care unit using logistic regression and a hidden Markov model,
tient specific predictions in the intensive care unit using a Bayesian ensemble, in: 2012 Computing in Cardiology, IEEE, 2012, pp. 393–396.
in: 2012 Computing in Cardiology, IEEE, 2012, pp. 249–252. [29] S. Wang, P. Ren, Z. Chen, Z. Ren, H. Liang, Q. Yan, E. Kanoulas, M. de Rijke, Few-
[17] A.E. Johnson, T.J. Pollard, L. Shen, H.L. Li-Wei, M. Feng, M. Ghassemi, B. Moody, shot electronic health record coding through graph contrastive learning, arXiv
P. Szolovits, L.A. Celi, R.G. Mark, MIMIC-III, a freely accessible critical care preprint arXiv:2106.15467 (2021).
database, Sci. Data 3 (1) (2016) 1–9. [30] T. Wanyan, J. Zhang, Y. Ding, A. Azad, Z. Wang, B.S. Glicksberg, Bootstrapping
[18] W.A. Knaus, D.P. Wagner, E.A. Draper, J.E. Zimmerman, M. Bergner, P.G. Bastos, your own positive sample: contrastive learning with electronic health record
C.A. Sirio, D.J. Murphy, T. Lotring, A. Damiano, et al., The apache III prognos- data, arXiv preprint arXiv:2104.02932 (2021).
tic system: risk prediction of hospital mortality for critically III hospitalized [31] Z. Wu, Y. Xiong, S.X. Yu, D. Lin, Unsupervised feature learning via non-para-
adults, Chest 100 (6) (1991) 1619–1636. metric instance discrimination, in: Proceedings of the IEEE Conference on
[19] J.-R. Le Gall, S. Lemeshow, F. Saulnier, A new simplified acute physiology score Computer Vision and Pattern Recognition, 2018, pp. 3733–3742.
(SAPS II) based on a European/North American multicenter study, JAMA 270 [32] H. Xia, B.J. Daley, A. Petrie, X. Zhao, A neural network model for mortality
(24) (1993) 2957–2963. prediction in ICU, in: 2012 Computing in Cardiology, IEEE, 2012, pp. 261–264.
[20] R. Li, J. Gao, Multi-modal contrastive learning for healthcare data analytics, [33] H. Yèche, G. Dresdner, F. Locatello, M. Hüser, G. Rätsch, Neighborhood con-
in: 2022 IEEE 10th International Conference on Healthcare Informatics (ICHI), trastive learning applied to online patient monitoring, in: International Con-
IEEE, 2022, pp. 120–127. ference on Machine Learning, PMLR, 2021, pp. 11964–11974.
[21] H.W. Loh, C.P. Ooi, S. Seoni, P.D. Barua, F. Molinari, U.R. Acharya, Application of [34] Y. You, T. Chen, Y. Shen, Z. Wang, Graph contrastive learning automated, 2021.
explainable artificial intelligence for healthcare: a systematic review of the last [35] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, Y. Shen, Graph contrastive learning
decade (2011–2022), Comput. Methods Programs Biomed. 226 (2022) 107161. with augmentations, arXiv preprint arXiv:2010.13902 (2020).
[22] O. Martinez, C. Martinez, C.A. Parra, S. Rugeles, D.R. Suarez, Machine learning [36] C. Zang, F. Wang, Scehr: supervised contrastive learning for clinical risk predic-
for surgical time prediction, Comput. Methods Programs Biomed. 208 (2021) tion using electronic health records, arXiv preprint arXiv:2110.04943 (2021).
106220. [37] Y. Zhang, H. Jiang, Y. Miura, C.D. Manning, C.P. Langlotz, Contrastive learning
[23] A. Mnih, Y.W. Teh, A fast and simple algorithm for training neural probabilistic of medical visual representations from paired images and text, arXiv preprint
language models, arXiv preprint arXiv:1206.6426 (2012). arXiv:2010.00747 (2020).