0% found this document useful (0 votes)
13 views16 pages

Zhou Et Al-2019-Human Brain Mapping

Uploaded by

hamidzamani445
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views16 pages

Zhou Et Al-2019-Human Brain Mapping

Uploaded by

hamidzamani445
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received: 5 October 2017 Revised: 4 September 2018 Accepted: 3 October 2018

DOI: 10.1002/hbm.24428

RESEARCH ARTICLE

Effective feature learning and fusion of multimodality data


using stage-wise deep neural network for dementia diagnosis
Tao Zhou1 | Kim-Han Thung1 | Xiaofeng Zhu1 | Dinggang Shen1,2

1
Department of Radiology and the Biomedical
Research Imaging Center, University of North Abstract
Carolina, Chapel Hill, North Carolina In this article, the authors aim to maximally utilize multimodality neuroimaging and genetic data
2
Department of Brain and Cognitive for identifying Alzheimer's disease (AD) and its prodromal status, Mild Cognitive Impairment
Engineering, Korea University, Seoul, Republic
(MCI), from normal aging subjects. Multimodality neuroimaging data such as MRI and PET pro-
of Korea
vide valuable insights into brain abnormalities, while genetic data such as single nucleotide poly-
Correspondence
Dinggang Shen, Department of Radiology and morphism (SNP) provide information about a patient's AD risk factors. When these data are
the Biomedical Research Imaging Center, used together, the accuracy of AD diagnosis may be improved. However, these data are hetero-
University of North Carolina, Chapel Hill, geneous (e.g., with different data distributions), and have different number of samples (e.g., with
North Carolina.
far less number of PET samples than the number of MRI or SNPs). Thus, learning an effective
Email: [email protected]
model using these data is challenging. To this end, we present a novel three-stage deep feature
Funding information
Foundation for the National Institutes of learning and fusion framework, where deep neural network is trained stage-wise. Each stage of
Health, Grant/Award Number: EB022880, the network learns feature representations for different combinations of modalities, via effective
AG053867, EB006733, EB008374, AG041721
training using the maximum number of available samples. Specifically, in the first stage, we learn latent
representations (i.e., high-level features) for each modality independently, so that the heterogeneity
among modalities can be partially addressed, and high-level features from different modalities can
be combined in the next stage. In the second stage, we learn joint latent features for each pair of
modality combination by using the high-level features learned from the first stage. In the third stage,
we learn the diagnostic labels by fusing the learned joint latent features from the second stage. To
further increase the number of samples during training, we also use data at multiple scanning time
points for each training subject in the dataset. We evaluate the proposed framework using Alzhei-
mer's disease neuroimaging initiative (ADNI) dataset for AD diagnosis, and the experimental results
show that the proposed framework outperforms other state-of-the-art methods.

KEYWORDS

Alzheimer's disease (AD), deep learning, mild cognitive impairment (MCI), multimodality data
fusion

1 | I N T RO D UC T I O N detection of AD and especially its prodromal stage, that is, mild cogni-
tive impairment (MCI), is vital, so that treatment can be administered
Alzheimer's disease (AD) is the most common form of dementia for to possibly slow down the disease progression (Thung, Wee, Yap, &
people over 65 years old (Chen et al., 2017; Mullins, Mustapic, Shen, 2016; Wee et al., 2012). On the other hand, it is also highly
Goetzl, & Kapogiannis, 2017; Rombouts et al., 2005; Zhou, desirable to further classify MCI subjects into two subgroups, that is,
Thung, Zhu, & Shen, 2017; Zhou et al., 2018). According to a recent progressive MCI (pMCI) that will progress to AD, and stable
research report from Alzheimer's association (Association, 2016), the MCI (sMCI) that will remain stable. Thus, more resources can be
total estimated prevalence of AD is expected to be 60 million world- applied directly to pMCI subjects for their treatment (Thung, Yap
wide over the next 50 years. AD is a neurodegenerative disease that et al., 2018).
is associated with the production of amyloid peptide (Suk et al., 2015), In search of biomarkers that can accurately identify AD and its
and its symptoms typically start with mild memory loss and gradual earlier statuses, data from different modalities have been collected
losses of other brain functions. As there is no cure for AD, the early and examined. One of the most commonly collected data is Magnetic

Hum Brain Mapp. 2019;40:1001–1016. wileyonlinelibrary.com/journal/hbm © 2018 Wiley Periodicals, Inc. 1001
1002 ZHOU ET AL.

Resonance (MR) images, which can provide us anatomical brain infor- AD. In our study, we aim to use the complementary information from
mation for AD study (Chen, Zhang et al., 2016; Cuingnet, Gerardin both the neuroimaging and genetic data for the diagnosis of AD and
et al., 2011; Fox et al., 1996; Koikkalainen et al., 2016; Raamana et al., its related early statuses. Based on this study, it shows that the com-
2014; Raamana, Weiner et al., 2015; Sørensen, Igel et al., 2017; plementary information from multimodality data can improve the
Thung, Wee et al., 2014; Yu Zhang, 2018; Zhanga et al., 2018). For diagnosis performance.
example, Koikkalainen et al. (Koikkalainen et al., 2016) extracted volu- There are three main challenges in fusing information from multi-
metric and morphometric features from T1 MR images and also vascu- modality neuroimaging data (i.e., MRI and PET) and genetic data
lar features from FLAIR images to build a multi-class classifier based (i.e., SNP) for AD diagnosis. The first challenge is data heterogeneity,
on the disease state index methodology. Raamana et al. (2014) pro- as the neuroimaging and genetic data have different data distribu-
posed a novel three-class classifier to discriminate among AD, fronto- tions, different numbers of features, and different levels of discrimina-
temporal dementia (FTD), and normal control (NC) using volumes, tive ability to AD diagnosis (e.g., SNP data in their raw form are less
shape invariants, and local displacements of hippocampi and lateral effective in AD diagnosis). Due to the heterogeneity issue, simple con-
ventricles obtained from brain MR images. Raamana, Weiner catenation of the features from multimodality data will result in an
et al. (2015) proposed a novel thick-net features that can be extracted inaccurate prediction model (Di Paola et al., 2010; Liu et al., 2015;
from a single time-point MRI scan and demonstrated their potential Ngiam et al., 2011; Zhu, Suk, Lee, & Shen, 2016).
for individual patient diagnosis. Another neuroimaging techniques, The second challenge is the high dimensionality issue. One neuro-
that is, Positron Emission topography (Rasmussen, Hansen, Madsen, image scan (i.e., MR or PET image) normally contains millions of vox-
Churchill, & Strother, 2012), which provides us functional brain infor- els, while the genetic data of a subject has thousands of AD-related
mation, has also been widely used to investigate the neurophysiologi- SNPs. In this study, we address the high dimensionality issue of the
cal characteristics of AD (Chetelat et al., 2003; Escudero, Ifeachor neuroimaging data by first preprocessing them to obtain the region-
et al., 2013; Liu et al., 2015; Mosconi et al., 2008; Nordberg, Rinne, of-interest (ROI) based features using a predefined template. How-
Kadir, & Långström, 2010). Recent studies have shown that fusing the ever, we do not have similar strategy to reduce the dimensionality of
complementary information from multiple modalities can enhance the genetic data. Thus, we still have a high-dimension-low-sample-size
diagnostic performance of AD (Kohannim, Hua et al., 2010; Perrin, problem, as we have thousands of features (dominated by SNPs) as
Fagan, & Holtzman, 2009; Yuan, Wang et al., 2012). For instance, compared to just hundreds of training samples.
Kohannim, Hua et al. (2010) concatenated attributes (or better known The third challenge is the incomplete multimodality data issue,
as features in machine learning community) derived from different that is, not all samples in the training set have the complete three
modalities into a long vector and then trained a support vector modalities. This issue will worsen the small-sample-size issue men-
machine (SVM) as classifier. The researchers in (Yuan, Wang et al., tioned above, if we only use samples with complete multimodality
2012; Zhang, Shen et al., 2012) used sparse learning to select features data for training. In addition, using few samples during training may
from multiple modalities to jointly predict the disease labels and clini- also degrade the performance of the classifier algorithm that relies on
cal scores. Another work in (Suk et al., 2015) used a multi-kernel SVM a large number of training samples to learn an effective model, such as
strategy to fuse multimodality data for disease label prediction. In deep learning (Schmidhuber, 2015; Zhou et al., 2017).
addition, discriminative multivariate analysis techniques have been To address the above challenges, we propose a novel three-stage
applied to the analysis of functional neuroimaging data (Dai et al., deep feature learning and fusion framework for AD diagnosis in this
2012; Haufe et al., 2014; Rasmussen et al., 2012). For instance, Dai article. Specifically, inspired by the stage-wise learning in (Barshan &
et al. (Dai et al., 2012) proposed a multi-modality, multi-level, and Fieguth, 2015), we build a deep neural network and train it stage-wise,
multi-classifier (M3) framework that used regional functional connec- where, at each stage, we learn the latent data representations (high-
tivity strength (RFCS) to discriminate AD patients from healthy level features) for different combinations of modalities by using the
controls. maximum number of available samples. Specifically, in the first stage,
Recently, imaging-genetic analysis (Lin, Cao, Calhoun, & Wang, we learn high-level features for each modality independently via pro-
2014) has been utilized to identify the genetic basis (e.g., Single gressive mapping of multiple hidden layers. After the first stage of
Nucleotide Polymorphisms [SNPs]) of phenotypic neuroimaging deep learning, the data from different modality in the latent represen-
markers (e.g., features in MRI) and study the associations between tation space (i.e., the output of the last hidden layer) are theoretically
them. In particular, various Genome-Wide Association Studies more discriminative to the target labels, and thus more comparable to
(GWAS) (Chu et al., 2017; Price et al., 2006; Saykin, Shen et al., 2010; each other. In other words, the heterogeneity issue of multimodality
Wang, Nie et al., 2012) have been done investigation on the relation- data is partially alleviated. In the second stage, we learn a joint feature
ship between the human genomic variants and the disease bio- representation for each modality combination by using the high-level
markers. For example, GWAS has identified the associations between latent features learned from the first stage. In the third stage, we learn
some SNPs and AD related brain regions (Biffi, Anderson et al., 2010; the diagnostic labels by fusing the learned joint features from the sec-
Shen et al., 2014; Shen, Kim et al., 2010), where the SNPs found could ond stage. It is worth emphasizing that we use the maximum number of
be used to predict the risk of incident AD at earlier stage of life even all available samples to train each stage of the network more effectively.
before pathological changes begin. If success, such early diagnosis For example, in the first stage, to learn the high-level latent features
may help clinicians to identify prospectus subject to monitor for AD from MRI data, we use all the available MRI data; in the second stage,
progression and find potential treatments to possibly prevent the to learn the joint high-level features from MRI and PET data, we use
ZHOU ET AL. 1003

all the samples with complete MRI and PET data; in the third stage, 2.2 | Deep learning in AD study
we use all the samples with complete MRI, PET and SNP data. In this
Deep learning has been widely used in learning high-level features
way, the small-sample-size and incomplete multimodality data issues can
and conducting classification, and achieves promising results
be partially addressed. Moreover, to learn a more effective deep classi-
(Barshan & Fieguth, 2015; Farabet, Couprie, Najman, & LeCun, 2013).
fication model, we further significantly increase the number of training
Deep learning can effectively capture hidden or latent patterns in the
samples by using multiple time-point data for each training subject, if
data. Recently, deep learning algorithms have been successfully
available. applied to medical image processing and analysis (Litjens et al., 2017).
The main contributions of our work are summarized as follows: For instance, Zheng et al. (2016) proposed a multimodal neuroimaging
(a) To our best knowledge, this is the first deep learning framework feature learning algorithm with the stacked deep polynomial networks
that fuses multimodality neuroimaging and genetic data for AD diag- for AD study. Fakoor et al. (2013) presented a novel method to
nosis. (b) We propose a novel three-stage deep learning framework to enhance cancer diagnosis from gene expression data by using unsu-
partially address the data heterogeneity, as well as small-sample-size pervised deep leaning methods (e.g., stacked auto-encoder [SAE]). Suk
and incomplete multimodality data issues. (c) We propose to signifi- et al. (2015) adopted SAE to discover the latent feature representa-
cantly increase the number of training samples by using multiple time- tion from the ROI-based features, and then used a multi-kernel learn-
point data scanned for each training subject in ADNI study, which is ing (MKL) framework to combine latent features from multimodality
completely different from most of the existing methods that often data for AD diagnosis. Liu et al. (2015) also adopted an SAE-based
consider only the data scanned at one time-point. multimodal neuroimaging feature learning algorithm for AD diagnosis.
The rest of this article is organized as follows. We briefly describe Suk, Lee et al. (2014) adopted Restricted Boltzmann Machine (RBM)
the background and related works in Section 2, introduce the pro- to learn multi-modal features from 3D patches for AD/MCI diagnosis.
posed framework in Section 3, describe the materials and the data Plis et al. (2014) adopted RBM and Deep Belief networks (DBN) to
preprocessing method used in this study in Section 4, present the learn high-level features from MRI and fMRI for schizophrenia diagno-
experimental results in Section 5, and conclude our study in Section 6. sis. The common limitation of these deep learning methods is that
they assume the data are complete, and thus only the data with com-
plete multimodality can be used in the training and testing. This limita-
2 | B A C KG RO U N D tion may also reduce the effectiveness of training the deep learning
model, as few number of samples can be used in the training. In the
2.1 | Feature extraction of neuroimaging data next section, we show how we address this limitation by proposing a
stage-wise deep learning model.
There are basically three approaches for extracting features from neu-
roimaging data for analysis (Jack, Bernstein et al., 2008): (a) voxel-
based approach, which directly extracts features by using voxel 3 | P RO P O S E D F R A M E W O R K
intensity values from neuroimaging data, (b) patch-based approach,
which extracts features from local image patches, and (c) region of Figure 1 shows the overview of our proposed three-stage deep fea-
interest (ROI) based approach, which extracts features from the pre- ture learning and fusion framework for AD classification by using
defined brain regions. Among these three approaches, the voxel- multimodality neuroimaging data (i.e., MRI and PET) and genetic
based approach is perhaps the most straightforward method, as it data (i.e., SNP). Our framework aims to maximally utilize all the
uses the raw low-level image intensity values as features. Because of available data from the three modalities to train an effective deep
that, it has the drawbacks of having high feature dimensionality and learning model. There are three stages in our proposed deep learn-
high computation load, as well as ignoring the regional information of ing framework, where each stage is composed of a set of different

the neuroimages as it treats each voxel in the neuroimaging data inde- deep neural networks (DNNs), with each DNN used to learn feature

pendently. In contrast, patch-based approach can capture brain representations for different combinations of modalities by using

regional information by extracting features from image patch. As the maximum number of available samples. In particular, the first
stage learns the latent representations for each individual modality,
diseased-related information and brain structures are more easily
the second stage learns the joint latent representations for each
found in image patches, this approach generally can obtain much bet-
pair of modalities, and finally the third stage learns the classification
ter classification performance than the voxel-based approach. A
model using the joint latent representations from all the modality
higher level of information can be extracted by using brain anatomical
pairs. The details of each stage of the framework are described in
prior, as in the ROI-based approach. The dimensionality of ROI-based
the following.
features depends on the number of ROIs defined in the template,
which is comparatively smaller than the aforementioned approaches,
and thus this is a good feature reduction method that can reflect the 3.1 | Stage 1 - individual modality feature learning
entire brain information (Barshan & Fieguth, 2015; Cuingnet, Gerardin The ROI-based features for MRI and PET data are continuous and
et al., 2011; Suk et al., 2015; Wan et al., 2012; Zhou et al., 2017). low-dimensional (i.e., 93), while SNP data are discrete (i.e., 0, 1, or 2)
Accordingly, we also use the ROI-based approach in this study to and high dimensional (i.e., 3,123). Direct concatenation of these data
reduce the feature dimensionality of neuroimaging data. will result in an inaccurate detection model, as SNP data, which are
1004 ZHOU ET AL.

FIGURE 1 The proposed overall framework of three-stage deep neural network for AD diagnosis using MRI, PET, and SNP data. We first learn
latent representations (i.e., high-level features) for each modality independently in stage 1. Then, in stage 2, we learn joint latent feature
representations for each pair of modality combination (e.g., MRI and PET, MRI and SNP, PET and SNP) by using the high-level features learned
from stage 1. Finally, in stage 3, we learn the diagnostic labels by fusing the learned joint latent feature representations from the stage 2 [Color
figure can be viewed at wileyonlinelibrary.com]

only indirectly related to the target labels, will dominate the feature layer) to learn the latent representations of each modality and
learning process. In addition, there is also incomplete multimodality modality combination. We argue that, as our multimodality data are
data issue, that is, not all samples have all the modalities and also the heterogeneous with different feature size and discriminability for
PET data has a far less number of samples than the numbers of MRI AD diagnosis, the number of hidden layers and the number of neu-
and SNP data. This implies that, if we train a single DNN model for all rons in the neural network should be modality-dependent. For
three modalities, only samples with complete multimodality data can instance, for the modality with more number of features (i.e., SNPs
be used, thus limiting the effectiveness of the model. in our case), we use more hidden layers and then gradually reduce
Therefore, in Stage 1 of our proposed framework, we employ a sep- the number of neurons for each layer to reduce the dimensionality
arate DNN for each individual modality, as depicted in Figure 1. Each of the modality; while for the modalities with less number of fea-
DNN contains several fully-connected hidden layers and one output tures or more direct relationship to the targets (i.e., ROI-based MRI
layer (i.e., Softmax classifier). The output layer consists of three neurons and PET features in our case), we can use a few number of hidden
for the case of three-class classification (i.e., AD/MCI/NC classification layers to obtain the latent features. This strategy is also consistent
task), or four neurons for the case of four-class classification with the strategy used in previous studies that also fuse multimod-
(i.e., AD/sMCI/pMCI/AD classification task). During the training, we use ality data at the later stage of the hidden layers (Ngiam et al., 2011;
the label information from the training samples at the output layer to Srivastava & Salakhutdinov, 2012; Suk, Lee et al., 2014). As a result,
guide the learning of the network weights. After training, the outputs of the high-level features (i.e., the output of the last hidden layer) of
the last hidden layer of each DNN are regarded as the latent representa- each modality should be more comparable to each other as they are
tions (i.e., high-level features) for the corresponding modality. semantically closer to the target labels, thus partially addressing the
There are several advantages of this individual modality feature modality heterogeneity issue.
learning strategy. First, it allows us to use the maximum number of
available training samples for each modality. For example, assume
3.2 | Stage 2 - joint latent representation learning of
that we have N subjects, where only N1 subjects contain MRI data,
two modalities
N2 subjects contain PET data, and N3 subjects contain SNP data.
The conventional multimodality model uses only the subjects with In Stage 2, we learn the feature representations for different combina-
all three modality data, which is much less than min(N1, N2, N3). On tions of modality pairs (i.e., MRI-PET, MRI-SNP, PET-SNP). The aim of
the other hand, by using our proposed framework, we can use all this stage is to fuse the complementary information from different
the N1, N2 and N3 samples to train three separate deep learning modalities to further improve the performance of the classification
models for three modalities, respectively. It is expected that, by framework. The complete DNN architecture used in Stage 2 is
using more samples in training, our model can learn better latent depicted in Figure 1. There are a total of three DNN architectures,
representations for each model. Furthermore, this setting also par- one for each pair of modalities. Note that, the outputs from hidden
tially addresses the incomplete multimodality data issue, as the layers in Stage 1 are regarded as intermediate inputs in Stage 2, and
framework is applicable for the training set with incomplete multi- the weights from Stage 1 can be regarded as the initial weights to ini-
modality data. Second, it allows us to use both different number of tialize the DNN architecture in Stage 2. In addition, we use three out-
hidden layers and different number of hidden neurons (for each puts to train each DNN architecture. Two of the outputs are used to
ZHOU ET AL. 1005

guide the learning of high-level features from two different modalities, dataset was launched in 2003 by the National Institute on Aging, the
while third output is used to guide the learning of joint high-level fea- National Institute of Biomedical Imaging and Bioengineering, the Food
tures for the two modalities. and Drug Administration, private pharmaceutical companies and non-
Note that we also use the maximum number of available samples profit organizations with a 5-year public private partnership. The main
for this stage. For instance, to learn feature representation for the combi- goal of ADNI is to investigate the potential of fusing multimodality data,
nation of MRI and PET data, we use the samples with complete MRI and including neuroimaging, clinical, biological, and genetic biomarkers, to
PET data to train the DNN model. Using the same example of the previ- diagnose AD and its early statuses.
ous section, where N1 subjects contain MRI data, N2 subjects contain
PET data, N3 subjects contain SNP data, and Nmp = min(N1, N2) subjects 4.1 | Subjects
contain both MRI and PET data. Then, we use Nmp samples to train net-
In this study, we used 805 ADNI-1 subjects, including 190 AD,
work for modality pair of MRI & PET in Stage 2, while use N1 samples
389 MCI, and 226 normal controls (NC) subjects, which have their MR
and N2 samples to train the independent MRI and PET network models,
images scanned at the first screening time (i.e., the baseline). Out of
respectively. The weights learned from Stage 1 are used as initial weights
these subjects, 360 have PET data, and 737 have SNP data. The
for Stage 2. We use a similar strategy to train neural network for other
detailed demographic information of the baseline subjects is summa-
modality pairs; thus, in Stage 2, we train totally three DNN models for
rized in Table 1. In addition, Table 2 shows the numbers of subjects
three combinations of modality pairs.
with different combinations of modalities. From Table 2, it is clear to
see that some subjects have certain modalities missing, as only
3.3 | Stage 3 - final feature fusion of three 360 subjects have complete multimodality data.
modalities After the baseline scan, follow-up scans were acquired every 6 or
After Stage 2, we obtain the joint feature representations of all the 12 months for up to 36 months. However, not all the subjects came
modality pairs. We then fuse all the joint representations in a final back for follow-up scans, and also not all kinds of neuroimaging scans
DNN prediction model. The architecture used in this stage is depicted were acquired for each subject. Thus, the number of longitudinal data
as Stage 3 in Figure 1. In Stage 3, we use the learned joint high-level for each subject is different, and also the number of modality data at
features from Stage 2 as input and the target labels as output. As fea- each time point is different for each subject. Nevertheless, our frame-
tures from all the three modalities are involved in the DNN architec- work is still applicable in this case, as it is robust to incomplete
ture in Stage 3, we can only use the samples with complete MRI, PET, multimodality data.
and SNP data to train this part of network, and then fine-tune the For the MCI subjects, we retrospectively labeled those who pro-
whole network (i.e., DNN architecture in Stage 1, Stage 2, and Stage gressed to AD after a certain period of time as pMCI subjects, while
3). Note that the networks in Stage 1 and Stage 2 are learned by using those who remained stable as sMCI subjects. Following this convention,
more available training samples. This is the major advantage of stage- the labeling of sMCI/pMCI could be affected by both the reference
wise network training which can make full use of all available samples time-point and the time period in which the patients are monitored for
for training. After training the whole network, we may obtain the diag- conversion to AD. We considered the 18-th month as the reference and
nostic label for each testing sample (with complete data from three 30 months as the time period to monitor for conversion, so that there is
modalities) at the output layer. Due to the limited number of subjects a sufficient number of earlier scan samples (i.e., samples at baseline, 6th
with complete multimodality data in this study, the classification and 12th month) in each cohort (i.e., pMCI and sMCI) for our study. Thus,
results at the last output layer may suffer from over-fitting issue. MCI patients who were converted to AD within the 18th to 48th month

Thus, we use majority voting strategy for all the seven soft-max out- (the duration of 30 months) are labeled as pMCI patients, while MCI

put layers in Stage 3 (shown in Figure 1), as our final classification patients whose conditions remained stable are labeled as sMCI patients.

result. MCI patients who were progressed to AD prior to the 18th month were
excluded from the study, because they were no longer MCI patients at
the reference time point. Similarly, MCI patients who were converted to
4 | MATERIALS AND IMAGE DATA AD after the 48th month were also excluded to avoid ambiguity in label-
PREPROCESSING ing. In addition, as some MCI subjects dropped out of the study after the
baseline scans, their sub-labels (pMCI or sMCI) cannot be determined.
We use the public Alzheimer's disease neuroimaging initiative (ADNI) Hence, the total number of pMCI (i.e., 157) and sMCI (i.e., 205) subjects
database to evaluate the performance of our framework. The ADNI does not match with the total number of baseline MCI subjects.
TABLE 1 Demographic information of the baseline subjects in this
study (MMSE: Mini-mental state examination)
4.2 | Processing of neuroimages and SNPs
Female/male Education Age MMSE
For this study, we downloaded the preprocessed 1.5 T MR images
NC 108/118 16.0  2.9 75.8  5.0 29.1  1.0
and PET images from the ADNI website.1 The MR images were col-
MCI 138/251 15.6  3.0 74.9  7.3 27.0  1.8
lected by using a variety of scanners with protocols individualized for
AD 101/89 14.7  3.1 75.2  7.5 23.3  2.0
Total 347/458 15.5  3.0 75.2  6.8 26.7  2.7 1
http://www.loni.usc.edu/ADNI
1006 ZHOU ET AL.

TABLE 2 Numbers of subjects with different combinations of modalities

Modality MRI PET SNP MRI & PET MRI & SNP PET & SNP MRI & PET & SNP
Number 805 360 737 360 737 360 360

each scanner. In order to ensure the quality of all images, ADNI had RIDs, 10% is used as validation set, while 90% is used as training set.
reviewed these MR images and had corrected them for spatial distor- Furthermore, as the success of deep learning model relies greatly on
tion caused by B1 field inhomogeneity and gradient nonlinearity. For adequate number of training samples, which enables the neural net-
PET images, which were collected by 30–60 min post Fluoro-Deoxy work to learn a generative nonlinear mapping of the input features to
Glucose (FDG) injection, multi-operations including averaging, spa- the target labels, we have taken two strategies to increase the number
tially alignment, interpolation to standard voxel size, intensity normali- of samples in our study. First, we trained our model stage-wise, where
zation, and common resolution smoothing had been performed. each stage of neural network learns feature representation for differ-
After that, following some previous studies (Barshan & Fieguth, ent modality combinations. In this way, we can use all the available
2015; Suk et al., 2015), we further processed these neuroimages to samples in each stage of deep learning model training. In contrast, if
extract ROI-based features. Specifically, the MR images were pro- we train our deep learning model directly, we can only use a limited
cessed using the following steps: anterior commissure-posterior com- number of samples with complete modalities. Second, as ADNI has
missure (AC-PC) correction by using MIPAV software,2 intensity been longitudinally collecting data for all the participating subjects
inhomogeneity correction using N3 algorithm (Sled, Zijdenbos, & and monitoring their disease status progressions, we propose to
Evans, 1998), brain extraction using robust skull-stripping algorithm exploit these longitudinal data in our model. More specifically, we
(Wang, Nie et al., 2014), cerebellum removal, tissues segmentation used the samples from multiple time points for all the training RIDs in
using FAST algorithm in FSL package (Zhang, Brady, & Smith, 2001) to our study. These two strategies can significantly increase the number
obtain three main tissues (i.e., white matter (WM), gray matter (GM), of training samples to train our model. Figure 2 shows how we split
and cerebrospinal fluid), registration to a template (Kabani, 1998) the data for training, validation, and testing. For training set, we can
using HAMMER algorithm (Shen & Davatzikos, 2002), and projection either use the baseline data (single time-point) or the longitudinal data
of ROI labels from the template image to the subject image. Finally, (multiple time-points) to train our deep learning model.
for each ROI in the labeled image, we computed the GM tissue vol- Next we discuss how to set the network structure of our pro-
ume, normalized it with the intracranial volume, and used it as ROI posed deep learning framework. For clarity, we define “hyperpara-
feature. Moreover, for each subject, we aligned the PET images to meters” as the parameters that are related to the network structure
their respective T1 MR images by using affine registration, computed (e.g., the number of layers, the number of nodes in each layer, etc.)
the average PET intensity value of each ROI, and regarded it as a fea- and network learning (e.g., regularization parameters, dropout, etc.).
ture. Thus, for a template with 93 ROIs, we obtained 93 ROI-based As in many deep learning related studies, it is challenging to determine
neuroimaging features for each neuroimage (i.e., MRI or PET). In addi- network hyperparameters. The tuning of these hyperparameters
tion, for SNP data, according to the AlzGene database,3 only the SNPs involves a lot of experience, guesswork, assumptions, prior knowledge
belonging to the top AD gene candidates were selected. The selected of the data, and experiments. As the cost (in terms of time and money)
SNPs were used to estimate the missing genotypes, and the illumina to train a deep neural network for each hyperparameter combination
annotation information was also adopted to select a subset of SNPs is high due to the large number of network parameters, it is not feasi-
(An et al., 2017; Saykin, Shen et al., 2010). In this study, we adopted ble to use inner cross-validation to determine all the hyperparameters
3,123 dimensional SNP data. for each fold of data. For example, for the case of using inner cross-
validation to select the number of layers and the number of neurons
in each layer (while fixing other hyperparameters), we will have
5 | EXPERIMENTAL RESULTS AND 5 × 5 = 25 combinations for each stage of network, even considering
ANALYSIS just 5 possible values for the number of layers and the number of neu-
rons at each layers, respectively. As we have three modalities and
5.1 | Experimental setup three stages (but just needing one combination in the third stage) in

In this section, we evaluate the effectiveness of the proposed deep


feature learning and fusion framework by considering four classifica-
tion tasks: (a) NC versus MCI versus AD, (b) NC versus sMCI versus
pMCI versus AD, (c) NC versus MCI, and (d) NC versus AD. For each
classification task, we used 20-fold cross-validation for our experi-
ments due to limited number of subjects. Specifically, we first split our
dataset to 20 parts according to the subjects' unique Roster IDs
(RIDs), where one part is used as testing set. Then, for the remaining FIGURE 2 Dataset separation procedure used in our study. For the
training set, we can either use the most available baseline data (single
2
http://mipav.cit.nih.gov/clickwrap.php time-point) or longitudinal data (multiple time-points) to train our
3
www.alzgene.org deep learning model
ZHOU ET AL. 1007

our proposed deep neural network, there is a total of 5.2 | Implementation details
25 × 3 × 2 + 25 = 175 hyperparameter value combinations. In each
As described in Sec. 4.1, we totally have 737 subjects (the corre-
fold, if we use 5 subfolds and 5 repetitions for inner cross-validation,
sponding RID set is denoted as “Rall”), in which 360 subjects (with
we will end up with a total of 175 × 5 × 5 = 4,375 simulations in each
their corresponding RIDs denoted as “Rcom”) have complete multi-
fold of experiment. The mean computation time for each simulation is
modality data (i.e., MRI, PET, and SNP). Besides, we denote the RID
about 1 min (note that, as I used the lab server, which was shared by
set corresponding to the subjects with MRI, PET, and SNP data as
other lab members, to run our experiments, the computation time
“RMRI”, “RPET”, and “RSNP”, respectively. In the following, we introduce
could be more when the server is busy). Thus, we may need about
the implementation details of how to apply this dataset for our three-
4375/60/24 ≈ 3 days for one fold of experiment. If we use 20 fold
stage training. First, we split the 360 subjects with complete data
cross-validation and repeat for 50 times, the computation cost could
(“Rcom”) into 20 subsets according to their RIDs, where one of the sub-
become 3,000 days/GPU, which is not practical.
sets are used as testing set, and the remaining subsets are further
Due to the enormous computation cost of adopting the inner
divided into two parts: validation set (10%) and training set (90%). We
cross-validation strategy, we have limited our search range to a small
denote the RIDs corresponding to the testing set as “Rte” and the RIDs
predefined range. First, we consider how to set the number of layers.
corresponding to the validation set as “Rva”. In the Stage 1 of our deep
We have investigated the effects of different number of layers and
found that the performance could degrade when more layers are learning model, we learn feature representations for each modality

used. As we have a limited number of training samples, too many independently. For example, for MRI modality, we use all training sub-

layers (thus more network parameters) will cause over-fitting issue. jects with available MRI data to train the MRI submodel, where the

Thus, we consider the number of hidden layers to be fewer than five. corresponding RID set used is RMRI − Rva − Rte. In other words, all

Second, we need to set the number of neurons in each layer. In our MRI data corresponding to RID set RMRI − Rva − Rte (including data

study, we use ROI-based neuroimaging features for PET and MRI from other time points if using longitudinal data), are used to train our
data. From the literature (Zhu et al., 2016), we know that not all the MRI model. Similarly, the corresponding RID sets used to train PET
ROIs are related to the disease. Thus, with the feature selection, we and SNP submodels are given as RPET − Rva − Rte and RSNP − Rva −
set the number of neurons to be smaller than the number of ROI- Rte, respectively. It can be clearly seen that the subject sets used in
based input features for each neuroimaging data. Similarly, for SNP training, validation, and testing are mutually exclusive. In Stage 2, we
data, previous studies have indicated that only a handful of SNPs is train three neural network submodels for three different combinations
helpful for AD diagnosis (An et al., 2017). Thus, in our study, we also of modality pair using the output from Stage 1. Similar to Stage 1, we
use a small number of hidden neurons for SNP data. also use all available subjects to train the submodels in Stage 2. For
The hyperparameter combination to be selected for each stage of example, for MRI + PET submodel, the corresponding RID set used is
network is given as follows. In Stage 1, we search the best hyperpara- RMRI \ RPET − Rva − Rte (\ denotes intersection). Similarly, the corre-
meter setting from the following four combinations, that is, sponding RID sets for MRI + SNP and PET+SNP submodels are given
64, 64–32, 64–32–32, 64–32–32–16. In Stage 2 and Stage 3, we as RMRI \ RSNP − Rva − Rte and RPET \ RSNP − Rva − Rte, respectively.
select the best number of hidden layers from the following two In Stage 3, we use the subjects with complete three modalities to train
options, that is, 32, 32–16. We use few number of layers in Stage the whole network. Thus, the corresponding RID set used is RMRI \
2 and 3, as we assume features have become more semantic (high- RPET \ RSNP − Rva − Rte. It is clearly shown that we have most number
level) after Stage 1 training. The details of our experiment are of training subjects in Stage 1, smaller number of training subjects in
described as follows. In our proposed stage-wise method, we first Stage 2, and the least number of training subjects in Stage 3. In brief,
selected hyperparameters using the inner-cross-validation loop (using in each stage, we first find the RID set corresponding to the training
only the training set) in Stage 1 for each modality data (i.e., MRI, PET, subjects, and then use all data corresponding to the RID set (including
SNP). Next, we fixed network architectures in Stage 1, and then data from time point other than the baseline, if using longitudinal data)
implemented an inner-cross-validation loop to tune network architec- as training samples.
tures in Stage 2 for each modality combination (i.e., MRI + PET, MRI +
SNP, MRI + PET + SNP). Finally, we fixed network architectures in
5.3 | Comparison with other feature representation
Stage 1 and Stage 2, to tune network architectures in Stage 3 for
methods
three-modality combination. Compared with the previous version of
our paper (i.e., we fixed hyper-parameters in all stages and folds), our We compared the proposed framework with four popular feature rep-
model now selects parameter setting based on the best result of resentation methods, that is, principal component analysis (PCA)
inner-cross-validation experiment by using only the training dataset. (Wold, Esbensen, & Geladi, 1987), canonical correlation analysis (Bron,
We used only two fold for the inner-cross-validation experiment to Smits et al., 2015; Hardoon, Szedmak, & Shawe-Taylor, 2004), locality
reduce the computation cost, but employed 20-fold outer-cross-vali- preserving projection (LPP) (He et al., 2006), and L21 based feature
dation, with 50 repetitions, to get a more accurate estimate of model selection method (Nie et al., 2010). For PCA and LPP, we determined
performance. Furthermore, L1 and L2 regularizations were also the optimal dimensionality of the data based on their respective
imposed on the weight matrices of the networks. The regularization eigenvalues computed by the generalized eigen-decomposition
parameters for L1 and L2 regularizers are empirically set to 0.001 and method according to (He et al., 2006). For CCA, we optimized its regu-
0.1, respectively. larization parameter value by cross-validation in the range of
1008 ZHOU ET AL.

{10−4, 10−3,   , 10−2}. For L21 method, we optimized its sparsity reg- achieved by the proposed method for two multi-class classification
ularization parameter by cross-validating its value in the range of tasks. Note that we report the final confusion matrices by averaging
{10−4, 10−3,   , 10−2}. To fuse the three modalities, we concatenated the 50 repetitions of 20-fold cross validation results. Further, we use
the feature vectors of the multimodality data into a single long vector a nonparametric Friedman test (Demšar 2006) to evaluate the perfor-
for the above four comparison methods. We also compared our pro- mance difference between our method and other competing methods.
posed framework with a deep feature learning method, that is, SAE Friedman test is generally used to test the difference between two
(Suk et al., 2015). For this method, we obtained SAE-learned features groups of variables that are corresponding to the same set of objects.
from each modality independently, and then concatenated all the Table 3 shows the Friedman test results (in term of p-values) by com-
learned features into a single long vector. We also set the hyper- paring the predictions between our method and each competing
parameters of SAE to the values suggested in (Suk et al., 2015), that method. Note that smaller p-value indicates bigger prediction differ-
is, we used a three-layer neural network for multi-modality data by ence between our method and another comparison method.
using a grid search from [100; 300; 500; 1,000]-[50; 100]-[10; 20; 30] From Figures 3–5 and Table 3, we have the following
(bottom-top). As a baseline method, we further included the results observations.
for the experiment using just the original features without any feature
selection (denoted as “Original”). In addition, we also compared to our 1. From Figures 3 and 4, it can be seen that our proposed AD diag-
method with Multiple Kernel Learning (MKL) (Althloothi, Mahoor, nosis framework outperforms all the comparison methods in term
Zhang, & Voyles, 2014; De Bie et al., 2007), as MKL is a common of classification accuracy.
multi-modality fusion method. For this method, we first used PCA to 2. From Table 3, it can be seen that most p-values are less than
reduce feature dimension for each modality, adopted MKL to fuse .00001, which indicates statistically significant improvement of
features from different modalities via a linear combination of kernels, our proposed method compared to other method under compari-
and then used a support vector machine (SVM) classifier for classifica- son. This is consistent with the accuracy comparison results using
tion. For MKL, we optimized the weights of different kernels by cross- violin plots in Figures 3 and 4.
validating its value in the range of (0, 1) with the sum of weights set 3. From Figure 5, we can see the percentages of both the correct
to 1. We used SVM classifier from LIBSVM toolbox (Chang & Lin, and wrong classifications in each cohort of data. As the MCI is
2011) to perform classification for all the above comparison methods. considered as the intermediate stage between AD and NC, it is
For each classification task, we use grid search to determine the best much more difficult to differentiate MCI subjects from AD and
parameters for both the feature selection and classification algo- NC subjects, which is supported by a relatively higher percentage
rithms, based on their performances on the validation set. For of miss-classification rate in MCI cohort.
instance, the best soft margin parameter C of SVM classifier was
determined by grid searching from {10−4,   , 104}. Also note that, for
5.4 | Effects of different components of the
fair comparison with other comparison methods, we used only the
proposed framework
data at baseline time-point (i.e., corresponding to “Ours-baseTP”) to
train our network in this subsection. We have two settings for our proposed framework, that is, “Ours-
In order to verify the effectiveness of our method, we have con- baseTP”, which uses only the baseline time-point data, and “Ours-mul-
ducted the comparison experiments for two multi-class classification tiTP”, which exploits the longitudinal data scanned at multiple time-
tasks (i.e., NC/MCI/AD and NC/sMCI/pMCI/AD) and two binary clas- points. As “Ours-multiTP” uses more data than “Ours-baseTP” when
sification tasks (i.e., NC/AD and NC/MCI). Figures 3 and 4 show the training the network, we expect “Ours-multiTP” would have better
results in violin plot achieved by different methods. We choose violin generalized network and would perform better than “Ours-baseTP”. In
plot to present our results as it can visualize the distribution of the addition, the good performance of our proposed framework could be
results. In addition, in Figure 5, we show the confusion matrix results due to the stage-wise feature learning strategy, which uses the

FIGURE 3 Violin plots for the distributions of classification accuracy of the two multi-class classification tasks, that is, NC/MCI/AD (left) and
NC/sMCI/pMCI/AD (right), where the hollow white dot and the box denote the median, and the interquartile range of the classification results of
50 repetitions, respectively. From the violin plot, it can be clearly seen that our proposed method outperforms other comparison methods [Color
figure can be viewed at wileyonlinelibrary.com]
ZHOU ET AL. 1009

FIGURE 4 Violin plots for the distributions of classification accuracy of the two multi-class classification tasks, that is, NC/AD (left) and NC/MCI
(right), where the hollow white dot and the box denote the median, and the interquartile range of the classification results of 50 repetitions,
respectively. From the violin plot, it can be clearly seen that our proposed method outperforms other comparison methods [Color figure can be
viewed at wileyonlinelibrary.com]

maximum number of all available samples for training. In order to ver- 5.5 | Effects of different modality combinations
ify this, we also compare our proposed methods with our degraded
To further analyze the benefit of neuroimaging and genetic data
deep learning method that do not use stage-wise training strategy,
fusion, Figure 8 illustrates the performance of our proposed frame-
that is, “Ours-complete”, which uses only the baseline samples with
work for different combinations of modalities on the baseline time-
complete three-modality data for training, with the deep learning
point data. Note that, in Figure 8, we show the comparison results
architecture shown in Figure 6. Figure 7 shows that comparison
using the base time-point data. From Figure 8, we can see that the
results for the four classification tasks. The bars labeled with “Ours-
performance of using only MRI modality is better than using PET or
complete” and “Ours-baseTP” use only the baseline time-point data,
SNP, and the SNP shows the lowest performance. This is understand-
while “Ours-multiTP” exploits the longitudinal. It can be seem from
able as SNP data are the genotype features which are least related to
Figure 7 that the proposed methods (i.e., “Ours-baseTP” and “Ours- the diagnostic label, compared to the MRI and PET data, which are
multiTP”) outperform “Ours-complete”, implying the effectiveness of the phenotypes features that are closely related to diagnostic labels.
our stage-wise feature learning and fusion strategy that can best use Nevertheless, when we combined all the three modalities, the classifi-
of all the samples in the training set, regardless of their modality com- cation results are better than the results from any single modality or
pleteness. Besides, we first use a nonparametric Friedman test to two-modality combinations (bimodal). The interesting part of the
evaluate the performance difference between “Ours-multiTP” and results in Figure 8 is that, for bimodal that involves SNP, the classifica-
other competing methods (i.e., “Ours-complete” and “Ours-baseTP”), tion results are not always better than the results using individual
as shown in Figure 7. From Figure 7, the Friedman test results have modality. For example, for PET+SNP combination, its classification
indicated that “Ours-multiTP” is significantly better than “Ours-com- result is better than the results using individual modality for four-class
plete” and “Ours-baseTP as demonstrated by very small p-values. (AD/pMCI/sMCI/NC) classification, but not for the three-class
Moreover, we also use a nonparametric Friedman test to evaluate the (AD/MCI/NC) and two-class (NC/AD and NC/MCI) classifications.
performance difference between “Ours-baseTP” and “Ours-complete”, For MRI + SNP combination, it can be seen that its performances for
the results have verified that “Ours-baseTP” performs better than four-class classification and NC/MCI tasks are better than the results
“Ours-complete”. In summary, our experimental results indicate that using only its individual modality. These findings show that the SNP
the performance of deep neural network can be improved when using data have positive effect for four-class classification task when it com-
more data samples. bines with MRI, PET or both modality data. The effect of SNP data in

FIGURE 5 Confusion matrices achieved by the proposed method on the two multi-class classification tasks: (left) NC/MCI/AD and (right)
NC/sMCI/pMCI/AD [Color figure can be viewed at wileyonlinelibrary.com]
1010 ZHOU ET AL.

TABLE 3 p-values of the Friedman test results between our method and other competing methods

ORI PCA LPP CCA SAE L21 MKL


NC/MCI/AD <.00001 <.00001 <.00001 <.00001 <.00001 <.00001 <.00001
NC/sMCI/pMCI/AD <.00001 <.00001 <.00001 <.00001 <.00001 <.00001 <.0001
NC/AD <.00001 <.00001 <.00001 <.00001 <.00001 <.00001 <.00001
NC/MCI <.00001 <.00001 <.00001 <.00001 <.00001 <.0001 <.00001

bimodal network for other classification tasks is not consistent, and, in first estimated the relationship between volumetric features and ages
some cases, the inclusion of SNP data will degrade the classification for the subjects in NC cohort by learning multiple linear regression
performance. This could be caused by the network structure that we models, where ages are used to predict brain volumetric features, with
used, which is probably not optimal for bimodal network that involves one regression model learned for each feature. Then, we removed
SNP. Probably we should reduce the number of neurons for the out- age-related effects by subtracting the predictions of the linear regres-
put layer of the SNP submodel, so that the less discriminant SNP fea- sion model from the original features of MRI and PET data, with the
tures can take less contribution in bimodal network, to circumvent the details described in Appendix B of Moradi, Pepe et al. (2015). For con-
negative effect of SNP data. Nevertheless, it is worth noting that venience, we denote “Ours-AgeEffectRemoved” as our method with
when all the three modalities are used, the performance of our model the aging effects removed. Figure 9 shows the comparison results
is better than any single modality or bi-modality models. Besides, we between our methods using neuroimaging data with and without
also use a nonparametric Friedman test to evaluate the performance removal of normal aging effects. From Figure 9, it can be observed
difference between our method with three modalities and the other that that the removal of age-related effects from MRI and PET data
methods with single modality or any two-modality combination, as can indeed improve the classification performance. This is because, by
shown in Figure 8, the results have indicated statistically significant removing age-related effects, we can focus more on the AD-related
improvement of our proposed method combing three modalities atrophies for classification. Besides, we also use a nonparametric
(i.e., MRI + PET+SNP) compared to the methods using single modality Friedman test to evaluate the performance difference between “Ours-
or any two-modality combination. AgeEffectRemoved” and “Ours-baseTP”. From Figure 9, the Friedman
test result has indicated that “Ours-AgeEffectRemoved” is signifi-
cantly better than “Ours-baseTP”.
5.6 | Normal aging effects
Some studies (Dukart et al., 2011; Franke, Ziegler et al., 2010; Moradi,
5.7 | The Most discriminative brain regions
Pepe et al., 2015) have discovered the confounding effects of normal
aging and AD, that is, there are overlaps between the brain atrophies
and SNPs
caused by normal aging and AD. In order to evaluate the impact of It is important to find out the most discriminative brain regions
removing age-related effects from the features derived from MRI and (i.e., ROIs) and SNPs for AD diagnosis. In this study, the most fre-
PET data, we followed a strategy described in previous studies quently selected ROI-based features or SNPs in cross-validations are
(Dukart et al., 2011; Moradi, Pepe et al., 2015). More specifically, we regarded as the most discriminative brain regions or SNPs. These

FIGURE 7 Comparison of classification accuracy for the four


FIGURE 6 The flow of directly fusing three complete modalities by classification tasks by using different methods, where “ours-complete”
using high-level features. Specifically, we learn latent representations use baseline time-point data with complete modalities, “ours-baseTP”
(i.e., high-level features) for each modality independently in stage use the baseline time-point data, while “ours-multiTP” exploits the
1, and then learn diagnostic labels by fusing the learned latent feature longitudinal data by using the data scanned at multiple time-points (*
representations from stage 2 [Color figure can be viewed at denotes the Friedman test with p < .001) [Color figure can be viewed
wileyonlinelibrary.com] at wileyonlinelibrary.com]
ZHOU ET AL. 1011

FIGURE 8 Comparison of classification accuracy of the proposed framework by using different modalities (i.e., MR, PET, and SNP) and modality
combinations (i.e., MRI + PET, MRI + SNP, PET+SNP, MRI + PET+SNP) for four different classification tasks (where * denotes the Friedman test
with p < .0001 and # denotes the Friedman test with p < .00001) [Color figure can be viewed at wileyonlinelibrary.com]

discriminative features are important as they can become the poten- PET, angular gyri, precuneus, globus palladus are the top regions iden-
tial biomarkers for clinical diagnosis. For our proposed deep learning tified. These regions are consistent with some previous studies
framework, although we did not use weight matrix W to select dis- (Convit et al., 2000; Zhang, Shen et al., 2012; Zhu et al., 2016) and can
criminative features directly, we can rank the features based on the be used as potential biomarkers for AD diagnosis.
l2-norm of the weight matrix W. Specifically, for the j-th fold, we have The most frequently selected SNP features and their correspond-
the weight matrix Wj in the first layer of the neural network. Each row ing gene names are summarized in Table 4. The SNPs in APOE have
in Wj is corresponding to one ROI (for MRI and PET data) or SNP. shown that they are related to neuroimaging measures in brain disor-
Then, we define the top 10 ROIs (or SNPs) as the ROIs (or SNPs) that ders (An et al., 2017; Chiappelli et al., 2006). Besides, some SNPs are
correspond to the 10 largest summation of absolute values along the found to be from DAPK1 and SORCS1 genes, which are the well-
rows of Wj. Thus, for each fold, we select top 10 ROIs (or SNPs) based known top candidate genes that are associated with hippocampal vol-
on their magnitudes of weights. For 20-folds, we have 20 different ume changes and AD progression. In addition, most selected SNPs are
sets of top 10 ROIs (or SNPs). We then define the final top ROIs from PICLAM, ORL1, KCNMA1, and CTNNA3 genes (An et al., 2017;
(or SNPs) as the ROIs (or SNPs) with the highest selection frequency. Peng et al., 2016), which have been shown to be AD-related in the

The top 10 ROIs identified from MRI and PET data for the four classi- previous studies. These findings indicate that our method is able to

fication tasks are shown in Figures 10 and 11, respectively. In MRI, identify the most relevant SNPs for AD status prediction.

hippocampal, amygdala, uncus, and gyrus regions are identified. In

6 | DI SCU SSION

6.1 | Comparison with previous studies


Different from the conventional multi-modality fusion methods, we
proposed a novel stage-wise deep feature learning and fusion
framework. In this stage-wise strategy, each stage of the network
learns feature representations for independent modality or differ-
ent combinations of modalities, by using the maximum number of
available samples. The main advantage is that we can use more
available samples to train our model for improving prediction per-
formance. Further, our proposed method can automatically learn
representations from multi-modality data and obtain diagnostic
results using an end-to-end manner, while the traditional methods
in the literature mostly employ feature selection, feature fusion,
and classification in multiple separate steps (Peng et al., 2016;
Zhang, Shen et al., 2012; Zhu et al., 2016).
FIGURE 9 Impact of removing aging-related effects in the two multi-
In addition, Bron, Smits et al. (2015) reported results of a chal-
class classification tasks, where “ours-AgeEffectRemoved” denotes
lenge, where different algorithms are evaluated using same set of fea-
our method using MRI and PET features after removing age-related
effects (* denotes the Friedman test with p < .0001) [Color figure can tures derived from MRI data for AD diagnosis. The features used
be viewed at wileyonlinelibrary.com] include regional volume, cortical thickness, shape, and signal intensity
1012 ZHOU ET AL.

FIGURE 10 Top 10 selected ROIs from MRI data for the four different classification tasks: (a) NC/MCI/AD, (b) NC/sMCI/pMCI/AD, (c) NC/AD,
and (d) NC/MCI [Color figure can be viewed at wileyonlinelibrary.com]

values. The best performing algorithm yielded classification accuracy whereas our study used 805 subjects from ADNI dataset (collected
of 63% for three-class (i.e., NC vs. MCI vs. AD) classification. How- from over 50 imaging centers). Furthermore, in term of the number of
ever, the results reported in Bron et al.'s paper are not directly compa- MCI subjects, our study has higher percentage of subjects coming
rable with the results reported in our paper. First, both studies used from MCI cohort than the Bron et al.'s study, that is, 48.3% versus
different set of data. Bron et al.'s study used 384 subjects from three 34.1%. As MCI can be considered as the intermediate state of NC and
medical centers (i.e., VU University Medical Center, the Netherlands; AD, this cohort is much more challenging to be discriminated from the
Erasmus MC, the Netherlands; and University of Porto, Portugal), other two cohorts. As our ADNI dataset is unbalance (in term of

FIGURE 11 Top 10 selected ROIs from PET data for the four different classification tasks: (a) NC/MCI/AD, (b) NC/sMCI/pMCI/AD, (c) NC/AD,
and (d) NC/MCI [Color figure can be viewed at wileyonlinelibrary.com]
ZHOU ET AL. 1013

TABLE 4 Most related SNPs for AD diagnosis classification tasks, added for easier comparison with the results from
Gene name SNP name previous studies. For four-class classification task (i.e., NC vs. sMCI
APOE Rs429358 vs. pMCI vs. AD), our method achieves about 54% accuracy, outper-
DAPK1 rs822097 forming other state-of-the-art methods. This performance seems
SORCS1 rs11814145 lower than the other tasks, but it is due to the increased complexity of
PICLAM rs11234495, rs7938033 the problem, instead of the failure of the algorithm. The bottom line is
ORL1 rs7945931 that we have shown how to make use of all available data for training
KCNMA1 rs1248571 a robust deep learning model for multi-status AD diagnosis. Neverthe-
CTNNA3 rs10997232 less, more works need to be done to improve the performance of this
classification task for practical clinical usage, where the performance

disease cohorts), contains higher percentage of hard-to-discriminate of this work could be used as a benchmark, and the strategy used for

intermediate cohort, incomplete (in term of the completeness of this work could be used as the foundation.

modalities), and heterogenous (e.g., due to over 50 image collection


centers), our problem is much more challenging than the problem pre- 6.3 | Future work
sented in Bron, Smits et al. (2015). Nevertheless, our method is still
Although our proposed prediction method achieves promising results
able to achieve 64.4% accuracy for three-class classification task,
in four classification tasks, there are several improvements that can be
comparable to the best result reported in Bron et al.'s study (Bron, considered for future work. First, our method is focusing on using ROI
Smits et al., 2015). features as input to the deep learning model; however, such hand-
Second, the types and number of features used by both works
crafted features may limit the richness of structural and functional
are different. In Bron et al.'s paper (Bron, Smits et al., 2015), when brain information from MRI and PET images, respectively. To fully
using only regional volume features, the reported accuracies are unleash the power of deep learning model in learning imaging features
49.7% for Dolph method and 47.7% for Ledig-VOL method, which are that are useful for our classification tasks, we may have to use the
lower than 61% accuracy achieved by our method that only uses ROI original imaging data, and utilize convolution or other more advanced
features from MRI data. Third, the focus of both studies are different. deep neural networks in our framework. Second, as discussed in
While Bron et al.'s study is focusing on getting the best classification Section 5.6 about the effects of age, we can incorporate other con-
algorithm using a combination of multiple-view features from the MRI founding factors (e.g., gender, education level, etc.) into the proposed
data, we are focusing on improving the classification performance by framework to possibly improve the performance.
using a combination of neuroimaging and genetic data. We aim at
overcoming the heterogeneity and incomplete data issues of these
multi-modality data by proposing a novel deep learning based multi- 7 | CONC LU SION
modality fusion framework. The bottom line is that the experimental
results have validated the efficacy of our proposed framework as it In this article, we focus on how to best use multimodality neuroimag-
outperforms the baseline method as well as other state-of-art ing and genetic data for AD diagnosis. Specifically, we present a novel
methods. For instance, we also achieve an accuracy of 64.4% for three-stage deep feature learning and fusion framework for AD diag-

three-class classification using the proposed method, which is about nosis, which integrates multimodality imaging and genetic data gradu-

18% improvement over our baseline method that uses original ally in each stage. Our framework alleviates the heterogeneity issue of

features. multimodality data by learning the latent representations of different


modality using separate DNN models guided by the same target. As
the latent representations of all the modalities (i.e., outputs of the last
6.2 | Clinical interest
hidden layer) are semantically closer to the target labels, the data het-
AD is a progressive irreversible neurodegenerative disease. Before its erogeneity issue is partially addressed. In addition, our framework also
disease onset, there is a prodromal stage called MCI. Some of the MCI partially addresses the incomplete multimodality data issue by devis-
subjects (i.e., pMCI) will progress to AD within few years, while the ing a stage-wise deep learning strategy. This stage-wise learning strat-
others (i.e., sMCI) are relatively stable and do not progress to AD egy allows samples with incomplete multimodality data to be used
within the same period. As AD is currently irreversible and incurable, during the training, thus also allowing to use the maximum number of
the detection of its earlier stages, or multi-class classification for dif- available samples to train each stage of the proposed network. More-
ferent AD stages is actually of much more clinical interest. Thus, over, we exploit longitudinal data for each training subject to signifi-
unlike many previous studies in the literature that focused on binary cantly increase the number of training samples. As our proposed deep
classification tasks (Zhang, Zhang, Chen, Lee, & Shen, 2017; Zhu, Suk learning framework can use more data in the training, we achieved
et al., 2017), we focus our classification results on four different tasks: better classification performance, compared with other methods. All
(a) NC versus MCI versus AD, (b) NC versus sMCI versus pMCI versus these experimental results (using ADNI database) have clearly demon-
AD, (c) NC versus MCI, and (d) NC versus AD. The first two are multi- strated the effectiveness of the proposed framework, and the superi-
class classification tasks, which is much more challenging but of more ority of using multimodality data (over the case of using single
clinical interest, while the latter two are the conventional binary modality data) in AD diagnosis.
1014 ZHOU ET AL.

ACKNOWLEDGMENTS De Bie, T., Tranchevent, L. C., Van Oeffelen, L. M., & Moreau, Y. (2007).
Kernel-based data fusion for gene prioritization. Bioinformatics, 23(13),
This research is partly supported by National Institutes of Health i125–i132.
EB022880, AG053867, EB006733, EB008374, AG041721, and Demšar, J. (2006). Statistical comparisons of classifiers over multiple data
G049371. sets. Journal of Machine learning research, 7(Jan), 1–30.
Di Paola, M., Di Iulio, F., Cherubini, A., Blundo, C., Casini, A. R., Sancesario,
G., … Spalletta, G. (2010). When, where, and how the corpus callosum
ORCID changes in MCI and AD a multimodal MRI study. Neurology, 74(14),
1136–1142.
Tao Zhou https://orcid.org/0000-0002-3733-7286
Dukart, J., Schroeter, M. L., Mueller, K., & Alzheimer's Disease Neuroimag-
ing Initiative. (2011). Age Correction in Dementia–Matching to a
Healthy brain. PLoS One, 6, e22193.
RE FE R ENC E S Escudero, J., Ifeachor, E., Zajicek, J. P., Green, C., Shearer, J., Pearson, S., &
Althloothi, S., Mahoor, M. H., Zhang, X., & Voyles, R. M. (2014). Human Alzheimer's Disease Neuroimaging Initiative. (2013). Machine learning-
activity recognition using multi-features and multiple kernel learning. based method for personalized and cost-effective detection of Alzheimer's
Pattern Recognition, 47(5), 1800–1812. disease. IEEE Transactions on Biomedical Engineering, 60(1), 164–168.
An, L., Adeli, E., Liu, M., Zhang, J., Lee, S. W., & Shen, D. (2017). A hierar- Fakoor, R., Ladhak, F., Nazi, A., & Huber, M. (2013). Using deep learning
chical feature and sample selection framework and its application for to enhance cancer diagnosis and classification. Proceedings of the Inter-
Alzheimer's disease diagnosis. Scientific Reports, 7, 45269. national Conference on Machine Learning, 28.
Association, A. S. (2016). 2016 Alzheimer's disease facts and figures. Alz- Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchi-
heimer's & Dementia, 12(4), 459–509. cal features for scene labeling. IEEE Transactions on Pattern Analysis
Barshan, E., & Fieguth, P. (2015). Stage-wise training: An improved feature and Machine Intelligence, 35(8), 1915–1929.
learning strategy for deep models. Feature extraction: Modern ques- Fox, N. C., Warrington, E. K., Freeborough, P. A., Hartikainen, P., Kennedy,
tions and challenges. Proceedings of Machine Learning Research, 44, A. M., Stevens, J. M., & Rossor, M. N. (1996). Presymptomatic hippo-
49–59. campal atrophy in Alzheimer's disease: A longitudinal MRI study. Brain,
Biffi, A., Anderson, C. D., Desikan, R. S., Sabuncu, M., Cortellini, L., 119(6), 2001–2007.
Schmansky, N., … Alzheimer's Disease Neuroimaging Initiative (ADNI) Franke, K., Ziegler, G., Klöppel, S., Gaser, C., & Alzheimer's Disease Neuro-
(2010). Genetic variation and neuroimaging measures in Alzheimer dis- imaging Initiative. (2010). Estimating the age of healthy subjects from
ease. Archives of Neurology, 67(6), 677–685. T 1-weighted MRI scans using kernel methods: Exploring the influence
Bron, E. E., Smits, M., van der Flier, W., Vrenken, H., Barkhof, F., of various parameters. NeuroImage, 50(3), 883–892.
Scheltens, P., … Alzheimer's Disease Neuroimaging Initiative. (2015). Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correla-
Standardized evaluation of algorithms for computer-aided diagnosis of tion analysis: An overview with application to learning methods. Neural
dementia based on structural MRI: The CADDementia challenge. Neu- Computation, 16(12), 2639–2664.
roImage, 111, 562–579. Haufe, S., Meinecke, F., Görgen, K., Dähne, S., Haynes, J. D., Blankertz, B., &
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector Bießmann, F. (2014). On the interpretation of weight vectors of linear
models in multivariate neuroimaging. NeuroImage, 87, 96–110.
machines. ACM Transactions on Intelligent Systems and Technology
He, X., Cai, D., & Niyogi, P. (2006). Laplacian score for feature selection.
(TIST), 2(3), 27.
Advances in Neural Information Processing Systems, 507–514.
Chen, X., Zhang, H., Gao, Y., Wee, C. Y., Li, G., Shen, D., & the Alzheimer's
Jack, C. R., Bernstein, M. A., Fox, N. C., Thompson, P., Alexander, G.,
Disease Neuroimaging Initiative. (2016). High-order resting-state func-
Harvey, D., … ADNI Study. (2008). The Alzheimer's disease neuroimag-
tional connectivity network for MCI classification. Human Brain Map-
ing initiative (ADNI): MRI methods. Journal of Magnetic Resonance
ping, 37(9), 3282–3296.
Imaging, 27(4), 685–691.
Chen, X., Zhang, H., Zhang, L., Shen, C., Lee, S. W., & Shen, D. (2017).
Kabani, N. J. (1998). 3D anatomical atlas of the human brain. NeuroImage,
Extraction of dynamic functional connectivity from brain grey matter
7, P-0717.
and white matter for MCI classification. Human Brain Mapping, 38(10),
Kohannim, O., Hua, X., Hibar, D. P., Lee, S., Chou, Y. Y., Toga, A. W., … Alz-
5019–5034.
heimer's Disease Neuroimaging Initiative. (2010). Boosting power for
Chetelat, G., Desgranges, B., de la Sayette, V., Viader, F., Eustache, F., &
clinical trials using classifiers based on multiple biomarkers. Neurobiol-
Baron, J. C. (2003). Mild cognitive impairment can FDG-PET predict
ogy of Aging, 31(8), 1429–1442.
who is to rapidly convert to Alzheimer's disease? Neurology, 60(8),
Koikkalainen, J., Rhodius-Meester, H., Tolonen, A., Barkhof, F., Tijms, B.,
1374–1377.
Lemstra, A. W., … Lötjönen, J. (2016). Differential diagnosis of neuro-
Chiappelli, M., Borroni, B., Archetti, S., Calabrese, E., Corsi, M. M.,
degenerative diseases using structural MRI data. NeuroImage: Clinical,
Franceschi, M., … Licastro, F. (2006). VEGF gene and phenotype rela- 11, 435–449.
tion with Alzheimer's disease and mild cognitive impairment. Rejuvena- Lin, D., Cao, H., Calhoun, V. D., & Wang, Y. P. (2014). Sparse models for
tion Research, 9(4), 485–493. correlative and integrative analysis of imaging and genetic data. Journal
Chu, A. Y., Deng, X., Fisher, V. A., Drong, A., Zhang, Y., Feitosa, M. F., … of Neuroscience Methods, 237, 69–78.
Fox, C. S. (2017). Multiethnic genome-wide meta-analysis of ectopic Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian,
fat depots identifies loci associated with adipocyte development and M., … Sánchez, C. I. (2017). A survey on deep learning in medical image
differentiation. Nature Genetics, 49(1), 125–130. analysis. Medical Image Analysis, 42, 60–88.
Convit, A., de Asis, J., de Leon, M. J., Tarshish, C. Y., de Santi, S., & Liu, S., Liu, S., Cai, W., Che, H., Pujol, S., Kikinis, R., … ADNI. (2015). Multimodal
Rusinek, H. (2000). Atrophy of the medial occipitotemporal, inferior, neuroimaging feature learning for multiclass diagnosis of Alzheimer's dis-
and middle temporal gyri in non-demented elderly predict decline to ease. IEEE Transactions on Biomedical Engineering, 62(4), 1132–1140.
Alzheimer's disease. Neurobiology of Aging, 21(1), 19–26. Moradi, E., Pepe, A., Gaser, C., Huttunen, H., Tohka, J., & Alzheimer's Dis-
Cuingnet, R., Gerardin, E., Tessieras, J., Auzias, G., Lehéricy, S., ease Neuroimaging Initiative. (2015). Machine learning framework for
Habert, M. O., … Alzheimer's Disease Neuroimaging Initiative. (2011). early MRI-based Alzheimer's conversion prediction in MCI subjects.
Automatic classification of patients with Alzheimer's disease from NeuroImage, 104, 398–412.
structural MRI: A comparison of ten methods using the ADNI data- Mosconi, L., Tsui, W. H., Herholz, K., Pupi, A., Drzezga, A., Lucignani, G., …
base. NeuroImage, 56(2), 766–781. de Leon, M. J. (2008). Multicenter standardized 18F-FDG PET diagno-
Dai, Z., Yan, C., Wang, Z., Wang, J., Xia, M., Li, K., & He, Y. (2012). Discrimi- sis of mild cognitive impairment, Alzheimer's disease, and other
native analysis of early Alzheimer's disease using multi-modal imaging dementias. Journal of Nuclear Medicine, 49(3), 390–398.
and multi-level characterization with multi-classifier (M3). NeuroImage, Mullins, R. J., Mustapic, M., Goetzl, E. J., & Kapogiannis, D. (2017). Exoso-
59(3), 2187–2195. mal biomarkers of brain insulin resistance associated with regional
ZHOU ET AL. 1015

atrophy in Alzheimer's disease. Human Brain Mapping, 38(4), Suk, H.-I., et al. (2015). Latent feature representation with stacked
1933–1940. auto-encoder for AD/MCI diagnosis. Brain Structure and Function,
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multi- 220(2), 841–859.
modal deep learning. Proceedings of the 28th international conference Thung, K.-H., Wee, C. Y., Yap, P. T., Shen, D., & Alzheimer's Disease Neu-
on machine learning (ICML-11). roimaging Initiative. (2014). Neurodegenerative disease diagnosis using
Nie, F., Huang, H., Cai, X., & Ding, C. H. (2010). Efficient and robust feature incomplete multi-modality data via matrix shrinkage and completion.
selection via joint ℓ2, 1-norms minimization. Advances in Neural Infor- NeuroImage, 91, 386–400.
mation Processing Systems, 1813–1821. Thung, K. H., Wee, C. Y., Yap, P. T., & Shen, D. (2016). Identification of
Nordberg, A., Rinne, J. O., Kadir, A., & Långström, B. (2010). The use of progressive mild cognitive impairment patients using incomplete longi-
PET in Alzheimer disease. Nature Reviews Neurology, 6(2), 78–87. tudinal MRI scans. Brain Structure and Function, 221(8), 3979–3995.
Peng, J., An, L., Zhu, X., Jin, Y., & Shen, D. (2016). Structured sparse kernel Thung, K.-H., Yap, P. T., Adeli, E., Lee, S. W., Shen, D., & Alzheimer's Disease
learning for imaging genetics based Alzheimer's disease diagnosis. In Neuroimaging Initiative. (2018). Conversion and time-to-conversion pre-
International Conference on Medical Image Computing and dictions of mild cognitive impairment using low-rank affinity pursuit
Computer-Assisted Intervention. Springer. denoising and matrix completion. Medical Image Analysis, 45, 68–82.
Perrin, R. J., Fagan, A. M., & Holtzman, D. M. (2009). Multimodal tech- Thung, K. H., Yap, P. T., & Shen, D. (2017). Multi-stage diagnosis of Alzhei-
niques for diagnosis and prognosis of Alzheimer's disease. Nature, mer's disease with incomplete multimodal data via multi-task deep
461(7266), 916–922. learning. In Deep learning in medical image analysis and multimodal learn-
Plis, S. M., Hjelm, D. R., Salakhutdinov, R., Allen, E. A., Bockholt, H. J., ing for clinical decision support (pp. 160–168). Quebec City, Canada:
Long, J. D., … Calhoun, V. D. (2014). Deep learning for neuroimaging: A Springer.
validation study. Frontiers in Neuroscience, 8, 229. Wan, J., Zhang, Z., Yan, J., Li, T., Rao, B. D., Fang, S., … Shen, L. (2012).
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Sparse Bayesian multi-task learning for predicting cognitive outcomes
Shadick, N. A., & Reich, D. (2006). Principal components analysis cor- from neuroimaging measures in Alzheimer's disease. Computer Vision and
rects for stratification in genome-wide association studies. Nature Pattern Recognition (CVPR), 2012 I.E. Conference on,16-21 June
Genetics, 38(8), 904–909. 2012, Providence, RI, USA: IEEE.
Raamana, P. R., Rosen, H., Miller, B., Weiner, M. W., Wang, L., & Beg, M. F. Wang, H., Nie, F., Huang, H., Risacher, S. L., Saykin, A. J., Shen, L., & For
(2014). Three-class differential diagnosis among Alzheimer disease, the Alzheimer's Disease Neuroimaging Initiative. (2012). Identifying
frontotemporal dementia, and controls. Frontiers in Neurology, 5, 71. disease sensitive and quantitative trait-relevant biomarkers from multi-
Raamana, P. R., Weiner, M. W., Wang, L., Beg, M. F., & Alzheimer's Dis- dimensional heterogeneous imaging genetics data via sparse multi-
ease Neuroimaging Initiative. (2015). Thickness network features modal multitask learning. Bioinformatics, 28(12), i127–i136.
for prognostic applications in dementia. Neurobiology of Aging, 36, Wang, Y., Nie, J., Yap, P. T., Li, G., Shi, F., Geng, X., … for the Alzheimer's
S91–S102. Disease Neuroimaging Initiative. (2014). Knowledge-guided robust
Rasmussen, P. M., Hansen, L. K., Madsen, K. H., Churchill, N. W., & MRI brain extraction for diverse large-scale neuroimaging studies on
Strother, S. C. (2012). Model sparsity and brain pattern interpretation humans and non-human primates. PLoS One, 9(1), e77810.
of classification models in neuroimaging. Pattern Recognition, 45(6), Wee, C. Y., Yap, P. T., Denny, K., Browndyke, J. N., Potter, G. G., Welsh-Bohmer,
2085–2100. K. A., … Shen, D. (2012). Resting-state multi-spectrum functional connectivity
Rombouts, S. A., Barkhof, F., Goekoop, R., Stam, C. J., & Scheltens, P. networks for identification of MCI patients. PloS one, 7(5), e37828.
(2005). Altered resting state networks in mild cognitive impairment Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis.
and mild Alzheimer's disease: An fMRI study. Human Brain Mapping, Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52.
26(4), 231–239. Yu Zhang, e. a. (2018). Strength and similarity guided group-level brain
Saykin, A. J., Shen, L., Foroud, T. M., Potkin, S. G., Swaminathan, S., Kim, S., … functional network construction for MCI diagnosis. Pattern Recognition.
Alzheimer's Disease Neuroimaging Initiative. (2010). Alzheimer's disease Yuan, L., Wang, Y., Thompson, P. M., Narayan, V. A., Ye, J., & Alzheimer's
neuroimaging initiative biomarkers as quantitative phenotypes: Genetics Disease Neuroimaging Initiative. (2012). Multi-source feature learning
core aims, progress, and plans. Alzheimer's & Dementia, 6(3), 265–273. for joint analysis of incomplete multiple heterogeneous neuroimaging
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. data. NeuroImage, 61(3), 622–632.
Neural Networks, 61, 85–117. Zhang, D., Shen, D., & Alzheimer's Disease Neuroimaging Initiative. (2012).
Shen, D., & Davatzikos, C. (2002). HAMMER: Hierarchical attribute match- Multi-modal multi-task learning for joint prediction of multiple regres-
ing mechanism for elastic registration. IEEE Transactions on Medical sion and classification variables in Alzheimer's disease. NeuroImage,
Imaging, 21(11), 1421–1439. 59(2), 895–907.
Shen, L., Kim, S., Risacher, S. L., Nho, K., Swaminathan, S., West, J. D., … Zhang, Y., Brady, M., & Smith, S. (2001). Segmentation of brain MR images
Alzheimer's Disease Neuroimaging Initiative. (2010). Whole genome through a hidden Markov random field model and the
association study of brain-wide imaging phenotypes for identifying expectation-maximization algorithm. IEEE Transactions on Medical Imag-
quantitative trait loci in MCI and AD: A study of the ADNI cohort. Neu- ing, 20(1), 45–57.
roImage, 53(3), 1051–1063. Zhang, Y., Zhang, H., Chen, X., Lee, S. W., & Shen, D. (2017). Hybrid
Shen, L., Thompson, P. M., Potkin, S. G., Bertram, L., Farrer, L. A., Foroud, high-order functional connectivity networks using resting-state functional
T. M., … Kauwe, J. S. (2014). Genetic analysis of quantitative pheno- MRI for mild cognitive impairment diagnosis. Scientific Reports, 7(1), 6530.
types in AD and MCI: Imaging, cognition and biomarkers. Brain Imaging Zhang, C., Adeli, E., Zhou, T., Chen, X., & Shen, D. (2018). Multi-Layer
and Behavior, 8(2), 183–207. Multi-View Classification for Alzheimer's Disease Diagnosis.
Sled, J. G., Zijdenbos, A. P., & Evans, A. C. (1998). A nonparametric method Zheng, X., Shi, J., Li, Y., Liu, X., & Zhang, Q. (2016). Multi-modality stacked deep
for automatic correction of intensity nonuniformity in MRI data. IEEE polynomial network based feature learning for Alzheimer's disease diagnosis.
Transactions on Medical Imaging, 17(1), 87–97. Biomedical Imaging (ISBI), 2016 I.E. 13th International Symposium on,
Sørensen, L., Igel, C., Pai, A., Balas, I., Anker, C., Lillholm, M., … Alzheimer’s IEEE.
Disease Neuroimaging Initiative and the Australian Imaging Biomarkers Zhou, S. K., Greenspan, H., & Shen, D. (2017). Deep learning for medical image
and Lifestyle flagship study of ageing (2017). Differential diagnosis of analysis. Cambridge, Massachusetts: Academic Press.
mild cognitive impairment and Alzheimer's disease using structural Zhou, T., Thung, K. H., Zhu, X., & Shen, D. (2017). Feature learning and
MRI cortical thickness, hippocampal shape, hippocampal texture, and fusion of multimodality neuroimaging and genetic data for multi-status
volumetry. NeuroImage: Clinical, 13, 470–482. dementia diagnosis. In International Workshop on Machine Learning in
Srivastava, N. and R. R. Salakhutdinov (2012). Multimodal learning with deep Medical Imaging (pp. 132–140). Springer, Cham.
boltzmann machines. Advances in neural information processing systems. Zhou, T., Thung, K. H., Liu, M., & Shen, D. (2018). Brain-wide genome-wide
Suk, H. I., Lee, S. W., Shen, D., & Alzheimer's Disease Neuroimaging Initia- association study for Alzheimer's disease via joint projection learning and
tive. (2014). Hierarchical feature representation and multimodal fusion sparse regression model. IEEE Transactions on Biomedical Engineering, in
with deep learning for AD/MCI diagnosis. NeuroImage, 101, 569–582. press.
1016 ZHOU ET AL.

Zhu, X., Suk, H. I., Lee, S. W., & Shen, D. (2016). Subspace regularized sparse
multitask learning for multiclass neurodegenerative disease identification. How to cite this article: Zhou T, Thung K-H, Zhu X, Shen D.
IEEE Transactions on Biomedical Engineering, 63(3), 607–618.
Zhu, X., Suk, H. I., Wang, L., Lee, S. W., Shen, D., & Alzheimer’s Disease Effective feature learning and fusion of multimodality data
Neuroimaging Initiative. (2017). A novel relational regularization fea- using stage-wise deep neural network for dementia diagnosis.
ture selection method for joint regression and classification in AD diag- Hum Brain Mapp. 2019;40:1001–1016. https://doi.org/10.
nosis. Medical Image Analysis, 38, 205–214.
1002/hbm.24428

You might also like