Gene Selection and Classification of Microarray Data Using Convolutional Neural Network
Gene Selection and Classification of Microarray Data Using Convolutional Neural Network
Abstract— Gene expression profiles could be generated in affects efficiency and effectiveness to a large data dimension-
large quantities by utilizing microarray techniques. Currently, the ality which impair classification performance [4].
task of diagnosing diseases relies on gene expression data. One of Convolutional neural network (CNN) is an instance of deep
the techniques which helps in this task is by utilizing deep learning learning strategy is mimicking brain function in processing in-
algorithms. Such algorithms are effective in the identification and
classification of informative genes. These genes may subsequently
formation [5]. In this paper, multilayered CNN, which is a deep
be used in predicting testing samples’ classes. In cancer learning algorithm, is proposed to classify microarray cancer
identification, the microarray data typically possesses minimal data in the identification of type of cancer. CNN is proposed
samples number with a huge feature collection size which are due to its ability in dealing with insufficient data and boosting
hailing from gene expression data. Lately, applications of deep classification performance. In addition, CNN is also powerful
learning algorithms are gaining much attention to solve various in integrating cancer datasets that are strongly linked, which
challenges in artificial intelligence field. In the present study, we improves performance in classifying data. This is attributed to
investigated a deep learning algorithm based on the convolutional its ability in detecting latent characteristics of cancer from com-
neural network (CNN), for classification of microarray data. In parable types. The organization of the present paper is as
comparison to similar techniques such as Vector Machine
Recursive Feature Elimination and improved Random Forest
following. Section 2 elaborates related works and definitions.
(mSVM-RFE-iRF and varSeIRF), CNN showed that not all the Section 3 elaborates the methods. Section 4 describes selected
data have superior performance. Most of experimental results on datasets, proposed architecture, evaluation techniques, and
cancer datasets indicated that CNN is superior in terms of benchmark. Section 5 presents the results and discussions.
accuracy and minimizing gene in classifying cancer comparing Lastly, conclusions of this paper.
with hybrid mSVM-RFE-iRF.
II. RELATED WORKS AND DEFINITIONS
Keywords: Deep Learning; Convolutional Neural Network
(CNN); Microarray Cancer Data; Classification; In this section, microarray data, machine learning, and CNN
algorithms along with related works would be reviewed.
I. INTRODUCTION
A. Microarray data classification
Microarray data are widely used in prognosis treatment and
Microarray gene expression data have been utilized in past
disease classification through a variety of genes selection and
researches to perform cancer type classification by using ma-
classification methods [1].
chine learning strategies. Decision tree (DT) was the most
Since cancer diagnosis has seen much applications of micro-
primitive machine learning strategy introduced in comparing
array particularly on gene expression profiles, scholars also
human proteins to informative gene in proteins containing dis-
have begun exploring data analysis by using the technology.
eases [6]. Diagnoses of cancer have been largely assisted by
This is attributed to its effectiveness in discovering abnormal
exploring gene expression data with the application of technol-
and normal tissue patterns in speedier time as microarray scales
ogy available in microarray technique. The technology enables
well on large dataset. Microarray is an attractive research ave-
genes to be measured simultaneously in a large quantity. In as-
nue as the technology is typically utilized to investigate dataset
sessing significant genes, parametric statistical analysis has
with high dimensionality, which demands significant memory
been typically employed to establish statistical significance [7].
and processor requirements [2].
In literature, numerous algorithms and mathematical models
There remains room for improvement of microarray in clas-
have been constructed and proposed to interpret and analyze
sifying data as the technology struggles with small samples col-
gene expression data. In analyzing gene expression data, two
lection yet large features quantity. Selection of suitable features
dominant strategies which have been focused are clustering and
are the keys in this field as numerous research endeavors aim
classification [8]. Additionally, there are also numerous tech-
to minimize data dimensionality with improved performance in
niques which have been executed previously in classifying
classification [3]. In the case of classifying cancer cells, numer-
gene expression data, including, k-nearest neighbors (k-NN)
ous machine learning algorithms strive to a number of workable
[9], Support Vector Machines (SVM) [10], Multilayer
samples is significantly lower than gene count. This situation
perceptron (MLP) [11], and variants of Artificial Neural Net- of all the cheques in the United States. A number of CNN based
works (ANNs) [12]. on optical character recognition and handwriting recognition
The breast cancer and leukemia datasets in performing systems were later deployed by Microsoft [22]. CNN was also
selection of informative features from gene expression data. experimented with in the early of 1990s for object detection in
The work assessed the accuracy of the proposed selection tech- natural images, including faces and hands [23, 24].
nique. In the work, the researchers concluded that k-NN classi- CNN performance in prediction was measured against K-NN
fier performed superiorly than random forest in terms of accu- in the task of classifying materials. Features fed into the algo-
rate classification [13]. The author [14] reported Random Sur- rithms were processed utilizing Local Binary Pattern in differ-
vival Forest strategy in selecting informative genes by means ent variants. CNN produced accurate classification (95%) com-
of eliminating non-informative genes iteratively. pared to a hybrid of K-NN and feature extractor (83%) [25].
a hybrid of particle swarm optimization and decision tree a deep learning comprising multiple tasks learning and trans-
(PSOC4.5) to be used in classifying informative genes from ferred learning in analyzing images of biological components
cancer datasets. The proposed classification strategy allows [6,14]. On the other hand, a deep learning algorithm based on
non-informative genes to be overlooked, which could success- CNN was proposed by [26] with reported results surpassing ex-
fully lead to cancer identification. The work reported superior isting ML strategies. The proposed work has won the research-
accuracy of the proposed classifier [15]. ers accolade in visual recognition challenge.
From a hybrid classifier comprising particle swarm optimi- Earliest use of CNN concerned with classifying images, par-
zation (PSO) and adaptive K-nearest neighbourhood (KNN) in ticularly, in segmenting and grouping images [27,28] in medi-
selecting informative genes. The proposed work identifies cal domain with superior accuracy. Apart from that, researchers
handful genes that meet the criteria of classification [16]. have also implemented CNN in different domains, including,
facial recognition [29], and examination of documents.
There are three main components: 1) input layer, 2) hidden
B. Deep Learning (DL)
layer, and 3) latent layers. These latent (hidden) layers may be
Deep Learning (DL) concerns with processing information categorized as a fully-connected layer, a pooling layer, or a con-
utilizing deep networks. It is a part of machine learning ap- volutional layer. Figure 1 shows these layers adapted from [30]:
proaches. In its earlier appearance in 1943, DL was termed by
McClulloch and Pitts as “cybernetics” [17]. Researchers were
drawn to DL attributed to its capability as well as characteris-
tics in mimicking the way the brain processes information prior
to make decisions. DL is constructed to process information ei-
ther via unsupervised or supervised approaches, whereby,
learning is conducted on multilayered features and representa-
tions. Numerous breakthroughs were reported on DL; relating
to improve solution and solve problems, attributed to highly ad-
vanced computation models. Due to capability of learning on
Fig. 1. The pipeline of the general CNN architecture [30].
multilayered representations, DL is superior in drawing out-
comes from complex problems. In this sense, DL is the most 1. Convolutional layer is essentially the primary layer in
advanced approach to be used in capturing and processing ab- CNN architecture. The process of convolution concerns
straction of data in several layers. Such characteristics presents with iterative execution of specific function toward the
DL as a suitable approach to be considered in analyzing and output of a different function [31]. This layer consists of
studying gene expression data. The ability in learning multi- numerous maps of neurons, described as maps of features
layered representations makes DL a versatile strategy in pro- or filters. It is relatively identical in size to the dimension-
ducing more accurate results in a much speedier time. Multi- ality of input data. Neural reactivity is interpreted through
layered representation is a component that forms the overall ar- quantifying discrete convolution of receptors. The quanti-
chitecture of deep learning [18]. fication deals with calculating total neural weights of input
ML and DL differs in terms of performance depending on the and assigning activation function.
quantity of data. In learning dataset with low dimensionality, 2. Max pooling layer concerns with producing several grids
DL is ineffective, as it requires data with high dimensionality from the splitting convolution layer’s output. Maximum
in order to comprehend learning to be carried out [19]. values of grids are sequenced in matrices [31]. Operators
are utilized in performing computation on each matrix in
C. Deep learning Convolutional Neural Network (CNN) order to quantify average or maximize value.
3. Full connection layer is an almost complete CNN, com-
A type of artificial neural network, Convolutional Neural
prising 90% of overall CNN architectural parameters. The
Network (CNN) is capable of extracting local features in data.
layer enables input to be transmitted in the network with
CNN simplifies network model through assigning weights on
pre-set vector lengths [26]. Dimensional data is trans-
singular mapping of features, which allows overall weights to
formed by a layer prior to classification. Convolutional
be reduced. These characteristics have resulted in a widespread
layer also undergoes transformation, which allows infor-
utilization of CNN in pattern recognition field [20, 21].
mation integrity to be retained.
The document reading system used a CNN trained jointly
with a probabilistic model that implemented language con-
straints. By the late of 1990s this system was reading over 10%
146
2018 International Conference on Advanced Science and Engineering (ICOASE), Kurdistan Region, Iraq
B. Evaluation Technique
In assessing the proposed deep learning CNN, ten cancer data
were tested. These data were used in training classification.
Mean accuracy was obtained through averaging accuracy
scores from the data. This eliminates concerns on redundant
tests and optimizes utilization on data that have been obtained.
In this paper, accuracy as a measure of performance for the pro-
posed convolutional CNN is defined as following:
To evaluate the performance, the accuracy of the result is cal-
culated according to [6].
C. Benchmarks
In benchmarking the results of the proposed CNN, accuracy
performance of the proposed work is benchmarked against
Fig. 2. Proposed general methodology MSCM-RFE [33] and varSeIFE [34, 35] on selected cancer da-
tasets. Based on the results, the proposed CNN performed with
superior accuracy, indicating its ability to improve classifica-
A. Proposed CNN Architecture tion through accurate gene selection. Table II lists accuracy
CNN model is configured in this paper upon completion of performances of aforementioned methods.
data collection and preprocessing. A convolutional CNN is se-
lected comprising of fully connected layers. Convolutional
layer is chosen as a default; as the architecture is capable of IV. DATASETS
dealing with data with high and multiple dimensionalities, such In this study, ten cancer datasets are used. The datasets con-
as, gene expression data and 2D images. Krizhevsky principles tain gene expression profiles that are extracted utilizing micro-
[32] were applied in the construction of the CNN architecture. array technology. Pre-processing is required prior to utilize of
In this paper a new system has been proposed with 2-Dimen- datasets. The file is stored in .RDA format which can be ac-
tional convolutional h. The size of the filter is 64 kernels with cessed by utilizing software suite supporting R package. All of
size 3×3 and the non-linearity activation relu has been used the gene profiles of tumor-inflicted and normal patients were
with convolutional layer. In fully connected layer the filter size encoded in binary format, described as different-class datasets.
is 128 kernels with size 2×2. The system has been learned for The datasets were provided with class file and data file. Data
the ration of the testing and training is 30% and 70% of the data file stores values in numeric format, arranged by rows and col-
respectively. umns. Each column indicates patient number. Each row indi-
cates gene number in cancer dataset. Table I lists the descrip-
tion of datasets, used in this paper, in terms of gene number,
147
2018 International Conference on Advanced Science and Engineering (ICOASE), Kurdistan Region, Iraq
patient number, as well as the reference. CNN algorithm is im- 92.14% in mean classification accuracy, scoring the highest ac-
plemented on the selected cancer datasets. Each dataset stores curacy at 97.62%, with 15.65% variance. On the other hand,
numerous categories of cancer. The data file primarily stores Breast2 dataset yielded 34.97% in mean classification accu-
cancer data of multiple classes that are obtained from microar- racy, scoring the highest accuracy at 41.26% with 8.52% vari-
ray technology. ance. Meanwhile, Breast3 dataset yielded 92.90% in mean clas-
sification accuracy, scoring the highest accuracy at 97.69%,
TABLE I. THE MAIN CHARACTERISTICS OF THE CANCER DATASET USED IN THIS with 10.27% variance. Colon dataset, meanwhile, yielded
RESEARCH
57.34% in mean classification accuracy, scoring the highest ac-
Data set #gene #Sample Class curacy at 64.52%, with 11.61% variance. While, Leukemia da-
Brain 5597 42 5 taset yielded 95.69% in mean classification accuracy, scoring
Breast2 13321 286 4
Breast3 1509 264 4
the highest accuracy at 100.00% with 13.30% variance. Lym-
Colon 2000 62 2 phoma dataset, on the other hand, yielded 100.00% in mean
Leukemia 3571 72 2 classification accuracy, scoring the highest accuracy at
Lymphoma 4026 62 3 100.00% with 1.73% variance.
Prostate 6033 102 2 Next, Lymphoma dataset yielded 76.62% in mean classifica-
Srbct 2308 63 4
tion accuracy, scoring the highest accuracy at 91.86% with
Lung 5217 86 2
(michigan) 16.91% variance. Whereas, SRBCT dataset yielded 98.02% in
Lung (boston) 5217 62 2 mean classification accuracy, scoring the highest accuracy at
100.00% with 19.92% variance. Next, LungMichigan dataset
yielded 62.27% in mean classification accuracy, scoring the
highest accuracy at 72.09% with 17.67% variance. Lastly,
V. RESULTS AND DISCUSSION LungBoston dataset yielded 50.00% in mean classification ac-
The proposed CNN was executed in Theano [36], which curacy, scoring the highest accuracy at 50.32% with 0.00% var-
hosts environment for constructing deep learning software. iance.
Theano is built upon Keras technology [37]. Intially, neuron The classification accuracy results obtained on 10 cancer da-
weights were assigned based on settings in Keras. tasets for proposed CNN are subsequently compared against
ADADELTA [38] was utilized to initiate deep network layers hybrid mSVM-RFE-IRF [7] and varSelRF [8, 9]. The results
concurrently. A MacBook Pro Core i5-3210M CPU system are listed in Table II shows the his highlighted cells signifying
with 8GB memory was used to execute classification training superior method with the highest classification accuracy. The
on the proposed CNN. The time taken to test and train the net- overall comparison of accuracies is tabulated to represent the
work was 12 days, utilizing Python package. overall findings of this study.
Analysis of Variance (ANOVA) was utilized as a statistical In overall, the proposed CNN scored higher accuracies in
analysis in establishing statistical significance of accuracy in comparison to MSVM-RFE and varSelRF in seven cancer da-
the classification of 10 types of cancer datasets. ANOVA com- tasets, including, Brain, Breast3, Leukemia, lymphoma,
parison is illustrated in Fig. 4. Based on accuracy of classifica- SRBCT, LungMichigan and LungBoston datasets.
tion obtained from ANOVA, there exists statistical significance
among 10 types of cancer datasets, with p = 3.4 *10-22.
TABLE II. COMPARISON OF CLASSIFICATION ACCURACY FOR CNN AND
HYBRID MSVM-RFE-IRF.
148
2018 International Conference on Advanced Science and Engineering (ICOASE), Kurdistan Region, Iraq
REFERENCES
[1] T. W. Shi, K. Moorthy, M. S. Mohamad, S. Deris, S. Omatu and M.
Yoshioka.Random Forest and Gene Ontology for functional analysis
microarray data. in Computational Intelligence and Applications (CIA),
2014 IEEE 7th International Workshop on. 2014. IEEE.
[2] Koschmieder, A., Zimmermann, K., Trißl, S., Stoltmann, T., & Leser, U.
(2011). Tools for managing and analyzing microarray data. Briefings in
bioinformatics, 13(1), 46-60.
[3] Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez,
J. M., & Herrera, F. (2014). A review of microarray datasets and applied
feature selection methods. Information Sciences, 282, 111-135.
[4] Tomašev, N., Radovanović, M., Mladenić, D., & Ivanović, M. (2013).
The role of hubness in clustering high-dimensional data. IEEE
Transactions on Knowledge & Data Engineering, (1), 1.
[5] Zeng, T., & Ji, S. (2015, November). Deep convolutional neural networks
for multi-instance multi-task learning. In Data Mining (ICDM), 2015
IEEE International Conference on (pp. 579-588). IEEE.
[6] Qing Liao, Lin Jiang, Xuan Wang, Chunkai Zhang and Ye Ding. Cancer
Fig. 5. ANOVA of classification accuracy for CNN, mSVM-RFE-IRF and
Classification with Multi-task Deep Learning. 2017.
varSelRF.
[7] Lee, C. P., Lin, W. S., Chen, Y. M., & Kuo, B. J. (2011). Gene selection
and sample classification on microarray data based on adaptive genetic
Based on ANOVA analysis, statistical significant difference algorithm/k-nearest neighbor method. Expert Systems with
exists between the three methods, as indicated by p = 0.007. Applications, 38(5), 4661-4667.
Fig. 5 illustrates the results of ANOVA analysis of CNN with [8] Wang, H., Meghawat, A., Morency, L. P., & Xing, E. P. (2016). Select-
hybrid mSVM-RFE-IRF and varSelRF. The proposed CNN Additive Learning: Improving Generalization in Multimodal Sentiment
scored mean classification accuracy of 94.74%, with best Analysis. arXiv preprint arXiv:1609.05244.
Meanwhile accuracy performance of 100, with 38.03% vari- [9] Chao Li, Shuheng Zhang, Huan Zhang, Lifang Pang, Kinman Lam, Chun
Hui, and Su Zhang. Using the K-nearest neighbor algorithm for the
ance. Meanwhile, mSVM-RFE-IRF recorded second highest classification of lymph node metastasis in gastric cancer. Computational
accuracy performance, scoring mean classification accuracy of and mathematical methods in medicine, 2012. 2012.
85.82%, with best accuracy performance of 95.55, with 15.26% [10] Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M.,
variance. Lastly, varSelRF recorded third best accuracy perfor- & Haussler, D. (2000). Support vector machine classification and
validation of cancer tissue samples using microarray expression
mance, scoring mean classification accuracy of 79.58%, with data. Bioinformatics, 16(10), 906-914.
best accuracy performance of 93.07%, with 26.89% variance. [11] Zuyi Wang Yue Wang Jianhua Xuan Yibin Dong Marina
Bakay Yuanjian FengRobert Clarke Eric P. Hoffman Optimized
149
2018 International Conference on Advanced Science and Engineering (ICOASE), Kurdistan Region, Iraq
multilayer perceptrons for molecular classification and diagnosis using [32] Willett, P., Wilton, D., Hartzoulakis, B., Tang, R., Ford, J., & Madge, D.
genomic data. Bioinformatics, 2006. 22(6): p. 755-761. (2007). Prediction of ion channel activity using binary kernel
[12] Asyali, M. H., Colak, D., Demirkaya, O., & Inan, M. S. (2006). Gene discrimination. Journal of chemical information and modeling, 47(5),
expression profile classification: a review. Current Bioinformatics, 1(1), 1961-1966.
55-73. [33] Moorthy, K., Improved Random Forest with Multiple Support Vector
[13] Kumar, C.A., M. Sooraj, and S. Ramakrishnan, A Comparative Machine for Gene Selection and Classification of Microarray Data. 2015,
Performance Evaluation of Supervised Feature Selection Algorithms on Universiti Teknologi Malaysia.
Microarray Datasets. Procedia Computer Science, 2017. 115: p. 209-217. [34] Díaz-Uriarte, R. and S.A. De Andres, Gene selection and classification of
[14] Pang, H., George, S. L., Hui, K., & Tong, T. (2012). Gene selection using microarray data using random forest. BMC bioinformatics, 2006. 7(1): p.
iterative feature elimination random forests for survival 3.
outcomes. IEEE/ACM Transactions on Computational Biology and [35] Huerta, E.B., B. Duval, and J.-K. Hao. A hybrid GA/SVM approach for
Bioinformatics (TCBB), 9(5), 1422-1431. gene selection and classification of microarray data. in Workshops on
[15] Pang, H., George, S. L., Hui, K., & Tong, T. (2012). Gene selection using Applications of Evolutionary Computation. 2006. Springer.
iterative feature elimination random forests for survival [36] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.,
outcomes. IEEE/ACM Transactions on Computational Biology and Bergeron, A., ... & Bengio, Y. (2012). Theano: new features and speed
Bioinformatics (TCBB), 9(5), 1422-1431. improvements. arXiv preprint arXiv:1211.5590.
[16] Kar, S., K.D. Sharma, and M. Maitra, Gene selection from microarray [37] Chollet, F. (2015). Keras.
gene expression data for classification of cancer subgroups employing [38] Zeiler, M.D., ADADELTA: an adaptive learning rate method. arXiv
PSO and adaptive K-nearest neighborhood technique. Expert Systems preprint arXiv:1212.5701, 2012.
with Applications, 2015. 42(1): p. 612-627.
[17] McCulloch, W.S. and W. Pitts, A logical calculus of the ideas immanent
in nervous activity. The bulletin of mathematical biophysics, 1943. 5(4):
p. 115-133.
[18] Bianchini, M. and F. Scarselli, On the complexity of neural network
classifiers: A comparison between shallow and deep architectures. IEEE
transactions on neural networks and learning systems, 2014. 25(8): p.
1553-1565.
[19] Wang, H., Meghawat, A., Morency, L. P., & Xing, E. P. (2016). Select-
Additive Learning: Improving Generalization in Multimodal Sentiment
Analysis. arXiv preprint arXiv:1609.05244.
[20] Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008, July).
Extracting and composing robust features with denoising autoencoders.
In Proceedings of the 25th international conference on Machine
learning (pp. 1096-1103). ACM.
[21] Huang, F.J. and Y. LeCun. Large-scale learning with svm and
convolutional for generic object categorization. in Computer Vision and
Pattern Recognition, 2006 IEEE Computer Society Conference on. 2006.
IEEE.
[22] Simard, P.Y., D. Steinkraus, and J.C. Platt. Best practices for
convolutional neural networks applied to visual document analysis. in
ICDAR. 2003.
[23] Vaillant, R., C. Monrocq, and Y. Le Cun, Original approach for the
localisation of objects in images. IEE Proceedings-Vision, Image and
Signal Processing, 1994. 141(4): p. 245-250.
[24] Nowlan, S.J. and J.C. Platt, A convolutional neural network hand tracker.
Advances in neural information processing systems, 1995: p. 901-908.
[25] Muja, M. and D.G. Lowe, Fast approximate nearest neighbors with
automatic algorithm configuration. VISAPP (1), 2009. 2(331-340): p. 2.
[26] Krizhevsky, A., I. Sutskever, and G.E. Hinton. Imagenet classification
with deep convolutional neural networks. in Advances in neural
information processing systems. 2012.
[27] Faro, A., Giordano, D., Spampinato, C., & Pennisi, M. (2010). Statistical
texture analysis of MRI images to classify patients affected by multiple
sclerosis. In XII Mediterranean Conference on Medical and Biological
Engineering and Computing 2010 (pp. 272-275). Springer, Berlin,
Heidelberg.
[28] Pereira, S., Pinto, A., Alves, V., & Silva, C. A. (2016). Brain tumor
segmentation using convolutional neural networks in MRI images. IEEE
transactions on medical imaging, 35(5), 1240-1251.
[29] Tivive, F.H.C. and A. Bouzerdoum. A new class of convolutional neural
networks (SICoNNets) and their application of face detection. in Neural
Networks, 2003. Proceedings of the International Joint Conference on.
2003. IEEE.
[30] Liu, Y. and X. An. A classification model for the prostate cancer based
on deep learning. in Image and Signal Processing, BioMedical
Engineering and Informatics (CISP-BMEI), 2017 10th International
Congress on. 2017. IEEE.
[31] Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016).
Deep learning for visual understanding: A review. Neurocomputing, 187,
27-48.
150