Breast Cancer Gene Expression
Breast Cancer Gene Expression
A R T I C L E I N F O A B S T R A C T
Keywords: Cancer, in particular breast cancer, is considered one of the most common causes of death worldwide according
Feature selection to the world health organization. For this reason, extensive research efforts have been done in the area of ac
Machine learning curate and early diagnosis of cancer in order to increase the likelihood of cure. Among the available tools for
Cancer classification
diagnosing cancer, microarray technology has been proven to be effective. Microarray technology analyzes the
Microarray data
expression level of thousands of genes simultaneously. Although the huge number of features or genes in the
microarray data may seem advantageous, many of these features are irrelevant or redundant resulting in the
deterioration of classification accuracy. To overcome this challenge, feature selection techniques are a manda
tory preprocessing step before the classification process. In the paper, the main feature selection and classifi
cation techniques introduced in the literature for cancer (particularly breast cancer) are reviewed to improve the
microarray-based classification.
1. Introduction nipple region. Early treatment of cancer increases the possibility of the
cure and reduces the fatality rate and probability of recurrence [3].
All cells have a nucleus that contains deoxyribonucleic acid (DNA). Recurrence may happen after months or years from an initial treatment
DNA is carrying genetic information of the organism to develop, func and it can be local where cancer affects the same place or can be distant
tion, grow and reproduce. The coding segments of DNA are called genes, where cancer returns to different areas in the body [4]. Breast Cancer is
which are re sponsible for making proteins. Proteins do the essential detected using traditional methods, e.g., physical detection, blood test,
work in every organism and they are synthesized in two steps. Firstly, and X-ray scan, but they are time-consuming and subject to human er
DNA is transcribed into mRNA, then mRNA is translated into proteins. rors [5]. Medical errors are considered the third-leading cause of death
Genetic technologies such as DNA microarrays measure the simulta in the US [6]. Therefore, an effective tool for the diagnosis of breast
neous expression of genes, offering us a global view of the cell, which cancer is necessary, and for this purpose microarray technology is
helps in differentiating between normal and diseased states. Cancer can extensively used. Gene expression data of DNA microarray represents
be described as a group of diseases associated with uncontrollable cell the state of a cell at a molecular level [7]. It has a great perspective as a
growth that invades and metastasizes to other tissues. It is considered medical diagnosis. They either analyzed to determine whether the pa
the second main cause of death globally, about 9.6 million in 2018, 1 out tient is oncological or not (two-class problems), distinguish between
of 6 dies due to cancer. The most common types for men are; Lung, different types of cancer (multi-class problems) [8], predict the response
prostate, colorectal, stomach, and liver cancer while breast, colorectal, to a drug based on the gene signature, or identify tumors [9] by finding
lung, cervical, and thyroid cancer are popular among women [1]. Breast groups of similarly expressed genes. They effectively analyzed by ma
cancer is a heterogeneous disease having different histological and chine learning (ML). ML is an automatic and intelligent learning tech
biological properties and various treatment responses [2]. It can be nique that gives machines the ability to learn without being explicitly
traced back to genetic, epigenetic, or transcriptome changes. It appears programmed. Ml techniques are widely employed in solving many
as a lump, nipple discharge, or a change of skin texture around the complex real-world problems and have proven to be efficient in
* Corresponding author.
E-mail addresses: [email protected] (M. Abd-Elnaby), [email protected] (M. Alfonse), [email protected]
(M. Roushdy).
https://doi.org/10.1016/j.jbi.2021.103764
Received 8 July 2020; Received in revised form 9 March 2021; Accepted 26 March 2021
Available online 6 April 2021
1532-0464/© 2021 Elsevier Inc. This article is made available under the Elsevier license (http://www.elsevier.com/open-access/userlicense/1.0/).
M. Abd-Elnaby et al. Journal of Biomedical Informatics 117 (2021) 103764
• Improves Accuracy: train the model with less misleading data will
2.1. Feature selection improve the accuracy.
• Reduce Training Time: The smaller the number of features, the less
Feature selection is the process of automatically or manually select computation time required for training.
the features that have an impact on the prediction to: • Offer biologists with insight about the mechanism between gene
signature and diseases [12].
• Reduce Overfitting: overfitting means the model doesn’t generalize
well from our training data to unseen data due to noise and redun Feature selection can be classified based on the integration between
dancy in the data. The model will be well generalized when the selection algorithm and the implemented model into four main
removing such data. categories, as shown in Fig. 2. the pros and cons of feature selection
2
M. Abd-Elnaby et al. Journal of Biomedical Informatics 117 (2021) 103764
techniques are presented in table 1. performance until a subset of the desired k features is reached that gives
maximum accuracy. Common techniques are sequential forward selec
2.2. Filter approach tion and sequential backward selection. Heuristic search algorithms: the
most utilized algorithms are:
It evaluates the features based on the intrinsic properties of the data
like distance, correlation, and consistency independently of the classi • Genetic Algorithm (GA) is a heuristic search algorithm that is
fier. Although they are computationally faster and have a strong inspired by natural evolution. The main principle of GA is randomly
generalization ability [5] their performance is lower. Filter approaches generating a population through three operations. Firstly, the se
are divided into univariate and multivariate. Univariate feature selec lection operation chooses individuals whose fitness functions are
tion examines each feature individually to measure the strength of an better. Then, in the crossover operation, each pair of individuals are
association between the features and the outcome variable. Common selected with a random crossover point to generate new offspring.
types are mutual information (MI) and information gain (IG). Finally, the Mutation process makes diversity in the population [16].
• Artificial Bee Colony (ABC) is a swarm-based algorithm that simu
• MI measures the correlation between the two variables. In other lates how honeybees search for food. The colony of bees consists of
words, measures how much information one variable (X) knows employed bees, on– lookers, and scouts. Employed bees: its numbers
about another one (Y). In gene selection, it measures the correlation are equal to the number of food sources. Each employed bee goes to a
between gene and classification category. The larger the value of MI, food source and evaluates it. Based on information shared by
the more informative the genes are [13]. employed bees, onlookers elect the food source. An employed bee
• IG is a statistical property that measures how infor mative a feature becomes a scout when the food source is depleted and begins to
is. Highly related features to class are those with the information, randomly search for a new food source around [17].
while unrelated features give no information. To determine the value • Particle Swarm Optimization (PSO) is a swarmbased algorithm, that
of IG, entropy value which is the impurity of the given samples is mimics how members in groups such as birds or fishes interact to
used. Then a threshold is set and features which value higher than share information. In PSO, a candidate solution is represented by a
the threshold are selected [14]. particle that has fitness values and velocity to direct the fly. Through
updating the position of the particle due to its own and of other
Multivariate evaluates features in the context of others, the most particle experience, an optimum solution can be reached [18].
typically used techniques are: • Bat Algorithm (BA) is a swarm-based algorithm, based on the
mechanism bats use to situate their prey, echolocation. Bats are
• Minimum Redundancy and Maximum Relevance (mRMR) tends to randomly fly based on the distance to the target. They automatically
select highly correlated features with the class and lowly between alter the frequency and rate of the emitted pulse. The solution is
themselves. It may not proper to select both features that are highly elected from among the best solutions [19].
relevant and highly correlated, as they wouldn’t add more infor
mation due to high correlation, but they would increase the model 2.2.2. Embedded approach
complexity and make it susceptible to overfitting [15]. Embedded methods learn which features best contribute to the ac
• Correlation-based Feature Selection (CFS) ranks features based on curacy of the model while the model is created. The most common types
the correlation due to the heuristic evaluation functions. Features are of embedded feature selection methods are regularization methods.
evaluated according to the hypothesis “Good feature subset contains
features that are highly correlated with the classification and yet 2.2.3. Hybrid approach
uncorrelated to each other” [13], which means a low correlation The hybrid approach can be any combination of any number of same
with the class refers to irrelevant features while informative features or different methods of feature selection to combine the advantages of
are strongly correlated [15]. both approaches and overcome or handle the drawback of each
• Fast Correlation Based Filter (FCBF) is a multivariate algorithm that approach individually. A combination is usually a filter-wrapper
bases on symmetrical uncertainty (SU) to select highly correlated approach that gets the benefit of fast computational of filter approach
features with the class. Then it applies heuristics to remove the to remove redundant features and high performance of wrapper
redundant features and maintain relevant ones to the class [6]. approach. It also less prone to overfitting than wrapper but it is classifier
specific.
3
M. Abd-Elnaby et al. Journal of Biomedical Informatics 117 (2021) 103764
• K-nearest neighbor (KNN) is a lazy learner that builds no model. It is efficiency of SVM to obtain high accuracy. Al-Batah et al.[24] used
used for classification and regression tasks. For classification it the filter method, CFS to remove redundant genes and get the
classifies data based on the classification of its neighbors, “birds of informative ones, then for classification process Decision Table,
feather flocks together”, an object is classified to the major class JRip, and OneR were used. The proposed approach can achieve high
among its k nearest neighbors. For regression, the output is the accuracy and fast computational speed with just a few numbers
average of the values of k nearest neighbors [20,21]. genes.
• Naïve Bayes (NB) is a probabilistic machine learning algorithm based Gao et al. [25] proposed PA-SVM that combines PSO with ABC
on Bayes’ theorem and widely used in classification tasks. Naïve named (PA) to optimize the classification of SVM. FCBF was initially
means the features are independent of each other and changing the used to obtain informative genes. Then PA-SVM evaluated 9 datasets.
value of one feature does not directly change the value of any of the The result was compared with other classifiers. According to the result,
other features. NB classifies data by calculating the posterior prob PA-SVM achieved good results with just a few numbers of genes.
ability for each class using the probability of the features belonging Baliarsingh et al [26] proposed Jaya optimized extreme learning
to the class. However, the simple assumption of NB, it is fast and machine (JELM) for breast cancer classification. Jaya is used to select
effective in real problems. Bayesian belief networks are used to deal the optimal input weights and hidden biases for ELM. The authors used
with the features’ dependency [20]. Wilcoxon rank sum test to select relevant genes. The performance of
• Support Vector Machine (SVM) is commonly used for classifying JELM was compared by the performance of SVM, KNN, NB, and c4.5 and
gene expession data due to the sparseness of solution sparseness of achieved a higher result about 90.91%. although the proposed model
solution and it’s ability to handle large feature space [22]. Firstly, achieved high accuracy, it selected a huge subset of about 505 genes so it
data items are plotted in n-dimensional space. Then SVM finds the needs a further reduction in the genes subset.
hyperplane that best differentiate between classes. Su et al. [27] introduced a gene selection method based on
Kolmogorov-Smirnov (K-S) test and CFS. Firstly, K-S test removed
2.3.2. Unsupervised approach redundant and noise genes by comparing the distribution of two sample
Unsupervised is a form of learning that requires no labeled data. One types. Then, the filtered subset was evaluated by CFS. Only genes that
of the common techniques is K-means where data with similar features are highly correlated with the class and have low redundancy remained.
are grouped in the same cluster.k-means in microarray analysis can be Finally, the proposed method the evaluation of proposed method was
used to remove redundant genes by grouping similar data [11]. done using SVM classifier with 10-fold CV. It’s the result was compared
with other FS techniques. K-S test-CFS had superior performance but
3. Different methods for classifying cancer optimization in running time was needed.
Ahmad, F. K [28] utilized different filter feature selection techniques
3.1. Filter approach namely SNR, FC, IG, and t-Test to select the informative genes. Gene
selection techniques were applied on three datasets. Finally, SVM was
Purbolaksono et al. [5] introduced a system of 3 stages for classifying used to evaluate the proposed methods. IG was effective to select a
microarray data, the first was discretization which used k-means for minimum set of attributes and SVM had high accuracy with IG and SNR
transforming continuous data into discrete and dividing data into clus techniques.
ters. Then the second stage was feature selection, mutual information
was used for dimensional reduction and obtaining informative genes. 3.2. Hybrid approach
Finally, the Bayes theorem was implemented on five datasets. The Best
result was obtained with k = 10 and the result showed that Bayesian WU et al. [23] proposed, a hybrid improved binary quantum particle
Network methods have better performance than Naïve Bayes in classi swarm optimization algorithm HI-BQPSO for feature selection,
fying the microarray data. combining the advantages of filtering and a random heuristic search.
Cilia et al. [8] compared the performance of various feature selection Firstly, the maximum information coefficient (MIC) was used to calcu
and classification techniques on six datasets. For feature selection, the late the correlation between features and class to obtain an initial
authors focused on feature ranking techniques, which evaluate each feature subset. Then the improved BQPSO was used to obtain the opti
feature singularly. The datasets were evaluated using a decision tree mized feature subset. The proposed model was evaluated using 9 gene
(DT), Random Forest (RF), KNN, and multilayer perceptron classifiers datasets with SVM classifier. However, HI-BQPSO had good overall
with 10-fold cross-validation (CV). The result of utilized filter tech performance and strong searchability, it still needs improvement espe
niques was compared with the Sequential Forward Floating Search, the cially for CNS dataset.
Fast Correlation-Based Filter, and the Minimum Redundancy Maximum Medjaheda et al. [29] proposed an approach to diagnosis cancer. In
Relevance. The ranking techniques obtained high results with three the first phase, Support Vector Machines based on Recursive Feature
datasets. While FCBF and SFFS obtained high results for the other three. Elimination (SVM-RFE) was used to eliminate 40 percent of features.
However, with the high result obtained, there was a need for further The remaining subset was processed via Binary Dragonfly (BDF) to
reduction in Ovarian, Lymphoma, and Lung datasets. retain informative genes only. The proposed method was evaluated on 6
Aydadenta and Adiwijaya [11] utilized k-means and IG for feature microarray cancer datasets. However, the model achieved comparable
selection. Initially, k-means was used to group similar features in one results but for breast, it was not satisfying as it achieved high accuracy
cluster, so a redundant one is removed. Then Relief algorithm was used but with a very huge number of features.
to rank the clusters’ elements and top-ranking features of each cluster Jain et al. [30] proposed a hybrid feature selection method that
were combined to train RF. The proposed model was evaluated on three combined CFS and IBPSO. The IBPSO enhanced the early convergence to
datasets and showed a higher result than the model using RF only the local optimum of BPSO. The proposed method was utilized on 11
without clustering [23]. microarray datasets and evaluated by NB with stratified 10 k-CV. The
model was compared with 7 classifiers and outperformed them in terms
V. Bolón.et.al. [12] reviewed state of art techniques applied in the of accuracy and number of selected genes in most cases.
domain of microarray classification. Then a practical evaluation was Shahbeig et al [31] proposed a hybrid TLBO-PSO that combined
done to compare the performance of the different techniques. teaching learning-based optimization (TLBO) algorithm and mutated
different feature selection techniques eg. ReliefF, SVM-RFE, mRMR, fuzzy adaptive particle swarm optimization (PSO) algorithm. The
IG, and FCBF were used for gene selection. Then C4.5, NB, and SVM mutated PSO is used to overcome PSO possibility to be trapped in the
were tested to get the accuracy of the model. The result showed the local optimum solutions. A constant or even linearly changed value of
4
M. Abd-Elnaby et al. Journal of Biomedical Informatics 117 (2021) 103764
inertia weight may prevent the PSO algorithm from reaching the opti ABC–SVM to obtain an accurate result in diagnosing breast cancer.
mum result. Fuzzy tuning of the inertia weight based on the proposed Initially, PSO and ABC were used as feature selection techniques. Then
total normalized function value can enhance The convergence speed of SVM was used as a classifier. The result showed that ABC–SVM had
PSO and avoid trapping at the local optimum. The proposed method was accurate results and it was effective to deal with high dimensional data
evaluated using SVM and achieved 91.88% accuracy with 195 features. like microarray.
Lu et al. [32] proposed MIMAGA, a hybrid feature selection algo Zhongxin et al. [40] proposed a Feature Selection Algorithm based
rithm combining mutual information maximization (MIM) and the on Mutual Information and Lasso(FSMIL). In the first stage, MI was used
adaptive genetic algorithm (AGA). Initially, MIM was applied as a pre to filter irrelevant features. Then in the next stage, an improved version
processing step to obtain a subset contains only 300 genes. Then of lasso was trained in the candidate subset to produce the most infor
wrapper technique, AGA, was applied. Finally, extreme learning ma mative genes. the produced methods were applied on five datasets. To
chine was applied as a classifier on the data set. MIMAGA was compared test the accuracy of the methods SVM classifier was utilized. The pro
with other FS techniques. The result showed that while MIMAGA takes a posed method achieved high accuracy, especially for lung and Lym
long time, it was efficient and had the best result. phoma datasets.
Alomari et al. [33] proposed a hybrid filter-wrapper gene selection Sardana et.al. [41] proposed a hybrid approach Cluster quantum
method using the filter approach, Minimum Redundancy Maximum Genetic Algorithm(ClusterQGA) to accurately classify cancer. Initially, a
Relevancy, and wrapper approach flower pollination algorithm (FPA). cluster was used to remove irrelevant and redundant data, then the
Initially, MRMR was employed to obtain important genes, that have the computer power of quantum and genetic algorithm were effectively
minimum redundancy for input genes and the maximum relevancy to used to select only relevant features. The proposed method was applied
the target class, from the gene expression data. Then these genes were to four datasets and evaluated using SVM and KNN classifiers. However,
used by FPA to get the most informative ones. The proposed model was ClusterQGA was successfully reduced the number of genes, the accuracy
evaluated on three datasets and compared with MRMR-GA. the pro of classifying needs further improvement.
posed method showed comparative results regarding the accuracy and a Singh and Sivabalakrishnan [42] presented a hybrid selection tech
low number of features. nique that comprised mRMR with Adaptive Genetic Algorithm (AGA). In
Turgut et al.[34] used Recursive Feature Elimination(RFE) and the first phase, mRMR was effectively used to reduce the dimensions and
Randomized Logistic Regression (RLR) feature elimination methods to the redundancy in the data. Subsequently produced subset was further
select informative genes. The proposed method selected the top 50 processed via AGA to get the most relevant genes. The mRMR-AGA
features. Performance of the proposed methods was evaluated using 8 approach was evaluated by four classifiers on four benchmarked data
classifiers: SVM, KNN, Multi-Layer Perceptron, DT, RF, LR, AdaBoost, sets and achieved comparable results.
and Gradient Boosting Machines with k- CV on two different breast Nagpala and Singhb [43] proposed qualitative mutual information
cancer datasets. The best result was achieved with SVM as a classifier for (QMI) for feature selection. Initially, RF was used to obtain the impor
both datasets. tance of each gene which was used to calculate the preference score (PS).
Mufassirin and Ragel. [35] proposed a novel filter- wrapper based PS reduces the redundancy in the subset. Then MI was used to obtain the
feature selection approach. Initially, a filter method gain ratio was used informative genes. The proposed method evaluated four datasets, and
to determine the importance of genes, by measuring the gain ratio for classification, NB, C4.5, and IB1 were used with 10-fold CV. The
concerning the relevant class to eliminate irrelevant and redundant result showed that the proposed method along with NB obtained an
genes. The second phase wrapper subset evaluator was used to evaluate accurate result of more than 98% for two datasets.
the subset produced after using gain ratio. Finally, the proposed Loey et al. [44] presented an intelligent decision support system for
approach was evaluated using J48, DT, NB, Sequential Minimal Opti diagnosing microarray cancer data. Initially, IG was used to select
mization on five datasets. The proposed model had time efficiency and relative genes. Then the selected subset was reduced via Grey Wolf
gave high results. Optimization (GWO). Finally, SVM was utilized for breast and colon
Sreepada et al. [36] proposed a hybrid of filter-wrapper approach for cancer classification. However, the IG-GWO approach achieved high
gene selection to combine the fast computation of the filter approach accuracy but with a huge number of features (about 240) for the breast
and the accuracy of the wrapper approach. Firstly, Filter techniques are cancer dataset.
computation- ally faster, and the wrapper approach is more efficient for Hamim et al. [45] combined a filter approach fisher score(F) with
classification accuracy. Each of F-Score and IG was separately used to C5.0 to select relevant breast cancer genes. Initially, Fisher score
produce a subset for each, then both sub- sets were combined. Then removed redundant genes and reduced the subset to only 10% of genes.
wrapper methods, Sequential Backward Elimination (SBE) and Then, C5.0 selects only 5 relevant genes. The proposed FC5 was assessed
Sequential Forward Selection (SFS) with SVM were used to get the by C5.0, ANN, SVM, and LR with stratified 10-fold CV. C5.0 achieved
informative genes. The proposed method was evaluated using three higher accuracy about 93.28%.
datasets and achieved good results of more than 97% for two datasets.
Hameed et al. [37] proposed a hybrid approach to elect the infor 3.3. Other approaches
mative genes. Firstly, Pearson correlation coefficient (PCC) was ran 10
times to select the top 100 ranked features. Then either binary PSO or Jinthanasatian et al. [46] utilized a neuro-fuzzy with firefly algo
GA was used for further reduction. Different classifiers were employed rithm to classify microarray data. A neurofuzzy algorithm was used to
to test the accuracy of eleven datasets based on 10-fold CV. The result select informative genes, and rule set generation as a classifier. firefly
showed that SVM had higher accuracy and BPSO performing faster and algorithm was utilized to optimize the parameters. The proposed
have high result than GA with a smaller number of selected genes. method was evaluated on seven datasets and the accuracy was assessed
Salem et al. [38] proposed a hybrid approach named (IG- SGA). with 10 k-fold method. The proposed algorithm provided comparable
Initially, IG was used with various thresholds to reduce the feature set. results with other techniques, but further improvement is needed
Then the reduced subset was passed to GA to obtain the most informa especially for the colon dataset achieved only 76.94%.
tive gene. Finally, genetic programming was used to classify seven Li et al. [47] proposed random value-based oversampling (RVOS)
datasets. The performance was assessed using 10-fold CV. However, the and an improved version of SVM-RFE to effectively analyze microarray
proposed model showed a higher result, needs further improvement was data. Firstly, RVOS was utilized to balance the distribution of two
needed specifically for Lung- Ontario datasets and there was a limitation samples. Then an improved version of linear SVM (LLSVM) with the
in terms of the time complexity. improved RFE strategy was used to get the informa- tive genes. Finally,
Utami and Rustama [39] proposed a hybrid method PSO–SVM and the proposed model was evaluated using four classifiers with stratified
5
M. Abd-Elnaby et al. Journal of Biomedical Informatics 117 (2021) 103764
Table 2
Different Methods for classifying breast cancer.
Ref Feature selection Classifier Dataset[ref] Classificationaccuracy No.Genes
6
M. Abd-Elnaby et al. Journal of Biomedical Informatics 117 (2021) 103764
Table 2 (continued )
Ref Feature selection Classifier Dataset[ref] Classificationaccuracy No.Genes
Leukemia 98.61% 93
Loey el al.[44] IG + GWO SVM Colon[48] 95.9% 16
Breast [51] 94.87% 240
Hamim et al. [45] FC5 C5.0 Breast[51] 93.28% 5
Jinthanasatian et al. [46] a neuro-fuzzy algorithm rule set generation Lung [52] 93.42% 4
Ovarian [49] 96.13% 12
Prostate [54] 87.43% 5
Leukemia [50] 82.27% 7
Breast [51] 82.37% 7
Colon 76.94% 11
DLBCL 83.81% 13
SVM-RFE
Li et al. [47] Prostate [54] 92.20% NA
SVM-RFE
Breast [51] 86.09%
SVM-VSSRFE CNS [55] 88.39%
SVM-RFE
Colon [48] 93.75%
SVM-VSSRFE
Ovarian [49] 100%
SVM-VSSRFE
Leukemia [50] 100%
Table 3
Different Methods for classifying other cancer.
Ref Feature selection Classifier Dataset[ref] Classificationaccuracy No.Genes
5-fold CV. Results showed that SVM-VSSRFE had better results for three high result but it selects a larger number of features on the other hand
datasets, and also the efficiency of LLSVM-VSSRFE in reducing time the ability of evolutionary wrapper feature selection techniques to find
consumption, especially with high dimensional datasets. optimal or near-optimal subset help hybrid approach to achieve higher
Different Methods for classifying breast cancer,other cancer types are accuracy with just a small subset. In [45], FC5 could generate the
presented in tables 2,3 respectively. The accuracy of state of art methods smallest subset about 5 genes, while the highest result achieved in [41]
are presented in Fig. 4. using IG-GWO but with a large subset of about 240 genes. However, the
hybrid approach can achieve better performance than a filter, SVM-RFE,
4. Discussion and BDF in [30] had the worst performance in terms of the Number of
selected genes it selected about 7237 genes but with acceptable accu
Although microarray data are proven to be efficient for diagnosing racy. For other cancer types, while GR in [8] generated a small subset for
cancer, the huge number of its features with respect to small sample size, colon [48], it produced a large subset for ovarian [49] and lung [52].
for example, breast datasets Van’t Veer [51] and wang [53] have 24,482 PCC-BPSO in [37] produced a the small subset for ovarian and lung that
and 18,000 features with only 97 samples, cause a so-called curse of led to the best accuracy using KNN and NB respectively. Applying
dimensionality problem. To avoid it hybrid and filter selection tech ClusterQGA in [41] with KNN classifier led to the best performance for
niques are commonly used. Filter approach is fast and isn’t computa colon about 100% accuracy. SVM has a high accuracy of for breast, KNN
tionally extensive so it is used in [5,8,12,24–28] and recommended to be has better accuracy for colon and ovarian and NB for lung. Another issue
initially used in hybrid approach in [29–35,37,39,41–46]. Applying due to few samples for accurate validation, 10-fold CV is commonly
filter approach on Van’t Veer, K-S test-CFS in [27] generated a small used.
subset in comparison to subset generated by other filter technique but
with lower accuracy about 87.4%.While in [26] Wilcoxon rank-sum test 5. Conclusion and future work
achieved high accuracy about 90.91% but with a large gene subset about
505 genes. On the other hand applying GR on wang achieved higher Microarray data analysis deepens your understanding of cancer
accuracy with only 50 genes. While the filter approach may achieve a pathogenesis and also having diagnostic value. It accurately diagnoses
7
M. Abd-Elnaby et al. Journal of Biomedical Informatics 117 (2021) 103764
cancer. However, The accuracy influenced by a large number of features profiles, IEEE/ACM transactions on computational biology and bioinformatics 11
(2014) 727–740.
and the limited number of samples. Dimensionality reduction tech
[10] K. Kourou, T.P. Exarchos, K.P. Exarchos, M.V. Karamouzis, D.I. Fotiadis, Machine
niques, mainly feature selection approaches are utilized to overcome learning applications in cancer prognosis and prediction, Computational and
this deterioration inaccuracy. The survey reviewed the state of the art of structural biotechnology journal 13 (2015) 8–17.
feature selection and classification techniques. The review showed that [11] H. Aydadenta, A Clustering Approach for Feature Selection in Microarray Data
Classification Using Random forest, Journal of Information Processing Systems 14
SVM is the most applied classification algorithm and achieved a high (2018).
result of about 94.87% with hybrid feature selection (IG-GWO). As [12] V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, J.M. Benítez,
future work, a hybrid feature selection technique based on a heuristic F. Herrera, A review of microarray datasets and applied feature selection methods,
Information Sciences 282 (2014) 111–135.
search algorithm will be examined to obtain a more accurate result. [13] J.R. Vergara, P.A. Estévez, A review of feature selection methods based on mutual
information, Neural computing and applications 24 (2014) 175–186.
[14] B. Azhagusundari, A.S. Thanamani, Feature selection based on information gain,
Declaration of Competing Interest
International Journal of Innovative Technology and Exploring Engineering
(IJITEE) 2 (2013) 18–21.
The authors declare that they have no known competing financial [15] M.A. Hall, L.A. Smith, Feature selection for machine learning: comparing a
correlation-based filter approach to the wrapper, FLAIRS conference (1999)
interests or personal relationships that could have appeared to influence
235–239.
the work reported in this paper. [16] N. Almugren, H. Alshamlan, A survey on hybrid feature selection methods in
microarray gene expression data for cancer classification, IEEE Access 7 (2019)
References 78533–78548.
[17] M.S. Hossain, A. El-Shafie, Application of artificial bee colony (ABC) algorithm in
search of optimal release of Aswan High Dam, Journal of Physics: Conference
[1] F. Bray, J. Ferlay, I. Soerjomataram, R.L. Siegel, L.A. Torre, A. Jemal, Global cancer Series, IOP Publishing (2013), 012001.
statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 [18] D. Wang, D. Tan, L. Liu, Particle swarm optimization algorithm: an overview, Soft
cancers in 185 countries, CA: a cancer journal for clinicians 68 (2018) 394–424. Computing 22 (2018) 387–408.
[2] N. Eliyatkın, E. Yalçın, B. Zengel, S. Aktaş, E. Vardar, Molecular classification of [19] X.-S. Yang, A new metaheuristic bat-inspired algorithm, Nature inspired
breast carcinoma: from traditional, old- fashioned way to a new age, and a new cooperative strategies for optimization (NICSO, Springer 2010 (2010) 65–74.
way, The journal of breast health 11 (2015) 59. [20] S.A. Abdulrahman, W. Khalifa, M. Roushdy, A.M. Salem, Comparative study for 8
[3] Lindsey A. Torre, Freddie Bray, Rebecca L. Siegel, Jacques Ferlay, Joannie Lortet- computational intelligence algorithms for human identification, Comput. Sci. Rev.
Tieulent, Ahmedin Jemal, Global cancer statistics, 2012: Global Cancer Statistics, 36 (2020), 100237.
2012, CA: A Cancer Journal for Clinicians 65 (2) (2015) 87–108, https://doi.org/ [21] Widiawati, I.F., Nugrahapraja, H., Fajriyah, R. (2018). K-Nearest Neighbor (KNN)
10.3322/caac.21262. Analysis on Genes Expression Datasets of Maize Nested Association Mapping
[4] R. Priya, P.S. Vadivu, A Review on Data Mining Techniques for Prediction of Breast (NAM) Showed Confident Classification on Organ-specific Expression. 2018 1st
Cancer Recurrence, International Journal of Engineering and Management International Conference on Bioinformatics, Biotechnology, and Biomedical
Research (IJEMR) 9 (2019) 142–146. Engineering - Bioinformatics and Biomedical Engineering, 1, 1-3.
[5] M.D. Purbolaksono, K.C. Widiastuti, M.S. Mubarok, F.A. Ma’ruf, Implementation of [22] B. Sahu, S. Dehuri, A.K. Jagadev, Feature selection model based on clustering and
mutual information and bayes theorem for classification microarray data, Journal ranking in pipeline for microarray data, Informatics in Medicine Unlocked 9 (2017)
of Physics: Conference Series, IOP Publishing (2018), 012011. 107–122.
[6] M.A. Makary, M. Daniel, Medical error—the third leading cause of death in the US, [23] Q. Wu, Z. Ma, J. Fan, G. Xu, Y. Shen, A feature selection method based on hybrid
Bmj 353 (2016). improved binary quantum particle swarm optimization, IEEE Access 7 (2019)
[7] H.J. Hong, W.S. Koom, W.-G. Koh, Cell microarray technologies for high- 80588–80601.
throughput cell-based biosensors, Sensors 17 (2017) 1293. [24] M.S. Al-Batah, B.M. Zaqaibeh, S.A. Alomari, M.S. Alz-boon, Gene Microarray
[8] N.D. Cilia, C. De Stefano, F. Fontanella, S. Raimondo, A. Scotto di Freca, An Cancer Classification using Correlation Based Feature Selection Algorithm and
experimental comparison of feature- selection and classification methods for Rules Classifiers, International Journal of Online and Biomedical Engineering
microarray datasets, Information 10 (2019) 109. (iJOE) 15 (2019) 62–73.
[9] Z. Yu, H. Chen, J. You, H.-S. Wong, J. Liu, L. Li, G. Han, Double selection based
semi-supervised clustering ensemble for tumor clustering from gene expression
8
M. Abd-Elnaby et al. Journal of Biomedical Informatics 117 (2021) 103764
[25] L. Gao, M. Ye, C. Wu, Cancer classification based on support vector machine [42] R.K. Singh, M. Sivabalakrishnan, Microarray Gene Expression Data Classification
optimized by particle swarm optimization and artificial bee colony, Molecules 22 using a Hybrid Algorithm: MRMRAGA, International Journal of Innovative
(2017) 2086. Technology and Exploring Engineering (IJITEE) August (8) (2019).
[26] S.K. Baliarsingh, C. Dora, S. Vipsita, Jaya Optimized Extreme Learning Machine for [43] A. Nagpal, V. Singh, A feature selection algorithm based on qualitative mutual
Breast Cancer Data Classification, Springer Singapore, Singapore, 2021, information for cancer microarray data, Procedia computer science, 132 (2018)
pp. 459–467. 244–252, Biotechnology Journal 10 (2016).
[27] Q. Su, Y. Wang, X. Jiang, F. Chen, W.-C. Lu, A cancer gene selection algorithm [44] Loey M, Jasim MW, EL-Bakry HM, Taha MHN, Khalifa NEM. Breast and Colon
based on the KS test and CFS, BioMed research international 2017 (2017). Cancer Classification from Gene Expression Profiles Using Data Mining Techniques.
[28] F.K. Ahmad, A comparative study on gene selection methods for tissues Symmetry. 2020;12:408.
classification on large scale gene expression data, Jurnal Teknologi 78 (2016) [45] M. Hamim, I. El Moudden, H. Moutachaouik, M. Hain, Decision Tree Model Based
116–125. Gene Selection and Classification for Breast Cancer Risk Prediction, Springer
[29] S.A. Medjahed, T.A. Saadi, A. Benyettou, M. Ouali, Kernel-based learning and International Publishing, Cham, 2020, pp. 165–177.
feature selection analysis for cancer diagnosis, Applied Soft Computing. 51 (2017) [46] P. Jinthanasatian, S. Auephanwiriyakul, N. Theera-Umpon, Microarray data
39–48. classification using neuro-fuzzy classifier with firefly algorithm, 2017 IEEE
[30] I. Jain, V.K. Jain, R. Jain, Correlation feature selection based improved-Binary Symposium Series on Computational Intelligence (SSCI), IEEE, 2017, pp. 1-6.
Particle Swarm Optimization for gene selection and cancer classification, Appl Soft [47] Z. Li, W. Xie, T. Liu, Efficient feature selection and classification for microarray
Comput. 62 (2018) 203–215. data, PloS one 13 (2018).
[31] S. Shahbeig, M.S. Helfroush, A. Rahideh, A fuzzy multi-objective hybrid TLBO-PSO [48] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad
approach to select the associated genes with breast cancer, Signal Process. 131 patterns of gene expression revealed by clustering analysis of tumor and normal
(2017) 58–65. colon tissues probed by oligonucleotide arrays, Proceedings of the National
[32] H. Lu, J. Chen, K. Yan, Q. Jin, Y. Xue, Z. Gao, A hybrid feature selection algorithm Academy of Sciences 96 (1999) 6745–6750.
for gene expression data classification, Neurocomputing 256 (2017) 56–62. [49] E.F. Petricoin III, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Steinberg,
[33] O.A. Alomari, A.T. Khader, M.A. Al-Betar, Z.A.A. Alyasseri, A hybrid filter-wrapper G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn, Use of proteomic patterns in serum
gene selection method for cancer classification, 2018 2nd International Conference to identify ovarian cancer, The lancet 359 (2002) 572–577.
on BioSignal Analysis, Processing and Systems (ICBAPS), IEEE, 2018, pp. 113- 118. [50] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasen-beek, J.P. Mesirov,
[34] S. Turgut, M. Dağtekin, T. Ensari, Microarray breast cancer data classification using H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, Molecular classification of
machine learning methods, 2018 Electric Electronics, Computer Science, cancer: class discovery and class prediction by gene expression monitoring, science
Biomedical Engineerings’ Meeting (EBBT), IEEE, 2018, pp. 1-3. 286 (1999) 531–537.
[35] M.M. Mufassirin, R.G. Ragel, A novel filter-wrapper based feature selection [51] L.J. Van’t Veer, H. Dai, M.J. Van De Vijver, Y.D. He, A.A. Hart, M. Mao, H.L.
approach for cancer data classification, 2018 IEEE International Conference on Peterse, K. Van Der Kooy, M.J. Marton, A.T. Witteveen, Gene expression profiling
Information and Automation for Sustainability (ICIAfS), IEEE, 2018, pp. 1-6. predicts clinical outcome of breast cancer, nature, 415 (2002) 530-536.
[36] R.S. Sreepada, S. Vipsita, P. Mohapatra, An efficient approach for microarray data [52] G.J. Gordon, R.V. Jensen, L.-L. Hsiao, S.R. Gullans, J.E. Blu-menstock,
classification using filter wrapper hybrid approach, 2015 IEEE International S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, R. Bueno, Translation of
Advance Computing Conference (IACC), IEEE, 2015, pp. 263-267. microarray data into clinically relevant cancer diagnostic tests using gene
[37] S.S. Hameed, F.F. Muhammad, R. Hassan, F. Saeed, Gene Selection and expression ratios in lung cancer and mesothelioma, Cancer research 62 (2002)
Classification in Microarray Datasets using a Hybrid Approach of PCC-BPSO/GA 4963–4967.
with Multi Classifiers, JCS. 14 (2018) 868–880. [53] Y. Wang, J.G. Klijn, Y. Zhang, A.M. Sieuwerts, M.P. Look, F. Yang, D. Talantov,
[38] H. Salem, G. Attiya, N. El-Fishawy, Classification of human cancer diseases by gene M. Timmermans, M.E. Meijer-van Gelder, J. Yu, Gene-expression profiles to predict
expression profiles, Applied Soft Computing 50 (2017) 124–134. distant metastasis of lymph-node-negative primary breast cancer, The Lancet 365
[39] D. Utami, Z. Rustam, Gene selection in cancer classification using hybrid method (2005) 671–679.
based on Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC) feature [54] D. Singh, P. Febbo, K. Ross, D. Jackson, J. Manola, C. Ladd, P. Tamayo,
selection and support vector machine, AIP Conference Proceedings, AIP Publishing A. Renshaw, A. D’Amico, J. Richie, E. Lander, M. Loda, P. Kantoff, T. Golub,
LLC (2019), 020047. W. Sellers, Gene Expression Correlates of Clinical Prostate Cancer Behavior, Cancer
[40] W. Zhongxin, S. Gang, Z. Jing, Z. Jia, Feature selection algorithm based on mutual cell 1 (2002) 203–209.
information and Lasso for microarray data, The Open Biotechnology Journal 10 [55] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.
(2016). E. McLaughlin, J.Y. Kim, L.C. Goumnerova, P.M. Black, C. Lau, Prediction of
[41] M. Sardana, R. Agrawal, B. Kaur, A hybrid of clustering and quantum genetic central nervous system embryonal tumour outcome based on gene expression,
algorithm for relevant genes selection for cancer microarray data, International Nature 415 (2002) 436–442.
Journal of Knowledge-based and Intelligent Engineering Systems 20 (2016)
161–173.