Comparation Analysis of Ensemble Technique With Boosting (Xgboost) and Bagging (Randomforest) For Classify Splice Junction Dna Sequence Category
Comparation Analysis of Ensemble Technique With Boosting (Xgboost) and Bagging (Randomforest) For Classify Splice Junction Dna Sequence Category
e-ISSN 2476-9266
p-ISSN: 2088-9402
Doi:10.17933/jppi.2019.090103
Naskah Diterima: 31 Oktober 2018; Direvisi : 4 Maret 2019; Disetujui : 5 Agustus 2019
Abstract
Bioinformatics research is currently undergoing a rapid growth, supported by the development of
computation technology and algorithm. Ensemble decision tree is a common method for classifying large and
complex dataset such as DNA sequence. Combining the implementation of two classification methods like
xgboost and random Forest with ensemble technique might improve the accuracy result on classifying DNA
Sequence splice junction type. With 96.24% accuracy for xgboost and 95.11% for Random Forest, the study
suggests that both methods, using the right parameter setting, are highly effective tools for classifying DNA
sequence dataset. Analyzing both methods with their characteristics will give an overview on how they work
to meet the needs in DNA splicing.
Keywords: DNA splice site junction, ensemble technique, extreme gradient boosting, grid search
hyperparamater optimization, random forest.
27
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36
machine learning algorithm. Machine Learning widely applied and implemented in some software
(ZX Sun, 2008).
(ML) uses machine to learn and recognize patterns to
be able to make classifications and even predictions. One of the common methods used in ML is the
The high level of accuracy make it easy for decision tree (DT). DT is able to extract information
researchers to evaluate an experiment immediately from a dataset into knowledge that is intuitive and
and precisely at an inexpensive cost. This technology easy to understand (Barros et al., 2012). DT
has been widely implemented in many fields related algorithms has advantages over other learning
to genetics and genomics because it is considered to algorithms, for example its endurance towards noise,
be able to interpret enormous genome dataset and has low computational cost to produce a model, and
been used to describe a wide variety of varieties from ability to handle excessive features (Rokach and
the part of the genomic sequence (Libbrecth, 2015). Maimon, 2005). DT classifiers are also considered to
Biogenetic data is also related to the process of be very useful, efficient and commonly used to deal
protein formation. There is a stage in the process of with data mining classification problems (Farid et al,
(DNA) is copied into ribonucleic acid (RNA). The One of DT weaknesses on availability of
copy resulted in unnecessary information which are training data with weak predictive values can be
carried to the final product, thus the RNA form is overcome by the application of ensemble techniques.
considered immature. Such information must be The ensemble method is a learning algorithm that is
removed in order to produce functional products. developed from several classification or predictive
RNA splicing process is done to eliminate models. Lately, the computing application in biology
information that is not needed. Exons are sequences has seen an increase use of ensemble learning
of nucleotides that remain in the mature RNA, method because of its unique advantages in handling
whereas introns are sequences that are removed. The small sample sizes, high dimensions, and complex
classification of data refers to 2 types of splicing data structures (Yang et al 2010). However, ideally
categories, namely the acceptor and donor the availability of data and variations are needed for
categories. The acceptor is the border between the better accuracy because the size of determinant
intron gene and the exon gene while the donor is the attributes variation in the classification contributes to
DNA sequence containing a border between the exon the accuracy value to form prediction models in an
gene and the intron gene. ensemble (Hamed and Can, 2017). Two methods
In the last decade, the pattern recognition commonly used in ensemble techniques are boosting
develop. Among them are the weight matrix method The boosting method is in the form of repeated
(WMM), weight array method (WAM), maximal weighting of the predictor. The boosting method used
28
Analisis Pembandingan Teknik Ensemble secara boosting(xgboost) dan bagging(Randomforest.. (Iswaya Maalik S. et al)
29
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36
Figure 1. stages of research of the implementation of ensemble method on DNA sequence dataset
Data of this study is taken from Genbank and test data. Training data was 75% of the overall
64.1 (ftp:://genbank.bio.net). The dataset "Primate data of 2,392 record data training divided by the
splice-junction gene sequences (DNA) with number of categories proportionally. The remaining
associated imperfect domain theory" is a DNA 798 or 25% is used as test data.
sequence from primates in the form of splice- Variables in DNA sequence consisting of a
junction sequences (Lichman M., 2013). Data group categories of intron-exon (IE), Neither (N) and
downloaded from the UCI machine learning is a exon-introns (EI) while the nucleotide sequence is
nucleotide sequences labeled splice exon-intron adenine (A), cytosine (C), guanine (G), and thymine
category and the opposite intron-exon sequences and (T). The DNA sequence code and categories were
neither categories. then categorized into a number value because
XGBoost requires data in numerical form. There are
Data pre-process no special requirements in coding, the important
The initial stage is to pre-process the data thing is that the values in the nucleotide code feature
which includes data acquisition, coding in numerical and label are unique. Information codification in
values, conversion to matrix and distribution of shown in Table 2.
training and test data. At the stage of data acquisition, The EI category value is converted to 0, the
the DNA sequence dataset compression file is N category is to 2 and last, the IE category is to 1.
downloaded via the internet at the address The values of the nucleotide adenine, cytosine,
https://archive.ics.uci.edu/ml/machine- guanine, and thymine which are clearly defined are
earningdatabases/molecular-biology/splice- converted to 3, 4, 5 and 6. In a nucleotide sequence,
junction-gene-sequences/splice.data.Z. not all types of base can be clearly defined, but the
Table 1. Dataset description nucleotide have characters that characterize the value
30
Analisis Pembandingan Teknik Ensemble secara boosting(xgboost) dan bagging(Randomforest.. (Iswaya Maalik S. et al)
to number 8. Nucleotides which may be cytosine or The concept is to make the data sample D sizes n, and
coded guanine "S" is converted to a value of 9. then produce new training data as many as m where
Whereas nucleotides which may be in the form of each set of size n based on random data D with
coded "R" denine or guanine are converted to replacement of content data. Classifications are
number 0. There was only a little percentage of base made based on these m samples. Each sample has a
types that are not clearly identified so that probability of (1-1/n) n to be selected as test data.
classification process was not affected. After making Random forest is a classification algorithm
sure the dataset has been converted into a number developed from the classification and regression tree
value and the missing value is not found, then the (CART) method. This method optimizes the
data needs to be converted into a matrix. estimation process by bagging. Random forest is
Table 2. Codification to number formed from many Decision Trees from sample data
which have undergone training process. Before tree
Code information conversion
EI Ekson – Intron 0 formation, the random feature selection stage is
IE Intron – Ekson 1 carried out. The results of the entire tree will be
N (Neither) 2 evaluated through voting. The basic concept of
A Adenin 3 random forest is the implementation of bootstrap
C Cytosine 4 aggregating (bagging) method.
G Guanine 5 Boosting is an ensemble method which
T Thymine 6 moves sequentially. The method is employed by
D A atau G atau T 7
combining weak predictor models to produce better
N A atau G atau C 2
predictive accuracy. For each iteration, models are
atau T
S C atau G 8 resulted from the previous weighting process.
Data classification using the ensemble process. Incorrect data from the previous prediction
method which is a learning algorithm built from is classified as "difficult" data and will be used for
several models of classification or predictor. The the next prediction process so that the accuracy value
most commonly used ensemble techniques are reaches a maximum point. After the whole prediction
boosting and bagging. process is carried out, all models are merged.
method built in an ensemble for stability and good reliable complex predictor. The stages of this
accuracy in classification and regression. To prevent learning process are predicting for regression,
overfitting, the number of variants are reduced and it calculation of errors of the residue, and learning
usually done in the form of decision tree with the process to process the residue.
application of the average value of generated model. One of the forms of ensemble
implementations by boosting is gradient boosting
31
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36
For illustration, Figure 2 shows the mechanism of a Accuracy level analysis for XGboost and Random
RESULTS AND DISCUSSION random forest were produced, each of which has
different parameters of numbers of tree or nround.
Training process is carried out in the range Then, the next stage is testing all models built with
of the number of trees, between 30 and 130. The the prepared test data during the data pre-process
number was obtained from the initial testing by stage.
measuring the error level of logloss and Mean Square
Error (MSE) at a certain point whose graph is
32
Analisis Pembandingan Teknik Ensemble secara boosting(xgboost) dan bagging(Randomforest.. (Iswaya Maalik S. et al)
Figure 3. Accuracy level of both models by number of trees with default parameter
The resulted values show the accuracy level of each for training process (subsample) and ratio subsample
model built by using the default parameter with of column when building each tree
various combinations of tree number. The average (colsample_bytree). A default value is set for other
level of accuracy of random forest is at 0.92 while hyperparameters. Other hyperparameters that can be
xgboost is 0.95. The accuracy level of both methods adjusted include number of iteration (nround),
to splice junction sample dataset is relatively high. regularization value (gamma) and learning rate (eta).
Reconfiguration was done for the number of tree Hyperparameter search were conducted
while no adjustment was made for other parameters, manually in 168 trials with various configurations.
and accuracy value is estimated not to change The best result obtained was at 96.24%.
significantly. To increase accuracy value, tuning Hypermparameter configurations used are displayed
hyperparameter on both methods was carried out in Table. 3.
Tabel 3. Xgboost hyperparameter configuration
Optimization of Hyperparameter tuning by Grid
Mekanisme
Search No Hyperparameter Nilai
tuning
On this stage, analysis is conducted to obtain 1 nrounds 80 manual
sequential patterns to be tested. Pattern in the form of 2 eta 0,2 manual
3 gamma 0 manual
grid allows the appropriate hyperparameter 4 max_depth 5 manual
formulation for the appropriate accuracy level. 5 min_child_weight 5 manual
6 subsample 0,4 manual
XGBoost Hyperparameter Tuning
7 colsample_bytree 1 manual
Hyperparameter to be configured for 8 Boost_type gbtree fix
xgboost are the depth of tree (max_depth), minimum
weight of child (min_child_weight), subsample ratio
33
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36
From the testing, results of the comparison method is superior to random forest. Even after
of accuracy levels of both methods both by default random forest tuning is conducted, the level of
value and by tuning hyperparameter shown in Figure accuracy obtained cannot exceed that of xgboost by
6. From this figure, it can be concluded that xgboost default values.
34
Analisis Pembandingan Teknik Ensemble secara boosting(xgboost) dan bagging(Randomforest.. (Iswaya Maalik S. et al)
Mechanism Comparison analysis of both boosting and bagging are able to handle
The bagging and boosting methods of the classification in a good manner, when the
ensemble concept are different. Their general hyperparameter is appropriately determined. The
similarity is the use of more than one classifiers in accuracy level of xgboost is overall superior.
their processes. Both methods have advantages and However, the drawback of xgboost is that its training
disadvantages. From this study, which uses small process took more time to complete because within
size dataset sample, it is indicated that xgboost is that process, the trees are built sequentially. The
superior to that of random forest. Referring to study also finds that it is more difficult to carry out
several literature studies, the differentiation between hyperparameter tuning for xgboost. In addition,
the ensemble concepts of bagging and boosting is xgboost is more sensitive, so that when there is too
summarized in table 4. much dirty data and too many outliers, overfitting
may occur.
Tabel 4. Analysis of the comparison of xgboost and
rendomfest in this study In random forest, training process of each
Random tree is carried out independently, with random data
XGBoost
Forest sample. This randomization makes increase models’
Process
sequential parallel resistance and reduce overfitting of training data. The
mechanism
Number of
More than 5 Only 2 advantage of this model is the ease of parameter
hyperparameter
Training
Using all data Menggunakan tuning compared to that of XGboost. The
with residue subsample
mchanism configuration process only requires two parameters,
optimization secara acak
Ensemble namely number of tree and number of features to be
boosting bagging
mechanism
Use of a large Tends to
More robust selected for each node. One of the disadvantages of
number of tree overfit
Types of Decision the random forest method is the large number of tree
Shallow tree Deep tree
tree built resulting in the longer process time for real time
implementation.
CONCLUSIONS Further researches are suggested to use more
This study show that the ensemble methods complex and massive size DNA sequence dataset in
35
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36
order to find out the actual performance of XGBoost Dietterich, T. G. (2000, June). Ensemble methods in
machine learning. In International workshop
om DNA sequence pattern related to splice acceptor
on multiple classifier systems (pp. 1-15).
and donor. Outlier data may be removed so that Springer, Berlin, Heidelberg.
models with more optimum value may be obtained. Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M.
A., & Strachan, R. (2014). Hybrid decision
Optimization may be performed with most
tree and naïve Bayes classifiers for multi-class
ideal hyperparameter configuration search using classification tasks. Expert Systems with
Applications, 41(4), 1937-1946.
random search. It is expected that hyperparameter
Friedman, J. H. (2001). Greedy function
values which are not included in the grid search approximation: a gradient boosting
pattern range can be found, so that configuration machine. Annals of statistics, 1189-1232.
values can be used on models and possibly resulted Libbrecht, M. W., & Noble, W. S. (2015). Machine
learning applications in genetics and
in better accuracy. genomics. Nature Reviews Genetics, 16(6),
321.
Lichman, M. (2013). UCI machine learning
REFERENCES repository.
Barros, R. C., Basgalupp, M. P., de Carvalho, A. C., Lo, C., Kakaradov, B., Lokshtanov, D., & Boucher,
& Freitas, A. A. (2012, July). A hyper- C. (2014). SeeSite: characterizing
heuristic evolutionary algorithm for relationships between splice junctions and
automatically designing decision-tree splicing enhancers. IEEE/ACM transactions
algorithms. In Proceedings of the 14th annual on computational biology and
conference on Genetic and evolutionary bioinformatics, 11(4), 648-656.
computation (pp. 1237-1244). ACM.
Sun, Z., Sang, L., Ju, L., & Zhu, H. (2008). A new
Bonab, H. R., & Can, F. (2017). Less is more: a method for splice site prediction based on the
comprehensive framework for the number of sequence patterns of splicing signals and
components of ensemble classifiers. arXiv regulatory elements. Chinese Science
preprint arXiv:1709.02925. Bulletin, 53(21), 3331.
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable Yang, P., Hwa Yang, Y., B Zhou, B., & Y Zomaya,
tree boosting system In Proceedings of the A. (2010). A review of ensemble methods in
22Nd ACM SIGKDD International bioinformatics. Current Bioinformatics, 5(4),
Conference on Knowledge Discovery and 296-308.
Data Mining (pp. 785‐794).
36