0% found this document useful (0 votes)
38 views

Comparation Analysis of Ensemble Technique With Boosting (Xgboost) and Bagging (Randomforest) For Classify Splice Junction Dna Sequence Category

This document compares the ensemble techniques of XGBoost (boosting) and Random Forest (bagging) for classifying DNA splice junction sequences. Both methods were able to achieve high accuracy for this classification task, with XGBoost achieving 96.24% accuracy and Random Forest achieving 95.11% accuracy when using optimized parameters. The study analyzes the characteristics and performance of each method to provide insight into how they can effectively classify DNA sequence data and assist research in the field of DNA splicing.

Uploaded by

Fatrina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Comparation Analysis of Ensemble Technique With Boosting (Xgboost) and Bagging (Randomforest) For Classify Splice Junction Dna Sequence Category

This document compares the ensemble techniques of XGBoost (boosting) and Random Forest (bagging) for classifying DNA splice junction sequences. Both methods were able to achieve high accuracy for this classification task, with XGBoost achieving 96.24% accuracy and Random Forest achieving 95.11% accuracy when using optimized parameters. The study analyzes the characteristics and performance of each method to provide insight into how they can effectively classify DNA sequence data and assist research in the field of DNA splicing.

Uploaded by

Fatrina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

JPPI Vol 9 No 1 (2019) 27 - 36

Jurnal Penelitian Pos dan Informatika


32a/E/KPT/2017

e-ISSN 2476-9266
p-ISSN: 2088-9402

Doi:10.17933/jppi.2019.090103

COMPARATION ANALYSIS OF ENSEMBLE TECHNIQUE


WITH BOOSTING(XGBOOST) AND
BAGGING(RANDOMFOREST) FOR CLASSIFY SPLICE
JUNCTION DNA SEQUENCE CATEGORY
ANALISIS PEMBANDINGAN TEKNIK ENSEMBLE SECARA
(XGBOOST) DAN BAGGING (RANDOMFOREST) PADA
KLASIFIKASI KATEGORI SAMBATAN SEKUENS DNA
Iswaya Maalik S1, Wisnu Ananta Kusuma2, Sri Wahjuni3
123
Departemen Ilmu Komputer, Fakultas Matematika dan Ilmu Pengetahuan Alam, Institut Pertanian Bogor
[email protected]

Naskah Diterima: 31 Oktober 2018; Direvisi : 4 Maret 2019; Disetujui : 5 Agustus 2019

Abstract
Bioinformatics research is currently undergoing a rapid growth, supported by the development of
computation technology and algorithm. Ensemble decision tree is a common method for classifying large and
complex dataset such as DNA sequence. Combining the implementation of two classification methods like
xgboost and random Forest with ensemble technique might improve the accuracy result on classifying DNA
Sequence splice junction type. With 96.24% accuracy for xgboost and 95.11% for Random Forest, the study
suggests that both methods, using the right parameter setting, are highly effective tools for classifying DNA
sequence dataset. Analyzing both methods with their characteristics will give an overview on how they work
to meet the needs in DNA splicing.

Keywords: DNA splice site junction, ensemble technique, extreme gradient boosting, grid search
hyperparamater optimization, random forest.

27
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36

INTRODUCTION dependence decomposition (MDD), hidden markov


model (HMM), artificial neural network (ANN), and
Researches in the fields of genome and genetics
are facilitated with the computational technology and support vector machine (SVM) which have been

machine learning algorithm. Machine Learning widely applied and implemented in some software
(ZX Sun, 2008).
(ML) uses machine to learn and recognize patterns to
be able to make classifications and even predictions. One of the common methods used in ML is the
The high level of accuracy make it easy for decision tree (DT). DT is able to extract information
researchers to evaluate an experiment immediately from a dataset into knowledge that is intuitive and
and precisely at an inexpensive cost. This technology easy to understand (Barros et al., 2012). DT
has been widely implemented in many fields related algorithms has advantages over other learning
to genetics and genomics because it is considered to algorithms, for example its endurance towards noise,
be able to interpret enormous genome dataset and has low computational cost to produce a model, and
been used to describe a wide variety of varieties from ability to handle excessive features (Rokach and
the part of the genomic sequence (Libbrecth, 2015). Maimon, 2005). DT classifiers are also considered to

Biogenetic data is also related to the process of be very useful, efficient and commonly used to deal

protein formation. There is a stage in the process of with data mining classification problems (Farid et al,

protein synthesis where deoxyribonucleic acid 2014).

(DNA) is copied into ribonucleic acid (RNA). The One of DT weaknesses on availability of
copy resulted in unnecessary information which are training data with weak predictive values can be
carried to the final product, thus the RNA form is overcome by the application of ensemble techniques.
considered immature. Such information must be The ensemble method is a learning algorithm that is
removed in order to produce functional products. developed from several classification or predictive
RNA splicing process is done to eliminate models. Lately, the computing application in biology
information that is not needed. Exons are sequences has seen an increase use of ensemble learning
of nucleotides that remain in the mature RNA, method because of its unique advantages in handling
whereas introns are sequences that are removed. The small sample sizes, high dimensions, and complex
classification of data refers to 2 types of splicing data structures (Yang et al 2010). However, ideally
categories, namely the acceptor and donor the availability of data and variations are needed for
categories. The acceptor is the border between the better accuracy because the size of determinant
intron gene and the exon gene while the donor is the attributes variation in the classification contributes to
DNA sequence containing a border between the exon the accuracy value to form prediction models in an
gene and the intron gene. ensemble (Hamed and Can, 2017). Two methods

In the last decade, the pattern recognition commonly used in ensemble techniques are boosting

algorithm for splice site junction has continued to and bagging.

develop. Among them are the weight matrix method The boosting method is in the form of repeated
(WMM), weight array method (WAM), maximal weighting of the predictor. The boosting method used

28
Analisis Pembandingan Teknik Ensemble secara boosting(xgboost) dan bagging(Randomforest.. (Iswaya Maalik S. et al)

is gradient boosting (GB) in the form of boosting by METHODOLOGY


gradient descent. GB was first introduced by This study compares testing on the models
Friedman et al . (2001), one of the improvised that are built using each method. Models were built
algorithms is (xgboost) by Chen and Guestrin (2016). using a computer device with Intel quad core
This extreme gradient boosting algorithm is very specifications with 8GB of memory with Microsoft
popular and it often wins the ML competition held by Windows 10 operating system. The software used to
Kaggle. build the model is R programming language using
Ensemble concept with bagging is done by the library caret, dplyr, XG Boost and RandomForest
combining many prediction values into one packages. Datasets were managed using the Notepad
prediction value. One of the advantages of Bagging plus editor.
is that it can reduce prediction errors generated by a This study is carried out in 3 main stages,
single DT . Random Forest (RF) is one of the DT namely pre-process, the implementation of ensemble
methods that employ the bagging concept. RF uses techniques to form models with training process with
predictor candidates randomly on each tree for default parameters of each method, and then the
training process and votes will be made for the entire results and performance were compared with test
tree formed. data. Evaluation is carried out by repeating the
The two ensemble techniques will be training and testing process several times with
implemented in DNA sequences derived from the various configurations of number of iterations or
UCI machine learning repository. Tuning parameters trees that are built. Optimization also performed with
is carried out to improve the accuracy of ML. The other parameters in addition to the number of
results of the implementation of both methods are iterations or a tree with grid search method in greedy
then analyzed in terms of their performance. It is matter to obtain the value with maximum accuracy.
expected that the results of this analysis can provide The last step is to analyze the process time and
an idea of how these methods real implementation of accuracy of each model built. In order to obtain more
working mechanisms could assist research in the in-depth information about the work mechanism of
field of DNA splicing. the ML is carried out with literature studies of related
journals and papers. Details of the mechanism of this
study are illustrated in the following chart.

29
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36

Figure 1. stages of research of the implementation of ensemble method on DNA sequence dataset

Data of this study is taken from Genbank and test data. Training data was 75% of the overall
64.1 (ftp:://genbank.bio.net). The dataset "Primate data of 2,392 record data training divided by the
splice-junction gene sequences (DNA) with number of categories proportionally. The remaining
associated imperfect domain theory" is a DNA 798 or 25% is used as test data.
sequence from primates in the form of splice- Variables in DNA sequence consisting of a
junction sequences (Lichman M., 2013). Data group categories of intron-exon (IE), Neither (N) and
downloaded from the UCI machine learning is a exon-introns (EI) while the nucleotide sequence is
nucleotide sequences labeled splice exon-intron adenine (A), cytosine (C), guanine (G), and thymine
category and the opposite intron-exon sequences and (T). The DNA sequence code and categories were
neither categories. then categorized into a number value because
XGBoost requires data in numerical form. There are
Data pre-process no special requirements in coding, the important
The initial stage is to pre-process the data thing is that the values in the nucleotide code feature
which includes data acquisition, coding in numerical and label are unique. Information codification in
values, conversion to matrix and distribution of shown in Table 2.
training and test data. At the stage of data acquisition, The EI category value is converted to 0, the
the DNA sequence dataset compression file is N category is to 2 and last, the IE category is to 1.
downloaded via the internet at the address The values of the nucleotide adenine, cytosine,
https://archive.ics.uci.edu/ml/machine- guanine, and thymine which are clearly defined are
earningdatabases/molecular-biology/splice- converted to 3, 4, 5 and 6. In a nucleotide sequence,
junction-gene-sequences/splice.data.Z. not all types of base can be clearly defined, but the
Table 1. Dataset description nucleotide have characters that characterize the value

Dataset Number of Number of Number of Missing


of the possible the nucleotide type. For nucleotides
characteristics attributes classes features Value
which have a possible value of coded "D" adenine,
Sequential 61 3 3,190 none
guanine, and thymine are converted to number 7. The
type of nucleotide that has a probability of being
Data extracted and converted into CSV
adapted to four base types of N values is converted
format. Furthermore the data is divided into training

30
Analisis Pembandingan Teknik Ensemble secara boosting(xgboost) dan bagging(Randomforest.. (Iswaya Maalik S. et al)

to number 8. Nucleotides which may be cytosine or The concept is to make the data sample D sizes n, and
coded guanine "S" is converted to a value of 9. then produce new training data as many as m where
Whereas nucleotides which may be in the form of each set of size n based on random data D with
coded "R" denine or guanine are converted to replacement of content data. Classifications are
number 0. There was only a little percentage of base made based on these m samples. Each sample has a
types that are not clearly identified so that probability of (1-1/n) n to be selected as test data.
classification process was not affected. After making Random forest is a classification algorithm
sure the dataset has been converted into a number developed from the classification and regression tree
value and the missing value is not found, then the (CART) method. This method optimizes the
data needs to be converted into a matrix. estimation process by bagging. Random forest is
Table 2. Codification to number formed from many Decision Trees from sample data
which have undergone training process. Before tree
Code information conversion
EI Ekson – Intron 0 formation, the random feature selection stage is
IE Intron – Ekson 1 carried out. The results of the entire tree will be
N (Neither) 2 evaluated through voting. The basic concept of
A Adenin 3 random forest is the implementation of bootstrap
C Cytosine 4 aggregating (bagging) method.
G Guanine 5 Boosting is an ensemble method which
T Thymine 6 moves sequentially. The method is employed by
D A atau G atau T 7
combining weak predictor models to produce better
N A atau G atau C 2
predictive accuracy. For each iteration, models are
atau T
S C atau G 8 resulted from the previous weighting process.

R A atau G 9 Boosting focuses on new learning process on data


with a low accuracy value produced in previous
process and is carried out with a sequential training

Data classification using the ensemble process. Incorrect data from the previous prediction

method which is a learning algorithm built from is classified as "difficult" data and will be used for

several models of classification or predictor. The the next prediction process so that the accuracy value

most commonly used ensemble techniques are reaches a maximum point. After the whole prediction

boosting and bagging. process is carried out, all models are merged.

Bagging or bootstrap aggregating is an ML Boosting transforms a weak predictor model into a

method built in an ensemble for stability and good reliable complex predictor. The stages of this

accuracy in classification and regression. To prevent learning process are predicting for regression,

overfitting, the number of variants are reduced and it calculation of errors of the residue, and learning

usually done in the form of decision tree with the process to process the residue.

application of the average value of generated model. One of the forms of ensemble
implementations by boosting is gradient boosting

31
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36

(GB). GB is a regression and classification algorithm relatively stable.


that applies the ensemble concept of weak predictors
and generally uses decision trees. Optimization
process is carried out through boosting by optimizing The processing time is directly proportional
the value of loss function. Gradient boosting to the number of trees. The more trees to be grown,
combines weak predictors iteratively by minimizing the longer the time needed to carry out the
the mean square error of the model where error classification process. For xgboost method, longer
( 𝑦̂ − 𝑦 ) of model 𝐹 and 𝑦̂ = 𝑓(𝑥). From each of time is needed for processing than in the random
the iteration process, a collection of hypotheses are forest. It happened because the xgboost mechanism
produced, forming model and producing predictive operates sequentially while the random forest in
value. parallel.

For illustration, Figure 2 shows the mechanism of a Accuracy level analysis for XGboost and Random

Figure 2. Ensemble on decision tree

single DT development that can be built by ensemble Forest test process


method, bagging and booting, in an optimization After training process was conducted on training
effort to obtain better accuracy value. data, approximately 100 models of xgboost and

RESULTS AND DISCUSSION random forest were produced, each of which has
different parameters of numbers of tree or nround.
Training process is carried out in the range Then, the next stage is testing all models built with
of the number of trees, between 30 and 130. The the prepared test data during the data pre-process
number was obtained from the initial testing by stage.
measuring the error level of logloss and Mean Square
Error (MSE) at a certain point whose graph is

32
Analisis Pembandingan Teknik Ensemble secara boosting(xgboost) dan bagging(Randomforest.. (Iswaya Maalik S. et al)

Figure 3. Accuracy level of both models by number of trees with default parameter

The resulted values show the accuracy level of each for training process (subsample) and ratio subsample
model built by using the default parameter with of column when building each tree
various combinations of tree number. The average (colsample_bytree). A default value is set for other
level of accuracy of random forest is at 0.92 while hyperparameters. Other hyperparameters that can be
xgboost is 0.95. The accuracy level of both methods adjusted include number of iteration (nround),
to splice junction sample dataset is relatively high. regularization value (gamma) and learning rate (eta).
Reconfiguration was done for the number of tree Hyperparameter search were conducted
while no adjustment was made for other parameters, manually in 168 trials with various configurations.
and accuracy value is estimated not to change The best result obtained was at 96.24%.
significantly. To increase accuracy value, tuning Hypermparameter configurations used are displayed
hyperparameter on both methods was carried out in Table. 3.
Tabel 3. Xgboost hyperparameter configuration
Optimization of Hyperparameter tuning by Grid
Mekanisme
Search No Hyperparameter Nilai
tuning
On this stage, analysis is conducted to obtain 1 nrounds 80 manual
sequential patterns to be tested. Pattern in the form of 2 eta 0,2 manual
3 gamma 0 manual
grid allows the appropriate hyperparameter 4 max_depth 5 manual
formulation for the appropriate accuracy level. 5 min_child_weight 5 manual
6 subsample 0,4 manual
XGBoost Hyperparameter Tuning
7 colsample_bytree 1 manual
Hyperparameter to be configured for 8 Boost_type gbtree fix
xgboost are the depth of tree (max_depth), minimum
weight of child (min_child_weight), subsample ratio

33
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36

Gambar 4. Akurasi model-model xgboost dengan berbagai konfigurasi parameter

Figure 4. displays test results on xgboost The Hyperparameter configured in random


generated models. The graphic shows a dynamic forest are only the number of tree and number of
move of accuracy level, inappropriate features for sorting, so that the process to determine
hyperparameter implementation resulted in the hyperparameter becomes faster.
prediction values that are far below accuracy values
From Figure 5, it is seen that optimum values
during test process by default value. Xgboost with
are generated by hyperparameters with ntree value of
more than five combinations of hyperparameters are
905 with 5 variables mtry. The naming of each model
fairly difficult to adjust on the hyperparameter
in figure 5 refers to the hyperparameter configuration
configuration so that maximum accuracy value is
in terms of the values of m (mtry) and n (ntree).
obtained
Best Technique analysis
Random Forest hyperparameter Tuning

Figure 5. Accuracy of RF models built by various parameter configurations

From the testing, results of the comparison method is superior to random forest. Even after
of accuracy levels of both methods both by default random forest tuning is conducted, the level of
value and by tuning hyperparameter shown in Figure accuracy obtained cannot exceed that of xgboost by
6. From this figure, it can be concluded that xgboost default values.

34
Analisis Pembandingan Teknik Ensemble secara boosting(xgboost) dan bagging(Randomforest.. (Iswaya Maalik S. et al)

Figure 6. Best accuracy of built models.

Mechanism Comparison analysis of both boosting and bagging are able to handle
The bagging and boosting methods of the classification in a good manner, when the
ensemble concept are different. Their general hyperparameter is appropriately determined. The
similarity is the use of more than one classifiers in accuracy level of xgboost is overall superior.
their processes. Both methods have advantages and However, the drawback of xgboost is that its training
disadvantages. From this study, which uses small process took more time to complete because within
size dataset sample, it is indicated that xgboost is that process, the trees are built sequentially. The
superior to that of random forest. Referring to study also finds that it is more difficult to carry out
several literature studies, the differentiation between hyperparameter tuning for xgboost. In addition,
the ensemble concepts of bagging and boosting is xgboost is more sensitive, so that when there is too
summarized in table 4. much dirty data and too many outliers, overfitting
may occur.
Tabel 4. Analysis of the comparison of xgboost and
rendomfest in this study In random forest, training process of each
Random tree is carried out independently, with random data
XGBoost
Forest sample. This randomization makes increase models’
Process
sequential parallel resistance and reduce overfitting of training data. The
mechanism
Number of
More than 5 Only 2 advantage of this model is the ease of parameter
hyperparameter
Training
Using all data Menggunakan tuning compared to that of XGboost. The
with residue subsample
mchanism configuration process only requires two parameters,
optimization secara acak
Ensemble namely number of tree and number of features to be
boosting bagging
mechanism
Use of a large Tends to
More robust selected for each node. One of the disadvantages of
number of tree overfit
Types of Decision the random forest method is the large number of tree
Shallow tree Deep tree
tree built resulting in the longer process time for real time
implementation.
CONCLUSIONS Further researches are suggested to use more
This study show that the ensemble methods complex and massive size DNA sequence dataset in

35
Jurnal Penelitian Pos dan Informatika, Vol.09 No 01 September 2019 : hal 27- 36

order to find out the actual performance of XGBoost Dietterich, T. G. (2000, June). Ensemble methods in
machine learning. In International workshop
om DNA sequence pattern related to splice acceptor
on multiple classifier systems (pp. 1-15).
and donor. Outlier data may be removed so that Springer, Berlin, Heidelberg.
models with more optimum value may be obtained. Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M.
A., & Strachan, R. (2014). Hybrid decision
Optimization may be performed with most
tree and naïve Bayes classifiers for multi-class
ideal hyperparameter configuration search using classification tasks. Expert Systems with
Applications, 41(4), 1937-1946.
random search. It is expected that hyperparameter
Friedman, J. H. (2001). Greedy function
values which are not included in the grid search approximation: a gradient boosting
pattern range can be found, so that configuration machine. Annals of statistics, 1189-1232.
values can be used on models and possibly resulted Libbrecht, M. W., & Noble, W. S. (2015). Machine
learning applications in genetics and
in better accuracy. genomics. Nature Reviews Genetics, 16(6),
321.
Lichman, M. (2013). UCI machine learning
REFERENCES repository.
Barros, R. C., Basgalupp, M. P., de Carvalho, A. C., Lo, C., Kakaradov, B., Lokshtanov, D., & Boucher,
& Freitas, A. A. (2012, July). A hyper- C. (2014). SeeSite: characterizing
heuristic evolutionary algorithm for relationships between splice junctions and
automatically designing decision-tree splicing enhancers. IEEE/ACM transactions
algorithms. In Proceedings of the 14th annual on computational biology and
conference on Genetic and evolutionary bioinformatics, 11(4), 648-656.
computation (pp. 1237-1244). ACM.
Sun, Z., Sang, L., Ju, L., & Zhu, H. (2008). A new
Bonab, H. R., & Can, F. (2017). Less is more: a method for splice site prediction based on the
comprehensive framework for the number of sequence patterns of splicing signals and
components of ensemble classifiers. arXiv regulatory elements. Chinese Science
preprint arXiv:1709.02925. Bulletin, 53(21), 3331.
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable Yang, P., Hwa Yang, Y., B Zhou, B., & Y Zomaya,
tree boosting system In Proceedings of the A. (2010). A review of ensemble methods in
22Nd ACM SIGKDD International bioinformatics. Current Bioinformatics, 5(4),
Conference on Knowledge Discovery and 296-308.
Data Mining (pp. 785‐794).

36

You might also like