20.k1.0038 Proposal Project Report Kelar-1
20.k1.0038 Proposal Project Report Kelar-1
Stroke is one of the most serious medical conditions and has a significant impact on
public health. The importance of accurate prediction of stroke risk is to provide appropriate
treatment and intervention to individuals at risk of developing the disease. In recent years, the
use of machine learning methods has become popular in improving stroke disease prediction.
This research implements the Adaboost method to the C4.5 and K-Nearest Neighbor (KNN)
algorithms with the aim of improving stroke prediction performance. Using relevant datasets,
the C4.5 and KNN algorithms were used separately to perform stroke disease prediction.
Furthermore, the Adaboost method is used to combine the prediction results of the two
algorithms. The results showed that the implementation of the Adaboost method on the C4.5 and
KNN algorithms successfully improved the performance of stroke disease prediction, providing
more accurate and reliable predictions to assist in the diagnosis and treatment of stroke disease.
With a value of 91% for the combination of KNN with Adaboost and 95% for the combination of
C4.5 with Adaboost. Both have a difference in value of 4%. Therefore, C4.5 is more effective in
improving the performance of stroke disease prediction.
ii
TABLE OF CONTENTS
COVER..........................................................................................................................................................i
ABSTRACT..................................................................................................................................................ii
TABLE OF CONTENTS...........................................................................................................................iii
LIST OF FIGURE.......................................................................................................................................v
LIST OF TABLE........................................................................................................................................vi
CHAPTER 1 INTRODUCTION................................................................................................................1
1.1. Background....................................................................................................................................1
1.2. Problem Formulation......................................................................................................................2
1.3. Scope..............................................................................................................................................2
1.4. Objective........................................................................................................................................2
CHAPTER 2 LITERATURE STUDY.......................................................................................................3
CHAPTER 3 RESEARCH METHODOLOGY........................................................................................7
3.1. Research Methodology...................................................................................................................7
3.2. Dataset Collection..........................................................................................................................8
3.3. Pre-processing Data........................................................................................................................8
3.3.1. Cleaning Data........................................................................................................................8
3.3.2. Encoding Data.......................................................................................................................8
3.3.3. Smote Oversampling............................................................................................................9
3.4. Splitting Data..................................................................................................................................9
3.5. C4.5 Algorithm...............................................................................................................................9
3.6. K-Nearest Neighbor Algorithm....................................................................................................10
3.7. Adaptive Boosting Method..........................................................................................................11
3.8. Evaluation.....................................................................................................................................11
CHAPTER 4...............................................................................................................................................13
4.1. Experiment Setup.........................................................................................................................13
4.2. Implementation.............................................................................................................................13
4.3. Result............................................................................................................................................15
4.3.1. Result C4.5 Algorithm.........................................................................................................15
4.3.2. Result KNN Algorithm.........................................................................................................16
iii
4.3.3. Result Adaboost Method....................................................................................................17
4.3.4. Result C4.5 and Adaboost Combination.............................................................................18
4.3.5. Result KNN and Adaboost Combination.............................................................................19
4.3.6. Result Conclussion..............................................................................................................21
4.4. Discussion....................................................................................................................................22
CHAPTER 5...............................................................................................................................................23
REFERENCES.............................................................................................................................................a
iv
LIST OF FIGURE
v
LIST OF TABLE
vi
CHAPTER 1
INTRODUCTION
1.1. Background
Stroke is a significant global health problem, ranking as the second leading cause of death
worldwide and contributing significantly to high rates of disability. Indonesia in particular, faces
a pressing challenge with increasing stroke cases and high mortality rates.[1] According to data
from 208 Riskesdas, North Sulawesi Province has the highest prevalence of stroke (14.2%) while
Papua Province (4.1%).[2] Not only that, based on information from the Centers for Disease
Control and Prevention (CDC), stroke is also the leading causes of death in the United States.
Stroke is a non-communicable disease that accounts for about 11% of all deaths and more than
795,000 individuals in the United States experience the adverse effects of stroke.[3] The C4.5
algorithm can be utilized for predict or classify an event by forming a decision tree.[4] K-Nearest
Neighbor performs classification by considering the closest distance between new data and
existing data, starting with determining the value of the nearest neighbor.[5] Adaboost is one of
the supervised algorithms in the field of data mining that is often used to develop classification
models.
With the development of medical technology, it has become possible to utilize machine
learning to forecast stroke events. Machine learning algorithms, which are constructive in nature,
can produce accurate predictions as well as provide careful analysis. The use of machine learning
has proven to be widely applied in classification and optimisation topics in creating intelligent
systems to improve healthcare providers. The selection of the right method for stroke symptom
detection is needed because it affects the results that will be displayed.[6]
The purpose of this research is to apply the Adaboost method to the C4.5 and K-Nearest
Neighbor algorithms in stroke disease classification in the hope of obtaining accurate
predictions. In the context of stroke disease classification, the C4.5 algorithm is used for the
construction of a decision tree model that can classify stroke symptoms into stroke or non-stroke
categories. The K-Nearest Neighbor algorithm measures the distance between new data and old
data and performs classification based on predetermined nearest neighbor values. The use of the
1
Adaboost method aims to improve the accuracy of the classification model by combining several
weak classification models into one stronger classification model. Accuracy is defined as the
degree of conformity between the predicted value and the actual value.[7] In addition, the results
of the tests should be analysed to see how effective the algorithm is.
1. Is the combination of the Adaboost Method in the C4.5 Algorithm effective in predicting
stroke disease?
2. Is the combination of Adaboost Method on K-Nearest Neighbor Algorithm effective in
predicting stroke disease?
3. From the two combinations above, which one is more effective in predicting?
1.3. Scope
The dataset used is Stroke Dataset | kaggle.com includes various patient information such
as id, gender, age, hypertension, heart disease, ever married, work type, residence type, average
glucose level, bmi, smoking habits, and overall patient status (stroke or non-stroke). The
classification model was performed using the Adaboost Method in the C4.5 and KNN
algorithms. This research does not discuss risk factors or causes of stroke, but only focuses on
the classification of stroke symptoms to get accurate prediction results.
1.4. Objective
The main objective of this research is to prove that the Adaboost Method on the C4.5 and
KNN Algorithms is able to provide higher performance for stroke disease classification, because
the Adaboost Method is considered capable of improving the accuracy results of several
algorithms in making predictions on various datasets. So that the results of this research can be
implemented in the health sector to assist health workers in classifying stroke symptoms to
produce accurate predictions.
2
CHAPTER 2
LITERATURE STUDY
In her research on the application of adaboost to improve the performance of data mining
classification in diabetes disease, Novianti et al [5] conducted a study by applying the K-Nearest
Neighbor method as the main algorithm for performance evaluation in the context of
classification. The test was carried out 5 times with K values of 7, 13, 19, 25, and 31
respectively. For testing the KNN algorithm itself, the highest results were obtained from the
second test with 92.90% accuracy. As for testing the KNN Algorithm with Adaboost, it has the
highest results in the first and second tests with the same accuracy of 95.40%. The use of the
adaboost method can increase accuracy results by 2.50%.
Research on the diagnosis of stroke risk levels conducted by Puspitawuri et al[6] utilized
a datasets consisting of both numerical and categorical attributes, the researchers decided to
apply the K-Nearest Neighbor approach to process numerical data and use Naïve Bayes method
to process categorical data. The first test was conducted on the effect of data distribution on
balanced training data classes using 30, 45, and 60 data. For example, in 30 training data, there
are 10 data with low risk class, 10 data with medium risk class, and 10 data with high risk class.
The second test is the effect of data distribution on unbalanced training data classes using 30, 45,
and 60 data. For example, in 30 training data, there are 8 data with low risk class, 8 data with
medium risk class, and 14 data with high risk class. The test results, show that the highest
3
accuracy is obtained on datasets with a balanced class distribution, reaching 96.67%. This was
achieved using 45 training data and K values of around 15 to 22. While in the unbalanced class
the highest accuracy was obtained at 100% with a total of 60 training data and a value of K = 20-
30. So that the combination method of KNN and Naïve Bayes can be diagnosed because it has
the right results.
The use of the C4.5 Algorithm can optimise classification results to get the right
accuracy. In his research, Pambudi et al[8] explained that the C4.5 Decision Tree Algorithm
modelling uses 23 rules, with the number of classes being 14 rules (non-stroke) and 9 rules
(stroke). Researchers also conducted research using two main approaches, namely qualitative
and quantitative approaches. This test was conducted using the C4.5 Decision Tree Algorithm
with confusion matrix measurements and AUC values. From testing the Deision Tree C4.5
algorithm, the prediction results were 96.05%. While in his research, Rohman et al [9] conducted
research related to the prediction of heart disease using the Adaboost-based C4.5 algorithm with
iteration and attribute weighting. With 867 patient data, after preprocessing, 567 data were
obtained. Testing using the K-Fold Cross Validation method shows that the Adaboost-based
C4.5 Algorithm provides higher accuracy (92.24%) compared to the C4.5 Algorithm alone
(86.59%). The difference in accuracy between the two models was 5.65%, and evaluation using
ROC curves showed a higher AUC value (0.982) for the Adaboost-based C4.5 Algorithm. These
results imply that the application of the Adaboost-based C4.5 method is more effective in
predicting heart disease.
Based on the experimental results that have been carried out using three split data
scenarios, Hermawan et al [10] stated that the Early Prediction of Stroke Disease Based on
4
Medical Records Using the Classification and Regression Tree (CART) algorithm produced the
highest accuracy of 89.83% in the split data scenario for 80% training data and 20% test data.
After analysing the experiments that have been carried out, the greater the training data, the
greater the accuracy obtained, because later on the evaluation carried out by the confusion
matrix, the truepositive value and the truenegative value will be greater in the larger dataset
scenario. This will affect the accuracy value because the truepositive value is the value of a
positive prediction and it is correct and the truenegative value is the value of a negative
prediction that is wrong, therefore the greatest accuracy value is in the largest dataset scenario as
well.
In his research on hepatitis disease prediction, Buani [11] said that the purpose of the
research was to test the prediction results of the Naïve Bayes algorithm by using genetic
algorithms for feature selection. The test results show an accuracy rate of 96.77%, which has
increased significantly from previous research using the same data and algorithm, with a
prediction result of 83.71%. The difference between the previous study and this study is 13.06%,
indicating that the accuracy of the Naïve Bayes algorithm increases after feature selection using a
genetic algorithm.
Handayani et al [12] explained that the Decision Tree Algorithm has a greater true
positive value than the Neural Network Algorithm. The model of the C4.5 algorithm is
represented in the form of a decision tree. The process of creating a decision tree starts by
calculating the number of positive and negative classes related to liver disease for each class,
based on the attributes that have been determined using the training data. Entropy (Total) is then
calculated using certain equations. The same training data is used for the Neural Network model,
but the attribute values are converted into numerical values. The model consists of three layers,
namely an Input layer with ten neurons (nine neurons for attributes and one neuron as bias), one
hidden layer with eight neurons, and two output layers that produce predictions for Positive
Liver and Negative Liver. The test results show that the C4.5 model has an accuracy of 75.56%
and an AUC value of 0.898, while the Neural Network Model has an accuracy of 74.1% and an
AUC value of 0.671. From these results, it can be concluded that the Decision Tree Model is
more accurate than the Artificial Neural Network Model.
5
In her research journal on coronary heart disease prediction system, Larassati et al [13]
using the Naïve Bayes method, this study involved 303 data records consisting of 13 variables
and 1 class. Data processing involved data cleaning, selection and transformation. In data
modeling, the Naïve Bayes algorithm was implemented to predict coronary artery disease.
Performance evaluation is done by measuring the prediction ability with the training data, thus
obtaining the accuracy rate of the applied method. The first experiment split the data 60%,
obtaining 177 training data and 119 test data, with an accuracy of 83.1%. The second experiment
with a 70% and 30% split produced an accuracy of 82.02%, while the third experiment with an
80% and 20% split produced an accuracy of 81.6%. It is concluded from the three experiments
that the amount of data significantly affects the accuracy rate and that the Naïve Bayes algorithm
can be applied to predict coronary artery disease based on the initial examination of patient data.
Based on the literature study above, the C4.5 and KNN algorithms are able to provide
high accuracy results in classifying diseases. However, in the research [5][9] proved that the use
of the Adaboost Method in the KNN and C4.5 Algorithms can provide higher performance than
the KNN and C4.5 Algorithms themselves. Therefore, in this research the author will prove that
the use of the Adaboost Method on the C4.5 and KNN Algorithms is able to provide higher
performance in classifying to predict stroke disease.
6
CHAPTER 3
RESEARCH METHODOLOGY
To achieve good results in this research study, a structured research method is essential.
Step of problem solving method :
7
3.2. Dataset Collection
The dataset used is Stroke Prediction taken from kaggle. The data consists of 43401
observation data with 12 attributes. The data attributes used in this study are presented in the
following Table 3.1.
No Name Information
1 id Id pasien
2 gender Jenis kelamin
3 age Usia pasien
4 hypertension Hipertensi/tekanan darah tinggi
5 heart_disease Penyakit jantung
6 ever_married Pernah menikah
7 work_type Jernis pekerjaan
8 residence_type Tempat tinggal
9 avg_glucose_level Kadar glukosa
10 bmi Index massa tubuh
11 smoking_status Status merokok
12 stroke Prediksi stroke
Data cleaning in the study was carried out to eliminate the same data and empty
data. Similarities and empty data can hinder the data processing process. Therefore, this
research needs to do data cleaning.
8
3.3.2. Encoding Data
The last pre-processing is oversampling using smote. This is done to change the
amount of data with the label "stroke". The stroke parameter has 2 data contents namely
stroke and non-stroke where the number of non-strokes is more than the number of strokes.
Therefore it is necessary to do oversampling so that the number becomes the same and
produces good accuracy.
In splitting the data is divided into 2, namely training and testing. Training is part
of the dataset that is trained to predict the function of the machine learning algorithm.
Testing is part of the dataset that is tested to see its accuracy. In this research, the module
used is sklearn.model_selection.
n this research, the classification method uses the C4.5 algorithm to analyze
stroke disease. The attribute selection process is done by assigning attributes as nodes,
which can be root nodes or internal nodes, based on the highest Gain value possessed by
those attributes. The steps of data processing with the C4.5 algorithm involve the
calculation of entropy values, the calculation of gain values, and the formation of
decision trees and corresponding rules. Equation (1) and (2) are used to calculate entropy
and gain values. [8]
− pi∗log 2 ( pi )
n
Entropy (S)=∑ ( 1) ¿
i=0 ¿
¿
Description :
9
S : set of cases
n : number of partitions S
pi : proportion of Si to S
n
|S i|
Gain ( S , A )=Entropy ( S )−∑ ∗Entropy ( S 1 )
I=0 |S|
(2)
Description :
S : set of cases
√
n
E ( x , y )= ∑ 0 ( xi− yi )2
i
(3 )
Description :
xi = sample Data
yi = testing data
n = data dimension
10
I = variable data
Adaboost is used to classify data in their respective classes. Adaboost searches for
class categories based on the weight value owned by the class. This process continues to
be repeated so that there is a value update on the class. In adaboost, the weight value will
continue to increase at each iteration of the wrong weight value at each iteration.
Adaboost is a typical ensemble learning algorithm, the results obtained have a strong
level of accuracy.To form an adaboost ensemble can use the following formula[1][5] :
(∑ )
M
Ym ( x )=sign am ym( x )
m−1
(4 )
3.8. Evaluation
The data that has been processed and tested is then compared. The three main
metrics used to evaluate classification models are accuracy, precision, and recall. In this
research, the model evaluation uses confusion matrix data. Based on the confusion matrix
results, the values of accuracy, recall, precision, and F1 score can be determined.
1. Accuracy
Accuracy is the ratio of true prediction to the overall data.
(TP+TN )
×100 %
( TP+ FP+ FN +TN )
( 5)
2. Precision
(TP )
×100 %
( TP+ FP )
( 6)
11
3. Recall
Recall is the ratio of positive true prediction compared to overall positive true
data.
(TP )
×100 %
( TP+ FN )
( 7)
4. F1 Score
F1 Score is a weighted comparison of average precission and recall.
2× ( Recall × Precission )
¿¿
¿
12
CHAPTER 4
This research was conducted using an Asus VivoBook 14/15 laptop. Windows 10
operating system with Intel(R) Core(TM) i7-10510U CPU @1.80GHz 2.30 GHz
processor and 8 GB RAM. The programming language used is python 3 which is run on
Google Collaboratory online.
4.2. Implementation
This research uses a combination of the Adaboost algorithm with C4.5 and K-
Nearest Neighbors for comparison in improving the performance of stroke disease
prediction. Before doing the comparison, this research uses several libraries in the
process.
1. import numpy as np
2. import pandas as pd
3. from sklearn.preprocessing import LabelEncoder
4. from sklearn.preprocessing import MinMaxScaler
5. from sklearn.neighbors import KNeighborsClassifier
6. from sklearn.tree import DecisionTreeClassifier
7. from sklearn.ensemble import AdaBoostClassifier
8. from sklearn.ensemble import VotingClassifier
9. from sklearn.model_selection import train_test_split
10. from imblearn.over_sampling import RandomOverSampler
11. from sklearn.metrics import
confusion_matrix,ConfusionMatrixDisplay,f1_score,roc_auc_score,classifica
tion_report, accuracy_score
12. from google.colab import drive
13. import warnings
14. warnings.filterwarnings('ignore')
Line 1 import numpy for numerical computation, line 2 import pandas to process
csv data to numerical and vice versa. Lines 3, 4 and 10 import the library used in pre-
processing data. For lines 5 – 8 import libraries for data modeling using C4.5, KNN and
Adaboost, then lin 11 used to display the result of accuracy, precision, recall and f1-
score. Line 9 is used to devide training and testing data and the library on line 12 is used
to acces and manage datasets whose files are located on Google Drive. Libraries on lines
13 and 14 are used to manage notofocations generated by the program.
13
15. drive.mount('/content/drive/')
16. dataframe = pd.read_csv("/content/drive/MyDrive/Kuliah smt
7/project_strokes.csv")
17. dataframe
Lines 15 - 17 are used to associate Google Drive with Google Colab. So, the
datasets files can be accessed and read existing data structures .
18. dataframe.dropna(inplace=True)
19. dataframe.drop_duplicates(inplace=True)
20. dataframe.isnull().sum()
21. dataframe.duplicated().sum()
22. labelencoder = LabelEncoder()
Lines 18 and 19 of the program code are used to delete has null or empty content
and the same data content. Lines 20 and 21 in the program dunction to display the
amount of data that has been dropped. Line 22 are required to perform the encoding
process.
23. !pip install -U imbalanced-learn
24. from imblearn.over_sampling import SMOTE
25. smote = SMOTE(sampling_strategy='auto', random_state=42)
26. X_resampled, y_resampled = smote.fit_resample(X, y)
27. X_train, X_test, y_train, y_test = train_test_split(X_resampled,
y_resampled, test_size=0.3, random_state=42)
Lines 23 and 24 of the program code contain about installing and importing
libraries that will be used in the process of handling class imbalance techniques. Lines 25
and 26 are used to apply SMOTE to the dataset which will result in a new dataset, where
the number of samples from the minority class has been synthetically added so that it is
balanced with the majority class. The variables X_resampled and y_resampled will
contain the new dataset after the oversampling process with SMOTE. Line 20 use the
train_test_split function to split the resampled dataset using SMOTE into training data
(X_train, y_train) and testing data (X_test, y_test). The split is done by allocating 30% of
the data as test data, and the final mold provides information about the shape of each
dataset.
28. c45 = DecisionTreeClassifier(criterion='gini', splitter='random',
max_depth=5)
29. c45.fit(X_train, y_train)
30. y_pred_c45 = c45.predict(X_test)
31. y_pred_train_c45 = c45.predict(X_train)
14
33. knn.fit(X_train, y_train)
34. y_pred_knn = knn.predict(X_test)
35. y_pred_train_knn = knn.predict(X_train)
In lines 28 to 47, the program creates and trains machine learning models, such as
C4.5, KNN, Adaboost, and an ensemble of models using the Voting Classifier technique.
These models are then used to make predictions on test (`X_test`) and training
(`X_train`) data. The C4.5 model is also combined with Adaboost to get the best
prediction results. An ensemble model is also performed using a combination of voting
results from Adaboost and KNN models with the 'hard voting' method.
4.3. Result
Result provided start from the beginning of preprocessing, then the data is divided
into training and testing then calculated accuracy using C4.5 Algorithm. The Experiment
for the optimal result is using 70% of training data and 30% test data. For C4.5 the best
max_depth in 5. The following is a table of calculate results.
15
1 0.90 0.83 0.86 8595
test size 40% 0 0.86 0.90 0.88 11398 0.87
1 0.89 0.85 0.87 11422
Confusion Matrix of C. 45
0.95
0.9
0.85
0.8
0.75
test size 20% test size 20% test size 30% test size 30% test size 40% test size 40%
(0) (1) (0) (1) (0) (1)
This is the result of the C4.5 algorithm in predicting stroke disease with data divided
into 20%, 30%, 40% in the testing set. Precision is the percentage of correct positive
predictions relative to the total positive predictions. Recall is the percentage of correct
positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
Result provided start from the beginning of preprocessing, then the data is divided
into training and testing and then calculated accuracy using KNN Algorithm. The
Experiment for the optimal result is using 70% of training data and 30% test data. For KNN
the best neighbors in 5. The following is a table of calculate results.
16
test size 20% 0 0.93 0.78 0.85 5678 0.86
1 0.81 0.95 0.87 5732
test size 30% 0 0.93 0.78 0.85 8520 0.86
1 0.81 0.94 0.87 8595
test size 40% 0 0.92 0.77 0.84 11398 0.85
1 0.80 0.94 0.87 11422
This is the result of the KNN algorithm in predicting stroke disease with data divided
into 20%, 30%, 40% in the testing set. Precision is the percentage of correct positive
predictions relative to the total positive predictions. Recall is the percentage of correct
positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
Result provided start from the beginning of preprocessing, then the data is divided
into training and testing and then calculated accuracy using Adaboost Method. The
Experiment for the optimal result is using 70% of training data and 30% test data. For
Adaboost the best estimator in 20. The following is a table of calculate results.
This is the result of the Adaboost method in predicting stroke disease with data
divided into 20%, 30%, 40% in the testing set. Precision is the percentage of correct
positive predictions relative to the total positive predictions. Recall is the percentage of
correct positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
The results given start from the beginning of preprocessing, then the data is divided
into training and testing and then combined between C4.5 and Adaboost and the accuracy is
calculated using the Adaboost Method. The following is a table of calculation results.
18
Table 4.4 C4.5 and Adaboost Modeling Result
This is the result of the C4.5 and Adaboost in predicting stroke disease with data
divided into 20%, 30%, 40% in the testing set. Precision is the percentage of correct
positive predictions relative to the total positive predictions. Recall is the percentage of
correct positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
19
4.3.5. Result KNN and Adaboost Combination
The results given start from the beginning of preprocessing, then the data is divided
into training and testing and then combined between KNN and Adaboost and the accuracy
is calculated using the Adaboost Method. The following is a table of calculation results.
This is the result of the KNN and Adaboost in predicting stroke disease with data
divided into 20%, 30%, 40% in the testing set. Precision is the percentage of correct
20
positive predictions relative to the total positive predictions. Recall is the percentage of
correct positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
Based on the algorithm testing above which uses a max depth value of 5, neighbors
5, estimator 20 and the research was processed with a test size 30% has good results.
Although each algorithm has very small difference in precision, recall, and f1-score values.
For more details, can see the chart below.
21
Confusion Matrix of KNN + Adaboost
0.97
0.95
0.93
0.91
0.89
0.87
0.85
test size 30% (0) test size 30% (1)
Based on Figure 4.6 and Figure 4.7 the combination of the C4.5 algorithm has
higher results than the knn algorithm combination. Each has a value of 91% and 95%.
Where the two combinations have a difference of 4%.
4.4. Discussion
The results of the above tests use the number of test sizes of 20%, 30% and 40%.
For max depth and neighbors, it was tested 20 times and got the optimal value at 5. As for
the estimator itself, it was tested 10 times and got the optimal value at 20. In the test, it did
not immediately get the best results, the researchers did oversampling so that the data used
had the same amount because before oversampling the data with labels 0 and 1 had a lot of
slippage. Not only that, to combine the KNN algorithm with Adaboost cannot be directly
combined like C4.5 and Adaboost. For the combination of KNN and Adaboost, ensemble
assistance is needed. Because there are contrasting parameters between the two algorithms.
After doing all the steps in testing, we finally got good results.
The results of the combination of C4.5 and Adaboost have results above the C4.5
and Adaboost algorithms themselves. Meanwhile, the results of the combination of KNN
with Adaboost itself have results above the KNN algorithm itself and below the Adaboost
algorithm itself. Therefore, in this test the combination of the C4.5 and Adaboost
algorithms has good results compared to the combination of the KNN and Adaboost
algorithms.
22
23
CHAPTER 5
CONCLUSION
Based on the test results of combining the two algorithms, it can be concluded
that both combinations can help in improving the performance of stroke disease prediction.
However, the performance generated by the combination of C4.5 and KNN is different. In
the C4.5 algorithm, the higher the max depth value, the higher the resulting value. While in
KNN the neighbor value does not have a significant difference as well as the Adaboost
Algorithm. Testing in this study using a max depth value of 5 and an estimator of 20 with a
test size of 30% resulted in a performance value of 95% in the combination of the C4.5
Algorithm with Adaboost. Meanwhile, using a neihgbor value of 5, 20 estimators with the
same test size of 30% produces a performance value of 91% in the combination of KNN
with Adaboost.
Suggestions for future research are to try combining KNN with Adaboost without
using the ensemble method and try combining other algorithms to find out better prediction
performance.
24
REFERENCES