0% found this document useful (0 votes)
27 views31 pages

20.k1.0038 Proposal Project Report Kelar-1

Uploaded by

v3n4n.fw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views31 pages

20.k1.0038 Proposal Project Report Kelar-1

Uploaded by

v3n4n.fw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

PROJECT REPORT

ENHANCING STROKE DISEASE PREDICTION


PERFORMANCE THROUGH A FUSION OF ADABOOST
WITH C4.5 AND K-NEAREST NEIGHBOR ALGORITHMS

HANNY LUTFY DAMAYANTI


20.K1.0038

Faculty of Computer Science


Soegijapranata Catholic University
2023
CHAPTER 1ABSTRACT

Stroke is one of the most serious medical conditions and has a significant impact on
public health. The importance of accurate prediction of stroke risk is to provide appropriate
treatment and intervention to individuals at risk of developing the disease. In recent years, the
use of machine learning methods has become popular in improving stroke disease prediction.
This research implements the Adaboost method to the C4.5 and K-Nearest Neighbor (KNN)
algorithms with the aim of improving stroke prediction performance. Using relevant datasets,
the C4.5 and KNN algorithms were used separately to perform stroke disease prediction.
Furthermore, the Adaboost method is used to combine the prediction results of the two
algorithms. The results showed that the implementation of the Adaboost method on the C4.5 and
KNN algorithms successfully improved the performance of stroke disease prediction, providing
more accurate and reliable predictions to assist in the diagnosis and treatment of stroke disease.
With a value of 91% for the combination of KNN with Adaboost and 95% for the combination of
C4.5 with Adaboost. Both have a difference in value of 4%. Therefore, C4.5 is more effective in
improving the performance of stroke disease prediction.

Keyword: stroke, c4.5, knn, adaboost

ii
TABLE OF CONTENTS

COVER..........................................................................................................................................................i
ABSTRACT..................................................................................................................................................ii
TABLE OF CONTENTS...........................................................................................................................iii
LIST OF FIGURE.......................................................................................................................................v
LIST OF TABLE........................................................................................................................................vi
CHAPTER 1 INTRODUCTION................................................................................................................1
1.1. Background....................................................................................................................................1
1.2. Problem Formulation......................................................................................................................2
1.3. Scope..............................................................................................................................................2
1.4. Objective........................................................................................................................................2
CHAPTER 2 LITERATURE STUDY.......................................................................................................3
CHAPTER 3 RESEARCH METHODOLOGY........................................................................................7
3.1. Research Methodology...................................................................................................................7
3.2. Dataset Collection..........................................................................................................................8
3.3. Pre-processing Data........................................................................................................................8
3.3.1. Cleaning Data........................................................................................................................8
3.3.2. Encoding Data.......................................................................................................................8
3.3.3. Smote Oversampling............................................................................................................9
3.4. Splitting Data..................................................................................................................................9
3.5. C4.5 Algorithm...............................................................................................................................9
3.6. K-Nearest Neighbor Algorithm....................................................................................................10
3.7. Adaptive Boosting Method..........................................................................................................11
3.8. Evaluation.....................................................................................................................................11
CHAPTER 4...............................................................................................................................................13
4.1. Experiment Setup.........................................................................................................................13
4.2. Implementation.............................................................................................................................13
4.3. Result............................................................................................................................................15
4.3.1. Result C4.5 Algorithm.........................................................................................................15
4.3.2. Result KNN Algorithm.........................................................................................................16

iii
4.3.3. Result Adaboost Method....................................................................................................17
4.3.4. Result C4.5 and Adaboost Combination.............................................................................18
4.3.5. Result KNN and Adaboost Combination.............................................................................19
4.3.6. Result Conclussion..............................................................................................................21
4.4. Discussion....................................................................................................................................22
CHAPTER 5...............................................................................................................................................23
REFERENCES.............................................................................................................................................a

iv
LIST OF FIGURE

Figure 3.1 Research Methodology...................................................................................................7

v
LIST OF TABLE

Table 3.1. Dataset Attribute.............................................................................................................8

vi
CHAPTER 1
INTRODUCTION

1.1. Background

Stroke is a significant global health problem, ranking as the second leading cause of death
worldwide and contributing significantly to high rates of disability. Indonesia in particular, faces
a pressing challenge with increasing stroke cases and high mortality rates.[1] According to data
from 208 Riskesdas, North Sulawesi Province has the highest prevalence of stroke (14.2%) while
Papua Province (4.1%).[2] Not only that, based on information from the Centers for Disease
Control and Prevention (CDC), stroke is also the leading causes of death in the United States.
Stroke is a non-communicable disease that accounts for about 11% of all deaths and more than
795,000 individuals in the United States experience the adverse effects of stroke.[3] The C4.5
algorithm can be utilized for predict or classify an event by forming a decision tree.[4] K-Nearest
Neighbor performs classification by considering the closest distance between new data and
existing data, starting with determining the value of the nearest neighbor.[5] Adaboost is one of
the supervised algorithms in the field of data mining that is often used to develop classification
models.

With the development of medical technology, it has become possible to utilize machine
learning to forecast stroke events. Machine learning algorithms, which are constructive in nature,
can produce accurate predictions as well as provide careful analysis. The use of machine learning
has proven to be widely applied in classification and optimisation topics in creating intelligent
systems to improve healthcare providers. The selection of the right method for stroke symptom
detection is needed because it affects the results that will be displayed.[6]

The purpose of this research is to apply the Adaboost method to the C4.5 and K-Nearest
Neighbor algorithms in stroke disease classification in the hope of obtaining accurate
predictions. In the context of stroke disease classification, the C4.5 algorithm is used for the
construction of a decision tree model that can classify stroke symptoms into stroke or non-stroke
categories. The K-Nearest Neighbor algorithm measures the distance between new data and old
data and performs classification based on predetermined nearest neighbor values. The use of the
1
Adaboost method aims to improve the accuracy of the classification model by combining several
weak classification models into one stronger classification model. Accuracy is defined as the
degree of conformity between the predicted value and the actual value.[7] In addition, the results
of the tests should be analysed to see how effective the algorithm is.

1.2. Problem Formulation

There are several problem formulations in this research, including :

1. Is the combination of the Adaboost Method in the C4.5 Algorithm effective in predicting
stroke disease?
2. Is the combination of Adaboost Method on K-Nearest Neighbor Algorithm effective in
predicting stroke disease?
3. From the two combinations above, which one is more effective in predicting?

1.3. Scope

The dataset used is Stroke Dataset | kaggle.com includes various patient information such
as id, gender, age, hypertension, heart disease, ever married, work type, residence type, average
glucose level, bmi, smoking habits, and overall patient status (stroke or non-stroke). The
classification model was performed using the Adaboost Method in the C4.5 and KNN
algorithms. This research does not discuss risk factors or causes of stroke, but only focuses on
the classification of stroke symptoms to get accurate prediction results.

1.4. Objective

The main objective of this research is to prove that the Adaboost Method on the C4.5 and
KNN Algorithms is able to provide higher performance for stroke disease classification, because
the Adaboost Method is considered capable of improving the accuracy results of several
algorithms in making predictions on various datasets. So that the results of this research can be
implemented in the health sector to assist health workers in classifying stroke symptoms to
produce accurate predictions.

2
CHAPTER 2
LITERATURE STUDY

Research conducted by Kohsasih and Situmorang [3] discussed the comparison of


accuracy and performance of two algorithms in predicting stroke disease, namely the C4.5 and
Naïve Bayes algorithms. The dataset used consists of about 5000 entries that have been divided
into 60% training and 40% testing. The preprocessing process is done using the orange
application. The results of the performance comparison of the two algorithms show an accuracy
rate of 95%, precision of 90%, recall of 95%, and f1-score of 93% for the C4.5 algorithm.
Meanwhile, the Naïve Bayes algorithm achieved an accuracy rate of about 91%, precision of
92%, recall of 91%, and f1-score of 92%. In terms of log loss and specificity, the C4.5 algorithm
achieved values of 0.190 and 0.047, while the Naïve Bayes algorithm had values of 0.205 and
0.213. Overall, the results show that the C4.5 Algorithm has superior performance.

In her research on the application of adaboost to improve the performance of data mining
classification in diabetes disease, Novianti et al [5] conducted a study by applying the K-Nearest
Neighbor method as the main algorithm for performance evaluation in the context of
classification. The test was carried out 5 times with K values of 7, 13, 19, 25, and 31
respectively. For testing the KNN algorithm itself, the highest results were obtained from the
second test with 92.90% accuracy. As for testing the KNN Algorithm with Adaboost, it has the
highest results in the first and second tests with the same accuracy of 95.40%. The use of the
adaboost method can increase accuracy results by 2.50%.

Research on the diagnosis of stroke risk levels conducted by Puspitawuri et al[6] utilized
a datasets consisting of both numerical and categorical attributes, the researchers decided to
apply the K-Nearest Neighbor approach to process numerical data and use Naïve Bayes method
to process categorical data. The first test was conducted on the effect of data distribution on
balanced training data classes using 30, 45, and 60 data. For example, in 30 training data, there
are 10 data with low risk class, 10 data with medium risk class, and 10 data with high risk class.
The second test is the effect of data distribution on unbalanced training data classes using 30, 45,
and 60 data. For example, in 30 training data, there are 8 data with low risk class, 8 data with
medium risk class, and 14 data with high risk class. The test results, show that the highest

3
accuracy is obtained on datasets with a balanced class distribution, reaching 96.67%. This was
achieved using 45 training data and K values of around 15 to 22. While in the unbalanced class
the highest accuracy was obtained at 100% with a total of 60 training data and a value of K = 20-
30. So that the combination method of KNN and Naïve Bayes can be diagnosed because it has
the right results.

Pebrianti et al[7] conducted research on diabetes disease classification. In order to get


optimal results, they used the Adaboost Method with the Naïve Bayes Algorithm. The dataset
used is tabular data from the health conditions of patients who are indicated as diabetic or not,
totalling 336 dta and consisting of 9 variables. Researchers perform data preprocessing, data
cleaning, and split data. In testing accuracy using Naïve Bayes with a data split of 60%:40%, an
accuracy rate of around 76% was recorded. Furthermore, after testing the accuracy of Naïve
Bayes with the application of Adaboost, it was found that the accuracy increased to 76.94%.
From these results, it can be seen that the Adaboost method can increase accuracy by 0.94%.

The use of the C4.5 Algorithm can optimise classification results to get the right
accuracy. In his research, Pambudi et al[8] explained that the C4.5 Decision Tree Algorithm
modelling uses 23 rules, with the number of classes being 14 rules (non-stroke) and 9 rules
(stroke). Researchers also conducted research using two main approaches, namely qualitative
and quantitative approaches. This test was conducted using the C4.5 Decision Tree Algorithm
with confusion matrix measurements and AUC values. From testing the Deision Tree C4.5
algorithm, the prediction results were 96.05%. While in his research, Rohman et al [9] conducted
research related to the prediction of heart disease using the Adaboost-based C4.5 algorithm with
iteration and attribute weighting. With 867 patient data, after preprocessing, 567 data were
obtained. Testing using the K-Fold Cross Validation method shows that the Adaboost-based
C4.5 Algorithm provides higher accuracy (92.24%) compared to the C4.5 Algorithm alone
(86.59%). The difference in accuracy between the two models was 5.65%, and evaluation using
ROC curves showed a higher AUC value (0.982) for the Adaboost-based C4.5 Algorithm. These
results imply that the application of the Adaboost-based C4.5 method is more effective in
predicting heart disease.

Based on the experimental results that have been carried out using three split data
scenarios, Hermawan et al [10] stated that the Early Prediction of Stroke Disease Based on

4
Medical Records Using the Classification and Regression Tree (CART) algorithm produced the
highest accuracy of 89.83% in the split data scenario for 80% training data and 20% test data.
After analysing the experiments that have been carried out, the greater the training data, the
greater the accuracy obtained, because later on the evaluation carried out by the confusion
matrix, the truepositive value and the truenegative value will be greater in the larger dataset
scenario. This will affect the accuracy value because the truepositive value is the value of a
positive prediction and it is correct and the truenegative value is the value of a negative
prediction that is wrong, therefore the greatest accuracy value is in the largest dataset scenario as
well.

In his research on hepatitis disease prediction, Buani [11] said that the purpose of the
research was to test the prediction results of the Naïve Bayes algorithm by using genetic
algorithms for feature selection. The test results show an accuracy rate of 96.77%, which has
increased significantly from previous research using the same data and algorithm, with a
prediction result of 83.71%. The difference between the previous study and this study is 13.06%,
indicating that the accuracy of the Naïve Bayes algorithm increases after feature selection using a
genetic algorithm.

Handayani et al [12] explained that the Decision Tree Algorithm has a greater true
positive value than the Neural Network Algorithm. The model of the C4.5 algorithm is
represented in the form of a decision tree. The process of creating a decision tree starts by
calculating the number of positive and negative classes related to liver disease for each class,
based on the attributes that have been determined using the training data. Entropy (Total) is then
calculated using certain equations. The same training data is used for the Neural Network model,
but the attribute values are converted into numerical values. The model consists of three layers,
namely an Input layer with ten neurons (nine neurons for attributes and one neuron as bias), one
hidden layer with eight neurons, and two output layers that produce predictions for Positive
Liver and Negative Liver. The test results show that the C4.5 model has an accuracy of 75.56%
and an AUC value of 0.898, while the Neural Network Model has an accuracy of 74.1% and an
AUC value of 0.671. From these results, it can be concluded that the Decision Tree Model is
more accurate than the Artificial Neural Network Model.

5
In her research journal on coronary heart disease prediction system, Larassati et al [13]
using the Naïve Bayes method, this study involved 303 data records consisting of 13 variables
and 1 class. Data processing involved data cleaning, selection and transformation. In data
modeling, the Naïve Bayes algorithm was implemented to predict coronary artery disease.
Performance evaluation is done by measuring the prediction ability with the training data, thus
obtaining the accuracy rate of the applied method. The first experiment split the data 60%,
obtaining 177 training data and 119 test data, with an accuracy of 83.1%. The second experiment
with a 70% and 30% split produced an accuracy of 82.02%, while the third experiment with an
80% and 20% split produced an accuracy of 81.6%. It is concluded from the three experiments
that the amount of data significantly affects the accuracy rate and that the Naïve Bayes algorithm
can be applied to predict coronary artery disease based on the initial examination of patient data.

Based on the literature study above, the C4.5 and KNN algorithms are able to provide
high accuracy results in classifying diseases. However, in the research [5][9] proved that the use
of the Adaboost Method in the KNN and C4.5 Algorithms can provide higher performance than
the KNN and C4.5 Algorithms themselves. Therefore, in this research the author will prove that
the use of the Adaboost Method on the C4.5 and KNN Algorithms is able to provide higher
performance in classifying to predict stroke disease.

6
CHAPTER 3
RESEARCH METHODOLOGY

3.1. Research Methodology

To achieve good results in this research study, a structured research method is essential.
Step of problem solving method :

1. Conduct a literature study related to the topic discussed


2. Collecting stroke disesase datasets from the kaggle platform, studying the
algorithms used
3. Preprocessing the dataset with cleaning data, encoding data, and over sampling
using smote
4. Algorithm modeling using C4.5, K-Nearest Neighbors, and Adaboost
5. Analysing implementation results and making conclusions

Figure 3.1 Research Methodology

7
3.2. Dataset Collection

The dataset used is Stroke Prediction taken from kaggle. The data consists of 43401
observation data with 12 attributes. The data attributes used in this study are presented in the
following Table 3.1.

Table 3.1. Dataset Attribute

No Name Information
1 id Id pasien
2 gender Jenis kelamin
3 age Usia pasien
4 hypertension Hipertensi/tekanan darah tinggi
5 heart_disease Penyakit jantung
6 ever_married Pernah menikah
7 work_type Jernis pekerjaan
8 residence_type Tempat tinggal
9 avg_glucose_level Kadar glukosa
10 bmi Index massa tubuh
11 smoking_status Status merokok
12 stroke Prediksi stroke

3.3. Pre-processing Data

In this research, data preprocessing is done by data cleaning. Where some


attributes and data that are incomplete, inaccurate, and irrelevant are removed. The
purpose of data cleaning is to produce data that is actually used.

3.3.1. Cleaning Data

Data cleaning in the study was carried out to eliminate the same data and empty
data. Similarities and empty data can hinder the data processing process. Therefore, this
research needs to do data cleaning.

8
3.3.2. Encoding Data

Encoding is one of the pre-processing done in this research. In this research,


encoding is done to change the form of the data. By doing encoding, it can facilitate data
processing.

3.3.3. Smote Oversampling

The last pre-processing is oversampling using smote. This is done to change the
amount of data with the label "stroke". The stroke parameter has 2 data contents namely
stroke and non-stroke where the number of non-strokes is more than the number of strokes.
Therefore it is necessary to do oversampling so that the number becomes the same and
produces good accuracy.

3.4. Splitting Data

In splitting the data is divided into 2, namely training and testing. Training is part
of the dataset that is trained to predict the function of the machine learning algorithm.
Testing is part of the dataset that is tested to see its accuracy. In this research, the module
used is sklearn.model_selection.

3.5. C4.5 Algorithm

n this research, the classification method uses the C4.5 algorithm to analyze
stroke disease. The attribute selection process is done by assigning attributes as nodes,
which can be root nodes or internal nodes, based on the highest Gain value possessed by
those attributes. The steps of data processing with the C4.5 algorithm involve the
calculation of entropy values, the calculation of gain values, and the formation of
decision trees and corresponding rules. Equation (1) and (2) are used to calculate entropy
and gain values. [8]

− pi∗log 2 ( pi )
n
Entropy (S)=∑ ( 1) ¿
i=0 ¿
¿
Description :

9
S : set of cases

n : number of partitions S

pi : proportion of Si to S
n
|S i|
Gain ( S , A )=Entropy ( S )−∑ ∗Entropy ( S 1 )
I=0 |S|
(2)

Description :

S : set of cases

n : number of partitions of attribute A

|S| : number of cases in S

|Si|: number of cases in the i partition

3.6. K-Nearest Neighbor Algorithm

The K-Nearest Neighbor (K-NN) algorithm performs clustering of new data by


considering the distance between the data and some nearest neighbors. The number of
nearest neighbors is determined by the K parameter, which can be set by the user. K-NN
operates by finding the minimum distance from the new data to the specified nearest
neighbors. The focus of this 𝐴 = 𝜋𝑟2 algorithm is to classify new objects based on their
attributes and training samples. The process of determining neighbor proximity generally
uses the Euclidean Distance calculation, which is explained as follows [5] :


n
E ( x , y )= ∑ 0 ( xi− yi )2
i
(3 )

Description :

xi = sample Data

yi = testing data

n = data dimension
10
I = variable data

3.7. Adaptive Boosting Method

Adaboost is used to classify data in their respective classes. Adaboost searches for
class categories based on the weight value owned by the class. This process continues to
be repeated so that there is a value update on the class. In adaboost, the weight value will
continue to increase at each iteration of the wrong weight value at each iteration.
Adaboost is a typical ensemble learning algorithm, the results obtained have a strong
level of accuracy.To form an adaboost ensemble can use the following formula[1][5] :

(∑ )
M
Ym ( x )=sign am ym( x )
m−1

(4 )

3.8. Evaluation

The data that has been processed and tested is then compared. The three main
metrics used to evaluate classification models are accuracy, precision, and recall. In this
research, the model evaluation uses confusion matrix data. Based on the confusion matrix
results, the values of accuracy, recall, precision, and F1 score can be determined.

1. Accuracy
Accuracy is the ratio of true prediction to the overall data.

(TP+TN )
×100 %
( TP+ FP+ FN +TN )
( 5)

2. Precision

Precision is the ratio of positive true prediction


compared to overall positive prediction result.

(TP )
×100 %
( TP+ FP )
( 6)

11
3. Recall
Recall is the ratio of positive true prediction compared to overall positive true
data.

(TP )
×100 %
( TP+ FN )
( 7)

4. F1 Score
F1 Score is a weighted comparison of average precission and recall.

2× ( Recall × Precission )
¿¿
¿

Based on function (5), (6), (7) TP is True Positive, FP is False Positive, TN is


True Negative, FN is False Negative and the result is multiplied by 100% to get the
percentage. The calculation result of the Recall (7) and Precission (8) functions will
produce an F1 Score (8).

12
CHAPTER 4

IMPLEMENTATION AND RESULTS

4.1. Experiment Setup

This research was conducted using an Asus VivoBook 14/15 laptop. Windows 10
operating system with Intel(R) Core(TM) i7-10510U CPU @1.80GHz 2.30 GHz
processor and 8 GB RAM. The programming language used is python 3 which is run on
Google Collaboratory online.

4.2. Implementation

This research uses a combination of the Adaboost algorithm with C4.5 and K-
Nearest Neighbors for comparison in improving the performance of stroke disease
prediction. Before doing the comparison, this research uses several libraries in the
process.
1. import numpy as np
2. import pandas as pd
3. from sklearn.preprocessing import LabelEncoder
4. from sklearn.preprocessing import MinMaxScaler
5. from sklearn.neighbors import KNeighborsClassifier
6. from sklearn.tree import DecisionTreeClassifier
7. from sklearn.ensemble import AdaBoostClassifier
8. from sklearn.ensemble import VotingClassifier
9. from sklearn.model_selection import train_test_split
10. from imblearn.over_sampling import RandomOverSampler
11. from sklearn.metrics import
confusion_matrix,ConfusionMatrixDisplay,f1_score,roc_auc_score,classifica
tion_report, accuracy_score
12. from google.colab import drive
13. import warnings
14. warnings.filterwarnings('ignore')

Line 1 import numpy for numerical computation, line 2 import pandas to process
csv data to numerical and vice versa. Lines 3, 4 and 10 import the library used in pre-
processing data. For lines 5 – 8 import libraries for data modeling using C4.5, KNN and
Adaboost, then lin 11 used to display the result of accuracy, precision, recall and f1-
score. Line 9 is used to devide training and testing data and the library on line 12 is used
to acces and manage datasets whose files are located on Google Drive. Libraries on lines
13 and 14 are used to manage notofocations generated by the program.

13
15. drive.mount('/content/drive/')
16. dataframe = pd.read_csv("/content/drive/MyDrive/Kuliah smt
7/project_strokes.csv")
17. dataframe

Lines 15 - 17 are used to associate Google Drive with Google Colab. So, the
datasets files can be accessed and read existing data structures .
18. dataframe.dropna(inplace=True)
19. dataframe.drop_duplicates(inplace=True)
20. dataframe.isnull().sum()
21. dataframe.duplicated().sum()
22. labelencoder = LabelEncoder()

Lines 18 and 19 of the program code are used to delete has null or empty content
and the same data content. Lines 20 and 21 in the program dunction to display the
amount of data that has been dropped. Line 22 are required to perform the encoding
process.
23. !pip install -U imbalanced-learn
24. from imblearn.over_sampling import SMOTE
25. smote = SMOTE(sampling_strategy='auto', random_state=42)
26. X_resampled, y_resampled = smote.fit_resample(X, y)
27. X_train, X_test, y_train, y_test = train_test_split(X_resampled,
y_resampled, test_size=0.3, random_state=42)

Lines 23 and 24 of the program code contain about installing and importing
libraries that will be used in the process of handling class imbalance techniques. Lines 25
and 26 are used to apply SMOTE to the dataset which will result in a new dataset, where
the number of samples from the minority class has been synthetically added so that it is
balanced with the majority class. The variables X_resampled and y_resampled will
contain the new dataset after the oversampling process with SMOTE. Line 20 use the
train_test_split function to split the resampled dataset using SMOTE into training data
(X_train, y_train) and testing data (X_test, y_test). The split is done by allocating 30% of
the data as test data, and the final mold provides information about the shape of each
dataset.
28. c45 = DecisionTreeClassifier(criterion='gini', splitter='random',
max_depth=5)
29. c45.fit(X_train, y_train)
30. y_pred_c45 = c45.predict(X_test)
31. y_pred_train_c45 = c45.predict(X_train)

32. knn = KNeighborsClassifier(n_neighbors=5)

14
33. knn.fit(X_train, y_train)
34. y_pred_knn = knn.predict(X_test)
35. y_pred_train_knn = knn.predict(X_train)

36. adaboost = AdaBoostClassifier(n_estimators=20, random_state=42)


37. adaboost.fit(X_train, y_train)
38. y_pred_adaboost = adaboost.predict(X_test)
39. y_pred_train_adaboost = adaboost.predict(X_train)

40. c45_adaboost = AdaBoostClassifier(base_estimator=c45,


n_estimators=20, random_state=42)
41. c45_adaboost.fit(X_train, y_train)
42. y_pred_c45_adaboost = c45_adaboost.predict(X_test)
43. y_pred_train_c45_adaboost = c45_adaboost.predict(X_train)

44. ensemble = VotingClassifier(estimators=[('adaboost', adaboost),


('knn', knn)], voting='hard')
45. ensemble.fit(X_train, y_train)
46. y_pred_ensemble = ensemble.predict(X_test)
47. y_pred_train_ensemble = ensemble.predict(X_train)

In lines 28 to 47, the program creates and trains machine learning models, such as
C4.5, KNN, Adaboost, and an ensemble of models using the Voting Classifier technique.
These models are then used to make predictions on test (`X_test`) and training
(`X_train`) data. The C4.5 model is also combined with Adaboost to get the best
prediction results. An ensemble model is also performed using a combination of voting
results from Adaboost and KNN models with the 'hard voting' method.

4.3. Result

4.3.1. Result C4.5 Algorithm

Result provided start from the beginning of preprocessing, then the data is divided
into training and testing then calculated accuracy using C4.5 Algorithm. The Experiment
for the optimal result is using 70% of training data and 30% test data. For C4.5 the best
max_depth in 5. The following is a table of calculate results.

Table 4.1 C4.5 Modeling Result

c4.5 label precision recall f1-score support acccuracy


test size 20% 0 0.80 0.90 0.85 5678 0.84
1 0.89 0.78 0.83 5732
test size 30% 0 0.84 0.91 0.87 8520 0.87

15
1 0.90 0.83 0.86 8595
test size 40% 0 0.86 0.90 0.88 11398 0.87
1 0.89 0.85 0.87 11422

Confusion Matrix of C. 45
0.95
0.9
0.85
0.8
0.75
test size 20% test size 20% test size 30% test size 30% test size 40% test size 40%
(0) (1) (0) (1) (0) (1)

precision recall f1-score accuracy

Figure 4.1 C4.5 Modeling Result

This is the result of the C4.5 algorithm in predicting stroke disease with data divided
into 20%, 30%, 40% in the testing set. Precision is the percentage of correct positive
predictions relative to the total positive predictions. Recall is the percentage of correct
positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.

4.3.2. Result KNN Algorithm

Result provided start from the beginning of preprocessing, then the data is divided
into training and testing and then calculated accuracy using KNN Algorithm. The
Experiment for the optimal result is using 70% of training data and 30% test data. For KNN
the best neighbors in 5. The following is a table of calculate results.

Table 4.2 KNN Modeling Result

knn label precision recall f1-score support acccuracy

16
test size 20% 0 0.93 0.78 0.85 5678 0.86
1 0.81 0.95 0.87 5732
test size 30% 0 0.93 0.78 0.85 8520 0.86
1 0.81 0.94 0.87 8595
test size 40% 0 0.92 0.77 0.84 11398 0.85
1 0.80 0.94 0.87 11422

Confusion Matrix of KNN


0.95
0.9
0.85
0.8
0.75
test size 20% test size 20% test size 30% test size 30% test size 40% test size 40%
(0) (1) (0) (1) (0) (1)

precision recall f1-score accuracy

Figure 4.1 KNN Modeling Result

This is the result of the KNN algorithm in predicting stroke disease with data divided
into 20%, 30%, 40% in the testing set. Precision is the percentage of correct positive
predictions relative to the total positive predictions. Recall is the percentage of correct
positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.

4.3.3. Result Adaboost Method

Result provided start from the beginning of preprocessing, then the data is divided
into training and testing and then calculated accuracy using Adaboost Method. The
Experiment for the optimal result is using 70% of training data and 30% test data. For
Adaboost the best estimator in 20. The following is a table of calculate results.

Table 4.3 Adaboost Modeling Result


17
adaboost label precision recall f1-score support acccuracy
test size 20% 0 0.90 0.92 0.91 5678 0.91
1 0.92 0.90 0.91 5732
test size 30% 0 0.91 0.92 0.92 8520 0.92
1 0.92 0.91 0.92 8595
test size 40% 0 0.91 0.92 0.91 11398 0.91
1 0.92 0.91 0.91 11422

Confusion Matrix of Adaboost


0.92
0.915
0.91
0.905
0.9
test size 20% test size 20% test size 30% test size 30% test size 40% test size 40%
(0) (1) (0) (1) (0) (1)

precision recall f1-score accuracy

Figure 4.1 Adaboost Modeling Result

This is the result of the Adaboost method in predicting stroke disease with data
divided into 20%, 30%, 40% in the testing set. Precision is the percentage of correct
positive predictions relative to the total positive predictions. Recall is the percentage of
correct positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.

4.3.4. Result C4.5 and Adaboost Combination

The results given start from the beginning of preprocessing, then the data is divided
into training and testing and then combined between C4.5 and Adaboost and the accuracy is
calculated using the Adaboost Method. The following is a table of calculation results.
18
Table 4.4 C4.5 and Adaboost Modeling Result

c4.5 + adaboost label precision recall f1-score support acccuracy


test size 20% 0 0.94 0.97 0.95 5678 0.95
1 0.96 0.94 0.95 5732
test size 30% 0 0.93 0.96 0.95 8520 0.95
1 0.96 0.93 0.95 8595
test size 40% 0 0.94 0.96 0.95 11398 0.95
1 0.96 0.94 0.95 11422

Confusion Matrix of C.45 + Adaboost


0.98
0.96
0.94
0.92
0.9
test size 20% test size 20% test size 30% test size 30% test size 40% test size 40%
(0) (1) (0) (1) (0) (1)

precision recall f1-score accuracy

Figure 4.1 C4.5 and Adaboost Modeling Result

This is the result of the C4.5 and Adaboost in predicting stroke disease with data
divided into 20%, 30%, 40% in the testing set. Precision is the percentage of correct
positive predictions relative to the total positive predictions. Recall is the percentage of
correct positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.

19
4.3.5. Result KNN and Adaboost Combination

The results given start from the beginning of preprocessing, then the data is divided
into training and testing and then combined between KNN and Adaboost and the accuracy
is calculated using the Adaboost Method. The following is a table of calculation results.

Table 4.5 KNN and Adaboost Modeling Result

knn + adaboost label precision recall f1-score support acccuracy


test size 20% 0 0.86 0.97 0.91 5678 0.91
1 0.96 0.85 0.90 5732
test size 30% 0 0.87 0.96 0.91 8520 0.91
1 0.96 0.86 0.90 8595
test size 40% 0 0.87 0.96 0.91 11398 0.91
1 0.96 0.85 0.90 11422

Confusion Matrix of KNN + Adaboost


0.98
0.96
0.94
0.92
0.9
0.88
0.86
0.84
0.82
0.8
test size 20% (0) test size 20% (1) test size 30% (0) test size 30% (1) test size 40% (0) test size 40% (1)

precision recall f1-score accuracy

Figure 4.1 KNN and Adaboost Modeling Result

This is the result of the KNN and Adaboost in predicting stroke disease with data
divided into 20%, 30%, 40% in the testing set. Precision is the percentage of correct
20
positive predictions relative to the total positive predictions. Recall is the percentage of
correct positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.

4.3.6. Result Conclussion

Based on the algorithm testing above which uses a max depth value of 5, neighbors
5, estimator 20 and the research was processed with a test size 30% has good results.
Although each algorithm has very small difference in precision, recall, and f1-score values.
For more details, can see the chart below.

Confusion Matrix of C.45 + Adaboost


0.96
0.95
0.94
0.93
0.92
0.91
0.9
test size 30% (0) test size 30% (1)

precision recall f1-score accuracy

Figure 4.1 C4.5 and Adaboost Combination Result

21
Confusion Matrix of KNN + Adaboost
0.97
0.95
0.93
0.91
0.89
0.87
0.85
test size 30% (0) test size 30% (1)

precision recall f1-score accuracy

Figure 4.2 KNN and Adaboost Combination Result

Based on Figure 4.6 and Figure 4.7 the combination of the C4.5 algorithm has
higher results than the knn algorithm combination. Each has a value of 91% and 95%.
Where the two combinations have a difference of 4%.

4.4. Discussion

The results of the above tests use the number of test sizes of 20%, 30% and 40%.
For max depth and neighbors, it was tested 20 times and got the optimal value at 5. As for
the estimator itself, it was tested 10 times and got the optimal value at 20. In the test, it did
not immediately get the best results, the researchers did oversampling so that the data used
had the same amount because before oversampling the data with labels 0 and 1 had a lot of
slippage. Not only that, to combine the KNN algorithm with Adaboost cannot be directly
combined like C4.5 and Adaboost. For the combination of KNN and Adaboost, ensemble
assistance is needed. Because there are contrasting parameters between the two algorithms.
After doing all the steps in testing, we finally got good results.

The results of the combination of C4.5 and Adaboost have results above the C4.5
and Adaboost algorithms themselves. Meanwhile, the results of the combination of KNN
with Adaboost itself have results above the KNN algorithm itself and below the Adaboost
algorithm itself. Therefore, in this test the combination of the C4.5 and Adaboost
algorithms has good results compared to the combination of the KNN and Adaboost
algorithms.

22
23
CHAPTER 5

CONCLUSION

Based on the test results of combining the two algorithms, it can be concluded
that both combinations can help in improving the performance of stroke disease prediction.
However, the performance generated by the combination of C4.5 and KNN is different. In
the C4.5 algorithm, the higher the max depth value, the higher the resulting value. While in
KNN the neighbor value does not have a significant difference as well as the Adaboost
Algorithm. Testing in this study using a max depth value of 5 and an estimator of 20 with a
test size of 30% resulted in a performance value of 95% in the combination of the C4.5
Algorithm with Adaboost. Meanwhile, using a neihgbor value of 5, 20 estimators with the
same test size of 30% produces a performance value of 91% in the combination of KNN
with Adaboost.

The results of precision, recall and f1-score in each combination, there is no


significant difference. In terms of processing time, the C4.5 algorithm can process faster
than the KNN algorithm. This is due to the parameters of each algorithm. It can be
concluded that performance results can be influenced by the amount of data and parameters
used.

Suggestions for future research are to try combining KNN with Adaboost without
using the ensemble method and try combining other algorithms to find out better prediction
performance.

24
REFERENCES

[1] A. Byna and M. Basit, “PENERAPAN METODE ADABOOST UNTUK


MENGOPTIMASI PREDIKSI PENYAKIT STROKE DENGAN ALGORITMA NAÏVE
BAYES,” SISFOKOM, vol. 9, no. 3, pp. 407–411, Nov. 2020, doi:
10.32736/sisfokom.v9i3.1023.
[2] Y. Oktarina and S. Mulyani, “EDUKASI KESEHATAN PENYAKIT STROKE PADA
LANSIA,” vol. 3, 2020, doi: 10.22437/medicaldedication.v3i2.11220.
[3] K. L. Kohsasih and Z. Situmorang, “Analisis Perbandingan Algoritma C4.5 dan Naïve
Bayes Dalam Memprediksi Penyakit Cerebrovascular,” Jurnal Penelitian Teknik
Informatika, Manajemen Informatika dan Sistem Informasi, vol. 9, no. 1, pp. 13–17, Apr.
2022, doi: 10.31294/inf.v9i1.11931.
[4] R. Novita, “Teknik Data Mining : Algoritma C 4.5”.
[5] N. Novianti, M. Zarlis, and P. Sihombing, “Penerapan Algoritma Adaboost Untuk
Peningkatan Kinerja Klasifikasi Data Mining Pada Imbalance Dataset Diabetes,” mib, vol.
6, no. 2, p. 1200, Apr. 2022, doi: 10.30865/mib.v6i2.4017.
[6] A. Puspitawuri, E. Santoso, and C. Dewi, “Diagnosis Tingkat Risiko Penyakit Stroke
Menggunakan Metode K-Nearest Neighbor dan Naïve Bayes”.
[7] L. Pebrianti, F. Aulia, and H. Nisa, “Implementasi Metode Adaboost untuk Mengoptimasi
Klasifikasi Penyakit Diabetes dengan Algoritma Naïve Bayes”, vol. 7, no. 2, 2022, doi:
10.32528/justindo.v7i2.8627.
[8] R. E. Pambudi, “Klasifikasi Penyakit Stroke Menggunakan Algoritma Decision Tree C.45,”
vol. 16, no. 02, doi: 10.5281/zenodo.7535865.
[9] A. Rohman, V. Suhartono, and C. Supriyanto, “PENERAPAN ALGORITMA C4.5
BERBASIS ADABOOST UNTUK PREDIKSI PENYAKIT JANTUNG,” vol. 13, 2017,
doi: 10.25126/jtiik.2020752379.
[10] A. F. Hermawan, F. R. Umbara, and F. Kasyidi, “Prediksi Awal Penyakit Stroke
Berdasarkan Rekam Medis menggunakan Metode Algoritma CART(Classification and
Regression Tree)”, vol. 7, no. 2, 2022, doi: 10.26760/mindjournal.v7i2.151-164.
[11] D. C. P. B. - STMIK Nusa Mandiri Jakarta, “Prediksi Penyakit Hepatitis Menggunakan
Algoritma Naïve Bayes Dengan Seleksi Fitur Algoritma Genetika,” EVOLUSI, vol. 6, no. 2,
Sep. 2018, doi: 10.31294/evolusi.v6i2.4381.
[12] P. Handayani, E. Nurlelah, M. Raharjo, and P. M. Ramdani, “Prediksi Penyakit Liver
Dengan Menggunakan Metode Decision Tree dan Neural Network,” Com, Engine, Sys, Sci,
vol. 4, no. 1, p. 55, Feb. 2019, doi: 10.24114/cess.v4i1.11528.
[13] D. Larassati, A. Zaidiah, and S. Afrizal, “Sistem Prediksi Penyakit Jantung Koroner
Menggunakan Metode Naive Bayes,” jipi. jurnal. ilmiah. penelitian. dan. pembelajaran.
informatika., vol. 7, no. 2, pp. 533–546, May 2022, doi: 10.29100/jipi.v7i2.2842.

You might also like