Sensors 22 05304 v2
Sensors 22 05304 v2
Article
Data-Driven Machine-Learning Methods for Diabetes
Risk Prediction
Elias Dritsas and Maria Trigka *
Department of Computer Engineering and Informatics, University of Patras, 26504 Patras, Greece;
[email protected]
* Correspondence: [email protected]
Diabetes often has no symptoms. If they do occur, the symptoms may include thirst,
frequent urination, overeating and hunger, fatigue, blurred vision, nausea, vomiting
and weight loss (despite overeating) [5]. Some people are more likely to develop dia-
betes. Various factors may be taken into consideration to evaluate the associated risk for its
occurrence. In particular, people who are more prone to develop diabetes are usually over
45 years and physically inactive in their daily life.
From a gender and waist perspective, men with a waist circumference greater than
102 cm or women with a waist circumference greater than 88 cm have a higher risk for
developing diabetes. Furthermore, a body mass index greater than 30 is an indicator
of obese people. Finally, diabetes relates to the coexistence of other comorbidities, such
as elevated cholesterol levels, history of diabetes in the immediate family environment,
hypertension or cardiovascular disease, peripheral vascular disease, women with polycystic
ovaries, gestational diabetes (especially women who are pregnant with overweight children)
and drugs that cause diabetes (e.g., cortisone) [6,7].
Chronic complications of diabetes can be reduced through regular blood sugar control.
The target organs affected by diabetes are the eyes, the kidneys, the nervous system and
the vessels of the heart, brain and peripheral arteries [8,9].
Early diagnosis of the disease is crucial to avoid unpleasant developments regarding
the patient’s health. Lifestyle changes with proper diet and exercise, as well as medication
under the supervision of appropriate physicians, are the most important elements for an
effective therapeutic approach. The science of medicine has made great steps in reducing
disease mortality and improving patients’ quality of life [10,11].
Proper treatment of patients with diabetes is imperative currently as we deal with the
critical pandemic COVID-19. It should be noted here that patients with diabetes are more
likely to have complications from COVID-19 and have increased mortality [12].
Recent advances in the fields of Artificial Intelligence (AI) and Machine Learning (ML)
may provide clinicians and physicians with efficient tools for the early diagnosis of various
diseases, such as Cholesterol [13], Hypertension [14], COPD [15], Continuous Glucose
Monitoring [16], Short-Term Glucose prediction [17], COVID-19 [18], CVDs [19], Stroke [20],
CKD [21], ALF [22], Sleep Disorders [23], Hepatitis [24] and Cancer [25]. The prediction
of type 2 diabetes is the point of interest in this research work. For this specific disease,
numerous research studies have been conducted with the aid of machine-learning models.
For the purpose of the specific research, we present a type 2 diabetes risk assessment
framework consisting of a plethora of classification models and assuming as risk factors the
gender, age (demographic data) and the most common symptoms that relate to the diabetes
development. The contributions of this manuscript are two-fold. First, after class balancing,
features analysis is conducted, which includes (i) feature ranking to identify their order of
importance in the diabetes class and (ii) capturing their prevalence in the diabetes class.
The second proposition of this paper is a comparative evaluation of several models in
order to identify the ones with the highest performance metrics, which means that they are
the most appropriate to correctly identify those at high risk. The most common performance
metrics are utilized to evaluate the classifiers’ performance, such as the Precision, Recall,
F-Measure, Accuracy and AUC. Performance analysis is conducted after the application of
class balancing, assuming 10-fold cross-validation and data splitting, which demonstrated
that Random Forest and K-NN are the most efficient models.
They achieved an accuracy of 98.59% after SMOTE with 10-fold cross-validation and
99.22% after SMOTE with a percentage split (80:20) in comparison to the other models.
Furthermore, the proposed models were compared with published research works that used
the same dataset with the same features we relied on. From the results of the experiments,
our models outperformed in all cases.
The rest of the paper is organized as follows. Section 2 describes the relevant works
with the subject under consideration. In addition, in Section 3, a dataset description and
analysis of the methodology followed are made. In addition, in Section 4, we discuss the
Sensors 2022, 22, 5304 3 of 18
acquired research results. Finally, our conclusions and future directions are outlined in
Section 5.
2. Related Work
Currently, researchers have paid great attention to the development of AI-based tools
and methods suitable for chronic conditions monitoring and control. Specifically, ML
models have been widely utilized to quantify the risk of a disease occurrence assuming
various features or risk factors. In the context of this section, our purpose is to present
relevant works concerning diabetes.
First, the authors in [26] proposed a framework for diabetes prediction consisting of
different machine learning classifiers, such as K-Nearest Neighbor, Decision Trees, Random
Forest, AdaBoost, Naive Bayes and XGBoost and Multilayer Perceptron neural networks.
Their proposed ensembling classifier is the best performing classifier with the sensitivity,
specificity, false omission rate, diagnostic odds ratio and AUC of 0.789, 0.934, 0.092, 66.234
and 0.950, respectively.
Moreover, in [27], the authors utilized machine-learning techniques in the Pima Indian
diabetes dataset to develop trends and detect patterns with risk factors using the R data
manipulation tool. They applied supervised machine learning algorithms, such as linear
kernel Support Vector Machine (SVM-linear), radial basis function, K-Nearest Neighbor,
Artificial Neural Network and Multifactor Mimensionality Reduction, in order to classify
the patients into diabetic and non-diabetic. The SVM-linear model provides the best
accuracy of 0.89 and precision of 0.88. On the other hand, the K-NN model provided the
best recall and F1 score of 0.90 and 0.88, respectively.
In addition, the authors in [28] compared machine-learning-based models, such as
Glmnet, Random Forest, XGBoost and LightGBM, to commonly used regression models for
the prediction of undiagnosed type 2 diabetes. With six months of data available, a simple
regression model performed with the lowest average Root Mean Square Error of 0.838,
followed by Random Forest (0.842), LightGBM (0.846), Glmnet (0.859) and XGBoost (0.881).
When more data were added, Glmnet improved with the highest rate (+3.4%).
Logistic Regression, K-Nearest Neighbor, Support Vector Machine, Naïve Bayes, Deci-
sion Tree and Random forest were applied in [29]. The 10-fold cross-validation was also
applied to test the effectiveness of different models. The experimental results showed that
the accuracy of Random Forest was 94.10% and outperforms the other models.
Additionally, in [30] Logistic Regression is used to identify the risk factors for diabetes
based on p-value and odds ratio (OR). The Naïve Bayes, Decision Tree, Adaboost and Ran-
dom Forest were applied to predict the diabetic patients. Furthermore, three types of
partition protocols (K2, K5 and K10) were considered and repeated in 20 trials. The overall
ACC of the ML-based system is 90.62%. The combination of Logistic Regression-based
feature selection and Random Forest-based classifier gives 94.25% ACC and 0.95 AUC for
the K10 protocol.
Furthermore, in [31], dataset creation, features selection and classification using differ-
ent supervised machine-learning models, such as Naïve Bayes, Decision Trees, Random
Forests and Logistic Regression, were considered. The ensemble Weighted-Voting-Logistic
Regression-Random Forest ML model was proposed to improve the prediction of diabetes,
scoring an Area Under the ROC Curve (AUC) of 0.884.
Finally, published works [32–35] based on [36] dataset. Specifically, in [32] the authors
based on Naive Bayes, Logistic Regression and Random Forest algorithms and, after apply-
ing 10-fold cross-validation and percentage split (80:20) evaluation techniques, Random
forest has been found to have the best accuracy in order to predict diabetes in both cases.
In [33], the authors applied Bayes Network, Naïve Bayes, J48, Random Tree, Random
Forest, K-Nearest Neighbor and Support Vector Machine, and, after applying 10-fold
cross-validation, the K-Nearest Neighbor performed the highest accuracy with 98.07%.
In [34], Naive Byes, Random Forest, Support Vector Machine and Multilayer Percep-
tron were applied. The results showed that the Random Forest provides the highest values
Sensors 2022, 22, 5304 4 of 18
of 0.975 for precision, recall and F-measure, respectively. Multiplayer perceptron also works
well with 0.96 precision value, 0.963 recall value and 0.964 F-measure value, respectively.
Last, in [35], the authors based on Artificial Neural Network and Random Forest, and after
applying 10-fold cross-validation, the Random Forest outperformed with an accuracy of
97.88%. To sum up, in Table 1 we summarize the aforementioned related works.
• Polyphagia [42]: This feature captures whether the participant had an episode of
excessive/extreme hunger or not. The percentage of participants who had an episode
of excessive/extreme hunger is 45.6%.
• Genital thrush [43]: This feature captures whether the participant had a yeast infection
or not. The percentage of participants who had a yeast infection is 22.3%.
• Visual blurring [44]: This feature captures whether the participant had an episode of
blurred vision or not. The percentage of participants who had an episode of blurred
vision is 44.8%.
• Itching [45]: This feature captures whether the participant had an episode of itch.
The percentage of participants who had an episode of itching is 48.7%.
• Irritability [46]: This feature captures whether the participant had an episode of
irritability. The percentage of participants who had an episode of irritability is 24.2%.
• Delayed healing [47]: This feature captures whether the participant had a noticed
delayed healing when wounded or not. The percentage of participants who had
noticed delayed healing when wounded is 46%.
• Partial paresis [48]: This feature captures whether the participant had an episode of
weakening of a muscle/group of muscles or not. The percentage of participants who
had an episode of weakening of a muscle/group of muscles is 43.1%.
• Muscle stiffness [49]: This feature captures whether the participant had an episode
of muscle stiffness. The percentage of participants who had an episode of muscle
stiffness is 37.5%.
• Alopecia [50]: This feature captures whether the participant experienced hair loss or
not. The percentage of participants who experienced hair loss is 34.4%.
• Obesity [51]: This feature captures whether the participant can be considered obese
or not. The percentage of participants who are considered obese is 16.9%.
• Diabetes: This feature refers to whether the participant has been diagnosed with
diabetes type 2 or not. The percentage of participants who suffer from diabetes type 2
is 61.5%.
All the attributes are nominal except for age, which is numerical.
Table 2. Evaluation of feature importance based on the Pearson Correlation, Gain Ratio, Naive Bayes
and Random Forest.
As for the first method, namely Pearson correlation coefficient [53], it is used to infer
the strength and direction of the association between the features and the target class
and varies between −1 and 1. More specifically, we observe that a strong correlation of
0.7046 is captured between diabetes and the symptom of polyuria. Furthermore, a moderate
relationship of rank 0.6969, 0.5017 and 0.4922 is noted between polydipsia, sudden weight
loss and gender with diabetes. The same holds for partial paresis feature and diabetes
with a rank of 0.4757. A weaker association is shown to have diabetes with the features
of polyphagia, irritability, alopecia, visual blurring and weakness, while the absence of
correlation occurs with the rest features where the rank is lower than 0.2.
Gain Ratio (GR) method [54] was also employed, which is calculated as GR( x ) =
H (c)− H (c| x )
H (x)
, where H ( x ) = − p x log2 ( p x ) (with p x denoting the probability of selecting
feature x), H (c) = − pc log2 ( pc ) (with pc be the probability of selecting an instance in
class c) and H (c| x ) are the entropy of an instance with feature x, the entropy of class c
and the conditional entropy of feature x given class c, respectively. Gain ratio is used
to determine the relevance of a feature and chooses the ones that achieve the maximal
gain ratio considering the probability of each feature value. Gain ratio, also known as
Uncertainty Coefficient, normalizes the information gain (H (c) − H (c| x )) of a feature
against how much entropy that feature has.
Furthermore, the Naive Bayes and Random Forest classifiers were selected to measure
the importance of the features. Random Forest creates a forest of trees, and per tree
measures a candidate feature’s ability to optimally split the instances into two classes using
the Gini impurity [55]. Naive Bayes calculates the probability of each feature p( x |c) in
order to evaluate their performance at predicting the output variable.
We observe that Naive Bayes and Pearson correlation coefficients assigned the same
order of importance in all features except for the age and genital thrush, which are presented
in reverse order. Although these methods compute the importance differently, they result
in the same ordering outcomes. The same order may relate to the fact that (i) Naive Bayes
supposes features independence, as their correlation may harm its performance and (ii) the
correlation coefficient measures the strength of each feature’s relationship with the target
class [56].
Sensors 2022, 22, 5304 7 of 18
The features of polydipsia and polyuria are unanimously categorized first while
features of muscle stiffness, obesity, delayed healing and itching are last in the order by
all methods. In the rest features, we observe similarities in the ranking order between
different methods. In conclusion, since all features are among the most common symptoms
for diabetes screening by physicians (including the blood test for verification), the models’
training and validation will be based on all of them.
Figure 2. Participants’ distribution in terms of polyuria and polydipsia in the balanced dataset.
Figure 3. Participants’ distribution in terms of sudden weight loss and weakness in the bal-
anced dataset.
Figure 4 illustrates the participants’ distribution in terms of the features that denote
polyphagia and obesity. A total of 29.53% and 9.53% of participants are diabetics and de-
clared an increase in appetite and that they are obese. In addition, a moderate percentage of
12.50% and a small portion of 6.56% mentioned excessive hunger and obesity, respectively,
although they are not diabetics.
Figure 4. Participants’ distribution in terms of polyphagia and obesity in the balanced dataset.
In the following, Figure 5 depicts the irritability and alopecia signs in terms of the
involved classes. We see that irritability and alopecia coexist with diabetes in 17.19% and
12.19% of the participants, correspondingly. However, an important portion of 25.63%
noted the occurrence of alopecia although they were not diabetic.
Figure 5. Participants’ distribution in terms of irritability and alopecia in the balanced dataset.
Moreover, Figure 6 presents the occurrence of genital thrush and itching signs in terms
of the two classes. We see that these features coexist with diabetes in 12.97% and 24.06%
of the participants, correspondingly. However, an important portion of 24.84% noted the
occurrence of itching while 7.19% had genital thrush although they were not diabetic.
Sensors 2022, 22, 5304 9 of 18
Figure 6. Participants’ distribution in terms of genital thrush and itching in the balanced dataset.
Figure 7. Participants’ distribution in terms of partial paresis and muscle stiffness in the bal-
anced dataset.
Finally, Figure 8 shows the prevalence of diabetes in terms of the features that capture
the occurrence of delayed healing and visual blurring. A total of 50% of those who have
been diagnosed with diabetes (or 25% of the total participants) occur visual blurring, which
owes to the quick change of blood sugar levels from normal to high. Similar outcomes hold
for the coexistence of diabetes and the sign that concern the delay in wound healing, which
relate to problems with the immune system activation.
Figure 8. Participants’ distribution in terms of delayed healing and visual blurring in the bal-
anced dataset.
∏nj=1 P xij |c is the features probability given class and P( xi1 , . . . , xin ), P(c) are the prior
probability of
features
and class, respectively. The estimated class is derived by maximizing
P(c) ∏nj=1 P xij |c , where c ∈ { Diabetes, Non − Diabetes}.
" #
M
f (x0 ) = Sgn ∑ αi ci K ( xi , x 0 ) + b (2)
i =1
0 ≤ αi ≤ C, ∑ αi ci = 0, αi ≥ 0, i = 1, 2, · · · , M
where M is the size of training instances, xi , ci are the training instance feature vector and
its class label, respectively, b is a bias, ci ∈ {1, −1}, K (xi , x0 ) is the kernel function which
corresponds the input vectors into an expanded feature space.
The MLPs are designed to approximate any continuous function and can solve problems
that are not linearly separable. Furthermore, it can use any arbitrary activation function.
3.3.7. J48
J48 [63] is a machine-learning decision tree classification algorithm that examines the
data categorically and continuously. It deals with the problems of the numeric attributes,
missing values, pruning, estimating error rates, the complexity of decision tree induction
and generating rules from trees.
3.3.13. AdaBoostM1
Let Gm (xi ), for m = 1, 2, . . . , M, be the sequence of weak classifiers. Our objective is
M
to build the G (x) = sign(∑m =1 αm Gm ( xi )). The final prediction is a combination of the
predictions from all classifiers through a weighted majority vote. At the first step, m = 1,
the weights are initialized uniformly wl = 1/N. The coefficients αm are computed by the
boosting algorithm and weight the contribution of each respective Gm (xi ) giving higher
influence to the more accurate classifiers in the sequence. At each boosting step, the data is
modified by applying weights w1 , w2 , . . . , w N to each training observation. At step m, the
observations that were misclassified previously have their weights increased [69].
Sensors 2022, 22, 5304 12 of 18
3.3.15. Stacking
Stacking is a common approach that is utilized to acquire more accurate predictions
than single models’. Stacking uses the predicted class labels of the base models as input
features to train a meta-classifier that undertakes to find the class label [71].
TP TP
Precision = , Recall = (4)
TP + FP TP + FN
Precision · Recall TN + TP
F-Measure = 2 , Accuracy = (5)
Precision + Recall TN + TP + FN + FP
Precision indicates how many of those who are labeled as diabetic actually belong
to this class. Recall shows how many of those who are diabetic are correctly predicted.
F-Measure is the harmonic mean of the precision and recall and captures the predictive
performance of a model. The Accuracy illustrates the proportion of the total number of
predictions that were correct.
To evaluate the distinguishability of a model, the Area under curve (AUC) is exploited.
It is a metric that varies in [0, 1]. The closer to one, the better the ML model performance is
in distinguishing diabetes from non-diabetes instances. If AUC equals one, the ML model
can perfectly separate the instances distribution of two classes. In special case where all
non-diabetes (diabetes) are classified as diabetes (non-diabetes), the AUC equals 0.
4.2. Evaluation
In this research work, various ML models, such as BayesNet, NB, SVM, LR, ANN,
KNN, J48, LMT, RF, RT, RepTree, RotF, AdaBoostM1 and SGD and Ensemble method
(Stacking), are evaluated in terms of the accuracy, precision, recall, F-measure and AUC.
In Table 4, we illustrate the performance of the models under consideration after
applying SMOTE with 10-fold cross-validation. From the results of the experiments, we
can see that the KNN and RF models present the best prediction accuracy with 98.59%
compared to the corresponding proposed models. Furthermore, the RotF and RF models
Sensors 2022, 22, 5304 13 of 18
have an AUC of 99.9%. It should be noted that in SMOTE with 10-fold cross-validation,
all our models have an accuracy greater than 88.75% (BayesNet) and an AUC greater than
94.2% (SGD).
Models Parameters
estimator: simpleEstimator
BayesNet searchAlgorithm: K2
useADTree: False
useKernelEstimator: False
NB useSupervisedDiscretization: False
eps = 0.001
gamma = 0.0
SVM kernel type: radial basis function
loss = 0.1
ridge = 10−8
LR useConjugateGradientDescent: False
hidden layers: ‘a’
learning rate = 0.3
ANN momentum = 0.2
training time = 500
K=1
KNN Serach Algorithm: LinearNNSearch
with Euclidean
reducedErrorPruning: False
J48 savelnstanceData: False
subtreeRaising: True
errorOnProbabilities: False
fastRegression: True
LMT numInstances = 15
useAIC: False
maxDepth = 0
RF numIterations = 100
numFeatures = 0
maxDepth = 0
RT minNum = 1.0
minVarianceProp = 0.001
maxDepth = −1
RepTree minNum = 2.0
minVarianceProp = 0.001
classifier: J48
RotF numberOfGroups: False
projectionFilter: PrincipalComponents
classifier: DecisionStump
AdaBoostM1 resume: False
useResampling: False
epochs = 500
epsilon = 0.001
SGD lamda = 10−4
learningRate = 0.01
lossFunction: Hinge loss (SVM)
Base Models: RF, KNN
Stacking Meta-model:LR
Sensors 2022, 22, 5304 14 of 18
Moreover, in Table 5, we summarize related works based on the dataset [36] after
applying 10-fold cross-validation on the same features we relied on but without SMOTE.
Our proposed models after SMOTE and 10-fold cross-validation showed better performance
in terms of accuracy compared to the related works as shown in Table 5.
In addition, in Table 6, we depict the performance of ML models in terms of accuracy,
recall, precision, F-measure and AUC after applying SMOTE and percentage split (80:20).
Both in this case, the KNN and RF achieved the best performance in relation to the rest
models with an accuracy of 99.22%. Furthermore, the RF model and the Stacking method
performed an AUC of 100%. Our proposed models have excellent AUC rates greater than
93.7% (SGD) and accuracy greater than 88.28% (BayesNet).
Furthermore, in Table 7, we outline the accuracy of our proposed models, such as NB,
LR J48 and RF, after applying SMOTE and percentage split (80:20). The same table shows
the results of the work [32] after applying a percentage split (80:20) on the same features
we relied on but without SMOTE. We observe that our proposed models showed better
accuracy but with a small percentage gap of 0.22–1.97%.
Accuracy
Proposed Models [32] [33] [34] [35]
BayesNet 88.75% - 86.92% - -
NB 88.91% 87.4% 87.11% 87.1% -
SVM 95.62% - 92.11% 92.1% -
LR 93.44% 92.4% - - -
ANN 96.45% - - 96.3% 96.34%
KNN 98.59% - 98.07% - -
J48 97.19% 95.6% 95.96% - -
RF 98.59% 97.4% 97.5% 97.5% 97.88%
RT 97.97% - 96.15% - -
Sensors 2022, 22, 5304 15 of 18
Accuracy
NB LR J48 RF
Proposed models 89.06% 92.97% 95.53% 99.22%
[32] 88% 91% 95% 99%
Finally, we note a limitation of this research work. This study was based on a publicly
available dataset. The dataset we relied on does not come from a hospital unit or institute,
which could give us richer information data models with different characteristics, such
as biochemical measurements that record a detailed health profile of the participants.
Acquiring access to such data is time-consuming and difficult for privacy reasons.
5. Conclusions
The habits and lifestyle of the modern world are the results of the growing incidence
of diabetes. Medical professionals now have the opportunity, with the contribution of
machine-learning techniques, to assess the relative risk and provide appropriate guidelines
and interventions for the management and treatment or prevention of diabetes.
In this research article, we applied several machine-learning models in order to identify
individuals at risk of diabetes based on specific risk factors. Data exploration through
risk factor analysis could help to identify associations between the features and diabetes.
Performance analysis showed that data pre-processing is a major step in the design of
efficient and accurate models for diabetes occurrence.
Specifically, after applying SMOTE with 10-fold cross-validation, the Random Forest
and KNN outperformed the other models with an accuracy of 98.59%. Similarly, applying
SMOTE with a percentage split (80:20), the Random Forest and KNN outperformed the
other models with an accuracy of 99.22%. In both cases, applying SMOTE, our proposed
models were superior to the related published research works based on the [36] dataset
with the same features we relied on in terms of accuracy.
In future work, we aim to extend the machine-learning framework through the use of
deep-learning methods by applying a Long-Short-Term-Memory (LSTM) algorithm and
Sensors 2022, 22, 5304 16 of 18
Convolutional Neural Networks (CNN) in the same dataset and comparing the results in
terms of accuracy with relevant published works.
Author Contributions: E.D. and M.T. conceived the idea, designed and performed the experiments,
analyzed the results, drafted the initial manuscript and revised the final manuscript. All authors
have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Zimmet, P.Z.; Magliano, D.J.; Herman, W.H.; Shaw, J.E. Diabetes: A 21st century challenge. Lancet Diabetes Endocrinol. 2014,
2, 56–64. [CrossRef]
2. Atkinson, M.A.; Eisenbarth, G.S.; Michels, A.W. Type 1 diabetes. Lancet 2014, 383, 69–82. [CrossRef]
3. Chatterjee, S.; Khunti, K.; Davies, M.J. Type 2 diabetes. Lancet 2017, 389, 2239–2251. [CrossRef]
4. McIntyre, H.D.; Catalano, P.; Zhang, C.; Desoye, G.; Mathiesen, E.R.; Damm, P. Gestational diabetes mellitus. Nat. Rev. Dis. Prim.
2019, 5, 47. [CrossRef]
5. Ramachandran, A. Know the signs and symptoms of diabetes. Indian J. Med Res. 2014, 140, 579.
6. Wu, Y.; Ding, Y.; Tanaka, Y.; Zhang, W. Risk factors contributing to type 2 diabetes and recent advances in the treatment and
prevention. Int. J. Med Sci. 2014, 11, 1185. [CrossRef]
7. Bellou, V.; Belbasis, L.; Tzoulaki, I.; Evangelou, E. Risk factors for type 2 diabetes mellitus: An exposure-wide umbrella review of
meta-analyses. PLoS ONE 2018, 13, e0194127. [CrossRef]
8. Kumar, A.; Bharti, S.K.; Kumar, A. Type 2 diabetes mellitus: The concerned complications and target organs. Apollo Med. 2014,
11, 161–166. [CrossRef]
9. Daryabor, G.; Atashzar, M.R.; Kabelitz, D.; Meri, S.; Kalantar, K. The effects of type 2 diabetes mellitus on organ metabolism and
the immune system. Front. Immunol. 2020, 11, 1582. [CrossRef]
10. Uusitupa, M.; Khan, T.A.; Viguiliouk, E.; Kahleova, H.; Rivellese, A.A.; Hermansen, K.; Pfeiffer, A.; Thanopoulou, A.; Salas-
Salvadó, J.; Schwab, U.; et al. Prevention of type 2 diabetes by lifestyle changes: A systematic review and meta-analysis. Nutrients
2019, 11, 2611. [CrossRef]
11. Kyrou, I.; Tsigos, C.; Mavrogianni, C.; Cardon, G.; Van Stappen, V.; Latomme, J.; Kivelä, J.; Wikström, K.; Tsochev, K.; Nanasi, A.; et al.
Sociodemographic and lifestyle-related risk factors for identifying vulnerable groups for type 2 diabetes: A narrative review with
emphasis on data from Europe. BMC Endocr. Disord. 2020, 20, 134. [CrossRef]
12. Huang, I.; Lim, M.A.; Pranata, R. Diabetes mellitus is associated with increased mortality and severity of disease in COVID-19
pneumonia–a systematic review, meta-analysis, and meta-regression. Diabetes Metab. Syndr. Clin. Res. Rev. 2020, 14, 395–403.
[CrossRef]
13. Fazakis, N.; Dritsas, E.; Kocsis, O.; Fakotakis, N.; Moustakas, K. Long-Term Cholesterol Risk Prediction with Machine Learning
Techniques in ELSA Database. In Proceedings of the 13th International Joint Conference on Computational Intelligence (IJCCI),
Valletta, Malta, 25–27 October 2021; pp. 445–450.
14. Dritsas, E.; Fazakis, N.; Kocsis, O.; Fakotakis, N.; Moustakas, K. Long-Term Hypertension Risk Prediction with ML Techniques in
ELSA Database. In Proceedings of the International Conference on Learning and Intelligent Optimization, Athens, Greece, 20–25
June 2021; Springer: Cham, Switzerland, 2021; pp. 113–120.
15. Moll, M.; Qiao, D.; Regan, E.A.; Hunninghake, G.M.; Make, B.J.; Tal-Singer, R.; McGeachie, M.J.; Castaldi, P.J.; Estepar, R.S.J.;
Washko, G.R.; et al. Machine learning and prediction of all-cause mortality in COPD. Chest 2020, 158, 952–964. [CrossRef]
16. Alexiou, S.; Dritsas, E.; Kocsis, O.; Moustakas, K.; Fakotakis, N. An approach for Personalized Continuous Glucose Prediction
with Regression Trees. In Proceedings of the 2021 sixth South-East Europe Design Automation, Computer Engineering, Computer
Networks and Social Media Conference (SEEDA-CECNSM), Preveza, Greece, 24–26 September 2021; pp. 1–6.
17. Dritsas, E.; Alexiou, S.; Konstantoulas, I.; Moustakas, K. Short-term Glucose Prediction based on Oral Glucose Tolerance Test
Values. In Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies—HEALTHINF,
Online, 9–11 February 2022; Volume 5, pp. 249–255.
18. Zoabi, Y.; Deri-Rozov, S.; Shomron, N. Machine learning-based prediction of COVID-19 diagnosis based on symptoms. NPJ Digit.
Med. 2021, 4, 3. [CrossRef]
19. Dritsas, E.; Alexiou, S.; Moustakas, K. Cardiovascular Disease Risk Prediction with Supervised Machine Learning Techniques.
In Proceedings of the eighth International Conference on Information and Communication Technologies for Ageing Well and
e-Health, ICT4AWE, Online, 23–25 April 2022; pp. 315–321.
20. Dritsas, E.; Trigka, M. Stroke Risk Prediction with Machine Learning Techniques. Sensors 2022, 22, 4670. [CrossRef]
21. Wang, W.; Chakraborty, G.; Chakraborty, B. Predicting the risk of chronic kidney disease (ckd) using machine learning algorithm.
Appl. Sci. 2020, 11, 202. [CrossRef]
Sensors 2022, 22, 5304 17 of 18
22. Speiser, J.L.; Karvellas, C.J.; Wolf, B.J.; Chung, D.; Koch, D.G.; Durkalski, V.L. Predicting daily outcomes in acetaminophen-
induced acute liver failure patients with machine learning techniques. Comput. Methods Programs Biomed. 2019, 175, 111–120.
[CrossRef]
23. Konstantoulas, I.; Kocsis, O.; Dritsas, E.; Fakotakis, N.; Moustakas, K. Sleep Quality Monitoring with Human Assisted Corrections.
In Proceedings of the International Joint Conference on Computational Intelligence (IJCCI), Valletta, Malta, 25–27 October 2021;
pp. 435–444.
24. Yarasuri, V.K.; Indukuri, G.K.; Nair, A.K. Prediction of hepatitis disease using machine learning technique. In Proceedings of the
2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 12–14
December 2019; pp. 265–269.
25. Saba, T. Recent advancement in cancer detection using machine learning: Systematic survey of decades, comparisons and
challenges. J. Infect. Public Health 2020, 13, 1274–1289. [CrossRef]
26. Hasan, M.K.; Alam, M.A.; Das, D.; Hossain, E.; Hasan, M. Diabetes prediction using ensembling of different machine learning
classifiers. IEEE Access 2020, 8, 76516–76531. [CrossRef]
27. Kaur, H.; Kumari, V. Predictive modelling and analytics for diabetes using a machine learning approach. Appl. Comput. Inform.
2020, 18, 90–100. [CrossRef]
28. Kopitar, L.; Kocbek, P.; Cilar, L.; Sheikh, A.; Stiglic, G. Early detection of type 2 diabetes mellitus using machine learning-based
prediction models. Sci. Rep. 2020, 10, 11981. [CrossRef]
29. Tigga, N.P.; Garg, S. Prediction of type 2 diabetes using machine learning classification methods. Procedia Comput. Sci. 2020,
167, 706–716. [CrossRef]
30. Maniruzzaman, M.; Rahman, M.; Ahammed, B.; Abedin, M. Classification and prediction of diabetes disease using machine
learning paradigm. Health Inf. Sci. Syst. 2020, 8, 7. [CrossRef]
31. Fazakis, N.; Kocsis, O.; Dritsas, E.; Alexiou, S.; Fakotakis, N.; Moustakas, K. Machine learning tools for long-term type 2 diabetes
risk prediction. IEEE Access 2021, 9, 103737–103757. [CrossRef]
32. Islam, M.; Ferdousi, R.; Rahman, S.; Bushra, H.Y. Likelihood prediction of diabetes at early stage using data mining techniques.
In Computer Vision and Machine Intelligence in Medical Image Analysis; Springer: Berlin/Heidelberg, Germany, 2020; pp. 113–125.
33. Alpan, K.; İlgi, G.S. Classification of diabetes dataset with data mining techniques by using WEKA approach. In Proceedings of
the 2020 fourth International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Istanbul, Turkey,
22–24 October 2020; pp. 1–7.
34. Patel, S.; Patel, R.; Ganatra, N.; Patel, A. Predicting a risk of diabetes at early stage using machine learning approach. Turk. J.
Comput. Math. Educ. (TURCOMAT) 2021, 12, 5277–5284.
35. Elsadek, S.N.; Alshehri, L.S.; Alqhatani, R.A.; Algarni, Z.A.; Elbadry, L.O.; Alyahyan, E.A. Early Prediction of Diabetes Disease
Based on Data Mining Techniques. In Proceedings of the International Conference on Computational Intelligence in Data Science,
Chennai, India, 18–20 March 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 40–51.
36. Early Classification of Diabetes. Available online: https://www.kaggle.com/datasets/andrewmvd/early-diabetes-classification
(accessed on 25 June 2022).
37. Yi, S.W.; Park, S.; Lee, Y.h.; Balkau, B.; Yi, J.J. Fasting glucose and all-cause mortality by age in diabetes: A prospective cohort
study. Diabetes Care 2018, 41, 623–626. [CrossRef]
38. Harreiter, J.; Kautzky-Willer, A. Sex and gender differences in prevention of type 2 diabetes. Front. Endocrinol. 2018, 9, 220.
[CrossRef]
39. Marks, B.E. Initial Evaluation of Polydipsia and Polyuria. In Endocrine Conditions in Pediatrics; Springer: Berlin/Heidelberg,
Germany, 2021; pp. 107–111.
40. Hamman, R.F.; Wing, R.R.; Edelstein, S.L.; Lachin, J.M.; Bray, G.A.; Delahanty, L.; Hoskin, M.; Kriska, A.M.; Mayer-Davis,
E.J.; Pi-Sunyer, X.; et al. Effect of weight loss with lifestyle intervention on risk of diabetes. Diabetes Care 2006, 29, 2102–2107.
[CrossRef]
41. Peterson, M.D.; Zhang, P.; Choksi, P.; Markides, K.S.; Al Snih, S. Muscle weakness thresholds for prediction of diabetes in adults.
Sport. Med. 2016, 46, 619–628. [CrossRef]
42. Batchelor, D.J.; German, A.J. Polyphagia. In BSAVA Manual of Canine and Feline Gastroenterology; BSAVA Library: Gloucester, UK,
2019; pp. 46–48.
43. Schneider, C.R.; Moles, R.; El-Den, S. Thrush: Detection and management in community pharmacy. Pharm. J. R. Pharm. Soc. Publ.
2018, 2018, 1–10.
44. Tamhankar, M.A. Transient Visual Loss or Blurring. In Liu, Volpe, and Galetta’s Neuro-Ophthalmology; Elsevier: Amsterdam, The
Netherlands, 2019; pp. 365–377.
45. Stefaniak, A.; Chlebicka, I.; Szepietowski, J. Itch in diabetes: A common underestimated problem. Adv. Dermatol. Allergol.
Dermatol. I Alergol. 2019, 38, 177–183. [CrossRef]
46. Barata, P.C.; Holtzman, S.; Cunningham, S.; O’Connor, B.P.; Stewart, D.E. Building a definition of irritability from academic
definitions and lay descriptions. Emot. Rev. 2016, 8, 164–172. [CrossRef] [PubMed]
47. Blakytny, R.; Jude, E. The molecular biology of chronic wounds and delayed healing in diabetes. Diabet. Med. 2006, 23, 594–608.
[CrossRef] [PubMed]
48. Andersen, H.; Nielsen, S.; Mogensen, C.E.; Jakobsen, J. Muscle strength in type 2 diabetes. Diabetes 2004, 53, 1543–1548. [CrossRef]
Sensors 2022, 22, 5304 18 of 18
49. Miyake, H.; Kanazawa, I.; Tanaka, K.I.; Sugimoto, T. Low skeletal muscle mass is associated with the risk of all-cause mortality in
patients with type 2 diabetes mellitus. Ther. Adv. Endocrinol. Metab. 2019, 10, 2042018819842971. [CrossRef]
50. Su, L.H.; Chen, L.S.; Lin, S.C.; Chen, H.H. Association of androgenetic alopecia with mortality from diabetes mellitus and heart
disease. JAMA Dermatol. 2013, 149, 601–606. [CrossRef]
51. Chobot, A.; Górowska-Kowolik, K.; Sokołowska, M.; Jarosz-Chobot, P. Obesity and diabetes—Not only a simple link between
two epidemics. Diabetes/Metab. Res. Rev. 2018, 34, e3042. [CrossRef]
52. Maldonado, S.; López, J.; Vairetti, C. An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl. Soft
Comput. 2019, 76, 380–389. [CrossRef]
53. Pavithra, V.; Jayalakshmi, V. Hybrid feature selection technique for prediction of cardiovascular diseases. Mater. Today Proc. 2021,
in press.
54. Gnanambal, S.; Thangaraj, M.; Meenatchi, V.; Gayathri, V. Classification algorithms with attribute selection: An evaluation study
using WEKA. Int. J. Adv. Netw. Appl. 2018, 9, 3640–3644.
55. Aldrich, C. Process variable importance analysis by use of random forests in a shapley regression framework. Minerals 2020,
10, 420. [CrossRef]
56. Chormunge, S.; Jena, S. Correlation based feature selection with clustering for high dimensional data. J. Electr. Syst. Inf. Technol.
2018, 5, 542–549. [CrossRef]
57. Berrar, D. Bayes’ theorem and naive Bayes classifier. Encycl. Bioinform. Comput. Biol. ABC Bioinform. 2018, 1, 403–412.
58. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [CrossRef]
59. Yang, Y.; Li, J.; Yang, Y. The research of the fast SVM classifier method. In Proceedings of the 2015 12th International Computer
Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 18–20 December
2015; pp. 121–124.
60. Nusinovici, S.; Tham, Y.C.; Yan, M.Y.C.; Ting, D.S.W.; Li, J.; Sabanayagam, C.; Wong, T.Y.; Cheng, C.Y. Logistic regression was as
good as machine learning for predicting major chronic diseases. J. Clin. Epidemiol. 2020, 122, 56–69. [CrossRef]
61. Masih, N.; Naz, H.; Ahuja, S. Multilayer perceptron based deep neural network for early detection of coronary heart disease.
Health Technol. 2021, 11, 127–138. [CrossRef]
62. Cunningham, P.; Delany, S.J. k-Nearest neighbour classifiers-A Tutorial. ACM Comput. Surv. (CSUR) 2021, 54, 1–25. [CrossRef]
63. Bhargava, N.; Sharma, G.; Bhargava, R.; Mathuria, M. Decision tree analysis on j48 algorithm for data mining. Proc. Int. J. Adv.
Res. Comput. Sci. Softw. Eng. 2013, 3, 1114–1119.
64. Truong, X.L.; Mitamura, M.; Kono, Y.; Raghavan, V.; Yonezawa, G.; Truong, X.Q.; Do, T.H.; Tien Bui, D.; Lee, S. Enhancing
prediction performance of landslide susceptibility model using hybrid machine learning approach of bagging ensemble and
logistic model tree. Appl. Sci. 2018, 8, 1046. [CrossRef]
65. Palimkar, P.; Shaw, R.N.; Ghosh, A. Machine learning technique to prognosis diabetes disease: Random forest classifier approach.
In Advanced Computing and Intelligent Technologies; Springer: Berlin/Heidelberg, Germany, 2022; pp. 219–244.
66. Elomaa, T.; Kaariainen, M. An analysis of reduced error pruning. J. Artif. Intell. Res. 2001, 15, 163–187. [CrossRef]
67. Joloudari, J.H.; Hassannataj Joloudari, E.; Saadatfar, H.; Ghasemigol, M.; Razavi, S.M.; Mosavi, A.; Nabipour, N.; Shamshirband, S.;
Nadai, L. Coronary artery disease diagnosis; ranking the significant features using a random trees model. Int. J. Environ. Res.
Public Health 2020, 17, 731. [CrossRef]
68. Rodriguez, J.J.; Kuncheva, L.I.; Alonso, C.J. Rotation forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach.
Intell. 2006, 28, 1619–1630. [CrossRef]
69. Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference
on Machine Learning, Bari, Italy, 3–6 July 1996.
70. Netrapalli, P. Stochastic gradient descent and its variants in machine learning. J. Indian Inst. Sci. 2019, 99, 201–213. [CrossRef]
71. Pavlyshenko, B. Using stacking approaches for machine learning models. In Proceedings of the 2018 IEEE Second International
Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 21–25 August 2018; pp. 255–258.
72. Hossin, M.; Sulaiman, M.N. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag.
Process 2015, 5, 1.
73. Waikato Environment for Knowledge Analysis. Available online: https://www.weka.io/ (accessed on 25 June 2022).