IC - Eduarda - 1 s2.0 S0010482520304820 Main
IC - Eduarda - 1 s2.0 S0010482520304820 Main
A R T I C L E I N F O A B S T R A C T
Keywords: Background: Rapid diagnosing is crucial for controlling malaria. Various studies have aimed at developing ma
Machine learning chine learning models to diagnose malaria using blood smear images; however, this approach has many limi
Malaria tations. This study developed a machine learning model for malaria diagnosis using patient information.
Diagnosis
Methods: To construct datasets, we extracted patient information from the PubMed abstracts from 1956 to 2019.
Case reports
Patient information
We used two datasets: a solely parasitic disease dataset and total dataset by adding information about other
diseases. We compared six machine learning models: support vector machine, random forest (RF), multilayered
perceptron, AdaBoost, gradient boosting (GB), and CatBoost. In addition, a synthetic minority oversampling
technique (SMOTE) was employed to address the data imbalance problem.
Results: Concerning the solely parasitic disease dataset, RF was found to be the best model regardless of using
SMOTE. Concerning the total dataset, GB was found to be the best. However, after applying SMOTE, RF per
formed the best. Considering the imbalanced data, nationality was found to be the most important feature in
malaria prediction. In case of the balanced data with SMOTE, the most important feature was symptom.
Conclusions: The results demonstrated that machine learning techniques can be successfully applied to predict
malaria using patient information.
1. Introduction Various studies have been conducted for diagnosing malaria using
machine learning [6,14], most of which focused on the blood smear
Malaria is a dangerous infection disease caused by various species of image approach [14]. Evidently, blood smear microscopic examination
Plasmodium worldwide, which can be cured using drugs [1]. The World is the most reliable clue in parasitic disease diagnoses [15]; moreover,
Health Organization (WHO)’s World Malaria report 2019 indicated 228 machine learning-based diagnosis reduces the required costs and pro
million cases of malaria, with 40,500 deaths, in more than 90 countries fessional labor while increasing the diagnosis accuracy [1]. However,
in 2018 [2,3]. Early diagnosis of malaria is very important as it allows supervised learning requires establishing appropriate labeling of images
performing appropriate disease management and treatment [4–6]. by experts to construct trained datasets, performing the so-called
Therefore, various diagnosis methods of malaria have been proposed so annotation [13,16]. Moreover, diagnoses using microscopy methods
far, such as polymerase chain reaction (PCR), rapid diagnostic tests considerably depend on the skills and experience of experts [1,15].
(RDTs), and microscopy [7–9]. Frickmann et al. evaluated a PCR assay Therefore, these methods require greater specificity and sensitivity of an
corresponding to the differentiation of plasmodium [10]. Amaral expert [13,15].
established ribosomal- and non-ribosomal-targeting PCR assays for Meanwhile, another important indicator that needs to be considered
detecting low-density and mixed malaria [11]. Makuuchi evaluated in malaria diagnosis is patient information, including symptomatology,
RDTs by comparing their results with those of microscopy analysis [12]. nationality, age, gender, and travel history [17,18]. However, it is
However, these methods are generally expensive in terms of time and difficult to discriminate malaria infection from other parasitic diseases
expert labor. Recently, machine learning-based diagnoses have been [18], as usually, patients exhibit similar symptoms of malaria. Various
investigated to increase the diagnosis speed [5,6,13]. effective methods for machine learning diagnosis have been developed
* Corresponding author. Department of Tropical Medicine and Parasitology, Seoul National University College of Medicine, 103 Daehak-ro, Jongno-gu, Seoul,
03080, Republic of Korea.
E-mail address: [email protected] (E.-H. Shin).
https://doi.org/10.1016/j.compbiomed.2020.104151
Received 22 September 2020; Received in revised form 9 November 2020; Accepted 24 November 2020
Available online 28 November 2020
0010-4825/© 2020 Elsevier Ltd. All rights reserved.
Y.W. Lee et al. Computers in Biology and Medicine 129 (2021) 104151
Table 1
Dataset review.
Non Only parasitic disease(N = 1698) Overall
-parasitic (N =
Non Malaria Total
disease 1846)
-malaria (N=135)
(N = 148)
(N=1563)
Gender(n)
Male 78 881 89 970 1048
Female 70 682 46 728 798
Age(n)
1–20 6 334 17 351 357
21–40 24 562 56 618 642
41–60 54 421 50 471 525
61–80 56 233 12 245 301
81~ 8 13 0 13 21
Nationality(n)
Africa 16 251 69 320 336
America 16 309 14 323 339
Asia 87 591 37 628 715
Europe 27 364 13 377 404
Ocearnia & 2 48 2 50 52
Caribbean
Symptomatic body 122(82.4) 1360(87) 78(57.8) 1438 1560
region(n, (%)) (84.7) (83.7)
ABDOMEN 21 476 24 500 521
BACK 7 32 7 39 46
CHEST 15 67 4 71 86
EAR 1 9 2 11 12
Fig. 1. Data processing.
EXTREMITIES 4 22 1 23 27
GASTROINTESTINAL 8 68 2 70 78
using patient information. Spathis et al. considered age, gender, and HAIR 0 9 0 9 9
symptomatology of a patient as variables for diagnosing chronic HEAD 0 28 0 28 28
LYMPH NODE 8 35 0 35 43
obstructive pulmonary disease [19]. Terrada et al. classified and pre
MOUTH 0 12 1 13 13
dicted atherosclerosis using a machine learning approach, which trained NAIL 1 4 0 4 5
the model on data including age, gender, and symptoms [20]. NECK 3 41 1 42 45
Mello-Roman et al. predicted dengue using data on age, gender, region, NEUROLOGICAL 5 78 9 87 92
and symptomatology of a patient [21]. However, no study, so far, has OCULAR 3 93 1 94 97
PELVIS 0 5 0 5 5
attempted to diagnose malaria using machine learning models trained PSYCHIATRIC 0 7 6 13 13
on patient information. PULMONARY 26 152 13 165 191
Therefore, this paper proposes a machine learning model to predict RECTUM 1 8 0 8 9
malaria by using patient information obtained from parasite case re SKIN 15 164 5 169 184
TOOTH 1 1 0 1 2
ports. We extract the data on nationality, disease, gender, age, symp
VAGINA 1 17 0 17 18
toms and body region of patients with symptoms. Then, we train six VISION 2 32 2 34 36
machine learning models on these data. Symptom(n, (%)) 46(31.1) 437(27.3) 86(63.7) 523 575
(30.8) (31.1)
ALOPECIA 1 2 0 2 3
2. Methods
APATHY 0 4 0 4 4
APHASIA 1 6 0 6 7
2.1. Dataset APNEA 0 0 0 0 1
APRAXIA 0 2 0 2 2
Using BioPython, we obtained the data corresponding to 56 parasitic ARRHYTHMIA 0 0 1 1 1
ARTHRALGIA 5 11 5 16 21
disease reports provided by the Center for Disease Control and Preven
ASTHENIA 0 1 0 1 1
tion (CDC) [22] and abstracts of case reports of non-parasitic diseases ATAXIA 1 12 3 15 16
(cancer, Alzheimer, rheumatoid disease, and diabetes) published from BACK PAIN 7 18 0 18 25
1956 to 2019 by PubMed [23]. Based on CDC parasitic disease list, we BLEEDING 0 19 2 21 21
classified 56 diseases based on causative parasite genus or if the disease BLINDNESS 0 12 1 13 13
BLURRED VISION 0 6 0 6 6
name was the same. For example, Hydatid disease, Alveolar Echino CHILLS 1 8 7 15 16
coccosis, and Echinococosis were caused by the same genus, i.e., Echi CHRONIC PAIN 0 0 0 0 1
nocococcus, and categorized as the same parasitic disease. Filaria, CONFUSION 2 7 2 9 11
Filariasis, Elephantiasis, and the infection of Wuchereria bancrofti, Bru DEFORMITY 0 4 0 4 4
DEPRESSION 0 3 1 4 4
gia genus are regarded as the same parasitic disease. Nonpathogenic
DISCHARGE 2 15 1 16 18
intestinal protozoa (Enteromonas hominis, Retortamonas intestinalis, and DIZZINESS 0 5 0 5 5
Pentatrichomonas hominis) can be present in feces but they are not DYSARTHRIA 0 0 0 0 2
harmful and nonpathogenic. Therefore, we combined them into one FECAL 0 0 0 0 1
disease. Using this method, we were able to reorganize 56 parasitic INCONTINENCE
FEVER 14 131 43 174 188
diseases. HALLUCINATION 0 2 0 2 2
We selected non-parasitic diseases using the following standard. i) HEARING LOSS 0 2 2 4 4
We chose diseases that constitute the top ten causes of deaths worldwide HEARTBURN 0 3 0 3 3
[24] or diseases with more than 10000 cases from 1956 to 2019. This HEMATEMESIS 0 2 0 2 2
INFERTILITY 0 3 0 3 3
approach was used because we wanted to have a diverse sample of pa
(continued on next page)
tients. ii) We chose a disease with a name without an organ name when
2
Y.W. Lee et al. Computers in Biology and Medicine 129 (2021) 104151
Table 1 (continued ) those of parasite infection diseases. Thus, we wanted to confirm the
Non Only parasitic disease(N = 1698) Overall applicability of our model in non-parasitic disease patients who had
-parasitic (N = other symptoms. The diseases that we used met these standards.
Non Malaria Total
disease
-malaria (N=135)
1846) To extract relevant data, we ran queries using logical combinations
(N = 148) such as operators “AND” and “OR.” We added our queries in Appendix A
(N=1563)
Supplementary Table 1. Parasitic diseases have many related names.
IRRITABILITY 0 1 1 2 2
LACERATION 0 2 0 2 2 Therefore, we use “OR” and “AND”. For example, another name for
LHERMITTE’S SIGN 0 1 0 1 1 sleeping sickness is African trypanosomiasis, and the causative parasite
LOSS OF 0 13 1 14 14 is Trypanosoma genus. In addition, there are two types of Trypanoso
CONSCIOUSNESS miasis: American trypanosomiasis (Chagas disease) and African
MALAISE 0 10 5 15 15
MYOCLONUS 0 0 0 0 1
trypanosomiasis. We are only interested in African trypanosomiasis and
NECK STIFFNESS 0 3 1 4 4 not American trypanosomiasis. Then, we used the following query:
PARALYSIS 0 3 0 3 3 ((Trypanosoma OR Sleeping Sickness OR trypanosomiasis) AND Africa)
PARESIS 0 5 0 5 5 AND case report.
PELVIC PAIN 0 7 0 7 7
If we are interested in American trypanosomiasis, then we used the
PETECHIA 0 3 0 3 3
PURPURA 0 3 1 4 4 following query: ((Trypanosoma OR Chagas Disease OR trypanosomiasis)
RASH 0 1 0 1 1 AND America) AND case report.
SHIVERING 0 3 4 7 7 We derived information regarding nation (meaning nationality or
SHORT OF BREATH 0 4 0 4 4 travel region of a patient), disease, gender, age, symptom and body re
SORE THROAT 0 2 0 2 2
SUICIDAL IDEATION 0 0 3 3 3
gion of patients with symptoms using Python scripts. The lists of body
SWEATS 1 6 1 7 8 regions and symptoms were prepared by referring to the 10th edition of
SWELLING 6 71 0 71 77 International Classification of Disease.
TINGLING 0 2 0 2 2
TREMOR 3 0 1 1 4
TRISMUS 0 1 0 1 1 2.2. Dataset preprocessing
URINARY 1 10 0 10 11
RETENTION
VAGINAL 1 4 0 4 5
Fig. 1 shows the data processing scheme. We removed missing var
DISCHARGE iables or values from the dataset, except for symptoms and body regions.
VOMIT 0 4 0 4 4 If the symptoms or body regions had at least one value, we did not
remove these data. First, we constructed a dataset comprising only
parasitic disease patient information, and then, prepared the total
dataset by adding information about other diseases (Alzheimer, rheu
Table 2
Solely parasitic disease dataset.
matoid, cancer, and diabetes). All the values were categorized using
integers. Note that the prepared datasets incurred the data imbalance
Model Accuracy Precision Recall F1-Score CV-10 AUC
problem. To address this problem, we applied the synthetic minority
SVM 0.915 0.000 0.000 0.000 0.914 0.728 oversampling technique (SMOTE) [25] provided by Scikit-learn [26].
RF 0.903 0.250 0.160 0.195 0.906 0.732
MLP 0.909 0.286 0.160 0.205 0.916 0.686
Ada 0.894 0.211 0.160 0.182 0.905 0.596 2.3. Model development
GB 0.891 0.227 0.200 0.213 0.913 0.708
CB 0.909 0.250 0.120 0.162 0.919 0.685
We used various machine learning techniques to develop six models
SMOTE + SVM 0.874 0.091 0.080 0.085 0.914 0.735 to diagnose malaria: support vector machine (SVM) [27], random forest
SMOTE + RF 0.871 0.120 0.120 0.120 0.906 0.740
SMOTE + MLP 0.721 0.150 0.600 0.240 0.916 0.702
(RF) [28], multilayered perceptron (MLP) [29], AdaBoost (Ada) [30],
SMOTE + Ada 0.865 0.161 0.200 0.179 0.915 0.688 gradient boosting (GB) [31], and CatBoost (CB) [32].
SMOTE + GB 0.885 0.208 0.200 0.204 0.912 0.698
SMOTE + CB 0.871 0.194 0.240 0.214 0.919 0.708 2.3.1. SVM
SVM is a widely used supervised learning approach for classification
or regression analysis. It can be applied to transform training data into a
Table 3 high-dimensional feature space and determine a linear optimal solution
Total dataset. by separating a hyperplane that provides the smallest distance between
Model Accuracy Precision Recall F1-Score CV-10 AUC the hyperplane points and the largest margin between the classes [27,
33–35].
SVM 0.938 0.000 0.000 0.000 0.921 0.776
RF 0.930 0.300 0.136 0.187 0.917 0.837
MLP 0.914 0.222 0.182 0.200 0.917 0.789 2.3.2. RF
Ada 0.932 0.333 0.136 0.194 0.910 0.835 RF is an ensemble supervised learning method composed of multiple
GB 0.930 0.300 0.136 0.187 0.908 0.856 decision trees corresponding to various subdatasets. Each tree calculates
CB 0.949 0.714 0.227 0.345 0.924 0.802
the results and obtains the average of the prediction outcomes. This
SMOTE + SVM 0.919 0.278 0.227 0.250 0.921 0.771 approach allows reducing variance in decision trees [28,36,37].
SMOTE + RF 0.922 0.348 0.364 0.356 0.917 0.805
SMOTE + MLP 0.746 0.125 0.545 0.203 0.917 0.674
SMOTE + Ada 0.922 0.360 0.409 0.383 0.907 0.803
2.3.3. MLP
SMOTE + GB 0.927 0.400 0.455 0.426 0.907 0.804 MLP is a supervised machine learning algorithm used for data clas
SMOTE + CB 0.881 0.211 0.364 0.267 0.924 0.714 sification tasks. It is composed of three layers: an input layer, which
includes input data; a hidden layer, which computes complicated asso
ciations across the network; and an output layer, which generates the
collecting case reports. For heart or brain diseases, if the name of the
final result. This process can be terminated when the error rate becomes
organ was included, it could overlap with the case report of parasites. iii)
sufficiently small. We optimized the log-loss function using the sto
We excluded infections because symptoms of infections were similar to
chastic gradient descent [29,38].
3
Y.W. Lee et al. Computers in Biology and Medicine 129 (2021) 104151
Fig. 2. AUC curve: A) the solely parasitic disease dataset; B) solely parasitic disease dataset with SMOTE; C) total dataset; D) total dataset with SMOTE.
4
Y.W. Lee et al. Computers in Biology and Medicine 129 (2021) 104151
Fig. 3. Feature importance: A) the solely parasitic disease dataset; B) solely parasitic disease dataset with SMOTE; C) total dataset; D) total dataset with SMOTE.
3.2. Model performance Recently, an increasing number of studies have been conducted on
malaria diagnosis using artificial intelligence (AI). Kim et al. and Wang
Tables 2 and 3 and Fig. 2 describe the performance of the considered et al. predicted malaria incidence by using a seasonal climate dataset
predictive models. Concerning the solely parasitic disease dataset, the [44,45]. Moreover, the methods based on AI for diagnosis using blood
RF model achieved the best performance with AUC of 73.2%. The worst smear images have been extensively investigated [1,4,6,14]. Rajaraman
model was Ada, with AUC of 59.6%. After applying SMOTE, the AUC et al. used thin-blood smear images to construct deep neural ensemble
values of almost all models increased, except GB. In this case, RF ach models [4]. Molina et al. introduced a machine learning model to
ieved the best performance (with AUC of 73.5%), while Ada demon discriminate infected blood cells from normal ones [6].
strated the worst performance (with AUC of 68.8%). Within the total In the present study, we used the parasitic disease patient informa
dataset, GB achieved the highest AUC (85.6%). The performance of the tion derived from the abstracts of case reports provided by PubMed to
models trained on the total dataset was higher compared to those train the models. Evidently, it is possible to consider various databases
trained on the solely parasitic disease dataset. The values of accuracy, for obtaining the epidemiology or symptom data on parasitic diseases,
precision, recall, and F1-score were also higher in the case of training on such as Gideon [46] and CDC [22]. Moreover, the National Health and
the total dataset. The model that showed the worst results was SVM, Nutrition Examination Survey was used as a source for obtaining health
with AUC of 77.6%. The AUC values of all models with SMOTE were and nutrition information about patients. However, no database on
decreased; however, the values of accuracy, precision, recall, and F1- parasitic disease patients provides information about patients’ nation
score were higher. RF achieved the best performance among all classi ality, age, symptoms, and gender. Even if such information was avail
fiers (with AUC of 80.5%), while the worst performing model was MLP able, it would not exhibit diversity in terms of regions or conditions of
(with AUC of 67.4%). patients [47–49]. Therefore, we constructed datasets based on the in
formation obtained from the abstracts of all parasitic disease case re
ports available in PubMed that were published from 1956 to 2019. Note
3.3. Feature importance
that the abstracts did not provide detailed information about patients;
however, the available data were sufficient to perform analysis to di
We calculated the feature importance using RF, which achieved the
agnose malaria using the methods considered in the present study.
highest performance overall (Fig. 3). Both in the total dataset and the
Moreover, these data reflected the trend of overall parasitic disease
solely parasitic disease dataset, the most important feature was “nation,”
patient information with sufficient accuracy.
followed by “age” (Fig. 3A and C). However, in case of the dataset with
The performance estimates of almost all models trained on the solely
SMOTE, “symptom” was the most important feature, followed by
parasitic disease dataset were lower than those of the models trained on
“nation” (Fig. 3B and D).
5
Y.W. Lee et al. Computers in Biology and Medicine 129 (2021) 104151
the total dataset and those trained on the data with SMOTE. The Declaration of competing interest
observed results could be explained not only by a smaller dataset but
also by the characteristics of parasitic diseases. In clinical cases, the The authors have no competing interests to declare.
symptoms of a parasitic disease are similar to those of malaria, and
therefore, it was more difficult to discriminate malaria using the solely Acknowledgments
parasitic disease dataset. For example, fever, which is a standard
symptom of malaria, can also indicate conditions such as toxoplasmosis This work was supported by the Korea Association of Health Pro
[50] and pulmonary eosinophilia [51]. Similarly, abdominal pain is a motion Fund (Grant no. 2014–01).
generic symptom for conditions such as amoebic liver abscesses [52]
and trichinellosis [53]. We hypothesized that SMOTE can address this Appendix A. Supplementary data
problem through oversampling; however, the AUC values were still
lower than those of the models trained on the total dataset. Supplementary data to this article can be found online at https://doi.
According to the obtained results, RF achieved the best performance, org/10.1016/j.compbiomed.2020.104151.
except for the total dataset without SMOTE. The remarkable perfor
mance of an RF model has also been reported by other studies con References
cerning various diseases [54–56]. An RF model has also been applied to
neuroimaging classification [54], the prediction of in-hospital cardiac [1] M. Poostchi, K. Silamut, R.J. Maude, S. Jaeger, G. Thoma, Image analysis and
machine learning for detecting malaria, Transl. Res. 194 (2018) 36–55, https://doi.
arrest [56], and biomarker prediction based on gene expression data org/10.1016/j.trsl.2017.12.004.
[55]; in all these applications, the RF model demonstrated great [2] L. Zekar, T. Sharman, Malaria (Plasmodium Falciparum), StatPearls, Treasure
performance. Island (FL), 2020.
[3] W.H. Organization, World Malaria Report 2019, 2019.
Meanwhile, the results of feature importance analysis indicated that [4] S. Rajaraman, S. Jaeger, S.K. Antani, Performance evaluation of deep neural
nationality and age are important factors to consider in diagnosing ensembles toward malaria parasite detection in thin-blood smear images, PeerJ 7
malaria using imbalanced data. Many previous studies have reported (2019) e6977, https://doi.org/10.7717/peerj.6977.
[5] K. Torres, C.M. Bachman, C.B. Delahunt, J. Alarcon Baldeon, F. Alava, D. Gamboa
that parasitic diseases, such as malaria, depend on the places in which Vilela, S. Proux, C. Mehanian, S.K. McGuire, C.M. Thompson, T. Ostbye, L. Hu, M.
patients live or travel [2,57]. The obtained results suggested that na S. Jaiswal, V.M. Hunt, D. Bell, Automated microscopy for routine malaria
tionality and the region of traveling are crucial factors in the diagnosis of diagnosis: a field comparison on Giemsa-stained blood films in Peru, Malar. J. 17
(2018) 339, https://doi.org/10.1186/s12936-018-2493-0.
parasitic diseases.
[6] A. Molina, S. Alferez, L. Boldu, A. Acevedo, J. Rodellar, A. Merino, Sequential
classification system for recognition of malaria infection using peripheral blood cell
4.1. Limitations images, J. Clin. Pathol. 73 (10) (2020) 665–670, https://doi.org/10.1136/
jclinpath-2019-206419.
[7] Z. Zheng, Z. Cheng, Advances in molecular diagnosis of malaria, Adv. Clin. Chem.
The limitations of our study are related to the small size of the 80 (2017) 155–192, https://doi.org/10.1016/bs.acc.2016.11.006.
datasets used and the limited number of features without the process of [8] P. Berzosa, A. de Lucio, M. Romay-Barja, Z. Herrador, V. Gonzalez, L. Garcia,
feature selection. Moreover, the observed values of precision, recall, and A. Fernandez-Martinez, M. Santana-Morales, P. Ncogo, B. Valladares, M. Riloha,
A. Benito, Comparison of three diagnostic methods (microscopy, RDT, and PCR) for
F1-score were lower than those reported previously. Specifically, the the detection of malaria parasites in representative samples from Equatorial
dataset has high imbalance between parasitic and non-parasitic cases. Guinea, Malar. J. 17 (2018) 333, https://doi.org/10.1186/s12936-018-2481-4.
We collected data from more than 35519 non-parasitic patients, [9] K.O. Mfuh, O.A. Achonduh-Atijegbe, O.N. Bekindaka, L.F. Esemu, C.D. Mbakop,
K. Gandhi, R.G.F. Leke, D.W. Taylor, V.R. Nerurkar, A comparison of thick-film
expecting to be able to obtain more than 1698 parasitic patients. microscopy, rapid diagnostic test, and polymerase chain reaction for accurate
However, the number of samples inevitably decreased when extracting diagnosis of Plasmodium falciparum malaria, Malar. J. 18 (2019) 73, https://doi.
only patients whose information, such as the country, age, gender, org/10.1186/s12936-019-2711-4.
[10] H. Frickmann, C. Wegner, S. Ruben, C. Behrens, H. Kollenda, R. Hinz, S. Rojak, N.
symptoms, and body region of patients with symptoms, was available. In G. Schwarz, R.M. Hagen, E. Tannich, Evaluation of the multiplex real-time PCR
particular, cancer, rheumatism, diabetes, and Alzheimer’s patients were assays RealStar malaria S&T PCR kit 1.0 and FTD malaria differentiation for the
able to provide less nationality information in case reports. Therefore, differentiation of Plasmodium species in clinical samples, Trav. Med. Infect. Dis. 31
(2019) 101442, https://doi.org/10.1016/j.tmaid.2019.06.013.
we had no choice but to create a dataset with a small number of patients
[11] L.C. Amaral, D.R. Robortella, L.F.F. Guimaraes, J.E. Limongi, C.J.F. Fontes, D.
having non-parasitic diseases. We considered that they could be B. Pereira, C.F.A. de Brito, F.S. Kano, T.N. de Sousa, L.H. Carvalho, Ribosomal and
improved by applying SMOTE. However, even after the application of non-ribosomal PCR targets for the detection of low-density and mixed malaria
infections, Malar. J. 18 (2019) 154, https://doi.org/10.1186/s12936-019-2781-3.
SMOTE, the value of precision, recall and F1-score did not exceed 0.5.
[12] R. Makuuchi, S. Jere, N. Hasejima, T. Chigeda, J. Gausi, The correlation between
This can be addressed by increasing the number of patients and features malaria RDT (Paracheck pf.(R)) faint test bands and microscopy in the diagnosis of
in the datasets. malaria in Malawi, BMC Infect. Dis. 17 (2017) 317, https://doi.org/10.1186/
s12879-017-2413-x.
[13] A. Rehman, N. Abbas, T. Saba, Z. Mehmood, T. Mahmood, K.T. Ahmed,
5. Conclusions Microscopic malaria parasitemia diagnosis and grading on benchmark datasets,
Microsc. Res. Tech. 81 (2018) 1042–1058, https://doi.org/10.1002/jemt.23071.
This is the first study that aims to diagnose malaria using patient [14] S. Rajaraman, S.K. Antani, M. Poostchi, K. Silamut, M.A. Hossain, R.J. Maude,
S. Jaeger, G.R. Thoma, Pre-trained convolutional neural networks as feature
information. The novelty of the utilized datasets lies in the fact that the extractors toward improved malaria parasite detection in thin blood smear images,
data were obtained for parasitic disease patients spread globally. We PeerJ 6 (2018) e4568, https://doi.org/10.7717/peerj.4568.
compared several machine learning models applied to malaria predic [15] A. Mbanefo, N. Kumar, Evaluation of malaria diagnostic methods as a key for
successful Control and elimination programs, Trav. Med. Infect. Dis. 5 (2020),
tion trained on parasitic disease patient data. The results showed that RF https://doi.org/10.3390/tropicalmed5020102.
was the best model for the diagnosis, indicating the possibility of diag [16] K. Smith, F. Piccinini, T. Balassa, K. Koos, T. Danka, H. Azizpour, P. Horvath,
nosing using only patient information with AI. Phenotypic image analysis software tools for exploring and understanding big
image data from cell-based assays, Cell Syst 6 (2018) 636–653, https://doi.org/
10.1016/j.cels.2018.06.001.
Author contributions [17] F. Jimenez-Morillas, M. Gil-Mosquera, E.J. Garcia-Lamberechts, I.-S. en
representacion de la seccion de enfermedades tropicales de, I.-S. Seccion de
enfermedades tropicales de, Fever in travellers returning from the tropics, Med Clin
EHS provided the research idea. YWL and EHS conceived and
(Barc) 153 (2019) 205–212, https://doi.org/10.1016/j.medcli.2019.03.017.
designed the study. YWL and JWC collected and analyzed data. YWL and [18] C. JY, Seo and Lee’s Clinical Parasitology, Seoul National University Publishing
JWC contributed materials and analysis tools. YWL and EHS wrote the Council, 2011.
paper. EHS was responsible for the overall project administration and
acquiring of financial support.
6
Y.W. Lee et al. Computers in Biology and Medicine 129 (2021) 104151
[19] D. Spathis, P. Vlamos, Diagnosing asthma and chronic obstructive pulmonary [41] Y. Ye, Y. Xiong, Q. Zhou, J. Wu, X. Li, X. Xiao, Comparison of machine learning
disease with machine learning, Health Inf. J. 25 (2019) 811–827, https://doi.org/ methods and conventional logistic regressions for predicting gestational diabetes
10.1177/1460458217723169. using routine clinical data: a retrospective cohort study, J Diabetes Res 2020
[20] O. Terrada, B. Cherradi, A. Raihani, O. Bouattane, Classification and prediction of (2020) 4168340, https://doi.org/10.1155/2020/4168340.
atherosclerosis diseases using machine learning algorithms, in: 2019 5th [42] A. Gupta, A.S.R. Potty, D. Ganta, R.J. Mistovich, S. Penna, C. Cady, A.G. Potty,
International Conference on Optimization and Applications (ICOA), IEEE, 2019, Streamlining the KOOS activities of daily living subscale using machine learning,
pp. 1–5. Orthop J Sports Med 8 (2020), https://doi.org/10.1177/2325967120910447,
[21] J.D. Mello-Roman, J.C. Mello-Roman, S. Gomez-Guerrero, M. Garcia-Torres, 2325967120910447.
Predictive models for the medical diagnosis of dengue: a case study in Paraguay, [43] B. Bengfort, R Bilbro, Yellowbrick: Visualizing the scikit-learn model selection
Comput Math Methods Med (2019), 7307803, https://doi.org/10.1155/2019/ process, J. Open Source Softwar 4 (35) (2019) 1075, https://doi.org/10.21105/
7307803, 2019. joss.01075.
[22] C.f.D.C.a. Prevention, DPDx - Laboratory Identification of Parasites of Public [44] Y. Kim, J.V. Ratnam, T. Doi, Y. Morioka, S. Behera, A. Tsuzuki, N. Minakawa,
Health Concern 2020. N. Sweijd, P. Kruger, R. Maharaj, C.C. Imai, C.F.S. Ng, Y. Chung, M. Hashizume,
[23] P.J. Cock, T. Antao, J.T. Chang, B.A. Chapman, C.J. Cox, A. Dalke, I. Friedberg, Malaria predictions based on seasonal climate forecasts in South Africa: a time
T. Hamelryck, F. Kauff, B. Wilczynski, M.J. de Hoon, Biopython: freely available series distributed lag nonlinear model, Sci. Rep. 9 (2019) 17882, https://doi.org/
Python tools for computational molecular biology and bioinformatics, 10.1038/s41598-019-53838-3.
Bioinformatics 25 (2009) 1422–1423, https://doi.org/10.1093/bioinformatics/ [45] M. Wang, H. Wang, J. Wang, H. Liu, R. Lu, T. Duan, X. Gong, S. Feng, Y. Liu, Z. Cui,
btp163. C. Li, J. Ma, A novel model for malaria prediction based on ensemble algorithms,
[24] J.S. Rana, S.S. Khan, D.M. Lloyd-Jones, S. Sidney, Changes in mortality in top 10 PLoS One 14 (2019), e0226910, https://doi.org/10.1371/journal.pone.0226910.
causes of death from 2011 to 2018, J. Gen. Intern. Med. 23 (2020) 1–2, https://doi. [46] S.C. Edberg, Global Infectious Diseases and Epidemiology Network (GIDEON): a
org/10.1007/s11606-020-06070-z. world wide Web-based program for diagnosis and informatics in infectious
[25] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P.J.J.o.a.i.r. Kegelmeyer, SMOTE: diseases, Clin. Infect. Dis. 40 (1) (2005) 123–126, https://doi.org/10.1086/
synthetic minority over-sampling technique, 16, 2002, pp. 321–357. 426549.
[26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, [47] S. Mahmoudi, S. Mamishi, M. Banar, B. Pourakbari, H. Keshavarz, Epidemiology of
M. Blondel, P. Prettenhofer, R. Weiss, V.J.t.J.o.m.L.r. Dubourg, Scikit-learn: echinococcosis in Iran: a systematic review and meta-analysis, BMC Infect. Dis. 19
Machine learning in Python, 12, 2011, pp. 2825–2830. (2019) 929, https://doi.org/10.1186/s12879-019-4458-5.
[27] C. Cortes, V.J.M.l. Vapnik, Support-vector networks, 20, 1995, pp. 273–297. [48] M. Kotepui, K.U. Kotepui, Prevalence and laboratory analysis of malaria and
[28] L.J.M.l. Breiman, Random forests, 45, 2001, pp. 5–32. dengue co-infection: a systematic review and meta-analysis, BMC Publ. Health 19
[29] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward (2019) 1148, https://doi.org/10.1186/s12889-019-7488-4.
neural networks, in: Proceedings of the thirteenth international conference on [49] D. Pierce, L. Merone, C. Lewis, T. Rahman, J. Croese, A. Loukas, M. McDonald,
artificial intelligence and statistics, 2010, pp. 249–256. P. Giacomin, R. McDermott, Safety and tolerability of experimental hookworm
[30] T. Hastie, S. Rosset, J. Zhu, H.J.S. Zou, i. Interface, Multi-class adaboost, 2, 2009, infection in humans with metabolic disease: study protocol for a phase 1b
pp. 349–360. randomised controlled clinical trial, BMC Endocr. Disord. 19 (2019) 136, https://
[31] J.H.J.A.o.s. Friedman, Greedy Function Approximation: a Gradient Boosting doi.org/10.1186/s12902-019-0461-5.
Machine, 2001, pp. 1189–1232. [50] A.S. Kota, N. Shabbir, Congenital Toxoplasmosis, StatPearls, Treasure Island (FL),
[32] A.V. Dorogush, V. Ershov, A.J.a.p.a. Gulin, CatBoost: Gradient Boosting with 2020.
Categorical Features Support, 2018. [51] S.K. Jha, B. Karna, K. Mahajan, Tropical Pulmonary Eosinophilia, StatPearls,
[33] A. Gupta, B. Kahali, Machine learning-based cognitive impairment classification Treasure Island (FL), 2020.
with optimal combination of neuropsychological tests, Alzheimers Dement (N Y) 6 [52] T. Tharmaratnam, T. Kumanan, M.A. Iskandar, K. D’Urzo, P. Gopee-Ramanan,
(2020), e12049, https://doi.org/10.1002/trc2.12049. M. Loganathan, T. Tabobondung, T.A. Tabobondung, S. Sivagurunathan, M. Patel,
[34] N. Liu, R. Zhao, L. Qiao, Y. Zhang, M. Li, H. Sun, Z. Xing, X. Wang, Growth stages I. Tobbia, Entamoeba histolytica and amoebic liver abscess in northern Sri Lanka: a
classification of potato crop based on analysis of spectral response and variables public health problem, Trop. Med. Health 48 (2020), https://doi.org/10.1186/
optimization, Sensors (Basel) 20 (2020), https://doi.org/10.3390/s20143995. s41182-020-0193-2, 2.
[35] A. Gupta, R. Katarya, Social media based surveillance systems for healthcare using [53] P. Rawla, S. Sharma, Trichinella Spiralis (Trichnellosis), StatPearls, Treasure Island
machine learning: a systematic review, J. Biomed. Inf. 108 (2020) 103500, https:// (FL), 2020.
doi.org/10.1016/j.jbi.2020.103500. [54] S.I. Dimitriadis, D. Liparas, I. Alzheimer’s Disease Neuroimaging, How random is
[36] A. Dinh, S. Miertschin, A. Young, S.D. Mohanty, A data-driven approach to the random forest? Random forest algorithm on the service of structural imaging
predicting diabetes and cardiovascular disease with machine learning, BMC Med. biomarkers for Alzheimer’s disease: from Alzheimer’s disease neuroimaging
Inf. Decis. Making 19 (2019) 211, https://doi.org/10.1186/s12911-019-0918-5. initiative (ADNI) database, Neural Regen Res 13 (2018) 962–970, https://doi.org/
[37] C. Wang, X. Chen, L. Du, Q. Zhan, T. Yang, Z. Fang, Comparison of machine 10.4103/1673-5374.233433.
learning algorithms for the identification of acute exacerbations in chronic [55] L. Guo, Z. Wang, Y. Du, J. Mao, J. Zhang, Z. Yu, J. Guo, J. Zhao, H. Zhou, H. Wang,
obstructive pulmonary disease, Comput. Methods Progr. Biomed. 188 (2020) Y. Gu, Y. Li, Random-forest algorithm based biomarkers in predicting prognosis in
105267, https://doi.org/10.1016/j.cmpb.2019.105267. the patients with hepatocellular carcinoma, Canc. Cell Int. 20 (2020) 251, https://
[38] A.M. Ahmed, S.F. Aly, Egyptian License Plates Recognition System Using doi.org/10.1186/s12935-020-01274-z.
Morphologial Operations and Multi Layered Perceptron, ICT in Our Lives-2019, [56] R. Ueno, L. Xu, W. Uegami, H. Matsui, J. Okui, H. Hayashi, T. Miyajima,
2019. Y. Hayashi, D. Pilcher, D. Jones, Value of laboratory results in addition to vital
[39] B.X. Tran, G.H. Ha, L.H. Nguyen, G.T. Vu, M.T. Hoang, H.T. Le, C.A. Latkin, C.S. signs in a machine learning algorithm to predict in-hospital cardiac arrest: a single-
H. Ho, R.C.M. Ho, Studies of novel Coronavirus disease 19 (COVID-19) pandemic: a center retrospective cohort study, PLoS One 15 (2020), e0235835, https://doi.org/
global analysis of literature, Int. J. Environ. Res. Publ. Health 17 (2020), https:// 10.1371/journal.pone.0235835.
doi.org/10.3390/ijerph17114095. [57] F. Jimenez-Morillas, M. Gil-Mosquera, E.J. Garcia-Lamberechts, I.-S.t.d.
[40] L. Liu, C. Zhang, G. Zhang, Y. Gao, J. Luo, W. Zhang, Y. Li, Y. Mu, A study of aortic department, Fever in travellers returning from the tropics, Med Clin (Engl Ed) 153
dissection screening method based on multiple machine learning models, (2019) 205–212, https://doi.org/10.1016/j.medcle.2019.03.013.
J. Thorac. Dis. 12 (2020) 605–614, https://doi.org/10.21037/jtd.2019.12.119.