Bmri2022 3113119
Bmri2022 3113119
Research Article
Symptom-Based COVID-19 Prognosis through AI-Based IoT: A
Bioinformatics Approach
Copyright © 2022 Madhumita Pal et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Objective. Internet of Things (IoT) integrates several technologies where devices learn from the experience of each other thereby
reducing human-intervened likely errors. Modern technologies like IoT and machine learning enable the conventional to patient-
specific approach transition in healthcare. In conventional approach, the biggest challenge faced by healthcare professionals is to
predict a disease by observing the symptoms, monitoring the remote area patient, and also attending to the patient all the time
after being hospitalised. IoT provides real-time data, makes decision-making smarter, and provides far superior analytics, and
all these to help improve the quality of healthcare. The main objective of the work was to create an IoT-based automated
system using machine learning models for symptom-based COVID-19 prognosis. Methods. Comparative analysis of predictive
microbiology of COVID-19 from case symptoms using various machine learning classifiers like logistics regression, k-nearest
neighbor, support vector machine, random forest, decision trees, Naïve Bayes, and gradient booster is reported here. For the
sake of the validation and verification of the models, performance of each model based on the retrieved cloud-stored data was
measured for accuracy. Results. From the accuracy plot, it was concluded that k-NN was more accurate (97.97%) followed by
decision tree (97.79), support vector machine (97.42), logistics regression (96.50), random forest (90.66), gradient boosting
classifier (87.77), and Naïve Bayes (73.50) in COVID-19 prognosis. Conclusion. The paper presents a health monitoring IoT
framework having high clinical significance in real-time and remote healthcare monitoring. The findings reported here and the
lessons learnt shall enable the healthcare system worldwide to counter not only this ongoing COVID but many other such
global pandemics the humanity may suffer from time to come.
2 BioMed Research International
Data
analysis
zone
P
A
T BIO
Data M D
I S
analysis models O O
E E
B C
N N Controller I T
T S L
Data O
with O E
base R
R
covid
symptoms
Computer
between the man and the machine. Using emerging technol- were implemented on the given dataset which is analysed
ogy, IoT has impacted numerous fields of human endeav- and described in this study. Application of machine learning
ours greatly including the healthcare system. It could to predict COVID-19 infection provides a new and more
change the existing healthcare system merely by using reliable direction to the healthcare professionals for an
advanced sensors and cloud computing platform. IoT, an early-stage disease diagnosis. It helps researchers predict
advanced automation system that uses big data concept, the rising COVID-19 cases at the symptom stage and also
makes it possible to connect every asset through the web helps in preventing the disease by taking due diligent
and helps design a smart healthcare system. As IoT handles precautions.
big data, it is hard for the healthcare professionals to handle
and manage it. Thus, the medical professionals require 2.4. Data Source. The dataset used for the work was accessed
chronicled data to predict a disease. Although various kinds from Kaggle site [25]. The dataset could be collected in a
of machine learning algorithms have been used since long to CSV file and uploaded in a Jupiter notebook for analysis
predict a disease, the biggest challenge in the machine learn- with the Python software. The dataset contained a total of
ing algorithm is to tune the various parameters. Proper tun- 5434 data samples and 19 features/parameters related to
ing of the parameters results in efficient prognosis and the patient symptoms as detailed in Table 1. Seven machine
diagnosis of a disease. learning algorithms were implemented in this work for
COVID-19 prognosis with maximum possible accuracy
2.2. Significance of the Proposed System. The present work and create an automated system for COVID-19 detection.
proposes a framework of e-healthcare system by using artifi- 2.5. Data Preprocessing. The dataset contained vast numbers
cial intelligence, machine learning, and statistics for disease of null values and outliers which might affect the accuracy of
prognosis. In the proposed system, the patient’s data are col- the model. To remove these noisy data, the datasets were
lected stored in cloud by using IoT sensors and transmitted preprocessed and the null values were removed to help
to the web server (mobile app) through the IoT agent. The increase the efficacy of the models. After cleaning the data-
cloud shares the data over social insurance frameworks, set, the data were transformed to a new form by using the
and various machine learning algorithms are executed to process of smoothing and normalisation. The dataset was
process the data. The response is sent to healthcare profes- classified into testing and training set which was imple-
sionals to monitor and suggest proper actions. The block mented on several machine learning models to compare
diagram of the proposed system is shown in Scheme 1. the accuracy score. The various machine learning algorithms
In this proposed model, six data prediction techniques used in this research are discussed below.
are used and their performances are compared to provide
better and reliable quality service for the healthcare sys- 2.5.1. Logistics Regression. This classifier, used for classifica-
tem. Data prediction techniques used are k-nearest neigh- tion and data analysis, is based on supervised algorithm. It
bor, support vector machines, decision tree, random is a type of regression model when data modeling requires
forest, gradient boosting classifier, Naïve Bayes, and logis- sigmoid function [26].
tics regression.
1
Sigmoid function, gðyÞ = : ð1Þ
2.3. Proposed Methodology. The main objective of this work 1 + e−y
was to forecast the probability of a patient suffering from
COVID-19 infection using computer-aided diagnosis/prog- Here, the regression model is built to predict the proba-
nosis system. To deliver this work, different ML techniques bility and measure the learning rate; thus, it is also
4 BioMed Research International
2.5.3. Random Forest (RF) Model. This classifier is the Figure 1: Count plot for the numerous patients suffering from
ensemble learning classifier. It is used for both classification COVID-19 (yes) and that did not (no).
and regression analysis. It consists of a set of trees in which
each tree is capable of providing a set of predictor values idea behind such decision algorithm includes the best attri-
[27]. Overall, the decision trees are weak classifier and they butes using information gain and the gain ratio. It makes a
are merged to form a random forest model. Random forest decision tree based on that attribute and breaks into subdata-
model does not have cross-validation, while the other classi- sets. Further, it starts building the tree and process repetition
fiers like decision tree and k-NN model have cross- recursively.
validation. In this classifier, a greater number of trees result
in more accuracy. Random forest classifier logic uses entropy, n
gain ratio, and Gini index. Information ðM Þ = − 〠 π log2 π,
i=1
n m
MJ
Entropy ðN Þ = − 〠 π log2 π, InformationA ðM Þ = 〠 X Information M j ,
j=1 M
i=1
ð3Þ
M m
ð2Þ M M
Gini ðN Þ = 1 − 〠 π2 , SplitA ðM Þ = − 〠 J log2 J ,
j=1 M M
I=1
N N Gain ðN Þ
GiniA ðN Þ = 1 GiniðN 1 Þ + 2 GiniðN 2 Þ: Gain Ratio ðN Þ = :
N N SplitA ðM Þ
2.5.4. Decision Tree (DT) Model. This classifier is based on 2.5.5. k-Nearest Neighbor (k-NN). Based on supervised algo-
classification algorithm while it works on numerical and cat- rithm, k-nearest neighbour technique is based on the nearest
egorical data. It is required to create tree-shaped graph while neighbour data points concept. By using different dis-
analysing the data. The analysis of decision trees is based on tance metric concept, the nearest neighbour data point could
three nodes (root node, interior node, and leaf node). The be deciphered. Although inefficient for large dimensional
BioMed Research International 5
Number of cases
Yes
80.7%
COVID-19
19.3%
No
dataset, k-NN technique is easy to implement. It is a non- the preceding predictors. The base learner in the machine
parametric model used to solve classification and regres- is the classification and regression trees [29]. The major
sion problems. The object is classified depending on the parameter used in this technique is the shrinkage which
nearest neighbour using the classification technique. The refers to the prediction of each tree when the model is
calculation of the nearest neighbor is measured using the shrunk after multiplying the learning rate that ranges
Euclidean distance. between 0 and 1. Since all trees are trained, the final predic-
tion is done by the following formula:
Euclidean Distance, dða, bÞ2 = ðb1 − a1 Þ2 + ðb2 − a2 Þ2 : ð4Þ
xðpredÞ = x1 + ðη ∗ r1Þ + ðη ∗ r2Þ+⋯ ⋯ :+ðη ∗ rnÞ: ð6Þ
Here, the input consists of the closest or nearest
neighbour in the dataset to deploy the model. The classi- The algorithm is used to classify gradient boosting
fier assumes similar attributes existing in closer proximity. classifier, and the class is called the gradient boosting
After loading the data and choosing the nearest neigh- regressor (GBR).
bour, the distance between query and original example is
calculated and the numbers of entries are sorted in the 3. Results
collection [28].
Count plot shows that 4383 patients suffered from COVID-
2.5.6. Naïve Bayes (NB). This classifier is based on supervised 19 and 1051 patients did not (Figure 1). Pie plot shows that
algorithm. A classification technique by Baye’s theorem, it 80.7% patients had COVID-19 infection and 19.3% did not
finds out the probability of attributes not having any cor- have (Figure 2).
relation with each other. All attributes contribute inde- 3620 patients had breathing problem and 1814 did not
pendently to the probability. The probability could be out of 5434 data samples. Similarly, 4273 patients suffered
calculated by building the frequency table and likelihood from fever and 1161 did not, 4307 patients had dry cough
table. Further, the test phase from the likelihood table and 1127 did not, 3953 patients had sore throat and 1481
needs to be found out after the training is done. The did not, and 2952 patients had running nose and 2482 did
Baye’s theorem equation is not (Figure 3).
Also, 2514 patients had asthma tendency and 2920 did
PðB/AÞ:PðBÞ not, 2565 patients had chronic lung disease and 2869 did
P ðB/AÞ = , ð5Þ
P ðA Þ not, 2736 patients had headache and 2698 did not, 2523
patients had heart disease and 2911 did not have, and 2588
where PðB/AÞ is the posterior probability, PðBÞ is the patients suffered from diabetes and 2846 did not. Patients
class prior probability, PðAÞ is the predictor prior proba- with heart disease, diabetes, headache, asthma, hypertension,
bility, and PðA/BÞ is the predictor probability. fatigue, gastrointestinal issue, and prior contact with
COVID-19 patient had more probability of suffering from
2.5.7. Gradient Boosting Machine (GBM). This classifier is COVID-19 infection than those that followed COVID
the most popular among all the boosting algorithms where appropriate measures (such as wearing a mask and sanitising
each predictor corrects its preceding predictor’s error. Each regularly) and had no associated health or sociological
predictor in the model is trained well using the errors of issues.
6 BioMed Research International
0 0 0 0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Hyper tension Fatigue Gastrointestinal Abroad travel Contact with COVID patient
3000 3000
2500 2500 2500
2500 2500
2000 2000 2000
2000 2000
1500 1500 1500 1500
1500
1000 1000 1000 1000 1000
0 0 0 0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Attended large gathering Visited public exposed places Family working in public exposed places Wearing masks Sanitization from market
3000
3000 5000 5000
2500 2500
2500
2000 4000 4000
2000
2000
1500 3000 3000
1500
1500
1000 1000 2000 2000
1000
500 500 500 1000 1000
0 0 0 0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 –0.50 –0.25 0.00 0.25 0.50 –0.50 –0.25 0.00 0.25 0.50
Target
4000
3000
2000
1000
0
0.00 0.25 0.50 0.75 1.00
1.0
Breathing problem
Fever
0.8
Dry cough
Sore throat
0.6
Hyper tension
Abroad travel
COVID-19
0.0
Fever
Dry cough
COVID-19
Sore throat
Abroad travel
Breathing problem
Hyper tension
Figure 4: Obtained correlation matrix for the given dataset after data cleaning operation.
3.1. Confusion Matrix. This table is considered to visualise tion. Recall represents correctly predicted positive class.
the classification of classification model. It contains positive, The best sensitivity rate is 1.0 and the worst rate is 0.
negative observation of actual class and positive, negative
observation of predicted class. The four observations are
TP1
TP1 (true positive), FN1 (false negative), TN1 (true nega- Sensitivity = : ð7Þ
tive), and FP1 (false positive). The confusion matrix and TP1 + FN1
the performance measurement parameters of k-NN models
are presented in Figure 5 and Table 3. 3.1.2. Specificity. This is used to calculate true negative pre-
This curve is used to evaluate binary classification and dictions by the total number of negative prediction. The best
plots true positive observations by the false positive observa- specificity rate is 1.0 and the worst rate is 0.
tions. AUC is used to measure the performance by distin-
guishing the positive and negative observations.
The area under the curve value obtained for k-NN algo- TN1
Specificity = : ð8Þ
rithm was found to be 0.98 (Figure 6). It represents that k- TN1 + FP1
NN model was able to reliably prognose COVID-19 infec-
tion up to 98%. k-NN model performance measure matrices 3.1.3. Precision. It represents the actual number of positive
are presented in Table 4 and are used to calculate sensitivity, class from total number of positive classes.
specificity, precision, and accuracy.
Confusion matrix
Without COVID
750
203 2
600
True class
450
300
20 862
150
COVID
0.6
TP1 + TN1
Accuracy = : ð10Þ
0.4
TP1 + TN1 + FP1 + FN1
0.2
3.1.5. F1-Score. It is the harmonic mean between precision
and sensitivity.
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
K-NN tree (AUC = 0.98) 1 + β2 Precision:Sensitivity
FI = , ð11Þ
Figure 6: AUC plot of k-NN model. β2 ðPrecision + SensitivityÞ
BioMed Research International 9
Table 5: Performance report of the various test models executed in the study.
where β is a constant which is commonly 1, 2, or 0.5. Table 5. The percentage of accuracy score is presented in
Table 6, and the accuracy comparison of each of the model
TP1:TP1 are depicted in Figure 7. From the accuracy plot, it was con-
F1 = , cluded that k-NN was more accurate (97.97%) followed by
TP1 + TP1 + FP1 + FN1
ð12Þ decision tree (97.79), support vector machine (97.42), logis-
2:TP1 tics regression (96.50), random forest (90.66), gradient
FI = :
2:TP1 + FP1 + FN1 boosting classifier (87.77), and Naïve Bayes (73.50) in
COVID-19 prognosis based on the given dataset and the
4. Discussion defined features/parameters.
Out of all the models compared for reliability, k-NN
This piece of research work detects (prognoses) whether or model was found to be the best. It was found that k-NN
not a patient is likely to suffer from COVID-19 infection model with a prediction accuracy of 98% performed better
by observing the patients’ symptoms. This research was as compared to other six algorithms. We have also compared
done on machine learning classification techniques using the results of our study with some other reported models
Naïve Bayes, decision tree, random forest, k-nearest neigh- (Table 7), which suggests that our models are effective and
bor, support vector machine, logistics regression, and gradi- give better results [30–34]. We have used a 10-fold cross-
ent booster. The dataset was collected from Kaggle site and validation method for improving the performances of our
processed using python open access software in Jupyter models. In future, this research may help healthcare profes-
notebook. The data was analysed and split into a training sionals to predict and diagnose COVID-19 at an early stage.
set and a test set. Different ML models are implemented This would be useful especially for the patients in remote
on the dataset, and the performance of each of the model locations with low access to immediate medical facility.
is described in terms of accuracy. Performance report of COVID-19 prognosis could also be done using other
the various test models executed in the study is given in machine learning and deep learning approaches with
10 BioMed Research International
40
20
Logistic regression
Decision tree
Support vector machines
KNN
Random forest
potentially better accuracy. This study is bound to provide COVID-19 infection in India and elsewhere. Also, the
ample references for further development in this field at a AUC and various performance measurement metrics like
global scale. However, more robust datasets as inputs are accuracy, precision, recall, and F1-score of k-NN model
strongly recommended to achieve this. are discussed. The work provides a precursor to design
an automated COVID-19 prognosis system using IoT and
5. Conclusion machine learning algorithms. The risk rate was 65-80%
with the four critical symptoms (fever, dry cough, breath-
Many countries including India are still struggling to fight ing issue, and sore throat) out of the 10 parameters/fea-
against this deadly corona pandemic as the cases are rising tures considered from the 19 total possible parameters/
daily. Each day comes as a new challenge with ever larger features. So, these four critical parameters could be recom-
quantity of COVID-19 cases and data. To address this, mended as the strong prognosis bioindicators.
research to develop medicines to treat and vaccines to pre-
vent COVID-19 is being pursued at global scale. This Data Availability
paper compares seven machine learning algorithms in
terms of their accuracy in COVID-19 prognosis; machine The data used to support the findings of this study are avail-
learning algorithms are implemented to predict/prognose able from the corresponding author upon request.
BioMed Research International 11
Conflicts of Interest macokinetic and toxicity studies,” Journal of King Saud Uni-
versity–Science, vol. 33, no. 8, article 101637, 2021.
The authors have no conflict of interest. [11] R. K. Mohapatra, K. Dhama, S. Mishra et al., “The microbiota
related coinfections in COVID-19 patients: a real challenge,”
Beni-Suef University Journal of Basic and Applied Sciences,
Authors’ Contributions vol. 10, no. 1, p. 47, 2021.
Conceptualisation and writing the original draft were per- [12] C. McCarthy, C. P. O’Donnell, N. E. W. Kelly, D. O'Shea, and
formed by RKM and MP. Software was the responsible of A. E. Hogan, “COVID-19 severity and obesity: are MAIT cells
a factor?,” The Lancet, vol. 9, no. 5, pp. 445–447, 2021.
MP. Literature search, data analysis, and interpretation and
editing were performed by MP, SP, AAR, SM, AM, and [13] B. M. Popkin, S. Du, W. D. Green et al., “Individuals with obe-
sity and COVID-19: a global perspective on the epidemiology
SA. Writing, review, and editing were carried out by KD
and biological relationships,” Obesity Reviews, vol. 21, article
and JAT. e13128, 2020.
[14] I. Lega, L. Nisticò, L. Palmieri et al., “Psychiatric disorders
Acknowledgments among hospitalized patients deceased with COVID-19 in
Italy,” EClinicalMedicine, vol. 35, article 100854, 2021.
Authors are very grateful to the authorities of their respec- [15] T. K. Suvvari, P. CharulataSree, S. Kuppili et al., “Consecutive
tive institutions/universities for the cooperation and support Hits of COVID-19 in India: The Mystery of Plummeting Cases
extended. and Current Scenario,” Archives of Razi Institute, vol. 76, no. 5,
pp. 1165–1174, 2021.
[16] R. L. Kumar, F. Khan, S. Din, S. S. Band, A. Mosavi, and
References E. Ibeke, “Recurrent neural network and reinforcement learn-
ing model for COVID-19 prediction,” Frontiers in public
[1] R. K. Mohapatra, S. Mishra, M. Azam, and K. Dhama, health, vol. 9, article 744100, 2021.
“COVID-19, WHO guidelines, pedagogy, and respite,” Open
[17] B. Wang, Y. Sun, T. Q. Duong, L. D. Nguyen, and L. Hanzo,
Medicine, vol. 16, no. 1, pp. 491–493, 2021.
“Risk-aware identification of highly suspected covid-19 cases
[2] R. K. Mohapatra, L. Perekhoda, M. Azam et al., “Computa- in social IoT: a joint graph theory and reinforcement learning
tional investigations of three main drugs and their comparison approach,” IEEE Access, vol. 8, pp. 115655–115661, 2020.
with synthesized compounds as potent inhibitors of SARS-
[18] Z. Fang, J. Wang, Y. Ren, Z. Han, H. V. Poor, and L. Hanzo,
CoV-2 main protease (Mpro): DFT, QSAR, molecular docking,
“Age of information in energy harvesting aided massive multi-
and in silico toxicity analysis,” Journal of King Saud Univer-
ple access networks,” IEEE journals on selected areas in com-
sity–Science, vol. 33, no. 2, article 101315, 2021.
munication, vol. 40, no. 5, pp. 1441–1456, 2022.
[3] R. K. Mohapatra, P. K. Das, and V. Kandi, “Challenges in con-
trolling COVID-19 in migrants in Odisha, India,” Diabetes & [19] M. A. Abd-Elmagid, N. Pappas, and H. S. Dhillon, “On the role
Metabolic Syndrome: Clinical Research & Reviews, vol. 14, of age of information in the Internet of Things,” IEEE commu-
no. 6, pp. 1593-1594, 2020. nication magazines, vol. 57, no. 12, pp. 72–77, 2019.
[4] WHO, “WHO Coronavirus (COVID-19) Dashboard,” 2021, [20] M. Pourhomayoun and M. Shakibi, “Predicting mortality risk
https://covid19.who.int/. in patients with COVID-19 using machine learning to help
medical decision-making,” Smart Health, vol. 20, article
[5] R. K. Mohapatra, L. Pintilie, V. Kandi et al., “The recent chal-
100178, 2021.
lenges of highly contagious COVID-19, causing respiratory
infections: symptoms, diagnosis, transmission, possible vac- [21] L. J. Muhammad, E. A. Algehyne, S. S. Usman, A. Ahmad,
cines, animal models, and immunotherapy,” Chemical Biology C. Chakraborty, and I. A. Mohammed, “Supervised machine
& Drug Design, vol. 96, no. 5, pp. 1187–1208, 2020. learning models for prediction of COVID-19 infection using
epidemiology dataset,” SN Computer Science, vol. 2, p. 11,
[6] R. K. Mohapatra, P. K. Das, L. Pintilie, and K. Dhama, “Infec-
2021.
tion capability of SARS-CoV-2 on different surfaces,” Egyptian
Journal of Basic and Applied Science, vol. 8, no. 1, pp. 75–80, [22] A. Zeroual, F. Harrou, A. Dairi, and Y. Sun, “Deep learning
2021. methods for forecasting COVID-19 time-series data: a com-
parative study,” Chaos, Solitons & Fractals, vol. 140, article
[7] C. Huang, Y. Wang, X. Li et al., “Clinical features of patients
110121, 2020.
infected with 2019 novel coronavirus in Wuhan, China,” The
Lancet, vol. 395, no. 10223, pp. 497–506, 2020. [23] Y. Zoabi, S. Deri-Rozov, and N. Shomron, “Machine learning-
[8] N. Singhania, S. Bansal, and G. Singhania, “An atypical presen- based prediction of COVID-19 diagnosis based on symp-
tation of novel coronavirus disease 2019 (COVID-19),” The toms,” npj Digital Medicine, vol. 4, p. 3, 2021.
American Journal of Medicine, vol. 133, no. 7, pp. e365–e366, [24] S. S. Aljameel, I. U. Khan, N. Aslam, M. Aljabri, and E. S.
2020. Alsulmi, “Machine learning-based model to predict the disease
[9] M. S. Ekbatani, S. A. Hassani, L. Tahernia et al., “Atypical and severity and outcome in COVID-19 patients,” Scientific Pro-
novel presentations of coronavirus disease 2019: a case series gramming, vol. 2021, Article ID 5587188, 2021.
of three children,” British Journal of Biomedical Science, [25] https://www.kaggle.com/symptoms-and-covid-presence.
vol. 78, no. 1, pp. 47–52, 2021. [26] C.-Y. J. Peng, K. L. Lee, and G. M. Ingersoll, “An introduction
[10] R. K. Mohapatra, K. Dhama, A. A. El–Arabey et al., “Repur- to logistic regression analysis and reporting,” The Journal of
posing benzimidazole and benzothiazole derivatives as poten- Educational Research, vol. 96, no. 1, pp. 3–14, 2002.
tial inhibitors of SARS-CoV-2: DFT, QSAR, molecular [27] G. Biau, “Analysis of a random forests model,” Journal of
docking, molecular dynamics simulation, and in-silico phar- Machine Learning Research, vol. 13, pp. 1063–1095, 2012.
12 BioMed Research International