Data Analytics On The COVID-19 Outbreak in South Asia Using Machine Learning Methods
Data Analytics On The COVID-19 Outbreak in South Asia Using Machine Learning Methods
2784
Kajol Chandra Paul et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(4), July – August 2021, 2784 – 2791
Where FCFR stands for final case fatality rate. C, D, and R respectively. The recovery rates are calculated the same
represent confirmed, death, and recovered cases way. Figure 1 shows the calculated fatality and recovery
2785
Kajol Chandra Paul et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(4), July – August 2021, 2784 – 2791
rates. Initially, the highest recovery rate is observed in India here is Sri Lanka, which ranks second on the GDP Per
which is 97.3% and the lowest recovery rate of 59.9% is Capita. Despite that, it carried out fewer tests than that of
seen in Afghanistan. On the other hand, Bhutan has the Maldives, Bhutan, and India. On the other hand, the number
lowest fatality rate overall (approximately 0.1%) whereas of tests is found to be moderately negatively associated with
Afghanistan tops the list. It is seen that Afghanistan’s final the test positivity rate as the correlation coefficient between
case fatality rate is a staggering 6.8%, which is significantly them is -0.50. This points to the fact that countries with
higher than the rest of the SAARC countries. It is to be higher positivity rates did not perform a sufficient amount of
noted, the initial fatality and recovery rates do not add up to tests. For example, Afghanistan with the highest positivity
100%, though the final rates do. Consequently, the final rate i.e., 20.4% conducted the lowest number of tests i.e.,
fatality and recovery rates can be interpreted 16986 tests/1M population, contributing to their lower
probabilistically. confirmed cases.
that it has a comparatively higher HCI and better vaccination moderately affected as in the case of India, Nepal, and Sri
record has contributed to a relatively lower death count. Lanka. Finally, cluster 2 is the set of countries comprising
Afghanistan, Bangladesh, Pakistan, and Bhutan. These
4. K-MEANS CLUSTERING METHOD countries are less affected with a low number of confirmed
and active cases.
An important way to compare the severity of the COVID-19
pandemic in different countries could be K-Means clustering
based on various case criteria. The SAARC countries have
been clustered based on the number of confirmed, death, and
active cases per 1M population. The K-Means clustering is
an unsupervised machine learning algorithm that divides the
data into K clusters. There are several methods to find the
optimal number of clusters such as the Elbow method,
Silhouette method, and Gap statistic [11]. In this study, we
have performed the K-Means clustering for a range of
clusters K in iteration and calculated the optimal K value
with the Elbow method. Figure 4 depicts the within-cluster
sum of squares (WCSS) values for different cluster numbers.
WCSS measures the sum of squared distances of each
sample to their nearest cluster center.
(a)
5.1Polynomial regression SStotal indicates the total variation in Y values i.e., the sum of
the squared differences between the observed output values
In this paper, we have performed the data modeling based on and their mean. The value of R2 ranges from 0 to 1. When
polynomial regression, where a polynomial equation of nth SSline is very small, R2 is close to 1, which means the
degree is used to model the non-linear dataset. The generic regression line perfectly fits the actual data points. For
nth degree polynomial equation is expressed as practical use, the modified version of R2 which is called the
adjusted R2 is a better metric. While R2 does not account for
the additional input variables that are statistically
= + + + +⋯⋯ (3) insignificant, adjusted R2 adjusts the statistical model based
Where a, b, c, d, ….n are called the parameters of on the number of observations and variables. Thus, it is
polynomial regression analysis. In our model, the number of more useful to evaluate the predictive power of the model
days is the independent variable X, which we count from the with adjusted R2. At times, the model tries to fit every data
starting date of our dataset (i.e., 22.01.2020) and the number points exactly, especially for higher degrees of the
polynomial function. The situation is known by the name
of confirmed/death cases is the dependent variable or
overfitting, where the model picks up too much noise and
predictor Y. The degree of a polynomial is determined by the makes fatal errors when predicting unknown data. An
highest exponent of the independent variable. Some of the effective way to solve the problem is to utilize cross-
common polynomial functions are validation (CV), where the training dataset is divided into k-
folds. At each iteration, the model is trained on k-1 folds,
1. Zero polynomial keeping the remaining fold for testing. The performance is
= (4)
function measured as the mean of the values in each iteration, which
2. First-degree is indicated by the cross-validation metric or CV score.
= + (5)
polynomial (Linear)
3. Quadratic polynomial 5.3 Experiment
= + + (6)
function
With the regression process algorithm described in [16], the
4. Cubic polynomial modeling experiment is performed in Python’s sklearn
= + + + (7)
function machine learning library.The entire dataset is divided into
two subsets of training and testing data. As we want to make
the prediction modeling with polynomial regression, the
The first-degree polynomial is a straight line, known as a
input data points i.e., days are converted into polynomial
linear function where b is the slope and a is the Y-intercept features of different degrees. The model is fit on the training
of the line. In a similar fashion, the mathematical expression data and prediction is done on the testing data. The
for higher-order polynomial regression can be found from performance of the model changes slightly with the size of
Eq. (3). The higher order of the polynomial does not the training and testing data. Eventually, the accuracy of the
necessarily signify a better level of fitting i.e., the closeness predictions is evaluated in terms of the MAE, adjusted R2,
between the observed sample values and the predicted and CV score.The CV score is computed for 5 splits and
values. The best possible fitted regression line can be of any their mean is taken. Table 2 illustrates the values of the
order. evaluation metrics based on the testing size and degree of
the polynomial function used to model the total deaths in
5.2 Evaluation metrics SAARC.
Generally, the statistical measures used for evaluation of the Table 2: Evaluation metrics for the polynomial regression model of
regression model are: Sum of Squared Error (SSE), Mean the total deaths in SAARC
Polynomial model of Polynomial model of
Squared Error (MSE), or Root Mean Squared Error (RMSE). degree 6 degree 13
Evaluation
The finest regression line minimizes these values most. metrics 10% test 20% test 10% test 20% test
However, these metrics penalize for the big error, and size size size size
consequently, Mean Absolute Error (MAE) could be used. Adjusted R2 0.9923 0.9914 0.9881 0.9863
Besides that, another metrics called R-squared (R2) or the
coefficient of determination is also used. It is an indication MAE 9035.24 8976.31 11650.72 12294.90
of the goodness of fit for the data points to the regression CV score 0.9900 0.9885 0.9869 0.9864
model. The formula used for the calculation of R2 is [17]
It is seen that the adjusted R2decreases slightly if the testing
= 1− (8) size is increased to 20%. In that case, there is less training
data to train the model.It is also seen that a higher degree of
Where SSline indicates the sum of squared error between the the polynomial function does not necessarily yield better
observed output values and the fitted regression line and values of the evaluation metrics. Figure 6and Figure 7
illustrate the predictive modeling of the cumulative
2788
Kajol Chandra Paul et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(4), July – August 2021, 2784 – 2791
6. RESULTS AND DISCUSSION populated country in the region, Maldives has witnessed
infections encompassing 13.8% of its population.
Data analytics and modeling is a great tool to comprehend Usually, the countries that rank higher up in GDP Per
the spread of COVID-19 and its impact. The documented Capita have conducted more tests except for Sri Lanka.
cases are diagnosed to unmask the underlying factors and the Afghanistan has conducted very few tests despite
having a test positivity rate of over 20%, which effected
pattern of the pandemic is modeled with polynomial
its relatively low number of confirmed cases.
regression to predict the future course. The characteristics of
The final case fatality rate calculated after considering
the Coronavirus spread in South Asia and the related resolved cases is found to be 6.8% in Afghanistan,
findings are encapsulated through the following points. which is the highest in South Asia. There is no
definitive correlation between the number of deaths and
Infections from COVID-19 spread all over South Asia
in a non-uniform pattern. Despite being the least
2789
Kajol Chandra Paul et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(4), July – August 2021, 2784 – 2791
HCI, though a weak positive correlation is found with models set forth a forecast for the future confirmed and
vaccination and 65+ age population data. death cases.
As per confirmed, death, and active cases,the SAARC REFERENCES
countries can be grouped into 3 clusters using the K-
Means clustering algorithm. Maldives belongs to the 1. J. Bryner, 1st known case of coronavirus traced back to
severely affected cluster while India, Nepal, and Sri November in China, Available at:
Lanka belong to the cluster which is moderately https://www.Livescience.com/first-case-coronavirus-
affected. The rest of the countries can be categorized as found.html, accessed May 2021.
less affected. 2. Timeline: WHO's COVID-19 response, Available at:
The polynomial regression model performed on the https://www.who.int/emergencies/diseases/novel-
confirmed cases and deaths exhibit good accuracy coronavirus-2019/interactive-timeline, accessed May
overall. The adjusted R2 value is over 0.9700 for all the 2021.
instances of the model, except for Bhutan’s death cases
and India’s confirmed cases. This can be interpreted as 3. N. Banka, Explained: How SAARC countries are
97% of the total variations in the Y values are described fighting COVID-19, Available at: https://indianexpress.
by the fitted regression line. com/article/explained/explained-how-saarc-countries-
Bhutan has two cases of fatality from COVID-19, which are-fighting-covid-19-6331509/, accessed June 2021.
occurred on January 8, and July 15 of 2021. This creates 4. Reported Cases and Deaths by Country or Territory,
two abrupt steps in the curve and it becomes relatively Available at:
imprecise to model these abrupt changes with https://www.worldometers.info/coronavirus/, accessed
polynomial regression. We observe many outlier data June 2021.
points on the modeling curve in the case of India and 5. S. K. Dey, M. M. Rahman,U. R.Siddiqi, andA.
Nepal, which have resulted in a relatively lower Howlader.Analyzing the epidemiological outbreak of
adjusted R-squared value. The outlier data points often COVID 19: A visual exploratory data analysis
come from dumping previous-day data into the next day approach,Journal of Medical Virology, vol. 92, no. 6,
due to delays in data collection. pp. 632–638, 2020.
If we sum up the predicted death cases for the individual
country models, we get a number close to the total 6. H. Nishiura, S-m. Jung, N.M. Linton, R.Kinoshita, Y.
predicted cases inthe SAARC model. The difference is Yang, K. Hayashi, T. Kobayashi, B. Yuan, and A. R.
only 1.4%. However, the difference of confirmed cases Akhmetzhanov. The extent of transmission of novel
Coronavirus in Wuhan, China,Journal of Clinical
between individual country models and the SAARC
Medicine, vol. 9, no. 2, pp. 330, 2020.
model is 17.7%. The relatively lower adjusted R2 scores
have introduced this disagreement in the confirmed 7. N.AL-Rousan, and H.AL-Najjar.Data analysis of
cases. coronavirus COVID-19 epidemic in South Korea
based on recovered and death cases,Journal of
This study has enabled it to gather deep insights into the Medical Virology, vol. 92, pp. 1603–1608, 2020.
pandemic situation in South Asian nations. It can guide the
8. T. Chakraborty, andI. Ghosh.Real-time forecasts and
research and shape the decision-making process in tackling
risk assessment of novel coronavirus (COVID-19)
the virus.According to the predictive model, the total
cases: A data-driven analysis,Chaos, Solitons&
estimated confirmed cases till August 17, 2021, are 42.91
Fractals, vol. 135, 109850, 2020.
million while total fatalities are 0.58 million.
9. S. K. Dey, M. M. Rahman, U. R. Siddiqi, and A.
7. CONCLUSION Howlader. Exploring epidemiological behavior of
novel coronavirus (COVID-19) outbreak in
The spread of the Coronavirus in the South Asianregion may Bangladesh,SN Comprehensive Clinical Medicine, vol.
appear one-dimensional because of the sheer number of 2, pp. 1724–1732, 2020.
confirmed and death cases in India. However, the data
analytics conducted in this study has shown diversified 10. S. K.Saini, V.Dhull, S.Singh, and A. Sharma.Visual
dynamics on the COVID-19 outbreak in SAARC countries. Exploratory Data Analysis of COVID-19
Be it the death rate, recovery rate, test positivity rate, or the Pandemic,5th IEEE International Conference on
number of testings done, each country has performed Recent Advances and Innovations in Engineering
differently which can be correlated to their population, GDP (ICRAIE),Jaipur, 2020, pp. 1-6.
Per Capita, vaccination number, and HCI. Countries such as 11. D.Abdullah, S.Susilo, A. S.Ahmar, R.Rusli, andR.
Maldives are severely affected, according to the K-Means Hidayat. The application of K-means clustering for
clustering. It is shown here that the number of confirmed or province clustering in Indonesia of the risk of the
death cases can be modeled by polynomial regression COVID-19 pandemic based on COVID-19 data.
techniques with good accuracy. The proposed regression
2790
Kajol Chandra Paul et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(4), July – August 2021, 2784 – 2791
2791