Mini Project Report
Mini Project Report
DEPARTMENT OF COMPUTER
APPLICATIONS
Submitted by
ASHISH U MANDAYAM
(USN: 01JST18BCA005)
RAKSHITH A C
(USN: 01JST18BCA025)
Submitted to
INTRODUCTION
Coronavirus disease 2019 (COVID-19) is a contagious infection caused by severe
acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It was first identified in
December 2019 in Wuhan, China, and which has since spread globally, evolving into
an ongoing pandemic. Common symptoms include cough, fever, fatigue,
breathlessness and loss of smell and taste. The World Health Organization (WHO)
announced the outbreak a Public Health Emergency of International Concern in
January and a pandemic on March. The COVID-19 has largely impacted on all the
sectors like economy, education, healthcare, logistics and mental health of people.
The pandemic has caused severe global economic disruption and has led to the
postponement or cancellation of many events. According to the World Trade
Organization, the trade has been plunged due to the pandemic and is expected to
fall between 13% and 32%. Many economic experts state that it might take 10 years
to improve the economy to its normal state.
The WHO states that, COVID-19 has impacted significantly in the health sector for
non-communicable diseases such as Cancer, Alzheimer’s etc. Since there are no
vaccines for this disease, it has become a humongous task and utmost priority for
the healthcare department to prevent the wide spread of the disease.
With the help of predictive analysis and supervised learning, we can predict the
future cases which might be helpful for taking much better preventive measures and
precautions. The proposed model is shown in Fig. 1. Here we have used 2
supervised machine learning models for the regression of the data. The data set
after a series of visualization seems to be linear and hence we have used 2 basic
regression models.
We have also used SVR as it is one of the basic and simplest algorithm available for
regression. One of the major advantages of SVR is the complexity of the model does
not depend on the dimensions of the data.
LITERATURE SURVEY
After extensive research and survey, we have found out a paper with similar kinds of
work but more extensively. “COVID-19 Future Forecasting Using Supervised
Machine Learning Models” a journal written by honourable professor, have used
similar techniques and models, but with more research and experimentation. We will
not be comparing our work with theirs as they have used less data compared to
ours.
The have made use of many models such as LASSO Regression, Support Vector
Machine, Linear Regression and Exponential Smoothing.
PROPOSED MODEL
In the proposed model there are two stages: training and evaluation. Before training,
dataset is pre-processed by removing null values and fields which are non-significant
for this study. In the training stage, the model is trained and tested for prediction. The
results are evaluated using 3 measures like MSE, R2 and MAE .
The Centre for Systems Science and Engineering (CSSE) is a research collection
centre housed within the Department of Civil and Systems Engineering (CaSE) of
John Hopkins University, has collected the data. They have released multiple forms
of the dataset, and in this case, we have selected the Time Series Dataset, which is
updated every day. We have used the dataset collected from 01/22/20 to 11/26/20,
which is available from their official GitHub website.
B. Data Pre-processing
Pandas package is used for converting the CSV file into a Data frame. Filtering of
data is manipulation and transforming method to fit into the requirements. Here we
make separate data frames such as future_forecast_dates, unique_countries, etc.
Lat, Long, Province/State columns from the data frame are removed as they are less
useful in prediction. We add the number of future dates required to the original dates
in the data frame for future prediction. This is done by using the date/time package of
python.
We used the train_test_split function in order to split the data into training and testing
data sets using the sklearn.models package.
It is a common application form of Support Vector Machine that supports linear and
non-linear regressions. SVR requires training data X and Y which covers the domain
of interest and is accompanied by solutions on that domain. The SVR is a supervised
learning machine derived directly from the Support Vector algorithm to estimate the
function we use to generate the training set to re-inforce some data. It is available in
Scikitlearn package of Python.
Linear Regression:
It is a commonly used type of predictive analysis and is used to foresee a numeric
result given an arrangement of autonomous factors.[14] The overall idea is to find a
relation between two variables by fitting a linear equation to the observed data. The
linear regression model can be used by importing the sklearn.linear_models package
in Python
LASSOLars:
The LassoLars does feature selection with its own CV and finds the optimum
parameters of the line and gives you the best fit line passing through the best
features. Least Angle Regression Shrinkage(LARS) as the name suggests deals
with the correlation(least angle) between predictors and the output variables and it's
an efficient stepwise variable selection algorithm.
BayesianRidge:
TRAINING
Steps in training each model: Google Collaboratory is used to train the model as it is
a free cloud service Jupyter Notebook and supports free GPU. It is research oriented
and does not require environment setup. It supports many machine learning libraries
which can be loaded easily without any dependencies on hardware.
The dataset is employed to validate the model.
A. Training SVR
Here in this project by using the GridSearchCV present in the sklearn and manual
batch search, we have tuned the hyper parameters. Since the GridSearchCV did not
perform well, we did manual Hyper-Parameter search and found the best set of
parameters.
B. Training LinearRegression
The linear model is trained with parameters normalize and fit_intercept set to TRUE.
The data is trained with train data and the test data with future dates are used for
prediction.
C. Training LASSOLars
The LASSO model is trained with parameters normalize and fit_intercept set to
TRUE. The data is trained with train data and the test data with future dates are used
for prediction.
D. Training BayesianRidge
The Bayesian Regressor model is trained with parameters normalize set to TRUE
and alpha_init is set to 0.01. The data is trained with train data and the test data with
future dates are used for prediction.
Comparison of all the algorithms are studied based on the accuracy for the effective
prediction.
A. Dataset
The basic dataset is having 271 rows and 310 columns. The dataset has all the
recorded positive cases dated from 01/22/20 to 11/26/20 based on countries. From
the dataset all the dates and the number of cases are extracted and made into
separate data frames. The dataset is then split into train and test data by dividing
30:70, 50:50, 60:40,80:20 percentage.
For the actual prediction the dataset is split in the ratio of 85:15 train and test
dataset. The dataset is split in such a manner to experiment and observe how the
model learns with different quantities of training set using the evaluation parameters.
B. EvaluationParameters
In this project, the evaluation of the models performance is measured in the terms of
R-Squared(R2). Here in this project we have mainly concentrated on the R2 score.
When the performance of the model is evaluated against the R2 if the value is
negative, it indicates that the model’s performance is arbitrarily worse. And if the
value is nearing or is 1.0, the model is evaluated to be having best performance.
C. Comparison of Models
SVR Prediction:The SVR model is trained and tested with different ratios of train
and test dataset. GridSearchCV was a failure and was rejected. But when the Hyper-
Parameters were manually trained, the SVR performed very well with high accuracy.
Fig. 2
Fig. 3
As shown in Fig. 3., both the lines almost coincide each other and is predicting
values almost the same as it is supposed to, hence visualizing that the prediction
done by Linear Regression model is quite accurate.
The R2 score is 0.9945359160912455. Hence the model performs with almost 99%
accuracy.The performance accuracy of the model can be further confirmed by
comparing the predicted cases and the actual cases confirmed by the WHO.
Fig 4.
As shown in Fig. 4., both the lines almost coincide each other and is predicting
values almost the same as it is supposed to, hence visualizing that the prediction
done by LASSO model is quite accurate.
The R2 score is 0.9994263506035728. Hence the model performs with almost 99%
accuracy.The performance accuracy of the model can be further confirmed by
comparing the predicted cases and the actual cases confirmed by the WHO.
Fig. 5
As shown in Fig. 5., both the lines almost coincide each other and is predicting
values almost the same as it is supposed to, hence visualizing that the prediction
done by the Bayesian model is quite accurate.
The R2 score is 0.9945346036480124. Hence the model performs with almost 99%
accuracy.The performance accuracy of the model can be further confirmed by comparing the
predicted cases and the actual cases confirmed by the WHO.
CONCLUSION
In conclusion, we can say that all four regression models have performed very well,
the LassoLars model having the highest R2 score has exceeded other models in
comparison.
The current work can prove that the Covid-19 pandemic cases are growing linearly
every day. This can be confirmed by visualizing Fig. 2,3,4 and 5. as the number of
cases are rising in a linear fashion and also proves that this will be a major threat
until a year or two. Hence we need to take the utmost precautions and measures to
decrease the spread of this pandemic.