0% found this document useful (0 votes)
37 views

Mini Project Report

Uploaded by

rakshithac10369
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Mini Project Report

Uploaded by

rakshithac10369
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

JSS MAHAVIDYAPEETHA

JSS SCIENCE AND TECHNOLOGY


UNIVERSITY Mysuru – 570 006.

DEPARTMENT OF COMPUTER
APPLICATIONS

A mini project report on

“PREDICTION OF COVID-19 USING MULTIPLE


REGRESSION MODELS”

Submitted by

ASHISH U MANDAYAM
(USN: 01JST18BCA005)

RAKSHITH A C
(USN: 01JST18BCA025)

Submitted to

Mr. SIDDESHA. S (Project Guide)


Assistant Professor
Department of Computer Applications
ABSTRACT
With the progression in the field of machine learning, predictive analysis has become
a key component for future prediction. As we face the COVID-19 pandemic, it would
be helpful to predict the future number of positive cases for better measures and
control. We used two supervised learning models to predict the future using the time-
series dataset of COVID-19. To study the performance of prediction, the comparison
between Linear Regression and Support Vector Regression is carried out. We have
used these two models as the data were almost linear.

INTRODUCTION
Coronavirus disease 2019 (COVID-19) is a contagious infection caused by severe
acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It was first identified in
December 2019 in Wuhan, China, and which has since spread globally, evolving into
an ongoing pandemic. Common symptoms include cough, fever, fatigue,
breathlessness and loss of smell and taste. The World Health Organization (WHO)
announced the outbreak a Public Health Emergency of International Concern in
January and a pandemic on March. The COVID-19 has largely impacted on all the
sectors like economy, education, healthcare, logistics and mental health of people.

The pandemic has caused severe global economic disruption and has led to the
postponement or cancellation of many events. According to the World Trade
Organization, the trade has been plunged due to the pandemic and is expected to
fall between 13% and 32%. Many economic experts state that it might take 10 years
to improve the economy to its normal state.

The WHO states that, COVID-19 has impacted significantly in the health sector for
non-communicable diseases such as Cancer, Alzheimer’s etc. Since there are no
vaccines for this disease, it has become a humongous task and utmost priority for
the healthcare department to prevent the wide spread of the disease.

With the help of predictive analysis and supervised learning, we can predict the
future cases which might be helpful for taking much better preventive measures and
precautions. The proposed model is shown in Fig. 1. Here we have used 2
supervised machine learning models for the regression of the data. The data set
after a series of visualization seems to be linear and hence we have used 2 basic
regression models.

We have used Linear Regression as it is simple to implement and easier to interpret.


Linear Regression tend to perform better when the current data is also linear.

We have also used SVR as it is one of the basic and simplest algorithm available for
regression. One of the major advantages of SVR is the complexity of the model does
not depend on the dimensions of the data.

LITERATURE SURVEY
After extensive research and survey, we have found out a paper with similar kinds of
work but more extensively. “COVID-19 Future Forecasting Using Supervised
Machine Learning Models” a journal written by honourable professor, have used
similar techniques and models, but with more research and experimentation. We will
not be comparing our work with theirs as they have used less data compared to
ours.

The have made use of many models such as LASSO Regression, Support Vector
Machine, Linear Regression and Exponential Smoothing.

PROPOSED MODEL
In the proposed model there are two stages: training and evaluation. Before training,
dataset is pre-processed by removing null values and fields which are non-significant
for this study. In the training stage, the model is trained and tested for prediction. The
results are evaluated using 3 measures like MSE, R2 and MAE .

Figure 1: Proposed model

A. Data Acquisition and Selection

The Centre for Systems Science and Engineering (CSSE) is a research collection
centre housed within the Department of Civil and Systems Engineering (CaSE) of
John Hopkins University, has collected the data. They have released multiple forms
of the dataset, and in this case, we have selected the Time Series Dataset, which is
updated every day. We have used the dataset collected from 01/22/20 to 11/26/20,
which is available from their official GitHub website.
B. Data Pre-processing

Pandas package is used for converting the CSV file into a Data frame. Filtering of
data is manipulation and transforming method to fit into the requirements. Here we
make separate data frames such as future_forecast_dates, unique_countries, etc.
Lat, Long, Province/State columns from the data frame are removed as they are less
useful in prediction. We add the number of future dates required to the original dates
in the data frame for future prediction. This is done by using the date/time package of
python.

We used the train_test_split function in order to split the data into training and testing
data sets using the sklearn.models package.

C. Selecting the model

Regression is a widely used Machine learning technique for prediction purposes. In


this project we have used supervised learning for future prediction. Under supervised
learning, four regression models, Support Vector Regression, Linear Regression,
LASSOLars and BayesianRidge are used for the conduction of experiments.
Supervised learning algorithm is used because of two input factors (X) and an output
factor (Y) which are utilized by the algorithm to learn the mapping from input to
output. The objective is to train the model by mapping in a well manner that a new
input data given to the model can predict the output factor (Y).

Support Vector Regression:

It is a common application form of Support Vector Machine that supports linear and
non-linear regressions. SVR requires training data X and Y which covers the domain
of interest and is accompanied by solutions on that domain. The SVR is a supervised
learning machine derived directly from the Support Vector algorithm to estimate the
function we use to generate the training set to re-inforce some data. It is available in
Scikitlearn package of Python.

Linear Regression:
It is a commonly used type of predictive analysis and is used to foresee a numeric
result given an arrangement of autonomous factors.[14] The overall idea is to find a
relation between two variables by fitting a linear equation to the observed data. The
linear regression model can be used by importing the sklearn.linear_models package
in Python

LASSOLars:

The LassoLars does feature selection with its own CV and finds the optimum
parameters of the line and gives you the best fit line passing through the best
features. Least Angle Regression Shrinkage(LARS) as the name suggests deals
with the correlation(least angle) between predictors and the output variables and it's
an efficient stepwise variable selection algorithm.

BayesianRidge:

Bayesian regression allows a natural mechanism to survive insufficient data or


poorly distributed data by formulating linear regression using probability distributors
rather than point estimates. The output or response ‘y’ is assumed to be drawn from
a probability distribution rather than estimated as a single value.

TRAINING

Steps in training each model: Google Collaboratory is used to train the model as it is
a free cloud service Jupyter Notebook and supports free GPU. It is research oriented
and does not require environment setup. It supports many machine learning libraries
which can be loaded easily without any dependencies on hardware.
The dataset is employed to validate the model.

A. Training SVR

Here in this project by using the GridSearchCV present in the sklearn and manual
batch search, we have tuned the hyper parameters. Since the GridSearchCV did not
perform well, we did manual Hyper-Parameter search and found the best set of
parameters.

B. Training LinearRegression

The linear model is trained with parameters normalize and fit_intercept set to TRUE.
The data is trained with train data and the test data with future dates are used for
prediction.

C. Training LASSOLars

The LASSO model is trained with parameters normalize and fit_intercept set to
TRUE. The data is trained with train data and the test data with future dates are used
for prediction.
D. Training BayesianRidge

The Bayesian Regressor model is trained with parameters normalize set to TRUE
and alpha_init is set to 0.01. The data is trained with train data and the test data with
future dates are used for prediction.

Comparison of all the algorithms are studied based on the accuracy for the effective
prediction.

EXPERIMENTS AND DISCUSSIONS

A. Dataset

The basic dataset is having 271 rows and 310 columns. The dataset has all the
recorded positive cases dated from 01/22/20 to 11/26/20 based on countries. From
the dataset all the dates and the number of cases are extracted and made into
separate data frames. The dataset is then split into train and test data by dividing
30:70, 50:50, 60:40,80:20 percentage.

For the actual prediction the dataset is split in the ratio of 85:15 train and test
dataset. The dataset is split in such a manner to experiment and observe how the
model learns with different quantities of training set using the evaluation parameters.

B. EvaluationParameters

In this project, the evaluation of the models performance is measured in the terms of
R-Squared(R2). Here in this project we have mainly concentrated on the R2 score.
When the performance of the model is evaluated against the R2 if the value is
negative, it indicates that the model’s performance is arbitrarily worse. And if the
value is nearing or is 1.0, the model is evaluated to be having best performance.

C. Comparison of Models

SVR Prediction:The SVR model is trained and tested with different ratios of train
and test dataset. GridSearchCV was a failure and was rejected. But when the Hyper-
Parameters were manually trained, the SVR performed very well with high accuracy.
Fig. 2

As shown in Fig. 2. the lines do coincide and is predicting a similar value as


supposed to, hence visualizing that the prediction done by SVR is quite accurate.
This can be proved by evaluating the performance of the model.

The R2 score is 0.9602768326663619. This score is considered a good accuracy of


the model and this good accuracy was achieved only due to manual Hyper-
Parameter tuning. The SVR model is performing with 96% accuracy. The
performance accuracy of the model can further be confirmed by comparing the
predicted cases by SVR and the actual cases confirmed by the WHO.

DATE SVR PREDICTION ACTUAL PREDICTION

12/22/2020 75118537.13544364 78403123

12/23/2020 75734738.40461905 79099682

12/24/2020 76354282.50375342 79795114

12/25/2020 76977178.4462428 80330270

12/26/2020 77603435.24548353 80799536


Linear Regression: The Linear Regression model is trained and tested with
parameters normalize and fit_intercept set to TRUE.
As mentioned above, the model is trained with different ratios of train and test
dataset and performance is evaluated for each. The performance of the model can
be visualized by plotting the predicted values and the actual values.

Fig. 3

As shown in Fig. 3., both the lines almost coincide each other and is predicting
values almost the same as it is supposed to, hence visualizing that the prediction
done by Linear Regression model is quite accurate.

The R2 score is 0.9945359160912455. Hence the model performs with almost 99%
accuracy.The performance accuracy of the model can be further confirmed by
comparing the predicted cases and the actual cases confirmed by the WHO.

DATE LR PREDICTION ACTUAL PREDICTION

12/22/2020 73831351.52370489 78403123

12/23/2020 74354364.3489824 79099682

12/24/2020 74877377.17425993 79795114

12/25/2020 75400389.99953744 80330270

12/26/2020 75923402.82481498 80799536


LASSOLars: The LASSO model is trained with parameters normalize and
fit_intercept set to TRUE. As mentioned above, the model is trained with different
ratios of train and test dataset and performance is evaluated for each. The
performance of the model can be visualized by plotting the predicted values and the
actual values.

Fig 4.

As shown in Fig. 4., both the lines almost coincide each other and is predicting
values almost the same as it is supposed to, hence visualizing that the prediction
done by LASSO model is quite accurate.

The R2 score is 0.9994263506035728. Hence the model performs with almost 99%
accuracy.The performance accuracy of the model can be further confirmed by
comparing the predicted cases and the actual cases confirmed by the WHO.

DATE LASSOLars PREDICTION ACTUAL PREDICTION

12/22/2020 73831326.75883234 78403123

12/23/2020 74354339.0787043 79099682

12/24/2020 74877351.39857626 79795114

12/25/2020 75400363.71844822 80330270

12/26/2020 75923376.03832018 80799536

Bayesian Ridge: The Bayesian Regressor model is trained with parameters


normalize set to TRUE and alpha_init is set to 0.01. As mentioned above, the model
is trained with different ratios of train and test dataset and performance is evaluated
for each. The performance of the model can be visualized by plotting the predicted
values and the actual values.

Fig. 5

As shown in Fig. 5., both the lines almost coincide each other and is predicting
values almost the same as it is supposed to, hence visualizing that the prediction
done by the Bayesian model is quite accurate.

The R2 score is 0.9945346036480124. Hence the model performs with almost 99%
accuracy.The performance accuracy of the model can be further confirmed by comparing the
predicted cases and the actual cases confirmed by the WHO.

DATE LASSOLars PREDICTION ACTUAL PREDICTION

12/22/2020 73828307.35444012 78403123

12/23/2020 74351258.05381426 79099682

12/24/2020 74874208.75318843 79795114

12/25/2020 75397159.4525626 80330270

12/26/2020 75920110.15193674 80799536

CONCLUSION
In conclusion, we can say that all four regression models have performed very well,
the LassoLars model having the highest R2 score has exceeded other models in
comparison.

The current work can prove that the Covid-19 pandemic cases are growing linearly
every day. This can be confirmed by visualizing Fig. 2,3,4 and 5. as the number of
cases are rising in a linear fashion and also proves that this will be a major threat
until a year or two. Hence we need to take the utmost precautions and measures to
decrease the spread of this pandemic.

You might also like