Parbat 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Chaos, Solitons and Fractals 138 (2020) 109942

Contents lists available at ScienceDirect

Chaos, Solitons and Fractals


Nonlinear Science, and Nonequilibrium and Complex Phenomena
journal homepage: www.elsevier.com/locate/chaos

A python based support vector regression model for prediction of


COVID19 cases in India
Debanjan Parbat a, Monisha Chakraborty b,∗
a
Research Scholar, Biomedical Instrumentation Lab, School of Bioscience & Engineering, Jadavpur University, Kolkata, India
b
Professor, School of Bioscience & Engineering, Jadavpur University, Kolkata, India

a r t i c l e i n f o a b s t r a c t

Article history: The proposed work utilizes support vector regression model to predict the number of total number of
Received 4 May 2020 deaths, recovered cases, cumulative number of confirmed cases and number of daily cases. The data is
Accepted 26 May 2020
collected for the time period of 1st March,2020 to 30th April,2020 (61 Days). The total number of cases as
Available online 31 May 2020
on 30th April is found to be 35043 confirmed cases with 1147 total deaths and 8889 recovered patients.
Keywords: The model has been developed in Python 3.6.3 to obtain the predicted values of aforementioned cases till
COVID19 30th June,2020. The proposed methodology is based on prediction of values using support vector regres-
India sion model with Radial Basis Function as the kernel and 10% confidence interval for the curve fitting. The
Support vector regression data has been split into train and test set with test size 40% and training 60%. The model performance
Machine learining parameters are calculated as mean square error, root mean square error, regression score and percentage
Python accuracy. The model has above 97% accuracy in predicting deaths, recovered, cumulative number of con-
RBF
firmed cases and 87% accuracy in predicting daily new cases. The results suggest a Gaussian decrease of
Data analysis
the number of cases and could take another 3 to 4 months to come down the minimum level with no
new cases being reported. The method is very efficient and has higher accuracy than linear or polynomial
regression.
© 2020 Published by Elsevier Ltd.

1. Introduction that this could be detrimental in mitigating the COVID19 spread


among its citizens. Currently the development of vaccines is still
The spread of coronavirus disease 2019 (COVID-19) has become in progress and there are no effective antiviral drugs for treating
a global threat and the World Health Organization (WHO) de- COVID-19 infections.As on April,30 the total number of COVID19
clared COVID-19 a global pandemic on March 11, 2020 [1]. As of cases in India is 35043 and 1147 has died due to Severe Actue
April 30, 2020, there were 3,359,055 confirmed cases and 238,999 Respiratory Syndrome(SARS) (https://www.mohfw.gov.in/). The to-
deaths from COVID-19 worldwide [7] (https://coronavirus.jhu.edu/ tal number of COVID19 recoverd individuals in India is 8889 until
data/new-cases). The COVID-19 pandemic has been greatly affect- date.
ing people’s lives and the world’s economy. Among many infec- The lockdown is severely affecting the poor and migrant
tion related questions, governments and people are most con- labours. Staying at home may not be a feasible option in the near
cerned with (i) when will the COVID19 infection rate reach the future since a lot of people may die out of hunger and other ail-
maximum; (ii) how long the pandemic will take to stop spread- ments. Newsmedia reports all over the world is reporting about
ing and (iii) What could be the total number of individuals that the crisis and how it is effecting the lives of people. Many re-
will eventually be infected (iv) what will be the toatl number of search is being carried out at all levels to quickly gather informa-
deaths [4]. The questions are of primary concern in India also, tion, develop mitigation tools and methods and implementation of
a country with high population density and economic diversity. the same. Therefore policy makers and authorities want to have an
The spread of the disease in India is considerably lower than overall view of the current situation and want to visualize the ex-
that of China, USA and other European countries. India is un- tent at which it can spread in the near future for informed policy
der complete lockdown since 21st March,2020 and experts belief making and deciding the next course of action.
The paper here discusses about the proposed prediction model
of COVID19 spread in India using support vector regression im-

Corresponding author plemented in Python.3.6. The steps of the model is discussed in
E-mail addresses: debanjanparbat.rs@jadavpuruniversity.in (D. Parbat),
the methodology section wih subsequent analysis. The results are
monishachakraborty@rediffmail.com (M. Chakraborty).

https://doi.org/10.1016/j.chaos.2020.109942
0960-0779/© 2020 Published by Elsevier Ltd.
2 D. Parbat and M. Chakraborty / Chaos, Solitons and Fractals 138 (2020) 109942

shown and discussed. The autors conclude the overall purpose of Table 1
The support vector regression model performance parameters with RBF kernel and
the work in Conclusion.
10 % fitting confidence interval

Data MSE RMSE Reg. score % Accuracy


2. Methodology
Total deaths 0.00849 0.092142 0.986812 99%
Total recovered 0.030289 0.174036 0.973437 97%
2.1. Preparation of the dataset
Daily confirmed 0.109448 0.330830 0.874900 87%
Cumulative confirmed 0.012856 0.113386 0.988613 99%
The .csv file of Novel Coronavirus 2019 dataset avail- Daily deaths 0.130847 0.361727 0.821829 82%
able at https://www.kaggle.com/sudalairajkumar/novel-corona-
virus- 2019- dataset is downloaded. A separate .csv file is is cre-
ated from the global dataset only for India. The columns include 2.5. Model performance evaluation
Total Deaths, Total Recovered and Total number of confirmed
COVID19 patients on day to day basis from 1st March,2020 to 30th The model performance parameters are then evaluated to check
April,2020 (61 days). All the data is in cumulative form. From for the reliability in predicting the outcome. The mean square error
the cumulative dataset, we have computed the difference time (MSE), root mean sqaure error (RMSE), R2 score and percentage
series to get the values based on daily new case basis.So we have accuracy are calculated and shown in Table 1.
now extended our dataset to have six columns 3 for cumulative
cases and 3 for respective daily new cases of deaths, recovery or 2.6. Prediction
confirmed COVID19 individuals.
The prediction of the future values of the time series invloves
2.2. Data preprocessing few steps of data manipulation to obtain the cumulative trend so
as to match the orignial dataset trend of the past. The past dataset
In data preprocessing section, we have set the columns created is in cumulative form, but since we have implemented RBF ker-
above as the dependent variable column (y) and number of days nel in our model, it is quite evident that the predicted time series
starting from 1st March as the independent variable (X). X column would be decreasing gaussian trend. The decresing trend can be
is basically a numpy array of elements 1 to 61. The X and y is then preserved by a transformation as discussed below. We have imple-
reshaped to be column vector of size 61 (i.e. 61 rows, 1 column) mented few steps in the algorithm that could help us reach our
The dataset is split for Training (60%) and Test (40%) using objective.
train_test_split() function imported from class model_selection of Here we have obtained the predicted time series for each case
sklearn python library.The training and testing variables are saved seperately for 60 more days that start just after 30th April or 61st
for further evaluation. day from the starting. Therefore, we wish to merge the 60 days
The training and testing variables of both X and y are standard- prediction with the past 61 days. The predicted column consits of
ized using StandardScaler() object imported from class preprocess- decreasing values. So, we have computed the difference of the time
ing of sklearn python library. Separate objects have been created series and then used absolute values of the difference time series.
for standardization of X and y data. The fit_transform() function is The difference time series gets inverted and gives us a rising trend,
used to fit the object into the data and transform the values of X which saturates after certain values. Then we performed cumula-
and y in standard form ranging from -3 to +3. The scaled data is tive sum of the elements of the time series and added the max
now fit for regression application. value of the the past time series to it. This helps us in preserving
the trend and visualizing it in cumulative form. The plots of the
past and forecasting values are shown in Fig. 3 and Fig. 4.
2.3. Support vector regression
This transformation is not required for prediction of time series
of daily new cases analysis.
Support vector regression is a popular choice for prediction and
All the necessary codes used in evaluation of the above men-
curve fiiting for both linear and non linear regression types. SVR
tioned steps is uploaded in GitHub repository for futher use
is based on the elements of Support vector machine (SVM), where
and improvization. The link is https://github.com/DebanjanParbat/
support vectors are basically closer points towards the generated
Support- Vector- Regression
hyperplane in an n-dimensional feature space that distincly seg-
gregates the data ponits about the hyperplane. More discussions 3. Results and discussion
on the SVR and SVM can be found on [3,2,6]. The SVR model
performs the fitting as shown in Fig. 1. The generelized equation The results show that the model performed well in fitting the
for hyperplane may be respesented as y = wX + b, where w is cumulative cases while a poor fiiting is observed in case of daily
weights and b is the intercept at X = 0. The margin of tolerance number of cases. The daily data show that, there are many spikes
is represented by epsilon ε . The SVR regression madel is imported which reduces the accuracy of predictability of the model. The
from SVM class of sklearn python library.The regressor is fit on the model predicts that the total numer of infected persons may cross
training dataset. The model parameters as chosen here for analysis the 550 0 0 mark if the current rate of daily new cases prevail, by
is shown below. the second week of June. The total number of people that can die
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, based on the recent trends predict that it can surpass 1600 mark
gamma=’auto’, kernel=’rbf’, max_iter=-1, shrinking=True, within second week of June.
tol=0.001, verbose=False) Moreover if more spikes are in daily deaths and daily new cases
then the total number of infected person may rise and there could
2.4. Visulization be more delay in attaining flatennig of the curve. The spikes in-
duces non-stationarity in the dataset making it difficult for re-
The regression fitting of the data with predicted values of the gression models to acccurately predict. But we can say, that if in
test data is plotted using scatter plot function imported from mat- near future the spikes are controlled with strict physical distanc-
plotlib python library. The actual points and the predicted points ing and containment measures then the flattening of the curve can
are shown in Fig. 2 for all the respective conditions. [5] be achieved by the end of 2nd week of June.
D. Parbat and M. Chakraborty / Chaos, Solitons and Fractals 138 (2020) 109942 3

Fig. 1. Support vector regression model for linear regression fitting where X1= X and X2 = y are the features and label in our case. [Image credit:
https://www.researchgate.net/figure/Schematic-of-the-one-dimensional-support-vector-regression-SVR-model-Only-the-points_fig5_320916953]

Fig. 2. The figures shown here are the plots of regression fit with the data for total deaths, total recovered, cumulative confirmed cases and daily confirmed cases (in
clockwise direction)
4 D. Parbat and M. Chakraborty / Chaos, Solitons and Fractals 138 (2020) 109942

Fig. 3. The past and forecast of the total deaths, total recovered, cumulative coinfirmed and daily confirmed cases of COVID19 patients in India. [Past: 1st Mar to 30th April;
Forecast: 1st May to 30th June]

4. Conclusion

The proposed methodolgy predicts the total number of COVID19


infected cases, total number of daily new cases, total number of
deaths and total number of daily new deaths. The total number of
recovered individuals is also predicted. Based on the recent trends,
the future trends has been predicted using a robust machine learn-
ing model, the support vector regression. The SVR has been re-
ported to outperform the consistency in predictabilty with respect
to other linear, plynomial and logistic regression models. The vari-
abilty in the dataset is addressed by the proposed methodolgy. The
model has above 97% accuracy in predicting deaths, recovered, cu-
mulative number of confirmed cases and 87% accuracy in predict-
ing daily new cases. The disease spread is significantly high and if
proper containment measures with physical distancing and hygein-
ity is maintained then we can reduce the spikes in the dataset and
hence lower the rate of progression.

Declaration of Competing Interest


Fig. 4. The past and forecast of the daily number of deaths

The authors declare that they have no known competing finan-


cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
D. Parbat and M. Chakraborty / Chaos, Solitons and Fractals 138 (2020) 109942 5

References [4] Li L. Propagation analysis and prediction of the COVID-19. Infect Dis Model
2020:282–92.
[1] Boccaletti S. Modeling and forecasting of epidemic spreading: the case of [5] Matplotlib. Documentation 2020.
COVID-19 and beyond. Chaos Solitons Fractals 2020. [6] Sci-kit-learn. (2020). https://scikit-learn.org/stable/auto_examples/svm/
[2] Drucker H. Support vector regression machines. In: Advances in neural informa- plot_svm_regression.html.
tion processing systems. MIT Press; 1997. p. 155–61. [7] Zhang.. Predicting turning point, duration and attack rate of COVID-19 outbreaks
[3] Hastie TJ. The elements of statistical learning: data mining, inference, and pre- in major Western countries. Chaos Solitons Fractals 2020.
diction. New York: Springer; 2008.

You might also like