SANDYA VB-Business Report TSF
SANDYA VB-Business Report TSF
REPORT
TIME SERIES FORECASTING
SANDYA V B
CONTENTS
1. Read the data as an appropriate Time Series data and plot the data.
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.
3. Split the data into training and test. The test data should start in 1991.
4. Build various exponential smoothing models on the training data and evaluate the model
using RMSE on the test data.
Other models such as regression, naïve forecast models, simple average models etc. should
also be built on the training data and check the performance on the test data using RMSE.
5. Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical test. If the
data is found to be non-stationary, take appropriate steps to make it stationary. Check the
new data for stationarity and comment.
Note: Stationarity should be checked at alpha = 0.05.
6. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and
evaluate this model on the test data using RMSE.
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the
training data and evaluate this model on the test data using RMSE.
8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.
9. Based on the model-building exercise, build the most optimum model(s) on the complete
data and predict 12 months into the future with appropriate confidence intervals/bands.
10. Comment on the model thus built and report your findings and suggest the measures
that the company should be taking for future sales.
PROBLEM:
For this particular assignment, the data of different types of wine sales in the 20th century is to be
analysed. Both of these data are from the same company but of different wines. As an analyst in the ABC
Estate Wines, you are tasked to analyse and forecast Wine Sales in the 20th century.
➢ The two datasets: Rose and Sparkling are imported using the read command. And convert to time series
data using
date_range function:
date = pd.date_range(start='01/01/1980', end='08/01/1995', freq='M')date
df['Time_Stamp'] = pd.DataFrame(date,columns=['Month'])
df.head()
o/p:
2. Perform appropriate Exploratory Data Analysis to understand the data and also
perform decomposition.
• The shape of the data is (187,1). • The shape of the data is (187,1).
• There are 2 null values present in the data, which • There are no null values present.
was interpolated using linear method. • Describing the data:
• Describing the data:
Measures count mean std min 25% 50% 75% max Measures count mean std min 25% 50% 75% max
Rose 185 90.3 39.1 28 63 86 112 267 Rose 187 2402.41 1295.11 1070 1605 1874 2549 7242
• From the above plot we see that the box plots • From the above plot, we see that the box plots
indicates a downward trend do not indicate any trend.
• We also see that there are few outliers present in • We also observe that the sale of Sparkling wine
the sales plot. has outliers for almost all the years except
1955.
• To resample or aggregate the Time Series from an • To resample or aggregate the Time Series from
annual perspective and take the mean of the an annual perspective and take the mean of the
observations of the year. observations of the year.
• If we take the resampling period to be 10 years or • If we take the resampling period to be 10 years
a decade, we see that the seasonality present has or a decade, we see that the seasonality present
been smoothed over and it is only giving an has been smoothed over and it is only giving
estimate of the trend. an estimate of the trend.
• For the multiplicative series, we see that a lot of • For the multiplicative series, we see that a lot
residuals are located around 1. of residuals are located around 1.
3. Split the data into training and test. The test data should start in 1991.
ROSE WINE TRAIN & TEST DATA SPARKLING WINE TRAIN & TEST
DATA
• The train data of Rose wine has been splitted • The train data of Rose wine has been splitted
upto the year 1990 and has 132 data points. upto the year 1990 and has 132 data points.
• The test data has been splitted from the year • The test data has been splitted from the year
1991 and has 55 data points. 1991 and has 55 data points.
• From our train-test data split we will be • From our train-test data split we will be
predicting the future sales in comparison with predicting the future sales in comparison with
the past years’ sale. the past years’ sale.
• Training data: • Training data:
First few rows of Training Data First few rows of Training Data
Rose Sparkling
Time_Stamp Time_Stamp
1980-01-31 112.0 1980-01-31 1686
1980-02-29 118.0 1980-02-29 1591
1980-03-31 129.0 1980-03-31 2304
1980-04-30 99.0 1980-04-30 1712
1980-05-31 116.0 1980-05-31 1471
Last few rows of Training Data Last few rows of Training Data
Rose Sparkling
Time_Stamp Time_Stamp
1990-08-31 70.0 1990-08-31 1605
1990-09-30 83.0 1990-09-30 2424
1990-10-31 65.0 1990-10-31 3116
1990-11-30 110.0 1990-11-30 4286
1990-12-31 132.0 1990-12-31 6047
Last few rows of Test Data Last few rows of Test Data
Rose Sparkling
Time_Stamp Time_Stamp
1995-03-31 45.0 1995-03-31 1897
1995-04-30 52.0 1995-04-30 1862
1995-05-31 28.0 1995-05-31 1670
1995-06-30 40.0 1995-06-30 1688
1995-07-31 62.0 1995-07-31 2031
4. Build various exponential smoothing models on the training data and evaluate
the model using RMSE on the test data. Other models such as regression, naïve
forecast models and simple average models. should also be built on the training
data and check the performance on the test data using RMSE.
• First few rows of Training Data • First few rows of Training Data
Rose time Sparkling time
Time_Stamp Time_Stamp
1980-01-31 112.0 1 1980-01-31 1686 1
1980-02-29 118.0 2 1980-02-29 1591 2
1980-03-31 129.0 3 1980-03-31 2304 3
1980-04-30 99.0 4 1980-04-30 1712 4
1980-05-31 116.0 5 1980-05-31 1471 5
Last few rows of Training Data Last few rows of Training Data
Rose time Sparkling time
Time_Stamp Time_Stamp
1990-08-31 70.0 128 1990-08-31 1605 128
1990-09-30 83.0 129 1990-09-30 2424 129
1990-10-31 65.0 130 1990-10-31 3116 130
1990-11-30 110.0 131 1990-11-30 4286 131
1990-12-31 132.0 132 1990-12-31 6047 132
First few rows of Test Data First few rows of Test Data
Rose time Sparkling time
Time_Stamp Time_Stamp
1991-01-31 54.0 43 1991-01-31 1902 43
1991-02-28 55.0 44 1991-02-28 2049 44
1991-03-31 66.0 45 1991-03-31 1874 45
1991-04-30 65.0 46 1991-04-30 1279 46
1991-05-31 60.0 47 1991-05-31 1432 47
Last few rows of Test Data Last few rows of Test Data
Rose time Sparkling time
Time_Stamp Time_Stamp
1995-03-31 45.0 93 1995-03-31 1897 93
1995-04-30 52.0 94 1995-04-30 1862 94
1995-05-31 28.0 95 1995-05-31 1670 95
1995-06-30 40.0 96 1995-06-30 1688 96
1995-07-31 62.0 97 1995-07-31 2031 97
• For the Naïve Model, we observe that the • For the Naïve Model, we observe that the
green line in the plot below shows a straight green line in the plot below shows a straight
line. line.
• Which predicts that the sale for tomorrow is • Which predicts that the sale for tomorrow is
the same as today. the same as today.
• And the prediction for day after tomorrow is • And the prediction for day after tomorrow is
tomorrow. tomorrow.
• Hence it applies to all the future years. • Hence it applies to all the future years.
• TEST RMSE SCORE = 79.718773 • TEST RMSE SCORE = 3864.279352
• In Simple Average method, we will forecast • In Simple Average method, we will forecast
the data using the average of the training the data using the average of the training
values. values.
• From the plot below, we observe that the • From the plot below, we observe that the green
green line is straight and shows the Simple line is straight and shows the Simple Average
Average forecasting. forecasting.
• TEST RMSE SCORE = 53.460570 • TEST RMSE SCORE = 1275.081804
➢ MODEL 4: MOVING AVERAGE MODEL ➢ MODEL 4: MOVING AVERAGE MODEL
• In Moving Average Model, we compute • In Moving Average Model, we compute
moving averages for 2, 4, 6 and 9 point moving averages for 2, 4, 6 and 9 point
intervals. intervals.
• Then the best interval is determined by the • Then the best interval is determined by the
maximum accuracy. maximum accuracy.
• From the below table we see that 2 point • From the below table we see that 2 point
trailing moving average has the least score. trailing moving average has the least score.
Alpha Values Train RSME Test RSME Alpha Values Train RSME Test RSME
0.3 32.470164 47.504821 0.3 1359.511747 1935.507132
0.4 33.035130 53.767406 0.4 1352.588879 2311.919615
0.5 33.682839 59.641786 0.5 1344.004369 2666.351413
0.6 34.441171 64.971288 0.6 1338.805381 2979.204388
0.7 35.323261 69.698162 0.7 1338.844308 3249.944092
0.8 36.334596 73.773992
0.9 37.482782 77.139276
• To check the stationarity of Rose data, we if the • To check the stationarity of Sparkling data, we
alpha value is less than 0.05 if the alpha value is less than 0.05
• From the above result we see that the alpha = 0.34 • From the above result we see that the alpha =
which is higher than 0.05 0.60 which is higher than 0.05
• Hence, we fail to reject the null hypothesis • Hence, we fail to reject the null hypothesis
• From the above result we see that the alpha = • From the above result we see that the alpha =
0.21 which is higher than 0.05 0.66 which is higher than 0.05
• Hence, we take a difference of 1 to make the data • Hence, we take a difference of 1 to make the
stationary. data stationary.
• From the below result we see that the value of • From the below result we see that the value of
alpha is less than 0.05. alpha is less than 0.05.
• To build the automated ARIMA model we • To build the automated ARIMA model we
arrange AIC value from lowest to highest. arrange AIC value from lowest to highest.
• And then proceed to build the ARIMA model • And then proceed to build the ARIMA model
with the lowest Akaike Information Criteria with the lowest Akaike Information Criteria
(AIC) value. (AIC) value.
param AIC
param AIC
(2,1,2) 2210.616692
(0,1,2) 1276.835377
(2,1,1) 2232.360490
(1,1,2) 1277.359224
(0,1,2) 2232.783098
(1,1,1) 1277.775754
(1,1,2) 2233.597647
(2,1,1) 1279.045689
(1,1,1) 2235.013945
(2,1,2) 1279.298694
• We see that in the ACF plot there is seasonality at • We see that in the ACF plot there is seasonality
the interval of 6 and 12. at the interval of 6 and 12.
• Therefore, we run the automated SARIMA model • Therefore, we run the automated SARIMA
for both the intervals. model for both the intervals.
• The sorted AIC values from lowest to highest. • The sorted AIC values from lowest to highest.
• TEST RMSE SCORE for interval 6= 26.13355444 • TEST RMSE SCORE for interval 6 =
626.880153
• Manual ARIMA model is built based on ACF • Manual ARIMA model is built based on ACF
plot and PACF plot. plot and PACF plot.
• Hence, we choose AR parameter value as p and • Hence, we choose AR parameter value as p and
moving average parameter value to be q. moving average parameter value to be q.
• TEST RMSE SCORE = 15.73425 • TEST RMSE SCORE = 1461.6785026
➢ MANUAL SARIMA ➢ MANUAL SARIMA
• Manual ARIMA model is built based on ACF • Manual ARIMA model is built based on ACF
plot and PACF plot. plot and PACF plot.
• Hence, we choose AR parameter value as p, • Hence, we choose AR parameter value as p,
moving average parameter value to be q and moving average parameter value to be q and
d(difference) value to be 1. d(difference) value to be 1.
• We then derive the seasonal parameters based • We then derive the seasonal parameters based
on the seasonal cut-off. on the seasonal cut-off.
• TEST RMSE SCORE = 20.96410 • TEST RMSE SCORE = 558.438329
• The Standardized Residual do not display any
• The Standardized Residual do not display any obvious seasonality.
obvious seasonality. • Histogram plus estimated density shows the
• Histogram plus estimated density shows the KDE of the residuals is in normal distribution,
KDE of the residuals is in normal distribution, therefore the model is normally distributed.
therefore the model is normally distributed. • Normal Q-Q plot tells about the ordered
• Normal Q-Q plot tells about the ordered distribution of residuals following the linear
distribution of residuals following the linear trend taken normal distribution with N(0,1).
trend taken normal distribution with N(0,1).
• Correlogram time series residuals have low
• Correlogram time series residuals have low
correlation with lagged version.
correlation with lagged version.
8. Build a table (create a data frame) with all the models built along with their
corresponding parameters and the respective RMSE values on the test data.
TEST
MODEL – ROSE RMS TEST
MODEL – SPARKLING
E RMSE
Alpha=0.3,Beta=0.4,Gamma=0.3,TripleExponentialSmoothing 10.945435
Alpha=0.154,Beta=1.307,Gamma=0.371,TripleExponentialSmoothing 383.15568
2pointTrailingMovingAverage 11.529278
Alpha=0.3,Beta=0.3,Gamma=0.3,TripleExponentialSmoothing 392.78619
4pointTrailingMovingAverage 14.451403
SARIMA(1,1,2)(1,0,2,12) 528.45273
6pointTrailingMovingAverage 14.566327
SARIMA(0,1,0)(1,1,3,6) 558.43832
9pointTrailingMovingAverage 14.727630
SARIMA(1,1,2)(2,0,2,6) 626.88015
ARIMA(0,1,2) 15.618912
2pointTrailingMovingAverage 813.40068
ARIMA(1,1,1) 15.734259
4pointTrailingMovingAverage 1156.5896
Alpha=0.106,Beta=0.048,Gamma=0.0,TripleExponentialSmoothin
17.369489
g Alpha=0.0,SimpleExponentialSmoothing 1275.0817
SARIMA(1,1,2)(2,0,2,6) 20.964110
SimpleAverageModel 1275.0818
SARIMA(1,1,2)(2,0,2,6) 26.133554
RegressionOnTime 1275.8670
SARIMA(0,1,2)(2,0,2,12) 26.929368
6pointTrailingMovingAverage 1283.9274
Alpha=0.098,SimpleExponentialSmoothing 36.796244 9pointTrailingMovingAverage 1346.2783
Alpha=0.3,SimpleExponentialSmoothing 47.504821 ARIMA(0,1,2) 1374.9769
RegressionOnTime 51.433312 ARIMA(1,1,1) 1461.6785
SimpleAverageModel 53.460570 Alpha=0.3,SimpleExponentialSmoothing 1935.5071
NaiveModel 79.718773 NaiveModel 3864.2793
Alpha=0.3,Beta=0.1,DoubleExponentialSmoothing 98.653317 Alpha=0.3,Beta=0.3,DoubleExponentialSmoothing 18259.110
• From the above table we see that the lowest • From the above table we see that the lowest score
score is 10.945435. is 383.15568.
• Obtained from triple exponential smoothing • Obtained from triple exponential smoothing
model. model.
• Which was executed on different alpha, beta • Which was executed on auto/manual fit
and gamma values ranging from 0.3 to 1.0. parameters alpha, beta and gamma values.
• Parameters having lowest score alpha = 0.3, • Whose smoothing level(alpha) = 0.154,
beta = 0.4 and gamma = 0.3. smoothing trend(beta) = 1.307 and smoothing
seasonality(gamma) = 0.371.
9. Based on the model-building exercise, build the most optimum model(s) on the
complete data and predict 12 months into the future with appropriate
confidence intervals/bands.
• From the previous answer we observe that • From the previous answer we observe that
Triple Exponential Smoothing has the least Triple Exponential Smoothing has the least
RMSE score. RMSE score.
• It falls under most optimum model compared to • It falls under most optimum model compared to
other models. other models.
• The model is calculated with the parameters • The model is calculated with the parameters
having alpha =0.3, beta = 0.3, gamma =0.3. having alpha =0.154, beta = 1.307, gamma
• The upper and lower bands are calculated with =0.371.
95% accuracy. • The upper and lower bands are calculated with
• The final TEST RMSE SCORE = 24.2665. 95% accuracy.
• The final TEST RMSE SCORE = 353.9124
10.Comment on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales.
➢ Time series analysis involves understanding various aspects about the inherent nature of the series so that
you are better informed to create meaningful and accurate forecasts
➢ Any time series may be split into the following components: Base Level + Trend + Seasonality + Error.
• Rose sales shows decrease in trend compared • Sparkling sales shows stabilized values.
to the previous years. • December month shows the highest sales.
• December month shows the highest sales. • The models are built and are chosen based on
• The models are built and are chosen based on the least RMSE score.
the least RMSE score. • The sales of Sparkling wine is seasonal and
• The sales of Rose wine is seasonal and also also had trend. Therefore, the company cannot
had trend. Therefore, the company cannot have have the same stock throughout the year.
the same stock throughout the year. • The company should use prediction results to
• The company should use prediction results to plan about future stock.
plan about future stock.
END