Business Report
Business Report
FINAL PROJECT
Balaji M P
PGP DSBA Online -March’ 22
Date: 23.10.2022
Date
1
TABLE OF CONTENTS
LIST OF FIGURES
Fig.23 Line Plot – Splitting of time series into Train & Test data 26
Fig.24 Rose Wine – Linear regression model 27
Fig.25 Linear regression on Test data 27
Fig.26 Naïve forecast on Test data 29
Fig.27 Rose Wine – Simple Average model 31
Fig.28 Simple Average model predictions on Test data 31
Fig.29 Rose Wine – Sample of Trailing Moving Averages 33
Fig.30 Moving Average on Entire data 33
Fig.31 Individual visualization of moving averages on entire data 34
Fig.32 Moving averages forecast on test data 35
Fig.33 Comparison of different models on test data (Regression, Naïve, Simple and Moving
Average) 37
Fig.34 Rose Wine – Simple Exponential Smoothing Model 38
Fig.35 Sample of SES predictions 38
Fig.36 Rose Wine - SES predictions on Test data 39
Fig.37 SES prediction metrics for different alpha values 40
Fig.38 SES forecast for different Alpha values 40
Fig.39 Rose Wine – Double Exponential Smoothing Model 42
Fig.40 Sample of DES predictions 43
Fig.41 Rose Wine - DES predictions on Test data 43
Fig.42 DES prediction metrics for different alpha, beta values 44
Fig.43 DES forecast for different Alpha, Beta values 44
Fig.44 Rose Wine – Triple Exponential Smoothing Model 46
Fig.45 Sample of TES predictions 47
Fig.46 Rose Wine - TES predictions on Test data 47
Fig.47 TES prediction metrics for different alpha, beta and gamma values 48
Fig.48 TES forecast for automated model parameters 48
Fig.49 TES forecast for different model parameters 49
Fig.50 Comparison of Test RMSE values of different exponential smoothing models 50
Fig.51 Comparison of different models on test data (SES, DES and TES) 51
Fig.52 Rose Wine – ADF summary 52
Fig.53 Rose Wine – ADF summary with differencing 53
Fig.54 Time Series Plot of Entire data – With differencing 53
Fig.55.1 Time Series Plot of Train data 54
Fig.55.2 Rose Wine – ADF summary on train data 54
Fig.56 Rose Wine – ADF summary on train data with differencing 55
Fig.57 Time Series Plot of Training data with differencing 55
Fig.58 Parameter Combinations for ARIMA model
57
3
Fig.98 Manual SARIMA Optimum Model – Line plot of Predictions vs Actual values 91
Fig.99 Manual SARIMA Optimum Model – Line plot of Predictions vs Actual values on Test data 92
Fig.100 Manual SARIMA Optimum Model 93
Fig.101 Manual SARIMA Model – Forecast for next 12 months with confidence intervals 94
Fig.102 Manual SARIMA Optimum Model – Time series plot forecast for next 12 months 94
Fig.103 Manual SARIMA Optimum Model – Time series plot forecast with confidence intervals 95
Fig.104 Manual SARIMA Optimum Model – Forecast for next 12 months with confidence interval 95
Fig.105 Sparkling Wine Analysis 99
Fig.106 Details of the dataset columns 102
Fig.107 Time stamp of dataset columns 102
Fig.108 Details of the updated dataset columns 103
Fig.109 Details of the dataset columns after renaming 103
Fig.110 Null values in the dataset 104
Fig.111 Graph plot of the Sparkling wine sales dataset 104
Fig.112 Descriptive Summary of Sparkling_Wine_Sales column 105
Fig.113 Yearly plot of Sparkling wine sales 106
Fig.114 Monthly plot of Sparkling wine sales 107
Fig.115 Line plot – Annual sales 108
Fig.116 Line plot – Quarterly sales 108
Fig.117 Monthly sales across different years 109
Fig.118 Line plot – Empirical cumulative distribution function 109
Fig.119 Time series plot – Monthly time series 110
Fig.120 Line plot – Average and % Change over each month 111
Fig.121 Additive decomposition of time series 112
Fig.122 Additive Decomposition - Sample of Trend, Seasonality & Residual values 112
Fig.123 Multiplicative decomposition of time series 113
Fig.124 Multiplicative Decomposition - Sample of Trend, Seasonality & Residual values 113
Fig.125 First and Last few rows of Train data 115
Fig.126 First and Last few rows of Test data 115
Fig.127 Count summary on train and test data 116
Fig.128 Line Plot – Splitting of time series into Train & Test data 116
Fig.129 Sparkling Wine – Linear regression model 117
Fig.130 Linear regression on Test data 117
Fig.131 Naïve forecast on Test data 119
Fig.132 Sparkling Wine – Simple Average model 121
Fig.133 Simple Average model predictions on Test data 121
Fig.134 Sparkling Wine – Sample of Trailing Moving Averages 123
Fig.135 Moving Average on Entire data 123
Fig.136 Individual visualization of moving averages on entire data 124
5
LIST OF TABLES
Table 1 Sample of first 5 rows of the dataset 10
Table 2 Sample of last 5 rows of the dataset 10
Table 3 Sample of first 5 rows of the dataset 101
Table 4 Sample of last 5 rows of the dataset 101
8
Rose Wine Analysis
Executive Summary
Data on wine sales from the 20th century are available from ABC Estate Wines, a wine producing firm, and
should be examined. With the provided information, an estimate of wine sales in the 20th century must be
forecasted.
Introduction
The purpose of this report is to explore the dataset. Do the exploratory data analysis. Explore the
dataset using central tendency and other parameters. The data consists of sales of Rose wine from
20th century.
Data Dictionary
Data Description
1. YearMonth: Datatime variable from 1980-01 to 1995-07
2. Rose: Continuous from 89 to 267
10
Dataset has 2 columns which captures the Year and Month of recorded data and the number of units
sold on corresponding Year-Month respectively.
11
1) Read the data as an appropriate Time Series data and plot the data.
Let us check the types of variables in the data frame and check for missing values in the dataset
The dataset has 2 variables and 187 rows in total. The "YearMonth" column can be deleted
after creating a suitable time stamp column because it is not necessary for our modelling.
The column Rose is of float type. Additionally, we can observe from the data above that Rose
column has some missing values which needs to be imputed further as it’s a time series.
Time_Stamp column has been set as index of the dataset and column Rose has been renamed as
Rose_Wine_Sales.
As can be seen from the above figure, there are 2 null values present in the dataset.
Since it’s a time series we cannot remove it and hence must be imputed.
Observation:
• The data set provided contains sales information from January 1980 to July 1995.
• We can see from the plot that there has been a decline in sales over time. Over the
years, the sales have gradually decreased. The data also exhibit some seasonality, as
may be shown.
• There are 2 missing values which must be imputed.
14
2) Perform appropriate Exploratory Data Analysis to understand the data and also
perform decomposition.
As can be seen from Fig.6, values are missing for July and August month of 1994. Since it’s a time
series, the missing values cannot be removed. We have imputed them using linear interpolation.
Observation:
Exploratory Analysis
Let us analyze the wine sales across different years and months using boxplots
Yearly Plot
Observation:
• We can see from the figure above that sales of rose wine have been declining over
time.
• After 1992, the median sales have been at their lowest levels, having peaked in 1980
and 1981.
• Additionally, we can see that there are outliers in the box plots.
17
Monthly Plot
Observation:
• The sales trajectory appears to be precisely the reverse of that seen in the yearly plot,
increasing near the end of each year.
• January has the lowest wine sales while December sees the greatest. The sales
modestly grow from January to August and then sharply climb after that.
• Additionally, we can see that there are outliers in the box plots.
18
Annual Sales
Quarterly Sales
Observation:
• After 1981, the sales fell drastically. Sales are typically lowest in the first quarter and
highest in the fourth quarter.
• Every year, December has the highest sales, followed by November and October.
January had the lowest sales.
• From the cumulative distribution graph, we can observe that around 70 to 75 percent
of the units sold are fewer than 100, and 90% of the units sold are less than 150. Only
15% of sales involved less than 50 items. Therefore, it is clear that the bulk of sales
were in the range of 50 to 100 units.
21
Average Wine sales per month & change percentage over each month
Observation:
• We can see that there is a declining trend and seasonality from the average sales and
% change plots. Additionally, the seasonality in the percentage change appears to be
consistent throughout all the years.
22
Additive Decomposition
Multiplicative Decomposition
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal.
• The residual patterns after additive decomposition of the time series appear to
represent the seasonal element and exhibit substantial variation.
• In the multiplicative decomposition of the time series, it has been observed that the
seasonal fluctuation of residuals is under control.
• The size of the seasonal variations doesn't change on comparison, but the residuals
are tightly controlled by the multiplicative decomposition. In addition to this, the
residuals are not independent of seasonality thus we may assume that it is
multiplicative.
25
3) Split the data into training and test. The test data should start in 1991.
Train and test data are separated from the provided dataset. Sales data up to 1991 is included in the
training data, while data from 1991 through 1995 is used for testing.
Fig.21.1 First and Last few rows of Train data Fig.21.2 First and Last few rows of Test data
26
Fig.23 Line Plot – Splitting of time series into Train & Test data
27
4) Build all the exponential smoothing models on the training data and evaluate the
model using RMSE on the test data. Other models such as regression, naïve forecast
models and simple average models. should also be built on the training data and
check the performance on the test data using RMSE.
For the selection criteria, the below Linear Regression model is built by using default parameters.
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• The train and test data trends have been caught by the linear regression model
however, it is unable to account for seasonality
• The root means squared error (RMSE) for the linear regression model is 15.268. The
size of the seasonal
Performance Metric
Test RMSE 15.268887
29
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• The seasonality and trend of the time series data cannot be captured by the simple
forecast model.
• The root means squared error (RMSE) for the naïve forecast model is 79.719 which is
significantly higher than the regression model.
Performance Metric
Test RMSE 79.718576
31
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• The seasonality and trend of the time series data cannot be captured by the simple
average model.
• The root means squared error (RMSE) for the simple average model is 53.46 which is
significantly higher than the regression model but lower than naïve forecast model.
Performance Metric
Test RMSE 53.460367
33
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• The seasonality and trend of the time series data may both be predicted using
moving average models.
• We can see how the data smooth out as the number of observation points taken
increases. The 2-point TMA has characteristics that are more similar to test results
than the 9-point TMA.
• The root means squared error (RMSE) for the 2-point trailing average model is
11.529, which is lowest than all models build so far.
36
Let's compare the visualization of each model's predictions that we have constructed so far before
investigating exponential smoothing methods.
Fig.33 Comparison of different models on test data (Regression, Naïve, Simple and Moving Average)
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• We can see from the graph above that simple average and naive forecast models fail
to adequately describe the characteristics of the test data.
• The trend portion of the series has been caught using linear regression, however the
seasonality has been missed
• Both trend and seasonality may be accounted for using moving average models
38
Ft+1=αYt + (1−α)Ft
Parameter α is called the smoothing constant and its value lies between 0 and 1. Since the model
uses only one smoothing constant, it is called Single Exponential Smoothing.
For the selection criteria, the below Simple Exponential Smoothing is built by using optimized
parameters.
The more recent observation is given more weight the higher the alpha value. That implies that the
recent events will repeat again. A loop with different alpha values is run to understand which
particular value works best for alpha on the test set.
The range of alpha value is from 0.1 to 0.95 and the respective RMSE for train and test data are
calculated for analyzing the performance metrics.
40
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• When there is neither a trend nor a seasonal component to the time series, simple
exponential smoothing is typically used. It is due to this reason, it unable to capture
the characteristics of the time series data.
• The root means squared error (RMSE) for the simple exponential smoothing model
with Alpha=0.0987 is 36.796 and for Alpha=0.1, RMSE is 36.827.
• The Simple Exponential Smoothing with alpha=0.0987 is taken as the best model
among two as it has the lowest test RMSE.
Double Exponential Smoothing uses two equations to forecast future values of the time series, one
for forecasting the short-term average value or level and the other for capturing the trend.
Here, αα and ββ are the smoothing constants for level and trend, respectively,
Ft+1 = Lt + Tt
Ft+n = Lt + nTt
For the selection criteria, the below Double Exponential Smoothing is built by using optimized
parameters.
The more recent observation is given more weight the higher the alpha value. That implies that the
recent events will repeat again. A loop with different alpha values is run to understand which
particular value works best for alpha on the test set.
The range of alpha value is from 0.05 to 1.0 and the respective RMSE for train and test data are
calculated for analyzing the performance metrics.
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• When there is simply trend and no seasonality in the time series data, the double
exponential smoothing model performs well. It is due to this reason it is only able to
capture the trend characteristics of the data and seasonality is not accounted for.
• The root means squared error (RMSE) for the double exponential smoothing model
with Alpha=1.49e-08, Beta=7.389e-09 is 15.268 and for Alpha=0.05, Beta=0.35 (Auto
tuned model), RMSE is 16.328994.
• The Double Exponential Smoothing with Alpha=1.49e-08, Beta=7.389e-09 is taken
as the best model among two as it has the lowest test RMSE.
• Additionally, it should be highlighted that compared to the simple exponential
smoothing model, the double exponential smoothing model has almost halved the
RMSE values.
where,
0 < α <1,
0 < β <1,
0 < γ <1
For the selection criteria, the below Triple Exponential Smoothing is built by using optimized
parameters.
The more recent observation is given more weight the higher the alpha value. That implies that the
recent events will repeat again. A loop with different alpha values is run to understand which
particular value works best for alpha on the test set.
48
The range of alpha value is from 0.1 to 1.0 and the respective RMSE for train and test data are
calculated for analyzing the performance metrics.
Fig.47 TES prediction metrics for different alpha, beta and gamma values
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• When there is both trend and seasonality in the time series data, the triple
exponential model works well. It is due to this reason it able to capture both the
trend and seasonal characteristics and nearly match the actual test data plot.
• The root means squared error (RMSE) for the double exponential smoothing model
with Alpha=0.064, Beta=0.053, Gamma=0.0 is 21.154 and for Alpha=0.2, Beta=0.85,
Gamma=0.15 (Auto tuned model), RMSE is 9.121.
• The Triple Exponential Smoothing with Alpha=0.2, Beta=0.85, Gamma=0.15 is taken
as the best model among two as it has the lowest test RMSE.
• Additionally, it should be highlighted that compared to the double exponential
smoothing model, the triple exponential smoothing model has almost reduced the
RMSE value by 40%.
50
Let's compare the RMSE values of the models we have constructed so far and visualize the plot of the
best exponential smoothing models thus built.
Fig.51 Comparison of different models on test data (SES, DES and TES)
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• Simple exponential smoothing is frequently employed when the time series doesn't
include a trend or a seasonal component. This is the reason why it is unable to
capture the time series data's features.
• The double exponential smoothing model works effectively when the time series
data just contains trend and no seasonality. This explains why seasonality is not taken
into consideration and just the trend features of the data are captured.
• The triple exponential model performs effectively when the time series data exhibit
both trend and seasonality. This is the reason why it is essentially identical to the test
data plot and is able to capture both the trend and seasonal aspects.
• The Triple exponential model is the best model we have built so far as it has the
lowest RMSE value.
52
5) Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical test. If
the data is found to be non-stationary, take appropriate steps to make it stationary.
Check the new data for stationarity and comment. Note: Stationarity should be
checked at alpha = 0.05.
H0: The Time Series has a unit root and is thus non-stationary.
H1: The Time Series does not have a unit root and is thus stationary.
The series have to be stationary for building ARIMA/SARIMA models and thus we would want the
p-value of this test to be less than the α value.
Inference:
We see that at 5% significant level the Time Series is non-stationary as p-value is 0.467 which is
more than alpha value (0.05), therefore we fail to reject the null hypothesis. Let us take one level
of differencing to see whether the series becomes stationary.
53
Inference:
We see that at 5% significant level the Time Series becomes stationary as p-value is 3.015e-11
which is less than alpha value (0.05), therefore we reject the null hypothesis. We can see that the
provided time series becomes stationary with differencing.
Inference:
We see that at 5% significant level the Time Series of training data is non-stationary as p-value is
0.756 which is more than alpha value (0.05), therefore we fail to reject the null hypothesis. Let us
take one level of differencing to see whether the series becomes stationary.
55
Inference:
We see that at 5% significant level the Time Series of training data is non-stationary as p-value is
3.894e-08 which is less than alpha value (0.05), therefore we reject the null hypothesis. We can
see that the provided training time series becomes stationary with differencing.
Observation:
• As per the Augmented Dicky-Fuller test, we observed that the time series data by
itself is not stationary, however, it becomes stationary when differencing is done.
• The same thing is also observed with Training data. Therefore, for training the
models, it can be built with order of difference d=1.
56
ARIMA models may be used to represent any "non-seasonal" time series that has patterns and isn't
just random noise.
where,
For the selection criteria of p,d,q the below ARIMA model is built by using automated model
parameters with lowest Akaike Information Criteria.
57
Fig.58 Parameter Combinations for ARIMA model Fig.59 AIC values for different parameter combinations
We can see that among all the possible given combinations, the AIC is lowest for the combination
(2,1,3). Hence, the model is built with these parameters to determine the RMSE value of test data.
58
Observation:
• The optimal parameters are decided based on the lowest Akaike Information Criteria
(AIC) values. The AIC is lowest for the combination (2,1,3) as we see from the above
results.
• From the Standardized residual plot above, we can notice that the residuals seem to
fluctuate around the mean of zero and have uniform variance.
• The histogram plus estimated density plot suggests a slightly uniform distribution
with mean zero and slightly skewed to the right.
• In Normal Q-Q plot, all the dots fall more or less in line with the red line. Few
deviations are present implying minor skewed distribution.
• The correlogram plot of residuals shows that the residuals are not auto correlated.
60
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• ARIMA models performs well on non-seasonal time series. It is due to this reason it is
unable to capture the entire characteristics of the test data.
• The root means squared error (RMSE) of test data for the ARIMA model with (p=2,
d=1, q=3) is 36.813.
• Not surprisingly, the RMSE of the aforementioned ARIMA model is greater than the
majority of previously constructed models.
62
where,
D is the number of seasonal differencing required to make the time series stationary
We must examine the PACF and ACF plots, respectively, at delays that are the multiple of "F" in order
to determine the "P" and "Q" values, and determine where these cut-off values are (for appropriate
confidence interval bands).
By examining the lowest AIC values, we can also estimate "p," "q," "P," and "Q" for the SARIMA
models.
By examining the ACF plots, one may calculate the seasonal parameter 'F'. The existence of
seasonality should be shown by a spike in the ACF plot at multiples of "F."
63
From the above ACF plot we can observe that at every 12 th lag is significant indicating the presence of
seasonality. Hence for our model building we will consider the term F=12.
64
For the selection criteria of p, d, q, P, D, Q & F the below SARIMA model is built by using automated
model parameters with lowest Akaike Information Criteria.
We can see that among all the possible given combinations, the AIC is lowest for the combination
(3,1,1) (3,0,2,12). Hence, the model is built with these parameters to determine the RMSE value of
test data.
Observation:
• The optimal parameters are decided based on the lowest Akaike Information Criteria
(AIC) values. The AIC is lowest for the combination (3,1,1) (3,0,2,12) as we see from
the above results.
• From the Standardized residual plot above, we can notice that the residuals seem to
fluctuate around the mean of zero and have uniform variance.
• The histogram plus estimated density plot suggests a slightly uniform distribution
with mean zero and slightly skewed to the right.
• In Normal Q-Q plot, all the dots fall more or less in line with the red line. Few
deviations are present implying minor skewed distribution.
• The correlogram plot of residuals shows that the residuals are not auto correlated.
67
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• SARIMA model performs well on seasonal time series. It is due to this reason it is able
to capture the entire characteristics of the test data.
• The root means squared error (RMSE) of test data for the SARIMA model with (p=3,
d=1, q=1) (P=3, D=0, Q=2, F=12) is 18.881.
• Additionally, it should be highlighted that compared to the ARIMA model, the
SARIMA model has almost halved the RMSE value.
69
7) Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the
training data and evaluate this model on the test data using RMSE.
where,
Indicating which previous series values are most beneficial in forecasting future values, autocorrelation
and partial autocorrelation are measures of relationship between present and past series values. You
may identify the sequence of processes in an ARIMA model using this information.
The parameters p & q can be determined by looking at the PACF & ACF plots respectively.
Autocorrelation function (ACF) - At lag k, this is the correlation between series values that
are k intervals apart.
Partial autocorrelation function (PACF) - At lag k, this is the correlation between series values that
are k intervals apart, accounting for the values of the intervals between.
In an ACF & PACF plots, each bar represents the size and direction of the connection. Bars that cross
the red line are statistically significant.
70
Observation:
• The Auto-Regressive parameter in an ARIMA model is 'p' which comes from the
significant lag after which the PACF plot cuts-off below the confidence interval.
• The Moving-Average parameter in an ARIMA model is 'q' which comes from the
significant lag after which the ACF plot cuts-off below the confidence interval.
• By looking at the above plots, we will take the value of p=2 and q=2 respectively.
The value of d=1, as with differencing the time series becomes stationary.
Observation:
• The model's parameters, p and q, were identified by examining the ACF (q=2) and
PACF (p=2) graphs. Since we differenced the series to make it stationary, the
parameter d=1.
• From the Standardized residual plot above, we can notice that the residuals seem to
fluctuate around the mean of zero and have uniform variance.
• The histogram plus estimated density plot suggests a slightly uniform distribution
with mean zero and slightly skewed to the right.
• In Normal Q-Q plot, all the dots fall more or less in line with the red line. Few
deviations are present implying minor skewed distribution.
• The correlogram plot of residuals shows that the residuals are not auto correlated.
73
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• ARIMA models performs well on non-seasonal time series. It is due to this reason it is
unable to capture the entire characteristics of the test data.
• The root means squared error (RMSE) of test data for the ARIMA model with (p=2,
d=1, q=2) is 36.87.
• Not surprisingly, the RMSE of the aforementioned ARIMA model is greater than the
majority of previously constructed models and nearly equal to ARIMA (2,1,3) model.
75
where,
D is the number of seasonal differencing required to make the time series stationary
We must examine the PACF and ACF plots, respectively, at delays that are the multiple of "F" in order
to determine the "P" and "Q" values, and determine where these cut-off values are (for appropriate
confidence interval bands).
By examining the ACF plots, one may calculate the seasonal parameter 'F'. The existence of
seasonality should be shown by a spike in the ACF plot at multiples of "F."
The parameters P & Q can be determined by looking at the seasonally differenced PACF & ACF plots
respectively.
Autocorrelation function (ACF) - At lag k, this is the correlation between series values that
are k intervals apart.
Partial autocorrelation function (PACF) - At lag k, this is the correlation between series values that
are k intervals apart, accounting for the values of the intervals between.
In an ACF & PACF plots, each bar represents the size and direction of the connection. Bars that cross
the red line are statistically significant.
76
Observation:
• From the PACF plot it can be seen in early lags that till lag 4 is significant before cut-
off, so AR term ‘p = 4’ is chosen. From the multiples of seasonal lags, after first
seasonal lag of 12, it cuts off, so keep seasonal AR ‘P = 0’.
• From ACF plot, it can be seen in early lags, lag 1 and 2 are significant before it cuts off,
so let’s keep MA term ‘q = 2’ and at seasonal lag of 12, a significant lag is apparent
and no seasonal lags are apparent at lags 24, 36 or afterwards, so let’s keep ‘Q = 1'.
• The final selected terms for SARIMA model is (4, 1, 2) (0, 1, 1, 12), as inferred from
the ACF and PACF plots.
Observation:
• The model's parameters, p, q, P, Q were identified by examining the ACF (q=2, Q=1)
and PACF (p=4, P=0) graphs. Since we differenced the series to make it stationary, the
parameter d=1, D=1.
• From the Standardized residual plot above, we can notice that the residuals seem to
fluctuate around the mean of zero and have uniform variance.
• The histogram plus estimated density plot suggests a slightly uniform distribution
with mean zero.
• In Normal Q-Q plot, all the dots fall more or less in line with the red line. Few
deviations are present implying minor skewed distribution.
• The correlogram plot of residuals shows that the residuals are not auto correlated.
79
Observation:
• We can see from the graphs above that the time series has a falling trend and is
seasonal
• SARIMA model performs well on seasonal time series. It is due to this reason it is able
to capture the entire characteristics of the test data.
• The root means squared error (RMSE) of test data for the SARIMA model with (p=4,
d=1, q=1) (P=0, D=1, Q=1, F=12) is 15.907.
• Additionally, it should be highlighted that compared to the all the ARIMA/SARIMA
models built so far, this SARIMA model has the lowest RMSE value.
81
8) Build a table (create a data frame) with all the models built along with their
corresponding parameters and the respective RMSE values on the test data.
Observation:
• From the above table, we can see that Triple Exponential Smoothing model with
parameters (Alpha=0.2, Beta=0.85, Gamma=0.15) has the lowest RMSE for test data.
• The naïve forecast model has performed the worst in terms of RMSE.
83
9) Based on the model-building exercise, build the most optimum model(s) on the
complete data and predict 12 months into the future with appropriate confidence
intervals/bands.
From Fig.86 we observed the Triple Exponential Smoothing model is the optimum model for the
given data set as it has the lowest RMSE value.
However, as we know SARIMA models tend to perform better with seasonal time series, we are also
considering SARIMA model for the forecast.
Let us visually see the time series plots of different models we have built so far on test data
Optimum Model 1:
Triple Exponential Smoothing Model (Alpha=0.2, Beta=0.85, Gamma=0.15)
Fig.91 TES Optimum Model – Line plot of Predictions vs Actual values on Test data
88
Fig.94 TES Optimum Model – Time series plot forecast for next 12 months
Fig.96 TES Optimum Model – Time series plot forecast with confidence intervals
Fig.97 TES Optimum Model – Time series plot forecast for next 12 months with confidence intervals
91
Optimum Model 2:
Manual SARIMA Model (4, 1, 2) (0, 1, 1, 12)
Fig.98 Manual SARIMA Optimum Model – Line plot of Predictions vs Actual values
92
Fig.99 Manual SARIMA Optimum Model – Line plot of Predictions vs Actual values on Test data
93
Fig.101 Manual SARIMA Model – Forecast for next 12 months with confidence intervals
Fig.102 Manual SARIMA Optimum Model – Time series plot forecast for next 12 months
95
Fig.103 Manual SARIMA Optimum Model – Time series plot forecast with confidence intervals
Fig.104 Manual SARIMA Optimum Model – Forecast for next 12 months with confidence interval
96
10) Comment on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales.
We needed to construct an optimum model to forecast the rose wine sales for the next 12 months.
The model information, insights and recommendations are as follows.
Model Insights:
• The time series in consideration exhibits a declining trend and stable seasonality. When
comparing the various models, we can see that Triple Exponential Smoothing and SARIMA
models frequently deliver the greatest results. This is due to the fact that these models
are excellent at predicting time series that demonstrate trend and seasonality. Apart
from these Double Exponential Smoothing and Moving Average Models also tend to
perform moderately good.
• We examine the root mean squared value of the forecast model to assess its performance
(RMSE). The model with the lowest RMSE value and characteristics that match the test
data is regarded as being a superior model.
• We observed that Triple Exponential Smoothing model had the lowest RMSE and the
characteristics that most closely fit test data. As a result, its regarded as the best model
for forecasting and can thus be used by the company for forecast analysis.
Historical Insights:
• The rose wine sales have declined throughout time. Rose wine sales peaked in 1980 &
1981 and fell to their present low position in 1995 (as we have data for only first 7
months).
• The monthly sales trajectory appears to be exactly the opposite of the yearly plot, with a
progressive increase towards the end of each year. January has the lowest wine sales,
while December has the highest. From January to August, sales increase gradually, and
then they quickly increase after that.
• The average monthly sales of Rose wine are 90 bottles. More than 50% of the sold units
of rose wine fall between 62 and 111. 28 units were sold as the lowest and 267 units as
the most. Only 20% of monthly sales that were recorded were for more than 120 units.
• Around 70 to 75 percent of the units sold are fewer than 100, and 90% of the units sold
are less than 150. Only 15% of sales involved more than 50 items. Therefore, it is clear
that the bulk of sales were in the range of 50 to 100 units.
97
Forecast Insights:
• Based on the forecast made by the Triple Exponential Smoothing model previously
presented, the following insights are offered.
• The forecast calls for average sale of 44 units, down by 45 units from the historical
average of 89 units. Thus, we might observe an alarming decrease in average sales by
50%.
• The prediction for minimum sales volume of 28 units equals the minimum sales volume
in the past. Consequently, a no percentage change could be seen in minimum quantity
sold.
• The projection estimates a maximum sales volume of 70 units, which is 197 units fewer
than the largest sales volume recorded in the past, which was 267 units. Consequently, a
73% decrease in maximum sales is visible.
• In comparison to the historical standard deviation of 62 recorded in the past, the
forecast's standard deviation is 10 units, or 52 units lower. It's gone down by 83%. This
is not anticipated because historical data tends to have less volatility than future data.
• We can see from the prediction that the months of October, November, and December
have increased sales. December is often when the sales are at their highest. There is a
startling decline in sales in January following December. The months after January
appear to witness a gradual improvement in sales until October, when it jumps sharply.
Recommendations:
• Records show that the months of September, October, November, and December
account for 40% of the total sales forecast. Many festivities take place in these months,
and many people travel during this time. One of the most premium types of wine used
during festive and event celebrations is rose wine.
• Wine sales often climb in the final two months of the year as people hurry to buy holiday
beverages. For forthcoming occasions like Thanksgiving, Christmas, and New Year's,
people typically stock up. The majority of individuals also buy in bulk for holiday
gatherings and gift-giving.
• Many individuals choose wine as their go-to gift when it comes to occasions like parties
and gift-giving. Sales of Rose wine rise just before the winter holidays as more collectors
purchase these wines as presents or look for vintages to serve at holiday gatherings.
98
• This blush wine works nicely with nearly anything, including spicy dishes, sushi, salads,
grilled meats, roasts, and rich sauces. It is well renowned for its outdoor-friendly
drinking style.
• The festival seasons may vary depending on where you are geographically, however the
most of the celebrations take place in the last four months.
▪ In these months, promotional offers might be implemented to lower costs
and significantly boost revenue.
▪ To increase sales, we must take advantage of all holiday events and set prices
appropriately.
▪ Many individuals order in bulk to prepare for upcoming festivities, which may
result in a high shipping expenditure. Businesses may provide significant
discounts or free shipping beyond a certain threshold at these times.
▪ Giving customers gifts to improve their user experience is one of the greatest
marketing strategies to deploy. In order to attract more consumers and
increase sales, the company might provide free gifts on orders with significant
sales.
▪ To target various client demographics, the proper marketing campaigns must
be run
▪ Numerous ecommerce campaigns and competitions may be performed to
broaden the product's audience and enhance sales.
• The period from January to June is one of the key challenges for Rose wine sales.
▪ To identify the elements affecting sales, in-depth market research must be
conducted.
▪ Due to the fact that rose wines are premium category of wine, a market-
friendly version of the existing product might be introduced by the company,
helping to make up for the drop in sales. Long-term, this may bring in
additional clients.
▪ The company can rebrand its product to instill a fresh perspective towards
the product and break the declining sales trend.
• There are other key elements that might be driving the sales, despite the present model's
ability to closely track the historical sales trend.
▪ The forecast might be improved by doing in-depth market research on the
factors that influence sales and incorporating that information into the model
for projection
Sparkling Wine 99
Analysis
Executive Summary
Data on wine sales from the 20th century are available from ABC Estate Wines, a wine producing
firm, and should be examined. With the provided information, an estimate of wine sales in the
20th century must be forecasted.
Introduction
The purpose of this report is to explore the dataset. Do the exploratory data analysis. Explore the
dataset using central tendency and other parameters. The data consists of sales of Sparkling wine from
20th century.
Data Dictionary
Data Description
3. YearMonth: Datatime variable from 1980-01 to 1995-07
4. Sparkling: Continuous from 1070 to 7242
101
Dataset has 2 columns which captures the Year and Month of recorded data and the number of units
sold on corresponding Year-Month respectively.
102
1) Read the data as an appropriate Time Series data and plot the data.
Let us check the types of variables in the data frame and check for missing values in the dataset
The dataset has 2 variables and 187 rows in total. The "YearMonth" column can be deleted
after creating a suitable time stamp column because it is not necessary for our modelling.
The column Sparkling is of float type. Additionally, we can observe from the data above that
Sparkling column has no missing values.
Time_Stamp column has been set as index of the dataset and column Sparkling has been renamed as
Sparkling_Wine_Sales.
As can be seen from the above figure, there are no null values present in the
dataset.
Observation:
• The data set provided contains sales information from January 1980 to July 1995.
• We can see from the plot that there has been a constant pattern of sales with
seasonality. Over the years, the sales have consistent. The data also exhibits
seasonality, as may be shown.
• There are no missing values which must be imputed.
105
2) Perform appropriate Exploratory Data Analysis to understand the data and also
perform decomposition.
Observation:
Exploratory Analysis
Let us analyze the wine sales across different years and months using boxplots
Yearly Plot
Observation:
• We can see from the figure above that sales of sparkling wine have remained
constant over the years.
• The median sales of sparkling wine reached their peak in 1988 and their current low
point in 1995.
• Additionally, we can see that there are outliers in the box plots.
107
Monthly Plot
Observation:
• The sales trajectory appears to be precisely the reverse of that seen in the yearly plot,
seeing a gradual increase towards the end of each year.
• January has the lowest wine sales while December sees the greatest. The sales
modestly grow from January to August and then sharply climb after that.
• Additionally, we can see that there are few outliers in the box plots.
108
Annual Sales
Quarterly Sales
Observation:
• Over the years, sales have stayed steady. The sales climbed gradually starting in 1982
until 1988, then decreased until 1990, then slightly increased again until 1994.
• Every year, December has the highest sales, followed by November and October. The
first 2 months January and February have the lowest median sales.
• From the cumulative distribution graph, we can observe that around 60 to 70 percent
of the units sold are fewer than 2500, and 80% of the units sold are less than 4000.
Only 20% of sales involved more than 3000 items. Therefore, it is clear that the bulk
of sales were in the range of 1000 to 3000 units.
111
Average Wine sales per month & change percentage over each month
Observation:
• We can see that there is a no trend but only seasonality from the average sales and %
change plots. Additionally, the seasonality in the percentage change appears to be
consistent throughout all the years.
112
Additive Decomposition
Multiplicative Decomposition
Observation:
• The residual patterns after additive decomposition of the time series appear to
represent the seasonal element and exhibit substantial variation.
• In the multiplicative decomposition of the time series, it has been observed that the
seasonal fluctuation of residuals is under control.
• The size of the seasonal variations doesn't change on comparison, but the residuals
are tightly controlled by the multiplicative decomposition. In addition to this, the
residuals are not independent of seasonality thus we may assume that it is
multiplicative.
115
3) Split the data into training and test. The test data should start in 1991.
Train and test data are separated from the provided dataset. Sales data up to 1991 is included in the
training data, while data from 1991 through 1995 is used for testing.
Fig.125 First and Last few rows of Train data Fig.126 First and Last few rows of Test data
116
Fig.128 Line Plot – Splitting of time series into Train & Test data
117
4) Build all the exponential smoothing models on the training data and evaluate the
model using RMSE on the test data. Other models such as regression, naïve forecast
models and simple average models. should also be built on the training data and
check the performance on the test data using RMSE.
For the selection criteria, the below Linear Regression model is built by using default parameters.
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• The train and test data trends have been caught by the linear regression model
however, it is unable to account for seasonality
• The root means squared error (RMSE) for the linear regression model is 1389.135.
The size of the seasonal
Performance Metric
Test RMSE 1389.135175
119
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• The seasonality and trend of the time series data cannot be captured by the naive
forecast model.
• The root means squared error (RMSE) for the naïve forecast model is 3864.279 which
is significantly higher than the regression model.
Performance Metric
Test RMSE 3864.279352
121
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• The seasonality and trend of the time series data cannot be captured by the simple
average model.
• The root means squared error (RMSE) for the simple average model is 1275.081
which is significantly lower than the naïve forecast model and slightly lower than
Linear regression model.
Performance Metric
Test RMSE 1275.081804
123
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• The seasonality and trend of the time series data may both be predicted using
moving average models.
• We can see how the data smooth out as the number of observation points taken
increases. The 2-point TMA has characteristics that are more similar to test results
than the 9-point TMA.
• The root means squared error (RMSE) for the 2-point trailing average model is 813.4,
which is lowest than all models build so far.
126
Let's compare the visualization of each model's predictions that we have constructed so far before
investigating exponential smoothing methods.
Fig.138 Comparison of different models on test data (Regression, Naïve, Simple and Moving Average)
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• We can see from the graph above that simple average and naive forecast models fail
to adequately describe the characteristics of the test data.
• The trend portion of the series has been caught using linear regression, however the
seasonality has been missed
• Both trend and seasonality may be accounted for using moving average models
128
Ft+1=αYt + (1−α)Ft
Parameter α is called the smoothing constant and its value lies between 0 and 1. Since the model
uses only one smoothing constant, it is called Single Exponential Smoothing.
For the selection criteria, the below Simple Exponential Smoothing is built by using optimized
parameters.
The more recent observation is given more weight the higher the alpha value. That implies that the
recent events will repeat again. A loop with different alpha values is run to understand which
particular value works best for alpha on the test set.
The range of alpha value is from 0.1 to 0.95 and the respective RMSE for train and test data are
calculated for analyzing the performance metrics.
130
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• When there is neither a trend nor a seasonal component to the time series, simple
exponential smoothing is typically used. It is due to this reason, it unable to capture
the characteristics of the time series data.
• The root means squared error (RMSE) for the simple exponential smoothing model
with Alpha =0.0496 is 1316.135and for Alpha=0.1, RMSE is 1375.393.
• The Simple Exponential Smoothing with alpha=0.0496 is taken as the best model
among two as it has the lowest test RMSE.
Double Exponential Smoothing uses two equations to forecast future values of the time series, one
for forecasting the short-term average value or level and the other for capturing the trend.
Here, αα and ββ are the smoothing constants for level and trend, respectively,
Ft+1 = Lt + Tt
Ft+n = Lt + nTt
For the selection criteria, the below Double Exponential Smoothing is built by using optimized
parameters.
The more recent observation is given more weight the higher the alpha value. That implies that the
recent events will repeat again. A loop with different alpha values is run to understand which
particular value works best for alpha on the test set.
The range of alpha value is from 0.05 to 1.0 and the respective RMSE for train and test data are
calculated for analyzing the performance metrics.
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• When there is simply trend and no seasonality in the time series data, the double
exponential smoothing model performs well. It is due to this reason it is only able to
capture the trend characteristics of the data and seasonality is not accounted for.
• The root means squared error (RMSE) for the double exponential smoothing model
with Alpha=0.6885, Beta=9.99e-05 is 2007.238and for Alpha=0.05, Beta=0.05 (Auto
tuned model), RMSE is 1418.407.
• The Double Exponential Smoothing with Alpha=0.05, Beta=0.05 is taken as the best
model among two as it has the lowest test RMSE.
• Additionally, it should be highlighted that compared to the simple exponential
smoothing model, the double exponential smoothing model has slightly higher RMSE.
where,
0 < α <1,
0 < β <1,
0 < γ <1
For the selection criteria, the below Triple Exponential Smoothing is built by using optimized
parameters.
The more recent observation is given more weight the higher the alpha value. That implies that the
recent events will repeat again. A loop with different alpha values is run to understand which
particular value works best for alpha on the test set.
138
The range of alpha value is from 0.1 to 1.0 and the respective RMSE for train and test data are
calculated for analyzing the performance metrics.
Fig.152 TES prediction metrics for different alpha, beta and gamma values
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• When there is both trend and seasonality in the time series data, the triple
exponential model works well. It is due to this reason it able to capture both the
trend and seasonal characteristics and nearly match the actual test data plot.
• The root means squared error (RMSE) for the double exponential smoothing model
with Alpha=0.111, Beta=0.0617, Gamma=0.395 is 469.659 and for Alpha=0.35,
Beta=0.10, Gamma=0.20 (Auto tuned model), RMSE is 319.498.
• The Triple Exponential Smoothing with Alpha=0.35, Beta=0.10, Gamma=0.20 is
taken as the best model among two as it has the lowest test RMSE.
• Additionally, it should be highlighted that compared to the double exponential
smoothing model, the triple exponential smoothing model has almost reduced the
RMSE value by 75%.
140
Let's compare the RMSE values of the models we have constructed so far and visualize the plot of the
best exponential smoothing models thus built.
Fig.156 Comparison of different models on test data (SES, DES and TES)
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• Simple exponential smoothing is frequently employed when the time series doesn't
include a trend or a seasonal component. This is the reason why it is unable to
capture the time series data's features.
• The double exponential smoothing model works effectively when the time series
data just contains trend and no seasonality. This explains why seasonality is not taken
into consideration and just the trend features of the data are captured.
• The triple exponential model performs effectively when the time series data exhibit
both trend and seasonality. This is the reason why it is essentially identical to the test
data plot and is able to capture both the trend and seasonal aspects.
• The Triple exponential model is the best model we have built so far as it has the
lowest RMSE value.
142
5) Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical test. If
the data is found to be non-stationary, take appropriate steps to make it stationary.
Check the new data for stationarity and comment. Note: Stationarity should be
checked at alpha = 0.05.
H0: The Time Series has a unit root and is thus non-stationary.
H1: The Time Series does not have a unit root and is thus stationary.
The series have to be stationary for building ARIMA/SARIMA models and thus we would want the
p-value of this test to be less than the α value.
Inference:
We see that at 5% significant level the Time Series is non-stationary as p-value is 0.705 which is
more than alpha value (0.05), therefore we fail to reject the null hypothesis. Let us take one level
of differencing to see whether the series becomes stationary.
143
Inference:
We see that at 5% significant level the Time Series becomes stationary as p-value is nearly 0 which
is less than alpha value (0.05), therefore we reject the null hypothesis. We can see that the
provided time series becomes stationary with differencing.
Inference:
We see that at 5% significant level the Time Series of training data is non-stationary as p-value is
0.567 which is more than alpha value (0.05), therefore we fail to reject the null hypothesis. Let us
take one level of differencing to see whether the series becomes stationary.
145
Inference:
We see that at 5% significant level the Time Series of training data is non-stationary as p-value is
8.479e-11 which is less than alpha value (0.05), therefore we reject the null hypothesis. We can
see that the provided training time series becomes stationary with differencing.
Observation:
• As per the Augmented Dicky-Fuller test, we observed that the time series data by
itself is not stationary, however, it becomes stationary when differencing is done.
• The same thing is also observed with Training data. Therefore, for training the
models, it can be built with order of difference d=1.
146
ARIMA models may be used to represent any "non-seasonal" time series that has patterns and isn't
just random noise.
where,
For the selection criteria of p,d,q the below ARIMA model is built by using automated model
parameters with lowest Akaike Information Criteria.
147
Fig.164 Parameter Combinations for ARIMA model Fig.165 AIC values for different parameter combinations
We can see that among all the possible given combinations, the AIC is lowest for the combination
(4,1,4). Hence, the model is built with these parameters to determine the RMSE value of test data.
Observation:
• The optimal parameters are decided based on the lowest Akaike Information Criteria
(AIC) values. The AIC is lowest for the combination (4,1,4) as we see from the above
results.
• From the Standardized residual plot above, we can notice that the residuals seem to
fluctuate around the mean of zero and have uniform variance.
• The histogram plus estimated density plot suggests a slightly uniform distribution
with mean zero and slightly skewed to the right.
• In Normal Q-Q plot, all the dots fall more or less in line with the red line. Few
deviations are present implying minor skewed distribution.
• The correlogram plot of residuals shows that the residuals are not auto correlated.
150
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• ARIMA models performs well on non-seasonal time series. It is due to this reason it is
unable to capture the entire characteristics of the test data.
• The root means squared error (RMSE) of test data for the ARIMA model with (p=4,
d=1, q=4) is 1212.918.
• Not surprisingly, the RMSE of the aforementioned ARIMA model is lower than the
majority of previously constructed models but significantly higher than triple
exponential smoothing model.
152
where,
D is the number of seasonal differencing required to make the time series stationary
We must examine the PACF and ACF plots, respectively, at delays that are the multiple of "F" in order
to determine the "P" and "Q" values, and determine where these cut-off values are (for appropriate
confidence interval bands).
By examining the lowest AIC values, we can also estimate "p," "q," "P," and "Q" for the SARIMA
models.
By examining the ACF plots, one may calculate the seasonal parameter 'F'. The existence of
seasonality should be shown by a spike in the ACF plot at multiples of "F."
153
From the above ACF plot we can observe that at every 12 th lag is significant indicating the presence of
seasonality. Hence for our model building we will consider the term F=12.
154
For the selection criteria of p, d, q, P, D, Q & F the below SARIMA model is built by using automated
model parameters with lowest Akaike Information Criteria.
We can see that among all the possible given combinations, the optimum AIC which is lowest for
the combination (3,1,2) (3,0,1,12). Hence, the model is built with these parameters to determine
the RMSE value of test data.
Observation:
• The optimal parameters are decided based on the lowest Akaike Information Criteria
(AIC) values. The AIC is lowest for the combination (3,1,2) (3,0,1,12) as we see from
the above results.
• From the Standardized residual plot above, we can notice that the residuals seem to
fluctuate around the mean of zero and have uniform variance.
• The histogram plus estimated density plot suggests a slightly uniform distribution
with mean zero and slightly skewed to the right.
• In Normal Q-Q plot, all the dots fall more or less in line with the red line. Few
deviations are present implying minor skewed distribution
• The correlogram plot of residuals shows that the residuals are not auto correlated.
157
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• SARIMA model performs well on seasonal time series. It is due to this reason it is able
to capture the entire characteristics of the test data.
• The root means squared error (RMSE) of test data for the SARIMA model with (p=3,
d=1, q=2) (P=3, D=0, Q=1, F=12) is 579.925.
• Additionally, it should be highlighted that compared to the ARIMA model, the
SARIMA model has almost more than halved the RMSE value.
159
7) Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the
training data and evaluate this model on the test data using RMSE.
where,
Indicating which previous series values are most beneficial in forecasting future values, autocorrelation
and partial autocorrelation are measures of relationship between present and past series values. You
may identify the sequence of processes in an ARIMA model using this information.
The parameters p & q can be determined by looking at the PACF & ACF plots respectively.
Autocorrelation function (ACF) - At lag k, this is the correlation between series values that
are k intervals apart.
Partial autocorrelation function (PACF) - At lag k, this is the correlation between series values that
are k intervals apart, accounting for the values of the intervals between.
In an ACF & PACF plots, each bar represents the size and direction of the connection. Bars that cross
the red line are statistically significant.
160
Observation:
• The Auto-Regressive parameter in an ARIMA model is 'p' which comes from the
significant lag after which the PACF plot cuts-off below the confidence interval.
• The Moving-Average parameter in an ARIMA model is 'q' which comes from the
significant lag after which the ACF plot cuts-off below the confidence interval.
• We can observe from the above plots that after lag 1, we have few significant lags
and hence we would also build another model by taking value of p=2 and q=1
respectively.
Observation:
• The model's parameters, p and q, were identified by examining the ACF (q=1) and
PACF (p=2) graphs. Since we differenced the series to make it stationary, the
parameter d=1.
• From the Standardized residual plot above, we can notice that the residuals seem to
fluctuate around the mean of zero and have uniform variance.
• The histogram plus estimated density plot suggests a slightly uniform distribution
with mean zero and slightly skewed to the right.
• In Normal Q-Q plot, all the dots fall more or less in line with the red line. Few
deviations are present implying minor skewed distribution.
• The correlogram plot of residuals shows that the residuals are not auto correlated.
163
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• ARIMA models performs well on non-seasonal time series. It is due to this reason it is
unable to capture the entire characteristics of the test data.
• The root means squared error (RMSE) of test data for the ARIMA model with (p=2,
d=1, q=1) is 1300.721
• Not surprisingly, the RMSE of the aforementioned ARIMA model is greater than the
majority of previously constructed models and also higher than Automated ARIMA
(4,1,4) model.
165
where,
D is the number of seasonal differencing required to make the time series stationary
We must examine the PACF and ACF plots, respectively, at delays that are the multiple of "F" in order
to determine the "P" and "Q" values, and determine where these cut-off values are (for appropriate
confidence interval bands).
By examining the ACF plots, one may calculate the seasonal parameter 'F'. The existence of
seasonality should be shown by a spike in the ACF plot at multiples of "F."
The parameters P & Q can be determined by looking at the seasonally differenced PACF & ACF plots
respectively.
Autocorrelation function (ACF) - At lag k, this is the correlation between series values that
are k intervals apart.
Partial autocorrelation function (PACF) - At lag k, this is the correlation between series values that
are k intervals apart, accounting for the values of the intervals between.
166
In an ACF & PACF plots, each bar represents the size and direction of the connection. Bars that cross
the red line are statistically significant.
ACF Plot – Seasonally differenced (F=12) Training Data
Observation:
• From the PACF plot it can be seen in early lags that till lag 4 is significant before cut-
off, so AR term ‘p = 4’ is chosen. From the multiples of seasonal lags, after first
seasonal lag of 12, it cuts off, so keep seasonal AR ‘P = 0’.
• From ACF plot, it can be seen in early lags, lag 1 and 2 are significant before it cuts off,
so let’s keep MA term ‘q = 2’ and at seasonal lag of 12, a significant lag is apparent
and no seasonal lags are apparent at lags 24, 36 or afterwards, so let’s keep ‘Q = 1'.
• The final selected terms for SARIMA model are (4, 1, 2) (0, 1, 1, 12), as inferred from
the ACF and PACF plots.
168
Observation:
• The model's parameters, p, q, P, Q were identified by examining the ACF (q=2, Q=1)
and PACF (p=4, P=0) graphs. Since we differenced the series to make it stationary, the
parameter d=1, D=1.
• From the Standardized residual plot above, we can notice that the residuals seem to
fluctuate around the mean of zero and have uniform variance.
• The histogram plus estimated density plot suggests a slightly uniform distribution
with mean zero.
• In Normal Q-Q plot, all the dots fall more or less in line with the red line. Few
deviations are present implying minor skewed distribution
• The correlogram plot of residuals shows that the residuals are not auto correlated.
170
Observation:
• We can see from the graphs above that the time series has a marginal upward trend
and seasonality
• SARIMA model performs well on seasonal time series. It is due to this reason it is able
to capture the entire characteristics of the test data.
• The root means squared error (RMSE) of test data for the SARIMA model with (p=4,
d=1, q=1) (P=0, D=1, Q=1, F=12) is 468.677
• Additionally, it should be highlighted that compared to the all the ARIMA/SARIMA
models built so far, this SARIMA model has the lowest RMSE value.
172
8) Build a table (create a data frame) with all the models built along with their
corresponding parameters and the respective RMSE values on the test data.
Observation:
• From the above table, we can see that Triple Exponential Smoothing model with
parameters (Alpha=0.35, Beta=0.10, Gamma=0.20) has the lowest RMSE for test
data.
• Manual SARIMA (4,1,2) (0,1,1,12) model is having the second lowest RMSE value for
test data after Triple Exponential Smoothing model.
• The naïve forecast model has performed the worst in terms of RMSE.
174
9) Based on the model-building exercise, build the most optimum model(s) on the
complete data and predict 12 months into the future with appropriate confidence
intervals/bands.
From Fig.86 we observed the Triple Exponential Smoothing model is the optimum model for the
given data set as it has the lowest RMSE value.
However, as we know SARIMA models tend to perform better with seasonal time series, we are also
considering SARIMA model for the forecast.
Let us visually see the time series plots of different models we have built so far on test data
Optimum Model 1:
Triple Exponential Smoothing Model (Alpha=0.35, Beta=0.10, Gamma=0.20)
Fig.197 TES Optimum Model – Line plot of Predictions vs Actual values on Test data
179
Fig.200 TES Optimum Model – Time series plot forecast for next 12 months
Fig.202 TES Optimum Model – Time series plot forecast with confidence intervals
Fig.203 TES Optimum Model – Forecast for next 12 months with confidence intervals
182
Optimum Model 2:
Manual SARIMA Model (4, 1, 2) (0, 1, 1, 12)
Fig.204 Manual SARIMA Optimum Model – Line plot of Predictions vs Actual values
183
Fig.205 Manual SARIMA Optimum Model – Line plot of Predictions vs Actual values on Test data
184
Fig.207 Manual SARIMA Model – Forecast for next 12 months with confidence intervals
Fig.208 Manual SARIMA Optimum Model – Time series plot forecast for next 12 months
186
Fig.209 Manual SARIMA Optimum Model – Time series plot forecast with confidence intervals
Fig.210 Manual SARIMA Optimum Model – Forecast for next 12 months with confidence interval
187
10) Comment on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales.
We needed to construct an optimum model to forecast the sparkling wine sales for the next 12
months. The model information, insights and recommendations are as follows.
Model Insights:
• The time series in consideration exhibits a little rising trend and stable seasonality. When
comparing the various models, we can see that Triple Exponential Smoothing and SARIMA
models frequently deliver the greatest results. This is due to the fact that these models
are excellent at predicting time series that demonstrate trend and seasonality.
• We examine the root mean squared value of the forecast model to assess its performance
(RMSE). The model with the lowest RMSE value and characteristics that match the test
data is regarded as being a superior model.
• We observed that SARIMA and the Triple Exponential Smoothing model had the lowest
RMSE and the characteristics that most closely fit test data. As a result, they are regarded
as the best models for forecasting.
• The firm may use the aforementioned best forecasting models since they can accurately
collect time series data and allow for proactive action based on the forecast.
Historical Insights:
• The sparkling wine sales have remained stable throughout time. Sparkling wine sales
peaked in 1988 and fell to their present low position in 1995 (as we have data for only
first 7 months).
• The monthly sales trajectory appears to be exactly the opposite of the yearly plot, with a
progressive increase towards the end of each year. January has the lowest wine sales,
while December has the highest. From January to August, sales increase gradually, and
then they quickly increase after that.
• The average monthly sales of sparkling wine are 2402 bottles. More than 50% of the sold
units of sparkling wine fall between 1605 and 2549. 1070 units were sold as the lowest
and 7242 units as the most. Only 25% of monthly sales that were recorded were for
more than 2549 units.
188
• Around 60 to 70 percent of the units sold are fewer than 2500, and 80% of the units sold
are less than 4000. Only 20% of sales involved more than 3000 items. Therefore, it is
clear that the bulk of sales were in the range of 1000 to 3000 units.
Forecast Insights:
• Based on the forecast made by the Triple Exponential Smoothing model previously
presented, the following insights are offered.
• The forecast calls for average sale of 2639 units, up 237 units from the historical average
of 2402 units. Thus, we might observe an increase in average sales of 10%.
• The prediction is for a minimum sales volume of 1540 units, which is 470 units more than
the minimum sales volume of 1070 units in the past. Consequently, a 43% increase in
minimum sales is seen.
• The projection estimates a maximum sales volume of 6487 units, which is 755 units fewer
than the largest sales volume recorded in the past, which was 7242 units. Consequently,
a 10% decrease in maximum sales is visible.
• In comparison to the historical standard deviation of 1295 recorded in the past, the
forecast's standard deviation is 1439 units, or 144 units higher. It's gone up by 11%. This
is also anticipated because historical data tends to have less volatility than future data.
• We can see from the prediction that the months of October, November, and December
have increased sales. December is often when the sales are at their highest. There is a
startling decline in sales in January following December. The months after January
appear to witness a gradual improvement in sales until October, when it jumps sharply.
Recommendations:
• Records show that the months of September, October, November, and December
account for 50% of the total sales forecast. Many festivities take place in these months,
and many people travel during this time. One of the most popular types of wine used
during festive and event celebrations is sparkling wine.
• Wine sales often climb in the final two months of the year as people hurry to buy holiday
beverages. For forthcoming occasions like Thanksgiving, Christmas, and New Year's,
people typically stock up. The majority of individuals also buy in bulk for holiday
gatherings and gift-giving.
189
• Many individuals choose wine as their go-to gift when it comes to occasions like parties
and gift-giving. Sales of sparkling wine rise just before the winter holidays as more
collectors purchase these wines as presents or look for vintages to serve at holiday
gatherings.
• The festival seasons may vary depending on where you are geographically, however the
most of the celebrations take place in the last four months.
▪ In these months, promotional offers might be implemented to lower costs
and significantly boost revenue.
▪ To increase sales, we must take advantage of all holiday events and set prices
appropriately.
▪ Many individuals order in bulk to prepare for upcoming festivities, which may
result in a high shipping expenditure. Businesses may provide significant
discounts or free shipping beyond a certain threshold at these times.
▪ Giving customers gifts to improve their user experience is one of the greatest
marketing strategies to deploy. In order to attract more consumers and
increase sales, the company might provide free gifts on orders with significant
sales.
▪ To target various client demographics, the proper marketing campaigns must
be run
▪ Numerous ecommerce campaigns and competitions may be performed to
broaden the product's audience and enhance sales.
• The period from January to June is one of the key challenges for sparkling wine sales.
▪ To identify the elements affecting sales, in-depth market research must be
conducted.
▪ Due to the fact that sparkling wines are typically used while celebrating, a
market-friendly version of the existing product might be introduced by the
company, helping to make up for the drop in sales. Long-term, this may bring
in additional clients.
• There are other key elements that might be driving the sales, despite the present model's
ability to closely track the historical sales trend.
▪ The forecast might be improved by doing in-depth market research on the
factors that influence sales and incorporating that information into the model
for projection.