TSF EXTENDED
TSF EXTENDED
PROJECT REPORT
Page | 1
1 Problem 1: 5
1.1 Read the dataset. Do the descriptive statistics and do the 5-6
null value condition check? Write an
inference on it.
1.2 Do exploratory data analysis (Shoe Sales & 7-13
Soft drinks)
1.3 Shoe Sales Forecast 13-15
(Split of Data into Test & Train Dataset)
1.4 Build various exponential smoothing 16-23
models on the training data and evaluate
the model using RMSE on the test
data.(Shoe Sales)
1.5 Check for stationarity - Make the data stationary (if 23-25
needed)
1.6 Model Building - Stationary Data 26-34
- Generate ACF & PACF Plot and find the AR, MA
values. - Build different ARIMA models - Auto
ARIMA - Manual ARIMA - Build different SARIMA
models - Auto SARIMA - Manual SARIMA - Check
the performance of the models built
(Shoe Sales)
1.7 Choose the best model with proper rationale - Rebuild the 34
best model using the entire data - Make a forecast for the
next 12 months
Page | 2
Problem 1:
Context:
In today's dynamic business environment, precise sales and
production forecasts are essential for strategic planning and
operational efficiency. Companies like IJK Shoe Company and RST
Firm have accumulated extensive monthly data on shoe sales and soft
drink production, respectively, spanning from January 1980 to July
1995. Leveraging advanced time series forecasting techniques, these
companies aim to utilize their historical data to predict future trends
accurately. This initiative enables them to make informed decisions,
optimize resource allocation, and adapt proactively to market
dynamics.
Objective:
The primary objective is to predict future sales for IJK Shoe Company and
production volumes for RST Firm over the next one year. By analyzing
the historical monthly data spanning from January 1980 to July 1995, our
goal is to develop accurate forecasting models that capture the underlying
patterns and seasonality inherent in the sales and production processes.
Through this task, we aim to empower IJK Shoe Company and RST Firm
with actionable insights that facilitate proactive planning, optimize
resource allocation, and enhance operational efficiency. By anticipating
future trends in sales and production, both companies can align their
strategies, streamline production-related activities, and capitalize on
emerging opportunities in their respective markets.
Page | 3
INTRODUCTION
This report consists of Time Series analysis and forecasting of 2 datasets
• DATASET 1 - Sales data of Shoe Sales
• DATASET 2 - Sales data of Soft Drink
Problem 1:
You are an analyst in the IJK shoe company and you are expected to
forecast the sales of the pairs of shoes for the upcoming 12 months from
where the data ends. The data for the pair of shoe sales have been given
to you from January 1980 to July 1995
Data Source-Shoesales.csv
Problem 2:
You are an analyst in the RST soft drink company and you are expected
to forecast the sales of the production of the soft drink for the upcoming
12 months from where the data ends. The data for the production of soft
drinks has been given to you from January 1980 to July 1995
Data Source- SoftDrink.csv
Both Datasets are read and stored as Pandas Data Frames for analysis
First 5 rows of both the data are given below
Year Month Shoe Sales
0 1980-01 1954
1 1980-02 2302
2 1980-03 3054
3 1980-04 2414
4 1980-05 2226
Page | 5
Soft Drink Data Plot:
Page | 6
1.2 Perform appropriate Exploratory Data Analysis to understand the
data and also perform decomposition.
Since, the data don’t have outliers, there is no need of treatment of outliers &
duplicated values.
Observations:
Spike is introduced on the 3rd quarter of the year i.e. during NOV & DEC
month.
Observations:
Year 1987 saw a boom & the shoe sales was maximum.
Page | 7
Shoes YOY Sales- All Months
Observation:
Period btw 1986 & 1988 saw the spike especially during December
Season followed by November Season.
The diff of sales is may be due to holiday season during end of year.
Page | 8
Observations:
Observations:
Maximum productions were observed btw the period of 1994 & 1995
Page | 9
YOY productions for all months:
Observations:
Maximum sales is observed in the month of December.
Page | 10
Multiplicative Decomposition of Shoe Sales:-
Page | 11
Additive Decomposition of Soft Drink:
Page | 12
Since we are looking at change in absolute quantity for this particular dataset
we move on with using the additive model.
1.3] Split the data into training and test. The test data should start in
1991.
Page | 13
Train data Head of the dataset: Test data Head of the dataset:
Page | 14
Test data Tail of the dataset:
Page | 15
1.4] Build various exponential smoothing models on the training
data and evaluate the model using RMSE on the testdata.
Page | 16
1.4.2]
Naïve Bayes Model:
Inference:
The RMSE values seem to be lowest for Naïve Bayes so far. But since the forecast is
constant through the years, it isn’t an ideal model for our dataset.
1.4.3]
Simple Average Forecast:
Page | 17
Model Type RMSE
RegressionOnTime 266.2765
NaiveModel 245.1213
SimpleAverageModel 61.714
Inference:
The RMSE values seem to be lowest for the Simple Average Method so far. But
since the forecast is constant through the years, it isn’t an ideal model for our
dataset.
1.4.4]
Moving Average Forecast:
Moving Average Forecasting is a naive and effective technique in time
series forecasting.
Moving average involves creating a new series where the values are
comprised of the average of raw observations in the original time
series.
Page | 18
4pointTrailingMovingAverage
40.500621
1.4.5]
Simple Exponential Smoothening:
Simple Exponential Smoothing, is a time series forecasting method for univariate data
without a trend or seasonality
Page | 19
Double Exponential Smoothening
It employs a level component and a trend component at each period.
Double exponential smoothing uses two weights, (also called smoothing
parameters), to update the components at each period.
RegressionOnTime 266.2765
NaiveModel 245.1213
SimpleAverageModel 63.98457
4pointTrailingMovingAverage
40.500621
Page | 20
Triple Exponential Smoothing:
Page | 21
Figure-Triple Exponential Smoothening (Multiplicative)
RegressionOnTime 266.2765
NaiveModel 245.1213
SimpleAverageModel 63.98457
4pointTrailingMovingAverage
40.500621
The RMSE values seem to be lowest for the 4 point Trailing Moving Average Method so far.
1.5] Check for stationarity - Make the data stationary (if needed)
The Augmented Dickey-Fuller test is a unit root test which determines whether
there is a unit root and subsequently whether the series is non-stationary.
The hypothesis in a simple form for the ADF test is:
H0: The Time Series has a unit root and is thus non-stationary.
H1: The Time Series does not have a unit root and is thus stationary.
We would want the series to be stationary for building ARIMA models and thus we
would want the p-value of this test to be less than the Alpha value.
When ADF was applied on the model we got a p-value of 0.601 which is higher than
0.5, hence we fail to reject the null hypothesis. Concluding that the series is not
stationary.
We now have to do a level differencing on the dataset and check for Stationarity.
The p-value after level 1 differencing is 0.0234<0.05, hence we now reject the null
hypothesis and conclude that the series is stationary with a lag of 1.
Below is a graphic representation of the same. The test statistic value is -3.144211,
while the number of lags used is 13.
Now that the data is stationary, we can move on to building the ARIMA and
SARIMA models.
Page | 23
Results of Dickey-Fuller Test:
Before Differentiation:
Page | 24
After Integration:
Page | 25
1.6]
Build an automated version of the ARIMA/SARIMA model in
which the parameters are selected using the lowest Akaike
Information Criteria (AIC) on the training data and evaluate
this model on the test data using RMSE.
param AIC
11 (2, 1, 3) 1480.805493
ARIMA SUMMARY
Page | 26
Graph:
Diagnostics:
Details:
Model Type RMSE MAPE
AIC-ARIMA(2,1,3) 184.648 85.73498
Page | 27
SARIMA:
Again, creation a grid of all possible combinations of (p,d,q) along with seasonal
(P,D,Q) and seasonality of 12.
The range of ‘p’ and ‘q’ being (0,4) and ‘d’ a constant = 1.
Model performance is calculated by lowest AIC value which is then fitted into
SARIMA model
We now fit the train data with the model and forecast on the test set. And we
get the SARIMA Summary, graph and diagnostic results
Inference:
Summary:
Page | 28
Graph:
Details:
The AR order is selected by looking at where the PACF plot cuts-off (for
appropriate confidence interval bands) and the MA order is selected by
looking at where the ACF plots cuts-off (for appropriate confidence interval
bands).
The correct degree or order of difference gives us the value of ‘d’ while the
‘p’ value is for the order of the AR model and the ‘q’ value is for the order of
the MA model.
For SARIMA, the seasonal parameter ‘F’ can be determined by looking at the
ACF plots. The ACF plot is expected to show a spike at multiples of ‘F’
thereby indicating a presence of seasonality.
Also, for Seasonal models, the ACF and the PACF plots are going to behave a bit different
and they will not always continue to decay as the number of lags increase
Page | 29
We get the ‘p’ value from the PACF and the ‘q’ value from the ACF plot. The following are
the plots at d=1:
Fitting of ARIMA model into (3,1,1). These values have been found from the ACF and PACF
plots.
Summary:
Figure-ACF/PACF(Summary)
Page | 30
Forecast:
graph
Diagnostics:
Figure-. Diagnostics
Observations:
Model Type RMSE MAPE
AIC-ARIMA(2,1,3) 184.648 85.73498
AIC-SARIMA(0, 1, 2)(1, 0, 2, 12) 69.03066 26.45588
ACF/PACF-ARIMA(3,1,1) 144.1839 66.91049
Page | 31
SARIMA:
We got ‘p’ value from the PACF and the ‘q’ value from the ACF plot.
From the above plots Figure 19 and 20 at d=1, frequency= 12. We additionally find
P, D, Q from the above plot by looking for seasonal peaks.
Fit the SARIMA model into (3,1,1) (2, 0, 4, 12). These values have been found
from the ACF and PACF plots. And we get the SARIMA Summary, graph and
diagnostic results.
Summary:
Page | 32
Graph:
Diagnostics:
Calculations:
Model Type RMSE MAPE
AIC-ARIMA(2,1,3) 184.648 85.73498
AIC-SARIMA(0, 1, 2)(1, 0, 2, 12) 69.03066 26.45588
ACF/PACF-ARIMA(3,1,1) 144.1839 66.91049
ACF/PACF-SARIMA(3,1,1)(2, 0, 4, 12) 109.9242 46.26953
Page | 33
Inference:
Graph is shown for the forecast of next 12 month using ARIMA model.
Page | 34
1.8]Insights & Recommendations:
The sales tend to increase in the second half of the year, more so in the 2nd
Quarter in comparison to the 1st Quarter. December records the highest sales
in shoes.
The spike might be caused due to the festive mood of the target market when
gifting is on the rise (either to others or oneself).
The sales peaked between 1986 and 1988 which might be due to a variety of
reasons ranging from disruptive innovations to lucrative offers.
An enhanced focus on target-oriented marketing might result in growth of
interest in the product which might in turn increase the sales while
launching a new variety of shoes.
Furthermore, a decision can be made on the viability of continuing with the
inventory of shoes which are having a different impact on the bottom line.
This may be done to streamline the resources to produce the high yielding
products.
In view of the above, we conclude that by considering the various pros and
cons of various strategies we can make a smooth transition for better
revenue resulting in better profits.
Page | 35
Soft Drinks
Linear Regression:
Plotting of Graph:
Linear Regression
Page | 36
Model Type RMSE
Regression On Time 798.1503
SimpleAverageModel 934.353357929829
The RMSE values seem to be lowest for the 2 point Trailing Moving
Average is lowest
The alpha value or smoothening level at which the graph is plotted is 0.119.
Page | 38
Double Exponential Smoothening:
Page | 39
Triple Exponential Model
Triple exponential smoothing is used to handle the time series data
containing a seasonal component. This method is based on three
smoothing equations: stationary component, trend, and seasonal. Both
seasonal and trend can be additive or multiplicative. This is the additive
model.
The alpha value or smoothening level at which the graph is plotted is 0.15,
while the beta or smoothening trend is 0.039 and gamma or smoothening
seasonal is 0.262.
Page | 40
Prediction of all Models
Page | 41
2.5] Checking for Stationarity
The hypothesis in a simple form for the ADF test is:
H0: The Time Series has a unit root and is thus non-stationary.
H1: The Time Series does not have a unit root and is thus stationary.
Page | 42
After Integration
Page | 43
2.6] ARIMA and SARIMA using lowest AIC method
ARIMA:
Creation of grid of all possible outcomes (p,d,q). The range of ‘p’ and ‘q’being
(0,4) and ‘d’ a constant = 1
ARIMA model is fitted into each of the above combinationsand end
up choosing that one with the least AIC value.
The lowest AIC value is mentioned below:
param AIC
2 (2,0,2) 2054.66072
We now fit the train data with the model and forecast on the test
set.
ARIMA Summary
Page | 44
GRAPH
Inference:
This is not a good model because it’s predicted far from the test data.
SARIMA
We fit the SARIMA model into each of the above combinationsand end
up choosing that one with the least AIC value.
We now fit the train data with the model and forecast on the testset. And we get the
SARIMA Summary, graph and diagnostic results
Page | 45
Summary
Graph:
Page | 46
Diagnostics:
Calculation
Inference:
The above graph represents this is also not a good model because it’s straight line
occurs in predicted SARIMA
Page | 47
2.7] ARIMA and SARIMA based on the cut-off points of ACF and
PACF:
ARIMA:
We get the ‘p’ value from thePACF and the ‘q’ value from the ACF
plot. The following are the plots at d=1:
Page | 48
A.
B.
SARIMA:
We then move on to fit the SARIMA model into (1,0,2) (0, 0, 2,5). These
values have been found from the ACF and PACF plots. And we get the
SARIMA Summary, graph and diagnostic results.
Page | 49
SUMMARY
Graph
Page | 50
Diagnostic
C.
Inference:
The model is a good model as it predicts the data closer to the test
data
Page | 51
2.9]Recommendations & Suggestions
Production of Soft Drinks has increased in the 2nd half of the year
with it peaking in the month of December.
In the monthly as well as the yearly trend, we see that December is
the most popular month for Soft Drink Production. The production
peaked between 1988 and 1990 which might be because of better
product marketing combined with the rising spending power of the
consumers.
Furthermore, the manufacturers can optimize their production to meet the
rising demands of the consumers.
Page | 52