0% found this document useful (0 votes)
35 views

TSF EXTENDED

This project report focuses on time series forecasting for shoe sales and soft drink production using historical data from January 1980 to July 1995. The objective is to develop accurate forecasting models to predict future sales and production volumes for the next year, utilizing techniques such as exponential smoothing and ARIMA models. The report includes exploratory data analysis, model evaluation, and insights derived from the forecasts to aid in strategic planning for the respective companies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

TSF EXTENDED

This project report focuses on time series forecasting for shoe sales and soft drink production using historical data from January 1980 to July 1995. The objective is to develop accurate forecasting models to predict future sales and production volumes for the next year, utilizing techniques such as exponential smoothing and ARIMA models. The report includes exploratory data analysis, model evaluation, and insights derived from the forecasts to aid in strategic planning for the respective companies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

TIME SERIES FORECASTING

(SHOE SALES &


SOFTDRINKS)

PROJECT REPORT

Page | 1
1 Problem 1: 5
1.1 Read the dataset. Do the descriptive statistics and do the 5-6
null value condition check? Write an
inference on it.
1.2 Do exploratory data analysis (Shoe Sales & 7-13
Soft drinks)
1.3 Shoe Sales Forecast 13-15
(Split of Data into Test & Train Dataset)
1.4 Build various exponential smoothing 16-23
models on the training data and evaluate
the model using RMSE on the test
data.(Shoe Sales)
1.5 Check for stationarity - Make the data stationary (if 23-25
needed)
1.6 Model Building - Stationary Data 26-34
- Generate ACF & PACF Plot and find the AR, MA
values. - Build different ARIMA models - Auto
ARIMA - Manual ARIMA - Build different SARIMA
models - Auto SARIMA - Manual SARIMA - Check
the performance of the models built
(Shoe Sales)
1.7 Choose the best model with proper rationale - Rebuild the 34
best model using the entire data - Make a forecast for the
next 12 months

1.8 Based on these predictions, what are the insights? 35


2 Problem 2: 36
2.4 Soft Drinks 36-41
(Split of Data into Test & Train Dataset)
2.5 Build various exponential smoothing models on the 42-43
training data and evaluate the model using RMSE on the
test data.(Soft Drink Sales)2.5

2.6 Model Building - Stationary Data 44-47


- Generate ACF & PACF Plot and find the AR, MA
values. - Build different ARIMA models - Auto
ARIMA - Manual ARIMA - Build different SARIMA
models
2.7 Auto SARIMA - Manual SARIMA - Check the 48-50
performance of the models built
(Soft Drink Sales)
2.8 Choose the best model with proper rationale - Rebuild 51
the best model using the entire data - Make a forecast
for the next 12 months
2.9 Recommendation & Suggestion 52

Page | 2
Problem 1:
Context:
In today's dynamic business environment, precise sales and
production forecasts are essential for strategic planning and
operational efficiency. Companies like IJK Shoe Company and RST
Firm have accumulated extensive monthly data on shoe sales and soft
drink production, respectively, spanning from January 1980 to July
1995. Leveraging advanced time series forecasting techniques, these
companies aim to utilize their historical data to predict future trends
accurately. This initiative enables them to make informed decisions,
optimize resource allocation, and adapt proactively to market
dynamics.

Objective:
The primary objective is to predict future sales for IJK Shoe Company and
production volumes for RST Firm over the next one year. By analyzing
the historical monthly data spanning from January 1980 to July 1995, our
goal is to develop accurate forecasting models that capture the underlying
patterns and seasonality inherent in the sales and production processes.
Through this task, we aim to empower IJK Shoe Company and RST Firm
with actionable insights that facilitate proactive planning, optimize
resource allocation, and enhance operational efficiency. By anticipating
future trends in sales and production, both companies can align their
strategies, streamline production-related activities, and capitalize on
emerging opportunities in their respective markets.

Page | 3
INTRODUCTION
This report consists of Time Series analysis and forecasting of 2 datasets
• DATASET 1 - Sales data of Shoe Sales
• DATASET 2 - Sales data of Soft Drink

Problem 1:
You are an analyst in the IJK shoe company and you are expected to
forecast the sales of the pairs of shoes for the upcoming 12 months from
where the data ends. The data for the pair of shoe sales have been given
to you from January 1980 to July 1995
Data Source-Shoesales.csv

Problem 2:
You are an analyst in the RST soft drink company and you are expected
to forecast the sales of the production of the soft drink for the upcoming
12 months from where the data ends. The data for the production of soft
drinks has been given to you from January 1980 to July 1995
Data Source- SoftDrink.csv

1.1 Define the problem and perform Exploratory Data Analysis.


Read the data as an appropriate time series data - Plot the data, perform
EDA & decomposition.
Total No. Of Shoe Sales Data Entries:187
Total No. Of Soft Drink Data Entries: 187
No. Of Missing Values in both data = 0
No. Of Duplicate entries in Shoe Sales data = 0
No. Of Duplicate entries in Soft Drink data = 0
Both datasets are split in Train: Test at year 1991 - Test data starts at
1991
Page | 4
Forecasting models applied are:
 Linear Regression
 Simple Average
 2-pt Moving Average
 Single exponential Smoothing
 Double Exponential Smoothing
 Triple Exponential Smoothing (Holt-Winter Model)
 ARIMA / SARIMA (Auto fitted)
 ARIMA / SARIMA (Manually fitted)
1] Read the data as an appropriate Time Series data and plot the data

Both Datasets are read and stored as Pandas Data Frames for analysis
First 5 rows of both the data are given below
Year Month Shoe Sales

182 1995-03 188

183 1995-04 195

184 1995-05 189

185 1995-06 220

186 1995-07 274

Year Month SoftDrinkProduction

0 1980-01 1954

1 1980-02 2302

2 1980-03 3054

3 1980-04 2414

4 1980-05 2226

Page | 5
Soft Drink Data Plot:

Shoe Sales Data Plot:

Page | 6
1.2 Perform appropriate Exploratory Data Analysis to understand the
data and also perform decomposition.

Since, the data don’t have outliers, there is no need of treatment of outliers &
duplicated values.

EDA of Shoe sales:


MOM plot of sales of shoes.

Observations:
Spike is introduced on the 3rd quarter of the year i.e. during NOV & DEC
month.

YOY sales of shoes.

Observations:
Year 1987 saw a boom & the shoe sales was maximum.

Page | 7
Shoes YOY Sales- All Months

Observation:

 Period btw 1986 & 1988 saw the spike especially during December
Season followed by November Season.

 The diff of sales is may be due to holiday season during end of year.

EDA of Soft drinks:

MOM plot of Soft drinks:

Page | 8
Observations:

 December month was the production happened across years. The


boxplot is to understand the overview of production across months. The
2nd highest production was November.

YOY sales of the Soft Drink Productions:

Observations:
 Maximum productions were observed btw the period of 1994 & 1995

Page | 9
YOY productions for all months:

Observations:
 Maximum sales is observed in the month of December.

Additive Decomposition of Shoe Sales:

Page | 10
Multiplicative Decomposition of Shoe Sales:-

Page | 11
Additive Decomposition of Soft Drink:

Multiplicative Decomposition of Soft Drink:

Page | 12
Since we are looking at change in absolute quantity for this particular dataset
we move on with using the additive model.

Shoe Sales Forecast

1.3] Split the data into training and test. The test data should start in
1991.

The train shape is 132 and test is 55 for this dataset


we need one month test data for one year in furtherevaluation
& the test data is starts from 1991

Page | 13
Train data Head of the dataset: Test data Head of the dataset:

Train data Tail of the dataset:

Train Data Shape = (132, 1)

Page | 14
Test data Tail of the dataset:

Test Data Shape = (55, 1)

Graphic representation of Train and Test Split:

Shoe Sales- Train and Test split

Page | 15
1.4] Build various exponential smoothing models on the training
data and evaluate the model using RMSE on the testdata.

Objective: The main objective of building so many models is to ensure we


pick an optimum model with the lowest RMSE and MAPE values.
Stratergy-
 We have build Linear Regression, Naïve Bayes Model, Simple Average
Models & check the performance of the model
 We also build various Exponential models to check the performance1
1.4.1]
Linear Regression:
Plot of the Shoe Sales of Linear Regression Model.
Linear Regression Model

Model Type RMSE


Regression On Time 244.810664

Page | 16
1.4.2]
Naïve Bayes Model:

Naïve Bayes Model


Model Type RMSE
Regression On Time 244.810664
Naïve Model 245.1213

Inference:
The RMSE values seem to be lowest for Naïve Bayes so far. But since the forecast is
constant through the years, it isn’t an ideal model for our dataset.

1.4.3]
Simple Average Forecast:

Page | 17
Model Type RMSE
RegressionOnTime 266.2765
NaiveModel 245.1213
SimpleAverageModel 61.714

Inference:
The RMSE values seem to be lowest for the Simple Average Method so far. But
since the forecast is constant through the years, it isn’t an ideal model for our
dataset.

1.4.4]
Moving Average Forecast:
 Moving Average Forecasting is a naive and effective technique in time
series forecasting.
 Moving average involves creating a new series where the values are
comprised of the average of raw observations in the original time
series.

Trailing Moving Average(2) Forecast

Model Type RMSE


RegressionOnTime 266.2765
NaiveModel 245.1213
SimpleAverageModel 63.98457

Page | 18
4pointTrailingMovingAverage
40.500621

1.4.5]
Simple Exponential Smoothening:

Simple Exponential Smoothing, is a time series forecasting method for univariate data
without a trend or seasonality

Simple Exponential Smoothening

Model Type RMSE


RegressionOnTime 266.2765
NaiveModel 245.1213
SimpleAverageModel 63.98457
4pointTrailingMovingAverage
40.500621
Simple Exponential Smoothening 192.641397

Page | 19
Double Exponential Smoothening
It employs a level component and a trend component at each period.
Double exponential smoothing uses two weights, (also called smoothing
parameters), to update the components at each period.

Figure-Simple and Double Exponential Smoothening

Model Type RMSE

RegressionOnTime 266.2765

NaiveModel 245.1213

SimpleAverageModel 63.98457

4pointTrailingMovingAverage
40.500621

Simple Exponential Smoothening 192.641397

Double Exponential Smoothening 247.788062

Page | 20
Triple Exponential Smoothing:

Holt’s winter method it’s an extension of double exponential smoothing


(Holt’s method) it incorporates the seasonalityin addition to the level and trend
components.
The level captures the underlying pattern and it represents the
average value of the seasonality over time.
The Trend represent the rate of change the series over time.

Figure- Simple, Double and Triple Exponential Smoothening

Triple Exponential Smoothening (Multiplicative):


 This method is based on three smoothing equations: stationary component, trend,
and seasonal. This is the multiplicative model.
 The alpha value or smoothening level at which the graph is plotted is 0.551, while
the beta or smoothening trend is 0.0001 and gamma or smoothening seasonal is
0.30.

Page | 21
Figure-Triple Exponential Smoothening (Multiplicative)

Prediction of all Models

Figure-1 Simple, Double and Triple Exponential (tuned)& Linear Model


Page | 22
Model Type RMSE

RegressionOnTime 266.2765

NaiveModel 245.1213

SimpleAverageModel 63.98457

4pointTrailingMovingAverage
40.500621

Simple Exponential Smoothening 192.641397

Double Exponential Smoothening 247.788062

Triple Exp Smoothing Model: Level 0.57 97.286929


,Trend0.01 ,Seasonality0.27
Triple Exp Smoothing Model(tuned) 56.89

The RMSE values seem to be lowest for the 4 point Trailing Moving Average Method so far.

1.5] Check for stationarity - Make the data stationary (if needed)

 The Augmented Dickey-Fuller test is a unit root test which determines whether
there is a unit root and subsequently whether the series is non-stationary.
 The hypothesis in a simple form for the ADF test is:
H0: The Time Series has a unit root and is thus non-stationary.
H1: The Time Series does not have a unit root and is thus stationary.
 We would want the series to be stationary for building ARIMA models and thus we
would want the p-value of this test to be less than the Alpha value.
 When ADF was applied on the model we got a p-value of 0.601 which is higher than
0.5, hence we fail to reject the null hypothesis. Concluding that the series is not
stationary.
 We now have to do a level differencing on the dataset and check for Stationarity.
 The p-value after level 1 differencing is 0.0234<0.05, hence we now reject the null
hypothesis and conclude that the series is stationary with a lag of 1.
 Below is a graphic representation of the same. The test statistic value is -3.144211,
while the number of lags used is 13.
 Now that the data is stationary, we can move on to building the ARIMA and
SARIMA models.

Page | 23
Results of Dickey-Fuller Test:

Test Statistic -1.717397


p-value 0.6022172
#Lags Used 13.000000
Number of Observations Used 173.000000
Critical Value (1%) -3.468726
Critical Value (5%) -2.878396
Critical Value (10%) -2.575756
dtype: float64

Before Differentiation:

Page | 24
After Integration:

Results of Dickey-Fuller Test:


Test Statistic -3.144211
p-value 0.023450
#Lags Used 13.000000
Number of Observations Used 117.000000
Critical Value (1%) -3.487517
Critical Value (5%) -2.886578
Critical Value (10%) -2.580124
dtype: float64

Stationary Shoe lag

Page | 25
1.6]
Build an automated version of the ARIMA/SARIMA model in
which the parameters are selected using the lowest Akaike
Information Criteria (AIC) on the training data and evaluate
this model on the test data using RMSE.

We first create a grid of all possible outcomes (p,d,q).


The range of ‘p’ and ‘q’ being (0,4) and ‘d’ a constant = 1.
Model performance is calculated by lowest AIC value which is then fitted into ARIMA
model

param AIC

11 (2, 1, 3) 1480.805493

ARIMA SUMMARY

Page | 26
Graph:

Diagnostics:

Details:
Model Type RMSE MAPE
AIC-ARIMA(2,1,3) 184.648 85.73498

Page | 27
SARIMA:

Again, creation a grid of all possible combinations of (p,d,q) along with seasonal
(P,D,Q) and seasonality of 12.
The range of ‘p’ and ‘q’ being (0,4) and ‘d’ a constant = 1.
Model performance is calculated by lowest AIC value which is then fitted into
SARIMA model

param seasonal AIC


23 (0, 1, 2) (1, 0, 2, 12) 1156.165429

We now fit the train data with the model and forecast on the test set. And we
get the SARIMA Summary, graph and diagnostic results

Inference:

Model Type RMSE MAPE


AIC-ARIMA(2,1,3) 184.648 85.73498
AIC-SARIMA(0, 1, 2)(1, 0, 2, 12) 69.03066 26.45588

Summary:

Page | 28
Graph:

Details:

Check the performance of the models built

 The AR order is selected by looking at where the PACF plot cuts-off (for
appropriate confidence interval bands) and the MA order is selected by
looking at where the ACF plots cuts-off (for appropriate confidence interval
bands).
 The correct degree or order of difference gives us the value of ‘d’ while the
‘p’ value is for the order of the AR model and the ‘q’ value is for the order of
the MA model.
 For SARIMA, the seasonal parameter ‘F’ can be determined by looking at the
ACF plots. The ACF plot is expected to show a spike at multiples of ‘F’
thereby indicating a presence of seasonality.
Also, for Seasonal models, the ACF and the PACF plots are going to behave a bit different
and they will not always continue to decay as the number of lags increase
Page | 29
We get the ‘p’ value from the PACF and the ‘q’ value from the ACF plot. The following are
the plots at d=1:

Figure Autocorrelation of Differenced Data

Figure Partial Autocorrelation of Differenced Data

Fitting of ARIMA model into (3,1,1). These values have been found from the ACF and PACF
plots.
Summary:

Figure-ACF/PACF(Summary)
Page | 30
Forecast:

graph
Diagnostics:

Figure-. Diagnostics
Observations:
Model Type RMSE MAPE
AIC-ARIMA(2,1,3) 184.648 85.73498
AIC-SARIMA(0, 1, 2)(1, 0, 2, 12) 69.03066 26.45588
ACF/PACF-ARIMA(3,1,1) 144.1839 66.91049

Page | 31
SARIMA:

We got ‘p’ value from the PACF and the ‘q’ value from the ACF plot.
From the above plots Figure 19 and 20 at d=1, frequency= 12. We additionally find
P, D, Q from the above plot by looking for seasonal peaks.

Fit the SARIMA model into (3,1,1) (2, 0, 4, 12). These values have been found
from the ACF and PACF plots. And we get the SARIMA Summary, graph and
diagnostic results.

Summary:

Page | 32
Graph:

Diagnostics:

Calculations:
Model Type RMSE MAPE
AIC-ARIMA(2,1,3) 184.648 85.73498
AIC-SARIMA(0, 1, 2)(1, 0, 2, 12) 69.03066 26.45588
ACF/PACF-ARIMA(3,1,1) 144.1839 66.91049
ACF/PACF-SARIMA(3,1,1)(2, 0, 4, 12) 109.9242 46.26953

Page | 33
Inference:

 AIC-SARIMA(0, 1, 2)(1, 0, 2, 12). Additionally, ARIMA models are more


computationally efficient and gives us accurate predictions.
 It also takes into consideration MAPE, and it is always a good idea to have
more than one accuracy parameter.

1.7] Make a forecast for the next 12 months

Graph is shown for the forecast of next 12 month using ARIMA model.

Figure-Optimum Model Forecast for next 12 months

Page | 34
1.8]Insights & Recommendations:

 The sales tend to increase in the second half of the year, more so in the 2nd
Quarter in comparison to the 1st Quarter. December records the highest sales
in shoes.
 The spike might be caused due to the festive mood of the target market when
gifting is on the rise (either to others or oneself).
 The sales peaked between 1986 and 1988 which might be due to a variety of
reasons ranging from disruptive innovations to lucrative offers.
 An enhanced focus on target-oriented marketing might result in growth of
interest in the product which might in turn increase the sales while
launching a new variety of shoes.
 Furthermore, a decision can be made on the viability of continuing with the
inventory of shoes which are having a different impact on the bottom line.
This may be done to streamline the resources to produce the high yielding
products.
 In view of the above, we conclude that by considering the various pros and
cons of various strategies we can make a smooth transition for better
revenue resulting in better profits.

Page | 35
Soft Drinks

2.4] Building Different models and checking RMSE

Linear Regression:

Plotting of Graph:

Linear Regression

Model Type RMSE


Regression On Time 798.150383

Page | 36
Model Type RMSE
Regression On Time 798.1503
SimpleAverageModel 934.353357929829

Moving Average Forecast:

Figure-11 Trailing Moving Average Forecast


Page | 37
Model Type RMSE
Regression On Time 798.1503
SimpleAverageModel 934.353357929829
MovingAverage(2 pt Trailing) 429.354079

The RMSE values seem to be lowest for the 2 point Trailing Moving
Average is lowest

The alpha value or smoothening level at which the graph is plotted is 0.119.

Figure- Simple Exponential Smoothening

Model Type RMSE


Regression On Time 798.1503
SimpleAverageModel 934.353357929829
MovingAverage(2 pt Trailing) 429.354079
Single Exp. Smoothing Model: Level 0.12 817.697561

Page | 38
Double Exponential Smoothening:

 Double exponential smoothing uses two weights, (also called


smoothing parameters), to update the components at each
period.
 The alpha value or smoothening level at which the graph is
plotted is 0.124, while the beta or smoothening trend is 0.11.

Figure-Simple and Double Exponential Smoothening

Model Type RMSE


Regression On Time 798.1503
SimpleAverageModel 934.353357929829
MovingAverage(2 pt Trailing) 429.354079
Single Exp. Smoothing Model: Level 0.12 817.697561
Double Exp Smoothing Model: Level 0.12 931.309018
,Trend0.11

Page | 39
Triple Exponential Model
 Triple exponential smoothing is used to handle the time series data
containing a seasonal component. This method is based on three
smoothing equations: stationary component, trend, and seasonal. Both
seasonal and trend can be additive or multiplicative. This is the additive
model.

 The alpha value or smoothening level at which the graph is plotted is 0.15,
while the beta or smoothening trend is 0.039 and gamma or smoothening
seasonal is 0.262.

Model Type RMSE


Regression On Time 798.1503
SimpleAverageModel 934.353357929829
MovingAverage(2 pt Trailing) 429.354079
Single Exp. Smoothing Model: Level 0.12 817.697561
Double Exp Smoothing Model: Level 0.12 931.309018
,Trend0.11
Triple Exp Smoothing Model: Level 0.15 459.51
,Trend0.04 ,Seasonality0.26

The RMSE values seem to be lowest for the Triple Exponential


SmootheningMethod so far.

Page | 40
Prediction of all Models

Model Type RMSE


Regression On Time 798.1503
SimpleAverageModel 934.353357929829
MovingAverage(2 pt Trailing) 429.354079
Single Exp. Smoothing Model: Level 0.12 817.697561
Double Exp Smoothing Model: Level 0.12 931.309018
,Trend0.11
Triple Exp Smoothing Model: Level 0.15 459.51
,Trend0.04 ,Seasonality0.26

Page | 41
2.5] Checking for Stationarity
 The hypothesis in a simple form for the ADF test is:

H0: The Time Series has a unit root and is thus non-stationary.
H1: The Time Series does not have a unit root and is thus stationary.

 When ADF was applied on the model we got a p-value of 0. 756854


which is higherthan 0.5, hence we fail to reject the null hypothesis.
Concluding that the series is not stationary.
 The p-value after level 1 differencing is .01345<0.05, hence we now
reject the null hypothesis and conclude that the series is stationary with
a lag of 1.
 The test statistic value is -3.33,while the number of lags used is 12.
 Now that the data is stationary we can move on to building the
ARIMA and SARIMA models.

Before Differentiation Stationarity Check:

Page | 42
After Integration

Figure- Stationarity of Soft Drink Production at lag 1

Page | 43
2.6] ARIMA and SARIMA using lowest AIC method

ARIMA:

 Creation of grid of all possible outcomes (p,d,q). The range of ‘p’ and ‘q’being
(0,4) and ‘d’ a constant = 1
 ARIMA model is fitted into each of the above combinationsand end
up choosing that one with the least AIC value.
 The lowest AIC value is mentioned below:

param AIC

2 (2,0,2) 2054.66072

 We now fit the train data with the model and forecast on the test
set.

ARIMA Summary

Page | 44
GRAPH

Inference:

This is not a good model because it’s predicted far from the test data.

SARIMA

We fit the SARIMA model into each of the above combinationsand end
up choosing that one with the least AIC value.

param seasonal AIC


26 (1,0,2) (1, 0 , 2, 5) 1865.43

We now fit the train data with the model and forecast on the testset. And we get the
SARIMA Summary, graph and diagnostic results

Page | 45
Summary

Graph:

Page | 46
Diagnostics:

Calculation

Inference:

The above graph represents this is also not a good model because it’s straight line
occurs in predicted SARIMA

Page | 47
2.7] ARIMA and SARIMA based on the cut-off points of ACF and
PACF:

ARIMA:

We get the ‘p’ value from thePACF and the ‘q’ value from the ACF
plot. The following are the plots at d=1:

Figure-19 Autocorrelation of Differenced Data

Figure-20 Partial Autocorrelation of Differenced Data

we get the ARIMA Summary, graphand diagnostic results.

Page | 48
A.

B.

SARIMA:
We then move on to fit the SARIMA model into (1,0,2) (0, 0, 2,5). These
values have been found from the ACF and PACF plots. And we get the
SARIMA Summary, graph and diagnostic results.

Page | 49
SUMMARY

Graph

Page | 50
Diagnostic
C.

2.8] Building of optimum model and 12 month forecast

Forecasting of Data with all the model.

Inference:

The model is a good model as it predicts the data closer to the test
data

Page | 51
2.9]Recommendations & Suggestions

 Production of Soft Drinks has increased in the 2nd half of the year
with it peaking in the month of December.
 In the monthly as well as the yearly trend, we see that December is
the most popular month for Soft Drink Production. The production
peaked between 1988 and 1990 which might be because of better
product marketing combined with the rising spending power of the
consumers.
 Furthermore, the manufacturers can optimize their production to meet the
rising demands of the consumers.

Page | 52

You might also like