0% found this document useful (0 votes)
14 views

Conference Paper Corrections

Uploaded by

SHANMUGAM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Conference Paper Corrections

Uploaded by

SHANMUGAM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Predictive Big Data Analytics For Supply Chain

Demand Forecasting⋆

Navadersh S1 and Vengadeswaran S1

Indian Institute of Information Technology Kottayam, Kerala, India


[email protected]
https://www.iiitkottayam.ac.in

Abstract. The increasing attention to Big Data Analytics (BDA) in


Supply Chain Management (SCM) stems from its versatile applications,
encompassing customer behaviour analysis, trend analysis, and demand
prediction. This paper aims to explore the potential of predictive BDA
applications in supply chain demand forecasting, pinpoint gaps in exist-
ing knowledge. In this proposed work, a time series predictive big data
analytics using SARIMA, Prophet and ensemble model is proposed. Fur-
ther to achieve improved prediction accuracy, optimal parameters were
identified using manual and grid search methods. In addition, geograph-
ical feature is extracted using KMeans clustering to understand any un-
derlying patterns. Finally, an ensemble model is proposed that integrate
heterogeneous models (SARIMA, Prophet), to achieve improved perfor-
mance. The experiments were evaluated using five node Spark clusters
deployed in the cloud environment. The results exhibits that the pro-
posed ensemble approach using Linear regression achieves lowest RMSE
of 30.76.

Keywords: Predictive Big Data Analytics · Demand Forecasting · Geo-


Spatial Clustering.

1 Introduction
Supply Chain Management (SCM) is the backbone of the smooth flow of prod-
ucts, services, and information. However, it faces challenges in terms of capacity,
supply, and demand. Also, this adds up in production, storage, and delivery
costs. The significance of dealing with these uncertainties can’t be understated.
This emphasises the crucial impact of demand forecasting on increasing sup-
ply chain performance [1], [?], [3], [4]. Traditionally, two approaches have been
employed: a forward-looking approach, anticipating potential demand over the
next several years, and a backward-looking approach, relying on past or ongoing
capabilities to respond to demand. However, traditional solutions, often relying
on spreadsheet models and statistical methods like moving averages, face limita-
tions in scalability for large-scale data and struggle to address the complexities

The research work is supported by IoT Cloud Research Group, Indian Institute of
Information Technology, Kottayam.
2 Navadersh S et al.

and uncertainties inherent in supply chain management (SCM) [5]. Historically,


statistical analysis techniques like time-series and regression analysis have been
employed for demand forecasting in SCM [6]. The limitations of conventional
methods become apparent in the context of supply chain demand forecasting,
where numerous parameters influence demand, often not captured by traditional
approaches. Conventional methods tend to provide a partial understanding of
demand variations and are less adept at handling the non-linear behaviours
prevalent in supply chain dynamics.
However, the evolving landscape of information technologies and computa-
tional efficiencies has given rise to the prominence of big data analytics (BDA)
[7]. BDA not only enables more precise predictions aligned with customer needs
but also contributes to the assessment of SCM performance, efficiency improve-
ment, reduced reaction time, and enhanced risk assessment. The advent of big
data and high computing analytics presents a transformative opportunity, en-
abling scalability, efficiency and ease of data processing, for data-driven demand
forecasting and planning [8], [9]. Leveraging different Time series algorithms,
BDA enables the extraction of valuable insights, leading to highly accurate de-
mand forecasting models scalable for application in SCM.
In this work, a study on the performance of KMeans clustering (geospatial
clustering) in the dataset to understand any underlying patterns or features
which can increase the efficiency of demand forecasting is done. For Demand
forecasting, we are taking mainly two time series algorithms, SARIMA and
PROPHET. Post-clustering on the dataset, out of SARIMA and PROPHET
Models, which model will provide us with the most accurate prediction and
forecasting.

2 Literature Review:
A review of literature spanning from 2005 to 2023 reveals a growing trend in
publications related to supply chain demand forecasting, with a focus on BDA
applications. Notable techniques identified in this review include neural net-
works, regression, time-series forecasting (ARIMA), support vector machines,
and decision trees. These techniques showcase the increasing utilization of BDA
in SCM demand forecasting, reflecting a departure from conventional statistical
forecasting approaches. The methodology, merits and demerits of the literature
are discussed in Table 1.
S.No Title Proposed Methodology Merits Demerits
1 Improved supply chain manage- Neural Networks, Random Forests, Improved accuracy in demand Limited scalability for large-
ment based on hybrid demand Time-Series Forecasting (ARIMA) forecasting. scale data.
forecasts [10]
2 Comparative Analysis of Ma- Relevance vector machine (KNN), Contributes to advancing the Time-Series Forecasting
chine Learning Algorithms for Support Vector Machine, Decision understanding of machine (ARIMA, SARIMA) are not
Predictive Modeling in Various Tree, Genetic Algorithms (GA), learning algorithm performance considered for seasonal trend
Domains [11] LSTM across diverse application data
domains, facilitating more
accurate and robust predictive
models
3 Big data analytics in supply Data-Mining Algorithms,machine Understanding of the potential Scope of Ensemble Modelling
chain management: A state-of- learning, predictive modelling benefits and challenges asso- for enhancing the forecasting
the-art literature review [12] ciated with implementing big accuracy
data analytics solutions in SCM
contexts.
4 Demand Forecasting for Tex- Regression models, Decision Trees, Improved prediction accuracy Widely used but not scalable
tile Products Using Statistical and Neural Networks for large-scale data, and lim-
Analysis and Machine Learning ited capabilities in handling
Algorithms [13] uncertainties.
5 Daily retail demand forecasting Machine Learning Techniques, Op- Improved forecast accuracy and Computational intensity due
using machine learning with em- timization resource allocation by capturing to the use complex algorithms.
phasis on calendric special days the nuanced effects of calendric
[14] special days.
6 A Comparative Study of De- A hybrid framework (ARIMA and Improved accuracy through Requires hyperparameter tun-
mand Forecasting Models for LSTM ML Models) for demand model combination, adaptable ing, potential complexity in
a Multi-Channel Retail Com- forecasting in multi-channel retail to diverse retail scenarios. model integration.
pany: A Novel Hybrid Machine
Predictive Big Data Analytics For Supply Chain Demand Forecasting

Learning Approach [15]


7 An Ensemble-Learning-Based Ensemble method Enhanced accuracy and robust- Complexity in model fusion
Method for Short-Term Water ness to fluctuating demand pat- and potential overfitting with
3

Demand Forecasting [16] terns. excessive algorithm combina-


tions.
Table 1. Summary of Literature Review
4 Navadersh S et al.

3 Research Objective
An ensemble-based demand forecasting model is proposed. The objective is to
– To study the adaptability of big data demand forecasting models (SARIMA,
Prophet) to nonlinear demand variations in the input dataset.
– To extract Geospatial Feature using K-Means clustering
– To identify the optimal parameters for training model using Grid search.
– To design ensemble architectures that integrate heterogeneous models (SARIMA,
Prophet), to achieve superior predictive accuracy and generalization capa-
bilities across diverse features considered.
– To harness Big Data processing platforms (SPARK), and cloud to experi-
ment the big data demand forecasting.

4 Methodology
In this proposed work, a time series predictive big data analytics using deep
learning models like SARIMA, Prophet and ensemble model is proposed. Fur-
ther to achieve improved prediction accuracy, optimal parameters were identified
using manual and grid search methods. Fig. 1 depicts a detailed workflow dia-
gram of the proposed work.

Fig. 1. Methodology Overview

4.1 Dataset Characteristics:


The dataset used in this work is IOWA_LIQUOR_SALES [17]. This dataset
contains the spirits purchase information of Iowa Class “E” liquor licensees by
product and date of purchase for 11 years. The total size of the dataset is 13.2
GB, containing 29 Million Rows of records. The Dataset contains 24 Features
including information about orders, products, Sales, Order Quantity, spatial co-
ordinates (longitude and latitude) and shipping details.
Predictive Big Data Analytics For Supply Chain Demand Forecasting 5

4.2 Data Pre-processing


Initially, the IOWA_LIQUOR_SALES dataset is preprocessed by removing null
values using the Pandas library. The following preprocessing steps were done to
improve the accuracy of the models in our proposed work.

Extracting Date Feature: The code extracts various features from the DATE
column, such as year, month, day, and weekday. These features capture different
aspects of time, like seasonal variations and weekly trends. Incorporating date
features enriches the dataset and aids in time-based analysis.

Outlier Removal: Using Z-score potential outliers are identified. The equation
for the Z-score is as follows:
X −µ
Z= (1)
σ
To maintain distribution, reduce the impact of any outliers and simplify
interpretation where each data point is scaled linearly to fit within a specific
range using Min-Max Scaling.

Min-Max Scaling transforms features by scaling each value to a range between


0 and 1. This is done using the equation.
X − Xmin
X′ = (2)
Xmax − Xmin

Data Aggregation: The data is aggregated at different time intervals, such


as monthly and weekly. Aggregating data allows for summarizing information
and reducing the dataset’s dimensionality while retaining essential insights. It
provides a broader perspective on trends and patterns over time.

4.3 Feature Engineering


To identify potential features affecting prediction, a subset of the dataset con-
taining only selected features was created, enabling focus on specific variables
of interest. For each selected feature, the Analysis of Variance test is used to
generate F-values and p-values. Features are deemed more significant for analy-
sis based on higher F-values and lower p-values. Subsequently, the dataset was
refined to include only the necessary 10 features for the multivariate experiments
as described in Table 2.
In addition, geographical similarities in demand patterns are identified by
extracting new feature Clusters. It helps in understanding spatial patterns in
the distribution of orders and can aid in optimizing logistics and resource allo-
cation. Before clustering, the latitude and longitude coordinates in the dataset
are standardized. This process ensures that all coordinates have the same scale,
which is important for accurate clustering. Then standardised geographic data
6 Navadersh S et al.

Table 2. Input Features

Column Name Data Type


Order_Item_Quantity int64
Cluster int32
Category_Name Object
Category_ID int64
Product_Name Object
Days_For_Shipping int64
Order_Date datetime64[ns]
Year_Month period[M]
Year_Week period[W-SUN]

(latitude and longitude) of orders are clustered using the K-means algorithm.
The Euclidean distance is used to measure the distance between each pair of
data blocks (di , dj ) from the dataset D ((di , dj ) ⊆ D).
q
Disteq (di , dj ) = (xdi − xdj )2 + (ydi − ydj )2 (3)

The optimal number of clusters is identified using the Silhouette score [18]. It is
calculated using the mean intra-cluster distance (a) and the mean nearest-cluster
distance (b) for each sample.

Silhouette Coef f icient = (b − a) / max(a, b) (4)

The K-means algorithms [19] segregate the geographical coordinates into clus-
ters, which are then included as a new feature for training the model.

4.4 Seasonal Decomposition

The seasonal decomposition is performed to gain insights into the distribution


and trends of various variables within the dataset. It explores the distribution of
order item quantities, investigates temporal trends in total orders over time, and
analyzes the distribution of orders by different categories such as region, category,
and order status. From Fig.2 both the seasonal component and residuals appear
as not static in the decomposition plot. It suggests that the data exhibit strong
seasonality and residual variability, indicating seasonal fluctuations.

4.5 Learning Models

The daily order quantities are aggregated and performed time series analysis
to understand patterns and trends. For training the model, input data is split
into training and testing sets in an 80-20 ratio. Model Selection: Based on
preliminary studies [10] [11] [12] [13] [14], SARIMA and Prophet models are
used to generate forecasts for future order based on historical data.
Predictive Big Data Analytics For Supply Chain Demand Forecasting 7

Fig. 2. Decomposition plots for seasonal analysis

SARIMA For demand forecasting, the SARIMA (Seasonal Auto Regressive


Integrated Moving Average) model is applied to training data having cluster
information. SARIMA is particularly effective for time-series data, considering
seasonality and trends over time. This model takes into account the historical
demand patterns within each cluster to provide accurate demand forecasts.
Notation: SARIMA(p, d, q)(P, D, Q, s)
– AR(p): Autoregressive component of order p
– MA(q): Moving average component of order q
– I(d): Integrated component of order d
– Seasonal AR(P): Seasonal autoregressive component of order P
– MA(Q): Seasonal moving average component of order Q
– Seasonal I(D): Seasonal integrated component of order D
– s: Seasonal period

Prophet Prophet, a robust forecasting tool developed by Facebook is used in


this work. It handles various sources of uncertainty in time-series data, including
holidays and special events. By applying Prophet after clustering, the model
tailors its forecasting approach to the specific demand patterns identified within
each cluster, leading to more accurate predictions.

y(t) = g(t) + s(t) + h(t) + ε(t) (5)


– g(t) describes a piecewise-linear trend (or “growth term”),
– s(t) describes the various seasonal patterns,
– h(t) captures the holiday effects,
– ε(t) is a white noise error term.
8 Navadersh S et al.

Ensemble Model Ensembling in machine learning is a potent strategy that


combines multiple models to improve predictive performance. The results from
individual models SARIMA and Prophet models are integrated using Weighted
Average, Simple Average, and Linear Regression-based ensemble modelling. The
Models (SARIMA, Prophet) are ensembled by combining predictions and ac-
tual order quantities of the two models, resulting in improved model efficiency.
The comparative study of different ensemble approaches and it performance is
evaluated and discussed in Chapter 5.

4.6 Model Evaluation


The evaluation of model effectiveness is based on performance metrics used for
prediction, which involves addressing a regression problem. The effectiveness of
the models is evaluated using the following performance metrics.

Mean squared error The mean squared error (MSE) is calculated by squaring
the residual error for each data point and then computing the average. The
equation for MSE is represented as follows::
n
1X
M SE = (yi − ŷi )2 (6)
n i=1

MSE values can range from 0 to ∞, with smaller values being preferable.

Root mean squared error Root mean squared error (RMSE) is akin to MSE,
with the addition of a square root. The equation for RMSE is identical to MSE,
with the addition of a square root:

RM SE = M SE (7)

Similar to MSE, smaller RMSE values are desirable.

5 Experimental Results and Analysis


For experimentation, five node Hadoop clusters were deployed in the cloud envi-
ronment. The cluster of nodes is installed with SPARK 3.3.4. Each node has a to-
tal memory capacity of 3 TB. One of the nodes, serving as the name node, records
the location of all files. The data nodes in the cluster are chosen with DS4v2
instances which include 8 vCPU, and 28 GB RAM. The analysis starts by im-
porting necessary libraries such as FileIO from io, SparkSession from pyspark.sql,
os, and tensorflow from tf.SparkSession is initialized to work with Apache Spark.
It’s configured with increased executor memory to handle large-scale data pro-
cessing efficiently. Then the input dataset is loaded into a Spark DataFrame
and preprocessed by standardizing column names, handling missing values, and
converting data types as needed.
Predictive Big Data Analytics For Supply Chain Demand Forecasting 9

Based on preliminary studies, SARIMA and Prophet models are used to gen-
erate forecasts for future orders based on historical data. Hence in this work both
the models (SARIMA, Prophet) are used for training the model. The perfor-
mance of the SARIMA and Prophet models is evaluated using metrics depicted
in Tables 3 and 4. Further, Fine-tuning the parameters of SARIMA and Prophet
models could potentially lead to improved accuracy and better generalization of
unseen data. Techniques such as grid search and manual search are explored
in this work to efficiently navigate the high-dimensional parameter space and
identify optimal configurations.

Table 3. Performance comparison of the SARIMA with and without Hyperparameter

Models / Hyperparameter optimization


SARIMA
Metrics SARIMA [Manual] SARIMA [Grid Search]
(0,1,1)(0,1,1,12) (0,0,0),(0,1,1,12)
MSE 1520.44 1480.83 1375.39
RMSE 43.44 39.48 37.08
MAE 36.68 32.62 30.37
MAPE 12.99 11.99 11.44

Fig. 3. Actual vs. Predicted Plot for SARIMA Model with Grid Search

Both manual and grid search methods were employed to determine the op-
timal parameters. The SARIMA (0, 1, 1)(0, 1, 1, 12) configuration achieved the
lowest MSE among the SARIMA models, suggesting its effectiveness in captur-
ing the underlying patterns in the data. The results depicted in Table 3 exhibit
that the differences in performance metrics between grid search and manual
methods were marginal, indicating that the grid search approach might be more
10 Navadersh S et al.

practical due to its automation. The comparison of Actual and Predicted values
for SARIMA Model with Grid Search is represented in Fig. 3.

Table 4. Performance comparison of PROPHET with and without Hyperparameter

Models / PROPHET Hyperparameter optimization


Metrics (Default) PROPHET [Grid Search]
MSE 11553.28 1333.76
RMSE 411.42 36.52
MAE 34.18 29.59
MAPE 12.15 11.16

Fig. 4. Actual vs. Predicted Plot for Prophet Model with Grid Search

The default configuration of the Prophet model yielded relatively higher


MSE compared to the SARIMA models. After hyperparameter tuning through
grid search, the configuration ’multiplicative’, 0.01, 0.01, 0.01, 0.95, 30, TRUE,
TRUE, TRUE achieves improved performance, resulting in lower MSE, RMSE,
MAE, and MAPE values (Table 4). The comparison of Actual and Predicted
values for the Prophet Model with Grid Search is represented in Fig. 4.

Table 5. Performance comparison of the various Ensemble approaches

Models / Ensemble Model


Metrics Weighted Average Simple Average Linear Regression
MSE 1341.26 1341.60 946.64
RMSE 35.50 35.20 30.76
MAE 28.86 28.76 24.43
MAPE 10.26 10.24 8.78
Predictive Big Data Analytics For Supply Chain Demand Forecasting 11

Fig. 5. Actual vs. Predicted Plot for ENSEMBLE Model using Linear Regression

Finally in this ensemble model using weighted Average, Simple Average and
Linear regression are experimented. The ensemble technique is a powerful ap-
proach to machine learning where multiple models are combined to improve
predictive performance. Instead of relying on a single model, ensemble methods
leverage the strength of multiple models to make more accurate predictions. In
this work, the combined predictions from SARIMA and Prophet models are used
for ensemble modelling. The trained the ensemble model on the combined predic-
tions and actual order quantities to leverage the strengths of both models. The
results tabulated in Table 5 exhibit the linear regression ensemble model shows
less RMSE, it outperforms especially when the relationships between individual
model predictions and the target variable are linear, and it offers interpretability
by providing coefficients for each model’s contribution. The actual vs. predicted
Plot for the ENSEMBLE Model using Linear Regression is represented in Fig 5.

6 Conclusion and Future Work

In this proposed work, a time series predictive big data analytics is proposed.
The time series and deep learning models like SARIMA, Prophet and ensemble
model are used to predict the sales demand from the IOWA_LIQUOR_SALES
dataset. In addition, geographical feature is extracted using KMeans clustering
and used in the predictive modelling. The experiments were evaluated using
five node Spark clusters deployed in the cloud environment. The finding under-
scores the efficacy of ensemble modelling in augmenting forecasting accuracy,
particularly when integrating clustering-based segmentation with SARIMA and
Prophet models.
In future studies, ensemble methods such as stacking, boosting, and bagging,
among others, will be investigated. The strengths of multiple models will be com-
bined in novel ways to construct more robust and adaptive ensemble frameworks,
aiming to elevate the predictive performance beyond the baseline established.
12 Navadersh S et al.

References

1. Huang, H., Zhang, Z., & Song, F. (2021). An Ensemble-Learning-Based Method


for Short-Term Water Demand Forecasting. Water Resources Management, 35 (6),
1757–1773.
2. Jahani, H., Jain, R., & Ivanov, D. (2023). Data science and big data analytics: a
systematic review of methodologies used in the supply chain and logistics research.
Annals of Operations Research.
3. Huang, L., Xie, G., Zhao, W., Gu, Y., & Huang, Y. (2023). Regional logistics de-
mand forecasting: a BP neural network approach. Complex & Intelligent Systems,
9 (3), 2297–2312.
4. Hosseinnia Shavaki, F., & Ebrahimi Ghahnavieh, A. (2023). Applications of deep
learning into supply chain management: a systematic literature review and a frame-
work for future research. Artificial Intelligence Review, 56 (5), 4447-4489.
5. Bohanec, M., Borštnar, M. K., & Robnik-Šikonja, M. (2017). Explaining machine
learning models in sales predictions. Expert Systems with Applications, 71, 416-
428.
6. Constante, F., Silva, F., & Pereira, A. (2019). DataCo smart supply chain for big
data analysis.
7. Gholizadeh, H., Tajdin, A., & Javadian, N. (2018). A closed-loop supply chain
robust optimization for disposable appliances. Neural Computing and Applications.
8. Chaturvedi, S., Mishra, N., Govindan, K., & Cheng, T.C.E. (2018). Big data ana-
lytics and application for logistics and supply chain management. Transportation
Research Part E: Logistics and Transportation Review, 114, 343–359.
9. Rejeb, A., Keogh, J.G., & Rejeb, K. (2022). Big data in the food supply chain: a
literature review. Journal of Data, Information and Management, 4 (1), 33–47.
10. Aburto, L., & Weber, R. (2007). Improved supply chain management based on
hybrid demand forecasts. Applied Soft Computing, 7(1), 136-144.
11. Mahmoud, N. (2023). Comparative Analysis of Machine Learning Techniques for
Predictive Modeling in Social and Infrastructural Systems. Emerging Trends in
Machine Intelligence and Big Data, 15(11), 36-42.
12. Nguyen, T., Li, Z. H. O. U., Spiegler, V., Ieromonachou, P., & Lin, Y. (2018). Big
data analytics in supply chain management: A state-of-the-art literature review.
Computers & operations research, 98, 254-264.
13. Lorente-Leyva, L. L., Alemany, M. M. E., Peluffo-Ordóñez, D. H., & Araujo, R.
A. (2021, April). Demand forecasting for textile products using statistical analysis
and machine learning algorithms. In Asian Conference on Intelligent Information
and Database Systems (pp. 181-194). Cham: Springer International Publishing.
14. Huber, J., & Stuckenschmidt, H. (2020). Daily retail demand forecasting using
machine learning with emphasis on calendric special days. International Journal of
Forecasting, 36(4), 1420-1438.
15. Mitra, A., Jain, A., Kishore, A., & Kumar, P. (2022, September). A comparative
study of demand forecasting models for a multi-channel retail company: a novel
hybrid machine learning approach. In Operations research forum (Vol. 3, No. 4, p.
58). Cham: Springer International Publishing.
16. Huang, H., Zhang, Z., & Song, F. (2021). An ensemble-learning-based method for
short-term water demand forecasting. Water Resources Management, 35, 1757-
1773.
17. https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy/data
Predictive Big Data Analytics For Supply Chain Demand Forecasting 13

18. Kodinariya, T. M., & Makwana, P. R. (2013). Review on determining number of


Cluster in K-Means Clustering. International Journal, 1(6), 90-95.
19. Sinaga, K. P., & Yang, M. S. (2020). Unsupervised K-means clustering algorithm.
IEEE access, 8, 80716-80727.

You might also like