0% found this document useful (0 votes)
12 views

Optimizing Supply Chain Dynamics Using Machine Learning

The thesis by Mohammad Zubin Siddiqui explores the optimization of supply chain dynamics through machine learning, focusing on demand forecasting and late delivery analysis. It proposes a comprehensive analytic framework that integrates advanced analytics to enhance supply chain efficiency and resilience. The study aims to provide insights for industries to build more adaptive and data-driven supply chain management practices amidst dynamic business environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Optimizing Supply Chain Dynamics Using Machine Learning

The thesis by Mohammad Zubin Siddiqui explores the optimization of supply chain dynamics through machine learning, focusing on demand forecasting and late delivery analysis. It proposes a comprehensive analytic framework that integrates advanced analytics to enhance supply chain efficiency and resilience. The study aims to provide insights for industries to build more adaptive and data-driven supply chain management practices amidst dynamic business environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Rochester Institute of Technology

RIT Digital Institutional Repository

Theses

Spring 2024

Optimizing Supply Chain Dynamics using Machine Learning


Mohammad Zubin Siddiqui
[email protected]

Follow this and additional works at: https://repository.rit.edu/theses

Recommended Citation
Siddiqui, Mohammad Zubin, "Optimizing Supply Chain Dynamics using Machine Learning" (2024). Thesis.
Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by the RIT Libraries. For more information, please contact
[email protected].
Optimizing Supply Chain Dynamics using Machine
Learning

By

Mohammad Zubin Siddiqui

A Thesis Submitted in Partial Fulfillment of the Requirements for the


Degree of Master of Science in Professional Studies: Data Analytics

Department of Graduate Programs & Research

Rochester Institute of Technology


RIT Dubai
Spring 2024

1
RIT
Master of Science in Professional Studies:
Data Analytics

Graduate Thesis Approval

Student Name: Mohammad Zubin Siddiqui

Graduate Thesis Title: Optimizing Supply Chain Dynamics using


Machine Learning

Graduate Thesis Committee:

Name: Dr. Sanjay Modak Date:


Chair of committee

Name: Dr. Ehsan Warriach Date:


Mentor

2
Acknowledgement
I would like to express my sincere thanks to all those who have contributed to the completion of
this Thesis study.
First & foremost, I extend my sincere appreciation to Prof. Ehsan Warriach for his priceless
direction, support and guidance from the beginning till the end of this process. My research has
been shaped a lot by his expertise & insights, and I have successfully navigated many difficulties
under his guidance.

Also, I want to thank the faculty members of Data Science at Rochester Institute of Technology
for their support and encouragement throughout this journey. Their feedback and suggestions
have helped me a lot in polishing my work and improving my ideas.

Finally, I am truly grateful to my family members and friends who have always stood by me,
providing unending love, motivation and belief in myself during all these years as a student.
Their unwavering backing has been a great source of strength to me; I appreciate them so much.
Thank you very much for everything you have done towards my success.

3
Abstract
Supply chains face many challenges in today’s business environment. This paper examines how
advanced analytics can be used to improve supply chain efficiency and resilience with a
particular focus on demand forecasting and late delivery analysis.
The thesis focuses on two areas, which are forecasting demand and analyzing late deliveries.
With the help of data-driven models for forecasting demand and pinpointing reasons behind
delayed shipments, establishments can take appropriate measures that will reduce risks and
enhance operational efficiency. A complete analytic framework is suggested that integrates
sophisticated analytics into supply chain management so that organizations can optimize all their
operations in a holistic manner.
Through literature review and studies aims at finding out different ways through which
predictive modeling coupled with machine learning among other analytical techniques may be
utilized to enhance performance across the entire supply chain. This study provides valuable
information for industries looking forward towards building resilient yet efficient supply chains
under dynamic business settings.
Keywords: Supply Chain Management, Advanced Analytics, Predictive Analysis, Regression
Analysis, Demand Forecasting, Late Delivery Analysis, Classification Models, Data Driven
Decision Making, Inventory Management, Operational Efficiency

4
Table of Contents
CHAPTER 1: Introduction 8
1.1 Introduction 8
1.2 Background Information 8
1.3 Project Goals 9
1.3.1 Research Questions 9
1.4 Research Methodology 9
1.4.1 Demand Prediction 10
1.4.2 Late Delivery Analysis 10
1.5 Limitations of the Study 11
CHAPTER 2: Literature Review 12
2.1 Introduction 12
2.2 Literature Review 12
2.3 Key Takeaways from the Literature Review 25
CHAPTER 3: Project / Data Description 26
CHAPTER 4: Analysis 28
4.1 Data Preprocessing 28
4.2 Feature Engineering 29
4.3 Exploratory Data Analysis 31
4.3.1 Customer Segment Analysis 32
4.3.2 Market Analysis 33
4.3.3 Product Category Analysis 34
4.3.4 Revenue vs Late Delivery 34
4.3.5 Delivery Status 36
4.3.6 Shipping Modes 36
4.3.7 Delivery Status by Shipping Mode 37
4.3.8 Payment Method 38
4.4 Ordinary Least Squares 39
4.4.1 Linear regression utilizing usual least squares (OLS) 39
4.4.2 OLS Regression 39
4.4.3 Linear Regression Equation 43
CHAPTER 5: Data Modeling 46
5.1 Order Item Quantity Regression Models 46
5.2 Classification Models – Late Delivery 48
CHAPTER 6: Results 52
CHAPTER 7: Conclusions 53
7.1 Supply Chain Issues as shown by the Dataset used 53
7.2 Conclusions 53
REFERENCES 55

5
List of Figures
Figure Page Number

Fig 1: Heatmap to find out important parameters 29

Fig 2: Heatmap of Important Parameters 31

Fig 3: Customer Segment Analysis 32

Fig 4: Number of Orders per Customer Segments 32

Fig 5: Market Analysis 33

Fig 6: Number of Orders per Region 33

Fig 7: Product Category Analysis 34

Fig 8: Products & Regions with Highest Profit 35

Fig 9: Top 10 Products & Regions with most Late 35


Deliveries

Fig 10: Delivery Status 36

Fig 11: Shipment Modes 37

Fig 12: Delivery Status by Shipping Mode 37

Fig 13: Payment Method 38

Fig14: Visualization of Linear Regression Model of 44


order_item_total for different predictor - 1

Fig15: Visualization of Linear Regression Model of 44


order_item_total for different predictors –2

6
List of Tables
Table Page Number

Table 1: OLS Regression Results 40

Table 2: Regression Coefficient Table 40

Table 3: OLS Regression Results for Predictors 41


having p-value < 0.05

Table 4: Regression Coefficient Table for 42


Predictors having p-value < 0.05

Table 5: Comparison between Regression 47


Models

Table 6: Comparison between different 50


Classification Models for Late Delivery

7
CHAPTER 1: Introduction
1.1 Introduction
Supply chains are the backbone of global commerce today because they enable goods and
services to move across vast networks seamlessly. However, even in this intricacy there are many
challenges that supply chains face such as erratic demand patterns and delayed deliveries which
greatly affect their efficiency levels.

1.2 Background Information


Supply chains have become more crucial than ever in the interconnected world of trade where
they ensure products flow from production to consumption points efficiently. It is important for
supply chains to be able to adjust, become resilient and optimize operations to stay ahead
competitively. However; these systems are complex themselves thus facing a range of
interrelated complications capable of disrupting functions and slowing down performance.

Fluctuating Demand Patterns:


The main problem with managing supply chains is dealing with the uncertainty of demand.
Consumer behavior together with market forces and external factors makes demand highly
unpredictable. Errors in predicting demand can result in inventory imbalances, stockouts
vulnerability and higher holding costs thus posing major sustainability challenges for
organizations involved.

Late Deliveries:
Customer satisfaction depends on whether goods arrive on time therefore making timely delivery
critical for any business’ success. Nonetheless, late deliveries can lead to dissatisfied customers
hence increasing operational costs due to repeat orders plus penalties incurred from breaching
agreements. Understanding what causes delays in shipping items should be done as part of
proactive measures towards enhancing supply chain efficiency.
These problems could be solved by utilizing advanced analytics like machine learning, data
mining or predictive modeling. Organizations can use historical trends identification tools to
detect outliers thus gaining insights from information collected during this process alone would
help solve some if not all these issues with regards interfering with other parts of the supply
chain system since traditional methods do not look at them jointly enough hence limiting
usefulness driven by analytics.

8
This study seeks to fill this gap through proposing an inclusive analysis framework which
incorporates superior analytical techniques into predicting demands and evaluating reasons
behind late deliveries. Such an approach views such hitches as integral constituents within wider
contexts of supply chains thereby enabling firms to respond better towards changes in markets
while strengthening their ability to cope with disruptions threatening operational continuity at
different points along delivery paths.

1.3 Project Goals


The main aim of this study is to come up with a complete plan that will boost the resiliency and
efficiency of supply chains through strategic use of advanced analytics. This venture will major
in two key areas: prediction of demand and analysis for delayed deliveries; all with the hope of
transforming supply chain management into an adaptive data driven enterprise.

1.3.1 Research Questions


1) Demand Forecasting
● How accurate can future sales be predicted by using past sales records?
● Among statistical models and those that depend on machine learning, which ones
give higher levels of accuracy in producing precise demand forecasts thus supporting
improved inventory control?
2) Late Deliveries
● What are the leading root causes why deliveries arrive late at different points along
the supply chain?
● To reduce delays while increasing delivery performance, how can we best identify
these root causes and deal with them effectively?
3) Data-Driven Insights
● How can data driven insights be utilized within the supply chain ecosystem to
minimize on time failures?

1.4 Research Methodology


Research Approach
To determine the goals of supply chain management regarding demand prediction and analysis of
late delivery, this study employs regression as well as classification models.

9
1.4.1 Demand Prediction
Regression Models
In order to predict “Order Item Quantity,” different forms of regression models are used such as
Random Forests Regression Model, Decision Tree Regression Model and Linear Regression
Model.
These models use historical sales data for estimating future needs thereby allowing organizations
to make right decisions about supplies and project total revenue realized from sales.
Evaluation metrics: Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are
measures employed to estimate how accurate these regression models can be in forecasting
outcomes.
Handling Object Type Data: This section covers handling object type columns without
affecting the performance of the model. It is not advisable to delete such columns because this
could lead to weak models being built.

1.4.2 Late Delivery Analysis


Classification Models
Different types of classification models like Logistic Regression, Linear Discriminant Analysis,
Gaussian Naive Bayes, Support Vector Machines and Random Forest classification are used
here. The purpose is to know if a delivery will be made late or on time based on features
associated with them.
These methods consider different attributes related to deliveries so as to indicate whether they
might arrive behind schedule or within the expected period.
Evaluation metrics: Accuracy, recall and F1 score are usually applied when evaluating how good
a given model performs in terms of classification accuracy rate achieved during the prediction
process.
Prediction Outcomes Evaluation:
True Positive (TP): Here we correctly predict that a particular delivery will be late thus enabling
us take necessary actions early enough so as the impact can be minimized
True Negative (TN): This refers when an on-time delivery is predicted correctly
False Positive (FP): These are instances where it wrongly predicts that something will not arrive
within stipulated time thereby leading to wastage of resources
False Negative (FN): This happens if it erroneously forecasts that some goods will reach their
destination in time thus missing out on chances to detect potential delays

10
The objective of this research design is to improve the efficiency and resilience of supply chains
through accurate forecasting of demand as well as identification of causes for delayed deliveries
using advanced analytics. The evaluation will provide insights into which types (regression or
classification) models work best under what conditions to address different aspects related with
this field.

1.5 Limitations of the Study


Despite that this research study is about supply chain management and tries to shed light on
demand forecasting as well as late delivery analysis; it has some limitations.
First, the accuracy of predictions for both demand forecasts and late deliveries can be affected by
data quality and completeness among other factors. For example, incomplete or incorrect
information may influence regression models performance thus giving less optimal results in
terms of classification.
Secondly, supply chains are complex systems with numerous components which interact in
different ways affecting time taken for goods or services to reach customers (delivery period).
This means that developing an all-inclusive model covering every possible variable could be
difficult due to many factors involved such as transportation modes used; types of products
handled at various stages among others.
Another limitation is lack of external validity where findings cannot be generalized beyond
specific settings used here because the dataset employed may not represent typical situations
elsewhere.
However, even with these shortcomings it still contributes much knowledge into the supply chain
management field.

11
CHAPTER 2: Literature Review
2.1 Introduction
For this research, the literature review acts as a basic part by giving an all-inclusive approach to
prior scholarship and theoretical frameworks about demand forecasting as well as late delivery
analysis in supply chain management context. This section integrates different scholarly works
such as academic papers, books and industry reports to critique theories that guide methods used
in this study. The intention of such kind is to show where more investigation needs to be done
while looking into what has already been done regarding supply chain analytics and bring out
areas of creativity based on what has been achieved up-to now.

2.2 Literature Review


According to P Anitha & MM Patil (2018), this paper is about data analytics in supply chain
management. The use of data analytics in improving the efficiency, effectiveness and resilience
of a supply chain system is discussed.
The article classifies Data Analytics for Supply Chain Management into three types which are
Descriptive, Predictive and Prescriptive. Descriptive analytics refers to understanding what has
happened in the past while predictive analytics is used to forecast what is likely to happen in
future. On the other hand, prescriptive analytics recommends the best course of action. It then
talks about challenges and opportunities that come with using data analytics in SCM such as
volume, velocity, variety and veracity of data among others. Some opportunities identified by
this study include but are not limited to enhanced forecasting; inventory management;
transportation management; customer relationship management etc.
Also included are examples where companies have applied these kinds of analysis in their
operations with positive outcomes for them. One such case involves using them for demand
forecasting through predictive analysis so as to avoid inventory shortages or overstocking items
at any given time thus saving money on storage costs too. Another example cited here involves
tracking performance across various points along supply chains using descriptive analyses
thereby enabling firms pinpoint areas requiring improvement most accurately possible without
much ado since they will have already known where exactly those places are situated vis-à-vis
each other beforehand.
Lastly, it concludes by looking ahead into the future vis-à-vis adoption levels within industries
relative proportions anticipated being embraced going forward based upon current trends
observed thus far coupled alongside some predictions made concerning its applicability even
further automation processes involved behind-the-scenes during SCMs themselves as well as
throughout wider enterprise settings beyond just warehouses alone or distribution centers either
but all aspects directly indirectly linked thereto inclusive everything surrounding this field

12
regardless whether physical virtual true mixed reality alternative means adopted towards
achieving intended objectives thereof or not according exact nature needed specifically here at
hand now indicated by those who know more about such things than anyone else within the
organization.
A few opportunities for improvement in big data analytics and applications with regard to
logistics and supply chain management are presented and analyzed by Govindan, Cheng,
Mishra & Shukla (2018). They do this through investigating technology-based tracking
strategies; financial performance relationships vis-à-vis data driven SCMs as well as
implementation challenges together alongside supply chain capability maturity models using
large datasets.
The advent of Web 2.0 along Industry 4.0, IoT among other digital technologies has created a lot
of talk on big data and data analysis. Huge volumes of information are now being collected from
different sources like ERP systems; distributed manufacturing environments; orders and shipping
logistics; customer buying patterns derived through social media feeds; product lifecycle
operations including technology-driven points such as GPS based on global positioning systems,
RFID tracking supported by radio frequency identification mobile devices etcetera used
surveillance videos among others. Thus organizations find themselves having to deal with these
types of massive records characterized by what have been called the four Vs i.e., large
volume(s); velocity(ies); variety/variability(ies) as well as veracity(-ies).
According to an IDC report published recently, it was projected that the Big Data technology
market would increase at a CAGR of 26% over next five years reaching $41.5 billion in revenue
by 2018 alone – which means more work for those trying to manage or analyze all this
information correctly! Already there have been some breakthrough tools discovered during
ongoing research activities in various fields related either directly or indirectly towards making
decisions based on available datasets so-called “data-driven” supply chains better possible ways
but still lacked wider applicability due limited scope coverage etcetera where results interpreted
could assist enterprises real-time decision-making process improvement their designs while also
helping them cut down costs reduce risks associated with managing such complex structures
across different levels globally
The present state of big data analytics in supply chain management (SCM) is addressed by
Rozados & Tjahjono(2014), which refers to the most recent trends and related research. The
paper delves into the challenges and opportunities that come with big data analytics in SCM as
well as various types of big data analytics for enhancing SCM performance.
This article recognizes several important trends in big data analytics within SCM:
● Increasing volume, velocity, and variety of data: The more information produced
from different sources; organizations involved in SCM must manage this influx
efficiently so it can be analyzed properly too.

13
● Growing use of cloud-based big data analytics platforms: Flexibility along with
scalability offered by cloud based systems are allowing SCMs store process and
analyze big data sets at an unprecedented level.
● Adoption of predictive analytics: Predictive models can help forecast demand,
optimize inventory levels or even improve supply chain risk management among
others.
● Use of social media data: Social media posts contain valuable customer behavior
information which when analyzed could enhance marketing strategies aimed at
increasing sales volumes through better understanding customers’ needs
● Use sensor data: Sensor generated signals such as those from RFID tags provide real
time input regarding where goods are located within transit networks thus aiding
decision making during route planning stages etcetera

Further challenges/opportunities that arise due to these types of analyses being conducted within
SCM include:
● Lack of skills/knowledge required for utilizing large scale datasets effectively in this
field – Many practitioners lack necessary skill sets needed to work with huge amounts
or types of information commonly encountered within their given industries.
● Integrating existing systems/applications used for conducting traditional smaller scale
analyses with new ones designed specifically around processing vast quantities at
once may prove complex/time consuming due differences between them both
functionally as well technically speaking.
● Need establish clear policies/procedures on how best govern privacy concerns
associated with collecting using such massive volumes containing highly sensitive
private personal details etcetera.

Some benefits likely achieve through big data analytics when used within SCM comprise:
● Better decision making: Organizations can make more informed decisions about their
businesses based on insights gained from analyzing large sets of data,
● Cost savings: Businesses will save money by optimizing inventory levels and
improving efficiency along the supply chain which are both possible with big data
analysis methods; also it reduces waste in general.
● Higher customer satisfaction levels attained through personalizing marketing
campaigns using relevant products/services awareness created after evaluating lots of
information collected from various sources so as to target different needs/preferences
for each client individually thus making them feel valued appreciated leading repeat
purchases etcetera

14
Examples provided in the article include:
● Forecasting demand: This technique helps organizations in predicting what
consumers might want next year thereby enabling them to prepare themselves
adequately by stocking up on those items but avoiding overstocking too much which
would result in stock outs later on.
● Optimizing inventory levels: Here slow moving goods can be identified quickly
during their production process therefore reducing wastage caused by having an
excess amount of such products at any given time besides finding out ways how best
utilize limited space available within warehouses or other storage facilities while still
ensuring that all required supplies are readily accessible whenever needed throughout
different locations served by particular company involved logistics operations
management functions
● Improving customer service: One way businesses can do this is through resolving
issues faster than before and even before they occur – A good example could involve
identifying patterns trends abnormal behaviors etc., then proactively reaching out
those affected parties via appropriate channels (emails, phones calls) asking if
everything okay or offering help where necessary.

Boone, Ganeshan, Jain, Sanders (2019) examines the relevance of sales forecasting within
supply chain management and how it can be enhanced in terms of accuracy during the Big Data
Era through customer analytics. Sales forecasting refers to predicting future demand for a
product or service. Precise sales forecasts are crucial for supply chain planning and execution as
they enable organizations to optimize inventory levels, production schedules and transportation
costs.
Following are the issues faced while trying to predict sales volume:
● Supply chain complexity: The system is complicated with many components which
can influence demand like economic conditions, competitor activities among others.
● Uncertain future: Future is unknown and events cannot be predicted with certainty
thus making accurate demand estimation difficult.
● Data volume; speed; variety: Forecasters have access to massive amounts of data
nowadays but collecting; storing and analyzing all these could be challenging because
not all may apply in prediction.

It then discusses ways through which consumer analytics can help improve accuracy of sales
forecasting. Consumer analytics refers to gathering; scrutinizing and interpreting information
about customers which can be used to positively identify patterns relating to their behavior
thereby enhancing predictions on what they are likely to buy this season.

15
The paper gives several instances where consumer analytics has been applied to increase
precision in sales projection such as:
● Using social media data to track changes in attitude: It helps one keep tabs on how
people feel towards a given commodity or service. This allows them to detect shifts in
demand which may not manifest themselves through traditional purchase records
necessarily.
● Employing website details for monitoring customer conduct: Website usage statistics
like product views duration offer insights into customer preferences that could inform
decisions regarding stocking levels vis-à-vis expected orders taking into account
current trends etcetera beyond ordinary transactional information alone
● Leveraging point-of-sale records when tracing revenue patterns: POS machines
record transactions made by selling items at specific periods i.e., what was bought
where & when. Such data helps identify demand shifts which traditional sales records
may not show necessarily.

Ittmann (2015) presents insights about how Big data and business analytics can transform
supply chain management (SCM) by giving organizations more visibility into their operations
and enabling them to make better decisions.

In SCM, big data can be used to:


● Track the movement of goods and materials across the supply chain
● Detect patterns or trends in customer demand
● Forecast disruptions within the supply chain
● Optimize stock levels
● Enhance transport efficiency
● Cut down costs

Business analytics can be employed to:


● Analyze big data for insights on supply chain performance
● Develop models that predict future demand & supply levels etcetera.
● Make informed choices based on facts rather than guesswork aimed at enhancing
efficiency or effectiveness along any given point in an organization’s SC network.
● The use of big data and business analytics in SCM is still at its infancy but has
potential of making significant contributions towards improving efficiency and
effectiveness throughout different parts of a supply chain system.

16
Jeble, Dubey, Childe, Papadopoulos, Rouband, Prakash (2017) explains the method in which
big data can be used to tackle sustainability issues in supply chain. Big data can be used for:
● Tracking of environmental impact of supply chains: For instance energy
consumption, water usage and hazardous materials are among the things that can be
tracked using big data on supply chains’ environmental impact. This information
could also be applied to locate areas where changes need to occur most frequently.
● Recognizing patterns and trends about customer demand: It is possible through the
use of big data analytics platforms like Hadoop or Spark which identify patterns or
trends around sustainable goods/services demanded by customers over time. These
insights may then guide us towards designing green products that meet customer
needs while minimizing their carbon footprint.
● Forecasting supply chain disruptions: Equipped with predictive modeling techniques
such as machine learning algorithms powered by Apache Mahout libraries; big data
could help anticipate events likely disrupt continuity along any given supply chain
vis-à-vis its sustainability e.g., natural disasters – floods earthquakes etc., political
unrest – strikes riots etcetera… Such information can subsequently be utilized for
developing appropriate contingency plans aimed at reducing possible negative
impacts caused by these threats.
● Optimization of inventory levels: Supply chain management systems integrated with
large-scale storage optimization tools enabled through Apache Hadoop Distributed
File System (HDFS) and MapReduce programming paradigm facilitate optimal stock
keeping unit selection based on cost minimization under different service levels
required from each SKU across multiple locations within a networked environment
characterized by complex interrelationships among various entities including
suppliers manufacturers distributors retailers consumers etcetera.
● Enhancing transport efficiency: Another area where big-data analytics has proven
effective in regards to sustainable development goals is improving efficiencies within
transportation networks themselves; this can be achieved through route optimization
algorithms that take into account real-time traffic conditions alongside fuel
consumption rates determined in advance using historical data.

Seyedan & Mafakheri (2020) talks about how Predictive big data analytics is a potent tool for
supply chain management. It can be used to improve efficiency, optimize inventory levels and
reduce costs. Predictive analytics can also be used to identify and mitigate risks, such as supply
chain disruptions.
There are several predictive big data analytics techniques that may be employed in supply chain
management; these include:

17
● Time series forecasting: This technique seeks to estimate future demand for products
or services by analyzing historical data on sales volume over specified periods of time
measured at regular intervals e.g., monthly sales volumes over a five-year period
etcetera.
● Clustering: Here we try grouping together customers who have shown similar
consumption behavior patterns based on their records within our database system
containing transactional records captured using RFID technology linked up with
Apache Hadoop framework running MapReduce programs for large-scale parallel
processing of clustering tasks.
● Classification: In this case, we want to predict which among given categories an item
belongs given some information about it; e.g., predicting whether a customer will
churn or not based on his/her past purchasing history recorded electronically using
machine readable barcodes alongside other relevant metadata stored in HBase tables
managed under HDFS distributed file system architecture supported by Apache Spark
platform as part of big-data infrastructure designed specifically for handling immense
volumes of structured/unstructured data …
● Regression: We have variables whose values depend linearly upon those values taken
by one variable – the key point here being that there must exist some known
functional relationship between two or more variables where change observed in
response variable can be expressed mathematically as function involving changes
experienced by predictor variables

A study by Seifi, Sepehri, Hosseinian-Far & Darvish (2022) analyzes the use of machine
learning (ML) techniques in detecting fraud within the supply chain. The authors point out that
this method can also come with a lot of difficulties and opportunities for supply chain fraud
detection while also discussing various kinds of algorithms which can be employed to detect
fraudulent activities using ML.
Supply chain fraud refers to any dishonest act done purposely to gain unfair advantage during
business transactions at the expense of other stakeholders along the value chain system. The
researchers have emphasized on financial implications as well as reputational damage caused by
such kinds of frauds to organizations.
The article then moves into talking about problems faced when trying to identify where this type
of crime occurs within supply chains. These challenges include; large amounts of data involved,
complex networks and constantly changing forms it takes. The traditional ways used in detecting
fraudulent transactions are often ineffective because they fail to address these issues according to
this report.
The paper introduces ML as a good way for detecting frauds committed in the context of supply
chains. This is because Machine Learning algorithms have been built in order to detect patterns

18
from big sets of data which might indicate some form of cheating like anomaly or deviation from
norm. It has been estimated that Supply Chain Fraud costs businesses US $2 trillion per year.
There exist several different types of Machine language techniques that can be employed when
dealing with them including;
● Supervised learning: This type involves training models based on labeled dataset
where new cases are classified either being fake or true ones.
● Unsupervised learning: This kind does not require pre-labeling examples so it just
tries finding clusters which might represent different classes
● Anomaly detection: It tries identifying outliers that do not follow expected patterns
hence could be indicating presence fraudulent activities among others.

In their work titled “A Critical Review of Machine Learning Techniques for Supply Chain
Management”, Wenzel et al.,(2019) give a general idea on how ML could enhance efficiency
and effectiveness of SCM. They further highlight some of the challenges that can be faced during
adoption of this technology such as data quality, privacy and need for specialized skills among
others. However they still believe that these merits outweigh demerits hence should become
integral part of future supply chain management systems.
ML applications in SCM:
● Demand forecasting: By predicting what customers will need next, organizations can
have plans in place to ensure that they meet those particular demands thereby
optimizing inventory levels as well as production planning and pricing strategies too.
● Inventory management; With algorithms used here one is able to know when stocks
are running out by identifying stock outs which may lead into recommending
replenishment quantities based on predicted demand patterns over time span covered.
● Production scheduling: Such an algorithm can help sequence different orders so that
there are no idle machines/resources at any given point in time plus it also aids
prediction breakdowns etcetera

Bushuev (2017) talks about what can be done to increase the efficiency of two-stage supply
chains. These are commonly used in distribution channels for goods and services as well. It
consists of a supplier and a retailer where the supplier produces goods which it then delivers to
the retailer that sells them off to consumers.
In general, delivery time and cost are the major components that determine the performance of a
two-stage supply chain. The delivery time here means the period between when an item is
shipped by its manufacturer up until when it reaches its final destination such as store shelves or
customers’ homes while cost refers to all expenses incurred along this process including
transportation fees among others.

19
To improve delivery performance within two-stage supply chains, Bushuev suggests several
strategies such as:
● Setting up a “delivery window” – This is a time frame during which items should
arrive at their destinations. A more specific delivery window reduces variance in
arrival times.
● Introducing penalties for late deliveries – Suppliers who fail to meet their deadlines
can be charged fines based on each unit’s value or any other appropriate criterion.
This encourages them not only to deliver but also do so promptly.
● Enhancing demand forecast accuracy – Accurate estimation smoothes out stock levels
thus enabling faster order fulfillment rates which translate into better services
rendered per given period.
● Creating cooperative frameworks – Establishing mechanisms through which
providers share information while collaborating towards common goals like timely
shipments.

Chu, C. W., and Guoqiang Peter Zhang. (2013) compare linear versus non-linear models for
predicting total sales in retail trade aggregates. Traditional seasonal prediction methods were
employed including time series models, regression models with dummies for seasons among
others; nonlinear versions were also applied using neural networks that serve as universal
approximators of unknown functions.
The research revealed that nonlinearities may lead to improved performance outside sample
space if handled well especially when some initial adjustment has been made on historical data
before fitting a neural network model since it performs better after getting rid of seasonality.
However, the ultimate best forecaster was found by this study to be a neural network fed with
deseasonalized series.
The paper concludes that although seasonal dummy variables may help formulate effective
regression models used in retail sales projections but they lack robustness while trigonometric
ones do not work when dealing with aggregate retail sales forecasting.
According to Keung et al. (2021), machine learning models can be used to forecast shipment
delays and sales. They tried different algorithms, such as Naive Bayes, K-nearest neighbor
(KNN) decision tree algorithm and artificial neural network (ANN) among others.
It is a case study of a French supermarket chain sourcing their products from Asia. The authors
divided the data into two parts; one for tracking shipments while the other focused on sales
information. The two datasets span over two years from 2018 to 2020.
The authors tested many machine learning models by training them and checking their accuracy
with test sets. The decision tree model was found to have yielded the best results out of all these
models in terms of both predicting sales as well as forecasting shipment delay periods accurately.

20
This work demonstrates how businesses could use ML systems to enhance efficiency in
operations management processes within retail organizations.
Tirkolaee and Babaee (2021) examines various techniques applied by computers which learn
from experience known as Machine Learning in Supply Chain Management Systems (SCMS). It
highlights different areas where ML can significantly improve SCM including:
● Supplier choice and classification.
● Risk estimation along supply chains.
● Demand prediction & sales estimation.
● Production process optimization.
● Inventory control methods design.
● Transportation planning & execution strategies formulation.
● Environmentally friendly sustainable development interventions planning (Circular
economy thinking).

In their research, Kache, Florian and Seuring (2017) have evaluated the potential of big data
analytics in supply chain management (SCM). The authors identified several challenges as well
as opportunities that arise with the use of big data in SCM.
Challenges
● Data quality and integrity: Big data may not be consistent which can result in
inaccurate forecasting and decision making.
● Data security: Huge volumes of information are prone to cyber-attacks.
● Lack of skilled workforce: Organizations need to train their staff on how to collect,
analyze and interpret big data.
Opportunities
● Better forecasting: Demand forecast accuracy can be improved using big data
analytics thereby enabling firms to optimize inventory levels and reduce costs.
● Enhanced risk management: Big Data Analytics can also help identify risks along the
supply chain such as those caused by natural calamities or political unrests among
others and put in place mitigation measures for them.
● Efficient operations: Transportation routes optimization, warehouse operations
management etcetera can all benefit from big data analytics within the supply chain
milieu.

Overall what this paper suggests is that big data analytics has a capacity to transform SCMs to
becoming more efficient, effective and robust. However there are also a number of challenges
that should be resolved so that full benefits realization can take place within SCM using big data
analytic systems.

21
Naik et al., (2022) propose a system which predicts sales by analyzing different factors affecting
customer satisfaction as well as product quality, Customer reviews, seller profiles, order details
and product attributes among others are used to determine these factors. Sentiment analysis is
done on customer reviews so as to identify whether they are positive, negative or neutral.
Clustering customers together depending on some common features is achieved through k-means
clustering. A machine learning model is then trained with sentiment of reviews,seller cluster,
product description order details being used for predicting customer review scores based on this
model, areas where improvements need to be made regarding product quality and customer
service can be identified.
Aamer & Ammar (2020) assess the use of machine learning in predicting demand for supply
chain management. The study scoured 1870 papers from Scopus and Web of Science databases.
According to the article, some of the most popular algorithms employed in demand forecasting
are neural networks, artificial neural networks, support vector regression, and support vector
machines. It also notes that big data analytics have become increasingly important in this area.
Lalou et al.’s (2020) proposes an approach to retail sales forecasting based on data analytics
which it claims can enhance traditional methods’ performance. The authors criticize current
techniques for not being able to capture the intricacies of modern retail environments especially
those that integrate online with physical stores. They present a procedure where statistical
programming is used together with data analytics so as to select the most appropriate prediction
models for particular retail networks.
The method’s context is a single case study conducted at a Greek third-party logistics firm that
serves as the intermediary between a large sports goods importer and its physical/online shop
customers located across five different countries with 129 outlets in total. Using past order
records, this company determines which forecasting model suits best each SKU (Stock Keeping
Unit) within its network before employing such a model to estimate future demands at every
store location for every SKU.
Various types of predictions were compared by Lalou et al., including conventional statistical
methods alongside more sophisticated machine learning approaches; they conclude that what
matters most is neither one-size-fits-all but specific features exhibited by each SKU together
with store location concerned if accuracy levels are to be achieved during forecasting exercise.
Park, K. J. (2021) examines how machine learning algorithms can be used to determine the
level (manufacturer, distributor, wholesaler or retailer) from which a particular datum comes in
the supply chain? They propose to analyze order information of each tier from the supply chain
which is then utilized for training machine learning models to classify new data points.
In this research work, seven different machine learning algorithms are considered for this
purpose: logistic regression, random forest, naive Bayes, decision tree, support vector machine
(SVM), k-nearest neighbor (KNN), and multi-layer perceptron (MLP). Accuracy, confusion

22
matrix , precision , recall and F1-score are used by them as evaluation metrics to assess how well
these models perform on their tasks.
Logistic regression along with multi-layer perceptrons are found superior to any other model at
correctly classifying what level the information originated while Random Forests had best
overall performance among all tested classifiers. The authors also state that remaining methods
were unable to distinguish between tiers accurately.
Al-Saghir, R. (2022) suggests prior to their occurrence using machine learning with historical
data in this case study on predicting delays in deliveries proposes a model. It will give an
estimate if the delivery is going to be on time, early or late based on predetermined specific
product delivery data. This model can work with any type of transportation mode and category of
goods if you feed required information into it
To predict late delivery in Supply Chain 4.0, Aboulouafa, H., & Bahaj, M. (2022, December)
develop a machine learning model. The authors used random forest classifier and feature
selection to increase prediction accuracy of the model which was trained with aerospace
manufacturing business data; achieving 97.9% accuracy without feature selection and 99.38%
with feature selection
Here are the major steps involved:
● Data collection: Data for this system comes from the ERP system which includes past
orders such as order cost, quantity ordered vs delivered date etc.
● Feature selection: To reduce number features used by the model while increasing its
accuracy; ANOVA's f_classif() function was employed by these researchers so that
they can choose most relevant ones
● Model training: They trained their algorithm using random forest classifier and
targeted whether an order will arrive later than expected or not
● Model evaluation: Accuracy was used to test how well models performed in terms of
correct classification ratio among all tested instances.

According to results obtained during testing stage it became evident that selecting some features
can greatly enhance supply chain management (SCM) performance especially when dealing with
situations where deliveries arrive too late most of times thus affecting other operations within
organizations heavily relying on SCM
Mediavilla, M. A., Dietrich, F., & Palm, D.(2022) discusses Artificial Intelligence (AI)
methods for demand forecasting improvement in supply chain management (SCM). The author
says that today’s markets are too turbulent for traditional statistical techniques to be effective in
achieving this aim.
The following are the main points that have been covered:

23
● Problems with conventional methods: The paper recognizes that classical statistical
forecasting strategies do not work well with the complex nature of modern markets
● AI's impact on demand estimation accuracy: AI algorithms have increasingly played a
huge role in enhancing forecast precision levels according to the article
● Current trends focus only on what has happened recently: It specifically concentrates
on AI approaches employed from around 2017-2021
● Classification of AI methods: The authors present different types based on factors
such as data dimensionality, data volume and forecasting time horizon among others.
● Goal – selecting the right method: This system should help professionals choose
appropriate methods for their SCM needs depending on different situations they may
come across while trying to predict future demands accurately

Common Machine Learning Techniques:


● The most commonly used machine learning algorithms for demand forecasting are
neural networks, artificial neural networks (a specific type of neural network),
support vector regression and support vector machines. These four account for more
than three quarters of all published studies
● Industry Focus: The study specifies that 65% of demand forecasting machine learning
applications are used for the industry. Agriculture is one of the sectors with the least
attention, approximately 5%.
● The Rise of Big Data: It is suggested by this article that big data analytics is
increasingly important in supply chain management because it can help to improve
demand forecasting through machine learning.

The document by Cadavid, J. P. U., Lamouri, S., & Grabot, B. (2018) talks about how ML has
been used to improve sales and demand forecasting in SCM. It discusses recent studies and
advantages over traditional methods such as accuracy rates among others.
Here are some main points from their work:
● ML Techniques Used For Sales And Demand Forecasting: Neural networks have
been used alongside other models like ANNs while doing sales predictions according
to the document. Support vector machines were also mentioned as part of frequently
employed techniques during forecast making processes.
● Benefits Of Using Machine Learning: Traditional methods may not always be able to
handle complex data sets where hidden patterns are involved since they only rely on
past observations without considering other possibilities hence leading into poor
decision making by firms; therefore this method helps businesses make better
decisions based on accurate forecasts derived from such hidden patterns.

24
● Applications In Reality: Demand and sales projections can be done using machines
across various sectors as stated in this text although no specific examples or industries
were provided for my part reviewed hereof.
● What To Expect Next: Artificial intelligence was mentioned briefly without going
deeper into its application areas but there might come time when people will integrate
AI with ML in order optimize these activities together hence increasing chances of
getting more accurate results; another possibility could involve integrating large
datasets into these models during their development stages which may help improve
on current weaknesses within them.

2.3 Key Takeaways from the Literature Review


● Strategic Use of Advanced Analytics: The literature review has highlighted the
strategic nature of employing advanced analytics like machine learning and predictive
modeling in supply chain management towards building resilience as well as
improving efficiency.
● Data-Driven Decision Making: Organizations can be empowered by embracing data
driven decisions that help drive Supply Chain. Insights from past shipment data can
help decision making, minimize stockouts and prevent disruptions.
● Demand Forecasting Accuracy: According to some research works done on this
area; it was found out that accurate predictions can be made concerning demand
levels by using regression equations alongside predictive analysis thereby leading into
better inventory control coupled with decreased chances of encountering stock outs.
● Late Delivery Prevention: When classification models are utilized for predicting late
deliveries then this improves customer satisfaction levels while boosting operational
efficiencies within organizations through supply chains.
● Continuous Improvement: With big data analytics one can continuously improve
different steps involved in a supply chain process by looking back at historical
performance records, finding areas where there might have been wastage or
underutilization then implementing targeted measures aimed at rectifying those
problems.

25
CHAPTER 3: Project / Data Description
This examination contributes to the progress of management of the movement of goods from
point A to B through education investigation by concentrating on the strategic employing of
more developed statistical methods in order to optimize supply string dynamics. The study is
meant to give ideas about the problems and opportunities that come with using machine learning
algorithms as a way of strengthening resilience and efficiency within supply chains.
The research work involves building predictive models for forecasting demand and analyzing
late deliveries, with an emphasis on their applicability across different contexts within real world
supply chains. The project attempts to provide practical answers for common problems
experienced in supply chains using a dataset obtained from Kaggle called “DataCo Smart Supply
Chain for Big Data Analysis”.

DataCo Smart Supply Chain for Big Data Analysis:


https://www.kaggle.com/datasets/shashwatwork/dataco-smart-supply-chain-for-big-data-analysis
?select=DataCoSupplyChainDataset.csv

The data set holds vast amounts of information related to supply chain activities such as order
details, delivery timings, feedback from customers among other things like operational metrics
used by businesses involved in running these systems. Therefore this rich data source forms the
basis upon which we can train our predictive models so as to extract insights into what drives
them.
For purposes of demand forecasting analysis accurate predictions about quantities will be made
using machine learning algorithms including Random Forest Regression, Decision Tree
Regression and Linear Regression. Performance evaluation criteria for these algorithms may
involve Mean Absolute Error (MAE) or Root Mean Square Error (RMSE) which are used to
assess how good they are at generating precise demand forecasts.
On top of that, classification models will also come into play during late delivery analysis where
Logistic Regression, Linear Discriminant Analysis, Gaussian Naive Bayes, Support vector
machines & Random Forest classification are employed. Such predictive models help identify
chances that orders might not arrive on time hence allowing appropriate measures being taken
early enough so as to prevent delays within any given supply chain thus making it operate
optimally.
In this project, we will preprocess data, engineer features, train models, validate them and
evaluate their performance

26
The deliverables of this study will be operational insights and recommendations for managers
that will enable them to improve efficiency in their operations while at the same time minimizing
risks which may lead to customer dissatisfaction. Besides that, it is also intended to contribute
knowledge on supply chain optimization through advanced analytics and thus enriching these
areas within academia related to such fields.
The Data Dictionary of the Dataset used in our project can be found below:
https://www.kaggle.com/datasets/shashwatwork/dataco-smart-supply-chain-for-big-data
-analysis?select=DescriptionDataCoSupplyChain.csv

27
CHAPTER 4: Analysis
In this paper, the three significant variables of effectiveness in supply chain management are
“Total amount per order”, “Order Item Quantity” and “Late Delivery”.
● The business forecasts the "Total amount per order" (or Sale per Customer) for
identifying early sales problems and opportunities associated with products.
● Forecasting for "Order Item Quantity" (or Demand) enables the identification of
customer buying patterns as well as prediction of future market demands on products.
The company can know which customer segment brings about highest profits, which
product receives most customer orders and which market shows high demand.
● Late deliveries reduce customer retention. It also undermines the customer's trust in
the company. As a result, over time through losing customers, the firm loses its
reputation and revenue too. By preventing late delivery, the organization may help
keep clients coming back frequently thereby increasing their lifetime value to them
while at the same time increasing its return on investment(ROI).

4.1 Data Preprocessing


The dataset has 53 columns and 180,519 rows.
Data Cleaning -
Some columns - ‘Customer Lname’ ‘Customer Zipcode’, ‘Order Zipcode’, and ‘Product
Description’ have missing values. The count row shows the number of missing values in each
column.
The column named “Late Delivery Risk” is binary (0 or 1) where 0 means no delay in delivery
while 1 indicates that there was a delayed delivery. On average this feature takes the value of
0.548 which implies that many orders are late and may lead to customer dissatisfaction and loss
of profits.
● Handling Missing Values: For missing values I replaced “Customer Lname” with
“NotDetermined”, both “Customer Zipcode” and “Order Zipcode” were replaced by
0, while the column named ‘Product Description’ containing missing values was
dropped from the dataset.
● Treatment of Customer Last Names: Eight customers had empty last names which
were filled with “NotDetermined”. Then a new column called "Customer Full Name"
was created to combine all customer names together.

28
Outlier Handling -
There were no outliers detected in the data set as such since our models are non-parametric and
robust to extreme values, we didn’t remove any outliers because we want our models to capture
all underlying patterns within data.
After these steps for preparing data, it was observed that the size reduced from (180,519,53) to
(180,519,44) which is now ready for further analysis.

4.2 Feature Engineering


Data Correlation
A correlation analysis was performed on the data to identify key parameters.

Fig 1: Heatmap to find out important parameters

29
The heatmap visualization shows clear insights:
Finding Duplicate Columns: Various columns have similar values but different metadata.
● Benefit per order and Order Profit per order
● Sales per customer, Sales, and Order Item Total
● Category ID, Product Category ID, Order Customer ID, Order Item Category ID, and
Product card ID

Removing Unwanted Features: Some features were considered irrelevant because they either
contain null values or have low correlation with other variables. These are:
● Product Description
● Product Status

To correctly interpret the relationships represented in the heatmap chart, only a few features were
chosen for examination. They are mentioned as followed:
● shipping_date_dateorders
● benefit_per_order
● order_date_dateorders
● order_item_discount
● order_item_product_price
● order_item_quantity
● sales
● order_item_total
● late_delivery_risk

30
Fig 2: Heatmap of Important Parameters
The heatmap visualization shows clear insights:
● Order Item Total has direct relationship with Order Item Discount, Order Item
Product Price, Order Item Quantity and Sales.
● Sales is directly associated with Order Item Discount and Order Item Product Price.
● There exists an opposite relationship between Order Item Quantity and Order Item
Product Price. Analysis of the dataset revealed that products with higher prices were
associated with a single order while items priced between 10 and 100 had five orders.
This disparity indicates the rareness of expensive items’ orders compared to those
moderately priced ones which are cheaper.

4.3 Exploratory Data Analysis


The Exploratory Data Analysis (EDA) section is an initial investigation into different aspects of
the data set. This helps us understand what decisions need to be made. We use many statistical
methods and graphs to analyze our information, which includes customer segmentation, market
studies as well as product category performance among others like delivery patterns. It seeks to
identify relationships, correlations or anomalies that may exist in this dataset so that people can
make sense of it and improve their business planning accordingly.

31
4.3.1 Customer Segment Analysis
Understanding the customers is key to any successful business strategy. This analysis looks at
different factors such as age, gender, race among others that can be used to segment our customer
base. The goal of this segmentation is to find common characteristics or behaviors among these
people so that we may know them better and provide for their specific needs through marketing
customization, product development personalization and service delivery tailoring where
necessary.

Fig 3: Customer Segment Analysis

Fig 4: Number of Orders per Customer Segments

32
4.3.2 Market Analysis
It is important for an organization to have a good understanding of its environment to make
informed decisions on how best it can position itself strategically vis-à-vis competition. By doing
this research one gains knowledge about what is happening around them concerning markets
therefore they become able forecast future changes more accurately than before which then
enables them to adapt quicker when those alterations affect customer demand patterns.

Fig 5: Market Analysis

Fig 6: Number of Orders per Region

33
4.3.3 Product Category Analysis
Looking into individual items within a group gives insight into sales volume as well profitability
levels associated with each item relative to other items falling under same category line up
against one another financially speaking over period time being considered.

Fig 7: Product Category Analysis

4.3.4 Revenue vs Late Delivery


One critical area supply chain manager should pay close attention to is whether their revenue
streams have been affected due delays experienced during transports? In other words what
happens if deliveries are made late? What impact does it have on customer satisfaction? This
study seeks to answer those questions by studying the relationship between income streams
generated by companies engaged in logistics services and number of late shipments recorded

34
over a specific period. The main purpose is thus to try to establish whether there exists any direct
connection between money earned through sale goods or provision services on one hand with
time taken to transport these products from seller buyer side.

Fig 8: Products & Regions with Highest Profit

Fig 9: Top 10 Products & Regions with most Late Deliveries

35
4.3.5 Delivery Status
Tracking the status of delivery ensures that orders are fulfilled on time and customers are
satisfied. In this assessment, we investigate where things stand with deliveries all along the
supply chain from order processing to destination. Moreover, we can also establish why some
parcels may have taken longer than others by comparing such aspects as order processing time;
transit time; percentage delivered in full etcetera against benchmarks set within industry
standards or best practices. Furthermore, real-time tracking allows for timely intervention
whenever there might be delays, hence preventing dissatisfaction among clients who expect their
goods promptly.

Fig 10: Delivery Status

4.3.6 Shipping Modes


Evaluating the modes of shipping gives a lot of ideas on how efficient and cheap various ways
can be in terms of delivery. It is possible to optimize logistics operations and increase satisfaction
among clients by looking at when goods are moved, how much it costs to move them, and what
level of service is provided through each mode. Additionally, this kind of analysis helps us find
out where processes could be made faster or cheaper while also shortening transit time.

36
Fig 11: Shipment Modes

4.3.7 Delivery Status by Shipping Mode


The performance levels among different methods used in making shipments become apparent
during analysis based on delivery statuses.

Fig 12: Delivery Status by Shipping Mode

37
4.3.8 Payment Method
Knowledge about current trends plus people’s preferences when it comes to paying for products
is useful in improving checkout experiences thereby facilitating smooth transactions.

Fig 13: Payment Method

Key Takeaways from the Exploratory Data Analysis:


● Evaluation of Consumer Conduct: This was made through looking into the
customer’s behavior in the DataCo Company dataset and it shows that this category
has the largest share of the market, which means more people are interested in
buying.

● Forces Behind Market: Sales within Europe are seen to be highest leading to raised
earnings per order for the business consequently making this area most lucrative.
However, LATAM records indicate that customers place greatest total quantity orders
hinting at a strong presence by volume thus creating large scale opportunities.

● Favored Goods Categories: There are some types of products which seem to attract
buyers more than others among them include Cleats, Women's Apparel,
Indoor/Outdoor Games, Cardio Equipment, Shop by Sport, Men's Footwear and
Fishing where most number orders were received indicating common demand for
such commodities.

● Shipping Desires: One interesting thing about how people want their items delivered
can be noticed from the predominance of low-cost shipping methods. Standard Class

38
therefore becomes the most preferred mode surpassing all other options according to
what most customers like when it comes to receiving their goods bought online.

● Means Of Payment Trends: Throughout all regions studied there seems no variation
as regards payment mode preference among clients. Debit transactions dominate
showing high levels of cashless transactions while cash payments appear least liked
by many buyers thereby ranking lowest choice for payment method across different
areas.

4.4 Ordinary Least Squares


4.4.1 Linear regression from Ordinary Least Squares (OLS)
Simple linear regression is a method that can predict the Total amount per order (Sales per
customer) by examining the correlation between different quantitative variables. These variables
might include sales value, order quantity, product price, customer type or market segment. The
ability to accurately forecast sales is important for any firm seeking to meet consumer needs
effectively.
Ordinary Least Squares (OLS) regression is a statistical technique used to find the line of best fit
for relating independent predictors with their dependent response variable. By approximating
relationships between observations, OLS helps organizations make better predictions and
strategic decisions thus reducing chances of running out of supply.
Predictor variables consist of various factors such as ‘order_item_product_price’,
‘order_country’, ‘order_item_discount’, ‘order_profit_per_order’, ‘order_item_quantity’,
’delivery_status’, ’customer_country’ ,‘customer_state’ ,‘order_city’ ,‘customer_city’
,‘department_name’ ,‘order_state’ ,‘order_status’ ,‘market’ ,‘type’ ,‘product_name’,
’customer_segment’ , ’order_region’ , ‘category_name’ and ‘shipping_mode’
Response variable is Total amount per order which is expressed as ’order_item_total’.

4.4.2 OLS Regression


Some dataset columns have object data types which cannot be used directly in a regression
model. One might choose to drop string columns, but it should be noted that customer category,
market, product name, order region, category name and other variables may affect the Total
amount per order. For this reason, all these categories were converted from object type into int
types so they can be included in multiple regression analysis.

39
OLS Regression Results

Dep. Variable: order_item_total R-squared: 0.871


Model: OLS Adj. R-squared: 0.871
Method: Least Squares F-statistic: 4034.
Date: Thu, 02 May 2024 Prob (F-statistic): 0.00
Time: 17:22:52 Log-Likelihood: -61950.
No. Observations: 11943 AIC: 1.239e+05
Df Residuals: 11922 BIC: 1.241e+05
Df Model: 20
Covariance Type: nonrobust
Table 1: OLS Regression Results

coef std err t P>|t| [0.025 0.975]


Intercept -24.1177 3.148 -7.662 0.000 -30.288 -17.948
order_item_product_price 0.9554 0.005 210.001 0.000 0.946 0.964
order_country 0.0080 0.012 0.668 0.504 -0.015 0.031
order_item_discount -0.3465 0.022 -15.513 0.000 -0.390 -0.303
order_profit_per_order 0.0125 0.004 3.467 0.001 0.005 0.020
order_item_quantity 53.3405 0.354 150.641 0.000 52.646 54.035
delivery_status -0.3856 0.411 -0.938 0.348 -1.191 0.420
customer_country -1.7044 1.289 -1.322 0.186 -4.232 0.823
customer_state -0.0352 0.044 -0.808 0.419 -0.121 0.050
order_city 0.0004 0.001 0.610 0.542 -0.001 0.002
customer_city -0.0014 0.003 -0.393 0.694 -0.008 0.005
department_name -1.2218 0.158 -7.719 0.000 -1.532 -0.912
order_state -0.0017 0.002 -0.898 0.369 -0.005 0.002
order_status -0.2760 0.302 -0.915 0.360 -0.868 0.316
market -0.4774 0.370 -1.292 0.197 -1.202 0.247
type 1.3522 0.665 2.032 0.042 0.048 2.657
product_name -0.1027 0.027 -3.792 0.000 -0.156 -0.050
customer_segment 0.8378 0.629 1.332 0.183 -0.395 2.070
order_region -0.0608 0.062 -0.982 0.326 -0.182 0.061
category_name -1.1545 0.037 -30.877 0.000 -1.228 -1.081
shipping_mode 0.6112 0.413 1.481 0.139 -0.198 1.420
Table 2: Regression Coefficient Table

40
Observations:
● For the p-values of predictor variables were assessed during analysis to determine
their significance. If a p-value was less than 0.05 which means the test hypothesis is
falsified or rejected, this will serve as the basis for keeping that predictor variable in
the model.
● In relation to 'order_item_total' (response variable), predictors showing p-values
below 0.05 indicate significant statistical association with it. Some of these important
predictors are ‘order_item_product_price’, ‘order_item_discount’,
‘order_item_quantity’, ‘order_profit_per_order’, ‘department_name’, ‘market’,
‘product_name’ and category_name.’
● Conversely, if a p-value exceeds 0.05 for any predictor variable, this implies no
observable effect. Therefore, all corresponding factors such as 'order_country',
'customer_country', 'customer_state', 'order_city', 'customer_city', 'order_status',
'order_state', 'type', 'customer_segment', 'order_region', 'shipping_mode' and
delivery_status’ were found not to have had any material impact on order_item_total”
(response variable) whose values fall within range [minimum value-maximum value].

After re-calibrating OLS regression model used here by removing those predictor variables
which had shown p-values greater than 0.05 during previous step; new OLS regression model
was fitted for estimation purposes only since they did not meet specified criteria necessary for
inclusion in subsequent stages of further analyses like testing multicollinearity etc.
OLS Regression Results

Dep. Variable: order_item_total R-squared: 0.871


Model: OLS Adj. R-squared: 0.871
Method: Least Squares F-statistic: 1.007e+04
Date: Thu, 02 May 2024 Prob (F-statistic): 0.00
Time: 17:24:46 Log-Likelihood: -61961.
No. Observations: 11943 AIC: 1.239e+05
Df Residuals: 11934 BIC: 1.240e+05
Df Model: 8
Covariance Type: nonrobust
Table 3: OLS Regression Results for Predictors having p-value < 0.05

41
coef std err t P>|t| [0.025 0.975]
Intercept -24.7488 1.985 -12.468 0.000 -28.640 -20.858
order_item_product_price 0.9547 0.005 210.134 0.000 0.946 0.964
order_item_discount -0.3471 0.022 -15.537 0.000 -0.391 -0.303
order_item_quantity 53.6606 0.329 162.994 0.000 53.015 54.306
order_profit_per_order 0.0124 0.004 3.447 0.001 0.005 0.019
department_name -1.2013 0.158 -7.596 0.000 -1.511 -0.891
market -0.4243 0.355 -1.196 0.232 -1.120 0.271
product_name -0.0977 0.027 -3.617 0.000 -0.151 -0.045
category_name -1.1575 0.037 -30.992 0.000 -1.231 -1.084
Table 4: Regression Coefficient Table for Predictors having p-value < 0.05
Observations:
According to the OLS regression analysis for predictors having p-value <0.05 -
● The model’s coefficient of determination (R-squared) is determined to be 0.871,
which means that nearly 87.1% of the dependent variable ‘order_item_total’ variance
can be explained by independent variables in the regression model used.
● Also, it is found that adjusted R-squared is 0.871 which also indicates that adding
more predictors does not significantly increase the explanatory power beyond what
was already captured by original predictors.
● The overall significance measure for the regression model known as F-statistic shows
a value of 1.007e+04 with a corresponding p-value of 0.00. This implies that all items
together are statistically significant at any level and therefore we can say that these
variables collectively explain a substantial proportion of variation in
'order_item_total'.
● Coefficients, standard errors, t-values and p-values for each predictor variable
included in this model are shown in table below: All listed predictors have p<0.05
which makes them statistically significant when it comes to 'order_item_total'
prediction.
● The intercept term with coefficient -24.7488 being statistically significant suggests
that it contributes significantly towards predicting orders total cost per item by itself
while other factors do not matter much on their own without considering this one
particular factor alone inside product’s name category department order number ship
date supplier customer id salesperson id employee id position id shift description
manager review date review month year survey type vendor name org name or job
title product type

42
● Among predictor variables; ‘order_item_quantity’, ‘order_item_product_price’,
‘order_item_discount’, ‘order_profit_per_order’, ‘department_name’,‘product_name’
and ‘category_name’ were found to be statistically significant predictors for order
item total given their respective low P values as indicated above.

4.4.3 Linear Regression Equation


Linear Regression Equation:
order_item_total = -24.7488 + (0.9547 X {order_item_product_price}) - (0.3471 X
{order_item_discount}) + (53.6606 X {order_item_quantity}) + (0.0124 X
{order_profit_per_order}) - (1.2013 X {department_name}) - (0.4243 X {market}) - (0.0977 X
{product_name}) - (1.1575 X {category_name})
Looking at the regression model and derived equation, several key points can be understood
about the relationship of Total amount per order & Predictor Variables.
Positive Correlation: The variable total amount per order indicates a positive correlation
between it and other predictors like product price, quantity, product, profit per order. Essentially
what this means is that when one of these variables goes up so does another on average in
relation to Total Amount Per Order; for example, higher quantities or profits would lead to larger
total amounts per order.
Negative Correlation: In contrast there also exists negative correlations among some categories
such as market/category/department where an increase in any given factor leads to a decrease in
another one i.e., total amount per order declines.
Explanation for such negative correlation are as follows:
● Market Impact: It is observed that more sales occur when many offers exist hence
those regions with bigger percentage offs record higher figures than others e.g.,
Europe and LATAM (Latin America). Hence the huge reductions made by sellers
within these areas explains why we have negative relationship between them and our
dependent variable – Total Amounts Per Order
● Categories of Products: There are certain commodity classes like fishing gear, cleats
shoes etcetera which usually experience both high volumes sold vis-à-vis percentage
discount given; thus whenever there is an increment in markdown price under such
category then it implies more units will be purchased thus reflecting negative
association between these two variables.
● Departmental Analysis: Wherever huge numbers of items sell at lower prices than
normal large discounts tend to be offered.A good example is the Fan Shop department
whereby great revenue numbers generated always coincide with significant
markdowns being provided.. Therefore this tells us that different departments respond

43
diversely towards pricing strategies so far adopted by organization; therefore
contributing negatively towards our dependent variable (Total Amounts Per Order).

Therefore, the negative relationship between market, category and department variables with
Total amount per order is well captured by the regression model. It shows how different
predictors affect this variable thereby helping understand customer behavior and making
strategic decisions within the company based on that understanding.
By analyzing the magnitude and sign of each coefficient, we can determine which predictor
variables have the greatest impact on Total amount per order. For example, if a coefficient is
large in size it means that its corresponding predictor variable has a stronger effect on Total
amount per order.

Fig14: Visualization of Linear Regression Model of order_item_total for different predictors –1

Fig15: Visualization of Linear Regression Model of order_item_total for different predictors –2

Model Interpretation:
A regression equation gives us an idea about how much different components contribute towards
achieving or increasing total sales volume (per unit). This will help organizations identify what
drives their revenue most hence making decisions such as pricing strategies; discounts policy;
product mix; departmental splits among others.

44
Business Perspective:
Correlation analysis and regression modeling can be used in business to forecast future sales, as
well as optimize resource allocation. A few variables seem to be related to the total amount per
order judging from the scatter plot of these two quantities. However there are some points that
are very far away from the line of best fit which implies they might not follow this relationship
closely.
Particularly, negative signs accompanying coefficients tied to such items like
“order_item_discount”, “department_name”, “market”, “product_name” & "category_name"
indicate how discounts affect overall amounts spent by customers on any given transaction. Thus
instead of concentrating all efforts towards enticing clients through markdowns alone, it would
make sense for management to quicken shipments while improving delivery plans as shown in
this model.
If shipment handling is done strategically coupled with streamlining the delivery process then
customer needs shall be met more effectively leading to increased ordering frequency from them
thus driving up sales volumes over time. This implies that, company should look at enhancing its
capacities when it comes to managing shipments so that peoples’ satisfaction levels could be
raised resulting in better purchase rates being recorded, ultimately translating into higher income
figures for the firm.
Therefore discounts may increase purchases temporarily but companies need work on operations
efficiency and good delivery experience in order build loyalty among buyers who will
continuously demand unnecessary goods thereby maximizing profits made by business
organizations.

45
CHAPTER 5: Data Modeling
5.1 Order Item Quantity Regression Models
Estimating the quantity of ordered items is possible through demand forecasting; this enables
companies to make strategic decisions about supply based on future sales and revenue
predictions. These processes rely heavily on regression models which can expose trends, identify
demand signals or discover relationships between variables within large data sets. According to
McKinsey & Company, prediction errors could be reduced by up to 50% with machine learning
backed supply chain solutions.
In this regard different types of regression models like Random Forest regression, Decision Tree
Regression and Linear Regression are applied for predicting “Order Item Quantity”. Models’
predictive performances are evaluated by mean absolute error (MAE) and root mean square error
(RMSE) among other criteria.
Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are two metrics commonly
used to estimate the quality of regression models, especially in forecasting tasks like supply
chain management and demand prediction.
Mean Absolute Error (MAE)
● Quantified is the MAE by averaging all error magnitudes between projections and
actualities.
● This can be computed by taking averages over absolute differences between predicted
and true values.
● As it represents average absolute deviation of predictions from actuals, MAE is easy
to interpret.
● One shortcoming might be that it does not differentiate enough small errors from
large ones, sometimes penalizing them equally.

Root Mean Square Error (RMSE)


● RMSE measures numbers squared on average across forecasted quantities compared
against realized figures.
● This can be calculated by finding square root for average of squared differences
between projected amounts and observed ones.
● In comparison with MAE, RMSE gives more weightage to bigger mistakes because it
squares them before taking mean.
● Especially when larger errors are considered more important, RMS is good for
evaluating overall correctness of the model.
● It also has a simple interpretation just like MAE where RMSE shows how far off we
expect our guesses should have been on average.

46
Linear Regression:
MAE of Total amount per order is : 0.45873962112754674
RMSE of Total amount per order is : 0.6527457353153101

Decision Tree:
MAE of Total amount per order is : 0.0
RMSE of Total amount per order is : 0.0

Random Forest Regression:


Model parameter used are: RandomForestRegressor(max_depth=10, random_state=40)
MAE of Total amount per order is : 0.0010789542043269447
RMSE of Total amount per order is : 0.012063712999009954

S.No Regression Model MAE RMSE

1 Linear Regression 0.458740 0.652746


2 Decision tree 0.000000 0.000000
3 Random Forest 0.001079 0.012064
Table 5: Comparison between Regression Models

The linear regression model performs moderately well. The MAE and RMSE values imply that,
on average, the predicted values are off from the actual values by about 0.459 units and 0.653
units respectively.

According to the provided metrics, the decision tree model is perfect. It has an error of zero,
meaning it predicts the target variable perfectly. However, such a result is highly abnormal and
may indicate overfitting or a very small data set.

Random Forest model performs incredibly well. Its MAE and RMSE values are very low so
that on average predicted values differ from actual ones by approximately 0.001 and 0 .012
units respectively which tells us that this random forest captures hidden patterns in data
accurately enough to make exact predictions.

Random Forest beats both Linear Regression and Decision Tree models in terms of MAE
and RMSE as indicated by given metrics. It demonstrates minimum errors hence suggests
stronger predictive performance than any other tested algorithm. Hence this should be
considered as the best performing model for this dataset.

47
Business Perspective:
From a trade standpoint, predicting demand has significant importance within the context of
supply chain management especially in relation to Order Item Quantity. It is through these
forecasts that all strategic decisions are made as they help in manufacturing plans, purchasing
policies and even methods that can be used to reduce capital expenditure.
Higher productivity levels are realized when there is accurate estimation of what will be required
by consumers thus leading to more profits which can then be channeled towards meeting various
needs or wants of customers thereby creating satisfaction among them and this also generates
lots of money for the organization concerned with such predictions. Consequently, an appropriate
selection regression model becomes instrumental in obtaining precise future demands hence
showing how crucial it is for businesses seeking success within vibrant market environments.

5.2 Classification Models – Late Delivery


In this section, models for classification such as Logistic Regression, Linear Discriminant
Analysis, Gaussian Naive Bayes, Support Vector Machines and Random Forest classification
have been used to predict "Late Delivery."
This is important because it helps in forecasting and managing potential delays within the supply
chain. Performance of these models was assessed using some evaluation metrics which are
Accuracy, Recall and F1 score each of them gives different insights about how well a model
performs.
Classification Matrix:
● True Positive (TP) cases indicate correct predictions for "Late Delivery" thus
enabling the supply chain to initiate pre-emptive actions aimed at minimizing
negative effects.
● True Negative (TN) cases represent accurate estimations for on-time deliveries, hence
ensuring operational smoothness.
● False Positive (FP) scenarios are wrong forecasts about late delivery which result in
poor resource allocation efficiency and possible wastage.
● False Negative (FN) occurrences show inaccurate predictions for timely deliveries
made when there is significant operational difficulty in meeting the target of
predicting late deliveries.

The F1 score serves as a selection criterion for determining the best model. This is a balanced
measure that considers both precision and recall values. It measures how well the model can
correctly identify positive as well as negative cases while at the same time reducing false alarms
thereby supporting decision making in SCM.

48
The F1 score is a well-known metric for classification tasks in machine learning, especially when
working with imbalanced data. It computes the harmonic mean of precision and recall, thereby
giving a fair assessment of how well a model performs.

F1 Score=( 2×Precision×Recall ) / Precision+Recall

● Precision vs Recall Balance: The F1-score balances between precision (the accuracy
of positive predictions) and recall (ability to catch all positive instances). This balance
is important where there is significant difference in false positives cost compared to
false negatives.
● Calculation Of Harmonic Mean: Unlike other average methods such as arithmetic
mean, this score calculates itself from harmonic means of precision and recall. In this
case, more weightage is given to lower values making it very sensitive towards
imbalanced classes thus penalizing models which do either poorly on precision or on
recall.
● Model Selection with Optimality: Often when we deal with choosing among
classifiers machines that categorize objects into one or another pre-defined category
based on their properties – people prefer selecting them by maximizing over
differentiating powers expressed through highest possible value for f1-scores rather
than any other single measure. It guarantees a good trade-off between FP and FN
rates yielding the best overall outcome.
● Interpretability: The interpretive simplicity of f1 score makes it ideal for comparing
different models or tuning parameters within a single model while at the same time
enabling easy understanding by stakeholders who may not necessarily have technical
expertise but play roles in decision making during implementation stages where
organizations deploy machine learning solutions.

49
Sno Classification Accuracy Recall F1 TN FP FN TP
Model
1 Random Forest 99.748814 99.529289 99.764089 1671 9 0 1903
Classification

2 Support Vector 96.092660 96.176008 96.327387 1607 73 67 1836


Machines

3 Logistic 96.343846 96.193952 96.571578 1607 73 58 1856


Classification
Model
4 Linear Discriminant 96.511303 95.966908 96.742247 1602 78 47 1856
Analysis

5 Gaussian Naive 85.431203 85.832901 86.370757 1407 273 249 1654


Bayes Model

Table 6: Comparison between different Classification Models for Late Delivery

Observations:
Random Forest Classification: This model has remarkable accuracy, recall, and F1 score
among all others. It identifies late deliveries correctly with great precision and captures most of
them.
Support Vector Machines (SVM): Even though SVM demonstrates relatively high accuracy, its
recall and F1 score are lower in comparison with Random Forest model. Many late deliveries are
detected correctly by it, but some examples are missing.
Logistic Classification Model: Both accuracy-wise and slightly better for recall and F1 score
than SVM. With regard to wrong predictions Logistic Regression may have higher rates of false
alarms (FP) than RF which can identify latenesses without errors.
Linear Discriminant Analysis (LDA): The same level as Logistic Model in terms of both
correct answers given back (accuracy) as well as forgotten ones (recall). But sometimes might
confound different types together or overlook certain ones altogether when dealing with delays.
Gaussian Naive Bayes Model: Among all models tested this has got least accurate results –
lowest numbers for right answers provided; also smallest amount recognized at once without any
mistakes made. Although there were some true detections anyway they happened more rarely
than wanted while too many were lost so FP rate became greater.

50
Looking at the numbers, the Random Forest Classification model is the best one to use for
classifying whether a delivery will be late or not. It has the highest accuracy, recall and F1 score.

Business Perspective:
For supply chain optimization, it is necessary to handle the late deliveries in an efficient way.
The reason accurate late delivery prediction is important lies in that it helps enterprises take
proactive measures before they turn into bigger problems, thus allowing them allocate resources
better and plan operationally more effectively. When companies use these types of models, they
can detect trends as well as factors behind the delayed deliveries like transport disruptions;
inventory deficits; logistical bottlenecks among others.
Additionally, these methods enable firms to prioritize their resources allocation by directing
attention towards those areas where there is higher probability of occurrence of late shipments.
This approach specifically focuses at reducing disturbances within the supply chains, cutting
down on costly expedited freight or warehousing fees while at the same time improving overall
operational efficiencies.
Moreover, businesses may consider taking preventive actions against risks associated with
service failure caused by tardiness once they know what could be leading to such failures. Some
of them may involve things like optimizing routes used during transportation; better practices in
managing stock levels; renegotiating terms with suppliers so that goods arrive just when needed
for production etc.; investment in technologies capable of providing continuously updated
information concerning location & condition of consignments being moved around different
parts of a country among other places.
Ultimately these classifications provide insight into various aspects of management giving
knowledge-based decisions making power on how best respond towards minimizing effects
occasioned by delays experienced in customer order fulfillment process thereby leading to lower
satisfaction rates among clients served. With predictive analytics supported by machine learning
tools organizations will be able to streamline their SCM functions leading not only improved
reliability but also gaining competitive advantage within relevant markets characterized by stiff
rivalry among players operating along similar lines as theirs.

51
CHAPTER 6: Results
Ordinary Least Squares (OLS) Regression
An OLS regression examination was done to look at the correlation between different predictor
variables and total amount per order ('order_item_total'). The final linear regression equation is
as follows:

order_item_total = -24.7488 + (0.9547 X {order_item_product_price}) - (0.3471 X


{order_item_discount}) + (53.6606 X {order_item_quantity}) + (0.0124 X
{order_profit_per_order}) - (1.2013 X {department_name}) - (0.4243 X {market}) - (0.0977 X
{product_name}) - (1.1575 X {category_name})

The coefficients indicate the impact of each predictor variable on the total amount per order. For
instance, a unit increase in the order item quantity is associated with an increase of
approximately 53.66 units in the total amount per order. Conversely, an increase in the order item
discount or market is correlated with a decrease in the total amount per order.

Regression Model for Order Item Quantity – Demand Forecasting

Different models were used like Random Forest Regression, Decision Tree Regression, Linear
Regression etc., for demand forecasting related with order item quantity where product price,
discount, profit per order and department name were taken as predictors among others which can
be used to predict the quantity demanded of an item within a particular transaction period or any
other time duration specified by user depending upon his requirements and available data. Each
model’s performance was evaluated using Mean Absolute Error (MAE) and Root Mean Squared
Error (RMSE) evaluation metrics. Random Forest Regression is the best performing model for
forecasting Order Item Quantity based on the results.

Classification Models for Late Delivery


Models were built to determine whether a delivery would be late or not using Logistic
Regression, Linear Discriminant Analysis (LDA), Gaussian Naive Bayes (GNB) , Support
Vector Machines (SVMs), Random Forest Classification etc. These models were scored on
accuracy, recall and F1 score metrics which are used for measuring how well each model
classifies instances into the correct category given that there are two categories namely late
delivery and not delivered late while taking into account false positive rate (FPR), false negative
rate (FNR) as well as true positive rate(TPR) and true negative rate(TNR).

52
CHAPTER 7: Conclusions
7.1 Supply Chain Issues as shown by the Dataset used
● The analysis of the dataset showed that almost 55% of orders are not delivered on
time. First-class shipments have a late delivery rate of 95%, while Second-Class
orders experience a delay of 77%.
● In terms of delayed deliveries, Second-Class shipments are the most affected
category, which means that supply chain improvements should be concentrated on
First- and Second-class shipping modes.
● It is important to note that only 18% of orders are shipped within the expected
timeline thus timely delivery remains vital for customer satisfaction and retention;
failure to meet this expectation may lead to customer dissatisfaction thereby causing
revenue loss to the company.
● The top ten products together with regions that generate highest profits can be
affected by delays in transportation; however, some strategic interventions can help
deal with such a situation by optimizing methods used during shipping for faster
product dispatching creating awareness among clients about their goods’ ETA
through careful planning for deliveries and also setting up regional depots aimed at
quickening pace while dropping off items within different parts of a given area during
transit hence minimizing time taken between points where goods are loaded or
offloaded along the way. Through these initiatives alone it is possible to prevent
customer complaints related to late delivery, improve satisfaction levels among them
as well as protect income sources belonging to an organization.
● Out of all orders placed in the system only one-fifth arrive when they should be there
while the rest come after their scheduled time thus showing how big a percentage we
are dealing with here – another reason why efficiency needs improvement ASAP.
● More than half (55%)of all orders fail to meet delivery expectation due lateness in
shipment mainly caused by first class which record high percentage (95%) then
second class which stands at (77%).
7.2 Conclusions
This thesis covers the major supply chain management dynamics and consumer contentment
within the DataCo Company dataset. Different regression models, together with comprehensive
data analysis methods were used during this study thereby resulting in a number of significant
findings that brought to light some of the most important aspects regarding how the business
operates as well as serves its clients.
OLS Regression Insights: The Ordinary Least Squares (OLS) regression analysis showed what
should have been known; that there are factors which significantly determine amounts spent on
an order. The identified coefficients revealed various things such as product price, discount
percentage, quantity purchased per unit among others affecting total expenditure per purchase

53
made by customers. These learnings can be used to inform decision making towards optimizing
pricing strategies, inventory control systems and allocation of resources in general.
Models for Demand Forecasting and Classification: Having performed well on all tested
metrics like accuracy rate or AUC score; this refers to those measures used when evaluating
performances of different machine learning models against each other – it can be concluded that
both regression model for order item quantity forecasting along with classification models for
predicting late delivery worked best among other types considered here too. Random Forest
Regression in combination with Logistic Regression algorithms were found useful for predicting
future trends in demand accurately while at the same time identifying areas which might
experience problems related to timely supplies.
Implications towards Supply Chain Optimization: The results from this analysis highlight the
importance of using data driven approaches in improving supply chain performance levels so
that customer satisfaction is enhanced too. Predictive analytics can therefore be employed by
organizations coupled with classification methods thus enabling them deal with issues affecting
their supply chains proactively while streamlining order fulfillment processes even further to
ensure faster response times throughout business operations cycle Additionally, companies could
also take advantage of pricing strategies derived out of findings obtained through regression
analyses so as to match products better within relevant markets not forgetting about overall
efficiency improvement recommendations given earlier.
Recommendations for Further Study: Although the present study shed light on many areas
related to managing customer satisfaction in supply chains, there are still gaps which require
further investigation. For instance more research should be done regarding integration of real
time data sources with advanced predictive models as well algorithms used for optimizing supply
chains. On another note external factors like market trends or economic conditions vis-à-vis
consumer behavior within any given context could greatly impact dynamics of these systems
hence need thorough probing.
To sum up everything, this investigation adds onto existing knowledge by giving practical
suggestions concerning supply chain management practices while at the same time bringing out
service strategies that can enable companies to gain competitive advantage. It is important for
businesses operating in today’s dynamic environment characterized by high levels of uncertainty
and complexity to embrace data analytics coupled with machine learning techniques for better
decision making. Going forward, organizations have no option but to adopt an approach based on
information availability related to improving efficiency along the entire supply chain besides
customer centricity being seen as key driver behind success within ever changing markets.

54
REFERENCES

1. Anitha, P., & Patil, M. M. (2018). A Review on Data Analytics for Supply Chain
Management: A Case study. International Journal of Information Engineering and
Electronic Business, 10(5), 30–39. https://doi.org/10.5815/ijieeb.2018.05.05

2. Govindan, K., Cheng, T., Mishra, N., & Shukla, N. (2018). Big data analytics and
application for logistics and supply chain management. Transportation Research Part
E: Logistics and Transportation Review, 114, 343–349.
https://doi.org/10.1016/j.tre.2018.03.011

3. Rozados, I. V., & Tjahjono, B. (2014). Big data analytics in supply chain
management: trends and related research. ResearchGate, 2013–2014.
https://doi.org/10.13140/rg.2.1.4935.2563

4. Boone, T., Ganeshan, R., Jain, A., & Sanders, N. R. (2019). Forecasting sales in the
supply chain: Consumer analytics in the big data era. International Journal of
Forecasting, 35(1), 170–180. https://doi.org/10.1016/j.ijforecast.2018.09.003

5. Ittmann, H. (2015). The impact of big data and business analytics on supply chain
management. Journal of Transport and Supply Chain Management, 9(1).
https://doi.org/10.4102/jtscm.v9i1.165

6. Jeble, S., Dubey, R., Childe, S. J., Παπαδόπουλος, Θ., Roubaud, D., & Prakash, A.
(2018). Impact of big data and predictive analytics capability on supply chain
sustainability. The International Journal of Logistics Management, 29(2), 513–538.
https://doi.org/10.1108/ijlm-05-2017-0134

7. Seyedan, S. M., & Mafakheri, F. (2020). Predictive big data analytics for supply chain
demand forecasting: methods, applications, and research opportunities. Journal of Big
Data, 7(1). https://doi.org/10.1186/s40537-020-00329-2

8. Seify, M., Sepehri, M., Hosseini-Far, A., & Darvish, A. (n.d.). Fraud Detection in
Supply Chain with Machine Learning. IFAC-PapersOnLine, 55(10), 406–411.
https://doi.org/10.1016/j.ifacol.2022.09.427

9. Bushuev, M. A. (2018). Delivery performance improvement in two-stage supply


chain. International Journal of Production Economics, 195, 66–73.
https://doi.org/10.1016/j.ijpe.2017.10.007

55
10. Wenzel, H., Smit, D., & Sardesai, S. (2019). A literature review on machine learning
in supply chain management. A Literature Review on Machine Learning in Supply
Chain Management, 27, 413–441. https://doi.org/10.15480/882.2478

11. Carbonneau, R. A., Laframboise, K., & Vahidov, R. (2008). Application of machine
learning techniques for supply chain demand forecasting. European Journal of
Operational Research, 184(3), 1140–1154. https://doi.org/10.1016/j.ejor.2006.12.004

12. Murray, P. W., Agard, B., & Barajas, M. (2015). Forecasting supply chain demand by
clustering customers. IFAC-PapersOnLine, 48(3), 1834–1839.
https://doi.org/10.1016/j.ifacol.2015.06.353

13. Nguyen, T. (2020, September 11). Machine learning in predicting supply chain risks.
Part 3: A case study of an E-commerce enabler.
https://www.linkedin.com/pulse/machine-learning-predicting-supply-chain-risks-part-
3-tuan-nguyen-/

14. DFreight. (2022, November 10). How data analytics can help Improve your Supply
chain.
https://www.linkedin.com/pulse/how-data-analytics-can-help-improve-your-supply-c
hain-dfreight/?trk=public_post

15. Aamer, A. M., Yani, L. E., & Priyatna, I. A. (2020). Data Analytics in the Supply
Chain Management: Review of Machine learning applications in demand Forecasting.
Operations and Supply Chain Management, 1–13.
https://doi.org/10.31387/oscm0440281

16. Chu, C. W., and Guoqiang Peter Zhang. “A Comparative Study of Linear and
Nonlinear Models for Aggregate Retail Sales Forecasting.” International Journal of
Production Economics, vol. 86, no. 3, Dec. 2003, pp. 217–31.
https://doi.org/10.1016/s0925-5273(03)00068-9.

17. Keung, K. L., Lee, C. K., & Yiu, Y. H. (2021, December). A machine learning
predictive model for shipment delay and demand forecasting for warehouses and
sales data. In 2021 ieee international conference on industrial engineering and
engineering management (ieem) (pp. 1010-1014). IEEE.

18. Tirkolaee, Erfan Babaee, et al. “Application of Machine Learning in Supply Chain
Management: A Comprehensive Overview of the Main Areas.” Mathematical

56
Problems in Engineering, vol. 2021, June 2021, pp. 1–14.
https://doi.org/10.1155/2021/1476043.

19. Kache, Florian, and Stefan Seuring. “Challenges and Opportunities of Digital
Information at the Intersection of Big Data Analytics and Supply Chain
Management.” International Journal of Operations & Production Management, vol.
37, no. 1, Jan. 2017, pp. 10–36. https://doi.org/10.1108/ijopm-02-2015-0078.

20. Chen, Chiang and Storey (2012). “Business Intelligence and Analytics: From Big
Data to Big Impact on JSTOR.” www.jstor.org. JSTOR,
www.jstor.org/stable/41703503.

21. Naik, I., Jagati, A., Mishra, S., & Satapathy, S. K. (2022, August). Customer
Relations and Marketing Analysis Model for Sales Enhancement. In 2022
International Conference on Machine Learning, Computer Systems and Security
(MLCSS) (pp. 123-128). IEEE.

22. Aamer, Ammar Mohamed, et al. “Data Analytics in the Supply Chain Management:
Review of Machine Learning Applications in Demand Forecasting.” Operations and
Supply Chain Management : An International Journal, Dec. 2020, pp. 1–13.
https://doi.org/10.31387/oscm0440281.

23. Widjaja, S., & Mauritsius, T. (2019). The development of performance dashboard
visualization with power BI as platform. Int. J. Mech. Eng. Technol, 10(5), 235-249.

24. Lalou, P., Ponis, S. T., & Efthymiou, O. K. (2020). Demand forecasting of retail sales
using data analytics and statistical programming. Management & Marketing, 15(2),
186-202.

25. Park, K. J. (2021). Determining the tiers of a supply chain using machine learning
algorithms. Symmetry, 13(10), 1934.

26. Cid-Fuentes, J. Á., Alvarez, P., Amela, R., Ishii, K., Morizawa, R. K., & Badia, R. M.
(2020). Efficient development of high performance data analytics in Python. Future
Generation Computer Systems, 111, 570-581.

27. Al-Saghir, R. (2022). Predicting Delays in the Supply Chain with the Use of Machine
Learning.

57
28. Mediavilla, M. A., Dietrich, F., & Palm, D. (2022). Review and analysis of artificial
intelligence methods for demand forecasting in supply chain management. Procedia
CIRP, 107, 1126-1131.

29. Cadavid, J. P. U., Lamouri, S., & Grabot, B. (2018, July). Trends in machine learning
applied to demand & sales forecasting: A review. In International conference on
information systems, logistics and supply chain.

30. Abouloifa, H., & Bahaj, M. (2022, December). Predicting late delivery in Supply
chain 4.0 using feature selection: a machine learning model. In 2022 5th International
Conference on Advanced Communication Technologies and Networking (CommNet)
(pp. 1-5). IEEE.

58

You might also like