The Realization of A Type of Supermarket Sales Forecast Model & System
The Realization of A Type of Supermarket Sales Forecast Model & System
Abstract—This essay solves the problem of supermarket sales of all possible sales series model of numerous commodity
forecast. In allusion to that the forecast targets are numerous, objects, a specific forecast method can not produce good
as well as the features of high volatility and having evident results always. The solution is to use combination of forecast
seasonality of the forecast series, the system adopts combined methods, i.e. to set up a library of forecast methods, having
forecast method, and designs a kind of forecast method several forecast methods included in the library, and to select
selecting algorithm integrating bagging approach and and empower the methods. Presently, research of combined
significance statistics checking approach. The system is forecast focuses on the matter of empowerment [4-5]. [6]
realized and through testing, it is able to make effective proposes a Forecast Method Selection Model with Personnel
forecast on most of the forecast targets.
Selection Approach as its basic theory. While [7] proposes a
Keywords-sales forecast; decision-making support; data
forecast model based on Linear Regression and Exponential
mining; ensemble method Smoothing. None of these models are suitable to be applied
to the sales data of supermarket. This essay puts suitable
methods in the library of forecast methods and designs a
I. INTRODUCTION selection algorithm of forecast methods integrating Bagging
The research group has developed a supermarket sales Theory and Statistical Testing Method according to the
forecast system based on data mining technology and characteristics of commodity data in supermarket. In allusion
statistical analysis method to help enterprise managers to the forecast series, the forecast model or combination of
rationally adjust the structure of commodities and make models with best performances will be found through this
decisions on purchase, sales, inventory and promotion. Both algorithm. And then the forecast and error estimation will be
the demands and experimental data of this research subject made according to the model.
come from a big chain supermarket located in Quanzhou
City, Fujian Province. II. SUPERMARKET SALES FORECAST MODEL
The forecast of sales is a matter of time series forecast.
There are many methods can be used in time series forecast. A. The Forecast Methods in the Library
The main forecast methods for stationary series are Moving With regards to the forecast of commodities sales volume,
Average Approach and Simple Exponential Smoothing which methods are more suitable? What kind of method
Approach, while to forecast the linear trend, the classical A should be used in the forecast of time series depends on the
Linear Regression or Holt’s model can be used. With regards pattern of data. Whether the data is stationary or non-
to the forecasting of non-linear trend regression, Polynomial stationary, and whether the data includes trend, seasonal or
Regression and Exponential curve model can be used. If the circulating factors? The scale of historical data, and the
series include seasonal elements, then Winter’s model or forecast period required by the Demands all have impacts on
Multiple Regression Predicting Method with seasonal the selection of forecast method. Through study, the time
dummy viable can be tried. As to series including multiple series of commodities sales have the following
factors, it would be more appropriate to choose characteristics:
Decomposing Prediction and ARIMA model. Besides, there 1) Most of the sales series have obvious seasonal or
are Grey Prediction model, Neural Network and Hybrid cyclical variation. Therefore, it would be better for the
Neural Network etc. [1-3] In case of a specific forecasting selected method to have the ability of separating seasonal
object, deep discussion can be made to find out the most factors.
appropriate forecast method or combination of methods, so
2) Generally the volatility of series is relatively high.
that the forecasting value and the observed value would have
the best goodness-of-fit. In this system, the forecasting The commodities sales fluctuate significantly; generally it
objects are being chosen and typed in by the user. Different would not be a smooth curve, but an irregular zigzag form.
commodity, different category and even different commodity Therefore, Easy Exponential Curve and A linear regression
of same category in the supermarket have different data are not suitable for the forecast of commodities sales
pattern. And the data pattern can be different as the length of volume.
selling period differs. The test results show that, in the face
220
for the study of model, while the later part is used as testing III. THE ANALYSIS AND REALIZATION OF THE SALES
set for the assessing the model and providing error value of FORECAST SYSTEM
the assessment. To ensure enough sample for the study of
model, the system cut out 5% of the data as testing set. A. System Architecture
2) Increasing the accuracy of forecast by using the The system architecture drawing of the sales forecast
approach of bagging [8]. system is shown as Fig. 1:
Through the first round of selection, theoretically, the
model with smallest error value will be chosen for the Data
Data Data Data Cache
forecast of given sample data, and the final forecast value Imaged ETL Warehouse
will be obtained. Yet the testing proves that, if two or more
than two models both perform good, then it would be better Library of
MS DSO
for every model to participate in the forecast of given sample Forecast methods
data, and take the average value of these forecast values as
The study, testing Data
the final result of the final forecast return value. This and use of model Cube
XMLA
approach is bagging of ensemble method, usually being used Library of (Application)
Models
to increase the accuracy of predictor and classifier.
Results
Then, how to identify the model with good performance? Forecast
Among the remaining models after the first round of Objects
In the fitting models, suppose there are k series spots, Figure 1. System architecture of the Sales forecast system.
then the error of every series spot can be taken as different
independent sample in probability distribution, and generally The procedures are described as follow:
it follows the pattern of distribution t with k-1 degree 1) All heterogeneous data will be completely imaged in
freedom, in which, k equals to the number of series spots. the data caching zone.
Conducting presumption testing t-test, suppose the testing is In order not to cause overdue burden to the
passed, then these two models is “identical”, or the error rate supermarket’s operation database, the setting up of data
of the average value of the two is “zero”. If this presumption caching zone is quite necessary. The caching zone stores
is declined, it shows that the difference of these two models blank database. All sheet structure and data are established
is statistical significant, i.e. they have differences, and then and imported by the data abstracting program, totally
the model with higher error rate will be sifted out. And every identical with all kinds of heterogeneous data. The extracting
non-statistical significant model will be retained. Then the of data adopts the approach of full flow extracting, and it is
corresponding forecast values can all be regarded as inputs realized through auto extracting program, while the
of bagging algorithm for increasing the accuracy of the extracting frequency can be set manually, either daily,
forecast. weekly or other value is acceptable.
To forecast the same data series using M1 and M2 2) The establishment of data warehouse.
respectively, the statistics of the significance testing t of the Through the combining and summarizing, computing
error shall be calculated according to the following formula: viewing, integrity check as well as cleaning and loading of
the caching data, the data warehouse SALES is established.
err ( M 1) err ( M 2) The data warehouse includes data sheet with preset structure.
t
var( M 1 M 2) / k 3) Setting up data cube.
The data in the data warehouse will be gathered as data
In this formula, err ( M 1) is the average value of M1 cube saved in the server of Analysis Services. The star data
cube being set up based on the sales forecast model is shown
model’s error, err (M 2) is the average value of M2 model’s below as Fig. 2:
error, and var(M1-M2) is the variance of the two models’
difference:
[err (M 1)i err (M 2)i (err(M 1) err(M 2))]2
1 k
var(M 1 M 2)
k i 1
221
time sales Year Sales Vol.
time_key branch
time_key
day 2013 587 605 412
branch_key branch_key
day_of_the_week item_key branch_name Having cut off the last three data as training set and
month dollars_sold substituted the three prediction methods, the most optimized
quarter promotion_key model as well as the model error of each method provided by
item the system is listed as below:
year
item_key
promotion_key item_name TABLE II. MODELS AND MODEL FITTING OF TESTING 1
promotion_key category
Model RMS
subcategory Model Description R2
promotion_name No. E
protion_type
M1 Winter’s model 0.67 614.6
Figure 2. Sales Star Seasonal Decomposition
M2 0.55 445.98
+ARIMA(2,2,1)
4) The study, testing and use of forecast model. M3 ARIMA(1,2,1)(1,1,2)12 0.6 784.18
The forecast algorithm as introduced in Sec. 2 shall be Generally, if the forecast model is desirable, then the
used for the train and forecast of model with regard to the forecast value must be referential, and the model’s goodness-
forecast object as chosen by the client, returning the forecast of-fit R2 should be no less than 0.5. Therefore, suggest the
value and forecast curve to assist the manager in decision- threshold value R2 is set to be 0.5, then after the first round of
making. Other model parameters, including the model and selection, M1, M2 and M3 will all be retained. Among the
the error value will all be saved in the model database, and three, M2 has the smallest RMSE, thus M1 and M3 will carry
being applied in the forecast of same commodity in the out statistical significance testing against M2 respectively.
future, or being used for reference in the forecast of other Through the testing between M1 and M2, we can find out the
commodity of the same niche. statistical volume t=0.867<z=sig/2=0.025’s tabular value
B. System Developing Platform 2.005, which shows there is no difference of statistical
significance between M1 and M2, that the difference between
The system uses MS SQL Server as the server of date- them is stochastic. So M1 is retained. Through the testing
caching zone and data warehouse, and uses Analysis between M3 and M2, we can find out the statistical volume
Services as the data cube server. Both server and client t=0.25>z=sig/2=0.025’s tabular value 2.005, which shows
program use .Net structure, and were written in C# language. there is difference of statistical significance between M3 and
DSO (Microsoft Decision Support Object) module was M2, that the difference between them is stochastic. So M3 is
introduced in the C# language. Through the use of this sifted out.
module, data cube can be set up in Analysis Services. The The calculation of error based on the testing set is given
system uses XMLA (Microsoft XML for Analysis) to realize as below:
the barrier free interaction of multi-dimension data set and
mining program in Analysis Services. TABLE III. ERROR ESTIMATION OF TESTING 1
IV. ANALYSIS WITH EXAMPLE AND TESTING OF
Observe Value Forecast Value of M1 Forecast Value of M2
ALGORITHM EFFECTIVENESS
587 614 642
A. Testing 1
The existing data of monthly sales of “Darlie Toothpaste 605 620 697
225G”after promotion reduction at some outlet is listed as 412 332 363
below:
The Output Error is :
TABLE I. MONTHLY SALES OF “DARLIE TOOTHPASTE 225G”
RMSE=(RMSE(M1)+RMSE(M2))/2=(49.5+55.1)/2=52.3
The forecast value is given as below:
Year Sales Vol.
TABLE IV. FORECAST VALUES OF TESTING 1
2008 569 585 552 362 439 965 1330 1387 830 369 976 1505
Time Model M1’s output Model M2’s output Final output
2009 2419 240 472 567 579 1183 1512 1669 809 583 359 322
2013-04 240 208 224
2010 1177 2352 927 259 242 353 262 274 304 127 162 235
2013-05 98 158 128
2011 382 599 356 332 270 264 861 1432 1361 658 874 278
2013-06 347 109 253
2012 357 1155 1185 781 346 297 451 355 394 383 235 198
222
B. Testing 2 TABLE VIII. FORECAST VALUES OF TESTING 2
223
Paltform.”Chinese Journal of Computer Technology and [8] Hillebrand, Eric, and Marcelo C. Medeiros. “The benefits of bagging
Development, Vol 17, No. 2, Feb. 2007, pp. 27-30. for forecast models of realized volatility.” Econometric Reviews Vol.
29,No. 5,2010,PP. 571-593.
224