Retail Analysis With Walmart Data
Retail Analysis With Walmart Data
DESCRIPTION
One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and
holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to
unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will
predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.
Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of
all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher
in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these
holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are
available.
Dataset Description
This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the
following fields:
Holiday Events
Analysis Tasks
2) Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation
4) Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for
all stores together
5) Provide a monthly and semester view of sales in units and give insights
Statistical Model
1) Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order).
Hypothesize if CPI, unemployment, and fuel price have any impact on sales.
Out[3]:
Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI Unemployment
Out[4]:
Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI Unemployment
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Store 6435 non-null int64
1 Date 6435 non-null object
2 Weekly_Sales 6435 non-null float64
3 Holiday_Flag 6435 non-null int64
4 Temperature 6435 non-null float64
5 Fuel_Price 6435 non-null float64
6 CPI 6435 non-null float64
7 Unemployment 6435 non-null float64
dtypes: float64(5), int64(2), object(1)
memory usage: 402.3+ KB
Out[8]: store 0
date 0
weekly_sales 0
holiday_flag 0
temperature 0
fuel_price 0
cpi 0
unemployment 0
dtype: int64
In [ ]:
Maximum sales
In [9]: # groupby stores and get total sales
store_totalweeklysales = walmart_df.groupby('store')['weekly_sales'].sum()
store_totalweeklysales.to_frame()
Out[9]:
weekly_sales
store
1 2.224028e+08
2 2.753824e+08
3 5.758674e+07
4 2.995440e+08
5 4.547569e+07
6 2.237561e+08
7 8.159828e+07
8 1.299512e+08
9 7.778922e+07
10 2.716177e+08
11 1.939628e+08
12 1.442872e+08
13 2.865177e+08
14 2.889999e+08
15 8.913368e+07
16 7.425243e+07
17 1.277821e+08
18 1.551147e+08
19 2.066349e+08
20 3.013978e+08
21 1.081179e+08
22 1.470756e+08
23 1.987506e+08
24 1.940160e+08
25 1.010612e+08
26 1.434164e+08
27 2.538559e+08
28 1.892637e+08
29 7.714155e+07
30 6.271689e+07
31 1.996139e+08
32 1.668192e+08
33 3.716022e+07
34 1.382498e+08
35 1.315207e+08
36 5.341221e+07
37 7.420274e+07
38 5.515963e+07
39 2.074455e+08
40 1.378703e+08
41 1.813419e+08
42 7.956575e+07
43 9.056544e+07
44 4.329309e+07
45 1.123953e+08
In [10]: print("{:.2f}".format(store_totalweeklysales.max()))
# using argmax to get the max sales store index
print(store_totalweeklysales.index[store_totalweeklysales.argmax()])
301397792.46
20
In [ ]:
Out[11]:
weekly_sales
store
1 155980.767761
2 237683.694682
3 46319.631557
4 266201.442297
5 37737.965745
6 212525.855862
7 112585.469220
8 106280.829881
9 69028.666585
10 302262.062504
11 165833.887863
12 139166.871880
13 265506.995776
14 317569.949476
15 120538.652043
16 85769.680133
17 112162.936087
18 176641.510839
19 191722.638730
20 275900.562742
21 128752.812853
22 161251.350631
23 249788.038068
24 167745.677567
25 112976.788600
26 110431.288141
27 239930.135688
28 181758.967539
29 99120.136596
30 22809.665590
31 125855.942933
32 138017.252087
33 24132.927322
34 104630.164676
35 211243.457791
36 60725.173579
37 21837.461190
38 42768.169450
39 217466.454833
40 119002.112858
41 187907.162766
42 50262.925530
43 40598.413260
44 24762.832015
45 130168.526635
In [12]: print("{:.2f}".format(store_sales_std.max()))
# using argmax to get the max std.dev store index
print(store_sales_std.index[store_sales_std.argmax()])
317569.95
14
In [ ]:
Out[13]:
weekly_sales
store
1 1.555264e+06
2 1.925751e+06
3 4.027044e+05
4 2.094713e+06
5 3.180118e+05
6 1.564728e+06
7 5.706173e+05
8 9.087495e+05
9 5.439806e+05
10 1.899425e+06
11 1.356383e+06
12 1.009002e+06
13 2.003620e+06
14 2.020978e+06
15 6.233125e+05
16 5.192477e+05
17 8.935814e+05
18 1.084718e+06
19 1.444999e+06
20 2.107677e+06
21 7.560691e+05
22 1.028501e+06
23 1.389864e+06
24 1.356755e+06
25 7.067215e+05
26 1.002912e+06
27 1.775216e+06
28 1.323522e+06
29 5.394514e+05
30 4.385796e+05
31 1.395901e+06
32 1.166568e+06
33 2.598617e+05
34 9.667816e+05
35 9.197250e+05
36 3.735120e+05
37 5.189003e+05
38 3.857317e+05
39 1.450668e+06
40 9.641280e+05
41 1.268125e+06
42 5.564039e+05
43 6.333247e+05
44 3.027489e+05
45 7.859814e+05
In [14]: #cv=(mean/std.dev)*100
covariance_std_mean = (store_sales_std / store_sales_mean) * 100
In [15]: covariance_std_mean.to_frame()
Out[15]:
weekly_sales
store
1 10.029212
2 12.342388
3 11.502141
4 12.708254
5 11.866844
6 13.582286
7 19.730469
8 11.695283
9 12.689547
10 15.913349
11 12.226183
12 13.792532
13 13.251363
14 15.713674
15 19.338399
16 16.518065
17 12.552067
18 16.284550
19 13.268012
20 13.090269
21 17.029239
22 15.678288
23 17.972115
24 12.363738
25 15.986040
26 11.011066
27 13.515544
28 13.732974
29 18.374247
30 5.200804
31 9.016105
32 11.831049
33 9.286835
34 10.822524
35 22.968111
36 16.257891
37 4.208412
38 11.087545
39 14.990779
40 12.342978
41 14.817711
42 9.033533
43 6.410363
44 8.179331
45 16.561273
Store numbers : 30,31,33,37,42,43,44 has very good co-efficients of mean to standard deviation
In [ ]:
In [18]: walmart_df.head()
Out[18]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day
In [21]: walmart_df.head()
Out[21]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day
Out[22]: 3
In [24]: stores_2012_sales.head()
Out[24]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day
In [26]: stores_2012_sales.head()
Out[26]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day quartile
In [27]: stores_2012_sales['month'].unique()
Out[27]: array(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
'Oct'], dtype=object)
In [29]: stores_2012_sales
Out[29]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day quartile
... ... ... ... ... ... ... ... ... ... ... ... ...
Out[30]: 4
In [33]: q2_sales.head(3)
Out[33]: store
1 20978760.12
2 25083604.88
3 5620316.49
Name: weekly_sales, dtype: float64
In [34]: q3_sales.head(3)
Out[34]: store
1 20253947.78
2 24303354.86
3 5298005.47
Name: weekly_sales, dtype: float64
Out[35]:
weekly_sales
store
1 -724812.34
2 -780250.02
3 -322311.02
4 -657571.21
5 -302572.70
6 -666597.68
7 971928.12
8 -170678.25
9 -462785.55
10 -713110.41
11 -271290.51
12 -826064.21
13 -587947.84
14 -3967974.76
15 -343162.04
16 557205.66
17 -132947.88
18 -406429.38
19 -163745.39
20 -632670.34
21 -266997.03
22 -642754.35
23 152606.33
24 292158.81
25 -213930.25
26 520356.34
27 -436301.34
28 -426188.16
29 -454073.36
30 -147612.43
31 -460524.05
32 -92742.10
33 -115380.03
34 -367622.08
35 484108.12
36 -320299.94
37 -96481.13
38 -32436.44
39 500987.77
40 145457.84
41 433901.28
42 -271479.93
43 -168264.19
44 104845.38
45 -809499.45
In [36]: # using argmax to find the store that has maximum sales growth in quarter 3
print(q3_salesgrowth.index[q3_salesgrowth.argmax()])
print("{:.2f}".format(q3_salesgrowth.max()))
7
971928.12
Store number : 7 has good quarterly 3 sales growth than other stores
In [ ]:
In [37]: # finding mean sales on superbowl for all stores each year
superbowl_sales = walmart_df[(walmart_df['date'] == '12-Feb-10')|(walmart_df['date'] == '11-Feb-11')|(w
almart_df['date'] == '10-Feb-12')]
print("{:.2f}".format(superbowl_sales['weekly_sales'].mean()))
1079127.99
In [38]: # finding mean sales on labourday for all stores each year
labourday_sales = walmart_df[(walmart_df['date'] == '10-Sep-10')|(walmart_df['date'] == '09-Sep-11')|(w
almart_df['date'] == '07-Sep-12')]
print("{:.2f}".format(labourday_sales['weekly_sales'].mean()))
1042427.29
In [39]: # finding mean sales on thanksgiving for all stores each year
thanksgiving_sales = walmart_df[(walmart_df['date'] == '26-Nov-10')|(walmart_df['date'] == '25-Nov-11')
|(walmart_df['date'] == '23-Nov-12')]
print("{:.2f}".format(thanksgiving_sales['weekly_sales'].mean()))
1471273.43
In [40]: # finding mean sales on christmas for all stores each year
christmas_sales = walmart_df[(walmart_df['date'] == '31-Dec-10')|(walmart_df['date'] == '30-Dec-11')|(w
almart_df['date'] == '28-Dec-12')]
print("{:.2f}".format(christmas_sales['weekly_sales'].mean()))
960833.11
Thanksgiving Holiday have Higher mean sales than any other Holiday sales
In [ ]:
1122887.89
1041256.38
In [45]: print("{:.2f}".format(superbowl_sales['weekly_sales'].mean()))
print("{:.2f}".format(labourday_sales['weekly_sales'].mean()))
print("{:.2f}".format(thanksgiving_sales['weekly_sales'].mean()))
print("{:.2f}".format(christmas_sales['weekly_sales'].mean()))
1079127.99
1042427.29
1471273.43
960833.11
In [46]: print("{:.2f}".format(nonholiday_sales['weekly_sales'].mean()))
1041256.38
Christmas holiday sales has a negative impact on sales than other holidays when compared to Non-
holiday sales mean
In [ ]:
Out[47]: 12
In [48]: totalmonthly_sales['weekly_sales'].describe()
Out[48]:
count mean std min 25% 50% 75% max
month
In [ ]:
Out[49]:
count mean std min 25% 50% 75% max
month
Out[50]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day
Out[51]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day
In [ ]:
Out[52]:
count mean std min 25% 50% 75% max
month
Out[53]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day
Out[54]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day
In [ ]:
Out[55]:
count mean std min 25% 50% 75% max
month
Out[56]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day quartile
Out[57]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day quartile
Visualization
In [58]: # Visualizing total sales percentage for each year
list1 = [stores_2010_sales['weekly_sales'].sum(),stores_2011_sales['weekly_sales'].sum(),stores_2012_sa
les['weekly_sales'].sum()]
labels = '2010 Sales','2011 Sales','2012 Sales'
cmap = plt.get_cmap('YlGnBu')
colors = cmap(np.arange(3)*95)
txt={'weight':'bold'}
plt.figure(figsize=(15,5))
plt.pie(list1,labels=labels,autopct='%.1f%%',colors=colors,textprops=txt)
plt.show()
In [59]: # Visualizing total sales percentage semester wise for different years
list1 = [semester1_2010['weekly_sales'].sum(),semester2_2010['weekly_sales'].sum(),
semester1_2011['weekly_sales'].sum(),semester2_2011['weekly_sales'].sum(),
semester1_2012['weekly_sales'].sum(),semester2_2012['weekly_sales'].sum()]
labels = 'Semester 1 - 2010','Semester 2 - 2010','Semester 1 - 2011','Semester 2 - 2011','Semester 1 -
2012','Semester 2 - 2012'
cmap = plt.get_cmap('Blues')
colors = cmap(np.arange(2)*150)
txt={'weight':'bold'}
plt.figure(figsize=(15,5))
plt.pie(list1,labels=labels,autopct='%.1f%%',colors=colors,textprops=txt)
plt.show()
The sale was maximum in the 2nd semester of the year 2011
These Pie-charts show semester wise sales percentage for each year
In [ ]:
In [62]: #HeatMap
m = np.ones_like(walmart_df.drop(columns=['holiday_flag','year']).corr())
m[np.tril_indices_from(m)]=0
labels = ['store','weeklysales','temperature','fuelprice','CPI','unemployment','day']
plt.figure(figsize=(12,6))
sns.heatmap(walmart_df.drop(columns=['holiday_flag','year']).corr(),annot=True,mask=m,cmap='YlGnBu',lin
ewidths=.5,xticklabels=labels)
plt.show()
CPI , Unemployment and Fuel price does not have any significant impact on Weekly_sales
In [ ]:
Statistical Model
For Store 1 – Build prediction models to forecast demand
Out[63]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day strdate
In [64]: # Restructuring dates to numbers to use them in model as categorical data cannot be used in linear mode
l.
dummy = []
for i in range(1,144):
dummy.append(i)
store1_dataset['dummy_date'] = dummy
In [65]: store1_dataset.head()
Out[65]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day strdate dummy_date
05-
2010-
0 1 Feb- 1643690.90 0 42.31 2.572 211.096358 8.106 2010 2 5 1
02-05
10
12-
2010-
1 1 Feb- 1641957.44 1 38.51 2.548 211.242170 8.106 2010 2 12 2
02-12
10
19-
2010-
2 1 Feb- 1611968.17 0 39.93 2.514 211.289143 8.106 2010 2 19 3
02-19
10
26-
2010-
3 1 Feb- 1409727.59 0 46.63 2.561 211.319643 8.106 2010 2 26 4
02-26
10
05-
2010-
4 1 Mar- 1554806.68 0 46.50 2.625 211.350143 8.106 2010 3 5 5
03-05
10
LinearRegression Model
Out[66]:
store holiday_flag temperature fuel_price cpi unemployment year month day dummy_date weekly_sales
In [67]: # Splitting data into train and test for the linear model
train,test=train_test_split(model_dataset,test_size=0.20,random_state=0)
lr = LinearRegression()
x_train = train.drop(columns=['weekly_sales'])
x_test = test.drop(columns=['weekly_sales'])
y_train = train['weekly_sales']
y_test = test['weekly_sales']
Out[69]: -2364627031.7073054
Out[73]:
OLS Regression Results
Df Model: 9
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.04e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
p-value alpha
1)CPI 0.089 0.05
2)Fuel Price 0.562 0.05
3)Unemployment 0.816 0.05
In [76]: # Visualizing weekly sales and fuel price using a line plot
plt.figure(figsize=(16,10))
sns.set(font_scale=1.2,style="white")
sns.lmplot(x='fuel_price',y='weekly_sales',data = store1_dataset,line_kws={'color':'red'})
plt.title("Fuel price vs Weekly sales")
plt.xlabel("Fuel price")
plt.ylabel("Weekly Sales")
plt.show()
In [77]: # Visualizing weekly sales and unemployment index using a line plot
plt.figure(figsize=(6,3))
sns.set(font_scale=1.2,style="white")
sns.lmplot(x='unemployment',y='weekly_sales',data = store1_dataset,line_kws={'color':'red'})
plt.title("Unemployment vs Weekly sales")
plt.xlabel("Unemployment")
plt.ylabel("Weekly Sales")
plt.show()
In [80]: walmart_df.head(10)
Out[80]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment year month day dayofweek
In [ ]: