Wholesale Customers Data Analysis PDF
Wholesale Customers Data Analysis PDF
import numpy as np
import copy
import pylab
import math
%matplotlib inline
import os
import warnings
warnings.filterwarnings('ignore')
Out[4]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatess
In [5]:
wholesale_customer_df.head()
Out[5]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
<class 'pandas.core.frame.DataFrame'>
Out[14]: Buyer/Spender 0
Channel 0
Region 0
Fresh 0
Milk 0
Grocery 0
Frozen 0
Detergents_Paper 0
Delicatessen 0
dtype: int64
In [19]:
wholesale_customer_df.head()
Out[19]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
In [20]:
wholesale_customer_drop_df = copy.deepcopy(wholesale_customer_df)
wholesale_customer_drop_df
Out[20]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatess
In [21]:
del wholesale_customer_drop_df['Buyer/Spender']
In [22]:
wholesale_customer_drop_df
Lisbon 77
Oporto 47
Retail 142
In [25]:
def categorical_multi(i,j):
pd.crosstab(wholesale_customer_drop_df[i],wholesale_customer_drop_df[j]).plot(ki
plt.show()
print(pd.crosstab(wholesale_customer_drop_df[i],wholesale_customer_drop_df[j]))
categorical_multi(i='Channel',j='Region')
Channel
Hotel 59 28 211
Retail 18 19 105
EDA
Starting to explore the data with the Univariate analysis (each feature individually), before
carrying the Bivariate analysis and compare pairs of features to find correlation between them
In [26]:
print('Descriptive Statastics of our Data:')
wholesale_customer_drop_df.describe().T
In [27]:
print('Descriptive Statastics of our Data including Channel & Retail:')
wholesale_customer_drop_df.describe(include='all').T
Out[27]: count unique top freq mean std min 25% 50%
Fresh 440.0 NaN NaN NaN 12000.297727 12647.328865 3.0 3127.75 8504.0 1
Milk 440.0 NaN NaN NaN 5796.265909 7380.377175 55.0 1533.0 3627.0
Grocery 440.0 NaN NaN NaN 7951.277273 9503.162829 3.0 2153.0 4755.5 1
Frozen 440.0 NaN NaN NaN 3071.931818 4854.673333 25.0 742.25 1526.0
Detergents_Paper 440.0 NaN NaN NaN 2881.493182 4767.854448 3.0 256.75 816.5
Delicatessen 440.0 NaN NaN NaN 1524.870455 2820.105937 3.0 408.25 965.5
Univariate
In [32]:
def plot_distribution(df, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(width,height))
ax = fig.add_subplot(rows, cols, i + 1)
ax.set_title(column)
if df.dtypes[column] == np.object:
g = sns.countplot(y=column, data=df)
g.set(yticklabels=substrings)
plt.xticks(rotation=25)
else:
g = sns.distplot(df[column])
plt.xticks(rotation=25)
From the graphs on the distribution of product it seems that we have some outliers in the data,
further deep dive to identify the outlier.
In [34]:
# removing the categorical columns:
products = wholesale_customer_drop_df[wholesale_customer_drop_df.columns[+2:wholesal
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(width,height))
ax = fig.add_subplot(rows, cols, i + 1)
ax.set_title(column)
g = sns.boxplot(df2[column])
plt.xticks(rotation=25)
FYI Evaluator : Outliers are detected but not necessarily removed, it depends of the situation.
Here I will assume that the wholesale distributor provided us a dataset with correct data, so I will
keep them as is.
Bivariate
In [36]:
sns.set(style="ticks")
g = sns.pairplot(products,corner=True,kind='reg')
g.fig.set_size_inches(15,15)
From the pairplot above, the correlation between the "detergents and paper products" and the
"grocery products" seems to be pretty strong, meaning that consumers would often spend
money on these two types of product. Applying the Pearson correlation coefficient to confirm
this:
In [39]:
corr = products.corr()
sns.heatmap(corr,annot=True)
##There is strong correlation (0.92) between the "detergents and paper products" and
Out[39]: <AxesSubplot:>
Problem 1:
1.1 Use methods of descriptive statistics to summarize data. Which Region and
which Channel spent the most? Which Region and which Channel spent the
least?
In [40]:
print('Descriptive Statastics of our Data:')
wholesale_customer_drop_df.describe().T
In [41]:
print('Descriptive Statastics of our Data including Channel & Retail:')
wholesale_customer_drop_df.describe(include='all').T
Out[41]: count unique top freq mean std min 25% 50%
Fresh 440.0 NaN NaN NaN 12000.297727 12647.328865 3.0 3127.75 8504.0 1
Milk 440.0 NaN NaN NaN 5796.265909 7380.377175 55.0 1533.0 3627.0
Grocery 440.0 NaN NaN NaN 7951.277273 9503.162829 3.0 2153.0 4755.5 1
Frozen 440.0 NaN NaN NaN 3071.931818 4854.673333 25.0 742.25 1526.0
Detergents_Paper 440.0 NaN NaN NaN 2881.493182 4767.854448 3.0 256.75 816.5
Delicatessen 440.0 NaN NaN NaN 1524.870455 2820.105937 3.0 408.25 965.5
From the above two describe function, we can infer the following
Channel has two unique values, with "Hotel" as most frequent with 298 out of 440
transactions. i.e 67.7 percentage of spending comes from "Hotel" channel.
Retail has three unique values, with "Other" as most frequent with 316 out of 440
transactions. i.e.71.8 percentage of spending comes from "Other" region.
has a mean of 12000.3, standard deviation of 12647.3, with min value of 3 and max value of
112151 .
The other aspect is Q1(25%) is 3127.75, Q3(75%) is 16933.8, with Q2(50%) 8504. range
= max-min =112151-3=112,148 & IQR = Q3-Q1 = 16933.8-3127.75 = 13,806.05 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))
has a mean of 5796.27, standard deviation of 7380.38, with min value of 55 and max value of
73498.
The other aspect is Q1(25%) is 1533, Q3(75%) is 7190.25, with Q2(50%) 3627
range = max-min =73498-55=73443 & IQR = Q3-Q1 = 7190.25-1533 = 5657.25 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))
has a mean of 7951.28, standard deviation of 9503.16, with min value of 3 and max value of
92780.
The other aspect is Q1(25%) is 2153, Q3(75%) is 10655.8, with Q2(50%) 4755.5
range = max-min =92780-3=92777 & IQR = Q3-Q1 = 10655.8-2153 = 8502.8 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))
has a mean of 3071.93, standard deviation of 4854.67, with min value of 25 and max value of
60869.
The other aspect is Q1(25%) is 742.25, Q3(75%) is 3554.25, with Q2(50%) 1526
range = max-min =60869-25=60844 & IQR = Q3-Q1 = 3554.25-742.25 = 2812 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))
has a mean of 2881.49, standard deviation of 4767.85, with min value of 3 and max value of
40827.
The other aspect is Q1(25%) is 256.75, Q3(75%) is 3922, with Q2(50%) 816.5
range = max-min =40827-3=40824 & IQR = Q3-Q1 = 3922-256.75 = 3665.25 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))
has a mean of 1524.87, standard deviation of 2820.11, with min value of 3 and max value of
47943.
The other aspect is Q1(25%) is 408.25, Q3(75%) is 1820.25, with Q2(50%) 965.5
range = max-min =47943-3=47940 & IQR = Q3-Q1 = 1820.25-408.25 = 1412 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))
In [ ]:
##Which Region and which Channel spent the most?
In [42]:
wholesale_customer_spending_df = copy.deepcopy(wholesale_customer_drop_df)
wholesale_customer_spending_df['Spending'] =wholesale_customer_drop_df['Fresh']+whol
wholesale_customer_spending_df
Out[42]: Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen Spending
... ... ... ... ... ... ... ... ... ...
435 Hotel Other 29703 12051 16027 13135 182 2204 73302
437 Retail Other 14531 15488 30243 437 14841 1867 77407
438 Hotel Other 10290 1981 2232 1038 168 2125 17834
In [43]:
regiondf = wholesale_customer_spending_df.groupby('Region')['Spending'].sum()
print(regiondf)
print()
channeldf = wholesale_customer_spending_df.groupby('Channel')['Spending'].sum()
print(channeldf)
Region
Lisbon 2386813
Oporto 1555088
Other 10677599
Channel
Hotel 7999569
Retail 6619931
Highest spend in the Region is from "Others" and lowest spend in the region is from
"Oporto"
Highest spend in the Channel is from "Hotel" and lowest spend in the Channel is from
"Retail".
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Fresh", hue ="Region", kind="bar", ci=None, data=wholesa
plt.title('Item - Fresh')
In [45]:
sns.catplot(x="Channel", y="Fresh", kind="bar", ci=None, data=wholesale_customer_dro
plt.title('Item - Fresh')
In [46]:
sns.catplot(x="Region", y="Fresh", kind="bar", ci=None, data=wholesale_customer_drop
plt.title('Item - Fresh')
Based on the plot, Fresh item is sold more in the Retail channel
In [47]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Milk", hue ="Region", kind="bar", ci=None, data=wholesal
plt.title('Item - Milk')
In [48]:
sns.catplot(x="Channel", y="Milk", kind="bar", ci=None, data=wholesale_customer_drop
plt.title('Item - Milk')
In [49]:
sns.catplot(x="Region", y="Milk", kind="bar", ci=None, data=wholesale_customer_drop_
plt.title('Item - Milk')
In [50]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Grocery", hue ="Region", kind="bar", ci=None, data=whole
plt.title('Item - Grocery')
In [51]:
sns.catplot(x="Channel", y="Grocery", kind="bar", ci=None, data=wholesale_customer_d
plt.title('Item - Grocery')
In [52]:
sns.catplot(x="Region", y="Grocery", kind="bar", ci=None, data=wholesale_customer_dr
plt.title('Item - Grocery')
In [53]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Frozen", hue ="Region", kind="bar", ci=None, data=wholes
plt.title('Item - Frozen')
In [54]:
sns.catplot(x="Channel", y="Frozen", kind="bar", ci=None, data=wholesale_customer_dr
plt.title('Item - Frozen')
In [55]:
sns.catplot(x="Region", y="Frozen", kind="bar", ci=None, data=wholesale_customer_dro
plt.title('Item - Frozen')
In [56]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Detergents_Paper", hue ="Region", kind="bar", ci=None, d
plt.title('Item - Detergents_Paper')
In [57]:
sns.catplot(x="Channel", y="Detergents_Paper", kind="bar", ci=None, data=wholesale_c
plt.title('Item - Detergents_Paper')
In [58]:
sns.catplot(x="Region", y="Detergents_Paper", kind="bar", ci=None, data=wholesale_cu
plt.title('Item - Detergents_Paper')
In [59]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Delicatessen", hue ="Region", kind="bar", ci=None, data=
plt.title('Delicatessen')
In [60]:
sns.catplot(x="Channel", y="Delicatessen", kind="bar", ci=None, data=wholesale_custo
plt.title('Item - Delicatessen')
In [61]:
sns.catplot(x="Region", y="Delicatessen", kind="bar", ci=None, data=wholesale_custom
plt.title('Item - Delicatessen')
standard_deviation_items = products.std()
standard_deviation_items.round(2)
Milk 7380.38
Grocery 9503.16
Frozen 4854.67
Detergents_Paper 4767.85
Delicatessen 2820.11
dtype: float64
In [63]:
cv_fresh = np.std(products['Fresh']) / np.mean(products['Fresh'])
cv_fresh
Out[63]: 1.0527196084948245
In [64]:
cv_milk = np.std(products['Milk']) / np.mean(products['Milk'])
cv_milk
Out[64]: 1.2718508307424503
In [65]:
localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 21/25
8/30/2021 Wholesale Customers Data Analysis
cv_grocery
Out[65]: 1.193815447749267
In [66]:
cv_frozen = np.std(products['Frozen']) / np.mean(products['Frozen'])
cv_frozen
Out[66]: 1.5785355298607762
In [67]:
cv_detergents_paper = np.std(products['Detergents_Paper']) / np.mean(products['Deter
cv_detergents_paper
Out[67]: 1.6527657881041729
In [68]:
cv_delicatessen = np.std(products['Delicatessen']) / np.mean(products['Delicatessen'
cv_delicatessen
Out[68]: 1.8473041039189306
In [69]:
from scipy.stats import variation
variance_items
Milk 5.446997e+07
Grocery 9.031010e+07
Frozen 2.356785e+07
Detergents_Paper 2.273244e+07
Delicatessen 7.952997e+06
dtype: float64
In [71]:
products.describe().T
In [72]:
pylab.style.use('seaborn-pastel')
products.plot.area(stacked=False,figsize=(11,5))
pylab.grid(); pylab.show()
1.4 Are there any outliers in the data? Back up your answer with a
suitable plot/technique with the help of detailed comments.
In [73]:
#Using the boxplot to see the outliers. The black point is the outliers in boxplot g
plt.figure(figsize=(15,8))
Out[73]: <AxesSubplot:>
In [74]:
def plot_distribution(items, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(width,height))
ax = fig.add_subplot(rows, cols, i + 1)
ax.set_title(column)
g = sns.boxplot(items[column])
plt.xticks(rotation=25)
Yes there are outliers in all the items across the product range (Fresh, Milk,
Grocery, Frozen, Detergents_Paper & Delicatessen) Outliers are detected but
not necessarily removed, it depends of the situation. Here I will assume that
the wholesale distributor provided us a dataset with correct data, so I will
keep them as is.
In [75]:
# visual analysis via histogram
products.hist(figsize=(6,6));
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: