0% found this document useful (0 votes)
437 views

Wholesale Customers Data Analysis PDF

The document describes analyzing data from a wholesale distributor operating in Portugal. The data set contains annual spending information for 440 large retailers on 6 product varieties across 3 regions and different sales channels. The data will be analyzed to understand spending patterns and relationships between variables. Key variables include region, channel, and annual spending on various product categories. The data will be loaded, checked for missing values, and have unnecessary columns dropped to prepare it for analysis.

Uploaded by

Raveena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
437 views

Wholesale Customers Data Analysis PDF

The document describes analyzing data from a wholesale distributor operating in Portugal. The data set contains annual spending information for 440 large retailers on 6 product varieties across 3 regions and different sales channels. The data will be analyzed to understand spending patterns and relationships between variables. Key variables include region, channel, and annual spending on various product categories. The data will be loaded, checked for missing values, and have unnecessary columns dropped to prepare it for analysis.

Uploaded by

Raveena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

8/30/2021 Wholesale Customers Data Analysis

A wholesale distributor operating in different regions of Portugal


has information on annual spending of several items in their
stores across different regions and channels. The data consists of
440 large retailers’ annual spending on 6 different varieties of
products in 3 different regions (Lisbon, Oporto, Other) and
across different sales channel (Hotel, Retail).
Importing the Libraries
In [31]:
import pandas as pd

import numpy as np

import copy

import matplotlib.pyplot as plt

import seaborn as sns

import pylab

import math

%matplotlib inline

import os

import warnings

warnings.filterwarnings('ignore')

Loading the Data


In [4]:
wholesale_customer_df = pd.read_csv('Wholesale+Customers+Data.csv')
wholesale_customer_df

Out[4]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatess

0 1 Retail Other 12669 9656 7561 214 2674 13

1 2 Retail Other 7057 9810 9568 1762 3293 17

2 3 Retail Other 6353 8808 7684 2405 3516 78

3 4 Hotel Other 13265 1196 4221 6404 507 17

4 5 Retail Other 22615 5410 7198 3915 1777 51

... ... ... ... ... ... ... ... ...

435 436 Hotel Other 29703 12051 16027 13135 182 22

436 437 Hotel Other 39228 1431 764 4510 93 23

437 438 Retail Other 14531 15488 30243 437 14841 18

438 439 Hotel Other 10290 1981 2232 1038 168 21

439 440 Hotel Other 2787 1698 2510 65 477

440 rows × 9 columns

In [5]:
wholesale_customer_df.head()

Out[5]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 1 Retail Other 12669 9656 7561 214 2674 1338

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 1/25


8/30/2021 Wholesale Customers Data Analysis

Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

1 2 Retail Other 7057 9810 9568 1762 3293 1776

2 3 Retail Other 6353 8808 7684 2405 3516 7844

3 4 Hotel Other 13265 1196 4221 6404 507 1788

4 5 Retail Other 22615 5410 7198 3915 1777 5185

Summary of the dataset


In [6]:
wholesale_customer_df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 440 entries, 0 to 439

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Buyer/Spender 440 non-null int64

1 Channel 440 non-null object

2 Region 440 non-null object

3 Fresh 440 non-null int64

4 Milk 440 non-null int64

5 Grocery 440 non-null int64

6 Frozen 440 non-null int64

7 Detergents_Paper 440 non-null int64

8 Delicatessen 440 non-null int64

dtypes: int64(7), object(2)

memory usage: 31.1+ KB

Checking the missing Value


In [14]:
wholesale_customer_df.isnull().sum()

Out[14]: Buyer/Spender 0

Channel 0

Region 0

Fresh 0

Milk 0

Grocery 0

Frozen 0

Detergents_Paper 0

Delicatessen 0

dtype: int64

Dropping as no use column for our analysis


1 continuous types of feature (Buyer/Spender) will be dropped as no use for our analysis

In [19]:
wholesale_customer_df.head()

Out[19]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 1 Retail Other 12669 9656 7561 214 2674 1338

1 2 Retail Other 7057 9810 9568 1762 3293 1776

2 3 Retail Other 6353 8808 7684 2405 3516 7844

3 4 Hotel Other 13265 1196 4221 6404 507 1788

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 2/25


8/30/2021 Wholesale Customers Data Analysis

Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

4 5 Retail Other 22615 5410 7198 3915 1777 5185

In [20]:
wholesale_customer_drop_df = copy.deepcopy(wholesale_customer_df)

wholesale_customer_drop_df

Out[20]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatess

0 1 Retail Other 12669 9656 7561 214 2674 13

1 2 Retail Other 7057 9810 9568 1762 3293 17

2 3 Retail Other 6353 8808 7684 2405 3516 78

3 4 Hotel Other 13265 1196 4221 6404 507 17

4 5 Retail Other 22615 5410 7198 3915 1777 51

... ... ... ... ... ... ... ... ...

435 436 Hotel Other 29703 12051 16027 13135 182 22

436 437 Hotel Other 39228 1431 764 4510 93 23

437 438 Retail Other 14531 15488 30243 437 14841 18

438 439 Hotel Other 10290 1981 2232 1038 168 21

439 440 Hotel Other 2787 1698 2510 65 477

440 rows × 9 columns

In [21]:
del wholesale_customer_drop_df['Buyer/Spender']

In [22]:
wholesale_customer_drop_df

Out[22]: Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 Retail Other 12669 9656 7561 214 2674 1338

1 Retail Other 7057 9810 9568 1762 3293 1776

2 Retail Other 6353 8808 7684 2405 3516 7844

3 Hotel Other 13265 1196 4221 6404 507 1788

4 Retail Other 22615 5410 7198 3915 1777 5185

... ... ... ... ... ... ... ... ...

435 Hotel Other 29703 12051 16027 13135 182 2204

436 Hotel Other 39228 1431 764 4510 93 2346

437 Retail Other 14531 15488 30243 437 14841 1867

438 Hotel Other 10290 1981 2232 1038 168 2125

439 Hotel Other 2787 1698 2510 65 477 52

440 rows × 8 columns

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 3/25


8/30/2021 Wholesale Customers Data Analysis

Performing Regional count


In [23]:
wholesale_customer_drop_df['Region'].value_counts()

Out[23]: Other 316

Lisbon 77

Oporto 47

Name: Region, dtype: int64

Performing Channel count


In [24]:
wholesale_customer_drop_df['Channel'].value_counts()

Out[24]: Hotel 298

Retail 142

Name: Channel, dtype: int64

In [25]:
def categorical_multi(i,j):

pd.crosstab(wholesale_customer_drop_df[i],wholesale_customer_drop_df[j]).plot(ki
plt.show()

print(pd.crosstab(wholesale_customer_drop_df[i],wholesale_customer_drop_df[j]))

categorical_multi(i='Channel',j='Region')

Region Lisbon Oporto Other

Channel

Hotel 59 28 211

Retail 18 19 105

EDA
Starting to explore the data with the Univariate analysis (each feature individually), before
carrying the Bivariate analysis and compare pairs of features to find correlation between them

In [26]:
print('Descriptive Statastics of our Data:')

wholesale_customer_drop_df.describe().T

Descriptive Statastics of our Data:

Out[26]: count mean std min 25% 50% 75% max

Fresh 440.0 12000.297727 12647.328865 3.0 3127.75 8504.0 16933.75 112151.0

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 4/25


8/30/2021 Wholesale Customers Data Analysis

count mean std min 25% 50% 75% max

Milk 440.0 5796.265909 7380.377175 55.0 1533.00 3627.0 7190.25 73498.0

Grocery 440.0 7951.277273 9503.162829 3.0 2153.00 4755.5 10655.75 92780.0

Frozen 440.0 3071.931818 4854.673333 25.0 742.25 1526.0 3554.25 60869.0

Detergents_Paper 440.0 2881.493182 4767.854448 3.0 256.75 816.5 3922.00 40827.0

Delicatessen 440.0 1524.870455 2820.105937 3.0 408.25 965.5 1820.25 47943.0

In [27]:
print('Descriptive Statastics of our Data including Channel & Retail:')

wholesale_customer_drop_df.describe(include='all').T

Descriptive Statastics of our Data including Channel & Retail:

Out[27]: count unique top freq mean std min 25% 50%

Channel 440 2 Hotel 298 NaN NaN NaN NaN NaN

Region 440 3 Other 316 NaN NaN NaN NaN NaN

Fresh 440.0 NaN NaN NaN 12000.297727 12647.328865 3.0 3127.75 8504.0 1

Milk 440.0 NaN NaN NaN 5796.265909 7380.377175 55.0 1533.0 3627.0

Grocery 440.0 NaN NaN NaN 7951.277273 9503.162829 3.0 2153.0 4755.5 1

Frozen 440.0 NaN NaN NaN 3071.931818 4854.673333 25.0 742.25 1526.0

Detergents_Paper 440.0 NaN NaN NaN 2881.493182 4767.854448 3.0 256.75 816.5

Delicatessen 440.0 NaN NaN NaN 1524.870455 2820.105937 3.0 408.25 965.5

Univariate
In [32]:
def plot_distribution(df, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):

plt.style.use('seaborn-whitegrid')

fig = plt.figure(figsize=(width,height))

fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace,


rows = math.ceil(float(df.shape[1]) / cols)

for i, column in enumerate(df.columns):

ax = fig.add_subplot(rows, cols, i + 1)

ax.set_title(column)

if df.dtypes[column] == np.object:

g = sns.countplot(y=column, data=df)

substrings = [s.get_text()[:18] for s in g.get_yticklabels()]

g.set(yticklabels=substrings)

plt.xticks(rotation=25)

else:

g = sns.distplot(df[column])

plt.xticks(rotation=25)

plot_distribution(wholesale_customer_drop_df, cols=3, width=20, height=20, hspace=0.

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 5/25


8/30/2021 Wholesale Customers Data Analysis

From the graphs on the distribution of product it seems that we have some outliers in the data,
further deep dive to identify the outlier.

In [34]:
# removing the categorical columns:

products = wholesale_customer_drop_df[wholesale_customer_drop_df.columns[+2:wholesal

#plotting the distribution of each feature

def plot_distribution(df2, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):

plt.style.use('seaborn-whitegrid')

fig = plt.figure(figsize=(width,height))

fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace,


rows = math.ceil(float(df2.shape[1]) / cols)

for i, column in enumerate(df2.columns):

ax = fig.add_subplot(rows, cols, i + 1)

ax.set_title(column)

g = sns.boxplot(df2[column])

plt.xticks(rotation=25)

plot_distribution(products, cols=3, width=20, height=10, hspace=0.45, wspace=0.5)

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 6/25


8/30/2021 Wholesale Customers Data Analysis

FYI Evaluator : Outliers are detected but not necessarily removed, it depends of the situation.
Here I will assume that the wholesale distributor provided us a dataset with correct data, so I will
keep them as is.

Bivariate
In [36]:
sns.set(style="ticks")

g = sns.pairplot(products,corner=True,kind='reg')

g.fig.set_size_inches(15,15)

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 7/25


8/30/2021 Wholesale Customers Data Analysis

From the pairplot above, the correlation between the "detergents and paper products" and the
"grocery products" seems to be pretty strong, meaning that consumers would often spend
money on these two types of product. Applying the Pearson correlation coefficient to confirm
this:

In [39]:
corr = products.corr()

sns.heatmap(corr,annot=True)

##There is strong correlation (0.92) between the "detergents and paper products" and

Out[39]: <AxesSubplot:>

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 8/25


8/30/2021 Wholesale Customers Data Analysis

Problem 1:
1.1 Use methods of descriptive statistics to summarize data. Which Region and
which Channel spent the most? Which Region and which Channel spent the
least?
In [40]:
print('Descriptive Statastics of our Data:')

wholesale_customer_drop_df.describe().T

Descriptive Statastics of our Data:

Out[40]: count mean std min 25% 50% 75% max

Fresh 440.0 12000.297727 12647.328865 3.0 3127.75 8504.0 16933.75 112151.0

Milk 440.0 5796.265909 7380.377175 55.0 1533.00 3627.0 7190.25 73498.0

Grocery 440.0 7951.277273 9503.162829 3.0 2153.00 4755.5 10655.75 92780.0

Frozen 440.0 3071.931818 4854.673333 25.0 742.25 1526.0 3554.25 60869.0

Detergents_Paper 440.0 2881.493182 4767.854448 3.0 256.75 816.5 3922.00 40827.0

Delicatessen 440.0 1524.870455 2820.105937 3.0 408.25 965.5 1820.25 47943.0

In [41]:
print('Descriptive Statastics of our Data including Channel & Retail:')

wholesale_customer_drop_df.describe(include='all').T

Descriptive Statastics of our Data including Channel & Retail:

Out[41]: count unique top freq mean std min 25% 50%

Channel 440 2 Hotel 298 NaN NaN NaN NaN NaN

Region 440 3 Other 316 NaN NaN NaN NaN NaN

Fresh 440.0 NaN NaN NaN 12000.297727 12647.328865 3.0 3127.75 8504.0 1

Milk 440.0 NaN NaN NaN 5796.265909 7380.377175 55.0 1533.0 3627.0

Grocery 440.0 NaN NaN NaN 7951.277273 9503.162829 3.0 2153.0 4755.5 1

Frozen 440.0 NaN NaN NaN 3071.931818 4854.673333 25.0 742.25 1526.0

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 9/25


8/30/2021 Wholesale Customers Data Analysis

count unique top freq mean std min 25% 50%

Detergents_Paper 440.0 NaN NaN NaN 2881.493182 4767.854448 3.0 256.75 816.5

Delicatessen 440.0 NaN NaN NaN 1524.870455 2820.105937 3.0 408.25 965.5

From the above two describe function, we can infer the following

Channel has two unique values, with "Hotel" as most frequent with 298 out of 440
transactions. i.e 67.7 percentage of spending comes from "Hotel" channel.

Retail has three unique values, with "Other" as most frequent with 316 out of 440
transactions. i.e.71.8 percentage of spending comes from "Other" region.

Fresh item (440 records),

has a mean of 12000.3, standard deviation of 12647.3, with min value of 3 and max value of
112151 .
The other aspect is Q1(25%) is 3127.75, Q3(75%) is 16933.8, with Q2(50%) 8504. range
= max-min =112151-3=112,148 & IQR = Q3-Q1 = 16933.8-3127.75 = 13,806.05 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Milk item (440 records),

has a mean of 5796.27, standard deviation of 7380.38, with min value of 55 and max value of
73498.

The other aspect is Q1(25%) is 1533, Q3(75%) is 7190.25, with Q2(50%) 3627

range = max-min =73498-55=73443 & IQR = Q3-Q1 = 7190.25-1533 = 5657.25 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Grocery item (440 records),

has a mean of 7951.28, standard deviation of 9503.16, with min value of 3 and max value of
92780.

The other aspect is Q1(25%) is 2153, Q3(75%) is 10655.8, with Q2(50%) 4755.5

range = max-min =92780-3=92777 & IQR = Q3-Q1 = 10655.8-2153 = 8502.8 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Frozen (440 records),

has a mean of 3071.93, standard deviation of 4854.67, with min value of 25 and max value of
60869.

The other aspect is Q1(25%) is 742.25, Q3(75%) is 3554.25, with Q2(50%) 1526

range = max-min =60869-25=60844 & IQR = Q3-Q1 = 3554.25-742.25 = 2812 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Detergents_Paper (440 records),

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 10/25


8/30/2021 Wholesale Customers Data Analysis

has a mean of 2881.49, standard deviation of 4767.85, with min value of 3 and max value of
40827.

The other aspect is Q1(25%) is 256.75, Q3(75%) is 3922, with Q2(50%) 816.5

range = max-min =40827-3=40824 & IQR = Q3-Q1 = 3922-256.75 = 3665.25 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Delicatessen (440 records),

has a mean of 1524.87, standard deviation of 2820.11, with min value of 3 and max value of
47943.

The other aspect is Q1(25%) is 408.25, Q3(75%) is 1820.25, with Q2(50%) 965.5

range = max-min =47943-3=47940 & IQR = Q3-Q1 = 1820.25-408.25 = 1412 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

In [ ]:
##Which Region and which Channel spent the most?

##Which Region and which Channel spent the least?

In [42]:
wholesale_customer_spending_df = copy.deepcopy(wholesale_customer_drop_df)

wholesale_customer_spending_df['Spending'] =wholesale_customer_drop_df['Fresh']+whol
wholesale_customer_spending_df

Out[42]: Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen Spending

0 Retail Other 12669 9656 7561 214 2674 1338 34112

1 Retail Other 7057 9810 9568 1762 3293 1776 33266

2 Retail Other 6353 8808 7684 2405 3516 7844 36610

3 Hotel Other 13265 1196 4221 6404 507 1788 27381

4 Retail Other 22615 5410 7198 3915 1777 5185 46100

... ... ... ... ... ... ... ... ... ...

435 Hotel Other 29703 12051 16027 13135 182 2204 73302

436 Hotel Other 39228 1431 764 4510 93 2346 48372

437 Retail Other 14531 15488 30243 437 14841 1867 77407

438 Hotel Other 10290 1981 2232 1038 168 2125 17834

439 Hotel Other 2787 1698 2510 65 477 52 7589

440 rows × 9 columns

In [43]:
regiondf = wholesale_customer_spending_df.groupby('Region')['Spending'].sum()

print(regiondf)

print()

channeldf = wholesale_customer_spending_df.groupby('Channel')['Spending'].sum()

print(channeldf)

Region

Lisbon 2386813

Oporto 1555088

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 11/25


8/30/2021 Wholesale Customers Data Analysis

Other 10677599

Name: Spending, dtype: int64

Channel

Hotel 7999569

Retail 6619931

Name: Spending, dtype: int64

Highest spend in the Region is from "Others" and lowest spend in the region is from
"Oporto"

Highest spend in the Channel is from "Hotel" and lowest spend in the Channel is from
"Retail".

1.2 There are 6 different varieties of items that are considered.


Describe and comment/explain all the varieties across Region
and Channel? Provide a detailed justification for your answer.
In [44]:
## Using the barplot to see behaviour across the channel and region

sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Fresh", hue ="Region", kind="bar", ci=None, data=wholesa
plt.title('Item - Fresh')

Out[44]: Text(0.5, 1.0, 'Item - Fresh')

In [45]:
sns.catplot(x="Channel", y="Fresh", kind="bar", ci=None, data=wholesale_customer_dro
plt.title('Item - Fresh')

Out[45]: Text(0.5, 1.0, 'Item - Fresh')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 12/25


8/30/2021 Wholesale Customers Data Analysis

In [46]:
sns.catplot(x="Region", y="Fresh", kind="bar", ci=None, data=wholesale_customer_drop
plt.title('Item - Fresh')

Out[46]: Text(0.5, 1.0, 'Item - Fresh')

Based on the plot, Fresh item is sold more in the Retail channel
In [47]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Milk", hue ="Region", kind="bar", ci=None, data=wholesal
plt.title('Item - Milk')

Out[47]: Text(0.5, 1.0, 'Item - Milk')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 13/25


8/30/2021 Wholesale Customers Data Analysis

In [48]:
sns.catplot(x="Channel", y="Milk", kind="bar", ci=None, data=wholesale_customer_drop
plt.title('Item - Milk')

Out[48]: Text(0.5, 1.0, 'Item - Milk')

In [49]:
sns.catplot(x="Region", y="Milk", kind="bar", ci=None, data=wholesale_customer_drop_
plt.title('Item - Milk')

Out[49]: Text(0.5, 1.0, 'Item - Milk')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 14/25


8/30/2021 Wholesale Customers Data Analysis

In [50]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Grocery", hue ="Region", kind="bar", ci=None, data=whole
plt.title('Item - Grocery')

Out[50]: Text(0.5, 1.0, 'Item - Grocery')

In [51]:
sns.catplot(x="Channel", y="Grocery", kind="bar", ci=None, data=wholesale_customer_d
plt.title('Item - Grocery')

Out[51]: Text(0.5, 1.0, 'Item - Grocery')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 15/25


8/30/2021 Wholesale Customers Data Analysis

In [52]:
sns.catplot(x="Region", y="Grocery", kind="bar", ci=None, data=wholesale_customer_dr
plt.title('Item - Grocery')

Out[52]: Text(0.5, 1.0, 'Item - Grocery')

In [53]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Frozen", hue ="Region", kind="bar", ci=None, data=wholes
plt.title('Item - Frozen')

Out[53]: Text(0.5, 1.0, 'Item - Frozen')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 16/25


8/30/2021 Wholesale Customers Data Analysis

In [54]:
sns.catplot(x="Channel", y="Frozen", kind="bar", ci=None, data=wholesale_customer_dr
plt.title('Item - Frozen')

Out[54]: Text(0.5, 1.0, 'Item - Frozen')

In [55]:
sns.catplot(x="Region", y="Frozen", kind="bar", ci=None, data=wholesale_customer_dro
plt.title('Item - Frozen')

Out[55]: Text(0.5, 1.0, 'Item - Frozen')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 17/25


8/30/2021 Wholesale Customers Data Analysis

In [56]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Detergents_Paper", hue ="Region", kind="bar", ci=None, d
plt.title('Item - Detergents_Paper')

Out[56]: Text(0.5, 1.0, 'Item - Detergents_Paper')

In [57]:
sns.catplot(x="Channel", y="Detergents_Paper", kind="bar", ci=None, data=wholesale_c
plt.title('Item - Detergents_Paper')

Out[57]: Text(0.5, 1.0, 'Item - Detergents_Paper')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 18/25


8/30/2021 Wholesale Customers Data Analysis

In [58]:
sns.catplot(x="Region", y="Detergents_Paper", kind="bar", ci=None, data=wholesale_cu
plt.title('Item - Detergents_Paper')

Out[58]: Text(0.5, 1.0, 'Item - Detergents_Paper')

In [59]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Delicatessen", hue ="Region", kind="bar", ci=None, data=
plt.title('Delicatessen')

Out[59]: Text(0.5, 1.0, 'Delicatessen')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 19/25


8/30/2021 Wholesale Customers Data Analysis

In [60]:
sns.catplot(x="Channel", y="Delicatessen", kind="bar", ci=None, data=wholesale_custo
plt.title('Item - Delicatessen')

Out[60]: Text(0.5, 1.0, 'Item - Delicatessen')

In [61]:
sns.catplot(x="Region", y="Delicatessen", kind="bar", ci=None, data=wholesale_custom
plt.title('Item - Delicatessen')

Out[61]: Text(0.5, 1.0, 'Item - Delicatessen')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 20/25


8/30/2021 Wholesale Customers Data Analysis

1.3 On the basis of a descriptive measure of variability,


which item shows the most inconsistent behaviour?
Which items show the least inconsistent behaviour?
In [62]:
#using standard deviation to check the measure of variabilty

standard_deviation_items = products.std()

standard_deviation_items.round(2)

Out[62]: Fresh 12647.33

Milk 7380.38

Grocery 9503.16

Frozen 4854.67

Detergents_Paper 4767.85

Delicatessen 2820.11

dtype: float64

Fresh item have highest Standard deviation So that is Inconsistent.¶

Delicatessen item have smallest Standard deviation, So that is consistent


In [ ]:
#Based on coeffiecent of Variation

In [63]:
cv_fresh = np.std(products['Fresh']) / np.mean(products['Fresh'])

cv_fresh

Out[63]: 1.0527196084948245

In [64]:
cv_milk = np.std(products['Milk']) / np.mean(products['Milk'])

cv_milk

Out[64]: 1.2718508307424503

In [65]:
localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 21/25
8/30/2021 Wholesale Customers Data Analysis

cv_grocery = np.std(products['Grocery']) / np.mean(products['Grocery'])

cv_grocery

Out[65]: 1.193815447749267

In [66]:
cv_frozen = np.std(products['Frozen']) / np.mean(products['Frozen'])

cv_frozen

Out[66]: 1.5785355298607762

In [67]:
cv_detergents_paper = np.std(products['Detergents_Paper']) / np.mean(products['Deter
cv_detergents_paper

Out[67]: 1.6527657881041729

In [68]:
cv_delicatessen = np.std(products['Delicatessen']) / np.mean(products['Delicatessen'
cv_delicatessen

Out[68]: 1.8473041039189306

In [69]:
from scipy.stats import variation

print(variation(products, axis = 0))

[1.05271961 1.27185083 1.19381545 1.57853553 1.65276579 1.8473041 ]

“Fresh” item have lowest coefficient of Variation So that is


consistent.

“Delicatessen” item have highest coefficient of Variation, So that


is Inconsistent
In [70]:
variance_items = products.var()

variance_items

Out[70]: Fresh 1.599549e+08

Milk 5.446997e+07

Grocery 9.031010e+07

Frozen 2.356785e+07

Detergents_Paper 2.273244e+07

Delicatessen 7.952997e+06

dtype: float64

In [71]:
products.describe().T

Out[71]: count mean std min 25% 50% 75% max

Fresh 440.0 12000.297727 12647.328865 3.0 3127.75 8504.0 16933.75 112151.0

Milk 440.0 5796.265909 7380.377175 55.0 1533.00 3627.0 7190.25 73498.0

Grocery 440.0 7951.277273 9503.162829 3.0 2153.00 4755.5 10655.75 92780.0

Frozen 440.0 3071.931818 4854.673333 25.0 742.25 1526.0 3554.25 60869.0

Detergents_Paper 440.0 2881.493182 4767.854448 3.0 256.75 816.5 3922.00 40827.0

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 22/25


8/30/2021 Wholesale Customers Data Analysis

count mean std min 25% 50% 75% max

Delicatessen 440.0 1524.870455 2820.105937 3.0 408.25 965.5 1820.25 47943.0

In [72]:
pylab.style.use('seaborn-pastel')

products.plot.area(stacked=False,figsize=(11,5))

pylab.grid(); pylab.show()

1.4 Are there any outliers in the data? Back up your answer with a
suitable plot/technique with the help of detailed comments.
In [73]:
#Using the boxplot to see the outliers. The black point is the outliers in boxplot g
plt.figure(figsize=(15,8))

sns.boxplot(data=products, orient="h", palette="Set2")

Out[73]: <AxesSubplot:>

In [74]:
def plot_distribution(items, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):

plt.style.use('seaborn-whitegrid')

fig = plt.figure(figsize=(width,height))

fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace,


rows = math.ceil(float(items.shape[1]) / cols)

for i, column in enumerate(items.columns):

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 23/25


8/30/2021 Wholesale Customers Data Analysis

ax = fig.add_subplot(rows, cols, i + 1)

ax.set_title(column)

g = sns.boxplot(items[column])

plt.xticks(rotation=25)

plot_distribution(products, cols=3, width=20, height=10, hspace=0.45, wspace=0.5)

Yes there are outliers in all the items across the product range (Fresh, Milk,
Grocery, Frozen, Detergents_Paper & Delicatessen) Outliers are detected but
not necessarily removed, it depends of the situation. Here I will assume that
the wholesale distributor provided us a dataset with correct data, so I will
keep them as is.
In [75]:
# visual analysis via histogram

products.hist(figsize=(6,6));

1.5 On the basis of your analysis, what are your


localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 24/25
8/30/2021 Wholesale Customers Data Analysis

recommendations for the business? How can your analysis help


the business to solve its problem? Answer from the business
perspective
## As per the analysis, I find out that there are inconsistencies in spending of different items (by calculating
Coefficient of Variation), which should be minimized. The spending of Hotel and Retail channel are different
which should be more or less equal. And also spent should equal for different regions. Need to focus on other
items also than “Fresh” and “Grocery”

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 25/25

You might also like