0% found this document useful (0 votes)

437 views

Wholesale Customers Data Analysis PDF

The document describes analyzing data from a wholesale distributor operating in Portugal. The data set contains annual spending information for 440 large retailers on 6 product varieties across 3 regions and different sales channels. The data will be analyzed to understand spending patterns and relationships between variables. Key variables include region, channel, and annual spending on various product categories. The data will be loaded, checked for missing values, and have unnecessary columns dropped to prepare it for analysis.

Uploaded by

Raveena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

437 views

Wholesale Customers Data Analysis PDF

Uploaded by

Raveena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

8/30/2021 Wholesale Customers Data Analysis

A wholesale distributor operating in different regions of Portugal

has information on annual spending of several items in their
stores across different regions and channels. The data consists of
440 large retailers’ annual spending on 6 different varieties of
products in 3 different regions (Lisbon, Oporto, Other) and
across different sales channel (Hotel, Retail).
Importing the Libraries
In [31]:
import pandas as pd

import numpy as np

import copy

import matplotlib.pyplot as plt

import seaborn as sns

import pylab

import math

%matplotlib inline

import os

import warnings

warnings.filterwarnings('ignore')

Loading the Data

In [4]:
wholesale_customer_df = pd.read_csv('Wholesale+Customers+Data.csv')
wholesale_customer_df

Out[4]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatess

0 1 Retail Other 12669 9656 7561 214 2674 13

1 2 Retail Other 7057 9810 9568 1762 3293 17

2 3 Retail Other 6353 8808 7684 2405 3516 78

3 4 Hotel Other 13265 1196 4221 6404 507 17

4 5 Retail Other 22615 5410 7198 3915 1777 51

... ... ... ... ... ... ... ... ...

435 436 Hotel Other 29703 12051 16027 13135 182 22

436 437 Hotel Other 39228 1431 764 4510 93 23

437 438 Retail Other 14531 15488 30243 437 14841 18

438 439 Hotel Other 10290 1981 2232 1038 168 21

439 440 Hotel Other 2787 1698 2510 65 477

440 rows × 9 columns

In [5]:
wholesale_customer_df.head()

Out[5]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 1 Retail Other 12669 9656 7561 214 2674 1338

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 1/25

8/30/2021 Wholesale Customers Data Analysis

Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

1 2 Retail Other 7057 9810 9568 1762 3293 1776

2 3 Retail Other 6353 8808 7684 2405 3516 7844

3 4 Hotel Other 13265 1196 4221 6404 507 1788

4 5 Retail Other 22615 5410 7198 3915 1777 5185

Summary of the dataset

In [6]:
wholesale_customer_df.info()

RangeIndex: 440 entries, 0 to 439

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Buyer/Spender 440 non-null int64

1 Channel 440 non-null object

2 Region 440 non-null object

3 Fresh 440 non-null int64

4 Milk 440 non-null int64

5 Grocery 440 non-null int64

6 Frozen 440 non-null int64

7 Detergents_Paper 440 non-null int64

8 Delicatessen 440 non-null int64

dtypes: int64(7), object(2)

memory usage: 31.1+ KB

Checking the missing Value

In [14]:
wholesale_customer_df.isnull().sum()

Out[14]: Buyer/Spender 0

Channel 0

Region 0

Fresh 0

Milk 0

Grocery 0

Frozen 0

Detergents_Paper 0

Delicatessen 0

dtype: int64

Dropping as no use column for our analysis

1 continuous types of feature (Buyer/Spender) will be dropped as no use for our analysis

In [19]:
wholesale_customer_df.head()

Out[19]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 1 Retail Other 12669 9656 7561 214 2674 1338

1 2 Retail Other 7057 9810 9568 1762 3293 1776

2 3 Retail Other 6353 8808 7684 2405 3516 7844

3 4 Hotel Other 13265 1196 4221 6404 507 1788

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 2/25

8/30/2021 Wholesale Customers Data Analysis

Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

4 5 Retail Other 22615 5410 7198 3915 1777 5185

In [20]:
wholesale_customer_drop_df = copy.deepcopy(wholesale_customer_df)

wholesale_customer_drop_df

Out[20]: Buyer/Spender Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatess

0 1 Retail Other 12669 9656 7561 214 2674 13

1 2 Retail Other 7057 9810 9568 1762 3293 17

2 3 Retail Other 6353 8808 7684 2405 3516 78

3 4 Hotel Other 13265 1196 4221 6404 507 17

4 5 Retail Other 22615 5410 7198 3915 1777 51

... ... ... ... ... ... ... ... ...

435 436 Hotel Other 29703 12051 16027 13135 182 22

436 437 Hotel Other 39228 1431 764 4510 93 23

437 438 Retail Other 14531 15488 30243 437 14841 18

438 439 Hotel Other 10290 1981 2232 1038 168 21

439 440 Hotel Other 2787 1698 2510 65 477

440 rows × 9 columns

In [21]:
del wholesale_customer_drop_df['Buyer/Spender']

In [22]:
wholesale_customer_drop_df

Out[22]: Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 Retail Other 12669 9656 7561 214 2674 1338

1 Retail Other 7057 9810 9568 1762 3293 1776

2 Retail Other 6353 8808 7684 2405 3516 7844

3 Hotel Other 13265 1196 4221 6404 507 1788

4 Retail Other 22615 5410 7198 3915 1777 5185

... ... ... ... ... ... ... ... ...

435 Hotel Other 29703 12051 16027 13135 182 2204

436 Hotel Other 39228 1431 764 4510 93 2346

437 Retail Other 14531 15488 30243 437 14841 1867

438 Hotel Other 10290 1981 2232 1038 168 2125

439 Hotel Other 2787 1698 2510 65 477 52

440 rows × 8 columns

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 3/25

8/30/2021 Wholesale Customers Data Analysis

Performing Regional count

In [23]:
wholesale_customer_drop_df['Region'].value_counts()

Out[23]: Other 316

Lisbon 77

Oporto 47

Name: Region, dtype: int64

Performing Channel count

In [24]:
wholesale_customer_drop_df['Channel'].value_counts()

Out[24]: Hotel 298

Retail 142

Name: Channel, dtype: int64

In [25]:
def categorical_multi(i,j):

pd.crosstab(wholesale_customer_drop_df[i],wholesale_customer_drop_df[j]).plot(ki
plt.show()

print(pd.crosstab(wholesale_customer_drop_df[i],wholesale_customer_drop_df[j]))

categorical_multi(i='Channel',j='Region')

Region Lisbon Oporto Other

Channel

Hotel 59 28 211

Retail 18 19 105

EDA
Starting to explore the data with the Univariate analysis (each feature individually), before
carrying the Bivariate analysis and compare pairs of features to find correlation between them

In [26]:
print('Descriptive Statastics of our Data:')

wholesale_customer_drop_df.describe().T

Descriptive Statastics of our Data:

Out[26]: count mean std min 25% 50% 75% max

Fresh 440.0 12000.297727 12647.328865 3.0 3127.75 8504.0 16933.75 112151.0

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 4/25

8/30/2021 Wholesale Customers Data Analysis

count mean std min 25% 50% 75% max

Milk 440.0 5796.265909 7380.377175 55.0 1533.00 3627.0 7190.25 73498.0

Grocery 440.0 7951.277273 9503.162829 3.0 2153.00 4755.5 10655.75 92780.0

Frozen 440.0 3071.931818 4854.673333 25.0 742.25 1526.0 3554.25 60869.0

Detergents_Paper 440.0 2881.493182 4767.854448 3.0 256.75 816.5 3922.00 40827.0

Delicatessen 440.0 1524.870455 2820.105937 3.0 408.25 965.5 1820.25 47943.0

In [27]:
print('Descriptive Statastics of our Data including Channel & Retail:')

wholesale_customer_drop_df.describe(include='all').T

Descriptive Statastics of our Data including Channel & Retail:

Out[27]: count unique top freq mean std min 25% 50%

Channel 440 2 Hotel 298 NaN NaN NaN NaN NaN

Region 440 3 Other 316 NaN NaN NaN NaN NaN

Fresh 440.0 NaN NaN NaN 12000.297727 12647.328865 3.0 3127.75 8504.0 1

Milk 440.0 NaN NaN NaN 5796.265909 7380.377175 55.0 1533.0 3627.0

Grocery 440.0 NaN NaN NaN 7951.277273 9503.162829 3.0 2153.0 4755.5 1

Frozen 440.0 NaN NaN NaN 3071.931818 4854.673333 25.0 742.25 1526.0

Detergents_Paper 440.0 NaN NaN NaN 2881.493182 4767.854448 3.0 256.75 816.5

Delicatessen 440.0 NaN NaN NaN 1524.870455 2820.105937 3.0 408.25 965.5

Univariate
In [32]:
def plot_distribution(df, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):

plt.style.use('seaborn-whitegrid')

fig = plt.figure(figsize=(width,height))

fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace,

rows = math.ceil(float(df.shape[1]) / cols)

for i, column in enumerate(df.columns):

ax = fig.add_subplot(rows, cols, i + 1)

ax.set_title(column)

if df.dtypes[column] == np.object:

g = sns.countplot(y=column, data=df)

substrings = [s.get_text()[:18] for s in g.get_yticklabels()]

g.set(yticklabels=substrings)

plt.xticks(rotation=25)

else:

g = sns.distplot(df[column])

plt.xticks(rotation=25)

plot_distribution(wholesale_customer_drop_df, cols=3, width=20, height=20, hspace=0.

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 5/25

8/30/2021 Wholesale Customers Data Analysis

From the graphs on the distribution of product it seems that we have some outliers in the data,
further deep dive to identify the outlier.

In [34]:
# removing the categorical columns:

products = wholesale_customer_drop_df[wholesale_customer_drop_df.columns[+2:wholesal

#plotting the distribution of each feature

def plot_distribution(df2, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):

plt.style.use('seaborn-whitegrid')

fig = plt.figure(figsize=(width,height))

fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace,

rows = math.ceil(float(df2.shape[1]) / cols)

for i, column in enumerate(df2.columns):

ax = fig.add_subplot(rows, cols, i + 1)

ax.set_title(column)

g = sns.boxplot(df2[column])

plt.xticks(rotation=25)

plot_distribution(products, cols=3, width=20, height=10, hspace=0.45, wspace=0.5)

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 6/25

8/30/2021 Wholesale Customers Data Analysis

FYI Evaluator : Outliers are detected but not necessarily removed, it depends of the situation.
Here I will assume that the wholesale distributor provided us a dataset with correct data, so I will
keep them as is.

Bivariate
In [36]:
sns.set(style="ticks")

g = sns.pairplot(products,corner=True,kind='reg')

g.fig.set_size_inches(15,15)

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 7/25

8/30/2021 Wholesale Customers Data Analysis

From the pairplot above, the correlation between the "detergents and paper products" and the
"grocery products" seems to be pretty strong, meaning that consumers would often spend
money on these two types of product. Applying the Pearson correlation coefficient to confirm
this:

In [39]:
corr = products.corr()

sns.heatmap(corr,annot=True)

##There is strong correlation (0.92) between the "detergents and paper products" and

Out[39]: <AxesSubplot:>

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 8/25

8/30/2021 Wholesale Customers Data Analysis

Problem 1:
1.1 Use methods of descriptive statistics to summarize data. Which Region and
which Channel spent the most? Which Region and which Channel spent the
least?
In [40]:
print('Descriptive Statastics of our Data:')

wholesale_customer_drop_df.describe().T

Descriptive Statastics of our Data:

Out[40]: count mean std min 25% 50% 75% max

Fresh 440.0 12000.297727 12647.328865 3.0 3127.75 8504.0 16933.75 112151.0

Milk 440.0 5796.265909 7380.377175 55.0 1533.00 3627.0 7190.25 73498.0

Grocery 440.0 7951.277273 9503.162829 3.0 2153.00 4755.5 10655.75 92780.0

Frozen 440.0 3071.931818 4854.673333 25.0 742.25 1526.0 3554.25 60869.0

Detergents_Paper 440.0 2881.493182 4767.854448 3.0 256.75 816.5 3922.00 40827.0

Delicatessen 440.0 1524.870455 2820.105937 3.0 408.25 965.5 1820.25 47943.0

In [41]:
print('Descriptive Statastics of our Data including Channel & Retail:')

wholesale_customer_drop_df.describe(include='all').T

Descriptive Statastics of our Data including Channel & Retail:

Out[41]: count unique top freq mean std min 25% 50%

Channel 440 2 Hotel 298 NaN NaN NaN NaN NaN

Region 440 3 Other 316 NaN NaN NaN NaN NaN

Fresh 440.0 NaN NaN NaN 12000.297727 12647.328865 3.0 3127.75 8504.0 1

Milk 440.0 NaN NaN NaN 5796.265909 7380.377175 55.0 1533.0 3627.0

Grocery 440.0 NaN NaN NaN 7951.277273 9503.162829 3.0 2153.0 4755.5 1

Frozen 440.0 NaN NaN NaN 3071.931818 4854.673333 25.0 742.25 1526.0

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 9/25

8/30/2021 Wholesale Customers Data Analysis

count unique top freq mean std min 25% 50%

Detergents_Paper 440.0 NaN NaN NaN 2881.493182 4767.854448 3.0 256.75 816.5

Delicatessen 440.0 NaN NaN NaN 1524.870455 2820.105937 3.0 408.25 965.5

From the above two describe function, we can infer the following

Channel has two unique values, with "Hotel" as most frequent with 298 out of 440
transactions. i.e 67.7 percentage of spending comes from "Hotel" channel.

Retail has three unique values, with "Other" as most frequent with 316 out of 440
transactions. i.e.71.8 percentage of spending comes from "Other" region.

Fresh item (440 records),

has a mean of 12000.3, standard deviation of 12647.3, with min value of 3 and max value of
112151 .
The other aspect is Q1(25%) is 3127.75, Q3(75%) is 16933.8, with Q2(50%) 8504. range
= max-min =112151-3=112,148 & IQR = Q3-Q1 = 16933.8-3127.75 = 13,806.05 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Milk item (440 records),

has a mean of 5796.27, standard deviation of 7380.38, with min value of 55 and max value of
73498.

The other aspect is Q1(25%) is 1533, Q3(75%) is 7190.25, with Q2(50%) 3627

range = max-min =73498-55=73443 & IQR = Q3-Q1 = 7190.25-1533 = 5657.25 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Grocery item (440 records),

has a mean of 7951.28, standard deviation of 9503.16, with min value of 3 and max value of
92780.

The other aspect is Q1(25%) is 2153, Q3(75%) is 10655.8, with Q2(50%) 4755.5

range = max-min =92780-3=92777 & IQR = Q3-Q1 = 10655.8-2153 = 8502.8 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Frozen (440 records),

has a mean of 3071.93, standard deviation of 4854.67, with min value of 25 and max value of
60869.

The other aspect is Q1(25%) is 742.25, Q3(75%) is 3554.25, with Q2(50%) 1526

range = max-min =60869-25=60844 & IQR = Q3-Q1 = 3554.25-742.25 = 2812 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Detergents_Paper (440 records),

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 10/25

8/30/2021 Wholesale Customers Data Analysis

has a mean of 2881.49, standard deviation of 4767.85, with min value of 3 and max value of
40827.

The other aspect is Q1(25%) is 256.75, Q3(75%) is 3922, with Q2(50%) 816.5

range = max-min =40827-3=40824 & IQR = Q3-Q1 = 3922-256.75 = 3665.25 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Delicatessen (440 records),

has a mean of 1524.87, standard deviation of 2820.11, with min value of 3 and max value of
47943.

The other aspect is Q1(25%) is 408.25, Q3(75%) is 1820.25, with Q2(50%) 965.5

range = max-min =47943-3=47940 & IQR = Q3-Q1 = 1820.25-408.25 = 1412 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

In [ ]:
##Which Region and which Channel spent the most?

##Which Region and which Channel spent the least?

In [42]:
wholesale_customer_spending_df = copy.deepcopy(wholesale_customer_drop_df)

wholesale_customer_spending_df['Spending'] =wholesale_customer_drop_df['Fresh']+whol
wholesale_customer_spending_df

Out[42]: Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen Spending

0 Retail Other 12669 9656 7561 214 2674 1338 34112

1 Retail Other 7057 9810 9568 1762 3293 1776 33266

2 Retail Other 6353 8808 7684 2405 3516 7844 36610

3 Hotel Other 13265 1196 4221 6404 507 1788 27381

4 Retail Other 22615 5410 7198 3915 1777 5185 46100

... ... ... ... ... ... ... ... ... ...

435 Hotel Other 29703 12051 16027 13135 182 2204 73302

436 Hotel Other 39228 1431 764 4510 93 2346 48372

437 Retail Other 14531 15488 30243 437 14841 1867 77407

438 Hotel Other 10290 1981 2232 1038 168 2125 17834

439 Hotel Other 2787 1698 2510 65 477 52 7589

440 rows × 9 columns

In [43]:
regiondf = wholesale_customer_spending_df.groupby('Region')['Spending'].sum()

print(regiondf)

print()

channeldf = wholesale_customer_spending_df.groupby('Channel')['Spending'].sum()

print(channeldf)

Region

Lisbon 2386813

Oporto 1555088

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 11/25

8/30/2021 Wholesale Customers Data Analysis

Other 10677599

Name: Spending, dtype: int64

Channel

Hotel 7999569

Retail 6619931

Name: Spending, dtype: int64

Highest spend in the Region is from "Others" and lowest spend in the region is from
"Oporto"

Highest spend in the Channel is from "Hotel" and lowest spend in the Channel is from
"Retail".

1.2 There are 6 different varieties of items that are considered.

Describe and comment/explain all the varieties across Region
and Channel? Provide a detailed justification for your answer.
In [44]:
## Using the barplot to see behaviour across the channel and region

sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Fresh", hue ="Region", kind="bar", ci=None, data=wholesa
plt.title('Item - Fresh')

Out[44]: Text(0.5, 1.0, 'Item - Fresh')

In [45]:
sns.catplot(x="Channel", y="Fresh", kind="bar", ci=None, data=wholesale_customer_dro
plt.title('Item - Fresh')

Out[45]: Text(0.5, 1.0, 'Item - Fresh')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 12/25

8/30/2021 Wholesale Customers Data Analysis

In [46]:
sns.catplot(x="Region", y="Fresh", kind="bar", ci=None, data=wholesale_customer_drop
plt.title('Item - Fresh')

Out[46]: Text(0.5, 1.0, 'Item - Fresh')

Based on the plot, Fresh item is sold more in the Retail channel
In [47]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Milk", hue ="Region", kind="bar", ci=None, data=wholesal
plt.title('Item - Milk')

Out[47]: Text(0.5, 1.0, 'Item - Milk')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 13/25

8/30/2021 Wholesale Customers Data Analysis

In [48]:
sns.catplot(x="Channel", y="Milk", kind="bar", ci=None, data=wholesale_customer_drop
plt.title('Item - Milk')

Out[48]: Text(0.5, 1.0, 'Item - Milk')

In [49]:
sns.catplot(x="Region", y="Milk", kind="bar", ci=None, data=wholesale_customer_drop_
plt.title('Item - Milk')

Out[49]: Text(0.5, 1.0, 'Item - Milk')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 14/25

8/30/2021 Wholesale Customers Data Analysis

In [50]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Grocery", hue ="Region", kind="bar", ci=None, data=whole
plt.title('Item - Grocery')

Out[50]: Text(0.5, 1.0, 'Item - Grocery')

In [51]:
sns.catplot(x="Channel", y="Grocery", kind="bar", ci=None, data=wholesale_customer_d
plt.title('Item - Grocery')

Out[51]: Text(0.5, 1.0, 'Item - Grocery')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 15/25

8/30/2021 Wholesale Customers Data Analysis

In [52]:
sns.catplot(x="Region", y="Grocery", kind="bar", ci=None, data=wholesale_customer_dr
plt.title('Item - Grocery')

Out[52]: Text(0.5, 1.0, 'Item - Grocery')

In [53]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Frozen", hue ="Region", kind="bar", ci=None, data=wholes
plt.title('Item - Frozen')

Out[53]: Text(0.5, 1.0, 'Item - Frozen')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 16/25

8/30/2021 Wholesale Customers Data Analysis

In [54]:
sns.catplot(x="Channel", y="Frozen", kind="bar", ci=None, data=wholesale_customer_dr
plt.title('Item - Frozen')

Out[54]: Text(0.5, 1.0, 'Item - Frozen')

In [55]:
sns.catplot(x="Region", y="Frozen", kind="bar", ci=None, data=wholesale_customer_dro
plt.title('Item - Frozen')

Out[55]: Text(0.5, 1.0, 'Item - Frozen')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 17/25

8/30/2021 Wholesale Customers Data Analysis

In [56]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Detergents_Paper", hue ="Region", kind="bar", ci=None, d
plt.title('Item - Detergents_Paper')

Out[56]: Text(0.5, 1.0, 'Item - Detergents_Paper')

In [57]:
sns.catplot(x="Channel", y="Detergents_Paper", kind="bar", ci=None, data=wholesale_c
plt.title('Item - Detergents_Paper')

Out[57]: Text(0.5, 1.0, 'Item - Detergents_Paper')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 18/25

8/30/2021 Wholesale Customers Data Analysis

In [58]:
sns.catplot(x="Region", y="Detergents_Paper", kind="bar", ci=None, data=wholesale_cu
plt.title('Item - Detergents_Paper')

Out[58]: Text(0.5, 1.0, 'Item - Detergents_Paper')

In [59]:
sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Delicatessen", hue ="Region", kind="bar", ci=None, data=
plt.title('Delicatessen')

Out[59]: Text(0.5, 1.0, 'Delicatessen')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 19/25

8/30/2021 Wholesale Customers Data Analysis

In [60]:
sns.catplot(x="Channel", y="Delicatessen", kind="bar", ci=None, data=wholesale_custo
plt.title('Item - Delicatessen')

Out[60]: Text(0.5, 1.0, 'Item - Delicatessen')

In [61]:
sns.catplot(x="Region", y="Delicatessen", kind="bar", ci=None, data=wholesale_custom
plt.title('Item - Delicatessen')

Out[61]: Text(0.5, 1.0, 'Item - Delicatessen')

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 20/25

8/30/2021 Wholesale Customers Data Analysis

1.3 On the basis of a descriptive measure of variability,

which item shows the most inconsistent behaviour?
Which items show the least inconsistent behaviour?
In [62]:
#using standard deviation to check the measure of variabilty

standard_deviation_items = products.std()

standard_deviation_items.round(2)

Out[62]: Fresh 12647.33

Milk 7380.38

Grocery 9503.16

Frozen 4854.67

Detergents_Paper 4767.85

Delicatessen 2820.11

dtype: float64

Fresh item have highest Standard deviation So that is Inconsistent.¶

Delicatessen item have smallest Standard deviation, So that is consistent

In [ ]:
#Based on coeffiecent of Variation

In [63]:
cv_fresh = np.std(products['Fresh']) / np.mean(products['Fresh'])

cv_fresh

Out[63]: 1.0527196084948245

In [64]:
cv_milk = np.std(products['Milk']) / np.mean(products['Milk'])

cv_milk

Out[64]: 1.2718508307424503

In [65]:
localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 21/25
8/30/2021 Wholesale Customers Data Analysis

cv_grocery = np.std(products['Grocery']) / np.mean(products['Grocery'])

cv_grocery

Out[65]: 1.193815447749267

In [66]:
cv_frozen = np.std(products['Frozen']) / np.mean(products['Frozen'])

cv_frozen

Out[66]: 1.5785355298607762

In [67]:
cv_detergents_paper = np.std(products['Detergents_Paper']) / np.mean(products['Deter
cv_detergents_paper

Out[67]: 1.6527657881041729

In [68]:
cv_delicatessen = np.std(products['Delicatessen']) / np.mean(products['Delicatessen'
cv_delicatessen

Out[68]: 1.8473041039189306

In [69]:
from scipy.stats import variation

print(variation(products, axis = 0))

[1.05271961 1.27185083 1.19381545 1.57853553 1.65276579 1.8473041 ]

“Fresh” item have lowest coefficient of Variation So that is

consistent.

“Delicatessen” item have highest coefficient of Variation, So that

is Inconsistent
In [70]:
variance_items = products.var()

variance_items

Out[70]: Fresh 1.599549e+08

Milk 5.446997e+07

Grocery 9.031010e+07

Frozen 2.356785e+07

Detergents_Paper 2.273244e+07

Delicatessen 7.952997e+06

dtype: float64

In [71]:
products.describe().T

Out[71]: count mean std min 25% 50% 75% max

Fresh 440.0 12000.297727 12647.328865 3.0 3127.75 8504.0 16933.75 112151.0

Milk 440.0 5796.265909 7380.377175 55.0 1533.00 3627.0 7190.25 73498.0

Grocery 440.0 7951.277273 9503.162829 3.0 2153.00 4755.5 10655.75 92780.0

Frozen 440.0 3071.931818 4854.673333 25.0 742.25 1526.0 3554.25 60869.0

Detergents_Paper 440.0 2881.493182 4767.854448 3.0 256.75 816.5 3922.00 40827.0

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 22/25

8/30/2021 Wholesale Customers Data Analysis

count mean std min 25% 50% 75% max

Delicatessen 440.0 1524.870455 2820.105937 3.0 408.25 965.5 1820.25 47943.0

In [72]:
pylab.style.use('seaborn-pastel')

products.plot.area(stacked=False,figsize=(11,5))

pylab.grid(); pylab.show()

1.4 Are there any outliers in the data? Back up your answer with a
suitable plot/technique with the help of detailed comments.
In [73]:
#Using the boxplot to see the outliers. The black point is the outliers in boxplot g
plt.figure(figsize=(15,8))

sns.boxplot(data=products, orient="h", palette="Set2")

Out[73]: <AxesSubplot:>

In [74]:
def plot_distribution(items, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):

plt.style.use('seaborn-whitegrid')

fig = plt.figure(figsize=(width,height))

fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace,

rows = math.ceil(float(items.shape[1]) / cols)

for i, column in enumerate(items.columns):

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 23/25

8/30/2021 Wholesale Customers Data Analysis

ax = fig.add_subplot(rows, cols, i + 1)

ax.set_title(column)

g = sns.boxplot(items[column])

plt.xticks(rotation=25)

plot_distribution(products, cols=3, width=20, height=10, hspace=0.45, wspace=0.5)

Yes there are outliers in all the items across the product range (Fresh, Milk,
Grocery, Frozen, Detergents_Paper & Delicatessen) Outliers are detected but
not necessarily removed, it depends of the situation. Here I will assume that
the wholesale distributor provided us a dataset with correct data, so I will
keep them as is.
In [75]:
# visual analysis via histogram

products.hist(figsize=(6,6));

1.5 On the basis of your analysis, what are your

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 24/25
8/30/2021 Wholesale Customers Data Analysis

recommendations for the business? How can your analysis help

the business to solve its problem? Answer from the business
perspective
## As per the analysis, I find out that there are inconsistencies in spending of different items (by calculating
Coefficient of Variation), which should be minimized. The spending of Hotel and Retail channel are different
which should be more or less equal. And also spent should equal for different regions. Need to focus on other
items also than “Fresh” and “Grocery”

In [ ]:

localhost:8888/nbconvert/html/Wholesale Customers Data Analysis .ipynb?download=false 25/25

SANDYA VB TIME SERIES FORECASTING PROJECT - HTML PDF
90% (20)
SANDYA VB TIME SERIES FORECASTING PROJECT - HTML PDF
196 pages
A Wholesale Distributor
100% (3)
A Wholesale Distributor
5 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Project Instructions 2
No ratings yet
Project Instructions 2
5 pages
Project SMDM Kundan Sinha PDF
0% (1)
Project SMDM Kundan Sinha PDF
4 pages
Project Questions
No ratings yet
Project Questions
4 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
Project: Advanced Statistics: Anova, Eda and Pca
No ratings yet
Project: Advanced Statistics: Anova, Eda and Pca
35 pages
Advance Statistics Business Report
No ratings yet
Advance Statistics Business Report
15 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
A Conductor's Guide To The Interpretation of Mendelssohn's Elijah
100% (1)
A Conductor's Guide To The Interpretation of Mendelssohn's Elijah
74 pages
DCS A-10C Keyboard Layout
100% (2)
DCS A-10C Keyboard Layout
7 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Dbms db03 2020 Assessment (Solved) : Find Study Resources
50% (2)
Dbms db03 2020 Assessment (Solved) : Find Study Resources
12 pages
SMDM-Project Report (Madhur Dhananiwala)
100% (2)
SMDM-Project Report (Madhur Dhananiwala)
43 pages
Great Learning: SMDM Final Assignment
100% (1)
Great Learning: SMDM Final Assignment
16 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
Which Year Has The Most Number of Records?: AS Quiz 2: Exploratory Data Analysis
100% (2)
Which Year Has The Most Number of Records?: AS Quiz 2: Exploratory Data Analysis
5 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
100% (1)
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
14 pages
Project Report - Advanced - Stats - Final PDF
No ratings yet
Project Report - Advanced - Stats - Final PDF
25 pages
Project 2 SMDM
50% (2)
Project 2 SMDM
5 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
Business Analytics Report: Submitted To
No ratings yet
Business Analytics Report: Submitted To
32 pages
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
100% (1)
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
25 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
No ratings yet
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
28 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
SMDM Project
100% (1)
SMDM Project
22 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
SMDM Assignment: Problem 1
0% (1)
SMDM Assignment: Problem 1
16 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
Project Report - Data Mining
0% (1)
Project Report - Data Mining
52 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Business Report SMDM Bhushan
No ratings yet
Business Report SMDM Bhushan
18 pages
SMDM Assignment PDF
100% (1)
SMDM Assignment PDF
15 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
MySQL - Week 2 Quiz
100% (2)
MySQL - Week 2 Quiz
6 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
ML Quiz-2
No ratings yet
ML Quiz-2
5 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Extended Project
No ratings yet
Extended Project
1 page
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Detail Project Report SMDM
100% (1)
Detail Project Report SMDM
25 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
Business Report Pradeep Chauhan 11june'23
100% (1)
Business Report Pradeep Chauhan 11june'23
25 pages
Week 1 Graded Quiz On Solution PDF
No ratings yet
Week 1 Graded Quiz On Solution PDF
2 pages
MRA Milestone-1 Graded Project
100% (2)
MRA Milestone-1 Graded Project
41 pages
Machine Learning - Customer Segment Project. Approved by UDACITY
100% (1)
Machine Learning - Customer Segment Project. Approved by UDACITY
19 pages
Tcs Employment Application Form
No ratings yet
Tcs Employment Application Form
6 pages
Advanced Statistics Project Module 3 - Advanced Statistics: Submitted To Great Learning
No ratings yet
Advanced Statistics Project Module 3 - Advanced Statistics: Submitted To Great Learning
37 pages
Vishal Kumar Sahu: Application & Web Developer
No ratings yet
Vishal Kumar Sahu: Application & Web Developer
2 pages
Form GST REG-06: Government of India
No ratings yet
Form GST REG-06: Government of India
3 pages
Government of India, Ministry of External Affairs Supplementary Form
No ratings yet
Government of India, Ministry of External Affairs Supplementary Form
7 pages
Curriculum Vitae Raveena Kumari: Gaushala Chowk, Upper Bazar Ranchi Jharkhand
No ratings yet
Curriculum Vitae Raveena Kumari: Gaushala Chowk, Upper Bazar Ranchi Jharkhand
3 pages
Golfball Problem Statement
No ratings yet
Golfball Problem Statement
1 page
Orientation Session 23 January 2022 Batch
No ratings yet
Orientation Session 23 January 2022 Batch
26 pages
Pachisia
No ratings yet
Pachisia
4 pages
Conditional Probability & Bayes T
No ratings yet
Conditional Probability & Bayes T
6 pages
2019 Eng p2 Insert
No ratings yet
2019 Eng p2 Insert
5 pages
Accounting Ratios MCQs 2024
No ratings yet
Accounting Ratios MCQs 2024
3 pages
PAD HERO Guitar Hero Using Arduino
No ratings yet
PAD HERO Guitar Hero Using Arduino
8 pages
Lesson Plan Grade 7 EMS Term 1 W5&6
No ratings yet
Lesson Plan Grade 7 EMS Term 1 W5&6
9 pages
Business Economics Cia - 1: Q1. Do You Think Aston Martin Car Presents An Exception To The Law of Demand? If Yes, Explain
No ratings yet
Business Economics Cia - 1: Q1. Do You Think Aston Martin Car Presents An Exception To The Law of Demand? If Yes, Explain
4 pages
Bio 101N Midterm Lab Reviewer
No ratings yet
Bio 101N Midterm Lab Reviewer
32 pages
Number System Worksheet
100% (2)
Number System Worksheet
4 pages
Ieees 11 Brochure
No ratings yet
Ieees 11 Brochure
2 pages
M.Sc. Environmental Science
No ratings yet
M.Sc. Environmental Science
19 pages
Auto Deltaweld 452: October 2002
No ratings yet
Auto Deltaweld 452: October 2002
68 pages
Vocab Test (F-3A)
No ratings yet
Vocab Test (F-3A)
6 pages
Co
No ratings yet
Co
10 pages
Inclusiveness PPT @keleme - 2013
No ratings yet
Inclusiveness PPT @keleme - 2013
209 pages
Dr. Reddy's Laboratories Limited
No ratings yet
Dr. Reddy's Laboratories Limited
9 pages
CCNA 3 Chapter 6 v5.0 Exam Answers 2015 100
No ratings yet
CCNA 3 Chapter 6 v5.0 Exam Answers 2015 100
6 pages
Rise and Shine L5 GSE Teacher Mapping Booklet
No ratings yet
Rise and Shine L5 GSE Teacher Mapping Booklet
38 pages
Amara Berri
No ratings yet
Amara Berri
1 page
Nbaa Special: India'S Civil Aviation Policy Draft
No ratings yet
Nbaa Special: India'S Civil Aviation Policy Draft
44 pages
Computational Intelligence and Data Analytics: Proceedings of ICCIDA 2022 Rajkumar Buyya - The ebook in PDF format is ready for download
No ratings yet
Computational Intelligence and Data Analytics: Proceedings of ICCIDA 2022 Rajkumar Buyya - The ebook in PDF format is ready for download
74 pages
January 2009 Room Assignments For Pharmacy Licensure Examination
No ratings yet
January 2009 Room Assignments For Pharmacy Licensure Examination
35 pages
Indian Airforce - Know About The Powerful Armed On Air Indian Air Force in India - DOWNLOAD PDF NOTES
No ratings yet
Indian Airforce - Know About The Powerful Armed On Air Indian Air Force in India - DOWNLOAD PDF NOTES
4 pages
Qw-484A Suggested Format A For Welder Performance Qualifications (WPQ) (See QW-301, Section IX, ASME Boiler and Pressure Vessel Code)
No ratings yet
Qw-484A Suggested Format A For Welder Performance Qualifications (WPQ) (See QW-301, Section IX, ASME Boiler and Pressure Vessel Code)
1 page
Issa Ebook Recovery
100% (2)
Issa Ebook Recovery
25 pages
Dimensional Formula PDF
No ratings yet
Dimensional Formula PDF
2 pages
Red Bull Spreads Its Wiiings - MARKETING
No ratings yet
Red Bull Spreads Its Wiiings - MARKETING
11 pages
P - 7 English Lesson Notes PDF
50% (2)
P - 7 English Lesson Notes PDF
48 pages
Syllabus UAE - VAT & TAX
No ratings yet
Syllabus UAE - VAT & TAX
15 pages