0% found this document useful (0 votes)

0 views

Walmart Business Case Study.ipynb - Colab

The document is a Jupyter notebook analyzing a Walmart dataset with 550,068 entries and 10 columns, focusing on customer demographics and purchasing behavior. Key findings indicate that the majority of users are males aged 26-35 from mid-sized cities, with a significant portion being single and having lived in their city for 1-3 years. The analysis includes visualizations and statistical summaries to explore relationships between demographics and purchase amounts.

Uploaded by

justanswerjunaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

Walmart Business Case Study.ipynb - Colab

Uploaded by

justanswerjunaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

2/17/25, 11:35 PM Walmart_Business_Case_Study.

ipynb - Colab

import gdown
!gdown 1lcjtmvtSjco6cnWaJvgJNTGjGJoKieh1

Downloading...
From: https://drive.google.com/uc?id=1lcjtmvtSjco6cnWaJvgJNTGjGJoKieh1
To: /content/walmart-data.csv
100% 23.0M/23.0M [00:00<00:00, 35.3MB/s]

# Importing required libraries -

# For reading & manipulating the data -

import pandas as pd
import numpy as np

# For visualizing the data -

import matplotlib.pyplot as plt
import seaborn as sns

# For statistical functions -

import scipy.stats as stats

# Loading the dataset -

df = pd.read_csv('walmart-data.csv')
df.sample(10)
df.head(10)
df.tail(10)

Show hidden output

# Shape of the dataset

df.shape
print('No. of rows:', df.shape[0])
print('No. of columns:', df.shape[1])

No. of rows: 550068

No. of columns: 10

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object
2 Gender 550068 non-null object
3 Age 550068 non-null object
4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object
6 Stay_In_Current_City_Years 550068 non-null object
7 Marital_Status 550068 non-null int64
8 Product_Category 550068 non-null int64
9 Purchase 550068 non-null int64

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 1/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
dtypes: int64(5), object(5)
memory usage: 42.0+ MB

keyboard_arrow_down Delete Null values and outliers

# Checking for null values:-
df.isnull().sum()
df.isna().sum()

User_ID 0

Product_ID 0

Gender 0

Age 0

Occupation 0

City_Category 0

Stay_In_Current_City_Years 0

Marital_Status 0

Product_Category 0

Purchase 0

dtype: int64

No null values found

# Checking for duplicates

duplicate_rows = df[df.duplicated()]
print(duplicate_rows.shape[0])

No duplicate rows found

sns.kdeplot(data=df, x='Purchase')
plt.show()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 2/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

attrs = ['Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status']

sns.set_style("white")

fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(20, 16))

fig.subplots_adjust(top=1.3)
count = 0
for row in range(3):
for col in range(2):
sns.boxplot(data=df, y='Purchase', x=attrs[count], hue=attrs[count], ax=axs[row, col], palette='Set3', legend=False)
axs[row, col].set_title(f"Purchase vs {attrs[count]}", pad=12, fontsize=13)
count += 1
plt.show()

Show hidden output

q1=df.Purchase.quantile(0.25)
q3=df.Purchase.quantile(0.75)
print(q1,q3)
IQR=q3-q1
outliers = df[((df.Purchase<(q1-1.5*IQR)) | (df.Purchase>(q3+1.5*IQR)))]
print("num outliers : ",len(outliers))
print("percent outliers : ",len(outliers)/len(df))

5823.0 12054.0
num outliers : 2677
percent outliers : 0.004866671029763593

df_clean=df.drop(df[ (df.Purchase > (q3+1.5IQR)) | (df.Purchase < (q1-1.5IQR)) ].index)

df_clean.info()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 3/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

<class 'pandas.core.frame.DataFrame'>
Index: 547391 entries, 0 to 550067
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 547391 non-null int64
1 Product_ID 547391 non-null object
2 Gender 547391 non-null object
3 Age 547391 non-null object
4 Occupation 547391 non-null int64
5 City_Category 547391 non-null object
6 Stay_In_Current_City_Years 547391 non-null object
7 Marital_Status 547391 non-null int64
8 Product_Category 547391 non-null int64
9 Purchase 547391 non-null int64
dtypes: int64(5), object(5)
memory usage: 45.9+ MB

df=df_clean.reset_index()

keyboard_arrow_down Data Exploration

# Statistical summary of the dataset -
df.describe(include='all').T

Show hidden output

df['User_ID'].nunique()

5891

df['Product_ID'].nunique()

3631

df['Gender'].value_counts()/len(df)

count

Gender

M 0.752974

F 0.247026

dtype: float64

Categorical_rows = ['Gender','Age','Stay_In_Current_City_Years','Marital_Status','City_Category']
df[Categorical_rows].melt().groupby(['variable','value'])[['value']].count()*100/len(df)

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 4/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

value

variable value

Age 0-17 2.746117

18-25 18.146809

26-35 39.946035

36-45 19.987358

46-50 8.301561

51-55 6.976914

55+ 3.895205

City_Category A 26.861238

B 42.038324

C 31.100438

Gender F 24.702635

M 75.297365

Marital_Status 0 59.051391

1 40.948609

Stay_In_Current_City_Years 0 13.525250

1 35.229845

2 18.521313

3 17.319247

4+ 15.404345

1. Age Distribution The majority of users (around 40%) fall in the 26-35 age group, making it the largest customer segment. The 36-45 age
group follows, comprising 20% of users. Younger users (18-25) make up 18%, while older users (46+ years) collectively contribute around
19%. Very few users (~3%) are under the age of 17.

2. City Category Distribution The highest proportion of users (42%) come from City Category B. City C contributes 31% of users, while City A
accounts for 27%. This suggests that users from mid-sized cities (B & C) form the majority of the customer base.

3. Gender Distribution A significantly higher percentage of users (75%) are male, while only 25% are female. This indicates that the platform
or product is more popular among male customers.

4. Marital Status A larger proportion (59%) of users are single, compared to 41% who are married. This suggests that the product may appeal
more to unmarried individuals.

5. Stay Duration in Current City The majority of users (35%) have been staying in their current city for 1 year, followed by 18% for 2 years.
Around 17% have lived in their city for 3 years, and 15% for 4+ years. A relatively smaller proportion (13.5%) are newcomers (0 years). This
could indicate a mix of both stable and migrating users.

- Overall Summary

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 5/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
The core customer base consists of males aged 26-35 from City Category B. A large proportion of users are single and have lived in their city
for 1-3 years. City B & C residents form the majority, while City A has the lowest representation.

fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(16, 12))

sns.countplot(data=df, x='Gender', ax=axs[0,0])
sns.countplot(data=df, x='Age', ax=axs[0,1])
sns.countplot(data=df, x='City_Category', ax=axs[1,0])
sns.countplot(data=df, x='Marital_Status', ax=axs[1,1])
plt.show()

Show hidden output

sns.barplot(data=df, x='Age', y='Purchase',hue="Gender")

<Axes: xlabel='Age', ylabel='Purchase'>

sns.barplot(data=df, x='Age', y='Purchase',hue="Marital_Status")

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 6/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

<Axes: xlabel='Age', ylabel='Purchase'>

plt.figure(figsize=(10, 8))
sns.countplot(data=df, x='Product_Category')
plt.show()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 7/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

plt.figure(figsize=(20, 8))
sns.countplot(data=df, x='Age',hue='Marital_Status')
plt.show()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 8/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

As expected, there are no married individuals in the 0-17 age group.

Up to the age of 45, the number of unmarried individuals is higher than married ones.
After the age of 45, the number of married individuals exceeds the unmarried ones.

plt.figure(figsize=(20, 8))
sns.countplot(data=df, x='Product_Category',hue='Marital_Status')
plt.show()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 9/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

keyboard_arrow_down How does gender affect the amount spent?

# Checking different metrics based on purchase by different genders -
df.groupby('Gender')['Purchase'].describe()

count mean std min 25% 50% 75% max

Gender

F 135220.0 8671.049039 4679.058483 12.0 5429.0 7906.0 11064.0 21398.0

M 412171.0 9367.724355 5009.234088 12.0 5852.0 8089.0 12247.0 21399.0

## run it multiple times

df.sample(500).groupby('Gender')['Purchase'].describe()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 10/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

count mean std min 25% 50% 75% max

Gender

F 107.0 8739.476636 4092.670974 743.0 5996.0 8041.0 10551.0 20634.0

M 393.0 8877.592875 4569.020437 136.0 5938.0 7994.0 11729.0 20419.0

sns.displot(x='Purchase', bins=25, kde=True,hue='Gender', data=df )

<seaborn.axisgrid.FacetGrid at 0x7a9b40f3d890>

We can see that the distribution is close to normal.

sns.displot(x='Purchase', data=df, bins=25, hue='Gender')

plt.axvline(x=df['Purchase'].mean(), color='r')
plt.axvline(x=df[df['Gender']=='M']['Purchase'].mean(), color='b')
plt.axvline(x=df[df['Gender']=='F']['Purchase'].mean(), color='g')

plt.show()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 11/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

The mean purchase amounts for males, females, and the overall dataset are quite similar, with females spending slightly less on average.

Let's dive deeper into the details.

keyboard_arrow_down Female Purchases

# Female Purchases
Female_data = df[df["Gender"]=="F"]
print("Female Purchase amount Mean:"+ str(Female_data["Purchase"].mean()))
print("Female Purchase amount SD:"+ str(Female_data["Purchase"].std()))

Female Purchase amount Mean:8671.049038603756

Female Purchase amount SD:4679.058483084425

keyboard_arrow_down Let us pick around 1,000 random samples of size 300 from the entire data set and calculate the
mean of each sample.

female_sample_means=[df[df['Gender']=='F'].sample(300, replace=True)['Purchase'].mean() for i in range(1000)]

sns.displot(female_sample_means,bins=30, kde=True )

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 12/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

<seaborn.axisgrid.FacetGrid at 0x7a9b42b2ce90>

females_sample_means_600=[df[df['Gender']=='F'].sample(600, replace=True)['Purchase'].mean() for i in range(1000)]

sns.displot(females_sample_means_600,bins=30, kde=True )

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 13/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

<seaborn.axisgrid.FacetGrid at 0x7a9b43d43450>

As the sample size increases, the sampling distribution becomes more normally distributed, with reduced deviation.

print("Sample distribution mean for females : ",sum(females_sample_means_600) / len(females_sample_means_600))

Sample distribution mean for females : 8675.96046499999

# Generate bootstrap sample means for females

female_sample_means = [df[df['Gender']=='F'].sample(1000, replace=True)['Purchase'].mean() for i in range(1000)]

# Compute the 95% Confidence Interval

lower_bound = np.percentile(female_sample_means, 2.5)
upper_bound = np.percentile(female_sample_means, 97.5)

# Visualization
plt.figure(figsize=(10, 6))
sns.histplot(female_sample_means, bins=30, kde=True, color="purple")

# Add vertical lines for confidence interval

plt.axvline(lower_bound, color='red', linestyle='dashed', label=f'Lower 95% CI: {lower_bound:.2f}')
plt.axvline(upper_bound, color='red', linestyle='dashed', label=f'Upper 95% CI: {upper_bound:.2f}')
plt.axvline(np.mean(female_sample_means), color='blue', linestyle='solid', label=f'Mean: {np.mean(female_sample_means):.2f}')

# Labels and title

plt.xlabel("Sample Mean of Purchase (Female)")
plt.ylabel("Frequency")
plt.title("Bootstrap Distribution of Female Purchase Means with 95% CI")

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 14/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
plt.legend()
plt.show()

# Print confidence interval

print(f"95% Confidence Interval for Female Purchase Mean: ({lower_bound:.2f}, {upper_bound:.2f})")

95% Confidence Interval for Female Purchase Mean: (8393.90, 8977.83)

keyboard_arrow_down Male's Purchases

# Male Purchases
Male_data = df[df["Gender"]=="M"]
print("Male Purchase amount Mean:"+ str(Male_data["Purchase"].mean()))
print("Male Purchase amount SD:"+ str(Male_data["Purchase"].std()))

Male Purchase amount Mean:9367.724354697444

Male Purchase amount SD:5009.234087946683

male_sample_means=[df[df['Gender']=='M'].sample(300, replace=True)['Purchase'].mean() for i in range(1000)]

sns.displot(male_sample_means,bins=30, kde=True )

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 15/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

<seaborn.axisgrid.FacetGrid at 0x7a9b46d12610>

males_sample_means_600=[df[df['Gender']=='M'].sample(600, replace=True)['Purchase'].mean() for i in range(1000)]

sns.displot(males_sample_means_600,bins=30, kde=True )

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 16/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

<seaborn.axisgrid.FacetGrid at 0x7a9b47134c90>

print("Sample distribution mean for males : ",sum(male_sample_means) / len(male_sample_means))

Sample distribution mean for males : 9371.82929899999

# Generate bootstrap sample means for females

male_sample_means = [df[df['Gender']=='M'].sample(1000, replace=True)['Purchase'].mean() for i in range(1000)]

# Compute the 95% Confidence Interval

lower_bound = np.percentile(male_sample_means, 2.5)
upper_bound = np.percentile(male_sample_means, 97.5)

# Visualization
plt.figure(figsize=(10, 6))
sns.histplot(male_sample_means, bins=30, kde=True, color="purple")

# Add vertical lines for confidence interval

plt.axvline(lower_bound, color='red', linestyle='dashed', label=f'Lower 95% CI: {lower_bound:.2f}')
plt.axvline(upper_bound, color='red', linestyle='dashed', label=f'Upper 95% CI: {upper_bound:.2f}')
plt.axvline(np.mean(male_sample_means), color='blue', linestyle='solid', label=f'Mean: {np.mean(male_sample_means):.2f}')

# Labels and title

plt.xlabel("Sample Mean of Purchase (Male)")
plt.ylabel("Frequency")
plt.title("Bootstrap Distribution of Male Purchase Means with 95% CI")
plt.legend()
plt.show()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 17/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
# Print confidence interval
print(f"95% Confidence Interval for Male Purchase Mean: ({lower_bound:.2f}, {upper_bound:.2f})")

95% Confidence Interval for Male Purchase Mean: (9059.41, 9672.62)

The mean of the sampling distribution, which represents the average of all the sample means taken, is very close to the original population
mean.

keyboard_arrow_down Females:
Population Mean: 8671
Sample Mean: 8675

Males:
Population Mean: 9367
Sample Mean: 9371

This aligns with the first property of the Central Limit Theorem, which states that the sampling distribution mean approximates the population
mean.

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 18/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
female_sample_means = [df[df['Gender']=='F'].sample(300, replace=True)['Purchase'].mean() for i in range(1000)]
male_sample_means = [df[df['Gender']=='M'].sample(300, replace=True)['Purchase'].mean() for i in range(1000)]

# Compute the 95% Confidence Intervals

female_lower, female_upper = np.percentile(female_sample_means, [2.5, 97.5])
male_lower, male_upper = np.percentile(male_sample_means, [2.5, 97.5])

# Visualization
plt.figure(figsize=(12, 6))
sns.histplot(female_sample_means, bins=30, kde=True, color="purple", label="Female Sample Means", alpha=0.6)
sns.histplot(male_sample_means, bins=30, kde=True, color="blue", label="Male Sample Means", alpha=0.6)

# Add vertical lines for confidence intervals

plt.axvline(female_lower, color='red', linestyle='dashed', label=f'Female 95% CI: {female_lower:.2f} - {female_upper:.2f}')
plt.axvline(female_upper, color='red', linestyle='dashed')
plt.axvline(male_lower, color='green', linestyle='dashed', label=f'Male 95% CI: {male_lower:.2f} - {male_upper:.2f}')
plt.axvline(male_upper, color='green', linestyle='dashed')

# Add means
plt.axvline(np.mean(female_sample_means), color='purple', linestyle='solid', label=f'Female Mean: {np.mean(female_sample_means):.2f}')
plt.axvline(np.mean(male_sample_means), color='blue', linestyle='solid', label=f'Male Mean: {np.mean(male_sample_means):.2f}')

# Labels and title

plt.xlabel("Sample Mean of Purchase")
plt.ylabel("Frequency")
plt.title("Comparison of Male vs. Female Purchase Means with 95% CI")
plt.legend()
plt.show()

# Print confidence intervals

print(f"95% Confidence Interval for Female Purchase Mean: ({female_lower:.2f}, {female_upper:.2f})")
print(f"95% Confidence Interval for Male Purchase Mean: ({male_lower:.2f}, {male_upper:.2f})")

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 19/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

95% Confidence Interval for Female Purchase Mean: (8178.71, 9206.66)

95% Confidence Interval for Male Purchase Mean: (8790.47, 9959.93)

keyboard_arrow_down Impact of Marital Status on Spending Amount

# Mapping integer values 0/1 to Unmarried/Married -
df['Marital_Status'] = df['Marital_Status'].map({0: 'Unmarried', 1: 'Married'})

# Marital Status -
x = df['Marital_Status'].value_counts().values

plt.figure(figsize=(5, 5))
plt.pie(x, center=(0, 0), radius=1.5, labels=df['Marital_Status'].unique(), autopct='%1.1f%%', pctdistance=0.5)
plt.title('Marital Status')
plt.axis('equal')
plt.show()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 20/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

# No. of unique users in each category -

df.groupby('Marital_Status')['User_ID'].nunique()

User_ID

Marital_Status

Married 2474

Unmarried 3417

dtype: int64

# Checking different metrics based on purchase by different categories -

df.groupby('Marital_Status')['Purchase'].describe()

count mean std min 25% 50% 75% max

Marital_Status

Married 224149.0 9187.040076 4925.205232 12.0 5833.0 8042.0 12006.0 21398.0

Unmarried 323242.0 9201.581849 4948.327397 12.0 5480.0 8035.0 12028.0 21399.0

# Plotting all the observations -

sns.displot(x='Purchase', data=df, bins=25, hue='Marital_Status')

plt.axvline(x=df['Purchase'].mean(), color='r')
plt.axvline(x=df[df['Marital_Status']=='Married']['Purchase'].mean(), color='g')

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 21/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
plt.axvline(x=df[df['Marital_Status']=='Unmarried']['Purchase'].mean(), color='b')

plt.show()

keyboard_arrow_down The distribution of Purchase appears close to normal.

Let us pick around 1,000 random samples of size 300 from the entire data set and calculate the mean of each sample.

Size = 300

Iterations = 1000

unmarried_sample_means=[df[df['Marital_Status']=='Unmarried']['Purchase'].sample(300).mean() for i in range(1000)]

married_sample_means=[df[df['Marital_Status']=='Married']['Purchase'].sample(300).mean() for i in range(1000)]

plt.figure(figsize=(10, 6))

# Plot both distributions

sns.histplot(unmarried_sample_means, bins=30, kde=True, color='blue', label="Unmarried")
sns.histplot(married_sample_means, bins=30, kde=True, color='green', label="Married")

# Add vertical lines for the population mean of each category

plt.axvline(x=df[df['Marital_Status'] == 'Unmarried']['Purchase'].mean(), color='blue', linestyle='dashed', label="Unmarried Mean")
plt.axvline(x=df[df['Marital_Status'] == 'Married']['Purchase'].mean(), color='green', linestyle='dashed', label="Married Mean")

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 22/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.title("Comparison of Sampling Distributions (Married vs. Unmarried)")
plt.legend()
plt.show()

# Mean of the sampling distributions

mean_unmarried = np.mean(unmarried_sample_means)
mean_married = np.mean(married_sample_means)

# Standard deviation of the sampling distributions (Standard Error)

std_unmarried = np.std(unmarried_sample_means, ddof=1) # ddof=1 for sample std dev
std_married = np.std(married_sample_means, ddof=1)

print(f"Unmarried - Mean: {mean_unmarried:.2f}, Standard Deviation: {std_unmarried:.2f}")

print(f"Married - Mean: {mean_married:.2f}, Standard Deviation: {std_married:.2f}")

Unmarried - Mean: 9210.43, Standard Deviation: 276.42

Married - Mean: 9187.20, Standard Deviation: 282.31

keyboard_arrow_down Insights from the Sampling Distributions

Mean Comparison:
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 23/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
Unmarried Mean = 9210.43

Married Mean = 9187.20

Observation: The means are very close, indicating that both groups have similar average purchases.

Standard Deviation (Standard Error) Comparison:

Unmarried SE = 276.42

Married SE = 282.31

Observation: The standard errors are also close, suggesting that both groups have similar variability in their spending patterns.

keyboard_arrow_down To determine a range where the true mean likely falls for each group, let's calculate the 95%
Confidence Interval (CI)

# 95% Confidence Interval Calculation

ci_unmarried = stats.norm.interval(0.95, loc=mean_unmarried, scale=std_unmarried / np.sqrt(600))
ci_married = stats.norm.interval(0.95, loc=mean_married, scale=std_married / np.sqrt(600))

print(f"95% CI for Unmarried: {ci_unmarried}")

print(f"95% CI for Married: {ci_married}")

95% CI for Unmarried: (9188.308205473391, 9232.543461193274)

95% CI for Married: (9164.612804844472, 9209.791721822197)

plt.figure(figsize=(8, 6))

# Unmarried Error Bar

plt.errorbar(x='Unmarried',
y=mean_unmarried,
yerr=ci_unmarried[1] - mean_unmarried,
fmt='o', color='blue', label='Unmarried')

# Married Error Bar

plt.errorbar(x='Married',
y=mean_married,
yerr=ci_married[1] - mean_married,
fmt='o', color='green', label='Married')

plt.ylabel('Mean Purchase')
plt.title('Mean Purchase with 95% Confidence Interval')
plt.legend()
plt.show()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 24/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

plt.figure(figsize=(8, 6))
sns.kdeplot(df[df['Marital_Status'] == 'Unmarried']['Purchase'], label='Unmarried', color='blue', fill=True)
sns.kdeplot(df[df['Marital_Status'] == 'Married']['Purchase'], label='Married', color='green', fill=True)

plt.axvline(mean_unmarried, color='blue', linestyle='dashed', label='Unmarried Mean')

plt.axvline(mean_married, color='green', linestyle='dashed', label='Married Mean')

plt.title('Purchase Distribution by Marital Status')

plt.xlabel('Purchase Amount')
plt.legend()
plt.show()

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 25/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab

keyboard_arrow_down 📌 Summary of Observations & Insights

Throughout this analysis, we explored purchase behavior across different demographic factors using statistical methods and visualizations.
Below is a structured summary of our key findings:

🔍 1. Purchase Behavior by Gender

Boxplot Analysis: Males tend to spend slightly more than females.
Distribution Analysis: Mean purchase values for males and females are close, but females spend slightly less.

📊 Sampling Distribution:
Mean of the sampled female purchases closely matches the population mean (8671 vs. 8675).
Mean of the sampled male purchases closely matches the population mean (9367 vs. 9371).
✅ This confirms the Central Limit Theorem: The sampling distribution mean approximates the population mean.

🔍 2. Purchase Behavior by Marital Status

Married vs. Unmarried Spending:

Unmarried Mean: 9210.43

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 26/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
Married Mean: 9187.20
📌 Standard deviations are close, indicating similar purchase variability.
Distribution Analysis: The spending distribution is approximately normal for both groups.

📊 Sampling Distribution Analysis:

Sampling distribution means for both groups closely match their original population means.
✅ This further validates the Central Limit Theorem.

📊 3. Confidence Interval Analysis

95% Confidence Intervals (CI) were calculated and visualized for both Married and Unmarried groups.
Visualization Insight:

The confidence intervals overlap, suggesting no statistically significant difference in spending between married and unmarried
users.

🚀 Final Takeaways
✔ The Central Limit Theorem holds, as the sampling distribution means approximate population means.
✔ Gender-based spending is slightly higher for males, but the difference is not very large.
✔ Marital status has little impact on purchase behavior, as seen from the overlapping confidence intervals.
✔ The overall spending distribution is close to normal, making sampling-based estimations reliable.

Start coding or generate with AI.

add Code add Text

Start coding or generate with AI.

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 27/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
Start coding or generate with AI.

Start coding or generate with AI.

https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 28/28

Problem scenario
No ratings yet
Problem scenario
13 pages
Retail Analysis With Walmart Data
100% (10)
Retail Analysis With Walmart Data
2 pages
Smart Data Discovery
No ratings yet
Smart Data Discovery
29 pages
Black Friday Sales
No ratings yet
Black Friday Sales
26 pages
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
From Everand
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
equitypress
3.5/5 (2)
Utilization of Assessment Data: K. Describing Relationship
No ratings yet
Utilization of Assessment Data: K. Describing Relationship
19 pages
Walmart Case Study
No ratings yet
Walmart Case Study
40 pages
Business Case Study Walmart New
No ratings yet
Business Case Study Walmart New
37 pages
Walmart Solution PDF
No ratings yet
Walmart Solution PDF
35 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Case Study Module 1
No ratings yet
Case Study Module 1
4 pages
Project Sale Analysis
No ratings yet
Project Sale Analysis
8 pages
Diwali Sales Analysis EDA 1696347982
No ratings yet
Diwali Sales Analysis EDA 1696347982
8 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Supermarket Sales Analysis 1
No ratings yet
Supermarket Sales Analysis 1
13 pages
Business Analytics Casebook
No ratings yet
Business Analytics Casebook
132 pages
BigMart PDF
100% (1)
BigMart PDF
42 pages
Project
No ratings yet
Project
12 pages
Masterclass Data Analysis.ipynb - Colab
No ratings yet
Masterclass Data Analysis.ipynb - Colab
4 pages
Customer_Marketing_Analysis_1738244935
No ratings yet
Customer_Marketing_Analysis_1738244935
42 pages
IP Project Final
No ratings yet
IP Project Final
9 pages
Cardio Good Fitness Dataset
No ratings yet
Cardio Good Fitness Dataset
27 pages
Nikita Prasad - Exploratory Data Analysis (EDA)
No ratings yet
Nikita Prasad - Exploratory Data Analysis (EDA)
18 pages
Walmart_Sales_Data_Analysis
No ratings yet
Walmart_Sales_Data_Analysis
4 pages
Diwali Sales Analysis
No ratings yet
Diwali Sales Analysis
14 pages
Advance Data Analytics ASSIGNMENT
No ratings yet
Advance Data Analytics ASSIGNMENT
10 pages
The Factors Affecting Big Mart's Sales
No ratings yet
The Factors Affecting Big Mart's Sales
20 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
AMAZON SALES ANALYSIS
No ratings yet
AMAZON SALES ANALYSIS
51 pages
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
No ratings yet
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
17 pages
Aerofit CaseStudy
No ratings yet
Aerofit CaseStudy
30 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Analytics Roadmap
No ratings yet
Analytics Roadmap
30 pages
Ali Shafi BSBA 2-A 6522 Sales Market Data
No ratings yet
Ali Shafi BSBA 2-A 6522 Sales Market Data
40 pages
BigMart Sales Data Analysis
No ratings yet
BigMart Sales Data Analysis
16 pages
Target SQL - Reference
No ratings yet
Target SQL - Reference
11 pages
DAR CASE STUDY
No ratings yet
DAR CASE STUDY
12 pages
Machine Learning - Customer Segment Project. Approved by UDACITY
100% (1)
Machine Learning - Customer Segment Project. Approved by UDACITY
19 pages
Roll - Number Name Sap Id: Business Analytics
No ratings yet
Roll - Number Name Sap Id: Business Analytics
2 pages
CardioGoodFitness - Descriptive Statistics (2) (1) - Jupyter Notebook
No ratings yet
CardioGoodFitness - Descriptive Statistics (2) (1) - Jupyter Notebook
14 pages
Divyanshi 05401172023 Ds Practical
No ratings yet
Divyanshi 05401172023 Ds Practical
18 pages
Business Case Aerofit Descriptive Statistics & Probability
No ratings yet
Business Case Aerofit Descriptive Statistics & Probability
12 pages
sql capstone project
No ratings yet
sql capstone project
4 pages
Task 3
No ratings yet
Task 3
3 pages
Another Project-Creating Customer Segments
No ratings yet
Another Project-Creating Customer Segments
31 pages
Olist Kasyapa
No ratings yet
Olist Kasyapa
22 pages
COMP1810 - Data and Web Analytics
100% (1)
COMP1810 - Data and Web Analytics
47 pages
Training
No ratings yet
Training
17 pages
BusinessCaseStudyTargetMySQL v1
No ratings yet
BusinessCaseStudyTargetMySQL v1
31 pages
Pankaj Soni Gamma 199 BA Assignment
No ratings yet
Pankaj Soni Gamma 199 BA Assignment
20 pages
Aerofit_Case_Study
No ratings yet
Aerofit_Case_Study
16 pages
Tasks for Students (1)
No ratings yet
Tasks for Students (1)
4 pages
Supermarket Sales Data analysis
No ratings yet
Supermarket Sales Data analysis
6 pages
CUSTOMER ANALYSIS_Report
No ratings yet
CUSTOMER ANALYSIS_Report
10 pages
Mall Customer Data Analysis PDF
No ratings yet
Mall Customer Data Analysis PDF
10 pages
Eda - 1@3pm 8th Nov
No ratings yet
Eda - 1@3pm 8th Nov
2 pages
Untitled0.ipynb - Colab
No ratings yet
Untitled0.ipynb - Colab
6 pages
Walmart Sales Analysis
No ratings yet
Walmart Sales Analysis
29 pages
Aerofit Case Study - Solution Approach
No ratings yet
Aerofit Case Study - Solution Approach
5 pages
Big Mart Sales Analysis
No ratings yet
Big Mart Sales Analysis
3 pages
B M Sale Analysis
No ratings yet
B M Sale Analysis
3 pages
Assignment
No ratings yet
Assignment
14 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
8 pages
Skewness and Kurtosis Original
No ratings yet
Skewness and Kurtosis Original
38 pages
Xi Com First Term Maths QB 2024-25
No ratings yet
Xi Com First Term Maths QB 2024-25
3 pages
DSBDA2
No ratings yet
DSBDA2
6 pages
Statistics - 11th Elite (1)-invert (1)-1 - converted (1)
No ratings yet
Statistics - 11th Elite (1)-invert (1)-1 - converted (1)
32 pages
Construction of A Box-Plot
No ratings yet
Construction of A Box-Plot
9 pages
MMW Module 8 - Measures of Relative Position
No ratings yet
MMW Module 8 - Measures of Relative Position
11 pages
Regresi Panel
No ratings yet
Regresi Panel
25 pages
1.8: Statistics: Example: Find The Mean, Median, Mode, and Range For The
No ratings yet
1.8: Statistics: Example: Find The Mean, Median, Mode, and Range For The
3 pages
AP Stats Final Review Ch1-15 Includes Answers
No ratings yet
AP Stats Final Review Ch1-15 Includes Answers
9 pages
Toreno, James S. (Midterm On Assessment in Learning 2)
No ratings yet
Toreno, James S. (Midterm On Assessment in Learning 2)
11 pages
Excel Notes in Hindi
No ratings yet
Excel Notes in Hindi
45 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
10 - Quantiles or Fractiles-1
No ratings yet
10 - Quantiles or Fractiles-1
26 pages
Statistics 2
No ratings yet
Statistics 2
74 pages
Econ-2042 - Unit 2-W3-5
No ratings yet
Econ-2042 - Unit 2-W3-5
54 pages
Introduction To The Normal Distribution: X CDF (X) P (X) (X-M) (X-M) /M
No ratings yet
Introduction To The Normal Distribution: X CDF (X) P (X) (X-M) (X-M) /M
4 pages
STAT Module
No ratings yet
STAT Module
25 pages
Mean Mode Median
100% (1)
Mean Mode Median
17 pages
MSD - Unit - Ii - MCQ
No ratings yet
MSD - Unit - Ii - MCQ
14 pages
Chapter 4: Standardized Scores and The Normal Distribution
No ratings yet
Chapter 4: Standardized Scores and The Normal Distribution
9 pages
Chapter 3 Sta404
No ratings yet
Chapter 3 Sta404
11 pages
Data Analysis Midterm
No ratings yet
Data Analysis Midterm
25 pages
Chapter 05 - Sampling and Sampling Distribution
No ratings yet
Chapter 05 - Sampling and Sampling Distribution
5 pages
MCQs_Sol
No ratings yet
MCQs_Sol
2 pages
Chapter 17 Correlation and Regression
No ratings yet
Chapter 17 Correlation and Regression
16 pages
Mathematics & Statistics for Management
No ratings yet
Mathematics & Statistics for Management
13 pages
Experiment 1 Lab Report
No ratings yet
Experiment 1 Lab Report
10 pages

Walmart Business Case Study.ipynb - Colab

Uploaded by

Walmart Business Case Study.ipynb - Colab

Uploaded by

2/17/25, 11:35 PM Walmart_Business_Case_Study.

# Importing required libraries -

# For reading & manipulating the data -

# For visualizing the data -

# For statistical functions -

# Loading the dataset -

Show hidden output

# Shape of the dataset

No. of rows: 550068

keyboard_arrow_down Delete Null values and outliers

No null values found

# Checking for duplicates

No duplicate rows found

attrs = ['Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status']

fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(20, 16))

Show hidden output

df_clean=df.drop(df[ (df.Purchase > (q3+1.5*IQR)) | (df.Purchase < (q1-1.5*IQR)) ].index)

keyboard_arrow_down Data Exploration

Show hidden output

Age 0-17 2.746117

fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(16, 12))

Show hidden output

sns.barplot(data=df, x='Age', y='Purchase',hue="Gender")

<Axes: xlabel='Age', ylabel='Purchase'>

sns.barplot(data=df, x='Age', y='Purchase',hue="Marital_Status")

<Axes: xlabel='Age', ylabel='Purchase'>

As expected, there are no married individuals in the 0-17 age group.

keyboard_arrow_down How does gender affect the amount spent?

count mean std min 25% 50% 75% max

F 135220.0 8671.049039 4679.058483 12.0 5429.0 7906.0 11064.0 21398.0

M 412171.0 9367.724355 5009.234088 12.0 5852.0 8089.0 12247.0 21399.0

## run it multiple times

count mean std min 25% 50% 75% max

F 107.0 8739.476636 4092.670974 743.0 5996.0 8041.0 10551.0 20634.0

M 393.0 8877.592875 4569.020437 136.0 5938.0 7994.0 11729.0 20419.0

sns.displot(x='Purchase', bins=25, kde=True,hue='Gender', data=df )

We can see that the distribution is close to normal.

sns.displot(x='Purchase', data=df, bins=25, hue='Gender')

Let's dive deeper into the details.

keyboard_arrow_down Female Purchases

Female Purchase amount Mean:8671.049038603756

female_sample_means=[df[df['Gender']=='F'].sample(300, replace=True)['Purchase'].mean() for i in range(1000)]

females_sample_means_600=[df[df['Gender']=='F'].sample(600, replace=True)['Purchase'].mean() for i in range(1000)]

print("Sample distribution mean for females : ",sum(females_sample_means_600) / len(females_sample_means_600))

Sample distribution mean for females : 8675.96046499999

# Generate bootstrap sample means for females

# Compute the 95% Confidence Interval

# Add vertical lines for confidence interval

# Labels and title

# Print confidence interval

95% Confidence Interval for Female Purchase Mean: (8393.90, 8977.83)

keyboard_arrow_down Male's Purchases

Male Purchase amount Mean:9367.724354697444

male_sample_means=[df[df['Gender']=='M'].sample(300, replace=True)['Purchase'].mean() for i in range(1000)]

males_sample_means_600=[df[df['Gender']=='M'].sample(600, replace=True)['Purchase'].mean() for i in range(1000)]

print("Sample distribution mean for males : ",sum(male_sample_means) / len(male_sample_means))

Sample distribution mean for males : 9371.82929899999

# Generate bootstrap sample means for females

# Compute the 95% Confidence Interval

# Add vertical lines for confidence interval

# Labels and title

95% Confidence Interval for Male Purchase Mean: (9059.41, 9672.62)

# Compute the 95% Confidence Intervals

# Add vertical lines for confidence intervals

# Labels and title

# Print confidence intervals

95% Confidence Interval for Female Purchase Mean: (8178.71, 9206.66)

keyboard_arrow_down Impact of Marital Status on Spending Amount

# No. of unique users in each category -

# Checking different metrics based on purchase by different categories -

count mean std min 25% 50% 75% max

Married 224149.0 9187.040076 4925.205232 12.0 5833.0 8042.0 12006.0 21398.0

Unmarried 323242.0 9201.581849 4948.327397 12.0 5480.0 8035.0 12028.0 21399.0

# Plotting all the observations -

sns.displot(x='Purchase', data=df, bins=25, hue='Marital_Status')

keyboard_arrow_down The distribution of Purchase appears close to normal.

df_clean=df.drop(df[ (df.Purchase > (q3+1.5IQR)) | (df.Purchase < (q1-1.5IQR)) ].index)