Walmart Business Case Study.ipynb - Colab
Walmart Business Case Study.ipynb - Colab
ipynb - Colab
import gdown
!gdown 1lcjtmvtSjco6cnWaJvgJNTGjGJoKieh1
Downloading...
From: https://drive.google.com/uc?id=1lcjtmvtSjco6cnWaJvgJNTGjGJoKieh1
To: /content/walmart-data.csv
100% 23.0M/23.0M [00:00<00:00, 35.3MB/s]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object
2 Gender 550068 non-null object
3 Age 550068 non-null object
4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object
6 Stay_In_Current_City_Years 550068 non-null object
7 Marital_Status 550068 non-null int64
8 Product_Category 550068 non-null int64
9 Purchase 550068 non-null int64
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 1/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
dtypes: int64(5), object(5)
memory usage: 42.0+ MB
User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category 0
Purchase 0
dtype: int64
sns.kdeplot(data=df, x='Purchase')
plt.show()
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 2/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
q1=df.Purchase.quantile(0.25)
q3=df.Purchase.quantile(0.75)
print(q1,q3)
IQR=q3-q1
outliers = df[((df.Purchase<(q1-1.5*IQR)) | (df.Purchase>(q3+1.5*IQR)))]
print("num outliers : ",len(outliers))
print("percent outliers : ",len(outliers)/len(df))
5823.0 12054.0
num outliers : 2677
percent outliers : 0.004866671029763593
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 3/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
<class 'pandas.core.frame.DataFrame'>
Index: 547391 entries, 0 to 550067
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 547391 non-null int64
1 Product_ID 547391 non-null object
2 Gender 547391 non-null object
3 Age 547391 non-null object
4 Occupation 547391 non-null int64
5 City_Category 547391 non-null object
6 Stay_In_Current_City_Years 547391 non-null object
7 Marital_Status 547391 non-null int64
8 Product_Category 547391 non-null int64
9 Purchase 547391 non-null int64
dtypes: int64(5), object(5)
memory usage: 45.9+ MB
df=df_clean.reset_index()
df['User_ID'].nunique()
5891
df['Product_ID'].nunique()
3631
df['Gender'].value_counts()/len(df)
count
Gender
M 0.752974
F 0.247026
dtype: float64
Categorical_rows = ['Gender','Age','Stay_In_Current_City_Years','Marital_Status','City_Category']
df[Categorical_rows].melt().groupby(['variable','value'])[['value']].count()*100/len(df)
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 4/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
value
variable value
18-25 18.146809
26-35 39.946035
36-45 19.987358
46-50 8.301561
51-55 6.976914
55+ 3.895205
City_Category A 26.861238
B 42.038324
C 31.100438
Gender F 24.702635
M 75.297365
Marital_Status 0 59.051391
1 40.948609
Stay_In_Current_City_Years 0 13.525250
1 35.229845
2 18.521313
3 17.319247
4+ 15.404345
1. Age Distribution The majority of users (around 40%) fall in the 26-35 age group, making it the largest customer segment. The 36-45 age
group follows, comprising 20% of users. Younger users (18-25) make up 18%, while older users (46+ years) collectively contribute around
19%. Very few users (~3%) are under the age of 17.
2. City Category Distribution The highest proportion of users (42%) come from City Category B. City C contributes 31% of users, while City A
accounts for 27%. This suggests that users from mid-sized cities (B & C) form the majority of the customer base.
3. Gender Distribution A significantly higher percentage of users (75%) are male, while only 25% are female. This indicates that the platform
or product is more popular among male customers.
4. Marital Status A larger proportion (59%) of users are single, compared to 41% who are married. This suggests that the product may appeal
more to unmarried individuals.
5. Stay Duration in Current City The majority of users (35%) have been staying in their current city for 1 year, followed by 18% for 2 years.
Around 17% have lived in their city for 3 years, and 15% for 4+ years. A relatively smaller proportion (13.5%) are newcomers (0 years). This
could indicate a mix of both stable and migrating users.
- Overall Summary
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 5/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
The core customer base consists of males aged 26-35 from City Category B. A large proportion of users are single and have lived in their city
for 1-3 years. City B & C residents form the majority, while City A has the lowest representation.
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 6/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
plt.figure(figsize=(10, 8))
sns.countplot(data=df, x='Product_Category')
plt.show()
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 7/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
plt.figure(figsize=(20, 8))
sns.countplot(data=df, x='Age',hue='Marital_Status')
plt.show()
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 8/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
plt.figure(figsize=(20, 8))
sns.countplot(data=df, x='Product_Category',hue='Marital_Status')
plt.show()
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 9/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
Gender
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 10/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
Gender
<seaborn.axisgrid.FacetGrid at 0x7a9b40f3d890>
plt.axvline(x=df['Purchase'].mean(), color='r')
plt.axvline(x=df[df['Gender']=='M']['Purchase'].mean(), color='b')
plt.axvline(x=df[df['Gender']=='F']['Purchase'].mean(), color='g')
plt.show()
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 11/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
The mean purchase amounts for males, females, and the overall dataset are quite similar, with females spending slightly less on average.
keyboard_arrow_down Let us pick around 1,000 random samples of size 300 from the entire data set and calculate the
mean of each sample.
sns.displot(female_sample_means,bins=30, kde=True )
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 12/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
<seaborn.axisgrid.FacetGrid at 0x7a9b42b2ce90>
sns.displot(females_sample_means_600,bins=30, kde=True )
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 13/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
<seaborn.axisgrid.FacetGrid at 0x7a9b43d43450>
As the sample size increases, the sampling distribution becomes more normally distributed, with reduced deviation.
# Visualization
plt.figure(figsize=(10, 6))
sns.histplot(female_sample_means, bins=30, kde=True, color="purple")
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 14/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
plt.legend()
plt.show()
sns.displot(male_sample_means,bins=30, kde=True )
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 15/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
<seaborn.axisgrid.FacetGrid at 0x7a9b46d12610>
sns.displot(males_sample_means_600,bins=30, kde=True )
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 16/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
<seaborn.axisgrid.FacetGrid at 0x7a9b47134c90>
# Visualization
plt.figure(figsize=(10, 6))
sns.histplot(male_sample_means, bins=30, kde=True, color="purple")
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 17/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
# Print confidence interval
print(f"95% Confidence Interval for Male Purchase Mean: ({lower_bound:.2f}, {upper_bound:.2f})")
The mean of the sampling distribution, which represents the average of all the sample means taken, is very close to the original population
mean.
keyboard_arrow_down Females:
Population Mean: 8671
Sample Mean: 8675
Males:
Population Mean: 9367
Sample Mean: 9371
This aligns with the first property of the Central Limit Theorem, which states that the sampling distribution mean approximates the population
mean.
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 18/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
female_sample_means = [df[df['Gender']=='F'].sample(300, replace=True)['Purchase'].mean() for i in range(1000)]
male_sample_means = [df[df['Gender']=='M'].sample(300, replace=True)['Purchase'].mean() for i in range(1000)]
# Visualization
plt.figure(figsize=(12, 6))
sns.histplot(female_sample_means, bins=30, kde=True, color="purple", label="Female Sample Means", alpha=0.6)
sns.histplot(male_sample_means, bins=30, kde=True, color="blue", label="Male Sample Means", alpha=0.6)
# Add means
plt.axvline(np.mean(female_sample_means), color='purple', linestyle='solid', label=f'Female Mean: {np.mean(female_sample_means):.2f}')
plt.axvline(np.mean(male_sample_means), color='blue', linestyle='solid', label=f'Male Mean: {np.mean(male_sample_means):.2f}')
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 19/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
# Marital Status -
x = df['Marital_Status'].value_counts().values
plt.figure(figsize=(5, 5))
plt.pie(x, center=(0, 0), radius=1.5, labels=df['Marital_Status'].unique(), autopct='%1.1f%%', pctdistance=0.5)
plt.title('Marital Status')
plt.axis('equal')
plt.show()
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 20/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
User_ID
Marital_Status
Married 2474
Unmarried 3417
dtype: int64
Marital_Status
plt.axvline(x=df['Purchase'].mean(), color='r')
plt.axvline(x=df[df['Marital_Status']=='Married']['Purchase'].mean(), color='g')
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 21/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
plt.axvline(x=df[df['Marital_Status']=='Unmarried']['Purchase'].mean(), color='b')
plt.show()
Size = 300
Iterations = 1000
plt.figure(figsize=(10, 6))
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 22/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.title("Comparison of Sampling Distributions (Married vs. Unmarried)")
plt.legend()
plt.show()
Observation: The means are very close, indicating that both groups have similar average purchases.
Unmarried SE = 276.42
Married SE = 282.31
Observation: The standard errors are also close, suggesting that both groups have similar variability in their spending patterns.
keyboard_arrow_down To determine a range where the true mean likely falls for each group, let's calculate the 95%
Confidence Interval (CI)
plt.figure(figsize=(8, 6))
plt.ylabel('Mean Purchase')
plt.title('Mean Purchase with 95% Confidence Interval')
plt.legend()
plt.show()
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 24/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
plt.figure(figsize=(8, 6))
sns.kdeplot(df[df['Marital_Status'] == 'Unmarried']['Purchase'], label='Unmarried', color='blue', fill=True)
sns.kdeplot(df[df['Marital_Status'] == 'Married']['Purchase'], label='Married', color='green', fill=True)
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 25/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
📊 Sampling Distribution:
Mean of the sampled female purchases closely matches the population mean (8671 vs. 8675).
Mean of the sampled male purchases closely matches the population mean (9367 vs. 9371).
✅ This confirms the Central Limit Theorem: The sampling distribution mean approximates the population mean.
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 26/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
Married Mean: 9187.20
📌 Standard deviations are close, indicating similar purchase variability.
Distribution Analysis: The spending distribution is approximately normal for both groups.
The confidence intervals overlap, suggesting no statistically significant difference in spending between married and unmarried
users.
🚀 Final Takeaways
✔ The Central Limit Theorem holds, as the sampling distribution means approximate population means.
✔ Gender-based spending is slightly higher for males, but the difference is not very large.
✔ Marital status has little impact on purchase behavior, as seen from the overlapping confidence intervals.
✔ The overall spending distribution is close to normal, making sampling-based estimations reliable.
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 27/28
2/17/25, 11:35 PM Walmart_Business_Case_Study.ipynb - Colab
Start coding or generate with AI.
https://colab.research.google.com/drive/1KQYVbLLPQh_jGV_DacOK1Ii8M4marGGo#scrollTo=d0KtEd2SFcOl&printMode=true 28/28