0% found this document useful (0 votes)

16 views

ML Practical 4D

Uploaded by

Samir Bhosale

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

ML Practical 4D

Uploaded by

Samir Bhosale

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Assignment No 04

Name: Bhosale Samir Shamkant Roll no: CO407 Class: BE COMP

Title: Implement K-Means clustering/ hierarchical clustering on sales_data_sample.csv

dataset. Determine the number of clusters using the elbowmethod.

Importing libraries
In [198… import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [199… from sklearn.cluster import KMeans, k_means #For clustering

from sklearn.decomposition import PCA #Linear Dimensionality reduction.

Importing the dataset

In [200… df = pd.read_csv("sales_data_sample.csv")

Preprocessing
In [201… df.head()

Out[201]: ORDERNUMBER QUANTITYORDERE PRICEEAC ORDERLINENUMBE SALE ORDER ATE STATUS QTR_ID
D H R S
2/24 2003
0 10107 30 95.70 2 2871.00 0:00 Shipped 1

5/7 2003
1 10121 34 81.35 5 2765.90 Shipped 2
0:00

7/1 2003
2 10134 41 94.74 2 3884.34 0:00 Shipped 3

8/25 2003
3 10145 45 83.26 6 3746.70 Shipped 3
0:00

10/10 2003
4 10159 49 100.00 14 5205.27 Shipped 4

5 rows × 25 columns

Out[202]:
In [202…
In [203…
df.shape

(2823, 25)

Out[203]: ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES QTR_ID MONT

count 2823.000000 2823.000000 2823.000000 2823.000000 2823.000000 2823.000000 2823.00

mean 10258.725115 35.092809 83.658544 6.466171 3553.889072 2.717676 7.0

std 92.085478 9.741443 20.174277 4.225841 1841.865106 1.203878 3.65

min 10100.000000 6.000000 26.880000 1.000000 482.130000 1.000000 1.00

25% 10180.000000 27.000000 68.860000 3.000000 2203.430000 2.000000 4.00

50% 10262.000000 35.000000 95.700000 6.000000 3184.800000 3.000000 8.00

75% 10333.500000 43.000000 100.000000 9.000000 4508.000000 4.000000 11.00

max 10425.000000 97.000000 100.000000 18.000000 14082.800000 4.000000 12.00

In [204… df.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex:
2823 entries, 0 to 2822 Data columns (total 25
columns):
# Column Non-Null Count Dtype

0 ORDERNUMBER 2823 non-null int64

1 QUANTITYORDERED 2823 non-null int64
2 PRICEEACH 2823 non-null float64
3 ORDERLINENUMBER 2823 non-null int64
4 SALES 2823 non-null float64
5 ORDERDATE 2823 non-null object
6 STATUS 2823 non-null object
7 QTR_ID 2823 non-null int64
8 MONTH_ID 2823 non-null int64
9 YEAR_ID 2823 non-null int64
10 PRODUCTLINE 2823 non-null object
11 MSRP 2823 non-null int64
12 PRODUCTCODE 2823 non-null object
13 CUSTOMERNAME 2823 non-null object
14 PHONE 2823 non-null object
15 ADDRESSLINE1 2823 non-null object
16 ADDRESSLINE2 302 non-null object
17 CITY 2823 non-null object
18 STATE 1337 non-null object
19 POSTALCODE 2747 non-null object
20 COUNTRY 2823 non-null object
21 TERRITORY 1749 non-null object
22 CONTACTLASTNAME 2823 non-null object
23 CONTACTFIRSTNAME 2823 non-null object
24 DEALSIZE 2823 non-null object
dtypes: float64(2), int64(7), object(16)
memory usage: 551.5+ KB
In [205…
df.isnull().sum()
Out[205]: ORDERNUMBER 0
QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
PHONE 0
ADDRESSLINE1 0
ADDRESSLINE2 2521
CITY 0
STATE 1486
POSTALCODE 76
COUNTRY 0
TERRITORY 1074
CONTACTLASTNAME 0
CONTACTFIRSTNAME 0
DEALSIZE 0
dtype: int64

In [206… df.dtypes

ORDERNUMBER int64
Out[206]:
QUANTITYORDERED int64
PRICEEACH float64
ORDERLINENUMBER int64
SALES float64
ORDERDATE object
STATUS object
QTR_ID int64
MONTH_ID int64
YEAR_ID int64
PRODUCTLINE object
MSRP int64
PRODUCTCODE object
CUSTOMERNAME object
PHONE object
ADDRESSLINE1 object
ADDRESSLINE2 object
CITY object
STATE object
POSTALCODE object
COUNTRY object
TERRITORY object
CONTACTLASTNAME object
CONTACTFIRSTNAME object
DEALSIZE object
dtype: object
In [207…
df_drop = ['ADDRESSLINE1', 'ADDRESSLINE2', 'STATUS','POSTALCODE', 'CITY', 'TERRITORY',
df = df.drop(df_drop, axis=1) #Dropping the categorical uneccessary columns along with c

In [208… df.isnull().sum()

Out[208]: QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
COUNTRY 0
DEALSIZE 0
dtype: int64
In [209…
df.dtypes
Out[209]: QUANTITYORDERED int64
PRICEEACH float64
ORDERLINENUMBER int64
SALES float64
ORDERDATE object
QTR_ID int64
MONTH_ID int64
YEAR_ID int64
PRODUCTLINE object
MSRP int64
PRODUCTCODE object
COUNTRY object
DEALSIZE object
dtype: object
In [ ]:
# Checking the categorical columns.

In [210… df['COUNTRY'].unique()

Out[210]: array(['USA', 'France', 'Norway', 'Australia', 'Finland', 'Austria', 'UK',

'Spain', 'Sweden', 'Singapore', 'Canada', 'Japan', 'Italy', 'Denmark', 'Belgium', 'Philippines',
'Germany', 'Switzerland','Ireland'], dtype=object)

df['PRODUCTLINE'].unique()
In [211…
array(['Motorcycles', 'Classic Cars', 'Trucks and Buses', 'Vintage Cars','Planes', 'Ships', 'Trains'],
Out[211]: dtype=object)

df['DEALSIZE'].unique()
In [212…
array(['Small', 'Medium', 'Large'], dtype=object)
Out[212]:

In [213… productline = pd.get_dummies(df['PRODUCTLINE']) #Converting the categorical columns.

Dealsize = pd.get_dummies(df['DEALSIZE'])

In [214… df = pd.concat([df,productline,Dealsize], axis = 1)

In [215… df_drop = ['COUNTRY','PRODUCTLINE','DEALSIZE'] #Dropping Country too as there are alot

df = df.drop(df_drop, axis=1)

In [216… df['PRODUCTCODE'] = pd.Categorical(df['PRODUCTCODE']).codes #Converting the datatype.

In [217… df.drop('ORDERDATE', axis=1, inplace=True) #Dropping the Orderdate as Month is already i

In [218… df.dtypes #All the datatypes are converted into numeric

Out[218]: QUANTITYORDERED int64

PRICEEACH float64
ORDERLINENUMB int64
ER
SALES float64
QTR_ID int64
MONTH_I int64
D int64
YEAR_ID
MSRP int64int8
PRODUCTCO uint8
DE uint8
Classic Cars
Motorcycles
Planes uint8
Ships uint8
Trains uint8
Trucks and Buses uint8
Vintage Cars uint8
Large uint8
Medium uint8
Small uint8
dtype: object

Plotting the Elbow Plot to determine the number of clusters.

In [219… distortions = [] # Within Cluster Sum of Squares from the centroid
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(df)
distortions.append(kmeanModel.inertia_) #Appeding the intertia to the Distortions

In [220… plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

As the number of k increases Inertia decreases.

Observations: A Elbow can be observed at 3 and after that thecurve

decreases gradually.
In [221… X_train = df.values #Returns a numpy array.

In [222… X_train.shape

Out[222]: (2823, 19)

In [223… model = KMeans(n_clusters=3,random_state=2) #Number of cluster = 3
model = model.fit(X_train) #Fitting the values to create a model.
predictions = model.predict(X_train) #Predicting the cluster values (0,1,or 2)

In [225… unique,counts = np.unique(predictions,return_counts=True)

In [226… counts = counts.reshape(1,3)

In [227… counts_df = pd.DataFrame(counts,columns=['Cluster1','Cluster2','Cluster3'])

In [228… counts_df.head()

Out[228]: Cluster1 Cluster2 Cluster30

1083 1367 373

Visualization

In [229… pca = PCA(n_components=2) #Converting all the features into 2 columns to make it easy to

In [230… reduced_X = pd.DataFrame(pca.fit_transform(X_train),columns=['PCA1','PCA2']) #Creating a

In [231… reduced_X.head()

Out[231]: PCA1 PCA2

0 -682.488323 -42.819535

1 -787.665502-41.694991

2 330.732170 -26.481208

3 193.040232-26.285766

4 1651.532874 -6.891196

In [232… #Plotting the normal Scatter Plot

plt.figure(figsize=(14,10))
plt.scatter(reduced_X['PCA1'],reduced_X['PCA2'])

Out[232]: <matplotlib.collections.PathCollection at 0x218dc747880>

In [233… model.cluster_centers_ #Finding the centriods. (3 Centriods in total. Each Array contain

Out[233]: array([[ 3.72031394e+01, 9.52120960e+01, 6.44967682e+00,

4.13868425e+03, 2.72022161e+00, 7.09879963e+00,
2.00379409e+03, 1.13248384e+02, 5.04469067e+01,
3.74884580e-01, 1.15420129e-01, 9.41828255e-02,
8.21791320e-02, 1.84672207e-02, 1.16343490e-01,
1.98522622e-01, 2.08166817e-17, 1.00000000e+00,
-6.66133815e-16],
[ 3.08302853e+01, 7.00755230e+01, 6.67300658e+00,
2.12409474e+03, 2.71762985e+00, 7.09509876e+00,
2.00381127e+03, 7.84784199e+01, 6.24871982e+01,
2.64813460e-01, 1.21433797e-01, 1.29480614e-01,
1.00219459e-01, 3.87710315e-02, 9.21726408e-02,
2.53108998e-01, 6.93889390e-18, 6.21799561e-02,
9.37820044e-01],
[ 4.45871314e+01, 9.98931099e+01, 5.75603217e+00,
7.09596863e+03, 2.71045576e+00, 7.06434316e+00,
2.00389008e+03, 1.45823056e+02, 3.14959786e+01,
5.33512064e-01, 1.07238606e-01, 7.23860590e-02,
2.14477212e-02, 1.07238606e-02, 1.31367292e-01,
1.23324397e-01, 4.20911528e-01, 5.79088472e-01,
5.55111512e-17]])
In [234… reduced_centers = pca.transform(model.cluster_centers_) #Transforming the centroids into

In [235… reduced_centers

Out[235]: array([[ 5.84994044e+02, -4.36786931e+00],[-

1.43005891e+03, 2.60041009e+00],[
3.54247180e+03, 3.15185487e+00]])

In [236… plt.figure(figsize=(14,10))
plt.scatter(reduced_X['PCA1'],reduced_X['PCA2'])
plt.scatter(reduced_centers[:,0],reduced_centers[:,1],color='black',marker='x',s=300) #P

Out[236]: <matplotlib.collections.PathCollection at 0x218deb6e220>

In [237… reduced_X['Clusters'] = predictions #Adding the Clusters to the reduced dataframe.

In [238… reduced_X.head()

Out[238]: PCA1 PCA2 Clusters

0 -682.488323 -42.819535 1

1 -787.665502 -41.694991 1

2 330.732170 -26.481208 0

3 193.040232 -26.285766 0

4 1651.532874 -6.891196 0

In [239… #Plotting the clusters

plt.figure(figsize=(14,10))
# taking the cluster number and first column taking the sa
plt.scatter(reduced_X[reduced_X['Clusters'] == 0].loc[:,'PCA1'],reduced_X[reduced_X['Clu
plt.scatter(reduced_X[reduced_X['Clusters'] == 1].loc[:,'PCA1'],reduced_X[reduced_X['Clu
plt.scatter(reduced_X[reduced_X['Clusters'] == 2].loc[:,'PCA1'],reduced_X[reduced_X['Clu

plt.scatter(reduced_centers[:,0],reduced_centers[:,1],color='black',marker='x',s=300)

Out[239]: <matplotlib.collections.PathCollection at 0x218dce9e1f0>

In [ ]:

Credit Card Fraud Detection (Data Analyst)
No ratings yet
Credit Card Fraud Detection (Data Analyst)
22 pages
Leer Los Datos: Import As Import As Import As From Import From Import
100% (1)
Leer Los Datos: Import As Import As Import As From Import From Import
14 pages
Balanta de Verificare: Cont Denumirea Contului
100% (1)
Balanta de Verificare: Cont Denumirea Contului
5 pages
Math Reproducibles - Grade 6
From Everand
Math Reproducibles - Grade 6
Vicky Shiotsu
5/5 (4)
Implement K-Means Clustering.: Preprocessing
No ratings yet
Implement K-Means Clustering.: Preprocessing
8 pages
GRL - EX - 4 (1) .Ipynb - Colaboratory
No ratings yet
GRL - EX - 4 (1) .Ipynb - Colaboratory
7 pages
ML 5
No ratings yet
ML 5
11 pages
SPPUML6
No ratings yet
SPPUML6
9 pages
Siddhesh Asati: #Group: B (ML)
No ratings yet
Siddhesh Asati: #Group: B (ML)
9 pages
DMV - 1 - Jupyter Notebook
No ratings yet
DMV - 1 - Jupyter Notebook
4 pages
Kmeansclustering Sales Dataset
No ratings yet
Kmeansclustering Sales Dataset
6 pages
Mapping document_13_Aggregator_Transformation
No ratings yet
Mapping document_13_Aggregator_Transformation
48 pages
afbpr7
No ratings yet
afbpr7
7 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
8 pages
Quiz 03: Example: 48 /32 2 I. 150 Ii. 2047 Iii. 0.7812 Iv. 0.0546 v. - 289
No ratings yet
Quiz 03: Example: 48 /32 2 I. 150 Ii. 2047 Iii. 0.7812 Iv. 0.0546 v. - 289
2 pages
BigMart Sales Data Analysis
No ratings yet
BigMart Sales Data Analysis
16 pages
Heirarchial and KMeans
No ratings yet
Heirarchial and KMeans
27 pages
ISO 3 -- Series of Preferred Numbers
No ratings yet
ISO 3 -- Series of Preferred Numbers
6 pages
Credit Card Default Clients Prediction 1693295790
No ratings yet
Credit Card Default Clients Prediction 1693295790
23 pages
Linear Algebra
No ratings yet
Linear Algebra
323 pages
DATA SCIENCE IDC 302 End Sem Project
No ratings yet
DATA SCIENCE IDC 302 End Sem Project
1 page
credit-card_notebooks_preprocessed-data_data_preprocessing.ipynb at main · Shubhamdongarjal_credit-card
No ratings yet
credit-card_notebooks_preprocessed-data_data_preprocessing.ipynb at main · Shubhamdongarjal_credit-card
15 pages
DS-500 CCU Performance Analysis Via QS
No ratings yet
DS-500 CCU Performance Analysis Via QS
7 pages
4210_list_2025-01-23
No ratings yet
4210_list_2025-01-23
13 pages
DCF Calculation
No ratings yet
DCF Calculation
2 pages
ML Assignment 5
No ratings yet
ML Assignment 5
8 pages
Final DA LAB1 Merged (1)
No ratings yet
Final DA LAB1 Merged (1)
48 pages
Molar_7
No ratings yet
Molar_7
11 pages
Hasil Hubungan Umr IBU Dan BBbayi074styc18
No ratings yet
Hasil Hubungan Umr IBU Dan BBbayi074styc18
15 pages
group-2-th (1)
No ratings yet
group-2-th (1)
25 pages
Task 1 Vijaya Lakshman PDF
No ratings yet
Task 1 Vijaya Lakshman PDF
10 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Walmart - Sales: Pandas PD Seaborn Sns Numpy NP Matplotlib - Pyplot PLT Matplotlib Datetime
100% (1)
Walmart - Sales: Pandas PD Seaborn Sns Numpy NP Matplotlib - Pyplot PLT Matplotlib Datetime
26 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
M-NERUL-20231125-ASVESTA-FEASIBILITY Model (1)
No ratings yet
M-NERUL-20231125-ASVESTA-FEASIBILITY Model (1)
2 pages
Interactive TD 280: Compensation of Oxygen Measurements
No ratings yet
Interactive TD 280: Compensation of Oxygen Measurements
17 pages
Ml-Exp-5 - Jupyter Notebook
No ratings yet
Ml-Exp-5 - Jupyter Notebook
5 pages
Credit_Card_fraud_detection Using ML - Jupyter Notebook (1)
No ratings yet
Credit_Card_fraud_detection Using ML - Jupyter Notebook (1)
12 pages
Fraction To Decimal Form and Vice Versa
No ratings yet
Fraction To Decimal Form and Vice Versa
17 pages
EDA and Similarity of Transactions On CreditCardFraudDetection
No ratings yet
EDA and Similarity of Transactions On CreditCardFraudDetection
66 pages
ML Project - Jupyter Notebook
No ratings yet
ML Project - Jupyter Notebook
5 pages
Exercicio_3_1
No ratings yet
Exercicio_3_1
3 pages
DMV - 5 - Jupyter Notebook
No ratings yet
DMV - 5 - Jupyter Notebook
5 pages
Projet de Sciences Des Données Appliquées en Assurance.ipynb - Colab
No ratings yet
Projet de Sciences Des Données Appliquées en Assurance.ipynb - Colab
26 pages
Assignment_01
No ratings yet
Assignment_01
6 pages
Database Management (Questions From CBSE Board Exam)
No ratings yet
Database Management (Questions From CBSE Board Exam)
55 pages
IPAddressingGuide PDF
No ratings yet
IPAddressingGuide PDF
1 page
Calc Sheet
No ratings yet
Calc Sheet
69 pages
CHEN403 11 DataMatching Sample Spreadsheet
No ratings yet
CHEN403 11 DataMatching Sample Spreadsheet
17 pages
Uji Normalitas: Dengan Uji Chi - Kuadrat
No ratings yet
Uji Normalitas: Dengan Uji Chi - Kuadrat
8 pages
02 End To End Machine Learning Project
No ratings yet
02 End To End Machine Learning Project
26 pages
Procesos Industriales - T4 Curva de Aprendizaje
No ratings yet
Procesos Industriales - T4 Curva de Aprendizaje
4 pages
Report Homework 1 No 8 & 11
No ratings yet
Report Homework 1 No 8 & 11
5 pages
ISO 00003-1973 Scan
No ratings yet
ISO 00003-1973 Scan
6 pages
FOOT BRIDGE 3D FRAME 30 M_REPORT_1 (1)
No ratings yet
FOOT BRIDGE 3D FRAME 30 M_REPORT_1 (1)
87 pages
ISO 00003-1973 Scan PDF
No ratings yet
ISO 00003-1973 Scan PDF
6 pages
Practica 9
No ratings yet
Practica 9
24 pages
Berryman IP Addressing Guide
No ratings yet
Berryman IP Addressing Guide
1 page
credit card-fraud-detection
No ratings yet
credit card-fraud-detection
39 pages
Math Puzzlers - Grade 4
From Everand
Math Puzzlers - Grade 4
Wilai & William Crouch
No ratings yet
Artificial Intelligence Overview
No ratings yet
Artificial Intelligence Overview
10 pages
AIML
No ratings yet
AIML
13 pages
Selection of Variables For Credit Risk Data Mining Models: Preliminary Research
No ratings yet
Selection of Variables For Credit Risk Data Mining Models: Preliminary Research
28 pages
DL Unit3 Autoencoder
No ratings yet
DL Unit3 Autoencoder
91 pages
Cbse - Department of Skill Education Artificial Intelligence
No ratings yet
Cbse - Department of Skill Education Artificial Intelligence
10 pages
Roadmap AI
No ratings yet
Roadmap AI
19 pages
Cse - B - Batch11 Smart Parking Using Matlab Documentation Final
No ratings yet
Cse - B - Batch11 Smart Parking Using Matlab Documentation Final
53 pages
Introducing Social Media Intelligence (SOCMINT) 2012 Intelligence and National Security
No ratings yet
Introducing Social Media Intelligence (SOCMINT) 2012 Intelligence and National Security
24 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Process Mining For Digital Transformation
No ratings yet
Process Mining For Digital Transformation
8 pages
Engineering Reports - 2023 - Parthasarathy - A Framework For Managing Ethics in Data Science Projects
No ratings yet
Engineering Reports - 2023 - Parthasarathy - A Framework For Managing Ethics in Data Science Projects
12 pages
Accenture A New Era of Generative AI For Everyone
75% (4)
Accenture A New Era of Generative AI For Everyone
23 pages
#Thesis Final v1-06082021
No ratings yet
#Thesis Final v1-06082021
47 pages
Sample Paper 1 AI Class 10
No ratings yet
Sample Paper 1 AI Class 10
8 pages
AI in Medical Coding
No ratings yet
AI in Medical Coding
12 pages
ML Lecture # 01 Introduction to ML
No ratings yet
ML Lecture # 01 Introduction to ML
44 pages
Usage of Ai in Research Writing Among Senior Highschool Students 3 1
No ratings yet
Usage of Ai in Research Writing Among Senior Highschool Students 3 1
20 pages
(Artigo de Revisão) Machine learning techniques for pavement condition evaluation
No ratings yet
(Artigo de Revisão) Machine learning techniques for pavement condition evaluation
17 pages
ML Project Proposal
No ratings yet
ML Project Proposal
4 pages
AI in Production: A Game Changer For Manufacturers With Heavy Assets
No ratings yet
AI in Production: A Game Changer For Manufacturers With Heavy Assets
46 pages
Brochure Master in Cyber Security
No ratings yet
Brochure Master in Cyber Security
24 pages
Stanford CS224W Limitations of Graph Neural Networks 18-Limitations
No ratings yet
Stanford CS224W Limitations of Graph Neural Networks 18-Limitations
75 pages
Machine Learning Advanced
100% (2)
Machine Learning Advanced
12 pages
Enhancing Customer Experience in Insurance Through AI-Driven Personalization
No ratings yet
Enhancing Customer Experience in Insurance Through AI-Driven Personalization
43 pages
4. Missing Child Identification System using Deep Learning and Multiclass SVM
No ratings yet
4. Missing Child Identification System using Deep Learning and Multiclass SVM
9 pages
Transforming Marketing With Artificial Intelligence: July 2020
No ratings yet
Transforming Marketing With Artificial Intelligence: July 2020
14 pages
AI and Machine Learning Report Sample 3
No ratings yet
AI and Machine Learning Report Sample 3
59 pages
55+ Emerging IoT Technologies You Should Have On Your Radar (2022)
No ratings yet
55+ Emerging IoT Technologies You Should Have On Your Radar (2022)
14 pages
Data Science With Python PDF
0% (1)
Data Science With Python PDF
7 pages
ai project
No ratings yet
ai project
29 pages

ML Practical 4D

Uploaded by

ML Practical 4D

Uploaded by

Assignment No 04

Name: Bhosale Samir Shamkant Roll no: CO407 Class: BE COMP

Title: Implement K-Means clustering/ hierarchical clustering on sales_data_sample.csv

In [199… from sklearn.cluster import KMeans, k_means #For clustering

Importing the dataset

Out[203]: ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES QTR_ID MONT

mean 10258.725115 35.092809 83.658544 6.466171 3553.889072 2.717676 7.0

min 10100.000000 6.000000 26.880000 1.000000 482.130000 1.000000 1.00

50% 10262.000000 35.000000 95.700000 6.000000 3184.800000 3.000000 8.00

75% 10333.500000 43.000000 100.000000 9.000000 4508.000000 4.000000 11.00

max 10425.000000 97.000000 100.000000 18.000000 14082.800000 4.000000 12.00

0 ORDERNUMBER 2823 non-null int64

Out[210]: array(['USA', 'France', 'Norway', 'Australia', 'Finland', 'Austria', 'UK',

In [213… productline = pd.get_dummies(df['PRODUCTLINE']) #Converting the categorical columns.

In [214… df = pd.concat([df,productline,Dealsize], axis = 1)

In [215… df_drop = ['COUNTRY','PRODUCTLINE','DEALSIZE'] #Dropping Country too as there are alot

In [216… df['PRODUCTCODE'] = pd.Categorical(df['PRODUCTCODE']).codes #Converting the datatype.

In [217… df.drop('ORDERDATE', axis=1, inplace=True) #Dropping the Orderdate as Month is already i

In [218… df.dtypes #All the datatypes are converted into numeric

Out[218]: QUANTITYORDERED int64

Plotting the Elbow Plot to determine the number of clusters.

As the number of k increases Inertia decreases.

Observations: A Elbow can be observed at 3 and after that thecurve

Out[222]: (2823, 19)

In [225… unique,counts = np.unique(predictions,return_counts=True)

In [226… counts = counts.reshape(1,3)

In [227… counts_df = pd.DataFrame(counts,columns=['Cluster1','Cluster2','Cluster3'])

Out[228]: Cluster1 Cluster2 Cluster30

1083 1367 373

In [230… reduced_X = pd.DataFrame(pca.fit_transform(X_train),columns=['PCA1','PCA2']) #Creating a

Out[231]: PCA1 PCA2

In [232… #Plotting the normal Scatter Plot

Out[232]: <matplotlib.collections.PathCollection at 0x218dc747880>

Out[233]: array([[ 3.72031394e+01, 9.52120960e+01, 6.44967682e+00,

Out[235]: array([[ 5.84994044e+02, -4.36786931e+00],[-

Out[236]: <matplotlib.collections.PathCollection at 0x218deb6e220>

In [237… reduced_X['Clusters'] = predictions #Adding the Clusters to the reduced dataframe.

Out[238]: PCA1 PCA2 Clusters

In [239… #Plotting the clusters

Out[239]: <matplotlib.collections.PathCollection at 0x218dce9e1f0>

You might also like