End To End Machine Learning Problem

End to end machine learning problem

Uploaded by

Eloy Dominguez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

29 views

End To End Machine Learning Problem

End to end machine learning problem

Uploaded by

Eloy Dominguez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 20

Table of contents ALntroduction 2. Defining the problem statement 3. Imports 4.808, 5. K-Means Clustering summary 6, Model Interpretation 7. Benefits of customer segmentation 8. Saving the kmeans clustering model and the data with cluster labe 1, Introduction Market segmentation isthe activity of dividing broad consumer or business market, normally consisting of existing and potentia customers, into sub-groups of consumers based on some type of shared characteristics Companies employing customer segmentation operate under the fact that every customer is different and that their marketing efforts would be better served if they target specific, smaller groups with messages that those consumers woul find relevant and lead ther to buy something. Companies also hope to gain a deeper understanding oftheir customers’ preferences and needs with the idea of discovering what each segment finds most valuable to more accurately tailor marketing materials toward that segment Malis or shopping complexes ae often involved in the race to increase their customers and hence making huge profs To achieve this task ‘machine learning can be applied, Te shopping complexes make use oftheir customers data and develop ML models to target the right ones 2. Defining the problem statement ‘You nthe mall and want to understand the customers who can easily be a target markets that thet canbe given to makeing tam and plan the stetegy accordingly 3. Imports Anport npy 38 4 For Mathematica. calculations inport Seaborn as. sns # For dota visualisationta outa] seport matplotisb.pyplot as pit ‘from skleara.tree Inport DecisionTreelassifien ‘from sklearn.nodel_selection import train test_split from sklearnsnetrics inport classification repart ‘from skiearn iaport tree from skiearn inport metrics nport warnings earnings. Pileervarnings(“gnore") GF =» pderead_csv(Mali_custoners.csv") ‘Show first 5 roms ¢.ne36() ° 1 Male 19 1" 1 2 Male 21 5 2 3 Female 20 16 a 4 Female 23 16 4 5 Female 31 ” at. shape (203, 5) seheck mtzzing values fF Asmut().sum() costonerID Gender ge Annual Income (ks) Spending Score (1-108) type: intea # For building decision tree models 4 For splitting datasets into training and testing subsets # Provises a sunmary of various classification metrics § Provises various classes and functions for norking with dectste # Provises varsous netrics ond evaluation functions for assessing # Filters and ignore warmings related to potential future changes Customerl0 Gender Age Annual Income (KS) Spending cor (1-100) xs a é 7 ac‘show info about the datafrane aF-inf0() eaming high but spending less These are the people who are unsatisfied or unhappy by the mall's services These can be the prime targets ofthe malls they have the potential to spend money. So, the mall authorities should try to add new facilities so that they can attract these people and can meet ther needs * ‘Cluster 2 (Blue Colour) -> average in terms of earning and spending, These people will not be the prime targets ofthe shops or mall but again they willbe considered and other data analysis techniques may be used to increase their spending score. © (Cluster 3 (Green Colo) -> earning high and also spending high. These people are the prime sources of profit. These people might be the regular customers ofthe mall and are convinced by the mall’ facilities ‘+ ‘Cluster 4 (Orange Color) -> earing less but spending more, These are those people who for some reason love to buy products more often even though they have alow income, Maybe its because these people are more than satisfied withthe mall services. The shops/malls might not target these people that effectively but stil wll not lose them. + ‘luster (Pink Color) -> Earning less, spending less. These is quite reasonable as people having ow salaries prefer to buy less, infact, these are the wise people who know how to spend and save money. The shops/mall wll be least interested in people belonging to this,Wea at hetero sae alaing ter wee el cre seroma cay ata the tage marcel there canst the orcein 7. Benefits of customer segmentation * Itenables companies to target specific groups of customers © When a group of customers is sent personalized messages as part of a marketing mix that is designed around ther needs. 's easier for companies to send those customers special offers meant o encourage them to buy more products. Futhermore,Customer Segmentation tend to be more valued and appreciated by the customer wo receives them as opposed to impersonal brand ‘messaging that doesn't acknowledge purchase history or any kind of customer relationship, ‘Customer segmentation can also improve customer service and assist in customer loyalty and retention * Staying a step ahead of competitors in specific sections of the market and identifying new products that existing or potential customers could be intrested in or improving products. to meet customer expectations 8, Saving the kmeans clustering model and the data with cluster label Anport joblin Joblib. dunp(kncanstodel, “kacansnodel pel") ['kneansode).p'] converting @ clustering problen into S-class classification probten ithe goat ts to classify new data points into these pre-defined cluster segments 4 Creating a target column “Cluster” for storing the cluster segnents cluster_éf = pa.concat([4F,pd.DataFrane((" Cluster’ :kneansrodel 2abels_))],2xis=1) cluster oF Gender Age Annual Income (k$}_ Spending Score (1-100) Cluster 0 Male 19 1" 4 1 Male 21 6 ea 2 female 26 6 eo4 3 Female 23 16 m3Gender Age Annual Income KS}. Spending Score (1-100) Cluster 4 Female 31 ” a4 195 Female 35 120 2 196 Fersle 45 126 2% 0 197 Male 32 126 m2 198 Mole 32 3 ° 198 Mole 3 3 200 rows» 5 columns 1 shlearn preprocessing for dealing with categoricol vartables from sklearn. preprocessing inport LabelEncader # Create a Lobel encoder object de = Labelencoger() te_count = @ for col in cluster of: iF cluster_é+(Zol] dtype = ‘object’ # If 2 on fener unique categories EF Len(dist(cluster.dF[col|-unique())) le.fit(eluster_ef{col]) Transform both training ond testing dota cluster ¢f[col] = le. transfora(clustar_é#{col]) 4 Keep track of how many coLunns were Label encoded lecourt o= 2 print(°%d columns were label encodes.” X 1e_covnt) 4 columns were label encodedcluster 4f.nead() Gender Age Annual Income ( Cluster o 1 4 15 we 4 roa 6 Bo 2 0 x 16 6 4 a 0B 16 m3 408 ” a4 saving clustered custarer data for streaelit app cuuster_of. to esv(*Clustereé,Mall_Custoners.csv") Using Decision tree to train and test the model F = pd.datarrane(cluster_of) 4 orop the “Cluster” column fr X= éF.drop({ cluster"), axis ¥ » 6F{{"Chuster")] ‘the DataFrane A.train, Atest, y-crain, y_test = train test_split(%, y, test_size-2.3) + The separation of X and Vis done when preparing data for machine leaming tasks, specifically for supervised leaming algorithms * The reason for dropping the ‘Cluster’ column and separating itinto "x andy’ isto create separate datasets forthe independent variables (X) and the target variable (y) ‘Independent variable (X) will be used to make predictions or classify datapoints, This variables have predictive power *# “Target Variable (yi a target variable | want to predictor classify. It contains laser labels. By assigning the ‘Cluster’ column toy, am creating a separate variable that holds the cluster labels for each data poin! Xm cluster_¢f4rop({ "Cluster" ],axise1)ye cluster afi ‘Cluster")) H.train, test, y-train, y test =train_test_split(K, y, test_sizess.3) * ‘train test split This is a function from the sklearmodel selection module Its used to randomly split the dataset into two subsets the training set and the testing set. It shuffles the data to ensure randomization Kandy: These variable represent the input feature (X) andthe target variable (y. They ae the datasets you obtained after dropping the target variable column and separating it from the input features, as explained eats, ‘test size: This parameter specifies the proportion of the dataset that willbe allocated tothe testing set In this case, itis st to 03, which means that 30% ofthe data will be used for testing, and the remaining 70% wll be used for training © train, Xtest,y tran, test: These variables store the resulting datasets after the split X train and y train represent the training data, while X.testand y test represent the testing data straining data train Gender Age Annual Income (kS| Spending Score (1-100) et ar st ‘ wo 6 st 8 4 108 6 4 8 ot a 0 8 26 9 seo st 4“ 50 no 0 B wo 0 8 8 “oo 7 75 140 rows « 4 columns168 va m 161 163 0 “ 182 m 1" 20 108 16 80 138 35 1" 2s vm " 106 Gender Age Annual Income (5) ° 3 2% « 2 a 2 0 4 9 2 « 2 “ a 9 a 2B @ 2 so 35 6 Spending Score (1-100)108 105 “6 60 8 98 4s ” 38 96 192 181 ss ” 68 103 8 102 me 143 157 Gender Age Annual Income (kS|_Spending cor (1-100) se oom ox 1 0 4 4 ou e o 2 ow o 1s 0 4 a on oR 2 o 4 1s 26 2 os oR ox 6 6 0 46 st 6 2 9 a ry 8 2 5s 56 2 2 65 “ 6 a 6 46 4 8 o 86 46 38 38 5s st 9 ar %Gender Age Annual Income (k|_ Spending Score (1-100) Bot Ps m1 8 2 4 m0 0 4“ st ve ed 6 5s a 4 4 126 8 r ce m3 a ea yo a 6 e wee a % 0 R 9 Pa 8 123 2 8 9 Ir chose dectston tree because {¢ (5 not affected by feature scaling creating a DecistontreeClassifien object mith the entropy ertterion rodel. = DecisionTreeClassifien(criterson="entropy”) 4+ Training the model on the trotning data sodel.#5t(x train, y.train) 1+ Howing predictions on the testing dota yiprea = model preaiet(X test) sconfuston Matrix print(aetrics confusion natrix(y test, y_pred)) print (classification report(y-test, ¥pred))L792 0 9) 2m ee 8 @ 110 2 9) e108 8 e000 7) precision recall. f1-score support PO a 7 1 esr a2 a.92 6 ‘mn cba TT: MTN: n Tn 9 TT 9: TITS 5 sn TI 6 TIT 08 7 accuracy 2.93 oe macro avg 8.98 0.98 8.98 e weighted avg 0.8 0.93 0.93 ra Explanation of confusion matrix results 1. Confusion Matrix: * tis a table that allows us to visualize the performance of the algorithm * The above confusion matrix shows the counts of true postive predictions inthe diagonal and the counts of false positive predictions 1n off-diagonal elements, Each row represents the true labels, and each column represents the predicted labels ‘For example, the value 7 in the top-left comer ofthe confusion matrix indicates that 7 instances of class O were correctly predicted as lass 0 + The other values in the confusion matrix represent similar information for the respective classes 2, Precision: * Precision isa measure of the model's ability to correctly predict the postive instances out of the total predicted positive instances. + The precision values range from 0 to 1, where a higher value indicates better performance «© For example, fr class 0, the precision is 0.78, meaning that out of all the instances predicted as class 0, 78% were correct. 3, Recall: ‘Recall (also known as sensitivity or true postive rate) measures the model's ability to correctly identify the postive instances out of the total actual postive instances + The recall values range from 0 to 1, where a higher value indicates better performance+ For example, fr class 2, the recalls 09, indicating that the model identified 91% of the actual instances belonging to class 2. 4, Fl-Scor * The F1-scores the harmonic mean of precision and recall and proves a balanced measure of a model's performance. * The F1-score values range from 0 to 1, where a higher value indicates beter performance + For example, fr class 4, the FI-score is 10, indicating a perfect balance between precision and recall for that class 5, Support: * Support represents the numberof instances in each class inthe actual dataset ‘For example, there were 7 instances of class 0 26 instances of class 1, and so.on 6. Accuracy: * Accuracy represents the overall performance of the model and is calculated asthe ratio of corect predictions tothe total number of predictions «+ Inthis case, the model achieved an accuracy of 093, indicating that it correctly predicted 93% of the instances inthe dataset 7. Macro Avg and Weighted Avg: Macro average calculates the average performance metrics (precision real, and F1-scor) across al classes, giving equal weight to zach class. ‘Weighted average calculates the average performance metrics, but weights each class by its support (numberof instances) © Inthis ase, both the macro average and weighted average have similar values, indicating consistent performance across classes. Saving the Decision tree model for future prediction inport pickle falenane = 'final_nodel. sav Pickle. cunp(nodel, open(+ilenane, ‘wb")) 1 Load the model frow ash 3oaded_nodel » pickle. 0a¢(open(Filenane, ‘rb")) result loaced rogel.score(X test, y_test) print (result, ® Accuracy") 0.9333338333303338 Accuracy+ An accuracy of09333333333333333 (or 93.33%) means that the model correctly predicted the outcome or class for approximately 98.33% of the samples inthe evaluation set. In classification task, accuracy is a commonly used metric to assess the performance ofa model

DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Clustering Algorithms SciKit Learn 1705740354
No ratings yet
Clustering Algorithms SciKit Learn 1705740354
22 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
3. Chapter 5 CLUSTERING
No ratings yet
3. Chapter 5 CLUSTERING
36 pages
Python Machine Learning
No ratings yet
Python Machine Learning
19 pages
Mastering Python For Data Science - Sample Chapter
71% (7)
Mastering Python For Data Science - Sample Chapter
24 pages
21AI71-module-5-textbook
No ratings yet
21AI71-module-5-textbook
25 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
BDA LabReport-9
No ratings yet
BDA LabReport-9
17 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Group 11 Ba Presentation
No ratings yet
Group 11 Ba Presentation
11 pages
3.unsupervised Learning
No ratings yet
3.unsupervised Learning
9 pages
CUSTOMER SEGMENTATION USING ENSEMBLE CLUSTERING
No ratings yet
CUSTOMER SEGMENTATION USING ENSEMBLE CLUSTERING
20 pages
Report ML 2
No ratings yet
Report ML 2
10 pages
K Means Clustering
No ratings yet
K Means Clustering
5 pages
Customer_segmentation
No ratings yet
Customer_segmentation
43 pages
Practical Data Analysis Cookbook - Sample Chapter
100% (1)
Practical Data Analysis Cookbook - Sample Chapter
31 pages
Peer Eval
No ratings yet
Peer Eval
6 pages
ML - K-Means
No ratings yet
ML - K-Means
12 pages
Practical-8: Import As Import As Import As Import Import As
No ratings yet
Practical-8: Import As Import As Import As Import Import As
9 pages
doc_A5
No ratings yet
doc_A5
3 pages
Model Definition11
No ratings yet
Model Definition11
6 pages
Model Definition
No ratings yet
Model Definition
6 pages
g (y) = βo + β (Age) - (a)
No ratings yet
g (y) = βo + β (Age) - (a)
6 pages
Mall Customer Segmentation Using KMeans Clustering Algorithm and Classification Algorithm
No ratings yet
Mall Customer Segmentation Using KMeans Clustering Algorithm and Classification Algorithm
40 pages
Untitled Document 15
No ratings yet
Untitled Document 15
7 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
Zara
No ratings yet
Zara
47 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Data Mining
No ratings yet
Data Mining
28 pages
IML Assignment5
No ratings yet
IML Assignment5
10 pages
MINOR PROJECT
No ratings yet
MINOR PROJECT
10 pages
Customer Segmentation Report
No ratings yet
Customer Segmentation Report
8 pages
Data Mining - Assignment: Girish Nayak
100% (1)
Data Mining - Assignment: Girish Nayak
21 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
No ratings yet
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
34 pages
Predict Classify Cluster
No ratings yet
Predict Classify Cluster
12 pages
DA_EXP_10 (1)
No ratings yet
DA_EXP_10 (1)
6 pages
MLT lab 08
No ratings yet
MLT lab 08
5 pages
Stin2044 Knowledge Discovery in Databases (Group A) SECOND SEMESTER SESSION 2020/2021 (A202) Group Assignment
No ratings yet
Stin2044 Knowledge Discovery in Databases (Group A) SECOND SEMESTER SESSION 2020/2021 (A202) Group Assignment
11 pages
K Means R and Rapid Miner Patient and Mall Case Study
No ratings yet
K Means R and Rapid Miner Patient and Mall Case Study
80 pages
K Means
No ratings yet
K Means
9 pages
PeerEval Unsupervised
No ratings yet
PeerEval Unsupervised
6 pages
Data Mining
No ratings yet
Data Mining
27 pages
Day13-K-Means Clustering
No ratings yet
Day13-K-Means Clustering
10 pages
Słowacja Wszystko PDF
No ratings yet
Słowacja Wszystko PDF
379 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
KMEANS
No ratings yet
KMEANS
13 pages
6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining
No ratings yet
6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining
51 pages
Module 3
No ratings yet
Module 3
6 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Clustering in R
No ratings yet
Clustering in R
12 pages
Customer_Segmentation_Analysis
No ratings yet
Customer_Segmentation_Analysis
18 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
Project Report - Data Mining
0% (1)
Project Report - Data Mining
52 pages
21MIC0107-1
No ratings yet
21MIC0107-1
7 pages

End To End Machine Learning Problem

Uploaded by

End To End Machine Learning Problem

Uploaded by

You might also like