0% found this document useful (0 votes)
29 views

End To End Machine Learning Problem

End to end machine learning problem

Uploaded by

Eloy Dominguez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
29 views

End To End Machine Learning Problem

End to end machine learning problem

Uploaded by

Eloy Dominguez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 20
Table of contents ALntroduction 2. Defining the problem statement 3. Imports 4.808, 5. K-Means Clustering summary 6, Model Interpretation 7. Benefits of customer segmentation 8. Saving the kmeans clustering model and the data with cluster labe 1, Introduction Market segmentation isthe activity of dividing broad consumer or business market, normally consisting of existing and potentia customers, into sub-groups of consumers based on some type of shared characteristics Companies employing customer segmentation operate under the fact that every customer is different and that their marketing efforts would be better served if they target specific, smaller groups with messages that those consumers woul find relevant and lead ther to buy something. Companies also hope to gain a deeper understanding oftheir customers’ preferences and needs with the idea of discovering what each segment finds most valuable to more accurately tailor marketing materials toward that segment Malis or shopping complexes ae often involved in the race to increase their customers and hence making huge profs To achieve this task ‘machine learning can be applied, Te shopping complexes make use oftheir customers data and develop ML models to target the right ones 2. Defining the problem statement ‘You nthe mall and want to understand the customers who can easily be a target markets that thet canbe given to makeing tam and plan the stetegy accordingly 3. Imports Anport npy 38 4 For Mathematica. calculations inport Seaborn as. sns # For dota visualisation ta outa] seport matplotisb.pyplot as pit ‘from skleara.tree Inport DecisionTreelassifien ‘from sklearn.nodel_selection import train test_split from sklearnsnetrics inport classification repart ‘from skiearn iaport tree from skiearn inport metrics nport warnings earnings. Pileervarnings(“gnore") GF =» pderead_csv(Mali_custoners.csv") ‘Show first 5 roms ¢.ne36() ° 1 Male 19 1" 1 2 Male 21 5 2 3 Female 20 16 a 4 Female 23 16 4 5 Female 31 ” at. shape (203, 5) seheck mtzzing values fF Asmut().sum() costonerID Gender ge Annual Income (ks) Spending Score (1-108) type: intea # For building decision tree models 4 For splitting datasets into training and testing subsets # Provises a sunmary of various classification metrics § Provises various classes and functions for norking with dectste # Provises varsous netrics ond evaluation functions for assessing # Filters and ignore warmings related to potential future changes Customerl0 Gender Age Annual Income (KS) Spending cor (1-100) xs a é 7 ac ‘show info about the datafrane aF-inf0() eaming high but spending less These are the people who are unsatisfied or unhappy by the mall's services These can be the prime targets ofthe malls they have the potential to spend money. So, the mall authorities should try to add new facilities so that they can attract these people and can meet ther needs * ‘Cluster 2 (Blue Colour) -> average in terms of earning and spending, These people will not be the prime targets ofthe shops or mall but again they willbe considered and other data analysis techniques may be used to increase their spending score. © (Cluster 3 (Green Colo) -> earning high and also spending high. These people are the prime sources of profit. These people might be the regular customers ofthe mall and are convinced by the mall’ facilities ‘+ ‘Cluster 4 (Orange Color) -> earing less but spending more, These are those people who for some reason love to buy products more often even though they have alow income, Maybe its because these people are more than satisfied withthe mall services. The shops/malls might not target these people that effectively but stil wll not lose them. + ‘luster (Pink Color) -> Earning less, spending less. These is quite reasonable as people having ow salaries prefer to buy less, infact, these are the wise people who know how to spend and save money. The shops/mall wll be least interested in people belonging to this, Wea at hetero sae alaing ter wee el cre seroma cay ata the tage marcel there canst the orcein 7. Benefits of customer segmentation * Itenables companies to target specific groups of customers © When a group of customers is sent personalized messages as part of a marketing mix that is designed around ther needs. 's easier for companies to send those customers special offers meant o encourage them to buy more products. Futhermore,Customer Segmentation tend to be more valued and appreciated by the customer wo receives them as opposed to impersonal brand ‘messaging that doesn't acknowledge purchase history or any kind of customer relationship, ‘Customer segmentation can also improve customer service and assist in customer loyalty and retention * Staying a step ahead of competitors in specific sections of the market and identifying new products that existing or potential customers could be intrested in or improving products. to meet customer expectations 8, Saving the kmeans clustering model and the data with cluster label Anport joblin Joblib. dunp(kncanstodel, “kacansnodel pel") ['kneansode).p'] converting @ clustering problen into S-class classification probten ithe goat ts to classify new data points into these pre-defined cluster segments 4 Creating a target column “Cluster” for storing the cluster segnents cluster_éf = pa.concat([4F,pd.DataFrane((" Cluster’ :kneansrodel 2abels_))],2xis=1) cluster oF Gender Age Annual Income (k$}_ Spending Score (1-100) Cluster 0 Male 19 1" 4 1 Male 21 6 ea 2 female 26 6 eo4 3 Female 23 16 m3 Gender Age Annual Income KS}. Spending Score (1-100) Cluster 4 Female 31 ” a4 195 Female 35 120 2 196 Fersle 45 126 2% 0 197 Male 32 126 m2 198 Mole 32 3 ° 198 Mole 3 3 200 rows» 5 columns 1 shlearn preprocessing for dealing with categoricol vartables from sklearn. preprocessing inport LabelEncader # Create a Lobel encoder object de = Labelencoger() te_count = @ for col in cluster of: iF cluster_é+(Zol] dtype = ‘object’ # If 2 on fener unique categories EF Len(dist(cluster.dF[col|-unique())) le.fit(eluster_ef{col]) Transform both training ond testing dota cluster ¢f[col] = le. transfora(clustar_é#{col]) 4 Keep track of how many coLunns were Label encoded lecourt o= 2 print(°%d columns were label encodes.” X 1e_covnt) 4 columns were label encoded cluster 4f.nead() Gender Age Annual Income ( Cluster o 1 4 15 we 4 roa 6 Bo 2 0 x 16 6 4 a 0B 16 m3 408 ” a4 saving clustered custarer data for streaelit app cuuster_of. to esv(*Clustereé,Mall_Custoners.csv") Using Decision tree to train and test the model F = pd.datarrane(cluster_of) 4 orop the “Cluster” column fr X= éF.drop({ cluster"), axis ¥ » 6F{{"Chuster")] ‘the DataFrane A.train, Atest, y-crain, y_test = train test_split(%, y, test_size-2.3) + The separation of X and Vis done when preparing data for machine leaming tasks, specifically for supervised leaming algorithms * The reason for dropping the ‘Cluster’ column and separating itinto "x andy’ isto create separate datasets forthe independent variables (X) and the target variable (y) ‘Independent variable (X) will be used to make predictions or classify datapoints, This variables have predictive power *# “Target Variable (yi a target variable | want to predictor classify. It contains laser labels. By assigning the ‘Cluster’ column toy, am creating a separate variable that holds the cluster labels for each data poin! Xm cluster_¢f4rop({ "Cluster" ],axise1) ye cluster afi ‘Cluster")) H.train, test, y-train, y test =train_test_split(K, y, test_sizess.3) * ‘train test split This is a function from the sklearmodel selection module Its used to randomly split the dataset into two subsets the training set and the testing set. It shuffles the data to ensure randomization Kandy: These variable represent the input feature (X) andthe target variable (y. They ae the datasets you obtained after dropping the target variable column and separating it from the input features, as explained eats, ‘test size: This parameter specifies the proportion of the dataset that willbe allocated tothe testing set In this case, itis st to 03, which means that 30% ofthe data will be used for testing, and the remaining 70% wll be used for training © train, Xtest,y tran, test: These variables store the resulting datasets after the split X train and y train represent the training data, while X.testand y test represent the testing data straining data train Gender Age Annual Income (kS| Spending Score (1-100) et ar st ‘ wo 6 st 8 4 108 6 4 8 ot a 0 8 26 9 seo st 4“ 50 no 0 B wo 0 8 8 “oo 7 75 140 rows « 4 columns 168 va m 161 163 0 “ 182 m 1" 20 108 16 80 138 35 1" 2s vm " 106 Gender Age Annual Income (5) ° 3 2% « 2 a 2 0 4 9 2 « 2 “ a 9 a 2B @ 2 so 35 6 Spending Score (1-100) 108 105 “6 60 8 98 4s ” 38 96 192 181 ss ” 68 103 8 102 me 143 157 Gender Age Annual Income (kS|_Spending cor (1-100) se oom ox 1 0 4 4 ou e o 2 ow o 1s 0 4 a on oR 2 o 4 1s 26 2 os oR ox 6 6 0 46 st 6 2 9 a ry 8 2 5s 56 2 2 65 “ 6 a 6 46 4 8 o 86 46 38 38 5s st 9 ar % Gender Age Annual Income (k|_ Spending Score (1-100) Bot Ps m1 8 2 4 m0 0 4“ st ve ed 6 5s a 4 4 126 8 r ce m3 a ea yo a 6 e wee a % 0 R 9 Pa 8 123 2 8 9 Ir chose dectston tree because {¢ (5 not affected by feature scaling creating a DecistontreeClassifien object mith the entropy ertterion rodel. = DecisionTreeClassifien(criterson="entropy”) 4+ Training the model on the trotning data sodel.#5t(x train, y.train) 1+ Howing predictions on the testing dota yiprea = model preaiet(X test) sconfuston Matrix print(aetrics confusion natrix(y test, y_pred)) print (classification report(y-test, ¥pred)) L792 0 9) 2m ee 8 @ 110 2 9) e108 8 e000 7) precision recall. f1-score support PO a 7 1 esr a2 a.92 6 ‘mn cba TT: MTN: n Tn 9 TT 9: TITS 5 sn TI 6 TIT 08 7 accuracy 2.93 oe macro avg 8.98 0.98 8.98 e weighted avg 0.8 0.93 0.93 ra Explanation of confusion matrix results 1. Confusion Matrix: * tis a table that allows us to visualize the performance of the algorithm * The above confusion matrix shows the counts of true postive predictions inthe diagonal and the counts of false positive predictions 1n off-diagonal elements, Each row represents the true labels, and each column represents the predicted labels ‘For example, the value 7 in the top-left comer ofthe confusion matrix indicates that 7 instances of class O were correctly predicted as lass 0 + The other values in the confusion matrix represent similar information for the respective classes 2, Precision: * Precision isa measure of the model's ability to correctly predict the postive instances out of the total predicted positive instances. + The precision values range from 0 to 1, where a higher value indicates better performance «© For example, fr class 0, the precision is 0.78, meaning that out of all the instances predicted as class 0, 78% were correct. 3, Recall: ‘Recall (also known as sensitivity or true postive rate) measures the model's ability to correctly identify the postive instances out of the total actual postive instances + The recall values range from 0 to 1, where a higher value indicates better performance + For example, fr class 2, the recalls 09, indicating that the model identified 91% of the actual instances belonging to class 2. 4, Fl-Scor * The F1-scores the harmonic mean of precision and recall and proves a balanced measure of a model's performance. * The F1-score values range from 0 to 1, where a higher value indicates beter performance + For example, fr class 4, the FI-score is 10, indicating a perfect balance between precision and recall for that class 5, Support: * Support represents the numberof instances in each class inthe actual dataset ‘For example, there were 7 instances of class 0 26 instances of class 1, and so.on 6. Accuracy: * Accuracy represents the overall performance of the model and is calculated asthe ratio of corect predictions tothe total number of predictions «+ Inthis case, the model achieved an accuracy of 093, indicating that it correctly predicted 93% of the instances inthe dataset 7. Macro Avg and Weighted Avg: Macro average calculates the average performance metrics (precision real, and F1-scor) across al classes, giving equal weight to zach class. ‘Weighted average calculates the average performance metrics, but weights each class by its support (numberof instances) © Inthis ase, both the macro average and weighted average have similar values, indicating consistent performance across classes. Saving the Decision tree model for future prediction inport pickle falenane = 'final_nodel. sav Pickle. cunp(nodel, open(+ilenane, ‘wb")) 1 Load the model frow ash 3oaded_nodel » pickle. 0a¢(open(Filenane, ‘rb")) result loaced rogel.score(X test, y_test) print (result, ® Accuracy") 0.9333338333303338 Accuracy + An accuracy of09333333333333333 (or 93.33%) means that the model correctly predicted the outcome or class for approximately 98.33% of the samples inthe evaluation set. In classification task, accuracy is a commonly used metric to assess the performance ofa model

You might also like