Data Mining Project DSBA Clustering Report Final
Data Mining Project DSBA Clustering Report Final
[email protected]
YQL1358D96
1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Contents
List of Figures .......................................................................................................................................... 3
List of Tables ........................................................................................................................................... 3
Problem Statement ................................................................................................................................. 4
Read the data and perform basic analysis such as printing a few rows (head and tail), info, data
summary, null values duplicate values, etc. ......................................................................................... 5
Missing Value Treatment ..................................................................................................................... 7
Check if there are any outliers. Do you think treating outliers is necessary for K-Means clustering?
Based on your judgement decide whether to treat outliers and if yes, which method to employ. (As an
analyst your judgement may be different from another analyst).......................................................... 8
Outlier Detection and Treatment using IQR method ............................................................................ 9
Perform z-score scaling and discuss how does it affects the speed of the algorithm ........................... 11
Perform clustering ............................................................................................................................. 13
Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance.............. 13
Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means algorithm .... 14
Print silhouette scores for up to 10 clusters and identify optimum number of clusters ................... 14
Profile the ads based on optimum number of clusters using silhouette score and your domain
[email protected]
YQL1358D96 understanding ............................................................................................................................... 16
Conclusion ..................................................................................................................................... 21
Appendix ........................................................................................................................................... 22
Code .............................................................................................................................................. 22
2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
List of Figures
Figure 1:Missing Values Count representation using bar plot ................................................................... 7
Figure 2: Boxplot for outliers ................................................................................................................... 8
Figure 3: Boxplots after Outlier Treatment ............................................................................................ 10
Figure 4: Dendrogram using WARD and Euclidean distance ................................................................... 13
Figure 5: Elbow Plot ............................................................................................................................... 14
Figure 6: Silhouette Score Plot ............................................................................................................... 15
Figure 7: Cluster wise device type total clicks......................................................................................... 17
Figure 8: Cluster wise Device Type wise total revenue ........................................................................... 18
Figure 9: Cluster wise device type wise total spend ............................................................................... 19
Figure 10:Cluster wise device type wise average CPC, CTR, CPM ............................................................ 20
List of Tables
Table 1: Data Information ........................................................................................................................ 5
Table 2: Data first 5 five rows of the dataset (shown here as columns to save space)............................... 6
Table 3: Data summary stats (for continuous variables only) ................................................................... 6
Table 4: Summary Table for Outlier Detection and Treatment ................................................................. 9
Table 5: Scaled data head ...................................................................................................................... 11
[email protected]
YQL1358D96 Table 6:Proportion of records per label ................................................................................................. 16
Table 7: : Cluster Profiles: Averages of the features considered ............................................................ 16
3
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Problem Statement
Read the data and perform basic analysis such as printing a few rows (head and tail), info, data
[email protected]
YQL1358D96 summary, null values duplicate values, etc.
Treat missing values in CPC, CTR and CPM using the formula given
Check if there are any outliers
Do you think treating outliers is necessary for K-Means clustering? Based on your judgement
decide whether to treat outliers and if yes, which method to employ. (As an analyst your
judgement may be different from another analyst)
Perform z-score scaling and discuss how it affects the speed of the algorithm
Perform clustering and do the following:
o Make Dendrogram using WARD and Euclidean distance
o Make elbow plot (up to n=10) and identify optimum number of clusters
o Print silhouette scores for up to 10 clusters and identify optimum number of clusters
o Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding.
[Hint: Group the data by clusters and take sum or mean to identify trends in clicks, spend,
revenue, CPM, CTR, & CPC based on Device Type. Make bar plots.]
4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Read the data and perform basic analysis such as printing a few rows (head and tail), info,
data summary, null values duplicate values, etc.
The data set contains of about 23K records with 19 variables (6 float64, 7 int64, 6 object).
5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Dataset Head:
Table 2: Data first 5 five rows of the dataset (shown here as columns to save space)
0 1 2 3 4
Timestamp 2020-9-2-17 2020-9-2-10 2020-9-1-22 2020-9-3-20 2020-9-4-15
InventoryType Format1 Format1 Format1 Format1 Format1
Ad - Length 300 300 300 300 300
Ad- Width 250 250 250 250 250
Ad Size 75000 75000 75000 75000 75000
Ad Type Inter222 Inter227 Inter222 Inter228 Inter217
Platform Video App Video Video Web
Device Type Desktop Mobile Desktop Mobile Desktop
Format Display Video Display Video Video
Available_Impressions 1806 1780 2727 2430 1218
Matched_Queries 325 285 356 497 242
Impressions 323 285 355 495 242
Clicks 1 1 1 1 1
Spend 0.0 0.0 0.0 0.0 0.0
Fee 0.35 0.35 0.35 0.35 0.35
Revenue
[email protected] 0.0 0.0 0.0 0.0 0.0
YQL1358D96
CTR 0.0031 0.0035 0.0028 0.002 0.0041
CPM 0.0 0.0 0.0 0.0 0.0
CPC 0.0 0.0 0.0 0.0 0.0
6
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Spend 23066. 2706.63 4067.93 0.00 85.18 1425.12 3121.40 26931.87
0
Fee 23066. 0.34 0.03 0.21 0.33 0.35 0.35 0.35
0
Revenue 23066. 1924.25 3105.24 0.00 55.37 926.34 2091.34 21276.18
0
CTR 18330. 0.07 0.08 0.00 0.00 0.08 0.13 1.00
0
CPM 18330. 7.67 6.48 0.00 1.71 7.66 12.51 81.56
0
CPC 18330. 0.35 0.34 0.00 0.09 0.16 0.57 7.26
0
Minimum value for several variables is 0. There are no negative values. The CTR, CPM, and CPC are
derived fields and have missing values. Note that the range of the values for different variables are very
different.
There are missing values in the 3 variables. Their counts are given in Figure 1.
[email protected]
YQL1358D96
Treat missing values in CPC, CTR and CPM using the formula given
7
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Check if there are any outliers. Do you think treating outliers is necessary for K-Means
clustering? Based on your judgement decide whether to treat outliers and if yes, which
method to employ. (As an analyst your judgement may be different from another analyst)
[email protected]
YQL1358D96
8
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
noise, and the k-means method may be sensitive to it. Noise can drastically change the quality of the
clustering solution and it is important to take this into account in designing algorithms for partition [1].
In this method, any observation that is less than Q1 – 1.5 IQR or more than Q3 + 1.5 IQR is considered
an outlier.
[email protected]
To treat outliers, we defined a function 'treat_outlier' where
YQL1358D96
The larger values (>upper whisker) are all equated to the 95th percentile value of the
distribution
The smaller values (<lower whisker) are all equated to the 5th percentile value of the
distribution.
9
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
YQL1358D96
10
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Perform z-score scaling and discuss how does it affects the performance of the algorithm
We used scikit-learn’s StandardScaler to perform z-score scaling. Table 6 shows the first five rows of the
scaled data (rows transposed as columns)
0 1 2 3 4
Ad - Length - - - - -
0.364496 0.364496 0.364496 0.364496 0.364496
Ad- Width - - - - -
0.432797 0.432797 0.432797 0.432797 0.432797
Ad Size - - - - -
0.359227 0.359227 0.359227 0.359227 0.359227
Available_Impressions - - - - -
0.569484 0.569490 0.569269 0.569339 0.569622
Matched_Queries - - - - -
0.567061 0.567076 0.567049 0.566994 0.567093
Impressions - - - - -
0.563943 0.563958 0.563931 0.563875 0.563975
Clicks - - - - -
0.719779 0.719779 0.719779 0.719779 0.719779
[email protected]
Spend - - - - -
YQL1358D96
0.722776 0.722776 0.722776 0.722776 0.722776
Fee 0.487214 0.487214 0.487214 0.487214 0.487214
Revenue - - - - -
0.676118 0.676118 0.676118 0.676118 0.676118
CTR - - - - -
0.978830 0.973650 0.982332 0.992329 0.965826
CPM - - - - -
1.220346 1.220346 1.220346 1.220346 1.220346
CPC - - - - -
1.083011 1.083011 1.083011 1.083011 1.083011
Scaling of variables is important for clustering to stabilize the weights of the different variables.
If there is wide discrepancy in the range of variables (refer to Table 3) cluster formation may be
affected by weight differential.
The features contained in a data set may have different units (e.g. feet, kilometers, and hours) that, in
turn, may mean that the variables have different scales. All machine learning algorithms are dependent
on the scaling of data. If there is wide discrepancy among the input values, the unscaled model may be
unstable, meaning that it may suffer from poor performance during learning and sensitivity to input values
resulting in higher generalization error. [2]
11
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
One of the most common forms of pre-processing consists of a simple linear rescaling of the input
variables.
[2] https://machinelearningmastery.com/
[email protected]
YQL1358D96
12
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Perform clustering
Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance
[email protected]
YQL1358D96
[reference - https://wheatoncollege.edu/wp-content/uploads/2012/08/How-to-Read-a-Dendrogram-Web-Ready.pdf ]
13
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Keeping the above reference as base, we can see the longest branch (tallest branch) is in blue. If we see
that only blue, it will result in only 2 clusters which is not acceptable in business. If however the
segmentation is at the tallest red branches, separated by the yellow horizontal line, 5 clusters are
identified. Alternatively, there may be 3 clusters as well, designated by the yellow horizontal line. But we
choose 5 Clusters using Dendrogram for this project.
Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means algorithm
[email protected]
YQL1358D96
Print silhouette scores for up to 10 clusters and identify optimum number of clusters
14
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
The Average Silhouette Score for 9 clusters is 0.45851
The Average Silhouette Score for 10 clusters is 0.46434
Hierarchical Clustering as well as KMeans Clustering were ferformed. We used Elbow plot and Silhouette
Score to identify optimum number of clusters in KMeans whereas in Hierarchical Clustering dendrogram
was drawn. In Hierarchical method, we got 5 clusters while in KMeans, we got 5 (using elbow plot) and 6
[email protected]
YQL1358D96 clusters (using silhouette score).
Discussion (Non-graded)
We can always try alternative approaches to clustering using other linkage types and distance metrics for
an exhaustive study of the data. Please refer to the Monograph for details. We observe that the methods
used in this project yielded similar results i.e. with 5 clusters. (n_clusters=5 is also close with silhouette
score of 0.518). According to the dendrogram in Figure 4, 3 or 4 clusters may also be considered. But
more than 5 clusters may result in a high degree of fragmentation, where more than one clusters may
have similar profiles. As per K-Means Silhouette score, we got 6 clusters. Hierarchical clustering may also
indicate 6 clusters (see the blue horizontal line in Figure 4). There may be other possibilities also.
However, the main considerations are:
1. What is the optimal number of clusters that support your business assumptions or rules about
the market?
2. What is optimum number of market segments that may be handled in day-to-day operations?
As suggested in Rubric, we will segment the data into 6 clusters as per above plot (figure 6).
15
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Profile the ads based on optimum number of clusters using silhouette score and your domain
understanding
Label Proportion
0 29.66
1 7.61
2 19.56
3 6.21
4 30.32
5 6.62
KMEANS_LABELS 0 1 2 3 4 5
Ad - Length 149.55 316.28 695.17 142.18 418.07 680.94
Ad- Width 558.21 254.54 316.80 571.18 157.14 117.92
Ad Size 75690.15 78364.78 213586.18 75625.96 56445.35 70159.76
Available_Impressions 46582.25 6583616.27 279059.43 843405.75 2070385.26 17858169.00
[email protected]
YQL1358D96Matched_Queries 28661.60 3680737.02 147665.19 591156.63 1020575.04 9536142.81
Impressions 21257.39 3600777.01 126758.60 498760.11 980987.72 9181756.42
Clicks 2947.20 8548.28 13904.89 68157.27 3451.11 17394.94
Spend 318.92 4867.49 1224.16 7234.73 1763.33 15373.73
Fee 0.35 0.32 0.35 0.29 0.35 0.24
Revenue 208.48 3326.64 797.23 5205.55 1157.98 11761.38
CTR 15.97 0.24 13.64 13.77 0.39 0.19
CPM 14.71 1.38 11.92 15.13 1.80 1.71
CPC 0.10 0.60 0.09 0.11 0.58 0.92
freq 6842.00 1756.00 4514.00 1433.00 6994.00 1527.00
Observations:
1. The clusters 2 and 5 contain ads that have higher mean length than other clusters.
2. The clusters 0 and 3 have ads whose mean width is considerably more than the other clsuters
3. Cluster 5 has minimum ad size
4. Available impressions is highest for cluster 1
5. There is not much difference in Fee, but cluster 5 has very high mean spend and mean revenue
compared to the others
16
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Discussion (non-graded):
It is possible to start with a larger number of clusters and based on comparison of the profiles, the number
of clusters may be reduced. Note that the opposite is not possible. Cluster Plots are investigated to decide
whether two clusters have considerable overlap, and therefore can be combined. Refer to the Monograph
for details. In this case, Using KMeans, we can take 5 or 6 clusters, but what if we take ‘n_clusters’ upto
20 and we get better Silhouette score at, say, 14 clusters or 20 clusters. From a practical point of view, so
many clusters are not usable. Hence, We take the best Silhouette score among a reasonable number of
clusters, say 10, and compare cluster profiles and plots to trim the number of clusters. This is subjective
and depends upon the domain application and experience of the user.
[Hint: Group the data by clusters and take sum or mean to identify trends in clicks, spend, revenue, CPM,
CTR, & CPC based on Device Type. Make bar plots.]
Using the hint provided in the rubric, we will plot the bar charts by grouping the data by Cluster Labels
and taking sum or mean of Clicks, Spend, Revenue, CTR,CPC, & CPM.
[email protected]
YQL1358D96
Figure 7: Comparison of Clusters according to device type (x-axis) and total clicks (y-axis)
17
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Observation: The Mobile segment within Cluster 3 has the maximum number of clicks followed by
Mobile segment within Cluster 2. Only for Cluster 3, desktop segment shows considerable number of
clicks.
[email protected]
YQL1358D96
Figure 8: Comparison of Clusters according to device type (x-axis) and total revenue (y-axis)
Observations:
The mobile segment within Cluster 5 have most revenue generated and may be considered the best ads.
Similarly, the desktop segment cluster label has highest revenue generated for Desktop Ads.
18
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
YQL1358D96
Figure 9: Comparison of Clusters according to device type (x-axis) and total spend (y-axis):
Observations:
The mobile segment within cluster 5 show the highest total spending may be considered premium ads.
Similarly, the desktop segment cluster label has highest spending done for Desktop Ads.
For Mobile segments clusters 3 and 4 show the most spending after cluster label 5 respectively.
19
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
YQL1358D96
Figure 10: Comparison of Clusters according to device type (x-axis) and average CPC, CTR, CPM
CPM stands for "cost per 1000 impressions". In simple words, CPM refers to the amount it costs to have
an ad published a thousand times on a website and is seen by users. For example, if a website publisher
charges $4.00 CPM, that means an advertiser must pay $4.00 for every 1,000 impressions of its ads.
CPC stands for Cost Per Click. It is a method that websites use to determine the average times an advertiser
has been clicked on the relevant ad. CPC is also a widely used google adwords metric that advertisers
incorporate to manage their campaign budgets & performance. Let us say your CPC ads get 2 clicks, one
costing $0.40 and the other is $0.20, this totals $0.60. You’d divide your $0.60 by 2 (your total number of
clicks) to get an average CPC of $0.30.
CTR or Click Through Rate is measuring the success of online ads by aggregating the percentage of people
that actually click on the ad to arrive at the hyperlinked website. For example, if an ad has been clicked
200 times after serving 50,000 times, by multiplying that result by 100 you get a click-through rate of 0.4%.
Reference: https://www.publift.com/adteach/what-are-cpm-cpc-cpa-ctr
Observations: According to Figure 10, Clusters 0, 2 and 3 have the highest avg CPM. These ads are
probably posted on expensive and most visited websites. Average CTR is also the highest in the same
20
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
three clusters. There does not seem to be any considerable difference between the mobile and desktop
segments here.
Selling ads according to CPM puts a ceiling on revenue. If you want to increase your revenue, you have
to spend money on increasing your reach to create more ad opportunities, or pumping out more ads to
the same users before seeing a return. But if you sell on CTR, revenue is not capped. You can increase
engagement on the same number of impressions per person, or DAU (daily active user). Whereas with
CPM, you stretch to reach more and more people, or degrade your user experience with more ads per
user. Reference: https://blog.taboola.com/ctr-better-cpm-care/
Conclusion
In this project,
1. We learned to impute missing values using a different approach i.e. using custom formulae
2. We discussed about outlier’s effect on quality of clustering profiles
3. We discussed about the scaling and its effect on performance of the algorithm
4. We discussed that clusters need to be revisited if there is too much similarity, or overlap, among
them
5. We learned about certain digital marketing terms and their significance.
21
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Appendix
Code
[email protected]
YQL1358D96
22
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
YQL1358D96
23
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
YQL1358D96
24
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
YQL1358D96
25
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
YQL1358D96
26
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.