0% found this document useful (0 votes)
20 views

ML 1

The document discusses clustering digital advertising data to segment ads into homogeneous groups. It describes preprocessing the data by treating missing values, identifying and handling outliers, and scaling the data. Various clustering algorithms are applied including hierarchical clustering to identify optimal k for k-means clustering, and k-means is used to cluster the ads and profile them by cluster.

Uploaded by

Janhavi Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

ML 1

The document discusses clustering digital advertising data to segment ads into homogeneous groups. It describes preprocessing the data by treating missing values, identifying and handling outliers, and scaling the data. Various clustering algorithms are applied including hierarchical clustering to identify optimal k for k-means clustering, and k-means is used to cluster the ads and profile them by cluster.

Uploaded by

Janhavi Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

MACHINE

LEARNING1
Janhavi Gupta
Table of Contents

Problem 1: Clustering - Digital Ads Data: ................................................................ 4


The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million.
They are expanding their wings in Marketing Analytics. They collected data from their
Marketing Intelligence team and now wants you (their newly appointed data analyst) to segment
type of ads based on the features provided. Use Clustering procedure to segment ads into
homogeneous groups ...................................................................................................................... 4
Perform the following in given order: ..........................................................................................4
1.1. Read the data and perform basic analysis such as printing a few rows (head and tail),
info, data summary, null values duplicate values, etc. (4 marks).............................................4
1.2. Treat missing values in CPC, CTR and CPM using the formula given. You may refer
to the Bank_KMeans Solution File to understand the coding behind treating the
missing values using a specific formula. You have to basically create a user defined function
and then call the function for imputing. (4 marks) ..................................................................5
1.3. Check if there are any outliers. Do you think treating outliers is necessary for K-
Means clustering? Based on your judgement decide whether to treat outliers and if yes,
which method to employ. (As an analyst your judgement may be different from another
analyst). (3 marks) ....................................................................................................................6
1.4. Perform z-score scaling and discuss how it affects the speed of the algorithm. (3
marks) .......................................................................................................................................7
Perform clustering and do the following:.....................................................................................8
1.5. Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean
distance. (4 marks) ...................................................................................................................8
1.6. Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm. (4 marks).................................................................................................................8
1.7. Print silhouette scores for up to 10 clusters and identify optimum number of clusters.
(4 marks) ..................................................................................................................................9
1.8. Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding [Hint: Group the data by clusters and take sum or mean to identify
trends in clicks, spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar plots.]
(4 marks) ..................................................................................................................................9
1.9. Conclude the project by providing summary of your learnings. (3 marks) ................. 13
Problem 2: PCA ..................................................................................................... 13
PCA FH (FT): Primary census abstract for female headed households excluding institutional
households (India & States/UTs - District Level), Scheduled tribes - 2011 PCA for Female
Headed Household Excluding Institutional Household. ............................................................... 13
2.1. Read the data and perform basic checks like checking head, info, summary, nulls, and
duplicates, etc. (4 marks) ........................................................................................................... 14
2|Page
2.2. Perform detailed Exploratory analysis by creating certain questions like (i) Which state
has highest gender ratio and which has the lowest? (ii) Which district has the highest & lowest
gender ratio? (Example Questions). Pick 5 variables out of the given 24 variables below for
EDA: No_HH, TOT_M, TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT,
M_ILL, F_ILL, TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F,
MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F,
MAIN_OT_M, MAIN_OT_F (6 marks) .................................................................................... 14
2.3. We choose not to treat outliers for this case. Do you think that treating outliers for this
case is necessary? (1 marks) ....................................................................................................... 19
2.4. Scale the Data using z-score method. Does scaling have any impact on outliers? Compare
boxplots before and after scaling and comment. (3 marks) ....................................................... 19
2.5. Perform all the required steps for PCA (use sklearn only) Create the covariance Matrix
Get eigen values and eigen vector. (4 marks)............................................................................. 21
2.6. Identify the optimum number of PCs (for this project, take at least 90% explained
variance). Show Scree plot. (3 marks)........................................................................................ 23
2.7. Compare PCs with Actual Columns and identify which is explaining most variance. Write
inferences about all the principal components in terms of actual variables. (4 marks) ............ 24
2.8. Write linear equation for first PC. (2 marks) ...................................................................... 26

3|Page
Problem 1: Clustering - Digital Ads Data:

The ads24x7 is a Digital Marketing company which has now got seed funding of $10
Million. They are expanding their wings in Marketing Analytics. They collected data
from their Marketing Intelligence team and now wants you (their newly appointed data
analyst) to segment type of ads based on the features provided. Use Clustering
procedure to segment ads into homogeneous groups.

Perform the following in given order:

1.1. Read the data and perform basic analysis such as printing a few
rows (head andtail), info, data summary, null values duplicate
values, etc.

Solution:

ads 24X7 data has (23066, 19)rows and columns respectively.

4|Page
The data has 19 attributes, 6 of object type and 13 floats.

CTR, CPM and CPC have 4736 null-values, remaining variables do not have any null-
values.

5|Page
Observation

Based on the provided data, it's apparent that the ads company's objective is to enhance the
Click-Through Rate (CTR) while minimizing both the Cost Per Mille (CPM) and Cost Per Click
(CPC). Here's a summary of the data:

• There are a total of 6 categorical and 13 numeric variables in the dataset.

• No duplicate values were found in the dataset.

• The dataset comprises 23066 rows and 19 columns.

• Ad Length ranges from a minimum of 120 to a maximum of 728, with an average Ad


width of 337.

• Clicks vary from a minimum of 1 to a maximum of 143049, with a median of 225290


impressions.

• The maximum fee observed is 0.35, while the minimum fee is 0.21.

• There's a notable correlation between Spends and revenue.

• Available Impressions and Matched Queries appear to be correlated.

• Additionally, Available Impressions and Impressions are correlated, and Matched


Queries and Impressions are highly correlated

1.2 Treat missing values in CPC, CTR and CPM using the formula given. You may
refer to the Bank_KMeans Solution File to understand the coding behind treating
the missing values using a specific formula. You have to basically create a user
defined function and then call the function for imputing.

We have found 4736 missing values in CPC, CPM and CTR.


The missing values were treated using the formulas given above as follows by imputing values
from the user defined functions:

6|Page
CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that
the Total Campaign Spend refers to the 'Spend' Column in the dataset and the Number
of Impressions refers to the 'Impressions' Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note


that the Total Measured Clicks refers to the 'Clicks' Column in the dataset and the Total
Measured Ad Impressions refers to the 'Impressions' Column in the dataset.

1.3 Check if there are any outliers. Do you think treating outliers is
necessary for K- Means clustering? Based on your judgement decide
whether to treat outliers andif yes, which method to employ. (As an
analyst your judgement may be different from another analyst).

Solution

7|Page
All features except Ad – Length and Ad – Width have outliers as shown by the Box plots
below.

K-means clustering is sensitive to outliers so outlier treatment is a must and hence done
using lower and upper nod method using lower_range= Q1-(1.5 * IQR) and
upper_range= Q3+(1.5 * IQR) as these.

Box plots of features post outlier treatments:

8|Page
1.4. Perform z-score scaling and discuss how it affects the speed of
the algorithm.

Solution:

Scaling (i.e. z=x-u/s) calculation is required as some variables are in hundred and
thousands ranges and others are in unit digits. Below is the scaled data:

[[-0.3644957 -0.43279676 -0.3522185 ... -0.87459265 -1.19449791


-1.04256138]
[-0.3644957 -0.43279676 -0.3522185 ... -0.87013569 -1.19449791
-1.04256138]
[-0.3644957 -0.43279676 -0.3522185 ... -0.87760581 -1.19449791
-1.04256138]
...
[ 1.43309269 -0.18659865 1.93908609 ... 9.88896203 3.16271759

-0.88461411]
[-1.13489073 1.29058999 -0.40096966 ... 9.88896203 3.16271759
-0.82143521]
[ 1.43309269 -0.18659865 1.93908609 ... 4.4904711 3.16271759
-0.7582563 ]]

Scaling has a positive and synchronizing impact on analysis enhancing speed by


reducing errors.

9|Page
1.5. Perform Hierarchical by constructing a Dendrogram using WARD
and Euclideandistance.
Solution:

Construct a dendrogram using Ward linkage and Euclidean distance - Identify the optimum number
of Clusters. Post doing the hierarchical clustering using ward linkage we concluded the optimum
number of clusters to be 5 at 200 distance. Please refer to the dendrogram shown below.

10 | P a g e
The dataframe is now stored in an array.

1.6. Make Elbow plot (up to n=10) and identify optimum number of
clusters for k-means algorithm.

Solution:

Apply K-means Clustering - Plot the Elbow curve - Check Silhouette Scores - Figure out the
appropriate number of clusters - Cluster Profiling. PLease refer to the elbow plot as shown below.

11 | P a g e
Post looking at the elbow plot we can observe that the total number of appropriate clusters for the
K_means clustering should be 5, because post 5 clusters the drop in distance has reduced.The

Silhouette Score is 0.572 for 5 clusters. The cluster profiling is as mentioned below.

1.7. Print silhouette scores for up to 10 clusters and identify


optimum number ofclusters.
Solution:

silhouette scores = 0.5726186038415385

Since the silhouette_score is 0.5, the we can conclude that it is a well distinguished set of clusters.
The 5 clusters that are created have a silhouette_score of 0.572
12 | P a g e
1.8. Profile the ads based on optimum number of clusters using
silhouette score andyour domain understanding
[Hint: Group the data by clusters and take sum or mean to identify
trends in clicks, spend, revenue, CPM, CTR, & CPC based on Device
Type. Make bar plots.]

Solution:

13 | P a g e
14 | P a g e
1.9. Conclude the project by providing summary of your learnings.

Based on the clustering analysis, here are actionable insights and recommendations:
1. Maximizing CTR: Focus spending on ads belonging to k-means cluster 0, especially those with
an ad size of 84000 and above. These ads exhibit characteristics associated with higher click-
through rates, suggesting that allocating more resources to this cluster could lead to improved
CTR.

2. Optimizing CPC: Allocate more budget towards ads in k-means cluster 2. This cluster
demonstrates the lowest cost per click, indicating that investing more in these ads can help
minimize CPC and maximize the value obtained from each click.

3. Enhancing CPM Efficiency: Prioritize spending on ads categorized under k-means cluster 3.
This cluster is associated with the best cost per 1000 impressions (CPM), suggesting that
increasing investment in these ads can lead to more efficient spending and better value for
impressions served.
4. Revenue Generation: Be cautious with ads categorized under k-means cluster 4, as they
exhibit the lowest revenue generation potential. Consider reassessing the targeting, content, or
placement strategies for ads in this cluster to improve their performance and maximize revenue.

By aligning spending and resource allocation based on these insights, the ads company can optimize
their advertising efforts to achieve their desired outcomes, whether it's maximizing CTR, minimizing
CPC, improving CPM efficiency, or enhancing revenue generation.

15 | P a g e
Problem 2: PCA

PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes - 2011
PCA for Female Headed Household Excluding Institutional Household.

The Indian Census has the reputation of being one of the best in the world. The first
Census in India was conducted in the year 1872. This was conducted at different points
of time in different parts of the country. In 1881 a Census was taken for the entire
country simultaneously. Since then, Census has been conducted every ten years, without
a break. Thus, the Census of India 2011 was the fifteenth in this unbroken series since
1872, the seventh after independence and the second census of the third millennium and
twenty first century. The census has been uninterruptedly continued despite of several
adversities like wars, epidemics, natural calamities, political unrest, etc. The Census of
India is conducted under the provisions of the Census Act 1948 and the Census Rules,
1990. The Primary Census Abstract which is important publication of 2011 Census gives
basic information on Area, Total Number of Households, Total Population, Scheduled
Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates, Main
Workers and Marginal Workers classified by the four broad industrial categories,
namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and
(iv) Other Workers and also Non-Workers. The characteristics of the Total Population
include Scheduled Castes, Scheduled Tribes, Institutional and Houseless Population and
are presented by sex and rural-urban residence. Census 2011 covered 35 States/Union
Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details
without using Data Science Techniques. You are tasked to perform detailed EDA and
identify Optimum Principal Components that explains the most variance in data. Use
Sklearn only.
• Note: The 24 variables given in the Rubric is just for performing EDA. You will
have to consider the entire dataset, including all the variables for performing
PCA.
Data file - PCA India Data Census.xlsx

16 | P a g e
2.1. Read the data and perform basic checks like checking head, info, summary, nulls,
and duplicates, etc. (4 marks)
Solution:

The census data set has [640 rows x 61 columns].

Out of the 61 features 59 are integers and 2 are of object type.

Here's a summary of the insights from the data provided:

1. Data Size: The dataset comprises 640 rows and 61 columns.

2. Variable Types: There are 2 categorical variables and 59 numerical variables in the dataset.

3. Household Numbers: The number of households ranges from a minimum of 350 to a maximum
of 310,450.

4. Maximum Male Population: Maharashtra's Mumbai Suburban district has the highest number of
males, with 485,417 individuals.

5. Minimum Male Population: Dibang Valley in Arunachal Pradesh has the lowest number of males,
with only 391 individuals.

6. Average Female Population: The average number of females across all locations is 122,372.

7. We have checked for missing values and we did not find any missing values in the data.
8. We have checked for duplicate values and not found any duplicate values in the data.

2.2. Perform detailed Exploratory analysis by creating certain questions like


(i) Which state has highest gender ratio and which has the lowest?
17 | P a g e
Solution:

Highest – Lakshadweep

Lowest – Andhra Pradesh

(ii) Which district has the highest & lowest gender ratio? (Example Questions).
Highest – Lakshadweep

Lowest – Krishna

EDA Analysis:
Districts in Uttar Pradesh: Uttar Pradesh has the highest number of districts.

Gender Ratio in Lakshadweep: Lakshadweep boasts the highest gender ratio, with 87% of the
population being female.

Gender Ratio in Krishna District: The Krishna district in Arunachal Pradesh has the lowest
gender ratio, with only 43% of the population being female.

Pick 5 variables out of the given 24 variables below for EDA: No_HH, TOT_M, TOT_F,
M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL,
TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M,
MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F,
MAIN_OT_M, MAIN_OT_F

18 | P a g e
19 | P a g e
2.3. We choose not to treat outliers for this case. Do you think that
treating outliers for this case is necessary?
Solution:

Yes, because PCA is sensitive to outliers. For details refer code.

2.4. Scale the Data using z-score method. Does scaling have any
impact on outliers? Compare boxplots before and after scaling and
comment.

Solution:

Plotting box plot before scaling the new data which contains only numerical columns.

scaling the data set using the Z score and checking for top 5 rows of the scaled dataset :

20 | P a g e
The data has been scaled and now lets check the outliers of scaled data.

We have used z-score scaling on the data.

We have treated the outliers before scaling the data by using IQR, lower and upper limit.

21 | P a g e
2.5. Perform all the required steps for PCA (use sklearn only) Create
the covarianceMatrix Get eigen values and eigen vector. (
Solution:

Eigen values

Eigen vectors
array([[ 3.00700521e-02, 3.00751392e-02, 1.56432451e-01, ..., 1.31868671e-
01, 1.50219557e-01, 1.31179136e-01], [-1.62782525e-01, -1.58821825e-01, -
1.28322211e-01, ..., 5.40694563e-02, -5.44095594e-02, -6.94741471e-02], [-
2.50129023e-01, -2.59359844e-01, -3.34978669e-02, ..., -1.83333910e-03,
1.28955424e-01, 8.67015734e-02], ..., [-0.00000000e+00, -5.35069897e-17,
1.26198190e-15, ..., 3.46533555e-02, 5.93811345e-02, 9.54738941e-02], [-
0.00000000e+00, -6.51467592e-18, 1.95751314e-17, ..., -1.32087710e-02, -
5.09063434e-02, -1.05647215e-01], [-0.00000000e+00, 1.01021105e-16,
1.06916841e-16, ..., 3.13625945e-03, -2.78785905e-02, 3.91423200e-02]])

To achieve 90% explained variance we have 6 components. As we can see the cumulative variance
explained in percentage as mentioned below.
Cumulative Variance Explained in Percentage: [ 59.66 73.42 81.08 86.05 89.61 91.69
93.7
94.94 95.73 96.31
96.83 97.3 97.63 97.93 98.2 98.41 98.61 98.79 98.94 99.07
22 | P a g e
99.2 99.31 99.41 99.49 99.56 99.62 99.67 99.71 99.75 99.79
99.82 99.84 99.87 99.89 99.91 99.92 99.93 99.95 99.95 99.96
99.97 99.98 99.98 99.98 99.99 99.99 99.99 99.99 99.99 100.
100. 100. 100. 100. 100. 100. 100. ]

23 | P a g e
2.6. Identify the optimum number of PCs (for this project, take at least
90% explainedvariance). Show Scree plot.
Solution:

Optimum No. is 6, after that scree plot flattens.

24 | P a g e
2.7. Compare PCs with Actual Columns and identify which is explaining most variance.
Write inferences about all the principal components in terms of actual variables.
Solution:

To achieve 90% explained variance we have 6 components. As we can see the cumulative variance
explained in percentage as mentioned below.
Cumulative Variance Explained in Percentage: [ 59.66 73.42 81.08 86.05 89.61 91.69
93.7
94.94 95.73 96.31
96.83 97.3 97.63 97.93 98.2 98.41 98.61 98.79 98.94 99.07
99.2 99.31 99.41 99.49 99.56 99.62 99.67 99.71 99.75 99.79
99.82 99.84 99.87 99.89 99.91 99.92 99.93 99.95 99.95 99.96
99.97 99.98 99.98 99.98 99.99 99.99 99.99 99.99 99.99 100.
100. 100. 100. 100. 100. 100. 100. ]

25 | P a g e
2.8. Write linear equation for first PC.
Solution:

( 0.03 ) * StateCode + ( 0.03 ) * DistCode + ( 0.15 ) * No_HH + ( 0.16 ) *


TOT_M + ( 0.16 ) * TOT_F + ( 0.16 ) * M_06 + ( 0.16 ) * F_06 + ( 0.15 ) *
M_SC + ( 0.15 ) * F_SC + ( 0.03 ) * M_ST + ( 0.03 ) * F_ST + ( 0.16 ) *
M_LIT + ( 0.15 ) * F_LIT + ( 0.16 ) * M_ILL + ( 0.17 ) * F_ILL + ( 0.16 )
* TOT_WORK_M + ( 0.15 ) * TOT_WORK_F + ( 0.15 ) * MAINWORK_M + ( 0.12 ) *
MAINWORK_F + ( 0.1 ) * MAIN_CL_M + ( 0.07 ) * MAIN_CL_F + ( 0.11 ) *
MAIN_AL_M + ( 0.07 ) * MAIN_AL_F + ( 0.13 ) * MAIN_HH_M + ( 0.08 ) *
MAIN_HH_F + ( 0.12 ) * MAIN_OT_M + ( 0.11 ) * MAIN_OT_F + ( 0.16 ) *
MARGWORK_M + ( 0.16 ) * MARGWORK_F + ( 0.08 ) * MARG_CL_M + ( 0.05 ) *
MARG_CL_F + ( 0.13 ) * MARG_AL_M + ( 0.11 ) * MARG_AL_F + ( 0.14 ) *
MARG_HH_M + ( 0.13 ) * MARG_HH_F + ( 0.16 ) * MARG_OT_M + ( 0.15 ) *
MARG_OT_F + ( 0.16 ) * MARGWORK_3_6_M + ( 0.16 ) * MARGWORK_3_6_F + ( 0.17
) * MARG_CL_3_6_M + ( 0.16 ) * MARG_CL_3_6_F + ( 0.09 ) * MARG_AL_3_6_M +
( 0.05 ) * MARG_AL_3_6_F + ( 0.13 ) * MARG_HH_3_6_M + ( 0.11 ) *
MARG_HH_3_6_F + ( 0.14 ) * MARG_OT_3_6_M + ( 0.12 ) * MARG_OT_3_6_F + (
0.15 ) * MARGWORK_0_3_M + ( 0.15 ) * MARGWORK_0_3_F + ( 0.15 )

26 | P a g e
Thank You

27 | P a g e

View publication stats

You might also like