ML 1
ML 1
LEARNING1
Janhavi Gupta
Table of Contents
3|Page
Problem 1: Clustering - Digital Ads Data:
The ads24x7 is a Digital Marketing company which has now got seed funding of $10
Million. They are expanding their wings in Marketing Analytics. They collected data
from their Marketing Intelligence team and now wants you (their newly appointed data
analyst) to segment type of ads based on the features provided. Use Clustering
procedure to segment ads into homogeneous groups.
1.1. Read the data and perform basic analysis such as printing a few
rows (head andtail), info, data summary, null values duplicate
values, etc.
Solution:
4|Page
The data has 19 attributes, 6 of object type and 13 floats.
CTR, CPM and CPC have 4736 null-values, remaining variables do not have any null-
values.
5|Page
Observation
Based on the provided data, it's apparent that the ads company's objective is to enhance the
Click-Through Rate (CTR) while minimizing both the Cost Per Mille (CPM) and Cost Per Click
(CPC). Here's a summary of the data:
• The maximum fee observed is 0.35, while the minimum fee is 0.21.
1.2 Treat missing values in CPC, CTR and CPM using the formula given. You may
refer to the Bank_KMeans Solution File to understand the coding behind treating
the missing values using a specific formula. You have to basically create a user
defined function and then call the function for imputing.
6|Page
CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that
the Total Campaign Spend refers to the 'Spend' Column in the dataset and the Number
of Impressions refers to the 'Impressions' Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.
1.3 Check if there are any outliers. Do you think treating outliers is
necessary for K- Means clustering? Based on your judgement decide
whether to treat outliers andif yes, which method to employ. (As an
analyst your judgement may be different from another analyst).
Solution
7|Page
All features except Ad – Length and Ad – Width have outliers as shown by the Box plots
below.
K-means clustering is sensitive to outliers so outlier treatment is a must and hence done
using lower and upper nod method using lower_range= Q1-(1.5 * IQR) and
upper_range= Q3+(1.5 * IQR) as these.
8|Page
1.4. Perform z-score scaling and discuss how it affects the speed of
the algorithm.
Solution:
Scaling (i.e. z=x-u/s) calculation is required as some variables are in hundred and
thousands ranges and others are in unit digits. Below is the scaled data:
-0.88461411]
[-1.13489073 1.29058999 -0.40096966 ... 9.88896203 3.16271759
-0.82143521]
[ 1.43309269 -0.18659865 1.93908609 ... 4.4904711 3.16271759
-0.7582563 ]]
9|Page
1.5. Perform Hierarchical by constructing a Dendrogram using WARD
and Euclideandistance.
Solution:
Construct a dendrogram using Ward linkage and Euclidean distance - Identify the optimum number
of Clusters. Post doing the hierarchical clustering using ward linkage we concluded the optimum
number of clusters to be 5 at 200 distance. Please refer to the dendrogram shown below.
10 | P a g e
The dataframe is now stored in an array.
1.6. Make Elbow plot (up to n=10) and identify optimum number of
clusters for k-means algorithm.
Solution:
Apply K-means Clustering - Plot the Elbow curve - Check Silhouette Scores - Figure out the
appropriate number of clusters - Cluster Profiling. PLease refer to the elbow plot as shown below.
11 | P a g e
Post looking at the elbow plot we can observe that the total number of appropriate clusters for the
K_means clustering should be 5, because post 5 clusters the drop in distance has reduced.The
Silhouette Score is 0.572 for 5 clusters. The cluster profiling is as mentioned below.
Since the silhouette_score is 0.5, the we can conclude that it is a well distinguished set of clusters.
The 5 clusters that are created have a silhouette_score of 0.572
12 | P a g e
1.8. Profile the ads based on optimum number of clusters using
silhouette score andyour domain understanding
[Hint: Group the data by clusters and take sum or mean to identify
trends in clicks, spend, revenue, CPM, CTR, & CPC based on Device
Type. Make bar plots.]
Solution:
13 | P a g e
14 | P a g e
1.9. Conclude the project by providing summary of your learnings.
Based on the clustering analysis, here are actionable insights and recommendations:
1. Maximizing CTR: Focus spending on ads belonging to k-means cluster 0, especially those with
an ad size of 84000 and above. These ads exhibit characteristics associated with higher click-
through rates, suggesting that allocating more resources to this cluster could lead to improved
CTR.
2. Optimizing CPC: Allocate more budget towards ads in k-means cluster 2. This cluster
demonstrates the lowest cost per click, indicating that investing more in these ads can help
minimize CPC and maximize the value obtained from each click.
3. Enhancing CPM Efficiency: Prioritize spending on ads categorized under k-means cluster 3.
This cluster is associated with the best cost per 1000 impressions (CPM), suggesting that
increasing investment in these ads can lead to more efficient spending and better value for
impressions served.
4. Revenue Generation: Be cautious with ads categorized under k-means cluster 4, as they
exhibit the lowest revenue generation potential. Consider reassessing the targeting, content, or
placement strategies for ads in this cluster to improve their performance and maximize revenue.
By aligning spending and resource allocation based on these insights, the ads company can optimize
their advertising efforts to achieve their desired outcomes, whether it's maximizing CTR, minimizing
CPC, improving CPM efficiency, or enhancing revenue generation.
15 | P a g e
Problem 2: PCA
PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes - 2011
PCA for Female Headed Household Excluding Institutional Household.
The Indian Census has the reputation of being one of the best in the world. The first
Census in India was conducted in the year 1872. This was conducted at different points
of time in different parts of the country. In 1881 a Census was taken for the entire
country simultaneously. Since then, Census has been conducted every ten years, without
a break. Thus, the Census of India 2011 was the fifteenth in this unbroken series since
1872, the seventh after independence and the second census of the third millennium and
twenty first century. The census has been uninterruptedly continued despite of several
adversities like wars, epidemics, natural calamities, political unrest, etc. The Census of
India is conducted under the provisions of the Census Act 1948 and the Census Rules,
1990. The Primary Census Abstract which is important publication of 2011 Census gives
basic information on Area, Total Number of Households, Total Population, Scheduled
Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates, Main
Workers and Marginal Workers classified by the four broad industrial categories,
namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and
(iv) Other Workers and also Non-Workers. The characteristics of the Total Population
include Scheduled Castes, Scheduled Tribes, Institutional and Houseless Population and
are presented by sex and rural-urban residence. Census 2011 covered 35 States/Union
Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details
without using Data Science Techniques. You are tasked to perform detailed EDA and
identify Optimum Principal Components that explains the most variance in data. Use
Sklearn only.
• Note: The 24 variables given in the Rubric is just for performing EDA. You will
have to consider the entire dataset, including all the variables for performing
PCA.
Data file - PCA India Data Census.xlsx
•
16 | P a g e
2.1. Read the data and perform basic checks like checking head, info, summary, nulls,
and duplicates, etc. (4 marks)
Solution:
2. Variable Types: There are 2 categorical variables and 59 numerical variables in the dataset.
3. Household Numbers: The number of households ranges from a minimum of 350 to a maximum
of 310,450.
4. Maximum Male Population: Maharashtra's Mumbai Suburban district has the highest number of
males, with 485,417 individuals.
5. Minimum Male Population: Dibang Valley in Arunachal Pradesh has the lowest number of males,
with only 391 individuals.
6. Average Female Population: The average number of females across all locations is 122,372.
7. We have checked for missing values and we did not find any missing values in the data.
8. We have checked for duplicate values and not found any duplicate values in the data.
Highest – Lakshadweep
(ii) Which district has the highest & lowest gender ratio? (Example Questions).
Highest – Lakshadweep
Lowest – Krishna
EDA Analysis:
Districts in Uttar Pradesh: Uttar Pradesh has the highest number of districts.
Gender Ratio in Lakshadweep: Lakshadweep boasts the highest gender ratio, with 87% of the
population being female.
Gender Ratio in Krishna District: The Krishna district in Arunachal Pradesh has the lowest
gender ratio, with only 43% of the population being female.
Pick 5 variables out of the given 24 variables below for EDA: No_HH, TOT_M, TOT_F,
M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL,
TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M,
MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F,
MAIN_OT_M, MAIN_OT_F
18 | P a g e
19 | P a g e
2.3. We choose not to treat outliers for this case. Do you think that
treating outliers for this case is necessary?
Solution:
2.4. Scale the Data using z-score method. Does scaling have any
impact on outliers? Compare boxplots before and after scaling and
comment.
Solution:
Plotting box plot before scaling the new data which contains only numerical columns.
scaling the data set using the Z score and checking for top 5 rows of the scaled dataset :
20 | P a g e
The data has been scaled and now lets check the outliers of scaled data.
We have treated the outliers before scaling the data by using IQR, lower and upper limit.
21 | P a g e
2.5. Perform all the required steps for PCA (use sklearn only) Create
the covarianceMatrix Get eigen values and eigen vector. (
Solution:
Eigen values
Eigen vectors
array([[ 3.00700521e-02, 3.00751392e-02, 1.56432451e-01, ..., 1.31868671e-
01, 1.50219557e-01, 1.31179136e-01], [-1.62782525e-01, -1.58821825e-01, -
1.28322211e-01, ..., 5.40694563e-02, -5.44095594e-02, -6.94741471e-02], [-
2.50129023e-01, -2.59359844e-01, -3.34978669e-02, ..., -1.83333910e-03,
1.28955424e-01, 8.67015734e-02], ..., [-0.00000000e+00, -5.35069897e-17,
1.26198190e-15, ..., 3.46533555e-02, 5.93811345e-02, 9.54738941e-02], [-
0.00000000e+00, -6.51467592e-18, 1.95751314e-17, ..., -1.32087710e-02, -
5.09063434e-02, -1.05647215e-01], [-0.00000000e+00, 1.01021105e-16,
1.06916841e-16, ..., 3.13625945e-03, -2.78785905e-02, 3.91423200e-02]])
To achieve 90% explained variance we have 6 components. As we can see the cumulative variance
explained in percentage as mentioned below.
Cumulative Variance Explained in Percentage: [ 59.66 73.42 81.08 86.05 89.61 91.69
93.7
94.94 95.73 96.31
96.83 97.3 97.63 97.93 98.2 98.41 98.61 98.79 98.94 99.07
22 | P a g e
99.2 99.31 99.41 99.49 99.56 99.62 99.67 99.71 99.75 99.79
99.82 99.84 99.87 99.89 99.91 99.92 99.93 99.95 99.95 99.96
99.97 99.98 99.98 99.98 99.99 99.99 99.99 99.99 99.99 100.
100. 100. 100. 100. 100. 100. 100. ]
23 | P a g e
2.6. Identify the optimum number of PCs (for this project, take at least
90% explainedvariance). Show Scree plot.
Solution:
24 | P a g e
2.7. Compare PCs with Actual Columns and identify which is explaining most variance.
Write inferences about all the principal components in terms of actual variables.
Solution:
To achieve 90% explained variance we have 6 components. As we can see the cumulative variance
explained in percentage as mentioned below.
Cumulative Variance Explained in Percentage: [ 59.66 73.42 81.08 86.05 89.61 91.69
93.7
94.94 95.73 96.31
96.83 97.3 97.63 97.93 98.2 98.41 98.61 98.79 98.94 99.07
99.2 99.31 99.41 99.49 99.56 99.62 99.67 99.71 99.75 99.79
99.82 99.84 99.87 99.89 99.91 99.92 99.93 99.95 99.95 99.96
99.97 99.98 99.98 99.98 99.99 99.99 99.99 99.99 99.99 100.
100. 100. 100. 100. 100. 100. 100. ]
25 | P a g e
2.8. Write linear equation for first PC.
Solution:
26 | P a g e
Thank You
27 | P a g e