0% found this document useful (0 votes)

20 views

ML 1

The document discusses clustering digital advertising data to segment ads into homogeneous groups. It describes preprocessing the data by treating missing values, identifying and handling outliers, and scaling the data. Various clustering algorithms are applied including hierarchical clustering to identify optimal k for k-means clustering, and k-means is used to cluster the ads and profile them by cluster.

Uploaded by

Janhavi Gupta

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

ML 1

Uploaded by

Janhavi Gupta

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

MACHINE

LEARNING1
Janhavi Gupta
Table of Contents

Problem 1: Clustering - Digital Ads Data: ................................................................ 4

The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million.
They are expanding their wings in Marketing Analytics. They collected data from their
Marketing Intelligence team and now wants you (their newly appointed data analyst) to segment
type of ads based on the features provided. Use Clustering procedure to segment ads into
homogeneous groups ...................................................................................................................... 4
Perform the following in given order: ..........................................................................................4
1.1. Read the data and perform basic analysis such as printing a few rows (head and tail),
info, data summary, null values duplicate values, etc. (4 marks).............................................4
1.2. Treat missing values in CPC, CTR and CPM using the formula given. You may refer
to the Bank_KMeans Solution File to understand the coding behind treating the
missing values using a specific formula. You have to basically create a user defined function
and then call the function for imputing. (4 marks) ..................................................................5
1.3. Check if there are any outliers. Do you think treating outliers is necessary for K-
Means clustering? Based on your judgement decide whether to treat outliers and if yes,
which method to employ. (As an analyst your judgement may be different from another
analyst). (3 marks) ....................................................................................................................6
1.4. Perform z-score scaling and discuss how it affects the speed of the algorithm. (3
marks) .......................................................................................................................................7
Perform clustering and do the following:.....................................................................................8
1.5. Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean
distance. (4 marks) ...................................................................................................................8
1.6. Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm. (4 marks).................................................................................................................8
1.7. Print silhouette scores for up to 10 clusters and identify optimum number of clusters.
(4 marks) ..................................................................................................................................9
1.8. Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding [Hint: Group the data by clusters and take sum or mean to identify
trends in clicks, spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar plots.]
(4 marks) ..................................................................................................................................9
1.9. Conclude the project by providing summary of your learnings. (3 marks) ................. 13
Problem 2: PCA ..................................................................................................... 13
PCA FH (FT): Primary census abstract for female headed households excluding institutional
households (India & States/UTs - District Level), Scheduled tribes - 2011 PCA for Female
Headed Household Excluding Institutional Household. ............................................................... 13
2.1. Read the data and perform basic checks like checking head, info, summary, nulls, and
duplicates, etc. (4 marks) ........................................................................................................... 14
2|Page
2.2. Perform detailed Exploratory analysis by creating certain questions like (i) Which state
has highest gender ratio and which has the lowest? (ii) Which district has the highest & lowest
gender ratio? (Example Questions). Pick 5 variables out of the given 24 variables below for
EDA: No_HH, TOT_M, TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT,
M_ILL, F_ILL, TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F,
MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F,
MAIN_OT_M, MAIN_OT_F (6 marks) .................................................................................... 14
2.3. We choose not to treat outliers for this case. Do you think that treating outliers for this
case is necessary? (1 marks) ....................................................................................................... 19
2.4. Scale the Data using z-score method. Does scaling have any impact on outliers? Compare
boxplots before and after scaling and comment. (3 marks) ....................................................... 19
2.5. Perform all the required steps for PCA (use sklearn only) Create the covariance Matrix
Get eigen values and eigen vector. (4 marks)............................................................................. 21
2.6. Identify the optimum number of PCs (for this project, take at least 90% explained
variance). Show Scree plot. (3 marks)........................................................................................ 23
2.7. Compare PCs with Actual Columns and identify which is explaining most variance. Write
inferences about all the principal components in terms of actual variables. (4 marks) ............ 24
2.8. Write linear equation for first PC. (2 marks) ...................................................................... 26

3|Page
Problem 1: Clustering - Digital Ads Data:

Perform the following in given order:

1.1. Read the data and perform basic analysis such as printing a few
rows (head andtail), info, data summary, null values duplicate
values, etc.

Solution:

ads 24X7 data has (23066, 19)rows and columns respectively.

4|Page
The data has 19 attributes, 6 of object type and 13 floats.

CTR, CPM and CPC have 4736 null-values, remaining variables do not have any null-
values.

5|Page
Observation

Based on the provided data, it's apparent that the ads company's objective is to enhance the
Click-Through Rate (CTR) while minimizing both the Cost Per Mille (CPM) and Cost Per Click
(CPC). Here's a summary of the data:

• There are a total of 6 categorical and 13 numeric variables in the dataset.

• No duplicate values were found in the dataset.

• The dataset comprises 23066 rows and 19 columns.

• Ad Length ranges from a minimum of 120 to a maximum of 728, with an average Ad

width of 337.

• Clicks vary from a minimum of 1 to a maximum of 143049, with a median of 225290

impressions.

• The maximum fee observed is 0.35, while the minimum fee is 0.21.

• There's a notable correlation between Spends and revenue.

• Available Impressions and Matched Queries appear to be correlated.

• Additionally, Available Impressions and Impressions are correlated, and Matched

Queries and Impressions are highly correlated

1.2 Treat missing values in CPC, CTR and CPM using the formula given. You may
refer to the Bank_KMeans Solution File to understand the coding behind treating
the missing values using a specific formula. You have to basically create a user
defined function and then call the function for imputing.

We have found 4736 missing values in CPC, CPM and CTR.

The missing values were treated using the formulas given above as follows by imputing values
from the user defined functions:

6|Page
CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that
the Total Campaign Spend refers to the 'Spend' Column in the dataset and the Number
of Impressions refers to the 'Impressions' Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note

that the Total Measured Clicks refers to the 'Clicks' Column in the dataset and the Total
Measured Ad Impressions refers to the 'Impressions' Column in the dataset.

1.3 Check if there are any outliers. Do you think treating outliers is
necessary for K- Means clustering? Based on your judgement decide
whether to treat outliers andif yes, which method to employ. (As an
analyst your judgement may be different from another analyst).

Solution

7|Page
All features except Ad – Length and Ad – Width have outliers as shown by the Box plots
below.

K-means clustering is sensitive to outliers so outlier treatment is a must and hence done
using lower and upper nod method using lower_range= Q1-(1.5 * IQR) and
upper_range= Q3+(1.5 * IQR) as these.

Box plots of features post outlier treatments:

8|Page
1.4. Perform z-score scaling and discuss how it affects the speed of
the algorithm.

Solution:

Scaling (i.e. z=x-u/s) calculation is required as some variables are in hundred and
thousands ranges and others are in unit digits. Below is the scaled data:

[[-0.3644957 -0.43279676 -0.3522185 ... -0.87459265 -1.19449791

-1.04256138]
[-0.3644957 -0.43279676 -0.3522185 ... -0.87013569 -1.19449791
-1.04256138]
[-0.3644957 -0.43279676 -0.3522185 ... -0.87760581 -1.19449791
-1.04256138]
...
[ 1.43309269 -0.18659865 1.93908609 ... 9.88896203 3.16271759

-0.88461411]
[-1.13489073 1.29058999 -0.40096966 ... 9.88896203 3.16271759
-0.82143521]
[ 1.43309269 -0.18659865 1.93908609 ... 4.4904711 3.16271759
-0.7582563 ]]

Scaling has a positive and synchronizing impact on analysis enhancing speed by

reducing errors.

9|Page
1.5. Perform Hierarchical by constructing a Dendrogram using WARD
and Euclideandistance.
Solution:

Construct a dendrogram using Ward linkage and Euclidean distance - Identify the optimum number
of Clusters. Post doing the hierarchical clustering using ward linkage we concluded the optimum
number of clusters to be 5 at 200 distance. Please refer to the dendrogram shown below.

10 | P a g e
The dataframe is now stored in an array.

1.6. Make Elbow plot (up to n=10) and identify optimum number of
clusters for k-means algorithm.

Solution:

Apply K-means Clustering - Plot the Elbow curve - Check Silhouette Scores - Figure out the
appropriate number of clusters - Cluster Profiling. PLease refer to the elbow plot as shown below.

11 | P a g e
Post looking at the elbow plot we can observe that the total number of appropriate clusters for the
K_means clustering should be 5, because post 5 clusters the drop in distance has reduced.The

Silhouette Score is 0.572 for 5 clusters. The cluster profiling is as mentioned below.

1.7. Print silhouette scores for up to 10 clusters and identify

optimum number ofclusters.
Solution:

silhouette scores = 0.5726186038415385

Since the silhouette_score is 0.5, the we can conclude that it is a well distinguished set of clusters.
The 5 clusters that are created have a silhouette_score of 0.572
12 | P a g e
1.8. Profile the ads based on optimum number of clusters using
silhouette score andyour domain understanding
[Hint: Group the data by clusters and take sum or mean to identify
trends in clicks, spend, revenue, CPM, CTR, & CPC based on Device
Type. Make bar plots.]

Solution:

13 | P a g e
14 | P a g e
1.9. Conclude the project by providing summary of your learnings.

Based on the clustering analysis, here are actionable insights and recommendations:
1. Maximizing CTR: Focus spending on ads belonging to k-means cluster 0, especially those with
an ad size of 84000 and above. These ads exhibit characteristics associated with higher click-
through rates, suggesting that allocating more resources to this cluster could lead to improved
CTR.

2. Optimizing CPC: Allocate more budget towards ads in k-means cluster 2. This cluster
demonstrates the lowest cost per click, indicating that investing more in these ads can help
minimize CPC and maximize the value obtained from each click.

3. Enhancing CPM Efficiency: Prioritize spending on ads categorized under k-means cluster 3.
This cluster is associated with the best cost per 1000 impressions (CPM), suggesting that
increasing investment in these ads can lead to more efficient spending and better value for
impressions served.
4. Revenue Generation: Be cautious with ads categorized under k-means cluster 4, as they
exhibit the lowest revenue generation potential. Consider reassessing the targeting, content, or
placement strategies for ads in this cluster to improve their performance and maximize revenue.

By aligning spending and resource allocation based on these insights, the ads company can optimize
their advertising efforts to achieve their desired outcomes, whether it's maximizing CTR, minimizing
CPC, improving CPM efficiency, or enhancing revenue generation.

15 | P a g e
Problem 2: PCA

PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes - 2011
PCA for Female Headed Household Excluding Institutional Household.

The Indian Census has the reputation of being one of the best in the world. The first
Census in India was conducted in the year 1872. This was conducted at different points
of time in different parts of the country. In 1881 a Census was taken for the entire
country simultaneously. Since then, Census has been conducted every ten years, without
a break. Thus, the Census of India 2011 was the fifteenth in this unbroken series since
1872, the seventh after independence and the second census of the third millennium and
twenty first century. The census has been uninterruptedly continued despite of several
adversities like wars, epidemics, natural calamities, political unrest, etc. The Census of
India is conducted under the provisions of the Census Act 1948 and the Census Rules,
1990. The Primary Census Abstract which is important publication of 2011 Census gives
basic information on Area, Total Number of Households, Total Population, Scheduled
Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates, Main
Workers and Marginal Workers classified by the four broad industrial categories,
namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and
(iv) Other Workers and also Non-Workers. The characteristics of the Total Population
include Scheduled Castes, Scheduled Tribes, Institutional and Houseless Population and
are presented by sex and rural-urban residence. Census 2011 covered 35 States/Union
Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details
without using Data Science Techniques. You are tasked to perform detailed EDA and
identify Optimum Principal Components that explains the most variance in data. Use
Sklearn only.
• Note: The 24 variables given in the Rubric is just for performing EDA. You will
have to consider the entire dataset, including all the variables for performing
PCA.
Data file - PCA India Data Census.xlsx
•

16 | P a g e
2.1. Read the data and perform basic checks like checking head, info, summary, nulls,
and duplicates, etc. (4 marks)
Solution:

The census data set has [640 rows x 61 columns].

Out of the 61 features 59 are integers and 2 are of object type.

Here's a summary of the insights from the data provided:

1. Data Size: The dataset comprises 640 rows and 61 columns.

2. Variable Types: There are 2 categorical variables and 59 numerical variables in the dataset.

3. Household Numbers: The number of households ranges from a minimum of 350 to a maximum
of 310,450.

4. Maximum Male Population: Maharashtra's Mumbai Suburban district has the highest number of
males, with 485,417 individuals.

5. Minimum Male Population: Dibang Valley in Arunachal Pradesh has the lowest number of males,
with only 391 individuals.

6. Average Female Population: The average number of females across all locations is 122,372.

7. We have checked for missing values and we did not find any missing values in the data.
8. We have checked for duplicate values and not found any duplicate values in the data.

2.2. Perform detailed Exploratory analysis by creating certain questions like

(i) Which state has highest gender ratio and which has the lowest?
17 | P a g e
Solution:

Highest – Lakshadweep

Lowest – Andhra Pradesh

(ii) Which district has the highest & lowest gender ratio? (Example Questions).
Highest – Lakshadweep

Lowest – Krishna

EDA Analysis:
Districts in Uttar Pradesh: Uttar Pradesh has the highest number of districts.

Gender Ratio in Lakshadweep: Lakshadweep boasts the highest gender ratio, with 87% of the
population being female.

Gender Ratio in Krishna District: The Krishna district in Arunachal Pradesh has the lowest
gender ratio, with only 43% of the population being female.

Pick 5 variables out of the given 24 variables below for EDA: No_HH, TOT_M, TOT_F,
M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL,
TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M,
MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F,
MAIN_OT_M, MAIN_OT_F

18 | P a g e
19 | P a g e
2.3. We choose not to treat outliers for this case. Do you think that
treating outliers for this case is necessary?
Solution:

Yes, because PCA is sensitive to outliers. For details refer code.

2.4. Scale the Data using z-score method. Does scaling have any
impact on outliers? Compare boxplots before and after scaling and
comment.

Solution:

Plotting box plot before scaling the new data which contains only numerical columns.

scaling the data set using the Z score and checking for top 5 rows of the scaled dataset :

20 | P a g e
The data has been scaled and now lets check the outliers of scaled data.

We have used z-score scaling on the data.

We have treated the outliers before scaling the data by using IQR, lower and upper limit.

21 | P a g e
2.5. Perform all the required steps for PCA (use sklearn only) Create
the covarianceMatrix Get eigen values and eigen vector. (
Solution:

Eigen values

Eigen vectors
array([[ 3.00700521e-02, 3.00751392e-02, 1.56432451e-01, ..., 1.31868671e-
01, 1.50219557e-01, 1.31179136e-01], [-1.62782525e-01, -1.58821825e-01, -
1.28322211e-01, ..., 5.40694563e-02, -5.44095594e-02, -6.94741471e-02], [-
2.50129023e-01, -2.59359844e-01, -3.34978669e-02, ..., -1.83333910e-03,
1.28955424e-01, 8.67015734e-02], ..., [-0.00000000e+00, -5.35069897e-17,
1.26198190e-15, ..., 3.46533555e-02, 5.93811345e-02, 9.54738941e-02], [-
0.00000000e+00, -6.51467592e-18, 1.95751314e-17, ..., -1.32087710e-02, -
5.09063434e-02, -1.05647215e-01], [-0.00000000e+00, 1.01021105e-16,
1.06916841e-16, ..., 3.13625945e-03, -2.78785905e-02, 3.91423200e-02]])

To achieve 90% explained variance we have 6 components. As we can see the cumulative variance
explained in percentage as mentioned below.
Cumulative Variance Explained in Percentage: [ 59.66 73.42 81.08 86.05 89.61 91.69
93.7
94.94 95.73 96.31
96.83 97.3 97.63 97.93 98.2 98.41 98.61 98.79 98.94 99.07
22 | P a g e
99.2 99.31 99.41 99.49 99.56 99.62 99.67 99.71 99.75 99.79
99.82 99.84 99.87 99.89 99.91 99.92 99.93 99.95 99.95 99.96
99.97 99.98 99.98 99.98 99.99 99.99 99.99 99.99 99.99 100.
100. 100. 100. 100. 100. 100. 100. ]

23 | P a g e
2.6. Identify the optimum number of PCs (for this project, take at least
90% explainedvariance). Show Scree plot.
Solution:

Optimum No. is 6, after that scree plot flattens.

24 | P a g e
2.7. Compare PCs with Actual Columns and identify which is explaining most variance.
Write inferences about all the principal components in terms of actual variables.
Solution:

To achieve 90% explained variance we have 6 components. As we can see the cumulative variance
explained in percentage as mentioned below.
Cumulative Variance Explained in Percentage: [ 59.66 73.42 81.08 86.05 89.61 91.69
93.7
94.94 95.73 96.31
96.83 97.3 97.63 97.93 98.2 98.41 98.61 98.79 98.94 99.07
99.2 99.31 99.41 99.49 99.56 99.62 99.67 99.71 99.75 99.79
99.82 99.84 99.87 99.89 99.91 99.92 99.93 99.95 99.95 99.96
99.97 99.98 99.98 99.98 99.99 99.99 99.99 99.99 99.99 100.
100. 100. 100. 100. 100. 100. 100. ]

25 | P a g e
2.8. Write linear equation for first PC.
Solution:

( 0.03 ) * StateCode + ( 0.03 ) * DistCode + ( 0.15 ) * No_HH + ( 0.16 ) *

TOT_M + ( 0.16 ) * TOT_F + ( 0.16 ) * M_06 + ( 0.16 ) * F_06 + ( 0.15 ) *
M_SC + ( 0.15 ) * F_SC + ( 0.03 ) * M_ST + ( 0.03 ) * F_ST + ( 0.16 ) *
M_LIT + ( 0.15 ) * F_LIT + ( 0.16 ) * M_ILL + ( 0.17 ) * F_ILL + ( 0.16 )
* TOT_WORK_M + ( 0.15 ) * TOT_WORK_F + ( 0.15 ) * MAINWORK_M + ( 0.12 ) *
MAINWORK_F + ( 0.1 ) * MAIN_CL_M + ( 0.07 ) * MAIN_CL_F + ( 0.11 ) *
MAIN_AL_M + ( 0.07 ) * MAIN_AL_F + ( 0.13 ) * MAIN_HH_M + ( 0.08 ) *
MAIN_HH_F + ( 0.12 ) * MAIN_OT_M + ( 0.11 ) * MAIN_OT_F + ( 0.16 ) *
MARGWORK_M + ( 0.16 ) * MARGWORK_F + ( 0.08 ) * MARG_CL_M + ( 0.05 ) *
MARG_CL_F + ( 0.13 ) * MARG_AL_M + ( 0.11 ) * MARG_AL_F + ( 0.14 ) *
MARG_HH_M + ( 0.13 ) * MARG_HH_F + ( 0.16 ) * MARG_OT_M + ( 0.15 ) *
MARG_OT_F + ( 0.16 ) * MARGWORK_3_6_M + ( 0.16 ) * MARGWORK_3_6_F + ( 0.17
) * MARG_CL_3_6_M + ( 0.16 ) * MARG_CL_3_6_F + ( 0.09 ) * MARG_AL_3_6_M +
( 0.05 ) * MARG_AL_3_6_F + ( 0.13 ) * MARG_HH_3_6_M + ( 0.11 ) *
MARG_HH_3_6_F + ( 0.14 ) * MARG_OT_3_6_M + ( 0.12 ) * MARG_OT_3_6_F + (
0.15 ) * MARGWORK_0_3_M + ( 0.15 ) * MARGWORK_0_3_F + ( 0.15 )

26 | P a g e
Thank You

27 | P a g e

View publication stats

Data Mining - Business Report: Clustering Clean - Ads
100% (4)
Data Mining - Business Report: Clustering Clean - Ads
24 pages
Data Mining
75% (4)
Data Mining
22 pages
Arnab Chowdhury DM
75% (4)
Arnab Chowdhury DM
14 pages
Case Analysis: Team Collapse at Richard, Wood and Hulme LLP
No ratings yet
Case Analysis: Team Collapse at Richard, Wood and Hulme LLP
4 pages
Data Mining Project DSBA Clustering Report Final
100% (4)
Data Mining Project DSBA Clustering Report Final
26 pages
Edt 2021 - 7buis010w - CW2
No ratings yet
Edt 2021 - 7buis010w - CW2
5 pages
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
No ratings yet
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
50 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
Project Questions
No ratings yet
Project Questions
4 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
RAJIV RANJAN 22 Jan 2023
No ratings yet
RAJIV RANJAN 22 Jan 2023
66 pages
Data Mining - Project
100% (2)
Data Mining - Project
25 pages
VARUNSAINI - 11 Dec 2022
No ratings yet
VARUNSAINI - 11 Dec 2022
16 pages
ML-1+Project
No ratings yet
ML-1+Project
30 pages
Business Report
No ratings yet
Business Report
20 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
DATA MINING Project Report
No ratings yet
DATA MINING Project Report
28 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
26 pages
Data Mining Project DSBA Clustering Report Final
No ratings yet
Data Mining Project DSBA Clustering Report Final
26 pages
Data Mining Project DSBA Clustering Report Final
No ratings yet
Data Mining Project DSBA Clustering Report Final
26 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
P L Lohitha 11-11-22 Data Mining Business Report
No ratings yet
P L Lohitha 11-11-22 Data Mining Business Report
47 pages
Data Mining Project - Abinaya John
No ratings yet
Data Mining Project - Abinaya John
42 pages
Great Learning DATA MINING PROJECT
No ratings yet
Great Learning DATA MINING PROJECT
15 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
Data Mining Project Ashwani 3 PDF
100% (1)
Data Mining Project Ashwani 3 PDF
20 pages
Sukanya 3rd December 2023 Machine Learning1 Coded
No ratings yet
Sukanya 3rd December 2023 Machine Learning1 Coded
58 pages
DM Project Report
No ratings yet
DM Project Report
43 pages
Data Mining Assignment-Clustering Data-Ads 24x7 Summary
No ratings yet
Data Mining Assignment-Clustering Data-Ads 24x7 Summary
12 pages
Monika Sree 08-06-2024
No ratings yet
Monika Sree 08-06-2024
36 pages
Data Minning Project
No ratings yet
Data Minning Project
31 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Machine Learning-1 Project
No ratings yet
Machine Learning-1 Project
47 pages
Clustering Project
100% (1)
Clustering Project
44 pages
Data Mining Business Report Set
No ratings yet
Data Mining Business Report Set
12 pages
Data Mining
No ratings yet
Data Mining
24 pages
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
No ratings yet
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
4 pages
Machine Learning-1 BUSINESS REPORT
No ratings yet
Machine Learning-1 BUSINESS REPORT
122 pages
K Means Clustering
No ratings yet
K Means Clustering
12 pages
Analysis and Presentation For Bank Marketing Data: Vinay Kumar MS by Research Scholar IIT Kharagpur +91-8348575432
No ratings yet
Analysis and Presentation For Bank Marketing Data: Vinay Kumar MS by Research Scholar IIT Kharagpur +91-8348575432
20 pages
Manufacturing: Engineering, Management and Marketing
From Everand
Manufacturing: Engineering, Management and Marketing
S.O.T Ogaji
No ratings yet
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Data Mining Project - Brahma Chari
No ratings yet
Data Mining Project - Brahma Chari
23 pages
Question Bank Unit 4
No ratings yet
Question Bank Unit 4
2 pages
Data Mining FAQ
No ratings yet
Data Mining FAQ
4 pages
answer adm sample
No ratings yet
answer adm sample
4 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
AI FOR ENTREPRENEURS
From Everand
AI FOR ENTREPRENEURS
Naorem Indrakumar Singh
No ratings yet
Description: Bank - Marketing - Part1 - Data - CSV
No ratings yet
Description: Bank - Marketing - Part1 - Data - CSV
4 pages
Data Mining Project - Parijat
No ratings yet
Data Mining Project - Parijat
28 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
BInDM Demo
No ratings yet
BInDM Demo
5 pages
DSA question bank
No ratings yet
DSA question bank
22 pages
PCED_Lösung en
No ratings yet
PCED_Lösung en
24 pages
Data Science Interview Best
No ratings yet
Data Science Interview Best
48 pages
Introduction to Data Analytics
From Everand
Introduction to Data Analytics
Dan Martin
No ratings yet
Assignment 1
No ratings yet
Assignment 1
3 pages
El Gwekwerere C23156717L MScBDA624 AssignmentOne
No ratings yet
El Gwekwerere C23156717L MScBDA624 AssignmentOne
6 pages
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
No ratings yet
Do-It-Yourself Technical Analysis Simplified by Trained Accountant
From Everand
Do-It-Yourself Technical Analysis Simplified by Trained Accountant
Anthony Brticevic
No ratings yet
Qanda 15
No ratings yet
Qanda 15
8 pages
Pedro Barbosa, Ignacio Castellanos - Ecology of Predator-Prey Interactions - Oxford University Press (2005)
No ratings yet
Pedro Barbosa, Ignacio Castellanos - Ecology of Predator-Prey Interactions - Oxford University Press (2005)
413 pages
Parliament As Partners Supporting The Women Peace and Security Agenda - A Global Handbook
No ratings yet
Parliament As Partners Supporting The Women Peace and Security Agenda - A Global Handbook
52 pages
Speech and Language Homework Worksheets
100% (1)
Speech and Language Homework Worksheets
8 pages
***** Đề Chính Thức: Họ tên thí sinh - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Lớp: - - - - - - -
No ratings yet
***** Đề Chính Thức: Họ tên thí sinh - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Lớp: - - - - - - -
18 pages
RRB Ministerial and Isolated Categories Syllabus
No ratings yet
RRB Ministerial and Isolated Categories Syllabus
16 pages
Philosophy Essay Example
No ratings yet
Philosophy Essay Example
4 pages
DCG 142013 34-35 Masterys15-80 PDF
No ratings yet
DCG 142013 34-35 Masterys15-80 PDF
2 pages
Weather and Climate
No ratings yet
Weather and Climate
69 pages
Applications of Scanning Electron Microscopy Sem in Nanotechnology and Nanoscience
No ratings yet
Applications of Scanning Electron Microscopy Sem in Nanotechnology and Nanoscience
12 pages
Importance of Marine Meteorological Observations in Support of Services
No ratings yet
Importance of Marine Meteorological Observations in Support of Services
9 pages
Noorhidana VA CivilEngineering PHD 2017
No ratings yet
Noorhidana VA CivilEngineering PHD 2017
322 pages
ISO 1738 IDF 12 - Butter - Determination of Salt Content - Titration
No ratings yet
ISO 1738 IDF 12 - Butter - Determination of Salt Content - Titration
15 pages
SCL Instruction Manual R2 09 18 PDF
No ratings yet
SCL Instruction Manual R2 09 18 PDF
28 pages
Studi Korelasi Koefisien Permeabilitas Vertikal Dan Permeabilitas Horizontal Pada Tanah Lempung
No ratings yet
Studi Korelasi Koefisien Permeabilitas Vertikal Dan Permeabilitas Horizontal Pada Tanah Lempung
11 pages
14 Midyear Review Form
No ratings yet
14 Midyear Review Form
7 pages
Introduction To China Green Building Assessment Standard 3rd Edition
No ratings yet
Introduction To China Green Building Assessment Standard 3rd Edition
38 pages
ACS Appl Mater Interfaces 2017
No ratings yet
ACS Appl Mater Interfaces 2017
11 pages
2.4.1 SAQA Criteria and Guidelines For Assessment
No ratings yet
2.4.1 SAQA Criteria and Guidelines For Assessment
71 pages
4.0 CPTu Report - Gandharbpur
No ratings yet
4.0 CPTu Report - Gandharbpur
15 pages
Arc-Welding Hazard
100% (2)
Arc-Welding Hazard
13 pages
Sebaran Lowongan Pekerjaan-UNSOED JOB FAIR 2023
No ratings yet
Sebaran Lowongan Pekerjaan-UNSOED JOB FAIR 2023
6 pages
11.....Part-2....CH.-2....Structure & Physiography....
No ratings yet
11.....Part-2....CH.-2....Structure & Physiography....
11 pages
Project 1
No ratings yet
Project 1
2 pages
1992-2007 Kpds Paragraf Tamamlama Sorulari
No ratings yet
1992-2007 Kpds Paragraf Tamamlama Sorulari
36 pages
Introduction To Educational Psychology 111
No ratings yet
Introduction To Educational Psychology 111
21 pages
A Robust Kalman Filter Design For Image Restoration
No ratings yet
A Robust Kalman Filter Design For Image Restoration
4 pages
WECMA - Google Search
No ratings yet
WECMA - Google Search
1 page
2008 Jak
No ratings yet
2008 Jak
132 pages

ML 1

Uploaded by

ML 1

Uploaded by

MACHINE

Problem 1: Clustering - Digital Ads Data: ................................................................ 4

Perform the following in given order:

ads 24X7 data has (23066, 19)rows and columns respectively.

• There are a total of 6 categorical and 13 numeric variables in the dataset.

• No duplicate values were found in the dataset.

• The dataset comprises 23066 rows and 19 columns.

• Ad Length ranges from a minimum of 120 to a maximum of 728, with an average Ad

• Clicks vary from a minimum of 1 to a maximum of 143049, with a median of 225290

• There's a notable correlation between Spends and revenue.

• Available Impressions and Matched Queries appear to be correlated.

• Additionally, Available Impressions and Impressions are correlated, and Matched

We have found 4736 missing values in CPC, CPM and CTR.

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note

Box plots of features post outlier treatments:

[[-0.3644957 -0.43279676 -0.3522185 ... -0.87459265 -1.19449791

Scaling has a positive and synchronizing impact on analysis enhancing speed by

1.7. Print silhouette scores for up to 10 clusters and identify

silhouette scores = 0.5726186038415385

The census data set has [640 rows x 61 columns].

Out of the 61 features 59 are integers and 2 are of object type.

Here's a summary of the insights from the data provided:

1. Data Size: The dataset comprises 640 rows and 61 columns.

2.2. Perform detailed Exploratory analysis by creating certain questions like

Lowest – Andhra Pradesh

Yes, because PCA is sensitive to outliers. For details refer code.

We have used z-score scaling on the data.

Optimum No. is 6, after that scree plot flattens.

( 0.03 ) * StateCode + ( 0.03 ) * DistCode + ( 0.15 ) * No_HH + ( 0.16 ) *

View publication stats

You might also like