0% found this document useful (0 votes)
2 views

data mning

The document consists of a series of questions and explanations related to data mining, data preprocessing, clustering, classification, and various algorithms. It covers concepts such as text mining, outliers, OLAP vs OLTP, K-means clustering, decision trees, and associative classification. Additionally, it includes practical tasks like executing the Apriori algorithm and calculating metrics from confusion matrices.

Uploaded by

kebaw53173
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

data mning

The document consists of a series of questions and explanations related to data mining, data preprocessing, clustering, classification, and various algorithms. It covers concepts such as text mining, outliers, OLAP vs OLTP, K-means clustering, decision trees, and associative classification. Additionally, it includes practical tasks like executing the Apriori algorithm and calculating metrics from confusion matrices.

Uploaded by

kebaw53173
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

1.(a) Why text mining is required?

(2 marks)
(b) What do you mean by an outlier?(2 marks)
(c) Explain the difference between data mining and data warehousing.(2 marks)
(d) What is an output of Apriori Algorithm?(2 marks)
(e) Differentiate between Clustering and Classification.(2 marks)
(f) What do you understand by predictive data mining?(2 marks)
(g) What is Knowledge Discovery in Databases?(2 marks)
(h) What do you mean by Clustering?(2 marks)
(i) Differentiate between OLTP and OLAP.(2 marks)
(j) What do you mean by outlier detection?(2 marks)
(k) What is Knowledge Discovery in Databases?(2 marks)
(l) Why data preprocessing is required?(2 marks)
(m) How can you check the efficiency of a classifier model?(2 marks)
(n) What do you understand by confidence-based pruning?(2 marks)
(o) Differentiate between Noise and Outlier.(2 marks)
(p) Explain the closed frequent itemset with an example.(2 marks)
(q) What are the different hyper-parameters of the K-means algorithm?(2 marks)
(r) State the difference between Eager learner and Lazy learner algorithms in the
context of classification with suitable examples.(2 marks)
(s) State the key differences between the K-means and K-medoid algorithms.(2
marks)
(t) State the importance of dimensionality reduction in data prepossessing.(2
marks)
2. Differentiate between OLAP and OLTP. Explain different OLAP operations with
the help of examples.(8 marks)
3. Explain the different methods of Data Cleaning and Data Transformation.(8
marks)
4. Write the algorithms for K-means clustering. Compare it with k-nearest neighbor
algorithm.(8 marks)
5. How Decision Trees assist in the process of classification? Explain with the help
of an example.(6+2 marks)
6. (a)What is associative classification in data mining?(2 marks)
(b)Why is associative classification being able to achieve higher classification
accuracy than a classical decision tree method?(6 marks)
7.(a) Define information gain. (2 marks)
(b)Define the concept of Splitting attribute and Splitting criterion.(4 marks)
8. Write short notes on the following:
(a) Hierarchical clustering.(4 marks)
(b) Support Vector Machine.(4 marks)
(c) Linear Regression.(4 marks)
(4) k-means clustering.(4 marks)
9. (a) What are the steps of KDD?(3 marks)
(b) What do you understand by nominal attribute? Give a suitable example.(2
marks)
(c) What are the major tasks in data preprocessing?(3 marks)
10. (a) Execute Apriori algorithm in the below given dataset:(6 marks)
Tid Items
1 ACD
2 ABCE
3 BC
4 BE
5 ABCE
6 BCE
(b) Define Gini impurity measure.(2 marks)
11. (a) Explain the confusion matrix for a 2-class problem.(3 marks)
(b) Explain information gain in Decision Tree based classification.(3 marks)
(c) What do you mean by feature selection? State with a suitable example.(3
marks)
12. Consider these 10 points (2, 3, 5, 6, 8, 9, 11, 13, 15, 16). Perform K-means
clustering with K=2 considering the first cluster centers 2 and 15. Now find the
following:
(i) SSE after first iteration.(3 marks)
(ii) Final clusters.(3 marks)
(iii) SSE after the termination of the K-means algorithm.(2 marks)
13. (a) Explain the following terms in the context of association rule mining:
(i) Support of an itemset.(2 marks)
(ii) Frequent closed itemset.(2 marks)
(iii) Lift of a rule.(2 marks)
(b) What is the time complexity of computing the supports for m number of
itemsets in a database of n transactions?(2 marks)
14. (a) Consider the following confusion matrix and compute the values of
precision, recall, false positive rate from it.(6 marks)
Not actually malignant Predicted not malignant
Predicted malignant 100 10
Actually malignant 5 50
(b) What are the limitations of K-means clustering?(2 marks)
15. Write short notes on the following:
(a) Divisive hierarchical clustering(4 marks)
(b) Logistic regression(4 marks)
(c) K-nn Classification(4 marks)
(d) Apriori algorithm.(4 marks)
16.(a) Consider the following table.
Tid Items
1 Tea, Milk
2 Tea, Diaper, Fruit, Pizza
3 Milk, Diaper. Fruit, Sugar
4 Tea, Milk, Diaper, Fruit
5 Tea, Milk. Diaper, Sugar
Using Association rule, check that whether {Milk, Diaper} {Fruit} is a valid
association or not.(5 marks)
Consider minsup=30% and minconf=0.5.
(b) Discuss the working principle of Apriori algorithm.(3 marks)
17. (a) What is semi-supervised learning?(3 marks)
(b) Write the algorithm of K-nearest neighbour method of classification.(4 marks)
(c) Mention some limitations of KNN algorithm.(1 marks)
18. Consider a binary classification problem where a machine learning model
predicts whether emails are spam or not. The confusion matrix of the model is
given below.
Predicted positive Predicted negative
Actual positive 150 30
Actual positive 20 800
Calculate the following matrices:
(a) Accuracy.(2 marks)
(b) Precision.(2 marks)
(c) Recall.(2 marks)
(d) F1 score.(2 marks)
19. (a) What is the purpose of logistic regression?(2 marks)
(b) Give the steps of logistic regression algorithm.(4 marks)
(c)How do you measure the goodness of logistic regression result?(2 marks)
20. Write short notes:
(a) K-medoids clustering.(4 marks)
(b) Sigmoid function.(4 marks)
(c) FP-Growth algorithm.(4 marks)
(d) Bayes classifier.(4 marks)
1. (a) Why text mining is required? (2 marks)
Text mining is required to extract useful insights, patterns, and trends from
unstructured textual data. It enables businesses and researchers to analyze
customer feedback, social media content, and documents for decision-making and
knowledge discovery.
(b) What do you mean by an outlier? (2 marks)
An outlier is a data point that significantly differs from other observations in a
dataset. It may indicate variability in measurement, errors, or novel insights,
depending on the context.
(c) Explain the difference between data mining and data warehousing. (2 marks)
Data mining involves extracting patterns and knowledge from large datasets using
algorithms and techniques, whereas data warehousing refers to the storage of
structured, integrated data from multiple sources for analysis and reporting.
(d) What is an output of Apriori Algorithm? (2 marks)
The output of the Apriori Algorithm is a set of frequent itemsets and association
rules, which reveal relationships between items in a transactional dataset.
(e) Differentiate between Clustering and Classification. (2 marks)
Clustering is an unsupervised learning method that groups similar data points
based on features, while classification is a supervised learning technique that
assigns predefined labels to data points based on training data.
(f) What do you understand by predictive data mining? (2 marks)
Predictive data mining involves using historical data and machine learning models
to predict future outcomes, such as customer behavior, sales trends, or risk
factors.
(g) What is Knowledge Discovery in Databases? (2 marks)
Knowledge Discovery in Databases (KDD) is the process of discovering useful and
previously unknown information from large datasets. It includes steps like data
preparation, data mining, and interpretation of results.
(h) What do you mean by Clustering? (2 marks)
Clustering is a data mining technique that organizes data points into groups or
clusters such that points in the same cluster are more similar to each other than to
those in other clusters.
(i) Differentiate between OLTP and OLAP. (2 marks)
• OLTP (Online Transaction Processing): Focuses on handling daily
transactional data in real time.
• OLAP (Online Analytical Processing): Focuses on complex analytical queries
and decision support using historical data.
(j) What do you mean by outlier detection? (2 marks)
Outlier detection involves identifying data points that deviate significantly from the
majority of the dataset, which may indicate anomalies, fraud, or errors.
(k) What is Knowledge Discovery in Databases? (2 marks)
Knowledge Discovery in Databases is the process of identifying valid, novel, and
potentially useful patterns from data. It integrates methods from machine
learning, statistics, and databases.
(l) Why data preprocessing is required? (2 marks)
Data preprocessing is required to clean, transform, and prepare raw data for
analysis. It improves data quality by handling missing values, noise, and
inconsistencies, ensuring accurate and meaningful results.
(m) How can you check the efficiency of a classifier model? (2 marks)
The efficiency of a classifier model can be checked using metrics like accuracy,
precision, recall, F1-score, and area under the ROC curve (AUC).
(n) What do you understand by confidence-based pruning? (2 marks)
Confidence-based pruning removes parts of a decision tree that have low
confidence in their predictions to prevent overfitting and improve generalization.
(o) Differentiate between Noise and Outlier. (2 marks)
• Noise: Random error or variance in data that can obscure patterns.
• Outlier: Specific data points that deviate significantly from the overall pattern
in the dataset.
(p) Explain the closed frequent itemset with an example. (2 marks)
A closed frequent itemset is a frequent itemset that has no superset with the same
frequency. For example, if {A, B} appears 5 times and {A, B, C} also appears 5 times,
{A, B} is not closed, but {A, B, C} is.
(q) What are the different hyper-parameters of the K-means algorithm? (2 marks)
Key hyper-parameters of the K-means algorithm include:
• Number of clusters (k)
• Initialization method for centroids
• Maximum number of iterations
• Tolerance for convergence
(r) State the difference between Eager learner and Lazy learner algorithms in the
context of classification with suitable examples. (2 marks)
• Eager Learner: Builds a model during training, e.g., Decision Trees, Neural
Networks.
• Lazy Learner: Delays model building until prediction, e.g., K-Nearest
Neighbors (KNN).
(s) State the key differences between the K-means and K-medoid algorithms. (2
marks)
• K-means: Uses mean as the cluster center, sensitive to outliers.
• K-medoid: Uses actual data points as cluster centers, robust to outliers.
(t) State the importance of dimensionality reduction in data preprocessing. (2
marks)
Dimensionality reduction simplifies data by reducing features, improving
computational efficiency, visualization, and mitigating the curse of dimensionality,
leading to better model performance.
2. Differentiate between OLAP and OLTP. Explain different OLAP operations with
the help of examples. (8 marks)
OLAP (Online Analytical Processing):
• Used for analytical purposes and decision-making.
• Deals with historical data.
• Executes complex queries to aggregate and analyze data.
• Data is stored in a multidimensional format.
• Example: Sales trend analysis over years.
OLTP (Online Transaction Processing):
• Used for transaction-oriented tasks.
• Deals with current, operational data.
• Executes simple, quick queries for day-to-day operations.
• Data is stored in a relational database format.
• Example: Recording a bank deposit transaction.
OLAP Operations:
1. Roll-up: Aggregates data along a hierarchy. Example: Summarizing sales data
from daily to monthly levels.
2. Drill-down: Opposite of roll-up; provides detailed data. Example: Viewing
daily sales details from monthly summaries.
3. Slice: Selects a single dimension. Example: Viewing sales data for a specific
product.
4. Dice: Selects data based on multiple dimensions. Example: Viewing sales data
for a specific region and time.
5. Pivot (Rotation): Reorients the data view. Example: Switching rows and
columns to view sales data by product instead of by region.
3. Explain the different methods of Data Cleaning and Data Transformation. (8
marks)
Data Cleaning Methods:
1. Handling Missing Data: Filling missing values with mean, median, mode, or
using algorithms like k-NN.
2. Removing Outliers: Identifying and removing data points that are far from
the norm.
3. Resolving Inconsistencies: Correcting discrepancies in data formats or values.
4. Removing Duplicate Data: Identifying and eliminating redundant data
records.
Data Transformation Methods:
1. Normalization: Scaling data to a specific range (e.g., [0, 1]).
2. Discretization: Converting continuous data into categorical bins.
3. Encoding: Transforming categorical data into numerical form (e.g., one-hot
encoding).
4. Aggregation: Summarizing data (e.g., calculating monthly averages).
5. Attribute Construction: Creating new attributes based on existing ones.
4. Write the algorithms for K-means clustering. Compare it with k-nearest
neighbor algorithm. (8 marks)
K-means Clustering Algorithm:
Algorithm: K-means Clustering
1. Input:
o k: Number of clusters
o Dataset D={x1,x2,…,xn}: Points to be clustered
2. Initialization: Randomly select k initial cluster centroids {c1,c2,…,ck}.
3. Iterative Steps:
o Assignment Step: Assign each point xi to the nearest cluster centroid cj
based on a distance metric (e.g., Euclidean distance):
Cluster(xi)=arg minj ∥ xi−cj ∥
• Update Step: Recalculate the centroids cjc_jcj for each cluster by taking the
mean of all points assigned to that cluster:

4. Convergence Check: Repeat steps 3(a) and 3(b) until:


• Centroids do not change significantly, or
• A maximum number of iterations is reached.
Output: Final cluster centroids and the clusters of points.
Comparison with k-Nearest Neighbor (k-NN):
• Purpose: K-means is unsupervised, while k-NN is supervised.
• Usage: K-means groups data into clusters; k-NN classifies data points based
on their nearest neighbors.
• Output: K-means outputs clusters; k-NN outputs class labels.
• Training: K-means requires an iterative training process; k-NN does not
involve explicit training.
5.How Decision Trees assist in the process of classification? Explain with the help
of an example.(6+2 marks)
A decision tree is a supervised learning algorithm used for classification and
regression tasks. It divides the data into subsets based on the feature values,
creating a tree structure where:
• Nodes represent features or attributes.
• Edges represent decision rules.
• Leaves represent the outcome or class label.
The tree is constructed by repeatedly splitting the dataset based on the feature
that provides the most significant separation (using metrics like Gini Impurity,
Information Gain, or Entropy).
Steps in Classification Using Decision Trees:
1. Feature Selection and Splitting:
o Identify the feature that best splits the data into subsets that are as
homogeneous as possible.
2. Tree Construction:
o Recursively split the dataset into smaller subsets until a stopping
condition is met (e.g., all instances belong to one class, or a maximum
depth is reached).
3. Prediction:
o For a new data point, traverse the tree based on its feature values until
reaching a leaf node, which provides the predicted class.
Lays Eggs?
├── No: Mammal

├── Yes:

├── Feathers?

├── Yes: Bird

├── No:

├── Produces Milk?

├── Yes: Mammal

├── No: Bird


Lays Eggs = Yes, Feathers = Yes → Predicted Class: Bird
6(a) What is Associative Classification in Data Mining? (2 marks)
Associative Classification (AC) is a technique in data mining that integrates
association rule mining with classification. It involves discovering strong and
significant association rules from the data and then using these rules to construct a
classifier. The process typically includes two steps:
1. Mining Association Rules: Identify rules of the form A→BA \rightarrow
BA→B, where AAA represents a set of conditions (attribute-value pairs) and
BBB is the class label.
2. Classification: Use the mined rules to predict the class label of new data
instances based on their attribute values.
6(b) Why Associative Classification Achieves Higher Accuracy than Classical
Decision Tree Methods? (6 marks)
Associative classification often achieves higher accuracy than classical decision tree
methods due to the following reasons:
1. Use of Strong Rules:
o Associative classification relies on frequent itemsets and selects rules
based on support and confidence thresholds, ensuring that only highly
reliable and significant patterns are used for classification.
2. Global Rule Evaluation:
o In decision trees, splits are determined locally at each node, which may
lead to suboptimal global solutions. Associative classification evaluates
rules globally, allowing it to capture relationships across multiple
features effectively.
3. Handling Complex Relationships:
o Associative classification can identify and use complex multi-attribute
relationships that decision trees might overlook because of their
hierarchical structure and greedy approach.
4. Reduced Overfitting:
o By focusing on strong, frequent patterns, associative classification can
avoid overfitting to noise, which is a common issue in decision trees,
especially when the tree grows too large.
5. Flexibility in Rule Selection:
o Associative classifiers often use pruning strategies to select the most
relevant rules, prioritizing those with higher predictive power. Decision
trees, however, are constrained by their recursive splitting process.
6. Improved Handling of Imbalanced Data:
o Associative classifiers can emphasize rules for minority classes by
adjusting support and confidence thresholds, leading to better
performance on imbalanced datasets.
7(a) Define Information Gain. (2 marks)
Information Gain (IG) is a metric used to measure the reduction in uncertainty or
entropy in a dataset after splitting it based on a particular attribute. It quantifies
how much information a feature provides in classifying a dataset.
7(b) Define Splitting Attribute and Splitting Criterion. (4 marks)
• Splitting Attribute:
The feature or attribute chosen to divide the dataset into smaller subsets.
The goal is to select an attribute that provides the best separation between
classes.
• Splitting Criterion:
A metric or rule used to evaluate and choose the splitting attribute. Common
criteria include:
o Information Gain (reduces uncertainty).
o Gini Index (measures impurity).
o Gain Ratio (balances IG and attribute bias).
8. Short Notes
(a) Hierarchical Clustering (4 marks)
Hierarchical clustering is an unsupervised learning algorithm used to build a
hierarchy of clusters. It is commonly visualized using a dendrogram.
Types:
1. Agglomerative (Bottom-Up):
o Starts with each data point as its own cluster.
o Merges clusters iteratively based on similarity/distance (e.g., single-
linkage, complete-linkage).
2. Divisive (Top-Down):
o Starts with all data points in one cluster.
o Splits the clusters iteratively until each data point is in its own cluster.
Advantages:
• No need to predefine the number of clusters.
• Captures nested cluster relationships.
Disadvantages:
• Computationally expensive for large datasets.
• Sensitive to noise and outliers.
(b) Support Vector Machine (SVM) (4 marks)
SVM is a supervised learning algorithm used for classification and regression
tasks.
Concept:
• SVM finds the optimal hyperplane that maximally separates data points of
different classes.
• For non-linear data, it uses the kernel trick to transform data into a higher-
dimensional space where a hyperplane can be applied.
Key Components:
1. Support Vectors: Data points closest to the hyperplane, influencing its
position.
2. Kernel Functions: Transform non-linear data (e.g., linear, polynomial, radial
basis function).
Advantages:
• Works well in high-dimensional spaces.
• Effective for small datasets.
Disadvantages:
• Sensitive to the choice of kernel and hyperparameters.
• Computationally intensive for large datasets.
(c) Linear Regression (4 marks)
Linear regression is a supervised learning technique used for predicting a
continuous output based on input features.
Model:
y=β0+β1x1+β2x2+…+βnxn+ϵ
where:
• y is the dependent variable,
• x represents the independent variables,
• β0,β1,…,βn are coefficients,
• ϵ is the error term.
Assumptions:
• A linear relationship exists between x and y.
• Residuals are normally distributed and homoscedastic.
Advantages:
• Simple and interpretable.
• Useful for understanding relationships between variables.
Disadvantages:
• Assumes linearity and independence of errors.
• Poor performance on non-linear data.
(d) K-means Clustering (4 marks)
K-means is an unsupervised learning algorithm used to group data into k clusters.
Steps:
1. Initialize k cluster centroids randomly.
2. Assign each data point to the nearest centroid based on a distance metric
(e.g., Euclidean distance).
3. Recalculate centroids as the mean of all points in each cluster.
4. Repeat steps 2 and 3 until centroids stabilize or a maximum number of
iterations is reached.
Advantages:
• Efficient and easy to implement.
• Scalable for large datasets.
Disadvantages:
• Requires the number of clusters kk to be predefined.
• Sensitive to initial centroid positions and outliers.
9. Steps of KDD and Related Concepts
(a) Steps of KDD (Knowledge Discovery in Databases) (3 marks):
1. Data Selection: Identify relevant data from various sources.
2. Preprocessing: Clean and transform the data.
3. Transformation: Convert data into appropriate formats.
4. Data Mining: Apply algorithms to discover patterns.
5. Interpretation/Evaluation: Analyze and validate the patterns.
6. Knowledge Presentation: Represent the findings in understandable formats.
(b) Nominal Attribute (2 marks):
A nominal attribute represents categorical data with no inherent order.
Example: Colors of cars ({Red, Blue, Green}).
(c) Major Tasks in Data Preprocessing (3 marks):
1. Data Cleaning: Handle missing values, outliers, and noise.
2. Data Integration: Combine data from multiple sources.
3. Data Transformation: Normalize or scale data.
4. Data Reduction: Reduce dimensionality or aggregate data.
5. Data Discretization: Convert continuous data into categorical bins.
10(a) Execute the Apriori Algorithm on the Given Dataset (6 marks)
The Apriori algorithm identifies frequent itemsets in a transactional database and
generates association rules based on a minimum support threshold.
Step 1: Generate Frequent 1-itemsets
• Support Calculation:
Support = Count of Item/Total Transactions

Item Count Support

A 3 3/6 = 0.5

B 5 5/6 = 0.83

C 5 5/6 = 0.83

D 1 1/6 = 0.17

E 4 4/6 = 0.67

• Retain Items with Support ≥ Threshold:


Assuming a minimum support threshold of 0.5, the frequent 1-itemsets are:
L1={A,B,C,E}
Step 2: Generate Frequent 2-itemsets
• Candidate Generation: Pair all items in L1.
• Support Calculation for Each Pair:

Itemset Count Support

AB 2 2/6 = 0.33

AC 2 2/6 = 0.33

AE 2 2/6 = 0.33

BC 4 4/6 = 0.67
Itemset Count Support

BE 4 4/6 = 0.67

CE 4 4/6 = 0.67

• Retain Pairs with Support ≥ Threshold: L2={BC,BE,CE}


Step 3: Generate Frequent 3-itemsets
• Candidate Generation: Combine items in L2.
• Support Calculation for Each Triplet:

Itemset Count Support

BCE 3 3/6 = 0.5

• Retain Triplets with Support ≥ Threshold: L3={BCE}


Step 4: Stop When No Larger Frequent Itemsets Can Be Found
The largest frequent itemset is BCE.
10(b) Define Gini Impurity Measure (2 marks)Gini impurity measures the
likelihood of incorrect classification of a randomly chosen element if it were
labeled based on the distribution of class labels in a dataset.
Formula:

where pi is the proportion of elements in class i, and n is the number of classes.


Properties:
• Gini impurity ranges from 0 to 1.
• 0: Perfectly pure node (all elements belong to one class).
• Higher Gini: Greater impurity or class mix.
11(a) Confusion Matrix for a 2-Class Problem (3 marks)
The confusion matrix is a table used to evaluate the performance of a
classification algorithm, particularly in binary (2-class) classification problems. It
compares the predicted labels with the actual labels.
For a 2-class problem, the confusion matrix has the following structure:

Actual\Predicted Positive (P) Negative (N)

Positive (P) True Positive (TP) False Negative (FN)

Negative (N) False Positive (FP) True Negative (TN)

• True Positive (TP): The number of positive instances correctly classified as


positive.
• True Negative (TN): The number of negative instances correctly classified as
negative.
• False Positive (FP): The number of negative instances incorrectly classified as
positive (Type I error).
• False Negative (FN): The number of positive instances incorrectly classified as
negative (Type II error).
From this matrix, metrics like accuracy, precision, recall, and F1-score can be
calculated.
11(b) Information Gain in Decision Tree-Based Classification (3 marks)
Information Gain (IG) is a metric used to choose the best attribute to split the data
at each step in the construction of a decision tree. It measures how much
uncertainty or entropy is reduced when the data is divided based on a particular
attribute.
Steps to calculate Information Gain:
1. Compute the entropy of the entire dataset before the split (Initial Entropy):
where pi is the probability of class ii in the dataset.
2. For each attribute, split the dataset into subsets based on its possible values
and calculate the weighted average entropy for the subsets.
3. Information Gain (IG):
IG=H(D)−Hsplit
where Hsplitis the weighted average entropy after the split.
Interpretation:
• Higher Information Gain indicates that the attribute is better at separating
the data and is chosen for the split.
11(c) Feature Selection (3 marks)
Feature selection is the process of selecting a subset of the most relevant features
(attributes) from the original dataset. This is done to improve the performance of
machine learning algorithms by eliminating irrelevant or redundant features that
do not contribute significantly to the predictive power.
Example:
Consider a dataset with attributes like age, income, education level, gender, and
number of hours worked to predict whether a person will purchase a product.
• Irrelevant features (e.g., gender) can be removed if it has no significant
impact on the target variable (purchase decision).
• Redundant features (e.g., age and number of years worked) can be
combined or removed to streamline the model.
Feature selection techniques include methods like Filter, Wrapper, and Embedded
approaches.
12. K-means Clustering
Given the points: 2,3,5,6,8,9,11,13,15,162, 3, 5, 6, 8, 9, 11, 13, 15, 16
We are performing K-means clustering with K=2, and initial cluster centers are 2
and 15.
• Cluster 1 center: 2
• Cluster 2 center: 15
For each point, we assign it to the nearest center.

Point Distance to Center 1 (2) Distance to Center 2 (15) Assigned Cluster

2 0 13 Cluster 1

3 1 12 Cluster 1

5 3 10 Cluster 1

6 4 9 Cluster 1

8 6 7 Cluster 1

9 7 6 Cluster 2

11 9 4 Cluster 2

13 11 2 Cluster 2

15 13 0 Cluster 2

16 14 1 Cluster 2

After this assignment, the points in Cluster 1 are: 2,3,5,6,82, 3, 5, 6, 8, and the
points in Cluster 2 are: 9,11,13,15,169, 11, 13, 15, 16.
Step 3: Recalculate Centroids
• New Center for Cluster 1:
Centroid of Cluster 1=(2+3+5+6+8)/5=24/5=4.8
New Center for Cluster 2:
Centroid of Cluster 2=(9+11+13+15+16)/5=64/5=12.8
The Sum of Squared Errors (SSE) is the sum of the squared distances between
each point and its respective cluster center.
For Cluster 1 (center = 4.8):
SSE1=(2−4.8)2+(3−4.8)2+(5−4.8)2+(6−4.8)2+(8−4.8)2=7.84+3.24+0.04+1.44+10.24
=22.8
For Cluster 2 (center = 12.8):
SSE2=(9−12.8)2+(11−12.8)2+(13−12.8)2+(15−12.8)2+(16−12.8)2=14.44+3.24+0.04
+4.84+10.24=32.8
Thus, the Total SSE after the first iteration is:
SSE = 22.8 + 32.8 = 55.6
12(i) SSE After the First Iteration (Answer)
SSE after first iteration=55.6
Step 5: Final Clusters and Convergence
• After recalculating the centroids and assigning the points, we see that the
assignments would not change in further iterations because the centroids
have already converged.
Final Clusters:
• Cluster 1: 2,3,5,6,82, 3, 5, 6, 8 (Center = 4.8)
• Cluster 2: 9,11,13,15,169, 11, 13, 15, 16 (Center = 12.8)
12(ii) Final Clusters (Answer)
• Cluster 1: 2,3,5,6,82, 3, 5, 6, 8
• Cluster 2: 9,11,13,15,169, 11, 13, 15, 16
Step 6: Compute SSE After Termination of the Algorithm
Since the centroids did not change after the first iteration, the algorithm has
converged. Therefore, the SSE after the algorithm terminates is the same as the
SSE after the first iteration.
SSE after termination=55.6
12(iii) SSE After the Termination of the K-means Algorithm (Answer)
SSE after termination=55.6
13. Association Rule Mining Terms
(a) (i) Support of an Itemset (2 marks)
Support of an itemset is the proportion of transactions in the dataset that contain
a particular itemset. It is calculated as:
Support(X)=(Number of transactions containing itemset X)/Total number of transac
tions
Example:
In a transaction dataset with 100 transactions, if an itemset {A,B} appears in 20
transactions, the support is:
Support({A,B})=20/100=0.2
(a) (ii) Frequent Closed Itemset (2 marks)
A frequent closed itemset is a set of items that appears frequently in the dataset,
and no superset of this itemset has the same support. In other words, it is an
itemset whose support is equal to the support of any of its supersets.
Example:
Consider a dataset where itemset {A,B} appears 30 times, and {A,B,C} appears 30
times as well. Since no superset of {A,B} has the same support, {A,B} is a frequent
closed itemset.

(a) (iii) Lift of a Rule (2 marks)


The Lift of an association rule X⇒Y is a measure of the strength of the rule. It is
calculated as:
Lift(X⇒Y)=Support(X∪Y)/Support(X)×Support(Y)
A lift value greater than 1 indicates that X and Y occur together more often than
expected, implying a strong relationship between X and Y.
Example:
If Support({A,B})=0.3, Support({A})=0.5 and Support({B})=0.4, then:

(b) Time Complexity of Computing Supports for m Itemsets in n Transactions (2


marks)
The time complexity of computing the supports for m itemsets in n transactions is
O(m * n).
• For each itemset, we need to check if it is present in each of the n
transactions.
• Therefore, for m itemsets, the total time complexity is O(m×n).
14(a) Compute Precision, Recall, and False Positive Rate from the Confusion
Matrix (6 marks)
• True Negative (TN) = 100 (Not malignant, predicted not malignant)
• False Positive (FP) = 10 (Not malignant, predicted malignant)
• False Negative (FN) = 5 (Malignant, predicted not malignant)
• True Positive (TP) = 50 (Malignant, predicted malignant)
Precision:
Precision is the proportion of correct positive predictions (TP) out of all predicted
positives (TP + FP).
Precision=TP/(TP+FP)=50/(50+10)=50/60=0.8333
Recall (Sensitivity or True Positive Rate):
Recall is the proportion of correct positive predictions (TP) out of all actual
positives (TP + FN).
Recall=TP/(TP+FN)=50/(50+5)=50/55=0.9091
False Positive Rate (FPR):
False Positive Rate is the proportion of incorrect negative predictions (FP) out of all
actual negatives (TN + FP).
False Positive Rate=FP/(TN+FP)=10/(100+10)=10/110=0.0909
14(b) Limitations of K-means Clustering (2 marks)
1. Sensitivity to Initial Centroids:
K-means is highly sensitive to the initial placement of centroids. Poor initial
centroids can result in suboptimal clustering and lead to different results on
different runs.
2. Predefined Number of Clusters (K):
K-means requires the user to specify the number of clusters KK beforehand.
If KK is not chosen correctly, it can lead to poor clustering results.
3. Non-Globular Clusters:
K-means assumes that clusters are spherical (globular) and equally sized,
which can be a limitation when data has non-convex or irregularly shaped
clusters.
15. Short Notes on the Following
(a) Divisive Hierarchical Clustering (4 marks)
Divisive Hierarchical Clustering is a type of hierarchical clustering method that
follows a top-down approach. Unlike agglomerative clustering (which is bottom-
up), divisive clustering starts with all data points in one cluster and recursively
splits the clusters into smaller ones until each data point is in its own individual
cluster.
• Process:
1. Start with a single cluster containing all data points.
2. Choose the best way to split this cluster into two smaller clusters.
3. Recursively apply the splitting process to each cluster.
4. The process continues until each data point forms its own cluster.
• Key Characteristics:
o Suitable for large datasets.
o The splitting strategy depends on distance measures or dissimilarity
matrices.
o A common method for splitting is by maximizing the dissimilarity
between the clusters after each split.
Example:
For a dataset of 100 points, the algorithm begins by treating all points as one
cluster. It then divides this cluster into two, then continues to split each of the new
clusters, and so on, until each point forms its own cluster.
(b) Logistic Regression (4 marks)
Logistic Regression is a statistical model used for binary classification problems,
where the goal is to predict one of two possible outcomes (such as "yes" or "no").
It models the probability of a binary response based on one or more predictor
variables.
• Formula: The logistic regression model uses the logit function to model the
probability pp of the positive class (often denoted as 1):

where:
o β0 is the intercept.
o β1,β2,...,βn are the coefficients of the predictor variables x1,x2,...,xn.
o The logistic function ensures that the output is between 0 and 1,
representing a probability.
• Interpretation:
o Probabilistic output: Logistic regression gives the probability of a
sample belonging to a particular class.
o It can be extended to multiclass problems using techniques like one-vs-
all.
• Applications:
Used in fields such as healthcare (predicting the presence or absence of a
disease) and marketing (predicting whether a customer will buy a product).
(c) K-NN Classification (4 marks)
K-Nearest Neighbors (K-NN) is a simple, instance-based classification algorithm. It
classifies a new sample based on the majority class of its K nearest neighbors in
the feature space.
• Process:
1. Choose a value for K (the number of nearest neighbors).
2. For a new data point, calculate the distance (usually Euclidean) from
that point to all points in the training set.
3. Identify the K nearest data points.
4. The class of the new point is determined by the majority class among
the KK nearest points.
• Key Features:
o Distance Metric: The most common distance measure is Euclidean
distance, but others such as Manhattan or Minkowski can also be used.
o No Training Phase: K-NN is a lazy learner meaning it doesn’t have an
explicit training phase. It uses the entire dataset for classification during
prediction.
o Choice of K: The value of K influences the classification result. A small
KK leads to noise sensitivity, while a large K can lead to over-smoothing.
• Applications:
Used in classification tasks like image recognition, recommender systems,
and document classification.
(d) Apriori Algorithm (4 marks)
The Apriori algorithm is a classic algorithm used in association rule mining to find
frequent itemsets in a transaction dataset and derive association rules. It uses a
breadth-first search approach to explore itemsets and their support in the dataset.
• Process:
1. Generate Candidate Itemsets: Start by identifying all individual items
(1-itemsets) that meet the minimum support threshold.
2. Iterate to Find Frequent Itemsets: In subsequent iterations, generate
candidate itemsets of size k by combining frequent itemsets of size k−1,
and count their support in the database.
3. Prune Non-Frequent Itemsets: Eliminate candidate itemsets that do
not meet the minimum support threshold.
4. Repeat the process until no more frequent itemsets can be found.
• Association Rule Generation: After identifying frequent itemsets, the
algorithm generates association rules based on certain measures like
confidence and lift. A rule A⇒BA indicates that if itemset A occurs in a
transaction, itemset B is likely to occur as well.
• Key Characteristics:
o Support: The proportion of transactions that contain the itemset.
o Confidence: The likelihood that a rule holds true, given the presence of
AA.
o Lift: The strength of the association between AA and BB, comparing
observed support with expected support.
• Applications:
Widely used in market basket analysis, where the goal is to find associations
between items purchased together. For example, if customers often buy
"bread" and "butter" together, a rule such as {bread}⇒{butter} may be
generated.

16(a) Association Rule Validation: {Milk, Diaper} → {Fruit}


• minsup = 30% (minimum support is 30% of transactions)
• minconf = 0.5 (minimum confidence is 50%)
Step 1: Calculate Support for {Milk, Diaper, Fruit}
The support of an itemset is the proportion of transactions in which the itemset
appears. We first need to count how many transactions contain {Milk, Diaper,
Fruit}.
From the table:
• Transaction 2: Contains Milk, Diaper, Fruit
• Transaction 3: Contains Milk, Diaper, Fruit
• Transaction 4: Contains Milk, Diaper, Fruit
Thus, {Milk, Diaper, Fruit} appears in 3 transactions.
Total number of transactions = 5.
Support({Milk,Diaper,Fruit})=35=0.6=60%
Since 60% is greater than the minimum support of 30%, the itemset {Milk, Diaper,
Fruit} is frequent.
Step 2: Calculate Confidence for {Milk, Diaper} → {Fruit}
Confidence is the probability that {Fruit} appears given that {Milk, Diaper}
appears, which is calculated as:

First, we need to calculate the support of {Milk, Diaper}:


• Transaction 1: Contains Milk (but not Diaper)
• Transaction 2: Contains Milk, Diaper
• Transaction 3: Contains Milk, Diaper
• Transaction 4: Contains Milk, Diaper
• Transaction 5: Contains Milk, Diaper
Thus, {Milk, Diaper} appears in 4 transactions.
Support({Milk,Diaper})=4/5=0.8=80%
Now, calculate the confidence:
Confidence({Milk,Diaper}→{Fruit})=0.6/0.8=0.75=75%
Since 75% is greater than the minimum confidence of 50%, the rule {Milk, Diaper}
→ {Fruit} is valid.
Conclusion:
The rule {Milk, Diaper} → {Fruit} is valid because both the support (60%) and
confidence (75%) exceed the minimum thresholds.
16(b) Working Principle of Apriori Algorithm (3 marks)
The Apriori algorithm is a popular algorithm used in association rule mining to
find frequent itemsets and generate association rules. It works based on the
principle of "aprior knowledge" that an itemset's subsets must also be frequent.
Working Principle:
1. Generate Frequent Itemsets:
o The algorithm begins by identifying individual items that meet the
minimum support threshold (i.e., single-item frequent itemsets).
o It then iteratively combines these frequent itemsets to generate larger
itemsets (e.g., 2-itemsets, 3-itemsets, etc.).
o At each step, the algorithm calculates the support for these itemsets
and prunes those that do not meet the minimum support.
2. Iterative Process:
o Step 1: Start with the 1-itemsets (individual items) and calculate their
support.
o Step 2: Use the frequent 1-itemsets to generate candidate 2-itemsets.
o Step 3: Calculate the support for these 2-itemsets.
o Step 4: Prune 2-itemsets that do not meet the minimum support.
o Step 5: Use the frequent 2-itemsets to generate candidate 3-itemsets,
and repeat the process.
3. Generate Association Rules:
o Once the frequent itemsets are identified, association rules can be
generated from them.
o Each rule is evaluated based on confidence (how often the rule holds
true) and lift (how strong the association is relative to randomness).
4. Pruning:
o Apriori uses the Apriori property: if an itemset is infrequent, then all of
its supersets are also infrequent.
o This property allows for pruning the search space, making the algorithm
more efficient.
17(a) What is Semi-Supervised Learning? (3 marks)
Semi-supervised learning is a type of machine learning that lies between
supervised and unsupervised learning. In this approach, the algorithm is trained on
a dataset that contains a small amount of labeled data and a large amount of
unlabeled data.
• Labeled Data: These are data points for which the correct output (label) is
provided.
• Unlabeled Data: These are data points for which the output label is
unknown.
In semi-supervised learning, the algorithm uses both labeled and unlabeled data to
improve learning accuracy and make predictions. The key idea is that even
unlabeled data contains valuable information that can help the model generalize
better and learn the underlying structure of the data.
17(b) Algorithm for K-Nearest Neighbors (K-NN) Classification (4 marks)
The K-Nearest Neighbors (K-NN) algorithm is a simple, instance-based
classification method. It classifies a new data point based on the majority class of
its K nearest neighbors from the training data.
K-NN Algorithm:
1. Input:
o K: The number of nearest neighbors to consider.
o Training dataset D with known labels.
o A test data point x for classification.
2. Steps:
1. Calculate Distance:
For each point in the training dataset DD, calculate the distance (typically
Euclidean distance) between the test point xx and the training point.
Where t is a training point and x is the test point.
2. Sort Distances:
Sort all training points based on their calculated distances from the test point xx.
3. Select K Nearest Neighbors:
Select the top K points with the smallest distances.
4. Determine the Majority Class:
Identify the class labels of the K nearest neighbors. The class that appears most
frequently among the K neighbors is the predicted class for the test point.
5. Assign Class Label:
Assign the predicted class label to the test point.
3. Output: The predicted class label for the test point xx.
17(c) Limitations of K-NN Algorithm (1 mark)
1. Computational Complexity:
K-NN is computationally expensive, especially as the size of the dataset
increases. For each test instance, the algorithm needs to compute the
distance to all training samples, which can be slow.
2. Memory Intensive:
Since K-NN is a lazy learner, it requires storing the entire training dataset in
memory, which can become problematic for large datasets.
18. Confusion Matrix Calculation for a Binary Classification Problem
The given confusion matrix is:

Predicted Positive Predicted Negative

Actual Positive 150 30

Actual Negative 20 800

From this matrix, we can define the following terms:


• True Positives (TP): 150 (Actual positive and predicted positive)
• False Positives (FP): 20 (Actual negative but predicted positive)
• False Negatives (FN): 30 (Actual positive but predicted negative)
• True Negatives (TN): 800 (Actual negative and predicted negative)
(a) Accuracy (2 marks)
Accuracy is the proportion of correct predictions (both positive and negative) to
the total number of predictions.
Accuracy= {TP + TN}/{TP + TN + FP + FN}
Substitute the values:
Accuracy={150 + 800}/{150 + 800 + 20 + 30} ={950}/{1000} = 0.95
So, Accuracy = 0.95 or 95%.
(b) Precision (2 marks)
Precision is the proportion of true positive predictions to the total number of
predicted positives.
Precision={TP}/{TP + FP}
Substitute the values:
Precision={150}/{150 + 20} = {150}/{170}=0.8824
So, Precision = 0.8824 or 88.24%.
(c) Recall (2 marks)
Recall (also known as Sensitivity or True Positive Rate) is the proportion of true
positive predictions to the total number of actual positives.
Recall={TP}/{TP + FN}
Substitute the values:
Recall={150}/{150 + 30} ={150}/{180}=0.8333
So, Recall = 0.8333 or 83.33%.
(d) F1 Score (2 marks)
The F1 Score is the harmonic mean of Precision and Recall. It gives a single score
that balances both Precision and Recall.
F1 Score=2×(Precision×Recall)/Precision+Recall
Substitute the values:
F1 Score= 2{0.8824x 0.8333}/{0.8824 + 0.8333} = 2x{0.7356}/{1.7157}= 0.856
So, F1 Score = 0.856 or 85.6%.
Summary:
• Accuracy: 95%
• Precision: 88.24%
• Recall: 83.33%
• F1 Score: 85.6%
19(a) Purpose of Logistic Regression (2 marks)
The purpose of logistic regression is to model the probability of a binary outcome
or class. It is a statistical method used for classification tasks where the dependent
variable is categorical, typically binary (e.g., 0 or 1, yes or no). Logistic regression
estimates the relationship between the independent variables and the probability
of a certain class by applying a logistic function (sigmoid function) to the linear
combination of input features.
• Binary Classification: The model predicts the likelihood of an instance
belonging to a particular class (e.g., spam or not spam, disease or no
disease).
The output is a probability between 0 and 1, which can be interpreted as the
likelihood of an instance belonging to the positive class.
19(b) Steps of Logistic Regression Algorithm (4 marks)
The steps involved in implementing logistic regression are as follows:
1. Data Preprocessing:
o Collect and prepare the dataset.
o Handle missing data, normalize features (if needed), and encode
categorical variables (if necessary).
2. Model Initialization:
o Initialize the coefficients (weights) θ for the model. This is typically done
by setting them to small random values.
3. Sigmoid Function:
o For each instance, compute the predicted probability y^ using the
sigmoid function:

o Where y^ is the predicted probability, θ are the model parameters


(weights), and x1,x2,…,xn are the feature values of the input instance.
4. Cost Function (Log-Loss): Compute the logistic loss (also called log-likelihood
or binary cross-entropy loss) for the prediction

Where mm is the number of training examples, y^{(i)} is the actual label, and
{y}^{(i)} is the predicted probability for each training example.
5. Optimization (Gradient Descent):
o Update the model parameters θ\theta by minimizing the cost function
using an optimization algorithm, such as gradient descent:

Where α is the learning rate and is the gradient of the cost function with
respect to the model parameters.
6. Convergence:
o Repeat steps 4 and 5 until the cost function converges to a minimum or
after a set number of iterations.
7. Prediction:
o Once the model is trained, use the learned coefficients θ to predict the
class label for new data. If y^ > 0.5, classify it as class 1 (positive),
otherwise classify it as class 0 (negative).
19(c) How to Measure the Goodness of Logistic Regression Result? (2 marks)
To measure the goodness of the logistic regression result, the following metrics
can be used:
1. Accuracy:
o Accuracy measures the proportion of correct predictions (both positive
and negative) to the total number of predictions.
o Formula: Accuracy={TP + TN}/{TP + TN + FP + FN}
2. Confusion Matrix:
o A confusion matrix helps evaluate the classification performance by
showing the counts of true positives, true negatives, false positives, and
false negatives.
20(a) K-medoids Clustering (4 marks)
K-medoids clustering is a variation of K-means clustering where the "mean" is
replaced by an actual data point (called a medoid) as the center of each cluster.
The medoid is the data point within a cluster that minimizes the total dissimilarity
to other points in the cluster. This makes K-medoids more robust to noise and
outliers compared to K-means because it uses actual data points instead of
averages to define the cluster center.
Steps in K-medoids clustering:
1. Initialization: Select k initial medoids (actual data points) randomly.
2. Assignment Step: Assign each data point to the nearest medoid based on a
distance measure (e.g., Manhattan or Euclidean distance).
3. Update Step: For each cluster, select the point that minimizes the sum of
dissimilarities to all other points as the new medoid.
4. Repeat Steps 2 and 3: Continue assigning points to the nearest medoid and
updating the medoids until convergence (when the medoids no longer
change).
Advantages:
• Less sensitive to outliers compared to K-means, as it uses actual points
instead of averages.
Disadvantages:
• Computationally more expensive than K-means because it requires
calculating dissimilarities between all pairs of points in the dataset.
20(b) Sigmoid Function (4 marks)
The sigmoid function is a mathematical function that maps any input value to a
range between 0 and 1, making it useful for modeling probabilities in classification
tasks, particularly in logistic regression and neural networks.
The formula for the sigmoid function is:
σ(x)={1}/{1 + e^{-x}}
Where:
• x is the input value (can be any real number).
• e is the base of the natural logarithm.
Properties of the Sigmoid Function:
• Range: 0≤σ(x)≤1, which makes it ideal for binary classification.
• Shape: The sigmoid curve is S-shaped, starting at 0 when x is very negative,
rising steeply around 0, and approaching 1 as x becomes large.
• Derivative: The derivative of the sigmoid function is σ(x)(1−σ(x)), which is
useful in backpropagation for training neural networks.
Applications:
• Used in binary classification to model probabilities of classes.
• In neural networks, the sigmoid activation function helps introduce non-
linearity to the model.
20(c) FP-Growth Algorithm (4 marks)
The FP-Growth (Frequent Pattern Growth) algorithm is an efficient method for
mining frequent itemsets in transactional databases, used primarily in association
rule mining. It is an improvement over the Apriori algorithm, designed to avoid
generating candidate itemsets and significantly improve computational efficiency.
Key Concepts of FP-Growth:
1. Frequent Pattern Tree (FP-Tree): The algorithm compresses the dataset into
a compact structure called the FP-tree, which maintains itemset information
in a way that allows fast access and mining.
2. Divide-and-Conquer: FP-Growth uses a recursive divide-and-conquer
strategy. It recursively projects the database into smaller subsets and mines
frequent itemsets from these subsets.
3. No Candidate Generation: Unlike Apriori, FP-Growth does not generate
candidate itemsets; instead, it extracts frequent patterns directly from the
FP-tree.
Steps of the FP-Growth Algorithm:
1. Construct the FP-tree: Create an FP-tree by scanning the database to count
the frequency of items, then sort items in each transaction in decreasing
order of frequency.
2. Mining Frequent Itemsets: Extract frequent itemsets from the FP-tree using
a recursive approach to find conditional FP-trees, then mine frequent
itemsets in those conditional trees.
Advantages:
• More efficient than Apriori, especially for large datasets.
• Avoids candidate generation, reducing computational complexity.
Disadvantages:
• Requires additional memory for constructing the FP-tree.
• The algorithm is less efficient when the dataset is sparse.
20(d) Bayes Classifier (4 marks)
The Bayes classifier is a probabilistic model used for classification tasks, based on
Bayes' Theorem. It predicts the probability of a class label given the observed
features of a sample. The classifier works by assuming that the features are
conditionally independent given the class label, which simplifies the computation
of the class probabilities.
Bayes' Theorem:

Where:
• P(Ck∣X) is the posterior probability of class Ck given features X.
• P(X∣Ck) is the likelihood of observing the features X given class Ck.
• P(Ck) is the prior probability of class Ck.
• P(X) is the probability of observing the features X across all classes.
Steps of the Naive Bayes Classification:
1. Compute Prior Probabilities: For each class, compute the probability P(Ck)
based on the frequency of that class in the training data.
2. Compute Likelihoods: For each feature, compute the likelihood P(Xi∣Ck)
based on the conditional distribution of features given the class.
3. Compute Posterior Probabilities: For each class, compute the posterior
probability P(Ck∣X) using Bayes' theorem and select the class with the highest
posterior probability.
Assumptions:
• The Naive Bayes classifier assumes that the features are conditionally
independent given the class label, which simplifies the calculation of
likelihoods.
Advantages:
• Simple and fast, particularly for high-dimensional data.
• Works well with small datasets and when feature independence holds.
Disadvantages:
• Assumes that all features are conditionally independent, which is often not
true in real-world data, leading to potential inaccuracies.

You might also like