Data Mining 2-5

The document provides information on common methods for handling missing values and noisy data, calculations for analyzing a number series, and issues that affect different types of software. It also discusses the steps involved in the data mining process and data warehouse backend processes. Cluster analysis methods are categorized as partitional or hierarchical, with popular algorithms explained briefly for each type.

Uploaded by

nirman kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Data Mining 2-5

Uploaded by

nirman kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Q(2) (a)What are the common methods for handling the problem of missing value and nousy data?

(b). For a given number series: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,

33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. Calculate

(1) What is the mean of the data? What is the median? (ii) What is the mode of the data? (iii) Find first
quartile and the third quartile of the data.

(c), explain the three general issues that affect the different types of software.

Ans:
(A) Common methods for handling missing values and noisy data include:
1. *Deletion*: Remove rows or columns with missing values. This is typically done when missing values
are few and don't significantly affect the overall dataset.
2. *Imputation*: Fill in missing values with estimated values. This could be the mean, median, mode, or
predicted values from regression models.
3. *Prediction Models*: Use machine learning algorithms to predict missing values based on other
features in the dataset.
4. *Interpolation*: Estimate missing values based on neighboring values. Linear, polynomial, or time-
series interpolation techniques can be used.
5. *Data Transformation*: Convert data into a different representation that is more robust to noise, such
as using logarithms or percentiles.

(b) Calculations for the Number Series: Series: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70

(i) Mean: Sum of all numbers / Total count = (Sum of numbers) / 26 ≈ 29.85
(ii) Median: Middle value = 33 (since there are 26 values)
(iii) Mode: 25 (appears most frequently)
(iv) First Quartile (Q1): Median of the lower half of the data = 20.5
(v) Third Quartile (Q3): Median of the upper half of the data = 40.5

(c) Three General Issues Affecting Different Types of Software:

Security: Security issues impact all types of software. Vulnerabilities in code can lead to breaches, data
leaks, and unauthorized access. Ensuring secure coding practices, regular security audits, and prompt
patching of vulnerabilities are critical.

Scalability and Performance: Scalability is a challenge across software domains. As user bases grow or
data volumes increase, software must continue to perform well. Proper design, efficient algorithms, and
optimizing resource utilization are essential for maintaining performance.

Compatibility: Compatibility issues arise due to differences in hardware, software platforms, and
versions. Ensuring software runs smoothly across various configurations requires extensive testing and
adaptation. This is especially relevant as technology evolves and new devices/operating systems emerge.

Q 3. (a) Compare and contrast data warehouse system and operational database system.
Ans:-
(b) Describe the steps involved in data mining when viewed as a process of knowledge discovery.
Ans:
here are the steps involved in data mining when viewed as a process of knowledge discovery:
 Business understanding: This step involves understanding the business problem that the data mining
project is trying to solve. It also involves identifying the data that is needed to solve the problem and
the goals of the data mining project.
 Data understanding: This step involves understanding the data that is available for the data mining
project. This includes understanding the data's quality, completeness, and format.
 Data preparation: This step involves preparing the data for data mining. This includes cleaning the
data, transforming the data, and selecting the features that will be used for data mining.
 Modeling: This step involves using data mining algorithms to build models of the data. These models
can be used to predict future outcomes, identify patterns, or cluster data.
 Evaluation: This step involves evaluating the models that were built in the modeling step. This
includes evaluating the accuracy of the models, the interpretability of the models, and the
usefulness of the models.
 Deployment: This step involves deploying the models that were built in the modeling step. This
includes making the models available to users and integrating the models into business processes.

Q(4) (a). What is data warehouse backend process? Explain briefly.

Ans:-
The data warehouse backend process is the set of processes that are responsible for loading, managing,
and maintaining the data in a data warehouse. It includes the following steps:
 Extracting data from source systems: The first step is to extract data from the various source
systems that are used by the organization. This data can come from a variety of sources, such as
transactional systems, operational databases, and external data sources.
 Cleaning and transforming data: Once the data is extracted, it needs to be cleaned and transformed.
This involves removing any errors or inconsistencies in the data, as well as converting the data into a
format that can be loaded into the data warehouse.
 Loading data into the data warehouse: Once the data is cleaned and transformed, it can be loaded
into the data warehouse. This process typically involves using an ETL (extract, transform, load) tool.
 Managing data in the data warehouse: Once the data is loaded into the data warehouse, it needs to
be managed. This involves tasks such as backing up the data, monitoring the data for errors, and
optimizing the performance of the data warehouse.
 Maintaining data in the data warehouse: Over time, the data in the data warehouse will need to be
maintained. This involves tasks such as updating the data with new data, removing old data, and
correcting any errors in the data.

The data warehouse backend process is an essential part of any data warehouse implementation. It
ensures that the data in the data warehouse is accurate, consistent, and up-to-date. This allows the data
warehouse to be used to gain insights from data and to make informed business decisions.

(b) Write and explain pseudocode for a priori algorithm. Explain the terms (i) support count: (ii)
confidence.
Ans:
The Apriori algorithm is a classic data mining algorithm used for frequent itemset mining and association
rule discovery. It aims to discover associations and correlations between items in a dataset. The
algorithm is named after the priori principle, which states that if an itemset is frequent, then all of its
subsets must also be frequent.
Algo:

A.Support count: The support count of an itemset is the number of transactions or instances in the
dataset that contain that itemset. It represents the absolute frequency or occurrence of the itemset in
the dataset. The support count is typically represented as a numerical value or a percentage.

B.Confidence: Confidence measures the strength of the association or correlation between two
itemsets or sets of items. Specifically, it measures the conditional probability that a transaction
containing itemset X also contains itemset Y.

Confidence is defined as: Confidence(X → Y) = Support count(X ∪ Y) / Support count(X)

Q(5) What is cluster analysis? How do we categorize the major clustering methods? Explain each in brief.
Ans:-
Clustering is a type of unsupervised machine learning that involves grouping data points together based
on their similarity. Clustering algorithms can be categorized into two main types: partitional clustering
and hierarchical clustering.
Partitional clustering methods divide the data into a pre-determined number of clusters. Some
popular partitioning clustering algorithms include:

 K-means clustering: This algorithm starts by randomly assigning each data point to one of k clusters.
It then iteratively updates the cluster centroids and reassigns each data point to the cluster with the
closest centroid.
 Expectation-maximization (EM) clustering: This algorithm is a probabilistic clustering algorithm that
works by iteratively estimating the parameters of a mixture model.
 Spectral clustering: This algorithm uses the spectrum of the data's similarity matrix to find clusters.
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise): This algorithm clusters data
points that are densely packed together, ignoring points that are sparsely scattered.

Hierarchical clustering methods build a hierarchy of clusters by successively merging or splitting clusters.
Some popular hierarchical clustering algorithms include:

 Agglomerative hierarchical clustering: This algorithm starts by assigning each data point to its own
cluster. It then repeatedly merges the two most similar clusters until there is only one cluster left.
 Divisive hierarchical clustering: This algorithm starts by assigning all data points to the same cluster.
It then repeatedly splits the cluster that is most heterogeneous until each data point is in its own
cluster.

Data Mining
No ratings yet
Data Mining
7 pages
Data Mining (Mid Sem)
No ratings yet
Data Mining (Mid Sem)
7 pages
dataqb
No ratings yet
dataqb
38 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
DMA_qb_solved
No ratings yet
DMA_qb_solved
42 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
DM
No ratings yet
DM
7 pages
Adm Q&a
No ratings yet
Adm Q&a
13 pages
unit 3 Question Bank
No ratings yet
unit 3 Question Bank
8 pages
DMDW Question Bank
No ratings yet
DMDW Question Bank
17 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DMDW Imp Ques
No ratings yet
DMDW Imp Ques
17 pages
Question Bank DMC
No ratings yet
Question Bank DMC
28 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Lecture 3
No ratings yet
Lecture 3
10 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Short Notes On Data Mining & Warehousing
No ratings yet
Short Notes On Data Mining & Warehousing
43 pages
Data Mining
No ratings yet
Data Mining
15 pages
shortnjn
No ratings yet
shortnjn
12 pages
dwm NOTES
No ratings yet
dwm NOTES
118 pages
DMBI Simplified
No ratings yet
DMBI Simplified
28 pages
DM & W SQ
No ratings yet
DM & W SQ
15 pages
Data Warehousing and Data Mining MCQ'S: Unit - I
No ratings yet
Data Warehousing and Data Mining MCQ'S: Unit - I
29 pages
DWM_Question_Bank_with_Answers
No ratings yet
DWM_Question_Bank_with_Answers
5 pages
Data Mining
No ratings yet
Data Mining
20 pages
SEM 5 - Comps, IOT, CYBER, CS - Data Warehousing & Mining - 2024 MAY To 2022 DEC PYQ - Aeraxia - in
No ratings yet
SEM 5 - Comps, IOT, CYBER, CS - Data Warehousing & Mining - 2024 MAY To 2022 DEC PYQ - Aeraxia - in
10 pages
Data Mining IMP Objective Questions_Sep 2023
No ratings yet
Data Mining IMP Objective Questions_Sep 2023
4 pages
Gita Autonomous College, Bhubaneswar Question Bank Subject
No ratings yet
Gita Autonomous College, Bhubaneswar Question Bank Subject
27 pages
solved DM questions
No ratings yet
solved DM questions
6 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
ModelQB - Part B&C-1
No ratings yet
ModelQB - Part B&C-1
51 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
3 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Data Mining Display
No ratings yet
Data Mining Display
20 pages
Assignment of DMDW kg11
No ratings yet
Assignment of DMDW kg11
17 pages
DMW - Unit 1
No ratings yet
DMW - Unit 1
21 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
CEUC502 - DMBI_Question_Bank
No ratings yet
CEUC502 - DMBI_Question_Bank
12 pages
Data Mining
No ratings yet
Data Mining
26 pages
DWDM(BCS058) 2nd UNIT NOTES
No ratings yet
DWDM(BCS058) 2nd UNIT NOTES
39 pages
6 TheRealTimeFaceDetectionandRecognitionSystem
No ratings yet
6 TheRealTimeFaceDetectionandRecognitionSystem
48 pages
Data Mining Simran
No ratings yet
Data Mining Simran
128 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
mcqs unit 3
No ratings yet
mcqs unit 3
6 pages
DM Module1 notes
No ratings yet
DM Module1 notes
25 pages
Unit 1
No ratings yet
Unit 1
18 pages
Stages in Data Mining
No ratings yet
Stages in Data Mining
11 pages
DWM
No ratings yet
DWM
29 pages
Ba Unit 2 Imp
No ratings yet
Ba Unit 2 Imp
9 pages
DWDM
No ratings yet
DWDM
18 pages
DM-Model Question Paper Solutions
No ratings yet
DM-Model Question Paper Solutions
27 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Subject Code: 80359 Subject Name: Data Warehousing and Data Mining Common Subject Code (If Any)
No ratings yet
Subject Code: 80359 Subject Name: Data Warehousing and Data Mining Common Subject Code (If Any)
9 pages
Perceptions Parents Intimate Relationships
No ratings yet
Perceptions Parents Intimate Relationships
23 pages
UnSupervised ML
No ratings yet
UnSupervised ML
17 pages
ML Lab Manual AIML Final
No ratings yet
ML Lab Manual AIML Final
61 pages
DM Assignments
No ratings yet
DM Assignments
4 pages
Activity 1 PDF
No ratings yet
Activity 1 PDF
3 pages
Machine Learning Questions
No ratings yet
Machine Learning Questions
19 pages
08 Fair Machine Learning
No ratings yet
08 Fair Machine Learning
53 pages
Health Promoting and Health Risk
No ratings yet
Health Promoting and Health Risk
13 pages
JNTUA MCA V Semester R17 Syllabus
No ratings yet
JNTUA MCA V Semester R17 Syllabus
24 pages
01.book About Fuzzy
No ratings yet
01.book About Fuzzy
367 pages
5th & 6th Sem BCA Syllabus
No ratings yet
5th & 6th Sem BCA Syllabus
27 pages
Project Assignment.2025
No ratings yet
Project Assignment.2025
2 pages
Ishola Et Al Pure and Applied Geophysics 2014
No ratings yet
Ishola Et Al Pure and Applied Geophysics 2014
29 pages
prt:978 0 387 74759 0/13
No ratings yet
prt:978 0 387 74759 0/13
585 pages
Hierarchical Cluster Analysis - R Tutorial
No ratings yet
Hierarchical Cluster Analysis - R Tutorial
3 pages
Business Intelligence, Data Warehousing, Data Mining, Data Visualization
No ratings yet
Business Intelligence, Data Warehousing, Data Mining, Data Visualization
79 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
Data Warehouse MCQS With Answer - Computer Science PDF
100% (2)
Data Warehouse MCQS With Answer - Computer Science PDF
41 pages
Zendesk Benchmark Report 2014
No ratings yet
Zendesk Benchmark Report 2014
18 pages
Clustsig R
No ratings yet
Clustsig R
7 pages
Image_Segmentation_DeepLearning
No ratings yet
Image_Segmentation_DeepLearning
18 pages
IITG Credit Linked DS
No ratings yet
IITG Credit Linked DS
10 pages
Design and Implementation of Fertilizer Recommendation System For Farmers
No ratings yet
Design and Implementation of Fertilizer Recommendation System For Farmers
11 pages
DWM 10 Marks
No ratings yet
DWM 10 Marks
3 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Basic Notes
No ratings yet
Basic Notes
26 pages
Bda Unit 5
No ratings yet
Bda Unit 5
30 pages
A Markov Chain-Based Availability Model of Virtual Cluster Nodes
No ratings yet
A Markov Chain-Based Availability Model of Virtual Cluster Nodes
5 pages
RESEARCH METHODOLOGY
No ratings yet
RESEARCH METHODOLOGY
46 pages
Speech Detection On Urban Sounds
No ratings yet
Speech Detection On Urban Sounds
66 pages

Data Mining 2-5

Uploaded by

Data Mining 2-5

Uploaded by

Q(2) (a)What are the common methods for handling the problem of missing value and nousy data?

(c) Three General Issues Affecting Different Types of Software:

Q(4) (a). What is data warehouse backend process? Explain briefly.

Confidence is defined as: Confidence(X → Y) = Support count(X ∪ Y) / Support count(X)

You might also like