0% found this document useful (0 votes)
31 views

Data Mining 2-5

The document provides information on common methods for handling missing values and noisy data, calculations for analyzing a number series, and issues that affect different types of software. It also discusses the steps involved in the data mining process and data warehouse backend processes. Cluster analysis methods are categorized as partitional or hierarchical, with popular algorithms explained briefly for each type.

Uploaded by

nirman kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Data Mining 2-5

The document provides information on common methods for handling missing values and noisy data, calculations for analyzing a number series, and issues that affect different types of software. It also discusses the steps involved in the data mining process and data warehouse backend processes. Cluster analysis methods are categorized as partitional or hierarchical, with popular algorithms explained briefly for each type.

Uploaded by

nirman kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Q(2) (a)What are the common methods for handling the problem of missing value and nousy data?

(b). For a given number series: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,

33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. Calculate

(1) What is the mean of the data? What is the median? (ii) What is the mode of the data? (iii) Find first
quartile and the third quartile of the data.

(c), explain the three general issues that affect the different types of software.

Ans:
(A) Common methods for handling missing values and noisy data include:
1. *Deletion*: Remove rows or columns with missing values. This is typically done when missing values
are few and don't significantly affect the overall dataset.
2. *Imputation*: Fill in missing values with estimated values. This could be the mean, median, mode, or
predicted values from regression models.
3. *Prediction Models*: Use machine learning algorithms to predict missing values based on other
features in the dataset.
4. *Interpolation*: Estimate missing values based on neighboring values. Linear, polynomial, or time-
series interpolation techniques can be used.
5. *Data Transformation*: Convert data into a different representation that is more robust to noise, such
as using logarithms or percentiles.

(b) Calculations for the Number Series: Series: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70

(i) Mean: Sum of all numbers / Total count = (Sum of numbers) / 26 ≈ 29.85
(ii) Median: Middle value = 33 (since there are 26 values)
(iii) Mode: 25 (appears most frequently)
(iv) First Quartile (Q1): Median of the lower half of the data = 20.5
(v) Third Quartile (Q3): Median of the upper half of the data = 40.5

(c) Three General Issues Affecting Different Types of Software:

Security: Security issues impact all types of software. Vulnerabilities in code can lead to breaches, data
leaks, and unauthorized access. Ensuring secure coding practices, regular security audits, and prompt
patching of vulnerabilities are critical.

Scalability and Performance: Scalability is a challenge across software domains. As user bases grow or
data volumes increase, software must continue to perform well. Proper design, efficient algorithms, and
optimizing resource utilization are essential for maintaining performance.

Compatibility: Compatibility issues arise due to differences in hardware, software platforms, and
versions. Ensuring software runs smoothly across various configurations requires extensive testing and
adaptation. This is especially relevant as technology evolves and new devices/operating systems emerge.

Q 3. (a) Compare and contrast data warehouse system and operational database system.
Ans:-
(b) Describe the steps involved in data mining when viewed as a process of knowledge discovery.
Ans:
here are the steps involved in data mining when viewed as a process of knowledge discovery:
 Business understanding: This step involves understanding the business problem that the data mining
project is trying to solve. It also involves identifying the data that is needed to solve the problem and
the goals of the data mining project.
 Data understanding: This step involves understanding the data that is available for the data mining
project. This includes understanding the data's quality, completeness, and format.
 Data preparation: This step involves preparing the data for data mining. This includes cleaning the
data, transforming the data, and selecting the features that will be used for data mining.
 Modeling: This step involves using data mining algorithms to build models of the data. These models
can be used to predict future outcomes, identify patterns, or cluster data.
 Evaluation: This step involves evaluating the models that were built in the modeling step. This
includes evaluating the accuracy of the models, the interpretability of the models, and the
usefulness of the models.
 Deployment: This step involves deploying the models that were built in the modeling step. This
includes making the models available to users and integrating the models into business processes.

Q(4) (a). What is data warehouse backend process? Explain briefly.


Ans:-
The data warehouse backend process is the set of processes that are responsible for loading, managing,
and maintaining the data in a data warehouse. It includes the following steps:
 Extracting data from source systems: The first step is to extract data from the various source
systems that are used by the organization. This data can come from a variety of sources, such as
transactional systems, operational databases, and external data sources.
 Cleaning and transforming data: Once the data is extracted, it needs to be cleaned and transformed.
This involves removing any errors or inconsistencies in the data, as well as converting the data into a
format that can be loaded into the data warehouse.
 Loading data into the data warehouse: Once the data is cleaned and transformed, it can be loaded
into the data warehouse. This process typically involves using an ETL (extract, transform, load) tool.
 Managing data in the data warehouse: Once the data is loaded into the data warehouse, it needs to
be managed. This involves tasks such as backing up the data, monitoring the data for errors, and
optimizing the performance of the data warehouse.
 Maintaining data in the data warehouse: Over time, the data in the data warehouse will need to be
maintained. This involves tasks such as updating the data with new data, removing old data, and
correcting any errors in the data.

The data warehouse backend process is an essential part of any data warehouse implementation. It
ensures that the data in the data warehouse is accurate, consistent, and up-to-date. This allows the data
warehouse to be used to gain insights from data and to make informed business decisions.

(b) Write and explain pseudocode for a priori algorithm. Explain the terms (i) support count: (ii)
confidence.
Ans:
The Apriori algorithm is a classic data mining algorithm used for frequent itemset mining and association
rule discovery. It aims to discover associations and correlations between items in a dataset. The
algorithm is named after the priori principle, which states that if an itemset is frequent, then all of its
subsets must also be frequent.
Algo:

A.Support count: The support count of an itemset is the number of transactions or instances in the
dataset that contain that itemset. It represents the absolute frequency or occurrence of the itemset in
the dataset. The support count is typically represented as a numerical value or a percentage.

B.Confidence: Confidence measures the strength of the association or correlation between two
itemsets or sets of items. Specifically, it measures the conditional probability that a transaction
containing itemset X also contains itemset Y.

Confidence is defined as: Confidence(X → Y) = Support count(X ∪ Y) / Support count(X)

Q(5) What is cluster analysis? How do we categorize the major clustering methods? Explain each in brief.
Ans:-
Clustering is a type of unsupervised machine learning that involves grouping data points together based
on their similarity. Clustering algorithms can be categorized into two main types: partitional clustering
and hierarchical clustering.
Partitional clustering methods divide the data into a pre-determined number of clusters. Some
popular partitioning clustering algorithms include:

 K-means clustering: This algorithm starts by randomly assigning each data point to one of k clusters.
It then iteratively updates the cluster centroids and reassigns each data point to the cluster with the
closest centroid.
 Expectation-maximization (EM) clustering: This algorithm is a probabilistic clustering algorithm that
works by iteratively estimating the parameters of a mixture model.
 Spectral clustering: This algorithm uses the spectrum of the data's similarity matrix to find clusters.
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise): This algorithm clusters data
points that are densely packed together, ignoring points that are sparsely scattered.

Hierarchical clustering methods build a hierarchy of clusters by successively merging or splitting clusters.
Some popular hierarchical clustering algorithms include:

 Agglomerative hierarchical clustering: This algorithm starts by assigning each data point to its own
cluster. It then repeatedly merges the two most similar clusters until there is only one cluster left.
 Divisive hierarchical clustering: This algorithm starts by assigning all data points to the same cluster.
It then repeatedly splits the cluster that is most heterogeneous until each data point is in its own
cluster.

You might also like