0% found this document useful (0 votes)
11 views

01. UNIT-I(DMWH6EM)

Data mining is the computational process of discovering patterns in large datasets, aiming to extract useful information and transform it into an understandable structure. It encompasses various tasks such as anomaly detection, association rule learning, clustering, classification, and summarization, and is supported by a system architecture that includes a knowledge base, data mining engine, and user interface. The advantages of data mining include automated trend prediction and the discovery of previously unknown patterns, while its functionalities can be categorized into descriptive and predictive tasks.

Uploaded by

nagendra.anumula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

01. UNIT-I(DMWH6EM)

Data mining is the computational process of discovering patterns in large datasets, aiming to extract useful information and transform it into an understandable structure. It encompasses various tasks such as anomaly detection, association rule learning, clustering, classification, and summarization, and is supported by a system architecture that includes a knowledge base, data mining engine, and user interface. The advantages of data mining include automated trend prediction and the discovery of previously unknown patterns, while its functionalities can be categorized into descriptive and predictive tasks.

Uploaded by

nagendra.anumula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 45

UNIT-I

INTRODUCTION TO DATA MINING

ESSAY QUESTIONS

1. Write about the overview of Data Mining.


Ans: Data mining refers to extracting or mining knowledge
from large amounts of data. The term is actually a misnomer.
Thus, data mining should have been more appropriately
named as knowledge mining which emphasis on mining from
large amounts of data. It is the computational process of
discovering patterns in large data sets involving methods at
the intersection of artificial intelligence, machine learning,
statistics, and database systems. The overall goal of the data
mining process is to extract information from a data set and
transform it into an understandable structure for further use.
The key properties of data mining are
 Automatic discovery of patterns
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large datasets and databases
The Scope of Data Mining
Data mining derives its name from the similarities
between searching for valuable business information in a

1
UNIT-I (DMWH6EM) 2
large database for example, finding linked products in
gigabytes of store scanner data and mining a mountain for a
vein of valuable ore. Both processes require either sifting
through an immense amount of material, or intelligently
probing it to find exactly where the value resides. Given
databases of sufficient size and quality, data mining
technology can generate new business opportunities by
providing these capabilities:
1. Automated prediction of trends and behaviors: Data
mining automates the process of finding predictive
information in large databases. Questions that traditionally
required extensive hands-on analysis can now be answered
directly from the data quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses
data on past promotional mailings to identify the targets most
likely to maximize return on investment in future mailings.
Other predictive problems include forecasting bankruptcy and
other forms of default, and identifying segments of a
population likely to respond similarly to given events.
2. Automated discovery of previously unknown
patterns: Data mining tools sweep through databases and
identify previously hidden patterns in one step. An example of
pattern discovery is the analysis of retail sales data to identify
seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting
fraudulent credit card transactions and identifying anomalous
data that could represent data entry keying errors.
Tasks of Data Mining
Data mining involves six common classes of tasks:
UNIT-I (DMWH6EM) 3
1. Anomaly detection (Outlier/change/deviation
detection): The identification of unusual data records, that
might be interesting or data errors that require further
investigation.
2. Association rule learning (Dependency modelling):
Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can
determine which products are frequently bought together and
use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
3. Clustering: It is the task of discovering groups and
structures in the data that are in some way or another
"similar", without using known structures in the data.
4. Classification: It is the task of generalizing known
structure to apply to new data. For example, an e-mail
program might attempt to classify an e-mail as "legitimate" or
as "spam". Regression – attempts to find a function which
models the data with the least error.
5. Summarization: Providing a more compact
representation of the data set, including visualization and
report generation.

2. Write about the Architecture of Data mining


Ans: A typical data mining system may have the following
major component
1. Knowledge Base: This is the domain knowledge that
is used to guide the search or evaluate the interestingness of
resulting patterns. Such knowledge can include concept
UNIT-I (DMWH6EM) 4
hierarchies, used to organize attributes or attribute values into
different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included. Other
examples of domain knowledge are additional interestingness
constraints or thresholds, and metadata (e.g., describing data
from multiple heterogeneous sources).

2. Data Mining Engine: This is essential to the data


mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and
UNIT-I (DMWH6EM) 5
correlation analysis, classification, prediction, cluster analysis,
outlier analysis, and evolution analysis.
3. Pattern Evaluation Module: This component
typically employs interestingness measures interacts with the
data mining modules so as to focus the search toward
interesting patterns. It may use interestingness thresholds to
filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module,
depending on the implementation of the data mining method
used. For efficient data mining, it is highly recommended to
push the evaluation of pattern interestingness as deep as
possible into the mining process so as to confine the search to
only the interesting patterns.
4. User interface: This module communicates between
users and the data mining system, allowing the user to interact
with the system by specifying a data mining query or task,
providing information to help focus the search, and
performing exploratory data mining based on the intermediate
data mining results. In addition, this component allows the
user to browse database and data warehouse schemas or data
structures, evaluate mined patterns, and visualize the patterns
in different forms.

3. Explain DM as a step in the process of knowledge


discovery.
Ans: The major reason that data mining has attracted a great
deal of attention in information industry in recent years is due
to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information
UNIT-I (DMWH6EM) 6
and knowledge. The information and knowledge gained can
be used for applications ranging from business management,
production control, and market analysis, to engineering design
and science exploration.
Data mining refers to extracting or mining" knowledge
from large amounts of data. There are many other terms
related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data
dredging. Many people treat data mining as a synonym for
another popularly used term, Knowledge Discovery in
Databases", or KDD
Essential step in the process of knowledge discovery in
databases
Knowledge discovery as a process is depicted in
following figure and consists of an iterative sequence of the
following steps:
 Data cleaning: to remove noise or irrelevant data
 Data integration: where multiple data sources may be
combined
 Data selection: where data relevant to the analysis task
are retrieved from the database
 Data transformation: where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations
 Data mining :an essential process where intelligent
methods are applied in order to extract data patterns
 Pattern evaluation to identify the truly interesting
patterns representing knowledge based on some
interestingness measures
UNIT-I (DMWH6EM) 7
 Knowledge presentation: where visualization and
knowledge representation techniques are used to present
the mined knowledge to the user.

4. Discuss about Data Mining Functionalities.


Ans: Data mining functionalities are used to specify the kind
of patterns to be found in data mining tasks. Data mining
tasks can be classified into two categories: descriptive and
predictive.
 Descriptive mining tasks characterize the general
properties of the data in the database.
 Predictive mining tasks perform inference on the current
data in order to make predictions.
Concept/Class Description: Characterization and
Discrimination: Data can be associated with classes or
concepts. For example, in the Electronics store, classes of
items for sale include computers and printers, and concepts of
customers include big Spenders and budget Spenders.
a) Data characterization: Data characterization is a
summarization of the general characteristics or features of a
target class of data.
b) Data discrimination: Data discrimination is a
comparison of the general features of target class data objects
with the general features of objects from one or a set of
contrasting classes.
Mining Frequent Patterns, Associations, and
Correlations: Frequent patterns, are patterns that occur
frequently in data. There are many kinds of frequent patterns,
including itemsets, subsequences, and substructures.
UNIT-I (DMWH6EM) 8
Association analysis: Suppose, as a marketing
manager, you would like to determine which items are
frequently purchased together within the same transactions.
buys(X,“computer”)=buys(X,“software”)
[support=1%,confidence=50%]
Where X is a variable representing a customer.
Confidence=50% means that if a customer buys a computer,
there is a 50% chance that she will buy software as well.
Support=1% means that 1% of all of the transactions
under analysis showed that computer and software were
purchased together.
Classification and Prediction: Classification is the
process of finding a model that describes and distinguishes
data classes for the purpose of being able to use the model to
predict the class of objects whose class label is unknown.
“How is the derived model presented?” The derived
model may be represented in various forms, such as
classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks.

A decision tree is a flow-chart-like tree structure, where


each node denotes a test on an attribute value, each branch
UNIT-I (DMWH6EM) 9
represents an outcome of the test, and tree leaves represent
classes or class distributions.
A neural network, when used for classification, is
typically a collection of neuron-like processing units with
weighted connections between the units.

Cluster Analysis: In classification and prediction


analyze class-labeled data objects, whereas clustering
analyzes data objects without consulting a known class label.

The objects are grouped based on the principle of


maximizing the intra class similarity and minimizing the
interclass similarity. That is, clusters of objects are formed so
that objects within a cluster have high similarity in
UNIT-I (DMWH6EM) 10
comparison to one another, but are very dissimilar to objects
in other clusters.
Outlier Analysis: A database may contain data objects
that do not comply with the general behavior or model of the
data. These data objects are outliers. Most data mining
methods discard outliers as noise or exceptions. The analysis
of outlier data is referred to as outlier mining.

5. What are the functional components of data mining


GUI.
Ans: A data mining query language provides necessary
primitives that allow users to communicate with data mining
systems. However, inexperienced users may find data mining
query languages occurred to use and the syntax difficult to
remember. Instead, users may prefer to communicate with
data mining systems through a graphical user interface (GUI).
In relational data base technology, SQL serves as a standard
“core” language for relational systems, on top of which GUI’s
can easily be designed. Similarly, a data mining query
language may serve as a “core language” for data mining
system implementation is providing a basis for the
development of GUI’s for effective data mining.
A data mining GUI may consist of the following
functional components:
1. Data collection and data mining query
compositions: This component allows the user to specify
task-relevant data sets and to compose data mining queries. It
is similar to GUI’s used for the specification of relational
queries.
UNIT-I (DMWH6EM) 11
2. Presentations of discovered patterns: This
component allows the display of the discovered patterns in
various forms, including tables, graphs, charts, curves and
other visualization techniques.
3. Hierarchy specification and manipulation: This
component allows for concept hierarchy specification, either
manually by the user or automatically (based on analysis of
the data at hand). In addition, this component should allow
concept hierarchies to be modified by the user or adjusted
automatically based on the given data set distribution.
4. Manipulation of data mining primitives: This
component may allow the dynamic adjustment of the data
mining thresholds, as well as the selection, display and
modification of concept hierarchies. It may also allow the
modification of previous data mining queries or conditions.
5. Interactive multilevel mining: This component
should allow roll-up or drill-down operations on discovered
patterns. Other miscellaneous information: this component
may include on-line help manuals indexed search, debugging,
and other interactive graphical facilities. The design of a
graphical user interface should also take into consideration
different classes of users of a data mining system. In general,
users of data mining systems can be classified into two
categories: business analysts and business executives.
a) Business analysts: Business analysts would like to
have flexibility and convenience in selecting different
portions of the data manipulating dimensions and levels,
setting mining parameters, and tuning data mining processes.
UNIT-I (DMWH6EM) 12
b) Business Executive: A business executive needs
clear presentation and interpretation of data mining results,
flexibility in viewing and comparing different data mining
results and easy integration of data mining results interpret
writing and presentation process. A well-designed data
mining systems should provide friendly user interfaces for
both kinds of users.

6. What is data mining ? Explain the advantages and


disadvantages of data mining.
Ans. The data mining has been stretched beyond its limits to
apply to any form of data analysis.
According to William J Frawley, Gregory Piatetsky-
Shapiro and Christopher J Matheus, "Data Mining, or
Knowledge Discovery in Databases (KDD), is the nontrivial
extraction of implicit, previously unknown, and potentially
useful information from data. This encompasses a number of
different technical approaches, such as clustering, data
summarization, learning classification rules, finding
dependency net works, analyzing changes, and detecting
anomalies"
According to Marcel Holshemier & Arno Siebes
(1994), "Data mining is the search for relationships and global
patterns that exist in large databases but are 'hidden' among
the vast amount of data, such as a relationship between patient
data and their medical diagnosis. These relationships
represent valuable knowledge about the database and the
objects in the database and, if the database is a faithful mirror,
of the real world registered by the database".
UNIT-I (DMWH6EM) 13
Basically data mining is concerned with the analysis of
data and the use of software techniques for finding patterns
and regularities in sets of data. It is the computer, which is
responsible for finding the patterns by identifying the
underlying rules and features in the data.
Components of Data Mining :
The architecture of a typical data mining system may
have the following major components (figure) :
1) Database, Data Warehouse, or Other Information
Repository : This is one or a set of databases, data
warehouses, spreadsheets, or other kinds of information
repositories. Data cleaning and data integration techniques
may be performed on the data.
2) Database or Data Warehouse Server : The database or
data warehouse server is responsible for fetching the relevant
data, based on the user's data mining request.
3) Knowledge Base : This is the domain knowledge that is
used to guide the search, or. evaluate the interestingness of
resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into
different levels of abstraction.
4) Data Mining Engine : This is essential to the data
mining system and ideally consists of a set of functional
modules for tasks such as characterization, association,
classification, cluster analysis, and evolution and deviation
analysis.
UNIT-I (DMWH6EM) 14

Figure : Architecture of a Typical Data Mining System


5) Pattern Evaluation Module : This component typically
employs interestingness measures and interacts with the data
mining modules so as to focus the search towards interesting
patterns. It may use interestingness thresholds to filter out
discovered patterns.
For efficient data mining, it is highly recommended to
push the evaluation of pattern interestingness as deep as
possible into the mining process so as to confine the search to
only the interesting patterns.
6) Graphical User Interface : This module communicates
between users and the data mining system, allowing the user
to interact with the system by specifying a data mining query
or task, providing information to help focus the search, and
UNIT-I (DMWH6EM) 15
performing exploratory data mining based on the intermediate
data mining results. In addition, this component allows the
user to browse database and data warehouse schemas or data
structures, evaluate mined patterns, and visualize the patterns
in different forms.
Advantages of Data Mining
1) Automated Prediction of Trends and Behaviors :
Data mining automates the process of finding predictive
information in large databases. Questions that traditionally
required extensive hands-on analysis can now be answered
directly from the data - quickly.
2) Automated Discovery of Previously Unknown
Patterns: Data mining tools sweep through databases and
identify previously hidden patterns in one step.
3) Databases can be Larger in both Depth and Breadth:
The databases can have more columns and rows. Usually,
analysts must often limit the number of variables they
examine when doing hands-on analysis due to time
constraints. Yet, variables that are discarded because they
seem unimportant may carry information about unknown
patterns.
High performance data mining allows users to explore
the full depth of a database, without pre-selecting a subset of
variables. The data mining databases contains larger samples
(more rows) as they yield lower estimation errors and
variance, and allow users to make inferences about small but
important segments of a population.
Disadvantages of Data Mining :
UNIT-I (DMWH6EM) 16
1) Privacy Issues : Personal privacy has always been a
major concern in this country. In recent years. With the
widespread use of Internet, the concerns about privacy have
increase tremendously. Because of the privacy issues, some
people do not shop on Internet. They are afraid that somebody
may have access to their personal information and then use
that information in an unethical way; thus causing they harm.
The selling of personal information may also bring harm
to the customers because one does not know what the other
companies are planning to do with the personal information
that they have purchased.
2) Security Issues : Although companies have a lot of
personal information about us available online, they do not
have sufficient security systems in place to protect that
information.
For example, recently the Ford Motor credit company
had to inform 13,000 of the consumers that their personal
information including Social Security Number, address,
account number and payment history were accessed by
hackers who broke into a database belonging to the Experian
credit reporting agency. This incidence illustrated that
companies are willing to disclose and share your personal
information, but they are not taking care of the information
properly. With so, much personal information available,
identify theft could become a real problem.
3) Misuse of Information / Inaccurate Information :
Trends obtain through data mining intended to be used for
marketing purpose or for some other ethical purposes, may be
misused. Unethical businesses or people may use the
UNIT-I (DMWH6EM) 17
information obtained through data mining to take advantage
of vulnerable people or discriminated against a certain group
of people. In addition, data mining technique is not a 100
percent accurate; thus mistakes do happen which can be
serious consequences.

7. Write about Data and Attribute types in Data Mining.


Ans: Data objects are the essential part of a database. A data
object represents the entity. Data Objects are like group of
attributes of a entity. For example a sales data object may
represent customer, sales or purchases. When a data object is
listed in a database they are called data tuples.
Attribute
It can be seen as a data field that represents
characteristics or features of a data object. For a customer
object attributes can be customer Id, address etc. We can say
that a set of attributes used to describe a given object are
known as attribute vector or feature vector.
Type of attributes:
This is the First step of Data Data-preprocessing. We
differentiate between different types of attributes and then
preprocess the data. So here is description of attribute types.
1. Qualitative (Nominal (N), Ordinal (O), Binary (B)).
2. Quantitative (Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes – related to names: The values
of a Nominal attribute are name of things, some kind of
symbols. Values of Nominal attributes represents some
category or state and that’s why nominal attribute also
UNIT-I (DMWH6EM) 18
referred as categorical attributes and there is no order (rank,
position) among values of nominal attribute.

Example:
Attribute Values
Colours Black, Brown, Whit
Categorical Data Lecturere, Professor, Assistant Professor
2. Binary Attributes: Binary data has only 2
values/states. For Example yes or no, affected or unaffected,
true or false.
i) Symmetric: Both values are equally important
(Gender).
ii) Asymmetric: Both values are not equally important
(Result).

3. Ordinal Attributes: The Ordinal Attributes contains


values that have a meaningful sequence or ranking(order)
between them, but the magnitude between values is not
UNIT-I (DMWH6EM) 19
actually known, the order of values that shows what is
important but don’t indicate how important it is.

Quantitative Attributes
1. Numeric: A numeric attribute is quantitative because,
it is a measurable quantity, represented in integer or real
values. Numerical attributes are of 2 types, interval and ratio.
i) An interval-scaled attribute: It has values, whose
differences are interpretable, but the numerical attributes do
not have the correct reference point or we can call zero point.
Data can be added and subtracted at interval scale but cannot
be multiplied or divided. Consider an example of temperature
in degrees Centigrade. If a day’s temperature of one day is
twice than the other day we cannot say that one day is twice
as hot as another day.
ii) A ratio-scaled attribute: It is a numeric attribute with an
fix zero-point. If a measurement is ratio-scaled, we can say of
a value as being a multiple (or ratio) of another value. The
values are ordered, and we can also compute the difference
between values, and the mean, median, mode, Quantile-range
and Five number summary can be given.
2. Discrete: Discrete data have finite values it can be
numerical and can also be in categorical form. These
attributes has finite or countably infinite set of values.
UNIT-I (DMWH6EM) 20
Example

3. Continuous: Continuous data have infinite no of


states. Continuous data is of float type. There can be many
values between 2 and 3.
Example:

8. Discuss about Statistical Description of Data.


Ans: A measure is distributive, if we can partition the dataset
into smaller subsets, compute the measure on the individual
subsets, and then combine the partial results in order to arrive
at the measure’s value on the entire (original) dataset
 A measure is algebraic if it can be computed by
applying an algebraic function to one or more
distributive measures
 A measure is holistic if it must be computed on the
entire dataset as a whole
Measure the Central Tendency
UNIT-I (DMWH6EM) 21
A measure of central tendency is a single value that
attempts to describe a set of data by identifying the central
position within that set of data. As such, measures of central
tendency are sometimes called measures of central location.
In other words, in many real-life situations, it is helpful
to describe data by a single number that is most representative
of the entire collection of numbers. Such a number is called a
measure of central tendency. The most commonly used
measures are as follows. Mean, Median, and Mode
a) Mean: mean, or average, of numbers is the sum of
the numbers divided by n.
b) Median: median of numbers is the middle number
when the numbers are written in order. If is even, the median
is the average of the two middle numbers.
c) Mode: Mode of numbers is the number that occurs
most frequently. If two numbers tie for most frequent
occurrence, the collection has two modes and is called
bimodal.
Measures of Dispersion
Measures of dispersion measure how spread out a set of
data is. The two most commonly used measures of dispersion
are the variance and the standard deviation. Rather than
showing how data are similar, they show how data differs
from its variation, spread, or dispersion.
a) Variance and Standard Deviation: Very different
sets of numbers can have the same mean. You will now study
two measures of dispersion, which give you an idea of how
much the numbers in a set differ from the mean of the set.
UNIT-I (DMWH6EM) 22
These two measures are called the variance of the set and the
standard deviation of the set
b) Percentile: Percentiles are values that divide a
sample of data into one hundred groups containing (as far as
possible) equal numbers of observations.
c) Quartiles: Quartiles are numbers that divide an
ordered data set into four portions, each containing
approximately one-fourth of the data. Twenty-five percent of
the data values come before the first quartile (Q1). The
median is the second quartile (Q2); 50% of the data values
come before the median. Seventy-five percent of the data
values come before the third quartile (Q3).
d) Range: The range of a set of data is the difference
between its largest (maximum) and smallest (minimum)
values. In the statistical world, the range is reported as a
single number, the difference between maximum and
minimum. Sometimes, the range is often reported as “from
(the minimum) to (the maximum),” i.e., two numbers.
e) Five-Number Summary: The Five-Number
Summary of a data set is a five-item list comprising the
minimum value, first quartile, median, third quartile, and
maximum value of the set.
{MIN, Q1, MEDIAN (Q2), Q3, MAX}
f) Box plots: A box plot is a graph used to represent the
range, median, quartiles and inter quartile range of a set of
data values.
Graphic Displays of Basic Descriptive Data Summaries
a) Histogram: A histogram is a way of summarizing
data that are measured on an interval scale (either discrete or
UNIT-I (DMWH6EM) 23
continuous). It is often used in exploratory data analysis to
illustrate the major features of the distribution of the data in a
convenient form. It divides up the range of possible values in
a data set into classes or groups. For each group, a rectangle is
constructed with a base length equal to the range of values in
that specific group, and an area proportional to the number of
observations falling into that group. This means that the
rectangles might be drawn of non-uniform height.
b) Scatter Plot: A scatter plot is a useful summary of a
set of bivariate data (two variables), usually drawn before
working out a linear correlation coefficient or fitting a
regression line. It gives a good visual picture of the
relationship between the two variables, and aids the
interpretation of the correlation coefficient or regression
model.
c) Loess curve:
It is another important exploratory graphic aid that adds a
smooth curve to a scatter plot in order to provide better
perception of the pattern of dependence. The word loess is
short for “local regression.”
d) Box plot: The picture produced consists of the most
extreme values in the data set (maximum and minimum
values), the lower and upper quartiles, and the median.

9. Write about Data Preprocessing.


Ans: Data in the real world is dirty. It can be in incomplete,
noisy and inconsistent from. These data needs to be
preprocessed in order to help improve the quality of the data,
and quality of the mining results.
UNIT-I (DMWH6EM) 24
 If no quality data, then no quality mining results. The
quality decision is always based on the quality data.
 If there is much irrelevant and redundant information
present or noisy and unreliable data, then knowledge
discovery during the training phase is more difficult.

Incomplete data: lacking attribute values, lacking


certain attributes of interest, or containing only aggregate
data. e.g., occupation=“ ”.
Noisy data: containing errors or outliers data. e.g.,
Salary=“-10”
Inconsistent data: containing discrepancies in codes or
names. e.g., Age=“42” Birthday=“03/07/1997”
Incomplete data may come from
UNIT-I (DMWH6EM) 25
 “Not applicable” data value when collected
 Different considerations between the time when the data
was collected and when it is analyzed.
 Human/hardware/software problems
Noisy data (incorrect values) may come from
 Faulty data collection by instruments
 Human or computer error at data entry
 Errors in data transmission
Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some
linked data)
Major Tasks in Data Preprocessing
1. Data cleaning: Fill in missing values, smooth noisy
data, identify or remove outliers, and resolve inconsistencies
2. Data integration: Integration of multiple databases,
data cubes, or files
3. Data transformation: Normalization and aggregation
4. Data reduction: Obtains reduced representation in
volume but produces the same or similar analytical results
5. Data discretization: Part of data reduction but with
particular importance, especially for numerical data

10. Write about Data Cleaning.


Ans: Incomplete, noisy, and inconsistent data are
commonplace properties of large, real-world databases and
data warehouses. Incomplete data can occur for a number of
reasons. Attributes of interest may not always be available,
such as customer information for sales transaction data. Other
UNIT-I (DMWH6EM) 26
data may not be included simply because it was not
considered important at the time of entry. Relevant data may
not be recorded due to a misunderstanding, or because of
equipment malfunctions. Data that were inconsistent with
other recorded data may have been deleted.
Data cleaning routines work to “clean" the data by
filling in missing values, smoothing noisy data, identifying or
removing outliers, and resolving inconsistencies. Dirty data
can cause confusion for the mining procedure. Although most
mining routines have some procedures for dealing with
incomplete or noisy data, they are not always robust. Instead,
they may concentrate on avoiding over fitting the data to the
function being modelled. Therefore, a useful pre-processing
step is to run your data through some data cleaning routines.
a) Missing Values: If it is noted that there are many
tuples that have no recorded value for several attributes, then
the missing values can be filled in for the attribute by various
methods described below:
1. Ignore the tuple: This is usually done when the class
label is missing (assuming the mining task involves
classification or description). This method is not very
effective, unless the tuple contains several attributes with
missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this
approach is time-consuming and may not be feasible given a
large data set with many missing values.
3. Use a global constant to fill in the missing value:
Replace all missing attribute values by the same constant,
UNIT-I (DMWH6EM) 27
such as a label like \Unknown", or -∞. If missing values are
replaced by, say, \Unknown", then the mining program may
mistakenly think that they form an interesting concept, since
they all have a value in common | that of \Unknown". Hence,
although this method is simple, it is not recommended.
4. Filling the missing value: Use the attribute mean to
fill in the missing value
5. Use of Attribute Value: Use the attribute mean for all
samples belonging to the same class as the given tuple.
6. Use the most probable value to fill in the missing
value: This may be determined with inference-based tools
using a Bayesian formalism or decision tree induction.
Methods 3 to 6 bias the data. The filled-in value may
not be correct. Method 6, however, is a popular strategy. In
comparison to the other methods, it uses the most information
from the present data to predict missing values.
b) Noisy Data: Noise is a random error or variance in a
measured variable. Given a numeric attribute such as, say,
price, how can the data are “smoothed" to remove the noise?
The following data smoothing techniques describes this.
1. Binning methods: Binning methods smooth a sorted
data value by consulting the \neighborhood", or values around
it. The sorted values are distributed into a number of 'buckets',
or bins. Because binning methods consult the neighborhood of
values, they perform local smoothing values around it. The
sorted values are distributed into a number of 'buckets', or
bins. Because binning methods consult the neighborhood of
values, they perform local smoothing.
UNIT-I (DMWH6EM) 28
2. Clustering: Outliers may be detected by clustering,
where similar values are organized into groups or \clusters".
3. Combined computer and human inspection: Outliers
may be identified through a combination of computer and
human inspection. In one application, for example, an
information-theoretic measure was used to help identify
outlier patterns in a handwritten character database for
classification. The measure's value reflected the \surprise"
content of the predicted character label with respect to the
known label. Outlier patterns may be informative (e.g.,
identifying useful data exceptions, such as different versions
of the characters \0" or \7"), or \garbage" (e.g., mislabeled
characters). Patterns whose surprise content is above a
threshold are output to a list. A human can then sort through
the patterns in the list to identify the actual garbage ones.
This is much faster than having to manually search
through the entire database. The garbage patterns can then be
removed from the (training) database.
4. Regression: Data can be smoothed by fitting the data
to a function, such as with regression. Linear regression
involves finding the \best" line to fit two variables, so that one
variable can be used to predict the other. Multiple linear
regression is an extension of linear regression, where more
than two variables are involved and the data are fit to a
multidimensional surface. Using regression to find a
mathematical equation to fit the data helps smooth out the
noise.
c) Inconsistent data: There may be inconsistencies in
the data recorded for some transactions. Some data
UNIT-I (DMWH6EM) 29
inconsistencies may be corrected manually using external
references. For example, errors made at data entry may be
corrected by performing a paper trace. This may be coupled
with routines designed to help correct the inconsistent use of
codes. Knowledge engineering tools may also be used to
detect the violation of known data constraints. For example,
known functional dependencies between attributes can be
used to find values contradicting the functional constraints.

11. Discuss about Data Integration.


Ans: Data integration is one of the steps of data pre-
processing that involves combining data residing in different
sources and providing users with a unified view of these data.
 It merges the data from multiple data stores (data
sources)
 It includes multiple databases, data cubes or flat files.
 Metadata, Correlation analysis, data conflict detection,
and resolution of semantic heterogeneity contribute
towards smooth data integration.
There are mainly 2 major approaches for data
integration - commonly known as "tight coupling approach"
and "loose coupling approach".
a) Tight Coupling: Here data is pulled over from
different sources into a single physical location through the
process of ETL - Extraction, Transformation and Loading.
The single physical location provides a uniform interface for
querying the data. ETL layer helps to map the data from the
sources so as to provide a uniform data warehouse. This
approach is called tight coupling since in this approach the
UNIT-I (DMWH6EM) 30
data is tightly coupled with the physical repository at the time
of query.
Advantages:
 Independence (Lesser dependency to source systems
since data is physically copied over)
 Faster query processing
 Complex query processing
 Advanced data summarization and storage possible
 High Volume data processing
Disadvantages:
 Latency (since data needs to be loaded using ETL)
 Costlier (data localization, infrastructure, security)
b) Loose Coupling: Here a virtual mediated schema
provides an interface that takes the query from the user,
transforms it in a way the source database can understand and
then sends the query directly to the source databases to obtain
the result. In this approach, the data only remains in the
actual source databases. However, mediated schema contains
several "adapters" or "wrappers" that can connect back to the
source systems in order to bring the data to the front end.
Advantages:
 Data Freshness (low latency - almost real time)
 Higher Agility (when a new source system comes or
existing source system changes - only the corresponding
adapter is created or changed - largely not affecting the
other parts of the system)
 Less costlier (Lot of infrastructure cost can be saved
since data localization not required)
Disadvantages:
UNIT-I (DMWH6EM) 31
 Semantic conflicts
 Slower query response
 High order dependency to the data sources
For example, let's imagine that an electronics company
is preparing to roll out a new mobile device. The marketing
department might want to retrieve customer information from
a sales department database and compare it to information
from the product department to create a targeted sales list. A
good data integration system would let the marketing
department view information from both sources in a unified
way, leaving out any information that didn't apply to the
search.

12. Write about Data Transformation.


Ans: Data transformation can involve the following:
1. Smoothing: Which works to remove noise from the
data
2. Aggregation: Where summary or aggregation
operations are applied to the data. For example, the daily sales
data may be aggregated so as to compute weekly and annual
total scores.
3. Generalization of the data: Where low-level or
“primitive” (raw) data are replaced by higher-level concepts
through the use of concept hierarchies. For example,
categorical attributes, like street, can be generalized to higher-
level concepts, like city or country.
4. Normalization: Where the attribute data are scaled so
as to fall within a small specified range, such as −1.0 to 1.0, or
0.0 to 1.0.
UNIT-I (DMWH6EM) 32
5. Attribute construction (feature construction): This is
where new attributes are constructed and added from the
given set of attributes to help the mining process.
Normalization
In which data are scaled to fall within a small, specified
range, useful for classification algorithms involving neural
networks, distance measurements such as nearest neighbor
classification and clustering. There are 3 methods for data
normalization. They are:
1. Min-max normalization
2. z-score normalization
3. Normalization by decimal scaling
1. Min-max normalization: performs linear
transformation on the original data values. It can be defined
as,

 v is the value to be normalized


 minA,maxA are minimum and maximum values of an
attribute A
 new_ maxA, new_ minA are the normalization range.
2. Z-score normalization / zero-mean normalization:
In which values of an attribute A are normalized based on the
mean and standard deviation of A. It can be defined as,
UNIT-I (DMWH6EM) 33
This method is useful when min and max value of
attribute A are unknown or when outliers that are dominate
min-max normalization.
3. Normalization by decimal scaling: normalizes by
moving the decimal point of values of attribute A. The
number of decimal points moved depends on the maximum
absolute value of A. A value v of A is normalized to v’ by
computing,

13. Write about Data Reduction.


Ans: Complex data analysis and mining on huge amounts of
data may take a very long time, making such analysis
impractical or infeasible. Data reduction techniques have been
helpful in analyzing reduced representation of the dataset
without compromising the integrity of the original data and
yet producing the quality knowledge. The concept of data
reduction is commonly understood as either reducing the
volume or reducing the dimensions (number of attributes).
There are a number of methods that have facilitated in
analyzing a reduced volume or dimension of data and yet
yield useful knowledge. Certain partition based methods work
on partition of data tuples. That is, mining on the reduced data
set should be more efficient yet produce the same (or almost
the same) analytical results. Strategies for data reduction
include the following.
1. Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data cube.
UNIT-I (DMWH6EM) 34
2. Dimension reduction, where irrelevant, weakly relevant,
or redundant attributes or dimensions may be detected
and removed.
3. Data compression, where encoding mechanisms are
used to reduce the data set size. The methods used for
data compression are wavelet transform and Principal
Component Analysis.
4. Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations
such as parametric models (which need store only the
model parameters instead of the actual data e.g.
regression and log-linear models), or nonparametric
methods such as clustering, sampling, and the use of
histograms.
5. Discretization and concept hierarchy generation, where
raw data values for attributes are replaced by ranges or
higher conceptual levels. Concept hierarchies allow the
mining of data at multiple levels of abstraction, and are
a powerful tool for data mining.

14. Write about Data Discretization.


Ans: Data Discretization techniques can be used to divide the
range of continuous attribute into intervals.Numerous
continuous attribute values are replaced by small interval
labels. This leads to a concise, easy-to-use, knowledge-level
representation of mining results.
1. Top-down discretization: If the process starts by first
finding one or a few points (called split points or cut points)
to split the entire attribute range, and then repeats this
UNIT-I (DMWH6EM) 35
recursively on the resulting intervals, then it is called top-
down discretization or splitting.
2. Bottom-up discretization: If the process starts by
considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to
form intervals, then it is called bottom-up discretization or
merging.
Discretization can be performed rapidly on an attribute
to provide a hierarchical partitioning of the attribute values,
known as a concept hierarchy.
Concept hierarchies
Concept hierarchies can be used to reduce the data by
collecting and replacing low-level concepts with higher-level
concepts.
In the multidimensional model, data are organized into
multiple dimensions, and each dimension contains multiple
levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data
from different perspectives.
Data mining on a reduced data set means fewer
input/output operations and is more efficient than mining on a
larger data set. Because of these benefits, discretization
techniques and concept hierarchies are typically applied
before data mining, rather than during mining.
Discretization and Concept Hierarchy Generation for
Numerical Data
1. Binning: Binning is a top-down splitting technique
based on a specified number of bins.Binning is an
unsupervised discretization technique.
UNIT-I (DMWH6EM) 36
2. Histogram Analysis: Because histogram analysis
does not use class information so it is an unsupervised
discretization technique. Histograms partition the values for
an attribute into disjoint ranges called buckets.
3. Cluster Analysis: Cluster analysis is a popular data
discretization method. A clustering algorithm can be applied
to discrete a numerical attribute of A by partitioning the
values of A into clusters or groups.
Each initial cluster or partition may be further
decomposed into several subcultures, forming a lower level of
the hierarchy.

15. Explain the classification of data mining systems.


Ans: There are many data mining systems available or being
developed. Some are specialized systems dedicated to a given
data source or are confined to limited data mining
functionalities, other are more versatile and comprehensive.
Data mining systems can be categorized according to various
criteria among other classification are the following:
1. Classification according to the type of data source
mined: this classification categorizes data mining systems
according to the type of data handled such as spatial data,
multimedia data, time-series data, text data, World Wide
Web, etc.
2. Classification according to the data model drawn
on: this classification categorizes data mining systems based
on the data model involved such as relational database,
object-oriented database, data warehouse, transactional, etc.
UNIT-I (DMWH6EM) 37
3. Classification according to the king of knowledge
discovered: this classification categorizes data mining
systems based on the kind of knowledge discovered or data
mining functionalities, such as characterization,
discrimination, association, classification, clustering, etc.
Some systems tend to be comprehensive systems offering
several data mining functionalities together.
4. Classification according to mining techniques used:
Data mining systems employ and provide different
techniques. This classification categorizes data mining
systems according to the data analysis approach used such as
machine learning, neural networks, genetic algorithms,
statistics, visualization, database oriented or data warehouse-
oriented, etc. The classification can also take into account the
degree of user interaction involved in the data mining process
such as query-driven systems, interactive exploratory systems,
or autonomous systems. A comprehensive system would
provide a wide variety of data mining techniques to fit
different situations and options, and offer different degrees of
user interaction.

16. Write the procedure of Integration of a Data Mining


System with a Database or Data Warehouse System
Ans: The differences between the following architectures for
the integration of a data mining system with a database or data
warehouse system are as follows.
1. No coupling: The data mining system uses sources
such as flat files to obtain the initial data set to be mined since
no database system or data warehouse system functions are
UNIT-I (DMWH6EM) 38
implemented as part of the process. Thus, this architecture
represents a poor design choice.
2. Loose coupling: The data mining system is not
integrated with the database or data warehouse system beyond
their use as the source of the initial data set to be mined, and
possible use in storage of the results. Thus, this architecture
can take advantage of the flexibility, efficiency and features
such as indexing that the database and data warehousing
systems may provide. However, it is difficult for loose
coupling to achieve high scalability and good performance
with large data sets as many such systems are memory-based.
3. Semitight coupling: Some of the data mining
primitives such as aggregation, sorting or pre computation of
statistical functions are efficiently implemented in the
database or data warehouse system, for use by the data mining
system during mining-query processing. Also, some
frequently used inter mediate mining results can be pre
computed and stored in the database or data warehouse
system, thereby enhancing the performance of the data mining
system.
4. Tight coupling: The database or data warehouse
system is fully integrated as part of the data mining system
and thereby provides optimized data mining query processing.
Thus, the data mining sub system is treated as one functional
component of an information system. This is a highly
desirable architecture as it facilitates efficient
implementations of data mining functions, high system
performance, and an integrated information processing
environment
UNIT-I (DMWH6EM) 39
From the descriptions of the architectures provided
above, it can be seen that tight coupling is the best alternative
without respect to technical or implementation issues.
However, as much of the technical infrastructure needed in a
tightly coupled system is still evolving, implementation of
such a system is non-trivial. Therefore, the most popular
architecture is currently semi tight coupling as it provides a
compromise between loose and tight coupling.

17. What are the major issues in data mining.


Ans: Major issues in data mining is regarding mining
methodology, user interaction, performance, and diverse data
types
1. Mining different kinds of knowledge in databases:
The need of different users is not the same. And Different
user may be in interested in different kind of knowledge.
Therefore it is necessary for data mining to cover broad range
of knowledge discovery task.
2. Interactive mining of knowledge at multiple levels of
abstraction: The data mining process needs to be interactive
because it allows users to focus the search for patterns,
providing and refining data mining requests based on returned
results.
3. Incorporation of background knowledge: To guide
discovery process and to express the discovered patterns, the
background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in
concise terms but at multiple level of abstraction.
UNIT-I (DMWH6EM) 40
4. Data mining query languages and ad hoc data
mining: Data Mining Query language that allows the user to
describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and
flexible data mining.
5. Presentation and visualization of data mining
results: Once the patterns are discovered it needs to be
expressed in high level languages, visual representations. This
representations should be easily understandable by the users.
6. Handling noisy or incomplete data: The data
cleaning methods are required that can handle the noise,
incomplete objects while mining the data regularities. If data
cleaning methods are not there then the accuracy of the
discovered patterns will be poor.
7. Pattern evaluation: It refers to interestingness of the
problem. The patterns discovered should be interesting
because either they represent common knowledge or lack
novelty.
8. Efficiency and scalability of data mining
algorithms: In order to effectively extract the information
from huge amount of data in databases, data mining algorithm
must be efficient and scalable.
9. Parallel, distributed, and incremental mining
algorithms: The factors such as huge size of databases, wide
distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data
mining algorithms. These algorithm divide the data into
partitions which is further processed parallel. Then the results
from the partitions are merged. The incremental algorithms,
UNIT-I (DMWH6EM) 41
updates databases without having mine the data again from
scratch.

SHORT ANSWER QUESTIONS

18. Data Mining.


Ans: Data mining refers to extracting or mining knowledge
from large amounts of data. The term is actually a misnomer.
Thus, data mining should have been more appropriately
named as knowledge mining which emphasis on mining from
large amounts of data. It is the computational process of
discovering patterns in large data sets involving methods at
the intersection of artificial intelligence, machine learning,
statistics, and database systems. The overall goal of the data
mining process is to extract information from a data set and
transform it into an understandable structure for further use.
The key properties of data mining are
 Automatic discovery of patterns
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large datasets and databases

19. Knowledge Base


Ans: This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting patterns.
Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be
UNIT-I (DMWH6EM) 42
used to assess a pattern’s interestingness based on its
unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints
or thresholds, and metadata (e.g., describing data from
multiple heterogeneous sources).
20. Data Reduction.
Ans: Complex data analysis and mining on huge amounts of
data may take a very long time, making such analysis
impractical or infeasible. Data reduction techniques have been
helpful in analyzing reduced representation of the dataset
without compromising the integrity of the original data and
yet producing the quality knowledge. The concept of data
reduction is commonly understood as either reducing the
volume or reducing the dimensions (number of attributes).
There are a number of methods that have facilitated in
analyzing a reduced volume or dimension of data and yet
yield useful knowledge. Certain partition based methods work
on partition of data tuples. That is, mining on the reduced data
set should be more efficient yet produce the same (or almost
the same) analytical results.
21. Statistical Description of Data.
Ans: A measure is distributive, if we can partition the dataset
into smaller subsets, compute the measure on the individual
subsets, and then combine the partial results in order to arrive
at the measure’s value on the entire (original) dataset
 A measure is algebraic if it can be computed by
applying an algebraic function to one or more
distributive measures
UNIT-I (DMWH6EM) 43
 A measure is holistic if it must be computed on the
entire dataset as a whole

22. Time Series Analysis


Ans: Time Series is a sequence of well-defined data points
measured at consistent time intervals over a period of time.
Data collected on an ad-hoc basis or irregularly does not form
a time series. Time series analysis is the use of statistical
methods to analyze time series data and extract meaningful
statistics and characteristics about the data.
Time series Analysis helps us understand what are the
underlying forces leading to a particular trend in the time
series data points and helps us in forecasting and monitoring
the data points by fitting appropriate models to it.
Benefits And Applications Of Time Series Analysis
Time series analysis aims to achieve various objectives
and the tools and models used vary accordingly. The various
types of time series analysis include –
1. Descriptive analysis: It is to determine the trend or
pattern in a time series using graphs or other tools. This helps
us identify cyclic patterns, overall trends, turning points and
outliers.
2. Spectral analysis: It is also referred to as frequency
domain and aims to separate periodic or cyclical components
in a time series. For example, identifying cyclical changes in
sales of a product.
3. Forecasting: It is used extensively in business
forecasting, budgeting, etc based on historical trends
UNIT-I (DMWH6EM) 44
4. Intervention analysis: It is used to determine if an
event can lead to a change in the time series, for example, an
employee’s level of performance has improved or not after an
intervention in the form of training – to determine the
effectiveness of the training program.
5. Explanative analysis: It studies the cross correlation
or relationship between two time series and the dependence of
one on another. For example the study of employee turnover
data and employee training data to determine if there is any
dependence of employee training programs on employee
turnover rates over time.

23. Data Discretization.


Ans: Data Discretization techniques can be used to divide the
range of continuous attribute into intervals.Numerous
continuous attribute values are replaced by small interval
labels. This leads to a concise, easy-to-use, knowledge-level
representation of mining results.
1. Top-down discretization: If the process starts by first
finding one or a few points (called split points or cut points)
to split the entire attribute range, and then repeats this
recursively on the resulting intervals, then it is called top-
down discretization or splitting.
2. Bottom-up discretization: If the process starts by
considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to
form intervals, then it is called bottom-up discretization or
merging.
UNIT-I (DMWH6EM) 45
Discretization can be performed rapidly on an attribute
to provide a hierarchical partitioning of the attribute values,
known as a concept hierarchy.

aaaaa

You might also like