0% found this document useful (0 votes)
137 views

DWDM Notes - Unit 1

This document provides an overview of data mining and knowledge discovery in databases (KDD). It discusses how data mining is used to extract patterns and knowledge from large amounts of data. The key steps in the KDD process are described, including data cleaning, data reduction, algorithm selection, pattern discovery, and interpretation. The architecture of a typical data mining system is also summarized, outlining its major components like databases, data warehouses, a data mining engine, and user interface.

Uploaded by

Kishor Peddi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views

DWDM Notes - Unit 1

This document provides an overview of data mining and knowledge discovery in databases (KDD). It discusses how data mining is used to extract patterns and knowledge from large amounts of data. The key steps in the KDD process are described, including data cleaning, data reduction, algorithm selection, pattern discovery, and interpretation. The architecture of a typical data mining system is also summarized, outlining its major components like databases, data warehouses, a data mining engine, and user interface.

Uploaded by

Kishor Peddi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Mining

B.Tech. IV Year I Semester

Lecture notes by K. ChandraSena Chary

UNIT – I
Introduction to Data Mining
Data Mining – It is a process of extracting knowledge from large volumes of data, or the practice
of examining large pre-existing databases in order to generate new information, or for uncovering
interesting data patterns hidden in large data sets. It is also called as KDD (Knowledge Discovery in Data
Bases).
Data Mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD) an
interdisciplinary subfield of computer sciences is the computational process of discovering patterns in
large data involving methods at the intersection of artificial intelligence, machine learning, statistics, and
database systems. The overall goal of the data mining process is to extract information from a data set
and transform it into an understandable structure for further use or summarizing it into useful
information - information that can be used to increase revenue or business.
Data Mining majorly used for decision making in finding correlations or patterns among dozens of fields in
large relational databases.
Data, Information, and Knowledge
Data
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations
are accumulating vast and growing amounts of data in different formats and different databases. This
includes:
 operational or transactional data such as, sales, cost, inventory, payroll, and accounting
 nonoperational data, such as industry sales, forecast data, and macro economic data
 meta data - data about the data itself, such as logical database design or data dictionary
definitions
Information
The patterns, associations, or relationships among all this data can provide information. For
example, analysis of super market data can yield information on which products are selling and when.

1 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
Knowledge
Information can be converted into knowledge about historical patterns and future trends. For
example, summary information on retail supermarket sales can be analyzed in light of promotional
efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could
determine which items are most susceptible to promotional efforts.

Data Warehouse – In computing a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a
data base used for reporting and data analysis. Integrating data from one or more disparate sources creates a
central repository of data, a data warehouse (DW). Data warehouses store current and historical data and are
used for creating trending reports for senior management reporting such as annual and quarterly comparisons.

Why Data Mining is impartment?

Wide availability of huge amounts of data and the need for turning such data into useful
information and knowledge.
Information or knowledge can be used for decision making.
It can be viewed as natural evolution of information technology.
 Data Collection and Database Creation (1960s and earlier ) – Primitive file processing
 DBMS (1970s – early 1980s) – SQL, OLTP (On-Line Transaction Processing), RDBMS,
Network DB, Transaction Mgt., etc...
 Advanced Databases Systems (mid-1980s – present) – Object oriented, relational,
spatial, temporal, multimedia, etc...

2 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
 Web-based Databases Systems (1990s-present) – XML based database systems,
Web mining
 Data Warehousing and Data Mining (late 1980s-present) – KDD and OLAP (On-Line
Analytical Processing)
 New Generation of Integrated Information Systems
We are data rich, but information poor
Used for data analysis - searching for knowledge (interesting patterns) in your data.

Knowledge Discovery in Databases or KDD - Knowledge Discovery in Databases creates the


context for developing the tools needed to control the flood of data facing organizations that depend on
ever-growing databases of business, manufacturing, scientific, and personal information. The nontrivial
process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data .

The KDD process is interactive and iterative (with many decisions made by the user), involving numerous
steps, summarized as:

1. Learning the application domain: includes relevant prior knowledge and the goals of the
application
2. Creating a target dataset: includes selecting a dataset or focusing on a subset of variables or data
samples on which discovery is to be performed
3. Data cleaning and preprocessing: includes basic operations, such as removing noise or outliers
if appropriate, collecting the necessary information to model or account for noise, deciding on
strategies for handling missing data fields, and accounting for time sequence information and
known changes, as well as deciding DBMS issues, such as data types, schema, and mapping of
missing and unknown values

3 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
4. Data reduction and projection: includes finding useful features to represent the data, depending
on the goal of the task, and using dimensionality reduction or transformation methods to reduce
the effective number of variables under consideration or to find invariant representations for the
data
5. Choosing the function of data mining: includes deciding the purpose of the model derived by the
data mining algorithm (e.g., summarization, classification, regression, and clustering)
6. Choosing the data mining algorithm(s): includes selecting method(s) to be used for searching for
patterns in the data, such as deciding which models and parameters may be appropriate (e.g.,
models for categorical data are different from models on vectors over reals) and matching a
particular data mining method with the overall criteria of the KDD process (e.g., the user may be
more interested in understanding the model than in its predictive capabilities)
7. Data mining: includes searching for patterns of interest in a particular representational form or a
set of such representations, including classification rules or trees, regression, clustering,
sequence modeling, dependency, and line analysis
8. Interpretation: includes interpreting the discovered patterns and possibly returning to any of the
previous steps, as well as possible visualization of the extracted patterns, removing redundant or
irrelevant patterns, and translating the useful ones into terms understandable by users.
9. Using discovered knowledge: includes incorporating this knowledge into the performance
system, taking actions based on the knowledge, or simply documenting it and reporting it to
interested parties, as well as checking for and resolving potential conflicts with previously
believed (or extracted) knowledge.

Architecture of Data Mining System – Data mining as a step in the process of knowledge discovery –
discovering interesting knowledge from large amounts of data stored either in database, data warehouse
or other information repositories. Based on this view the architecture of a typical data mining system
may have the following major components.

1. Database, data warehouse, World Wide Web, or other information repository: This is one or a
set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data
cleaning and data integration techniques may be performed on the data
2. Database or data warehouse server: The database or data warehouse server is responsible for
fetching the relevant data, based on the user’s data mining request.

4 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
3. Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a pattern’s interestingness.
4. Data mining engine: This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
5. Pattern evaluation module: This component typically employs interestingness measures and
interacts with the data mining modules so as to focus the search toward interesting patterns. It
may use interestingness thresholds to filter out discovered patterns. For efficient data mining, it
is highly recommended to push the evaluation of pattern interestingness as deep as possible into
the mining process so as to confine the search to only the interesting patterns.
6. User interface: This module communicates between users and the data mining system, allowing
the user to interact with the system by specifying a data mining query or task, providing
information to help focus the search, and performing exploratory data mining based on the
intermediate data mining results. In addition, this component allows the user to browse database
and data warehouse schemas or data structures, evaluate mined patterns, and visualize the
patterns in different forms.

5 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
Data Mining – On What Kinds of Data
In principle, data mining is not specific to one type of media or data. Data mining should be
applicable to any kind of information repository. However, algorithms and approaches may differ when
applied to different types of data. Data mining is being put into use and studied for databases, including
relational databases, object-relational databases and object-oriented databases, data warehouses,
transactional databases, unstructured and semi-structured repositories such as the World Wide Web,
advanced databases such as spatial databases, multimedia databases, time-series databases and textual
databases, and even flat files.

1. Flat files: Flat files are actually the most common data source for data mining algorithms,
especially at the research level. Flat files are simple data files in text or binary format with a
structure known by the data mining algorithm to be applied. The data in these files can be
transactions, time-series data, scientific measurements, etc.
2. Relational Databases: Briefly, a relational database consists of a set of tables containing either
values of entity attributes, or values of attributes from entity relationships. Tables have columns
and rows, where columns represent attributes and rows represent tuples. A tuple in a relational
table corresponds to either an object or a relationship between objects and is identified by a set
of attribute values representing a unique key. The most commonly used query language for
relational database is SQL, which allows retrieval and manipulation of the data stored in the
tables, as well as the calculation of aggregate functions such as average, sum, min, max and
count. While data mining can benefit from SQL for data selection, transformation and
consolidation, it goes beyond what SQL could provide, such as predicting, comparing, detecting
deviations, etc.
3. Data Warehouses: A data warehouse as a store house is a repository of data collected from
multiple data sources (often heterogeneous) and is intended to be used as a whole under the
same unified schema. A data warehouse gives the option to analyze data from different sources
under the same roof. The actual physical structure of a data warehouse may be a relational data
store or a multidimensional data cube. It provides a multidimensional view of data and allows the
precomputation and fast accessing of summarized data.

6 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
4. Transaction Databases: A transaction database is a set of records representing transactions, each
with a time stamp, an identifier and a set of items. Associated with the transaction files could also
be descriptive data for the items. Since relational databases do not allow nested tables (i.e. a set
as attribute value), transactions are usually stored in flat files or stored in two normalized
transaction tables, one for the transactions and one for the transaction items. One typical data
mining analysis on such data is the so-called market basket analysis or association rules in which
associations between items occurring together.
5. Multimedia Databases: Multimedia databases include video, images, audio and text media. They
can be stored on extended object-relational or object-oriented databases, or simply on a file
system. Multimedia is characterized by its high dimensionality, which makes data mining even
more challenging. Data mining from multimedia repositories may require computer vision,
computer graphics, image interpretation, and natural language processing methodologies.
6. Spatial Databases: Spatial databases are databases that, in addition to usual data, store
geographical information like maps, and global or regional positioning. Such spatial databases
present new challenges to data mining algorithms.

7. Time-Series Databases: Time-series databases contain time related data such stock market data or
logged activities. These databases usually have a continuous flow of new data coming in, which sometimes
causes the need for a challenging real time analysis. Data mining in such databases commonly includes the
study of trends and correlations between evolutions of different variables, as well as the prediction of
trends and movements of the variables in time.

7 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
8. World Wide Web: The World Wide Web is the most heterogeneous and dynamic repository available Data
in the World Wide Web is organized in inter-connected documents. These documents can be text, audio,
video, raw data, and even applications. Conceptually, the World Wide Web is comprised of three major
components: The content of the Web, which encompasses documents available; the structure of the Web,
which covers the hyperlinks and the relationships between documents; and the usage of the web,
describing how and when the resources are accessed. Data mining in the World Wide Web, or web mining,
tries to address all these issues and is often divided into web content mining, web structure mining and
web usage mining.

Data Mining Functionalities – What Kinds of Patterns Can Be Mined

Data Mining deals with what kind of patterns can be mined. On the basis of kind of data to be mined
there are two kinds of functions involved in Data Mining that is listed below:

 Descriptive
 Classification and Prediction

Descriptive - The descriptive function deals with general properties of existing data in the database. Here
is the list of descriptive functions:

 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters

Predictive – The predictive mining tasks perform inference on the current data in order to make
predictions.

8 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
Functionalities:

1. Class/Concept Description - Class/Concepts refers the data to be associated with classes or


concepts. For example, in a company classes of items for sale include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of a class or
a concept are called class/concept descriptions. These descriptions can be derived by following
two ways:

 Data Characterization - This refers to summarizing data of class under study. This class
under study is called as Target Class.
 Data Discrimination - It refers to mapping or classification of a class with some predefined
group or class.

2. Association analysis - Association analysis is the discovery of what are commonly called
association rules. It studies the frequency of items occurring together in transactional databases,
and based on a threshold called support, identifies the frequent item sets. Another threshold,
confidence, which is the conditional probability than an item appears in a transaction when
another item appears, is used to pinpoint association rules. Association analysis is commonly
used for market basket analysis. The association rule is of the form X => Y is interpreted as
database tuples that satisfy the conditions on X are also likely to satisfy the conditions in Y.
E.g. age (X,”20…29”) ^ income (X, “20K…29K) => buys (X, “CD Player”) [support = 2%, confidence =
60%)
Here X is a variable representing customer. The rule indicates that, the customer under study, 2%
(support) are 20 to 29 years of age with an income of 20k to 29k and have purchased a CD player.
There is a 60% of probability (confidence, or certainty) that a customer in this age and income
group will purchase a CD player.

3. Classification: Classification analysis is the organization of data in given classes. Also known as
supervised classification, the classification uses given class labels to order the objects in the data
collection. Classification approaches normally use a training set where all objects are already
associated with known class labels. The classification algorithm learns from the training set and
builds a model. The model is used to classify new objects
4. Clustering: Similar to classification, clustering is the organization of data in classes. However,
unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm

9 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
to discover acceptable classes. Clustering is also called unsupervised classification, because the
classification is not dictated by given class labels. There are many clustering approaches all based
on the principle of maximizing the similarity between objects in a same class (intra-class
similarity) and minimizing the similarity between objects of different classes (inter-class
similarity).
5. Prediction: Prediction has attracted considerable attention given the potential implications of successful
forecasting in a business context. There are two major types of predictions: one can either try to predict
some unavailable data values or pending trends, or predict a class label for some data. The latter is tied to
classification. Once a classification model is built based on a training set, the class label of an object can be
foreseen based on the attribute values of the object and the attribute values of the classes. Prediction is
however more often referred to the forecast of missing numerical values, or increase/ decrease trends in

time related data. �The major idea is to use a large number of past values to consider probable future

values.
6. Outlier analysis: Outliers are data elements that cannot be grouped in a given class or cluster.
Also known as exceptions or surprises, they are often very important to identify. While outliers
can be considered noise and discarded in some applications, they can reveal important
knowledge in other domains, and thus can be very significant and their analysis valuable.
7. Evolution and deviation analysis: Evolution and deviation analysis pertain to the study of time
related data that changes in time. Evolution analysis models evolutionary trends in data, which
consent to characterizing, comparing, classifying or clustering of time related data. Deviation
analysis, on the other hand, considers differences between measured values and expected
values, and attempts to find the cause of the deviations from the anticipated values.

Classification of Data Mining Systems

There are many data mining systems available or being developed. Some are specialized systems
dedicated to a given data source or are confined to limited data mining functionalities, other are more
versatile and comprehensive. Data mining systems can be categorized according to various criteria
among other classification are the following

1. Classification according to the type of data source mined: this classification categorizes data
mining systems according to the type of data handled such as spatial data, multimedia data, time-
series data, text data, World Wide Web, etc.

10 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
2. Classification according to the data model drawn on: this classification categorizes data mining
systems based on the data model involved such as relational database, object-oriented database,
data warehouse, transactional, etc.
3. Classification according to the kind of knowledge discovered: this classification categorizes data
mining systems based on the kind of knowledge discovered or data mining functionalities, such as
characterization, discrimination, association, classification, clustering, etc. Some systems tend to
be comprehensive systems offering several data mining functionalities together.
4. Classification according to mining techniques used: Data mining systems employ and provide
different techniques. This classification categorizes data mining systems according to the data
analysis approach used such as machine learning, neural networks, genetic algorithms, statistics,
visualization, database-oriented or data warehouse-oriented, etc. The classification can also take
into account the degree of user interaction involved in the data mining process such as query-
driven systems, interactive exploratory systems, or autonomous systems. A comprehensive
system would provide a wide variety of data mining techniques to fit different situations and
options, and offer different degrees of user interaction .

Major Issues in Data Mining - Data mining algorithms embody techniques that have sometimes existed for
many years, but have only lately been applied as reliable and scalable tools that time and again outperform older
classical statistical methods. Before data mining develops into a conventional, mature and trusted discipline, many
still pending issues have to be addressed. The major issues are mining methodology, user interaction,
performance, and diverse data type.

11 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
1. Mining methodology and user interaction issues – These reflects the kinds of knowledge mined,
the ability to make knowledge at multiple granularities, the use of domain knowledge, ad hoc
mining and knowledge visualization. It refers to the following kind of issues.
Mining different kinds of knowledge in databases. - The need of different users is not the
same. And Different user may be in interested in different kind of knowledge. Therefore it
is necessary for data mining to cover broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction. - The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on returned results.
Incorporation of background knowledge. - To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may
be used to express the discovered patterns not only in concise terms but at multiple level
of abstraction.
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.

12 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
Presentation and visualization of data mining results. - Once the patterns are discovered
it needs to be expressed in high level languages, visual representations. This
representation should be easily understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can
handle the noise, incomplete objects while mining the data regularities. If data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered
should be interesting because either they represent common knowledge or lack novelty.

2. Performance Issues – These include efficiency, scalability, and parallelization of data mining
algorithms. It refers to the following kind of issues
Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate
the development of parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed parallel. Then the result from the
partitions is merged. The incremental algorithms, updates databases without having mine
the data again from scratch.
3. Issues relating to the diversity of data base types
Handling of relational and complex types of data. - The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems. -
The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining knowledge from them adds
challenges to data mining.

13 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
Data Preprocessing

Definition - What does Data Preprocessing mean?

Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of
resolving such issues. Data preprocessing prepares raw data for further processing.

Why Preprocessing?

1. Real world data are generally


 Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
 Noisy: containing errors or outliers
 Inconsistent: containing discrepancies in codes or names
2. Tasks in data preprocessing
 Data Cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
 Data Integration: using multiple databases, data cubes, or files.
 Data Transformation: normalization and aggregation.
 Data Reduction: reducing the volume but producing the same or similar analytical results.
 Data Discretization: part of data reduction, replacing numerical attributes with nominal
ones.

Preprocessing Techniques
1. Data Cleaning – Real world data tend to be incomplete, noisy, and inconsistent. Data cleaning
routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data. The following are methods for data cleaning.

14 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
I. Missing Values – it is noticed that many tuples have no recorded values for several
attributes, and then the missing values can be filled in for the attribute by various
methods described below.
i. Ignore the tuple – this is usually done when the class label is missing. This method
is not effective, unless the tuple contains several attributes with missing values. It
is especially poor when the percentage of missing values per attribute varies
considerably.
ii. Fill in the missing value manually – in general, this approach is time consuming
process and may not feasible given a large data set with many missing values.
iii. Use a global constant to fill in the missing value – replace all missing attribute
values b the same constant, such as label like “Unknown” or “NA”.
iv. Use the attribute mean to fill in the missing value – use the majority nominal
value to fill in the missing value. (E.g. Average salary of an employee)
v. Use the attribute mean for all samples belonging to the same class as the given
tuple – use the majority nominal value for all samples belonging to the same class.
(E.g. Salary of an employee under the same department)
vi. Use the most probable value to fill in the missing value – this may be determined
with regression or decision tree induction to predict the missing value.
II. Noisy Data – Noise is a random error or variance in a measured variable. The following
are the smoothing techniques.
i. Binning – binning methods smooth a sorted data value by consulting the
neighborhood or values around it. The sorted values are distributed into a number
of buckets or bins. Because binning method consults the neighborhood of values,
they perform local smoothing. In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin. For example, the mean of the values 4, 8,
and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the
value 9. Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin boundaries.
Each bin value is then replaced by the closest boundary value. Alternatively, bins
may be equal-width, where the interval range of values in each bin is constant.

15 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
ii. Clustering – outliers may be detected by clustering, where similar values are
organized into groups, or “cluster”. Intuitively, values that fall outside of the set of
clusters may be considered outliers.

iii. Combined computer and human inspection – outliers may be identified through a
combination of computer and human inspection.
iv. Regression – Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the best line to fit two variables, so
that one variable can be used to predict the other. In regression, smooth by fitting
the data into regression functions.
III. Inconsistent Data – there may be inconsistencies in the data for some transaction during
data entry. These may corrected manually using external references. There may also be
inconsistencies due to data integration, where a given attribute can have different names

16 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
in different data bases. So, we can correct inconsistent data by using domain knowledge
or expert decisions.
2. Data Integration and Transformation – Data Integration refers to the merging of data from
multiple data stores. This data may also need to be transformed into forms appropriate for
mining.
I. Data Integration – combines data from multiple sources into a coherent data store. These
sources may include multiple databases, data cubes, or flat files. Issues to be consider
during data integration.
i. Schema Integration problem – Real world entities from multiple data sources may
be matched up. This is referred to as the entity identification problem.
ii. Redundancy – an attribute may be redundant if it can be “derived” from another
table.
iii. Detection and resolution of data value conflicts – attribute values from different
sources may be differ due to differences in representation.
II. Data Transformation – in data transformation the data are transformed or consolidated
into forms for appropriate for mining. Data transformation can involve the following.
i. Smoothing – in which, we remove the noise from data using binning, clustering,
and regression.
ii. Aggregation – in this, summary or aggregation operations are applied to the data.
For example, the daily sales data may be aggregated so as to compute monthly
and annual total amounts.
iii. Generalization – in this, where low level or primitive (raw) data are replaced by
higher level concepts through the use of concept hierarchies. For example,
attribute like street, ca n be generalized to higher-level concepts, like city or
country.
iv. Normalization – Where the attribute data are scaled so as to fall within a small
specified range, such as 0.0 to 1.0. , the following are the normalization
techniques.
 Min-max normalization – performs linear transformation on the original
data. Suppose that minA and maxA are the minimum and maximum values
of an attribute A. Min-max normalization maps a value v of A to v’ in the
range [new_minA, new_maxA] by computing
17 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

 z-score normalization ( or zero –mean normalization) – the values of an


attribute A are normalized based on the mean and standard deviation of A.
A value v of A is normalized to v’ by computing
v  meanA
v' 
stand _ devA

 Normalization by decimal scaling – normalizes by moving the decimal


point of values of attribute A. The number of decimal points moved
depends on the maximum absolute value of A. A value v of A is normalized
to v’ by computing
v
v' 
10 j Where j is the smallest integer.
v. Attribute Construction (or feature construction) – where new attributes are
constructed and added from the given set of attributes to help the mining process.
For example, we may add the attribute area based on the attributes width and
height.
3. Data Reduction – techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of original data. That is, mining
on the reduced data set should be more efficient yet produce the same analytical results.
Strategies for data reduction include the following,
I. Data cube aggregation – where aggregation operations are applied to the data in the
construction of a data cube.

18 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
II. Dimensionality Reduction – reduces the data set size by removing irrelevant, weakly
relevant, or redundant attributes or dimensions may be detected and removed. The
following are the techniques.
i. Stepwise forward selection – the procedure starts with an empty set of attributes.
The best of the original attributes is determined and added to the set. At each
subsequent iteration or step, the best of the remaining original attributes is added
to the list.
ii. Stepwise backward elimination – the procedure starts with full set of an
attributes. At each step, it removes the worst attribute remaining in the set.
iii. Combining of forward selection and backward elimination –this procedure
selects the best attributes and removes the worst from among the remaining
attributes.

Forms of Data Preprocessing

III. Data Compression – in data compression, data encoding or transformations are applied
so as to obtain a reduced or “compressed” representation of the original data. If the
original data can be reconstructed from the compressed data without any loss of
information is called lossless compression. If, we can reconstruct only an approximation o
the original data, then the data compression technique is called as lossy compression. The
following are the two lossy data compressions:
i. Wavelet Transforms – the discrete wavelet transform is a linear signal processing
technique that, when applied to a data vector D, transforms it to a numerically
different vector, D’ of wavelet coefficients.

19 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
 Method :
 Length, L, must be an integer power of 2 (padding with 0s, when
necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length
ii. Principal Component Analysis
 Given N data vectors from k -dimensions, find c <= k orthogonal
vectors that can be best used to represent data.
i. The original data set is reduced to one consisting of N data
vectors on c principal components (reduced dimensions).
 Each data vector is a linear combination of the c principal
component vectors
 Works for numeric data only
 Used when number of dimensions is large.

IV. Numerosity Reduction – used to reduce the data volume by choosing alternative,
‘smaller’ forms of data representation. These techniques may be parametric or
nonparametric.
i. Parametric methods
 Assume the data fits some model, estimate model parameters, store only
the parameters, and discard the data (except possible outliers). Regression
and log-linear models can be used to approximate the given data.
ii. Nonparametric methods
20 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
 Nonparametric methods for storing reduced representations of the data
which do not assume models. These include histograms, clustering, and
sampling.
 Histogram – is a popular reduction technique, which divides data into
buckets and store average (sum) for each bucket.
 Clustering – partition data set into clusters, and store cluster
representation only. The objects within the cluster are similar to one
another and dissimilar to objects in other clusters.
 Sampling – sampling can be used as a data reduction technique since it
allows a large data set to be represented by a much smaller random
sample of the data.
 Simple random sample without replacement (SRSWOR) of size n:
This is created by drawing n of the N tuples from D (n < N), where
the probability of drawing any tuple in D is 1/N, that is, all tuples are
equally likely.
 Simple random sample with replacement (SRSWR) of size n: this is
similar to SRSWOR, except that each time a tuple is drawn from D, it
is recorded and then replaced. That is, after a tuple is drawn, it is
placed back in D so that it may be drawn again.
 Cluster sample
 Stratified sample

21 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
4. Discretization and Concept Hierarchy Generation
I. Unsupervised discretization - class variable is not used.
i. Equal-interval (equiwidth) binning: split the whole range of numbers in intervals
with equal size.
ii. Equal-frequency (equidepth) binning: use intervals containing equal number of
values.
II. Supervised discretization - uses the values of the class variable.
i. Using class boundaries. Three steps:
 Sort values.
 Place breakpoints between values belonging to different classes.
 If too many intervals, merge intervals with equal or similar class
distributions.
ii. Entropy (information)-based discretization. Example:
 Information in a class distribution:
 Denote a set of five values occurring in tuples belonging to two
classes (+ and -) as [+,+,+,-,-]
 That is, the first 3 belong to "+" tuples and the last 2 - to "-" tuples
 Then, Info([+,+,+,-,-]) = -(3/5)*log(3/5)-(2/5)*log(2/5)
(logs are base 2)
 3/5 and 2/5 are relative frequencies (probabilities)
 Ignoring the order of the values, we can use the following notation:
[3,2] meaning 3 values from one class and 2 - from the other.
 Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)
 Information in a split (2/5 and 3/5 are weight coefficients):
i. Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) +
(3/5)*Info([+,-,-])
ii. Or, Info([2,0],[1,2]) = (2/5)*Info([2,0]) +
(3/5)*Info([1,2])

 Method:

 Sort the values;


 Calculate information in all possible splits;
 Choose the split that minimizes information;

22 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
 Do not include breakpoints between values belonging to the same class (this
will increase information);
 Apply the same to the resulting intervals until some stopping criterion is
satisfied.
III. Generating concept hierarchies: recursively applying partitioning or discretization methods.

Forms of Data Preprocessing

23 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
Discretization and Binarization in Data Mining

It is often necessary to transform a continuous attribute into a categorical attribute (Discretization), and
both continuous and discrete attributes may need to be transformed into one or more binary attributes
(Binarization).

Discretization in data mining is the process that is frequently used and it is used to transform the attributes
that are in continuous format.

On the other hand, Binarization is used to transform both the discrete attributes and the continuous
attributes into binary attributes in data mining.

Measures of Similarity and Dissimilarity- Basics

Similarity

 Numerical measure of how alike two data objects are


o Value is higher when objects are more alike
o Often falls in the range [0,1]

Dissimilarity (e.g., distance)

 Numerical measure of how different two data objects are


 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity

Types of Attributes
 There are different types of attributes
o Nominal ( Examples: ID numbers, eye color, PIN Codes )
o Ordinal ( Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
o Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit.
o Ratio Examples: temperature in Kelvin, length, time, counts

24 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
25 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527
26 Prepared by K. Chandrasena chary, Associate Professor, Department of CSE, SCIT, Karimnagar – 505 527

You might also like