0% found this document useful (0 votes)
26 views

Data Mining UNIT II

Data preprocessing is a critical step in data mining that involves cleaning, transforming, and integrating data to enhance its quality for analysis. Key processes include data cleaning, integration, transformation, reduction, and discretization, each with specific techniques to handle issues like missing values, noise, and dimensionality. Effective data preprocessing ensures accurate analysis results and improves the efficiency of the data mining process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Data Mining UNIT II

Data preprocessing is a critical step in data mining that involves cleaning, transforming, and integrating data to enhance its quality for analysis. Key processes include data cleaning, integration, transformation, reduction, and discretization, each with specific techniques to handle issues like missing values, noise, and dimensionality. Effective data preprocessing ensures accurate analysis results and improves the efficiency of the data mining process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT –II Data Preprocessing

Data preprocessing:- An overview, Data cleaning, Data Integration, Data Reduction, Data
transformation and Data discretization.

Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming,
and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve
the quality of the data and to make it more suitable for the specific data mining task.

Some common steps in data preprocessing include:

Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.

Data Integration: This involves combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be used for data integration.

Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform the
data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete
categories.

Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant features from the dataset, while
feature extraction involves transforming the data into a lower-dimensional space while preserving the
important information.

Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning,
and clustering.

Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1
and 1. Normalization is often used to handle data with different units and scales. Common normalization
techniques include min-max normalization, z-score normalization, and decimal scaling.

Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis
results. The specific steps involved in data preprocessing may vary depending on the nature of the data
and the analysis goals.

By performing these steps, the data mining process becomes more efficient and the results become more
accurate.

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.
Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:

1. Ignore the tuples:


This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.

• (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the dataset
while preserving the important information. This is done to improve the efficiency of data analysis and to
avoid overfitting of the model. Some common steps involved in data reduction are:

Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be done
using various techniques such as correlation analysis, mutual information, and principal component
analysis (PCA).

Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features are high-
dimensional and complex. It can be done using techniques such as PCA, linear discriminant analysis
(LDA), and non-negative matrix factorization (NMF).

Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to
reduce the size of the dataset while preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.

Clustering: This involves grouping similar data points together into clusters. Clustering is often used to
reduce the size of the dataset by replacing similar data points with a representative centroid. It can be
done using techniques such as k-means, hierarchical clustering, and density-based clustering.

Compression: This involves compressing the dataset while preserving the important information.
Compression is often used to reduce the size of the dataset for storage and transmission purposes. It can
be done using techniques such as wavelet compression, JPEG compression, and gzip compression.

Why to preprocess data??

The data to be analyzed by data mining techniques are incomplete i.e. it is lacking attribute values or certain
attributes of interest or containing only aggregate data.

It may be noisy i.e. containing errors, or outlier values.

It may be inconsistent i.e. containing discrepancies in the department codes used to categorize items.

Incomplete, noisy and inconsistent data are present in large real world databases and datawarehouses.The data may
be incomplete for many reasons.
a) Attributes of interest may not always be available such as customer information for sales transaction data.

b) Other data may not be included simply, because it was not considered important at the time of entry.

c) Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions.

d) d) Data that were inconsistent with other recorded data may have been deleted.

e) e)The recording of the history or modifications to the data may have been overlooked.

f) Missing data for tuples with missing values for some attributes may need to be inferred.

There are many reasons for noisy data.

a) The instruments used for data collection may be faulty.

b) There may be a human or computer error occurring at data entry.

c) Errors in data transmission can also occur.

d) There may be technology limitations ,such as limited buffer size for coordinating synchronized data
transfer and consumption.

e) The incorrect data may be from inconsistencies in naming conventions or data codes used or inconsistent
formats for input fields such as “date". Duplicate tuples also require data cleaning.

Data Cleaning

Data cleaning is an essential step in the data mining process. It is crucial to the construction of a model.
The step that is required, but frequently overlooked by everyone, is data cleaning. The major problem
with quality information management is data quality. Problems with data quality can happen at any place
in an information system. Data cleansing offers a solution to these issues.

Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted,
duplicated, or insufficient data from a dataset. Even if results and algorithms appear to be correct, they are
unreliable if the data is inaccurate. There are numerous ways for data to be duplicated or incorrectly
labeled when merging multiple data sources.

In general, data cleaning lowers errors and raises the caliber of the data. Although it might be a time-
consuming and laborious operation, fixing data mistakes and removing incorrect information must be
done. A crucial method for cleaning up data is data mining. A method for finding useful information in
data is data mining. Data quality mining is a novel methodology that uses data mining methods to find
and fix data quality issues in sizable databases. Data mining mechanically pulls intrinsic and hidden
information from large data sets. Data cleansing can be accomplished using a variety of data mining
approaches.

To arrive at a precise final analysis, it is crucial to comprehend and improve the quality of your data. To
identify key patterns, the data must be prepared. Exploratory data mining is understood. Before doing
business analysis and gaining insights, data cleaning in data mining enables the user to identify erroneous
or missing data.

Data cleaning before data mining is often a time-consuming procedure that necessitates IT personnel to
assist in the initial step of reviewing your data due to how time-consuming data cleaning is. But if your
final analysis is inaccurate or you get an erroneous result, it's possible due to poor data quality.
Steps for Cleaning Data

You can follow these fundamental stages to clean your data even if the techniques employed may vary
depending on the sorts of data your firm stores:

1. Remove duplicate or irrelevant observations

Remove duplicate or pointless observations as well as undesirable observations from your dataset. The
majority of duplicate observations will occur during data gathering. Duplicate data can be produced when
you merge data sets from several sources, scrape data, or get data from clients or other departments. One
of the most important factors to take into account in this procedure is de-duplication. Those observations
are deemed irrelevant when you observe observations that do not pertain to the particular issue you are
attempting to analyze.

You might eliminate those useless observations, for instance, if you wish to analyze data on millennial
clients but your dataset also includes observations from earlier generations. This can improve the
analysis's efficiency, reduce deviance from your main objective, and produce a dataset that is easier to
maintain and use.

2. Fix structural errors

When you measure or transfer data and find odd naming practices, typos, or wrong capitalization, such
are structural faults. Mislabelled categories or classes may result from these inconsistencies. For instance,
"N/A" and "Not Applicable" might be present on any given sheet, but they ought to be analyzed under the
same heading.

3. Filter unwanted outliers

There will frequently be isolated findings that, at first glance, do not seem to fit the data you are
analyzing. Removing an outlier if you have a good reason to, such as incorrect data entry, will improve
the performance of the data you are working with.

However, occasionally the emergence of an outlier will support a theory you are investigating. And just
because there is an outlier, that doesn't necessarily indicate it is inaccurate. To determine the reliability of
the number, this step is necessary. If an outlier turns out to be incorrect or unimportant for the analysis,
you might want to remove it.

4. Handle missing data

Because many algorithms won't tolerate missing values, you can't overlook missing data. There are a few
options for handling missing data. While neither is ideal, both can be taken into account, for example:

Although you can remove observations with missing values, doing so will result in the loss of
information, so proceed with caution.

Again, there is a chance to undermine the integrity of the data since you can be working from
assumptions rather than actual observations when you input missing numbers based on other
observations.

To browse null values efficiently, you may need to change the way the data is used.

5. Validate and QA

As part of fundamental validation, you ought to be able to respond to the following queries once the data
cleansing procedure is complete:
• Are the data coherent?
• Does the data abide by the regulations that apply to its particular field?
• Does it support or refute your working theory? Does it offer any new information?
• To support your next theory, can you identify any trends in the data?
• If not, is there a problem with the data's quality?

False conclusions can be used to inform poor company strategy and decision-making as a result of
inaccurate or noisy data. False conclusions can result in a humiliating situation in a reporting meeting
when you find out your data couldn't withstand further investigation. Establishing a culture of quality data
in your organization is crucial before you arrive. The tools you might employ to develop this plan should
be documented to achieve this.

Techniques for Cleaning Data

The data should be passed through one of the various data-cleaning procedures available. The procedures
are explained below:

1. Ignore the tuples: This approach is not very practical because it is only useful when a tuple has
multiple characteristics and missing values.
2. Fill in the missing value: This strategy is also not very practical or effective. Additionally, it
could be a time-consuming technique. One must add the missing value to the approach. The most
common method for doing this is manually, but other options include using attribute means or the
most likely value.
3. Binning method: This strategy is fairly easy to comprehend. The values nearby are used to
smooth the sorted data. The information is subsequently split into several equal-sized parts. The
various techniques are then used to finish the assignment.
4. Regression: With the use of the regression function, the data is smoothed out. Regression may be
multivariate or linear. Multiple regressions have more independent variables than linear
regressions, which only have one.
5. Clustering: This technique focuses mostly on the group. Data are grouped using clustering. After
that, clustering is used to find the outliers. After that, the comparable values are grouped into a
"group" or "cluster".

Process of Data Cleaning

The data cleaning method for data mining is demonstrated in the subsequent sections.

1. Monitoring the errors: Keep track of the areas where errors seem to occur most frequently. It
will be simpler to identify and maintain inaccurate or corrupt information. Information is
particularly important when integrating a potential substitute with current management software.
2. Standardize the mining process: To help lower the likelihood of duplicity, standardize the place
of insertion.
3. Validate data accuracy: Analyse the data and spend money on data cleaning software. Artificial
intelligence-based tools were utilized to thoroughly check for accuracy.
4. Scrub for duplicate data: To save time when analyzing data, find duplicates. By analyzing and
investing in independent data-erasing technologies that can analyze imperfect data in quantity and
automate the operation, it is possible to avoid again attempting the same data.
5. Research on data: Our data needs to be vetted, standardized, and duplicate-checked before this
action. There are numerous third-party sources, and these vetted and approved sources can extract
data straight from our databases. They assist us in gathering the data and cleaning it up so that it is
reliable, accurate, and comprehensive for use in business decisions.
6. Communicate with the team: Keeping the group informed will help with client development and
strengthening as well as giving more focused information to potential clients.

Data Cleansing Tools can be very helpful if you are not confident of cleaning the data yourself or have no
time to clean up all your data sets. You might need to invest in those tools, but it is worth the expenditure.
There are many data cleaning tools in the market. Here are some top-ranked data cleaning tools, such as:

1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10. Winpure

Noisy data:- Noise is a random error or variance in a measured variable .Given a numerical attribute such
as price ,how can we “smooth” out the data to remove the noise?

Example :- sorted data for price(in dollars):-4,8,15,21,21,24,25,28,34.

***Binning methods for data smoothing***

Partition into (equal frequency) bins:-

Bin1 : 4,8,15.

Bin2 : 21,21,24.

Bin3 : 25,28,34.

Smoothing by bin means:-

Bin1 : 9,9,9. --(4+8+15=27/3 gives 9)

Bin2 : 22,22,22. --(21+21+24=66/3 gives 22)

Bin3 : 29,29,29. –(25+28+34=87/3 gives 29).

Smoothing by bin boundaries:-

Bin1: 4,4,15.
Bin2: 21,21,24.

Bin3: 25,25,34.

Binning :- This method smooth a sorted data value by consulting its “neighborhood” i.e. values around it.
As binning methods consult the neighborhood of values they perform local smoothing.

In above example ,the data for price are first sorted and then partitioned into equal –frequency bins of size
3(i.e. each bin contains 3 values).In smoothing by bin means each value in a bin is replaced by the mean
value of the bin.

For example:- from the list 4,8,15,21,21,24,25, 28,34.

The mean of values 4,8,15 in bin1 is(4+8+15=27/3=9).Therefore each original values in this bin is
replaced by the value 9.Similarly ,smoothing by bin medians can be employed in which each bin value is
replaced by bin median.

Regression:- Data can be smoothed by fitting the data to a function such as with regression. Regression is
a statistical technique that relates a dependent variable to one or more independent variables. A
regression model is able to show whether changes observed in the dependent variable are associated with
changes in one or more of the independent variables.

In linear regression we find the “best” line to fit two attributes(or variables)so that one attribute can be
used to predict the other. Formulating a regression analysis helps you predict the effects of the
independent variable on the dependent one.

Example: we can say that age and height can be described using a linear regression model. Since a
person's height increases as age increases, they have a linear relationship.

Multiple linear regression:- It is an extension of linear regression ,where more than two attributes are
involved and the data are fit to a multidimensional surface.

3) Clustering:- Values that fall outside of the set of clusters are called outliers. clustering is used to detect
outliers.

Discrepancy detection:- This is the first step in data cleaning as a process. Discrepancies can be caused
by several factors like, poorly designed data entry forms having many optional fields.Human error in data
entry.Deliberate errors(e.g. respondents not wanting to give information about themselves) and Data
decay (e.g. outdated addresses).

Discrepancies are also due to inconsistent data representations and the inconsistent use of codes.Error in
instrumentation devices that record data and system errors.When the data are inadequately used for
purpose it is not intended for then also errors occur.Inconsistencies can be due to data integration (e.g.
when a given attribute can have different names in different databases).

To detect discrepancies use any knowledge regarding properties of the data, such knowledge or “data
about data” is called metadata.

Look for the inconsistent use of codes and any inconsistent data representation (such as “2024/07/25” and
“25/07/2024” for date).
• Unique rule:- This rule says that each value of the given attribute must be different from all other
values for that attribute.(i.e. all tuples for that attribute should have different values no repeated
values allowed.).
• Consecutive rule:-This rule says that there can be no missing values between the lowest and the
highest values for the attribute, and that all values must also be unique.
• Null rule:- This rule specifies the use of blanks ,question marks, special characters or other strings
that may indicate the null condition. The null rule should specify how to record the null condition
.The data should be examined regarding unique rules , consecutive rules and null rules
• Example:- 1) To store zero(0) for numerical attributes.
• 2) A blank for character attributes or any other convention that may in use (such as “don’t know”
or “?” )should be transformed to blank
• For example:- A salesman in a firm or company will have certain commission value, rest all other
employees will have commission as NULL. It should be remembered that NULL is different from
zero(0).
• Some of the reasons for missing values are:-
• 1) The person originally asked to provide a value for the attribute refuses to fill that value and/or
finds that the information requested is not applicable.(e.g. driving license number left blank by
individuals who are not drivers).
• 2) The data entry person does not know the correct value.

3)The value is to be provided by a later step of the process

• Different commercial tools available for discrepancy detection are as follows:-


• Data Scrubbing Tools:- They use simple domain knowledge(e.g. knowledge of postal addresses
and spell checking) to detect errors and make corrections in the data. These rules rely on parsing
and fuzzy matching techniques when cleaning data from multiple sources.
• Data Auditing Tools:- They find discrepancies by analyzing the data to discover rules and
relationships and detecting data that violate such conditions. Some data inconsistencies may be
corrected manually using external references.
• For example: Errors made at data entry may be corrected by performing a paper trace. Most errors
will require data transformation which is a second step in data cleaning process. i.e. once the
discrepancies are known we need to define and apply (a series of)transformations to correct them.

Commercial tools can assist in the data transformation step .Data migration tools allow simple
transformations to be specified ,such as to replace the string “gender” by “sex”.

• The two step process of discrepancy detection and data transformation (to correct discrepancies)
iterates. This process is error prone and time consuming.
• Some transformations may introduce more discrepancies. some nested discrepancies may only be
detected after others have been fixed.
• Any tuples that cannot be automatically handled by a given transformation are written to a file
without any explanation regarding the reason behind their failure. The entire data cleaning process
also suffers from the lack of interactivity
• A publicly available data cleaning tool is “potter’s wheel”.
• It integrates discrepancy detection and transformation.
• The tool performs discrepancy checking automatically in the background on the latest transformed
view of the data.

Users can gradually develop and refine transformations, as discrepancies are found ,this leads to more
effective and efficient data cleaning
Data Integration:-

Data integration in data mining refers to the process of combining data from multiple sources into a
single, unified view. This can involve cleaning and transforming the data, as well as resolving any
inconsistencies or conflicts that may exist between the different sources. The goal of data integration is to
make the data more useful and meaningful for the purposes of analysis and decision making. Techniques
used in data integration include data warehousing, ETL (extract, transform, load) processes, and data
federation.

Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data
sources into a coherent data store and provides a unified view of the data. These sources may include
multiple data cubes, databases, or flat files.

Data integration is the process of combining data from multiple sources into a cohesive and consistent
view. This process involves identifying and accessing the different data sources, mapping the data to a
common format, and reconciling any inconsistencies or discrepancies between the sources. The goal of
data integration is to make it easier to access and analyze data that is spread across multiple systems or
platforms, in order to gain a more complete and accurate understanding of the data.

Data integration can be challenging due to the variety of data formats, structures, and semantics used
by different data sources. Different data sources may use different data types, naming conventions, and
schemas, making it difficult to combine the data into a single view. Data integration typically involves a
combination of manual and automated processes, including data profiling, data mapping, data
transformation, and data reconciliation.

Issues in Data Integration:

There are several issues that can arise when integrating data from multiple sources, including:

1. Data Quality: Inconsistencies and errors in the data can make it difficult to combine and analyze.
2. Data Semantics: Different sources may use different terms or definitions for the same data,
making it difficult to combine and understand the data.
3. Data Heterogeneity: Different sources may use different data formats, structures, or schemas,
making it difficult to combine and analyze the data.
4. Data Privacy and Security: Protecting sensitive information and maintaining security can be
difficult when integrating data from multiple sources.
5. Scalability: Integrating large amounts of data from multiple sources can be computationally
expensive and time-consuming.
6. Data Governance: Managing and maintaining the integration of data from multiple sources can
be difficult, especially when it comes to ensuring data accuracy, consistency, and timeliness.
7. Performance: Integrating data from multiple sources can also affect the performance of the
system.
8. Integration with existing systems: Integrating new data sources with existing systems can be a
complex task, requiring significant effort and resources.
9. Complexity: The complexity of integrating data from multiple sources can be high, requiring
specialized skills and knowledge.

There are three issues to consider during data integration:

1) Schema Integration.
2) Redundancy Detection.
3) Resolution of data value conflicts.
These are explained in brief below.

1. Schema Integration:

• Integrate metadata from different sources.


• The real-world entities from multiple sources are referred to as the entity identification
problem.ER
• Schema Integration:- Schema integration and object matching can be tricky. How to match
equivalent real-world entities from multiple data sources?. This is called entity identification
problem.
• For example-how can the data analyst or computer be sure that customer_id in one database and
cust_number in another, refer to the same attribute?
• Examples of metadata for each attribute include the name, meaning , data type and range of values
permitted for the attribute, and null rules for handling blank , zero or Null values.

2. Redundancy Detection:

• An attribute may be redundant if it can be derived or obtained from another attribute or set of
attributes.
• Inconsistencies in attributes can also cause redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis.
• Redundancy:-An attribute(such as annual revenue) may be redundant if it can be “derived” from
another attribute or set of attributes.
• Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data
set.
• Some redundancies can be detected by correlation analysis. when 2 attributes are given ,
correlation analysis measures how strongly one attribute implies the other based on the available
data.
• For numerical attributes ,we can evaluate the correlation between two attributes A and B, by
computing the correlation coefficient. the higher the value ,the stronger is the correlation.
• In addition to detecting redundancies between attributes ,duplication should also be detected at
the tuple level.
• The use of de normalized tables also causes redundancy.
• So to avoid redundancy always use normalized tables.

3. Resolution of data value conflicts:

• This is the third critical issue in data integration.


• Attribute values from different sources may differ for the same real-world entity.
• An attribute in one system may be recorded at a lower level of abstraction than the “same”
attribute in another.
• For the same real world entity ,attribute values from different sources may be different .This may
due to differences in representation ,scaling or encoding.
• For example:- a weight attribute may be stored in metric units in one system and british imperial
units in another system.
• An attribute in one system may be recorded at a lower-level of abstraction than the same attribute
in another system.
• When matching attributes from one database to another during integration ,special attention must
be paid to the structure of data. This is done to ensure that any attribute function dependencies and
referential constraints in the source system match those in the target system.
• Careful integration of the data from multiple sources can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
• This can help to improve the accuracy and speed of the subsequent mining process.
Data Transformation :-

Data transformation in data mining refers to the process of converting raw data into a format that is
suitable for analysis and modeling. The goal of data transformation is to prepare the data for data mining
so that it can be used to extract useful insights and knowledge. Data transformation typically involves
several steps, including:

1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the data.
2. Data integration: Combining data from multiple sources, such as databases and spreadsheets,
into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0 and 1, to
facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant features
or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing or
averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure that the
data is in a format that is suitable for analysis and modeling, and that it is free of errors and
inconsistencies. Data transformation can also help to improve the performance of data mining
algorithms, by reducing the dimensionality of the data, and by scaling the data to a common range
of values.

The data are transformed in ways that are ideal for mining the data. The data transformation involves
steps that are:

1. Smoothing: It is a process that is used to remove noise from the dataset using some algorithms It
allows for highlighting important features present in the dataset. It helps in predicting the patterns. When
collecting data, it can be manipulated to eliminate or reduce any variance or any other noise form. The
concept behind data smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can
often be difficult to digest for finding patterns that they wouldn’t see otherwise.

2. Aggregation: Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data analysis insights is highly dependent
on the quantity and quality of the data used. Gathering accurate data of high quality and a large enough
quantity is necessary to produce relevant results. The collection of data is useful for everything from
decisions concerning financing or business strategy of the product, pricing, operations, and marketing
strategies. For example, Sales, data may be aggregated to compute monthly& annual total amounts.

3. Discretization: It is a process of transforming continuous data into set of small intervals. Most Data
Mining activities in the real world require continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes. Also, even if a data mining task can manage a
continuous attribute, it can significantly improve its efficiency by replacing a constant quality attribute
with its discrete values. For example, (1-10, 11-20) (age:- young, middle age, senior).

4. Attribute Construction: Where new attributes are created & applied to assist the mining process from
the given set of attributes. This simplifies the original data & makes the mining more efficient.

5. Generalization: It converts low-level data attributes to high-level data attributes using concept
hierarchy. For Example Age initially in Numerical form (22, 25) is converted into categorical value
(young, old). For example, Categorical attributes, such as house addresses, may be generalized to higher-
level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data variables into a given range.
Techniques that are used for normalization are:

• Min-Max Normalization:
o This transforms the original data linearly.
o Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
o Where v is the value you want to plot in the new range.
o v’ is the new value you get after normalizing the old value.
o The min-max normalization maps a value ,V,of A to V’ in the range[new_minA ,
new_maxA] by computing.
o V’=V-minA/maxA-minA(new_maxA– new_minA)+new_minA.

o
o
• Z-Score Normalization:
o In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation
o A value, v, of attribute A is normalized to v’ by computing.
o V’=V-(mean of A)/standard deviation of A.
o

o
• Decimal Scaling:
o It normalizes the values of an attribute by changing the position of their decimal points
o The number of points by which the decimal point is moved can be determined by the
absolute maximum value of attribute A.
o A value, v, of attribute A is normalized to v’ by computing
o where j is the smallest integer such that Max(|v’|) < 1.
o Suppose: Values of an attribute P varies from -99 to 99.
o The maximum absolute value of P is 99.
o For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of
integers in the largest number) so that values come out to be as 0.98, 0.97 and so on.
o It normalizes by moving the decimal point of values of attribute A. The number of decimal
points moved depends on the maximum absolute value of A. The value ,V, of A is
normalized to V’ by computing,
o V’=V/10j where j is the smallest integer such that Max(|V’|<1)

Data Reduction

Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the
most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant information.

There are several different data reduction techniques that can be used in data mining,
including:

1. Data Sampling: This technique involves selecting a subset of the data to work with, rather than
using the entire dataset. This can be useful for reducing the size of a dataset while still preserving
the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features into a
single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that are
most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the size of
the data. The more data is reduced, the less accurate the model will be and the less generalizable it
will be.

Methods of data reduction:


These are explained as following below.

1. Data Cube Aggregation:


This technique is used to aggregate data in a simpler form. For example, imagine the information you
gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company
every three months. They involve you in the annual sales, rather than the quarterly average, So we can
summarize the data in such a way that the resulting data summarizes the total sales per year instead of per
quarter. It summarizes the data.

2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our
analysis. It reduces data size as it eliminates outdated or redundant features.

• Step-wise Forward Selection –


The selection begins with an empty set of attributes later on we decide the best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.

Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

• Step-wise Backward Selection –


This selection starts with a set of complete attributes in the original data and at each point, it
eliminates the worst remaining attribute in the set.

Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

• Combination of forwarding and Backward Selection –


It allows us to remove the worst and select the best attributes, saving time and making the process
faster.

3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression
techniques.
• Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
• Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component analysis)
are examples of this compression. For e.g., the JPEG image format is a lossy compression, but we
can find the meaning equivalent to the original image. In lossy-data compression, the
decompressed data may differ from the original data but are useful enough to retrieve information
from them.

5. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model parameter.
Or non-parametric methods such as clustering, histogram, and sampling.
The data volume can be reduced by choosing alternative smaller forms of data representation.For
this purpose numerosity reduction can be applied.These techniques can be parametric and non
parametric.
Parametric method:- A model is used to estimate the data ,so that only the data parameters are
stored, instead of actual data.

Example:- Log linear models

4b)Non-Parametric method:-These methods for storing reduced representation of data are non
parametric methods.

Example:- Histograms, clustering and sampling.

** Log linear models:-They approximate discrete multidimensional probability distribution .Log linear
models are used to estimate the probability of each point in a multidimensional space for a set of
discretized attributes based on a smaller subset of dimensional combinations. This allows a higher
dimensional data space to be constructed from lower dimensional spaces. Log linear models are also
useful for dimensionality reduction and data smoothing.

4b)Non-Parametric method:-, clustering and sampling.

**Histograms:-They are a popular method of data reduction. It uses binning to approximate data
distribution. The histogram for an attribute A, partitions the data distribution of A into disjoint subsets or
buckets. If each bucket represents only a single attribute-value/frequency pair the buckets are called
singleton buckets.

Histograms:-There are several partitioning rules to determine buckets and partition attribute values.

i)Equal width:- In equal width histogram ,the width of each bucket range is uniform.

ii) Equal frequency (or equidepth):- The buckets are created so that, the frequency of each bucket is
constant(i.e. each bucket contains the same number of contiguous data samples).

iii)V-optimal:- This histogram has the least variance. Histogram variance is a weighted sum of the
original values that each bucket represents, where bucket weight is equal to the number of values in the
bucket.

iv)MaxDiff:- In this histogram we consider the difference between each pair of adjacent values. A bucket
boundary is established between each pair ,for pairs having the B-1 largest differences ,where B is the
number of buckets specified by user.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals. This means that
mining results are shown in a concise, and easily understandable way.

• Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the
whole set of attributes and repeat this method up to the end, then the process is known as top-
down discretization also known as splitting.
• Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.

Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) with
high-level concepts (categorical variables such as middle age or Senior).

For numeric data following techniques can be followed:

• Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number
of categorical counterparts depends on the number of bins specified by the user.
• Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into
disjoint ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of
bins i.e. a set of values ranging from 0-20.
3. Clustering: Grouping similar data together.

Discretization in data mining

Data discretization refers to a method of converting a huge number of data values into smaller ones so
that the evaluation and management of data become easy. In other words, data discretization is a method
of converting attributes values of continuous data into a finite set of intervals with minimum data loss.
There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization. Supervised discretization refers to a method in which the class data is used.
Unsupervised discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottom-up merging strategy.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age


1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78
After Discretization Child Young Mature Old
Some Famous techniques of data discretization

Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set.
Histogram assists the data inspection for data distribution. For example, Outliers, skewness
representation, normal distribution representation, etc.

Binning

Binning refers to a data smoothing technique that helps to group a huge number of continuous values into
smaller values. For data discretization and the development of idea hierarchy, this technique can also be
used.

Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values
of x numbers into clusters to isolate a computational feature of x.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is
done through a supervised procedure. In a numeric attribute discretization, first, you need to select the
attribute that has the least entropy, and then you need to run it with the help of a recursive process. The
recursive process divides it into various discretized disjoint intervals, from top to bottom, using the same
splitting criterion.

Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best neighboring interval, and then the
large intervals are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a
supervised procedure.

Data discretization and concept hierarchy generation

The term hierarchy represents an organizational structure or mapping in which items are ranked according
to their levels of importance. In other words, we can say that a hierarchy concept refers to a sequence of
mappings with a set of more general concepts to complex concepts. It means mapping is done from low-
level concepts to high-level concepts. For example, in computer science, there are different types of
hierarchical systems. A document is placed in a folder in windows at a specific place in the tree structure
is the best example of a computer hierarchical tree model. There are two types of hierarchy: top-down
mapping and the second one is bottom-up mapping.

Let's understand this concept hierarchy for the dimension location with the help of an example.

A particular city can map with the belonging country. For example, New Delhi can be mapped to India,
and India can be mapped to Asia.

Top-down mapping

Top-down mapping generally starts with the top with some general information and ends with the bottom
to the specialized information.

Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends with the
top to the generalized information.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some specialized information and ends with the
top to the generalized information.

****************************END OF UNIT 2**********************************

You might also like