0% found this document useful (0 votes)
23 views

Data Cleaning and Data Transformation

Uploaded by

Aman Srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data Cleaning and Data Transformation

Uploaded by

Aman Srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Cleaning and Data Transformation

Quality of the data is critical in analysis because any data tend to be


incomplete, noisy and inconsistent will affect your result. So data cleaning in
data mining is the process of detecting and removing corrupt or incorrect
records from the record items, table or database.
There are many
data cleaning
methods:
• It can ignore the tuple, it is done when class label is missing. This technique is not very effective, except
the tuple contains many attributes with missing values.
• This method can fill the missing value manually and this approach is effective on little data set with some
missing values.
• It can replace every missing attribute values with global constant, such as a label like “Unknown” or
minus infinity.
• It can use the attribute type to fill in the missing value. As an example customer average income is 30000
then it can use this value to replace missing value for income.
• Fill the most probable value in the missing value.
Advantages of data cleansing:
Accurate data
and quality
data

Improve Streamline
Decision Business
Making process Practise

Increases Increases
Productivity Revenue
Benefits of Data Cleaning
The several key benefits of the data cleaning process are
as follows:
1. It removes higher errors and inconsistencies that are
inevitable when different sources of data are getting
pushing into one dataset.
2. The tools to cleanup data will build everyone more
efficient since they’ll be able to quickly get what they
have from the data.
3. Smaller error means happier user and fewer frustrated
employees.
4. The ability to map the various functions and what the
Steps to clean the data
Monitor Errors Keep a record and appearance at trends of whenever most errors are coming
from, as this may build it a lot easier to identify and fix the wrong or corrupt data. This is particularly
necessary if you are integrating different solutions with your fleet management software, In order that
errors don’t obstruct the work of other departments.
Standardize Your Processes By standardizing the data process you will ensure a good
point of entry and reduce the risk of duplication.
3. Validate Accuracy It validates the accuracy of the data once it has cleaned in the
existing database. Research and invest in data tools that permit to clean the data in real-time. Some
tools now use Artificial Intelligence or machine learning to better test for accuracy.
4. Scrub for Duplicate Data It identify the duplicates, since this will help the user to
save time when analysing data. This can be avoided by researching and investing in various data
cleaning tools.
5. Analyse After the data has been standardized, validated, and scrambled for duplicates, use
third-party sources to append it. Reliable third-party sources may capture data directly from first-party
sites, then clean and compile the data to provide more complete data for business intelligence and
analytics.
6. Communicate with the Team Communicate the new standardized cleaning process to the
team. The data can be scrubbed, it’s important to keep it clean. This will help to develop and
strengthen the user segmentation and send more targeted data to user and prospects,
What kinds of issues affect the quality of
data?
• Invalid values: Some datasets have common values, e.g. gender must have “F”
(Female) and “M” (Male). Here it’s easy to detect wrong values.
• Formats: The most familiar issue that it’s possible to get values in various
formats like a name written as “Name, Surname” or “Surname, Name”.
• Attribute dependencies: Here the value of a feature completely depends on the
value of another feature. For instance, if there is some school data, the “number of
students” is related to whether the human being “is teacher?” If they are not a
teacher he/she can’t have any students.
• Uniqueness: It’s possible to find duplicated data in features that only allow single
values. For instance, an ID field can’t have two products with the same identifier.
• Missing values: The features in the dataset can have blank or null values.
• Misspellings: Values written incorrectly.
• Mis fielded values: It contains the values of another.
How can I detect and fix these issues?

There is a great deal of methods that can use to find


these issues. For example:
• Visualization: Visualizing all the values that can take as
a random sample to see whether it’s right.
• Outlier analysis: Analyze the data can be a human
error. E.g. a 200 year old person in the “age” feature.
• Validation code: It creates a code that checks if the
data is right.
For instance, in uniqueness, checking if the distance of
the data is the same as the length of the vector of single
values.
We can apply many methods to fix the
various issues:
• Misspelled data: Replace the incorrect fields by the
most similar value.
• Uniqueness: Repeated fields switching of one data with
the other value that is not in the feature.
• Missing data: Missing data handling is a key decision. It
can change null values with the mean, median or mode
of the feature.
• Formats: It have the same number of decimals, the
same format in the dates. Data Transformation in Data
Mining
Meaning of Data
Transformation:
• In data transformation process the data are transformed from
one format to another format, which is more appropriate for data
mining.
• It is the process of converting data or information from one
format to another, generally from the format of a source system
into the needed format of a new destination system.
• This process involves converting documents, but data
conversions involve the conversion of a program from one
computer language to another to enable the program to run on a
various platform. Data transformation involves the need of a
special program that is ready to browse the data original base
language, verify the language that the information has to be
translated for it to be usable by the new program or system, and
Data Transformation has two key
phases:
Data Mapping: The assignment of component from
the source base or system toward the destination to
capture all transformations that occur.
• This can create more difficult once there are complex
transformations like many to-one or one-to-many rules
for transformation.
Code Generation: The creation of the actual
transformation program.
• The resulting data map specification is used to create
an executable program to run on computer systems.
Data Transformation Strategies:
1. Smoothing: Smoothing is the process of removing noise
from the data.
2. Aggregation: Aggregation is the process where summary or
aggregation operations are applied to the data.
3. Generalization: By using concept hierarchies climbing low-
level data are replaced with the high level data
4. Normalization: Normalization scaled attribute data so as to
fall within a small specified range, such as 0.0 to 1.0. 5.
Attribute
5. Construction: In Attribute construction, new attributes are
constructed from the given set of attributes.
Is it possible to transform the features to gain more information?
There are so many methods that add information to the algorithm:
Data Binning or Bucketing: A pre-processing technique is used to decrease the
consequences of smaller observation errors. This sample is split into intervals and replaced
by categorical values.
Indicator variables: This technique converts categorical data into Boolean
values by making indicator variables. It has more than two values (n) and it have to create n-
1 columns.
Centering & Scaling: It can center the data of one data by subtracting the mean
to all values. To scale the data, it can split the centered feature by the standard deviation.

Other techniques:
It can group the outliers with the same value or replace the value with the number of times
that it appears in the feature.
Data Transformation types and dimensional attributes It is the main function of an Extract,
Transform, and Load (ETL) tool is to translate data. The transformation step is the most
important stage of building a structured data warehouse. The major transformation types
are:
Format revision: This field can contain numeric and text data types.
It needs to standardize and change the data type to text to provide values that can be
Decoding of fields: In multiple source systems, the similar data items are
described by a variety of field values.
Many legacy systems are notorious for using cryptic codes to represent business
values.
This Extract, Transform, and Load (ETL) transformation type changes codes into
values that make sense to the end-users.
Calculated and derived values: It calculates the total cost and the
profit margin before data can be stored in the data warehouse, it is an instance of
the calculated value. It can also store user’s age separately, it would be an
example of the derived value.
Splitting single fields: The first name, middle name, and last name, and
some other values are stored as a large text in a single field in the systems. It can
store individual components of names and addresses in separate fields in the data
repository to improve the operating performance by indexing and analyzing
individual components.
Merging of information: This type of data transformation in Extract,
Transform, and Load (ETL) does not literally mean the merging of various fields to
create a single field. In this case, merging of data stands for establishing the
communication between various fields such as product price, description, package
types, and viewing these fields as a single entity.

You might also like