0% found this document useful (0 votes)

0 views

chapter3 DS

Chapter 3 focuses on data preprocessing, emphasizing the importance of data cleaning, transformation, and feature selection in machine learning. It outlines methods for handling missing values, duplicates, and outliers, as well as techniques for scaling, normalization, and encoding categorical variables. The chapter highlights that effective data preprocessing is crucial for enhancing model performance and ensuring accurate analyses.

Uploaded by

trexwarrior92

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

chapter3 DS

Uploaded by

trexwarrior92

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Chapter-3

Data Preprocessing
Data Preprocessing: Data cleaning: handling missing values, outliers,
duplicates, Data transformation: scaling, normalization, encoding categorical.
variables, Feature selection: selecting relevant features/columns, Data.
merging: combining multiple datasets.

Data cleaning: Data cleaning is one of the important parts of machine learning. It
plays a significant part in building a model.
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent, and free of errors, as
incorrect or inconsistent data can negatively impact the performance of the ML
model. Professional data scientists usually invest a very large portion of their time
in this step because of the belief that “Better data beats fancier algorithms”.
Data cleaning is essential because raw data is often noisy, incomplete, and
inconsistent, which can negatively impact the accuracy and reliability of the insights
derived from it.
Data cleaning involves the systematic identification and correction of errors,
inconsistencies, and inaccuracies within a dataset, encompassing tasks such as
handling missing values, removing duplicates, and addressing outliers. This
meticulous process is essential for enhancing the integrity of analyses, promoting
more accurate modeling, and ultimately facilitating informed decision-making based
on trustworthy and high-quality data.

Steps to Perform Data Cleanliness

Performing data cleaning involves a systematic process to identify and rectify errors,
inconsistencies, and inaccuracies in a dataset.
• Removal of Unwanted Observations: Identify and eliminate irrelevant
or redundant observations from the dataset. The step involves scrutinizing
data entries for duplicate records, irrelevant information, or data points that
do not contribute meaningfully to the analysis. Removing unwanted
observations streamlines the dataset, reducing noise and improving the
overall quality.
• Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity
in data representation. Fixing structure errors enhances data consistency
and facilitates accurate analysis and interpretation.
• Managing Unwanted outliers: Identify and manage outliers, which are
data points significantly deviating from the norm. Depending on the
context, decide whether to remove outliers or transform them to minimize
their impact on analysis. Managing outliers is crucial for obtaining more
accurate and reliable insights from the data.
• Handling Missing Data: Devise strategies to handle missing data
effectively. This may involve imputing missing values based on statistical
methods, removing records with missing values, or employing advanced
imputation techniques. Handling missing data ensures a more complete
dataset, preventing biases and maintaining the integrity of analyses.

Handling missing values:

Identify the Missing Data Values

Most analytics projects will encounter three possible types of missing data values,
depending on whether there’s a relationship between the missing data and the other
data in the dataset:

• Missing completely at random (MCAR): In this case, there may be no

pattern as to why a column’s data is missing. For example, survey data is
missing because someone could not make it to an appointment, or an
administrator misplaces the test results he is supposed to enter into the
computer. The reason for the missing values is unrelated to the data in the
dataset.
• Missing at random (MAR): In this scenario, the reason the data is missing
in a column can be explained by the data in other columns. For example, a
school student who scores above the cutoff is typically given a grade. So, a
missing grade for a student can be explained by the column that has scores
below the cutoff. The reason for these missing values can be described by data
in another column.
• Missing not at random (MNAR): Sometimes, the missing value is related to
the value itself. For example, higher income people may not disclose their
incomes. Here, there is a correlation between the missing values and the actual
income. The missing values are not dependent on other variables in the
dataset.
Handling Missing Data Values

The first common strategy for dealing with missing data is to delete the rows with
missing values. Typically, any row which has a missing value in any cell gets
deleted. However, this often means many rows will get removed, leading to loss of
information and data. Therefore, this method is typically not used when there are
few data samples.
We can also impute the missing data. This can be based solely on information in the
column that has missing values, or it can be based on other columns present in the
dataset.
Finally, we can use classification or regression models to predict missing values.

1. Missing Values in Numerical Columns

The first approach is to replace the missing value with one of the following
strategies:

• Replace it with a constant value. This can be a good approach when used in
discussion with the domain expert for the data we are dealing with.
• Replace it with the mean or median. This is a decent approach when the data
size is small—but it does add bias.
• Replace it with values by using information from other columns.

2. Predicting Missing Values Using an Algorithm

Another way to predict missing values is to create a simple regression model. The
column to predict here is the Salary, using other columns in the dataset. If there are
missing values in the input columns, we must handle those conditions when creating
the predictive model. A simple way to manage this is to choose only the features that
do not have missing values or take the rows that do not have missing values in any
of the cells.

3. Missing Values in Categorical Columns

Dealing with missing data values in categorical columns is a lot easier than in
numerical columns. Simply replace the missing value with a constant value or the
most popular category. This is a good approach when the data size is small, though
it does add bias.
For example, say we have a column for Education with two possible values: High
School and College. If there are more people with a college degree in the dataset, we
can replace the missing value with College Degree:
2. Handling Duplicates:
The simplest and most straightforward way to handle duplicate data is to delete it.
This can reduce the noise and redundancy in our data, as well as improve the
efficiency and accuracy of your models. However, we need to be careful and make
sure that you are not losing any valuable or relevant information by removing
duplicate data. We also need to consider the criteria and logic for choosing which
duplicates to keep or discard. For example, we can use the df.drop_duplicates()
method in pandas to remove duplicate rows or columns, specifying the subset, keep,
and inplace arguments.
Removing duplicates:
In python using Pandas: df.drop_duplicates()
In SQL: Use DISTINCT keyword in SELECT statement.

3. Outliers Detection and Treatment

Outlier detection is the process of detecting outliers, or a data point that is far away
from the average, and depending on what we are trying to accomplish.
Detecting and appropriately dealing with outliers is essential in data science to
ensure that statistical analysis and machine learning models are not unduly
influenced, and the results are accurate and reliable.
Techniques used for outlier detection.

A data scientist can use several techniques to identify outliers and decide if they are
errors or novelties.
Numeric outlier
This is the simplest nonparametric technique, where data is in a one-dimensional
space. Outliers are calculated by dividing them into three quartiles. The range limits
are then set as upper and lower whiskers of a box plot. Then, the data that is outside
those ranges can be removed.
Z-score
This parametric technique indicates how many standard deviations a certain point of
data is from the sample’s mean. This assumes a gaussian distribution (a normal, bell-
shaped curve). However, if the data is not normally distributed, data can be
transformed by scaling it, and giving it a more normal appearance. The z-score of
data points is then calculated, placed on the bell curve, and then using heuristics (rule
of thumb) a cut-off point for thresholds of standard deviation can be decided. Then,
the data points that lie beyond that standard deviation can be classified as outliers
and removed from the equation. The Z-score is a simple, powerful way to remove
outliers, but it is only useful with medium to small data sets. It can’t be used for
nonparametric data.
DBSCAN
This is Density Based Spatial Clustering of Applications with Noise, which is
basically a graphical representation showing density of data. Using complex
calculations, it clusters data together in groups of related points. DBSCAN groups
data into core points, border points, and outliers. Core points are main data groups,
border points have enough density to be considered part of the data group, and
outliers are in no cluster at all, and can be disregarded from data.
Isolation forest
This method is effective for finding novelties and outliers. It uses binary decision
trees which are constructed using randomly selected features and a random split
value. The forest trees then form a tree forest, which is averaged out. Then, outlier
scores can be calculated, giving each node, or data point, a score from 0 to 1, 0 being
normal and 1 being more of an outlier.

Visualization for Outlier Detection

We can use the box plot, or the box and whisker plot, to explore the dataset and
visualize the presence of outliers. The points that lie beyond the whiskers are
detected as outliers.

Handling Outliers
a. Removing Outliers
i) Listwise detection: Remove rows with outliers.
ii) Trimming: Remove extreme values while keeping a certain percentage (1%
or 5%) of data.
b. Transforming Outliers
i) Winsorization: Cap or replace outliers with values at a specified percentile.
ii) Log Transformation: Apply a log transformation to reduce the impact of
extreme values.
c. Imputation
Impute outliers with a value derived from statistical measures(mean,median)
or more advanced imputation methods.
d. Treating as Anomaly: Treat outliers as anomalies and analyze them
separately. This is common in fraud detection or network security.
Data Transformation:

Data transformation is the process of converting data from one format or

structure into another format or structure. It is a fundamental aspect of most
data integration and data management tasks such as data wrangling, data
warehousing, data integration and application integration.
Data transformation is part of an ETL process and refers to preparing data for
analysis and modeling. This involves cleaning (removing duplicates, fill-in
missing values), reshaping (converting currencies, pivot tables), and
computing new dimensions and metrics.

Data transformation techniques include scaling, normalization, and encoding

categorical variables.

1. Scaling: Scaling is the process of transforming the features of a dataset

so that they fall within a specific range. Scaling is useful when we want
to compare two different variables on equal grounds. This is especially
useful with variables which use distance measures. For example,
models that use Euclidean Distance are sensitive to the magnitude of
distance, so scaling helps even with the weight of all the features. This
is important because if one variable is more heavily weighted than the
other, it introduces bias into our analysis.

Min-Max Scaling:

The objective of Min-Max scaling is to shift the values closer to the mean of the

column. This method scales the data to a fixed range, usually [0, 1] or [-1, 1]. A

drawback of bounding this data to a small, fixed range is that we will, in turn, end

up with smaller standard deviations, which suppresses the weight of outliers in our

data.
Standardization (Z-Score Normalization):

Standardization is used to compare features that have different units or scales. This

is done by subtracting a measure of location (x- x̅) and dividing by a measure of

scale (σ).

This transforms your data, so the resulting distribution has a mean of 0 and a standard

deviation of 1. This method is useful (in comparison to normalization) when we have

important outliers in our data, and we don’t want to remove them and lose their

impact.

2. Normalization

Data normalization is a technique used in data mining to transform the values of a

dataset into a common scale. This is important because many machine learning
algorithms are sensitive to the scale of the input features and can produce better
results when the data is normalized.
There are several different normalization techniques that can be used in data mining,
including:

1. Min-Max normalization: This technique scales the values of a feature to

a range between 0 and 1. This is done by subtracting the minimum value
of the feature from each value, and then dividing it by the range of the
feature.
2. Z-score normalization: This technique scales the values of a feature to
have a mean of 0 and a standard deviation of 1. This is done by subtracting
the mean of the feature from each value, and then dividing it by the
standard deviation.
3. Decimal Scaling: This technique scales the values of a feature by dividing
the values of a feature by a power of 10.
4. Logarithmic transformation: This technique applies a logarithmic
transformation to the values of a feature. This can be useful for data with
a wide range of values, as it can help to reduce the impact of outliers.
5. Root transformation: This technique applies a square root transformation
to the values of a feature. This can be useful for data with a wide range of
values, as it can help to reduce the impact of outliers.
6. It’s important to note that normalization should be applied only to the input
features, not the target variable, and that different normalization techniques
may work better for different types of data and models.

Note: The main difference between normalizing and scaling is that in normalization
you are changing the shape of the distribution and in scaling you are changing the
range of your data. Normalizing is a useful method when you know the distribution
is not Gaussian. Normalization adjusts the values of your numeric data to a common
scale without changing the range whereas scaling shrinks or stretches the data to fit
within a specific range.

3. Encoding Categorical Variables

The process of encoding categorical data into numerical data is called

“categorical encoding.” It involves transforming categorical variables into a
numerical format suitable for machine learning models.

1. Label Encoding: Label Encoding is a technique that is used to convert

categorical columns into numerical ones so that they can be fitted by
machine learning models which only take numerical data. It is an
important preprocessing step in a machine-learning project.

Example Of Label Encoding

Suppose we have a column Height in some dataset that has elements as Tall,
Medium, and short. To convert this categorical column into a numerical column we
will apply label encoding to this column. After applying label encoding, the Height
column is converted into a numerical column having elements 0,1, and 2 where 0 is
the label for tall, 1 is the label for medium, and 2 is the label for short height.

Height Height

Tall 0

Medium 1

Short 2

2. One-Hot Encoding: One hot encoding is a technique that we use

to represent categorical variables as numerical values in a machine
learning model. One-hot encoding is used when there is no ordinal
relationship among the categories, and each category is treated as a
separate independent feature. It creates a binary column for each
category, where a “1” indicates the presence of the category and “0” its
absence. This method is suitable for nominal variables.

Advantages:
It allows the use of categorical variables in models that require
numerical input.
It can improve model performance by providing more information to
the model about the categorical variable.
It can help to avoid the problem of ordinality, which can occur when a
categorical variable has a natural ordering (e.g. “small”, “medium”,
“large”).

One Hot Encoding Examples

In One Hot Encoding, the categorical parameters will prepare separate columns for
both Male and Female labels. So, wherever there is a Male, the value will be 1 in the
Male column and 0 in the Female column, and vice-versa. Let’s understand with an
example: Consider the data where fruits, their corresponding categorical values, and
prices are given.

Fruit Categorical value of fruit Price

apple 1 5

mango 2 10

apple 1 15
orange 3 20

The output after applying one-hot encoding on the data is given as follows,

apple mango orange price

1 0 0 5

0 1 0 10

1 0 0 15

0 0 1 20

The disadvantages of using one hot encoding include:

1. It can lead to increased dimensionality, as a separate column is created for
each category in the variable. This can make the model more complex and
slower to train.

2. It can lead to sparse data, as most observations will have a value of 0 in

most of the one-hot encoded columns.
3. It can lead to overfitting, especially if there are many categories in the
variable and the sample size is relatively small.

4. One-hot-encoding is a powerful technique to treat categorical data, but it

can lead to increased dimensionality, sparsity, and overfitting. It is
important to use it cautiously and consider other methods such as ordinal
encoding or binary encoding.

3. Binary Encoding:
Binary encoding combines elements of label encoding and one-hot encoding. It first
assigns unique integer labels to each category and then represents these labels in
binary form. It’s especially useful when we have many categories, reducing the
dimensionality compared to one hot encoding.

4. Frequency Encoding (Count Encoding)

Frequency encoding replaces each category with the count of how often it appears
in the dataset. This can be useful when we suspect that the frequency of a category
is related to the target variable.

5. Target Encoding (Mean Encoding)

Target encoding is used when we want to encode categorical variables based on their
relationship with the target variable. It replaces each category with the mean of the
target variable for that category.

Feature Selection

A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection.
Each machine learning process depends on feature engineering, which mainly
contains two processes, which are Feature Selection and Feature Extraction.
Although feature selection and extraction processes may have the same objective,
both are completely different from each other. The main difference between them is
that feature selection is about selecting the subset of the original feature set, whereas
feature extraction creates new features.
Feature selection is a way of reducing the input variable for the model by using only
relevant data to reduce overfitting in the model.
We can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.

Feature Selection Techniques

There are mainly two types of Feature Selection techniques, which are:

o Supervised Feature Selection technique

Supervised Feature selection techniques consider the target variable and can
be used for the labeled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can
be used for the unlabeled dataset.

There are mainly three techniques under supervised feature Selection:

1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
Based on the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.
Some techniques of wrapper methods are:

o Forward selection - Forward selection is an iterative process, which begins

with an empty set of features. After each iteration, it keeps adding on a feature
and evaluates the performance to check whether it is improving the
performance or not. The process continues until the addition of a new
variable/feature does not improve the performance of the model.
o Backward elimination - Backward elimination is also an iterative approach,
but it is the opposite of forward selection. This technique begins the process
by considering all the features and removes the least significant feature. This
elimination process continues until removing the features does not improve
the performance of the model.
o Exhaustive Feature Selection- Exhaustive feature selection is one of the best
feature selection methods, which evaluates each feature set as brute-force. It
means this method tries & make each possible combination of features and
return the best performing feature set.
o Recursive Feature Elimination-
Recursive feature elimination is a recursive greedy optimization approach,
where features are selected by recursively taking a smaller and smaller subset
of features. Now, an estimator is trained with each set of features, and the
importance of each feature is determined using coef_attribute or through a
feature_importances_attribute.
2. Filter Methods
In the Filter Method, features are selected based on statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-
processing step.
The filter method filters out the irrelevant features and redundant columns from the
model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and
does not overfit the data.

Some common techniques of Filter methods are as follows:

o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio

Information Gain: Information gain determines the reduction in entropy while

transforming the dataset. It can be used as a feature selection technique by
calculating the information gain of each variable with respect to the target variable.
Chi-square Test: Chi-square test is a technique to determine the relationship between
the categorical variables. The chi-square value is calculated between each feature
and the target variable, and the desired number of features with the best chi-square
value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised techniques of feature selection. It
returns the rank of the variable on the fisher's criteria in descending order. Then we
can select the variables with a large fisher's score.
Missing Value Ratio:
The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number
of missing values in each column divided by the total number of observations. The
variable having more than the threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are
fast processing methods like the filter method but more accurate than the filter
method.

These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:

o Regularization- Regularization adds a penalty term to different parameters

of the machine learning model for avoiding overfitting in the model. This
penalty term is added to the coefficients; hence it shrinks some coefficients to
zero. Those features with zero coefficients can be removed from the dataset.
The types of regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).
o Random Forest Importance - Different tree-based methods of feature
selection help us with feature importance to provide a way of selecting
features. Here, feature importance specifies which feature has more
importance in model building or has a great impact on the target variable.
Random Forest is such a tree-based method, which is a type of bagging
algorithm that aggregates a different number of decision trees. It automatically
ranks the nodes by their performance or decrease in the impurity (Gini
impurity) over all the trees. Nodes are arranged as per the impurity values,
and thus it allows pruning of trees below a specific node. The remaining nodes
create a subset of the most important features.

Data Merging: Combining Multiple Datasets

The most common method for merging data is through a process called “joining”.
There are several types of joins.

• Inner Join: Uses a comparison operator to match rows from two tables that
are based on the values in common columns from each table.
• Left join/left outer join.
Returns all the rows from the left table that are specified in the left outer join
clause, not just the rows in which the columns match.
• Right join/right outer join
Returns all the rows from the right table that are specified in the right outer
join clause, not just the rows in which the columns match.
• Full outer join
Returns all the rows in both the left and right tables.
• Cross joins (cartesian join)
Returns all possible combinations of rows from two tables.

Chapter Ends…

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
Tech & Democracy Report by All Tech Is Human
100% (2)
Tech & Democracy Report by All Tech Is Human
129 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Quality
No ratings yet
Data Quality
14 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
Unit 1
No ratings yet
Unit 1
21 pages
Subtitle
No ratings yet
Subtitle
2 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Quality
100% (2)
Data Quality
16 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Data Cleaning R
No ratings yet
Data Cleaning R
16 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
20 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Machine: Learning
No ratings yet
Machine: Learning
15 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Document (2)
No ratings yet
Document (2)
29 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Data Preparation .1
No ratings yet
Data Preparation .1
37 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
TP2- ML -handling outliers
No ratings yet
TP2- ML -handling outliers
5 pages
Outliners
No ratings yet
Outliners
15 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Data Collection Cleaning Preprocessing Presentation
No ratings yet
Data Collection Cleaning Preprocessing Presentation
13 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
No ratings yet
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
15 pages
PHD seminar
No ratings yet
PHD seminar
38 pages
Missing Data
No ratings yet
Missing Data
14 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Lect 2
No ratings yet
Lect 2
54 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Research File 3
No ratings yet
Research File 3
10 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
DataCleaninginML
No ratings yet
DataCleaninginML
15 pages
Data Cleaning_ Importance and Techniques
No ratings yet
Data Cleaning_ Importance and Techniques
1 page
Anomalies in dataset
No ratings yet
Anomalies in dataset
4 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Final Exam (Case Study)_CPE32S1_YU
No ratings yet
Final Exam (Case Study)_CPE32S1_YU
6 pages
Aclas LS2 User Manual Espanol
No ratings yet
Aclas LS2 User Manual Espanol
52 pages
Abraham 1991
No ratings yet
Abraham 1991
24 pages
Brochure Balio Ox-360 en
No ratings yet
Brochure Balio Ox-360 en
4 pages
Java
No ratings yet
Java
11 pages
Exception Handling Quizzes
No ratings yet
Exception Handling Quizzes
46 pages
HEMIX 5-60 Communication Protocol 170411
No ratings yet
HEMIX 5-60 Communication Protocol 170411
27 pages
CSE(AIML) SYLLABUS 16-07-2024 (1)
No ratings yet
CSE(AIML) SYLLABUS 16-07-2024 (1)
131 pages
The Advantages and Disadvantages of Social Networking
100% (6)
The Advantages and Disadvantages of Social Networking
20 pages
Mobile App Portfolio - Destek
No ratings yet
Mobile App Portfolio - Destek
11 pages
Boom Leveling Kit: Installation, Use and Maintenance
No ratings yet
Boom Leveling Kit: Installation, Use and Maintenance
26 pages
ECEN 5053-002: Developing The Industrial Internet of Things
No ratings yet
ECEN 5053-002: Developing The Industrial Internet of Things
50 pages
12.7.13 Packet Tracer - Implement Physical Security With Iot Devices
No ratings yet
12.7.13 Packet Tracer - Implement Physical Security With Iot Devices
4 pages
Low Cost PMSM Sensorless Field-Oriented Control Based On KE02
No ratings yet
Low Cost PMSM Sensorless Field-Oriented Control Based On KE02
26 pages
Berkeley AI Materials
No ratings yet
Berkeley AI Materials
1 page
ASUS EEE PC 900 900SD - MB. Schematic Diagram. REV 1.0G
No ratings yet
ASUS EEE PC 900 900SD - MB. Schematic Diagram. REV 1.0G
51 pages
FC Series Add-On Instruction For VLT® EtherNetIP MCA 121 MG92M102
No ratings yet
FC Series Add-On Instruction For VLT® EtherNetIP MCA 121 MG92M102
14 pages
Literature Review Rfid Technology
100% (1)
Literature Review Rfid Technology
6 pages
EMC Avamar NDMP Accelerator For NetApp Filers 7.5 User Guide
No ratings yet
EMC Avamar NDMP Accelerator For NetApp Filers 7.5 User Guide
70 pages
An Analysis of Digital Forensics in Cyber Security: D. Paul Joseph and Jasmine Norman
No ratings yet
An Analysis of Digital Forensics in Cyber Security: D. Paul Joseph and Jasmine Norman
8 pages
COMP 1154: Chapter 3: TCP/IP Protocol Suites
No ratings yet
COMP 1154: Chapter 3: TCP/IP Protocol Suites
20 pages
V1260n Manual de Instrução
No ratings yet
V1260n Manual de Instrução
44 pages
web tech
No ratings yet
web tech
131 pages
Cisco Mds9000 Fundamentals Config Guide 8x
No ratings yet
Cisco Mds9000 Fundamentals Config Guide 8x
224 pages
HDMI (High-Definition Multimedia Interface)
No ratings yet
HDMI (High-Definition Multimedia Interface)
36 pages
ABX00062 Datasheet
No ratings yet
ABX00062 Datasheet
12 pages
Free Online Resources
No ratings yet
Free Online Resources
6 pages
Bhargavi Prakash Deodhe 22BCE11586
No ratings yet
Bhargavi Prakash Deodhe 22BCE11586
14 pages
History of The Internet
No ratings yet
History of The Internet
17 pages

chapter3 DS

Uploaded by

chapter3 DS

Uploaded by

Chapter-3

Steps to Perform Data Cleanliness

Handling missing values:

Identify the Missing Data Values

• Missing completely at random (MCAR): In this case, there may be no

1. Missing Values in Numerical Columns

2. Predicting Missing Values Using an Algorithm

3. Missing Values in Categorical Columns

3. Outliers Detection and Treatment

Visualization for Outlier Detection

Data transformation is the process of converting data from one format or

Data transformation techniques include scaling, normalization, and encoding

1. Scaling: Scaling is the process of transforming the features of a dataset

is done by subtracting a measure of location (x- x̅) and dividing by a measure of

deviation of 1. This method is useful (in comparison to normalization) when we have

Data normalization is a technique used in data mining to transform the values of a

1. Min-Max normalization: This technique scales the values of a feature to

3. Encoding Categorical Variables

The process of encoding categorical data into numerical data is called

1. Label Encoding: Label Encoding is a technique that is used to convert

Example Of Label Encoding

2. One-Hot Encoding: One hot encoding is a technique that we use

One Hot Encoding Examples

Fruit Categorical value of fruit Price

apple mango orange price

The disadvantages of using one hot encoding include:

2. It can lead to sparse data, as most observations will have a value of 0 in

4. One-hot-encoding is a powerful technique to treat categorical data, but it

4. Frequency Encoding (Count Encoding)

5. Target Encoding (Mean Encoding)

Feature Selection Techniques

o Supervised Feature Selection technique

There are mainly three techniques under supervised feature Selection:

o Forward selection - Forward selection is an iterative process, which begins

Some common techniques of Filter methods are as follows:

Information Gain: Information gain determines the reduction in entropy while

o Regularization- Regularization adds a penalty term to different parameters

Data Merging: Combining Multiple Datasets

You might also like