0% found this document useful (0 votes)
3 views

Lecture # 13 Data_Transformation_Techniques

The document discusses various data transformation techniques essential for preparing data for analysis, including feature encoding methods like One-Hot, Label, and Binary Encoding. It also covers normalization techniques such as Min-Max Scaling, Z-Score Normalization, and Robust Scaling, highlighting their applications and best practices. Challenges in data transformation and the importance of maintaining data integrity and consistency are also addressed.

Uploaded by

Ezza Mehmood
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture # 13 Data_Transformation_Techniques

The document discusses various data transformation techniques essential for preparing data for analysis, including feature encoding methods like One-Hot, Label, and Binary Encoding. It also covers normalization techniques such as Min-Max Scaling, Z-Score Normalization, and Robust Scaling, highlighting their applications and best practices. Challenges in data transformation and the importance of maintaining data integrity and consistency are also addressed.

Uploaded by

Ezza Mehmood
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Transformation

and Feature Scaling


Data Science
Feature Encoding - Overview
• Feature encoding is converting categorical data into numerical
format. It's necessary because most machine learning
algorithms can only handle numerical values.
One-Hot Encoding
• One-Hot Encoding creates new columns indicating the
presence of each possible value from the original data. It's
ideal for categorical variables where no ordinal relationship
exists.
Label Encoding
• Label Encoding converts each value in a column to a number.
It’s useful for ordinal data, i.e., where the categorical values
hold a mathematical significance.
Binary Encoding
• Binary Encoding converts categories into binary digits. Each
binary digit creates one feature column. If there are n unique
categories, you get log2(n) new features, making it efficient for
high cardinality features.
Handling Mixed Data Types
• Handling mixed data types involves applying different
transformations to different types of data within the same
dataset. Techniques include using different scalers or encoders
based on the nature of the data.
Challenges in Data
Transformation
• Challenges include maintaining data integrity, handling
transformation of mixed data types, dealing with outliers, and
scalability issues in large datasets.
Best Practices in Data Transformation
• Ensure consistency in methods applied, document
transformations clearly, test transformations on small data
before full application, and always keep the raw data before
transformation.
Conclusion
• Data transformation is a foundational step in preparing data
for analysis. It helps improve the quality and efficiency of data
analysis, leading to more accurate insights.
Introduction to Data Transformation
• Data transformation involves converting data from one format
or structure into another. This process is crucial for data
cleaning, data integration, and preparation of data for
analysis. It enhances the data quality and makes it suitable for
specific analytical or operational purposes.
Why Transform Data?
• Data transformation is essential for several reasons: it
normalizes scale, converts data types, handles missing values,
and prepares data for analysis, ensuring that the results of
machine learning models are accurate and reliable. It's crucial
for comparing data that originates from different sources.
Normalization - Concept
• Normalization adjusts the scale of data without distorting
differences in the ranges of values. It brings all the data into a
specific range, usually 0 to 1, making it easier to process by
analytical methods.
Min-Max Scaling
• Min-Max Scaling is a simple method where values are shifted
and rescaled so that they end up ranging from 0 to 1.
• Formula:
• It is useful for algorithms that assume data is on the same
scale.
Z-Score Normalization (Standardization)
• Z-Score Normalization involves rescaling data to have a mean
(μ) of 0 and a standard deviation (σ) of 1. Formula: (X - μ) / σ.
This method standardizes the range of independent variables
or features of data.
Mean Scaling
• Mean Normalization is a technique used to scale data such
that the features have a mean of zero and are scaled within a
range. This technique shifts the data to center it around the
mean and then scales it by the range (difference between the
maximum and minimum values).
Mean Scaling
When do we use Mean Normalization?
• When we need the data to be centered around zero.
• When the features have different units and we want to make
them comparable.
• When the dataset includes both positive and negative values
(it can be useful for algorithms like k-nearest neighbors or
linear regression).
Mean Scaling
Sample Height (cm) Weight (kg)
1 170 70
2 180 80
3 160 60
What is MaxAbs Scaling?
• MaxAbs Scaling is another technique for scaling data where
each feature is scaled by its maximum absolute value, making
the values fall within the range [−1,1][-1, 1][−1,1].

When to use MaxAbs Scaling?


• When your data is sparse (i.e., many zero values).
• When your features already have a positive and negative
range, but you want to scale them into the range [−1,1]
without shifting them.
• When you don’t want to change the sign of your data.
Example

Sample Height (Scaled)


1 0.944
2 1
3 0.889
Robust Scaling
• Robust Scaling uses the median and the interquartile range for
scaling.
• Formula:
• It's beneficial for datasets with outliers.

You might also like