0% found this document useful (0 votes)
19 views

FINAL LECTURE 3,4.pptx - AutoRecovered

Uploaded by

shitalastik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

FINAL LECTURE 3,4.pptx - AutoRecovered

Uploaded by

shitalastik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Presenter’s Name

Dr. Shital Bhatt


Associate Professor
School of Computational and Data Sciences

www.vidyashilpuniversity.co www.vidyashilpuniversity.co
m m
EDA Techniques: Graphical
and Non-Graphical,
Univariate, Multivariate
 1. Graphical EDA Techniques
 Graphical EDA techniques involve visualizing the data using various charts and
plots. These techniques help in understanding the distribution, trends,
relationships, and outliers in the data.
 a. Univariate Graphical EDA
 Univariate analysis involves analyzing one variable at a time. The primary goal is
to understand the distribution and central tendency of the data.
• Histogram
• Purpose: To show the frequency distribution of a continuous variable.
• Example: A histogram of the ages of a group of people..
• Box Plot
• Purpose: To show the distribution of data based on five summary statistics:
minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
• Example: A box plot showing the distribution of salaries in a company.
• Density Plot
• Purpose: To estimate the probability density function of a continuous variable.
• Example: A density plot showing the distribution of test scores for students.
 b. Multivariate Graphical EDA
 Multivariate analysis involves analyzing two or more variables
simultaneously to identify relationships and patterns.
• Scatter Plot
• Purpose: To examine the relationship between two continuous variables.
• Example: A scatter plot showing the relationship between advertising spend and
sales revenue.
• Pair Plot
• Purpose: To visualize the pairwise relationships between several continuous
variables.
• Example: A pair plot showing the relationships among variables like height,
weight, and age.
• Heatmap
• Purpose: To visualize correlations between variables using colors.
• Example: A heatmap showing the correlation between different financial metrics
like revenue, profit, and expenses.
 Non-Graphical EDA Techniques
 Non-Graphical EDA techniques involve analyzing the data using numerical
summaries and statistical methods, without the use of visualizations.
 a. Univariate Non-Graphical EDA
• Summary Statistics
• Purpose: To provide a numerical summary of a single variable.
• Examples:
• Mean: The average value of the data (e.g., the average income of a population).
• Median: The middle value of the data when sorted (e.g., the median house price).
• Mode: The most frequent value in the data (e.g., the most common age in a group).
• Standard Deviation: Measures the spread of the data (e.g., the standard deviation of
test scores).

• Frequency Distribution
• Purpose: To show how frequently each value occurs in the dataset.
• Example: A table showing the frequency of different blood types in a population.
 Multivariate Non-Graphical EDA
• Correlation Coefficient
• Purpose: To quantify the relationship between two continuous variables.
• Example: Calculating the correlation between study hours and exam scores.
• Explanation: A correlation coefficient close to +1 indicates a strong
positive relationship, while a coefficient close to -1 indicates a strong
negative relationship.
• Cross-Tabulation (Contingency Table)
• Purpose: To examine the relationship between two categorical variables.
• Example: A table showing the relationship between gender and voting
preference.
• Explanation: The table would display counts or percentages, helping to
identify any patterns or associations between the variables.
Data Pre-Processing –
Numeric and Non Numeric
Numeric Data Pre-Processing
 Numeric data pre-processing is a fundamental step in data analysis and machine
learning. It involves cleaning, transforming, and organizing numerical
data to ensure it is suitable for analysis or modeling.
 Why Pre-Process Numeric Data?
 Raw data is often messy and inconsistent, containing missing values,
outliers, and features with varying scales. If left untreated, these issues
can lead to inaccurate models, poor predictions, and misleading
insights.
 Pre-processing helps to:
• Standardize the data, making it easier to compare and analyze.
• Normalize the data to ensure that all features contribute equally to the
analysis.
• Handle missing values appropriately to avoid biased results.
• Detect and manage outliers that can distort statistical analyses.
• Improve model performance by preparing the data for machine learning
algorithms.
Key Techniques in Numeric Data Pre-
Processing
a. Handling Missing Values
 Why It Matters: Missing data can lead to biased estimates, reduced
statistical power, and invalid conclusions. It’s essential to address
missing values before proceeding with any analysis.
 Common Strategies:
• Imputation: Replace missing values with statistical measures (mean,
median, or mode).
• Mean Imputation: Suitable for numerical data that is symmetrically
distributed.
• Median Imputation: Useful when the data is skewed.
• Mode Imputation: Typically used for categorical data but can be applied to
numerical data in some cases.
• Dropping: Remove rows or columns with missing values.
• This is practical when the missing data is minimal and doesn’t significantly
affect the dataset's size.
 Normalization
 Why It Matters: Normalization is crucial when the features in the
dataset have different scales. For example, if one feature ranges
between 0 and 1000, and another between 0 and 1, the feature with the
larger scale can dominate the model training, leading to biased
outcomes.
 Normalization scales the numeric data to a specific range, typically [0,
1]. This is particularly useful when different features have different
ranges and you want to bring them to a common scale.

 Common Approaches:
• Min-Max Normalization: Rescales the data to a fixed range, usually [0,
1].
 Standardization
 Why It Matters: Standardization is essential when you expect that your
data should follow a normal distribution. It transforms the data to have a
mean of 0 and a standard deviation of 1, which helps in comparing
features with different units and scales.
 Common Approach:
 Outlier Detection and Handling
 Why It Matters: Outliers can skew data analysis, lead to inaccurate
predictions, and affect the model's performance. Identifying and
addressing outliers ensures that the analysis is robust and reliable.
 Common Strategies:
 Binning
 Why It Matters: Binning converts continuous data into categorical data
by dividing it into intervals or "bins." This can reduce the impact of noise
and allow for better data interpretation, especially when dealing with
large datasets.
 Common Approaches:
• Equal-width Binning: Divides the data into bins of equal size.
• Equal-frequency Binning: Each bin contains the same number of data
points.
 Numeric data pre-processing is a crucial step that involves transforming
raw numerical data into a clean, consistent format that can be effectively
used in data analysis and machine learning. The primary techniques
include:
• Handling Missing Values: Ensures that no data is lost or biased due
to gaps in the dataset.
• Normalization and Standardization: Rescale data to make features
comparable and improve model performance.
• Outlier Detection and Handling: Identifies and mitigates the impact of
extreme values that could skew results.
• Binning: Simplifies continuous data by grouping it into categories,
making it more interpretable.
 Visual Representation
• Histograms can be used to show the distribution of data before and after
normalization or standardization.
• Boxplots are ideal for visualizing outliers.
• Bar charts can represent the distribution of data after binning.
Non-Numeric Data Pre-Processing
 Non-numeric data, also known as categorical or qualitative data, represents
variables that can be divided into different categories but do not have inherent
numerical meaning. Examples include gender, occupation, product type, or
review text. Pre-processing non-numeric data is crucial for preparing it for
machine learning models, as most algorithms require numerical input.
 Why Pre-Process Non-Numeric Data?
 Non-numeric data often contains valuable information that can enhance
the predictive power of models. However, because most machine learning
algorithms work with numerical data, non-numeric data must be converted or
transformed into a numerical format. Pre-processing also involves handling
inconsistencies, encoding categories, and ensuring that the data is in
a format suitable for analysis.
Key Techniques in Non-Numeric Data
Pre-Processing
 Handling Missing Values
 Why It Matters: Just like in numeric data, missing values in non-numeric
data can lead to biased results and poor model performance.
Addressing missing values is crucial for maintaining data integrity.
 Common Strategies:
• Imputation: Replace missing categorical values with the most frequent
category (mode) or a new category (e.g., "Unknown").
• Dropping: If a significant amount of data is missing, you might choose to
remove those rows or columns entirely.
 Encoding Categorical Variables
 Why It Matters: Machine learning algorithms require numerical input,
so categorical variables must be converted into a numerical format. This
process is called encoding.
 Common Encoding Techniques:
• Label Encoding: Assigns a unique integer to each category.
• Example: Categories "Red", "Blue", and "Green" might be encoded as 0, 1,
and 2, respectively.
• One-Hot Encoding: Creates a new binary variable for each category.
Each category is represented by a binary vector.
• Example: A "Color" variable with categories "Red", "Blue", and "Green" would
be split into three binary columns: "Color_Red", "Color_Blue", and
"Color_Green".
 Handling Text Data (Text Pre-Processing)
 Why It Matters: Text data, such as customer reviews or comments, is
unstructured and must be processed before it can be used in
models. Text pre-processing transforms raw text into a format that can be
analyzed.
 Common Steps in Text Pre-Processing:
• Tokenization: Splitting text into individual words or tokens.
• Lowercasing: Converting all text to lowercase to ensure uniformity.
• Removing Stop Words: Removing common words that do not
contribute much meaning (e.g., "and", "the").
• Stemming/Lemmatization: Reducing words to their base or root form
(e.g., "running" to "run").
• Vectorization: Converting text into numerical format using techniques
like Bag of Words (BoW), Term Frequency-Inverse Document Frequency
(TF-IDF), or Word Embeddings.
 Handling Ordinal Data
 Why It Matters: Ordinal data represents categories with a meaningful
order but no consistent difference between them. Examples include
rating scales (e.g., "Poor", "Average", "Good") or education levels (e.g.,
"High School", "Bachelor's", "Master's").
 Common Strategy:
• Ordinal Encoding: Assigns integers to categories based on their order.
• Example: "Poor", "Average", "Good" might be encoded as 1, 2, and 3.
 Feature Engineering for Non-Numeric Data
 Why It Matters: Feature engineering involves creating new features
from existing non-numeric data to improve model performance.
For text data, this might involve creating features like the length of a
review or the presence of specific keywords.
 Common Techniques:
• Creating Binary Features: From categorical data (e.g., a binary
column indicating whether a product is 'High Risk').
• Extracting Features from Text: Such as word count, presence of
certain phrases, or sentiment analysis scores.
 Non-numeric data pre-processing is a vital step in preparing qualitative data
for analysis and modeling. It involves various techniques, including:
• Handling Missing Values: Ensures that the data remains unbiased and
complete.
• Encoding Categorical Variables: Converts categories into numerical
format for use in algorithms.
• Text Pre-Processing: Transforms raw text into a format suitable for
analysis.
• Handling Ordinal Data: Ensures that ordered categories are
represented numerically in a way that reflects their order.
• Feature Engineering: Creates new features from non-numeric data to
improve model performance.
 These techniques are crucial for making non-numeric data compatible with
machine learning models, ensuring that all relevant information is captured
and utilized effectively in analysis.
What Is a Missing Value?
 Missing data is defined as the values or data that is not stored (or not
present) for some variable/s in the given dataset. Below is a sample of
the missing data from the Titanic dataset. You can see the columns ‘Age’
and ‘Cabin’ have some missing values.
How is a Missing Value Represented in a Dataset?

 Missing values in a dataset can be represented in various ways, depending on the


source of the data and the conventions used. Here are some common representations:

 NaN (Not a Number): In many programming languages and data analysis tools, missing
values are represented as NaN. This is the default for libraries like Pandas in Python.
 NULL or None: In databases and some programming languages, missing values are
often represented as NULL or None. For instance, in SQL databases, a missing value is
typically recorded as NULL.
 Empty Strings: Sometimes, missing values are denoted by empty strings (""). This is
common in text-based data or CSV files where a field might be left blank.
 Special Indicators: Datasets might use specific indicators like -999, 9999, or other
unlikely values to signify missing data. This is often seen in older datasets or specific
industries where such conventions were established.
 Blanks or Spaces: In some cases, particularly in fixed-width text files, missing values
might be represented by spaces or blank fields.
Why is Data Missing From the
Dataset?
 There can be multiple reasons why certain values are missing from the
data. Reasons for the missing of data from the dataset affect the
approach of handling missing data. So it’s necessary to understand why
the data could be missing.

 Some of the reasons are listed below:

 Past data might get corrupted due to improper maintenance.


 Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.
 The user has not provided the values intentionally
 Item nonresponse: This means the participant refused to respond.
Types of Missing Values
 Understanding the types of missing values in datasets is crucial for
effectively handling missing data and ensuring accurate analyses:
 1. Missing Completely at Random (MCAR)
 Definition:
• MCAR occurs when the likelihood of a data point being missing is
independent of both the observed and unobserved data. In other words,
the missingness is entirely random and unrelated to any variable in the
dataset.
 Implication:
• If data is MCAR, then the missing data does not introduce any bias, and
any analysis conducted on the remaining data will be unbiased.
 Example:
• In a survey, a few respondents might skip a question due to a technical
glitch in the online form. The likelihood of skipping the question is not
related to the respondents' demographics or any other variables,
making the missingness MCAR.
 2. Missing at Random (MAR)
 Definition:
• MAR occurs when the missingness is related to some observed data but not to the
unobserved data. The probability of a data point being missing can be explained by
other variables in the dataset.
 Implication:
• If data is MAR, the missingness can be accounted for using other observed variables,
making it possible to use methods like imputation without introducing bias.
 Example:
• Suppose respondents with lower income are less likely to disclose their income level in
a survey. Here, the missingness of the income data is related to the income level itself
but can be predicted using other variables such as education level or occupation.
 3. Missing Not at Random (MNAR)
 Definition:
• MNAR occurs when the missingness is related to the unobserved data itself. In this
case, the reason for the missing data is directly related to the value that is missing.
 Implication:
• MNAR is the most challenging type of missing data because the missingness is
related to the unobserved data, introducing potential bias that is difficult to correct.
Traditional methods may not be adequate, and more sophisticated techniques may
be required.
 Example:
• In a medical study, patients with more severe symptoms might be less likely to
attend follow-up appointments, leading to missing data on symptom severity. The
missing data is directly related to the unobserved severity of the symptoms.
 Visualizing the Types of Missing Data
 1. MCAR Visualization:
 A plot of missing values where there is no discernible pattern.
 2. MAR Visualization:
 A plot where missing values are correlated with other observed
variables.
 3. MNAR Visualization:
 A plot where the missing values themselves are likely to have a pattern
based on the unobserved data.
How to Handle Missing Data?
 Missing data is a common headache in any field that deals with datasets. It can arise
for various reasons, from human error during data collection to limitations of data
gathering methods. Luckily, there are strategies to address missing data and minimize
its impact on your analysis.
 Deletion: This involves removing rows or columns with missing values. This is a
straightforward method, but it can be problematic if a significant portion of your data is
missing. Discarding too much data can affect the reliability of your conclusions.
 Imputation: This replaces missing values with estimates. There are various imputation
techniques, each with its strengths and weaknesses. Here are some common ones:
 Mean/Median/Mode Imputation: Replace missing entries with the average (mean),
middle value (median), or most frequent value (mode) of the corresponding column.
This is a quick and easy approach, but it can introduce bias if the missing data is not
randomly distributed.
 K-Nearest Neighbors (KNN Imputation): This method finds the closest data points
(neighbors) based on available features and uses their values to estimate the missing
value. KNN is useful when you have a lot of data and the missing values are scattered.
 Model-based Imputation: This involves creating a statistical model to predict the
missing values based on other features in the data. This can be a powerful technique,
but it requires more expertise and can be computationally expensive.
Anomalous/Outlier Data: Significance,
Types, Detection, and Treatment
 1. Significance of Outlier Data
 Outliers are data points that significantly differ from other observations in a
dataset. They can be caused by measurement errors, data entry errors, or
genuine variability in the data. Understanding and handling outliers is crucial
because they can:
• Skew Statistical Analysis: Outliers can heavily influence mean, standard
deviation, and other statistical measures, leading to misleading results.
• Indicate Interesting Insights: In some cases, outliers may represent rare
events or phenomena that are of interest, such as fraud detection in financial
transactions.
• Impact Machine Learning Models: Outliers can negatively affect the
performance of machine learning models, particularly those sensitive to data
distributions, like linear regression.
 Types of Outliers
 Outliers can be classified into several types based on their characteristics
and causes:
• Global Outliers: Data points that deviate significantly from the entire
dataset. For example, in a dataset of people's ages, a value of 150 years
would be a global outlier.
• Contextual Outliers: Data points that are considered outliers within a
specific context but may not be outliers in general. For instance, a
temperature of 30°C is normal in summer but might be an outlier in winter.
• Collective Outliers: A group of data points that collectively deviate from
the expected pattern. For example, a sudden drop in stock prices across
multiple companies might represent a collective outlier.
 Detection of Outliers
 There are various methods to detect outliers, depending on the data
type and distribution:
• Visual Methods:
• Box Plot: Visualizes the distribution of data and highlights potential outliers
as points outside the whiskers. It is useful for detecting global outliers.
• Scatter Plot: Helps identify outliers in bivariate data by plotting one variable
against another. Points that lie far from the main cluster may be outliers.
 Machine Learning Methods:
• Isolation Forest: A tree-based algorithm that isolates outliers by
randomly selecting features and splitting values. Outliers are isolated
faster, indicating their anomalous nature.
• One-Class SVM (Support Vector Machine): A model that learns the
boundary around the normal data and identifies points outside this
boundary as outliers.
 Isolation Forest
 Concept: Isolation Forest is a tree-based algorithm that isolates outliers by randomly
selecting features and then splitting their values. The main idea is that outliers are
easier to isolate because they differ significantly from the majority of the data.
 How It Works:
• The algorithm creates an ensemble of decision trees (a forest) by randomly selecting
a feature and a split value at each node.
• Data points that are outliers will likely be isolated (i.e., separated from other points)
closer to the root of the tree, requiring fewer splits.
• The average path length from the root to the leaf for each data point is calculated
across all trees. Points with shorter average path lengths are more likely to be
outliers.
 Example: Imagine you have a dataset of daily temperatures measured in a city over
a year. Most temperatures range between 15°C and 30°C, but there are a few days
with temperatures as low as -5°C or as high as 40°C. These extreme temperatures
are anomalies.
• The Isolation Forest algorithm would create trees that split based on temperature.
Since the extreme temperatures (-5°C, 40°C) are far from the normal range, they
would be isolated in the first few splits of many trees.
• The algorithm would then flag these points as outliers.
 One-Class SVM (Support Vector Machine)
 Concept: One-Class SVM is a type of SVM designed for unsupervised anomaly
detection. It learns a decision boundary that encompasses the normal data
points. Any new point that falls outside this boundary is considered an anomaly.
 How It Works:
• The One-Class SVM tries to find the maximum margin boundary that best
separates the data from the origin (in a transformed feature space).
• It uses a kernel function (commonly the radial basis function, or RBF) to map
the input data into a higher-dimensional space, where it becomes easier to
draw a boundary around the normal data.
• Points that fall outside this boundary are labeled as outliers.
 Example: Consider a dataset of transaction amounts from a bank account.
Most transactions range from $10 to $1000. However, there are a few
transactions of $10,000 or more, which could indicate fraudulent activity.
• The One-Class SVM would learn the boundary around the normal transaction
amounts ($10 to $1000).
• Transactions far beyond this range, like $10,000, would fall outside the
boundary and be classified as anomalies.
 Treatment of Outliers
 Once outliers are detected, the next step is deciding how to handle them:
• Removal: Simply removing outliers may be appropriate if they result from
data entry errors or other anomalies that don't represent the true
data distribution. However, this should be done cautiously to avoid losing
valuable information.
• Transformation: Transforming data, such as applying a logarithmic scale,
can reduce the impact of outliers and bring them closer to the rest of
the data.
• Capping/Flooring: Setting a threshold beyond which data points are capped
or floored. For example, any data point above the 95th percentile could be
capped at the 95th percentile value.
• Imputation: Replacing outliers with a central tendency measure like the
median or mean, especially in cases where outliers are likely due to
errors.
Data Splitting Strategies in Machine Learning
 Machine learning algorithms learn patterns from data and use them to make
predictions on new, unseen data. It's important to evaluate the performance of our
models on data that they haven't seen during training. This is where data-splitting
strategies come into play.
 Train, Validation, and Test Data
 Before diving into data splitting strategies, let's first define three important terms:
train data, validation data, and test data.
 Train Data: Train data is the data that we use to train our machine learning model. This
data consists of input features and their corresponding output labels. The model
learns patterns from this data and uses them to make predictions on new, unseen
data.
 Validation Data: Validation data is the data that we use to evaluate the performance of
our model during training. We use this data to tune the model's hyperparameters and
prevent overfitting. The validation data should be representative of the unseen data
that the model will encounter in the real world.
 Test Data: Test data is the data that we use to evaluate the final performance of our
model after training and tuning. This data should be completely separate from the
train and validation data to avoid any data leakage.
Random Train-Test Split
 The simplest data splitting strategy is random splitting, where we
randomly divide our dataset into a training set and a test set. This
method is easy to implement and works well when our dataset is
large enough. However, we run the risk of overfitting to our
training set if our dataset is small, or if the test set is not
representative of the data we want to generalize to.
 Random splitting may result in a biased dataset if the dataset is
imbalanced, meaning that it contains more samples from one class than
the others. In such cases, a stratified random splitting strategy is
recommended, which ensures that the proportion of each class is the same
in both the training and test sets.
Stratified Sampling

 Stratified sampling is a data-splitting strategy that ensures that the


proportion of each class in the training and test sets is the same
as that in the original dataset. This approach is particularly useful
when dealing with imbalanced datasets, where one class is
significantly more prevalent than the others. Stratified sampling helps to
ensure that the model is trained and evaluated on a representative
sample of the data, and it can improve the model's overall performance.
Holdout Validation with
Validation Set
Holdout validation with a validation set is a data-splitting strategy that
involves splitting the dataset into a training set, a validation
set, and a test set. We use the training set to fit the model, the
validation set to tune the model's hyperparameters, and the test set to
evaluate the final model. This strategy is useful when we have a
large dataset and want to avoid the computational cost of cross-
validation.
Cross-Validation
 Cross-validation is a data splitting strategy that involves splitting the
dataset into k folds, where k is typically 5 or 10. We then train k
models, each using a different fold as the test set and the
remaining folds as the training set. We then average the
performance of the k models to get an estimate of the model's
performance on unseen data. Cross-validation is useful when
our dataset is small or when we want to tune our model's
hyperparameters.
Leave-One-Out Cross-Validation
 Leave-one-out cross-validation (LOOCV) is a special case of k-fold cross-
validation where k is set to the number of samples in the dataset. In
other words, each sample in the dataset is used once as the validation
set, while the remaining samples are used for training. LOOCV is a very
robust method for estimating the model's performance but can be
computationally expensive, especially for large datasets.
Data Transformations
 Data transformations are crucial in data preprocessing, especially when
preparing data for analysis or machine learning. These
transformations help in making the data more suitable for various statistical
methods and models by altering the format, structure, or values of the
data.
 What are Data Transformations?
 Data transformations involve applying mathematical functions to data,
converting it into a more usable form. The primary goals are to:
• Normalize or standardize data to bring different features to a common
scale.
• Reduce skewness to make the data distribution more symmetrical.
• Stabilize variance across data points.
• Handle outliers to mitigate their effect on analyses.
 Log Transformation
Purpose: To handle skewed data and to make distributions more normal.
Log transformation is useful when the data is highly skewed, typically with a
long right tail (positive skewness). It compresses the range of data, reducing the
impact of large values.
• How It Works: It compresses the range of values by applying the
logarithm function (e.g., log(x)). It's particularly effective when the data
contains values spanning several orders of magnitude.
• Use Cases: Often used when dealing with financial data (e.g., income or
sales data), biological data (e.g., gene expression levels), or data
with exponential growth patterns.
• Effect: Reduces skewness, helps stabilize variance, and can handle
outliers more effectively
 Reciprocal Transformation
 Reciprocal transformation is used for data with large values, reducing
their impact by inverting them.

 Reciprocal Transformed Data: The transformation compresses large


values, helping in situations where large values disproportionately affect
analysis.
 Square Root Transformation
• Purpose: To reduce skewness and stabilize variance in data.
• How It Works: Applies the square root function to each value (e.g., √x). The
square root transformation is less aggressive than log transformation and is
used for moderately skewed data.

• Use Cases: Commonly used for count data (e.g., number of


occurrences, frequencies).
• Effect: Makes highly skewed data less skewed and improves and
normality
 Power Transformation (Box-Cox and Yeo-Johnson)
• Purpose: To make data more closely follow a normal distribution.
• How It Works:
• Box-Cox Transformation: Applicable to positive values only. It finds the
best power (lambda) to which all data should be raised to minimize
skewness.
• Yeo-Johnson Transformation: Can handle both positive and negative
values, similar to Box-Cox but more flexible.
• Effect: Reduces skewness and makes variances more uniform.
 The Box-Cox transformation can handle both skewed and
heteroscedastic data (where variance is not constant).
 Normalization (Min-Max Scaling)
• Purpose: To rescale data to a specific range, usually [0, 1].
• Effect: Scales the data to a fixed range, helping to eliminate any
influence of scale on model performance.
 Normalization is used to scale data to a fixed range, typically [0, 1]. This
is essential when dealing with features of different units or scales.
 Standardization (Z-score Normalization)
• Purpose: To rescale data so that it has a mean of 0 and a standard
deviation of 1.
• Effect: Ensures that data is centered around the mean with unit
variance, making it easier to compare features.
 Standardization is used to transform data to have a mean of 0 and a
standard deviation of 1. This transformation is particularly important for
algorithms like k-NN and SVM, which are sensitive to the scale of input
features.
 Scaling
 Scaling refers to transforming features so that they are on a similar
scale, which can be crucial for many machine learning algorithms.
 a. Min-Max Scaling
• Definition: Also known as Normalization, it scales each feature to a
fixed range, typically [0, 1].
• Use Cases: Essential for algorithms like K-Nearest Neighbors, Neural
Networks, and any models where distances are calculated (e.g.,
Euclidean distance).
• Effect: Compresses the range of the data while maintaining the relative
relationships between the data points.
 Z-score Standardization
• Definition: Also known as Standardization, this method transforms
the data to have a mean of 0 and a standard deviation of 1.
• Use Cases: Necessary for algorithms that assume normally distributed
data or require the same scale, such as SVMs, PCA, and linear
regression.
• Effect: Centers the data and removes the effect of scale, making
comparisons across features more meaningful.
 Robust Scaling
• Definition: Similar to Z-score standardization, but it uses the median
and the Interquartile Range (IQR) instead of the mean and standard
deviation.
• Use Cases: Useful when dealing with datasets that contain outliers.
Unlike Z-score, it is not heavily affected by outliers.
• Effect: Scales the data based on median and IQR, making it robust to
outliers.
 Max-Abs Scaling
• Definition: Scales the data by dividing each feature by its maximum
absolute value, preserving the sparsity of the data.
• Use Cases: Particularly useful for sparse data and in algorithms that
require normalization without altering the sparsity.
• Effect: Keeps the range within [-1, 1] without breaking the sparsity
structure.
 Why Use Data Transformations and Scaling?
1. Algorithm Performance: Many machine learning algorithms, especially distance-
based ones like K-NN, SVM, and neural networks, perform better when features are
on the same scale.
2. Normalization of Distribution: Transformations like log or square root can help
in normalizing the data distribution, which is essential for many statistical models.
3. Handling Outliers: Robust scaling can mitigate the impact of outliers.
4. Convergence in Gradient Descent: Scaling helps in faster convergence for
gradient-based optimization algorithms by preventing any feature from dominating
the learning process.
 Choosing the Right Technique
• Normalization: When you know the data does not follow a Gaussian distribution
or when using algorithms that do not assume any specific distribution.
• Standardization: When the data follows a Gaussian distribution or when using
algorithms that assume normally distributed data.
• Log or Square Root Transformations: When dealing with skewed data or data
that spans several orders of magnitude.
• Robust Scaling: When the dataset contains outliers and you want to minimize
their influence.
Feature Engineering and Feature
Selection
 Feature Engineering and Feature Selection are critical steps in
preparing data for machine learning models. They help improve
model performance by ensuring the most relevant data is used
while reducing noise and complexity.
Feature Engineering
 Feature Engineering is the process of creating new features or modifying
existing ones from raw data to improve the performance of a machine
learning model. This step involves transforming, combining, or
deriving features that make patterns in data more obvious to
the model.
 Why is Feature Engineering Important?
• Enhances the model’s ability to learn.
• Improves the accuracy and predictive power.
• Reduces overfitting by making the data more relevant.
 Steps in Feature Engineering:
1. Domain Knowledge Utilization: Use your understanding of the data and the
problem domain to create new features that might better represent the
problem.
2. Data Transformation: Apply mathematical transformations like log, square
root, or power transformations to make skewed data more normally distributed.
3. Handling Date/Time Data: Extract features from date/time data, such as the
day of the week, hour of the day, month, or season.
4. Encoding Categorical Variables: Convert categorical data into a numerical
format that can be used by machine learning algorithms, using methods like
one-hot encoding or label encoding.
5. Aggregation and Grouping: Summarize information across different
dimensions, such as mean, sum, count, etc., based on groupings.
6. Interaction Features: Create features that represent interactions between
two or more existing features (e.g., product of two features).
7. Polynomial Features: Generate polynomial terms from existing features to
capture non-linear relationships.
8. Text Data Features: For text data, create features using methods like Bag of
Words, Term Frequency-Inverse Document Frequency (TF-IDF), or word
 Example of Feature Engineering:
 Imagine you are working with a dataset of house prices. It contains
features like "Area in Square Feet," "Number of Bedrooms," "Age of the
House," and "Distance from City Center."
• Transformation: Apply log transformation to "Area in Square Feet" to
normalize it.
• Encoding: Use one-hot encoding for categorical features like "Type of
House" (Apartment, Detached, Semi-Detached).
• Binning: Convert "Age of the House" into bins (0-5 years, 6-10 years,
etc.).
• Feature Creation: Create a new feature "Price per Square Foot" by
dividing the "Price" by "Area in Square Feet."
Feature Selection
 Feature Selection is the process of selecting the most important and
relevant features from your dataset to improve model performance. The
goal is to remove irrelevant or redundant features, which can help
reduce overfitting, improve accuracy, and decrease training time.
 Why is Feature Selection Important?
• Reduces the risk of overfitting.
• Decreases model complexity and improves interpretability.
• Reduces training time and computational cost.
• Improves the model's generalization on unseen data.
 Types of Feature Selection Techniques:
1. Filter Methods: These are based on the characteristics of the data itself and do not
involve any machine learning algorithms.
1. Example Techniques:
1. Correlation Coefficient: Removing features with high correlation to each other.
2. Chi-Square Test: Used for categorical variables to measure the association.
3. Variance Threshold: Removing features with low variance.

2. Wrapper Methods: These involve using a machine learning model to evaluate the
performance of a subset of features.
1. Example Techniques:
1. Forward Selection: Start with no features, add one by one and keep those that improve the model.
2. Backward Elimination: Start with all features, remove them one by one, and keep those that
contribute most to the model.
3. Recursive Feature Elimination (RFE): A recursive process where features are removed to determine
the best subset.

3. Embedded Methods: These methods perform feature selection as part of the model
training process.
1. Example Techniques:
1. Lasso Regression (L1 Regularization): Penalizes the absolute size of the coefficients.
2. Decision Trees and Random Forests: Use feature importance scores to rank features.
 Example of Feature Selection:
 Continuing with the house price prediction example:
• Filter Method: Use a correlation matrix to remove highly correlated
features like "Number of Bathrooms" if it is highly correlated with
"Number of Bedrooms."
• Wrapper Method: Apply Recursive Feature Elimination (RFE) with a
linear regression model to select the top 5 features.
• Embedded Method: Use a decision tree algorithm and evaluate the
importance scores of each feature to select the most relevant ones.
Feature Selection
 Feature Selection is the process of selecting a subset of the most
relevant features (variables, predictors) from the data that contribute
the most to predicting the target variable. It is a critical step in machine
learning that focuses on improving the model’s performance by reducing
the feature space.
 Why Feature Selection is Important:
1. Improves Model Accuracy:
Irrelevant or redundant features can reduce the predictive power of the model by
introducing noise. Selecting only the most relevant features helps the model focus on the
true patterns in the data.
2. Reduces Overfitting:
Models with too many features can capture noise in the training data rather than
the underlying data patterns. By reducing the number of features, feature selection helps
to mitigate overfitting and improves the model’s ability to generalize to new, unseen data.
3. Reduces Training Time:
Fewer features mean less computational complexity, which translates to faster
training times, particularly important for large datasets and complex models.
4. Enhances Interpretability:
A model with fewer features is easier to interpret, explain, and deploy, especially in
fields like healthcare or finance, where understanding the decisions made by a model is
crucial.
 Types of Feature Selection Methods
 Feature selection methods can be broadly categorized into three types: Filter Methods,
Wrapper Methods, and Embedded Methods. Each of these methods has different
approaches to evaluating the importance of features.
 1. Filter Methods
 Filter methods use statistical techniques to evaluate the relationship between each
feature and the target variable. They are model-agnostic, meaning they do not depend on
any specific machine learning algorithm.
 Common Filter Methods:
• Correlation Coefficient:
• Measures the linear relationship between each feature and the target variable. Features with low
correlation can be removed. This method is useful for continuous data.
• Chi-Squared Test:
• Measures the dependence between categorical features and the target variable. It evaluates
whether the occurrence of a feature is independent of the occurrence of the target.
• Variance Threshold:
• Removes features with low variance. Features with little to no variance (e.g., almost all values
are the same) are unlikely to be useful in predicting the target variable.
• Mutual Information:
• Measures the dependency between two variables. Features with low mutual information with the
target variable can be removed as they do not provide much information about the target.
 Advantages of Filter Methods:
• Fast and efficient as they do not involve model training.
• Suitable for high-dimensional datasets.
• Simple to implement and understand.
 Disadvantages of Filter Methods:
• Consider each feature independently, ignoring feature interactions.
• May miss important features that are useful in combination but not
individually.
 2. Wrapper Methods
 Wrapper methods involve evaluating multiple subsets of features and selecting
the subset that produces the best model performance based on a specific
machine learning algorithm. These methods are computationally expensive but often
yield more accurate results than filter methods.
 Common Wrapper Methods:
• Recursive Feature Elimination (RFE):
• Iteratively builds a model and removes the least important feature until the desired number of
features is reached. The importance of features is typically determined by the model coefficients
(e.g., in linear regression) or feature importance scores (e.g., in decision trees).
• Sequential Feature Selection:
• Forward Selection: Starts with an empty set of features and adds one feature at a time, which
improves the model the most until no significant improvement is observed.
• Backward Elimination: Starts with all features and removes one feature at a time, which, when
removed, improves the model the most until no further improvement is observed.
 Advantages of Wrapper Methods:
• Consider the interaction between features.
• Typically yield better performance as they optimize for a specific algorithm.
 Disadvantages of Wrapper Methods:
• Computationally expensive, especially with a large number of features.
• Prone to overfitting, especially with limited data.
3. Embedded Methods
 Embedded methods perform feature selection during the process of model training. These
methods are specific to certain learning algorithms that have built-in feature selection
capabilities.
 Common Embedded Methods:
• LASSO (Least Absolute Shrinkage and Selection Operator):
• A linear regression model that uses L1 regularization, which adds a penalty equal to the absolute
value of the magnitude of coefficients. This penalty causes some coefficients to shrink to zero,
effectively selecting a subset of features.
• Decision Trees and Ensemble Methods:
• Decision trees, Random Forest, and Gradient Boosting methods provide feature importance scores
based on how much each feature contributes to reducing impurity (like Gini impurity or entropy).
• Regularization Methods:
• Models like Ridge Regression (L2 regularization) and Elastic Net (a combination of L1 and L2
regularization) penalize the size of coefficients, effectively shrinking less important features.
 Advantages of Embedded Methods:
• Incorporate feature selection as part of the model training, which can improve efficiency.
• Typically result in good performance since feature selection is optimized for a specific
algorithm.
 Disadvantages of Embedded Methods:
• Limited to specific types of models.
• Less flexible compared to filter and wrapper methods.
 Evaluating Feature Selection Methods
 When selecting a feature selection method, consider the following:
1. Type of Data:
Continuous vs. categorical data may require different feature selection techniques
(e.g., Chi-Squared Test for categorical data, Correlation for continuous data).
2. Number of Features:
Filter methods are more suitable for high-dimensional datasets due to their
computational efficiency.
3. Model Type:
Some models (like tree-based models) already have built-in feature selection
mechanisms, making embedded methods a natural choice.
4. Computational Resources:
Wrapper methods can be computationally expensive, especially for large datasets.
5. Interpretability:
Simpler models and methods (like filter methods) may provide more interpretable
results compared to more complex wrapper or embedded methods.

You might also like