FINAL LECTURE 3,4.pptx - AutoRecovered
FINAL LECTURE 3,4.pptx - AutoRecovered
www.vidyashilpuniversity.co www.vidyashilpuniversity.co
m m
EDA Techniques: Graphical
and Non-Graphical,
Univariate, Multivariate
1. Graphical EDA Techniques
Graphical EDA techniques involve visualizing the data using various charts and
plots. These techniques help in understanding the distribution, trends,
relationships, and outliers in the data.
a. Univariate Graphical EDA
Univariate analysis involves analyzing one variable at a time. The primary goal is
to understand the distribution and central tendency of the data.
• Histogram
• Purpose: To show the frequency distribution of a continuous variable.
• Example: A histogram of the ages of a group of people..
• Box Plot
• Purpose: To show the distribution of data based on five summary statistics:
minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
• Example: A box plot showing the distribution of salaries in a company.
• Density Plot
• Purpose: To estimate the probability density function of a continuous variable.
• Example: A density plot showing the distribution of test scores for students.
b. Multivariate Graphical EDA
Multivariate analysis involves analyzing two or more variables
simultaneously to identify relationships and patterns.
• Scatter Plot
• Purpose: To examine the relationship between two continuous variables.
• Example: A scatter plot showing the relationship between advertising spend and
sales revenue.
• Pair Plot
• Purpose: To visualize the pairwise relationships between several continuous
variables.
• Example: A pair plot showing the relationships among variables like height,
weight, and age.
• Heatmap
• Purpose: To visualize correlations between variables using colors.
• Example: A heatmap showing the correlation between different financial metrics
like revenue, profit, and expenses.
Non-Graphical EDA Techniques
Non-Graphical EDA techniques involve analyzing the data using numerical
summaries and statistical methods, without the use of visualizations.
a. Univariate Non-Graphical EDA
• Summary Statistics
• Purpose: To provide a numerical summary of a single variable.
• Examples:
• Mean: The average value of the data (e.g., the average income of a population).
• Median: The middle value of the data when sorted (e.g., the median house price).
• Mode: The most frequent value in the data (e.g., the most common age in a group).
• Standard Deviation: Measures the spread of the data (e.g., the standard deviation of
test scores).
• Frequency Distribution
• Purpose: To show how frequently each value occurs in the dataset.
• Example: A table showing the frequency of different blood types in a population.
Multivariate Non-Graphical EDA
• Correlation Coefficient
• Purpose: To quantify the relationship between two continuous variables.
• Example: Calculating the correlation between study hours and exam scores.
• Explanation: A correlation coefficient close to +1 indicates a strong
positive relationship, while a coefficient close to -1 indicates a strong
negative relationship.
• Cross-Tabulation (Contingency Table)
• Purpose: To examine the relationship between two categorical variables.
• Example: A table showing the relationship between gender and voting
preference.
• Explanation: The table would display counts or percentages, helping to
identify any patterns or associations between the variables.
Data Pre-Processing –
Numeric and Non Numeric
Numeric Data Pre-Processing
Numeric data pre-processing is a fundamental step in data analysis and machine
learning. It involves cleaning, transforming, and organizing numerical
data to ensure it is suitable for analysis or modeling.
Why Pre-Process Numeric Data?
Raw data is often messy and inconsistent, containing missing values,
outliers, and features with varying scales. If left untreated, these issues
can lead to inaccurate models, poor predictions, and misleading
insights.
Pre-processing helps to:
• Standardize the data, making it easier to compare and analyze.
• Normalize the data to ensure that all features contribute equally to the
analysis.
• Handle missing values appropriately to avoid biased results.
• Detect and manage outliers that can distort statistical analyses.
• Improve model performance by preparing the data for machine learning
algorithms.
Key Techniques in Numeric Data Pre-
Processing
a. Handling Missing Values
Why It Matters: Missing data can lead to biased estimates, reduced
statistical power, and invalid conclusions. It’s essential to address
missing values before proceeding with any analysis.
Common Strategies:
• Imputation: Replace missing values with statistical measures (mean,
median, or mode).
• Mean Imputation: Suitable for numerical data that is symmetrically
distributed.
• Median Imputation: Useful when the data is skewed.
• Mode Imputation: Typically used for categorical data but can be applied to
numerical data in some cases.
• Dropping: Remove rows or columns with missing values.
• This is practical when the missing data is minimal and doesn’t significantly
affect the dataset's size.
Normalization
Why It Matters: Normalization is crucial when the features in the
dataset have different scales. For example, if one feature ranges
between 0 and 1000, and another between 0 and 1, the feature with the
larger scale can dominate the model training, leading to biased
outcomes.
Normalization scales the numeric data to a specific range, typically [0,
1]. This is particularly useful when different features have different
ranges and you want to bring them to a common scale.
Common Approaches:
• Min-Max Normalization: Rescales the data to a fixed range, usually [0,
1].
Standardization
Why It Matters: Standardization is essential when you expect that your
data should follow a normal distribution. It transforms the data to have a
mean of 0 and a standard deviation of 1, which helps in comparing
features with different units and scales.
Common Approach:
Outlier Detection and Handling
Why It Matters: Outliers can skew data analysis, lead to inaccurate
predictions, and affect the model's performance. Identifying and
addressing outliers ensures that the analysis is robust and reliable.
Common Strategies:
Binning
Why It Matters: Binning converts continuous data into categorical data
by dividing it into intervals or "bins." This can reduce the impact of noise
and allow for better data interpretation, especially when dealing with
large datasets.
Common Approaches:
• Equal-width Binning: Divides the data into bins of equal size.
• Equal-frequency Binning: Each bin contains the same number of data
points.
Numeric data pre-processing is a crucial step that involves transforming
raw numerical data into a clean, consistent format that can be effectively
used in data analysis and machine learning. The primary techniques
include:
• Handling Missing Values: Ensures that no data is lost or biased due
to gaps in the dataset.
• Normalization and Standardization: Rescale data to make features
comparable and improve model performance.
• Outlier Detection and Handling: Identifies and mitigates the impact of
extreme values that could skew results.
• Binning: Simplifies continuous data by grouping it into categories,
making it more interpretable.
Visual Representation
• Histograms can be used to show the distribution of data before and after
normalization or standardization.
• Boxplots are ideal for visualizing outliers.
• Bar charts can represent the distribution of data after binning.
Non-Numeric Data Pre-Processing
Non-numeric data, also known as categorical or qualitative data, represents
variables that can be divided into different categories but do not have inherent
numerical meaning. Examples include gender, occupation, product type, or
review text. Pre-processing non-numeric data is crucial for preparing it for
machine learning models, as most algorithms require numerical input.
Why Pre-Process Non-Numeric Data?
Non-numeric data often contains valuable information that can enhance
the predictive power of models. However, because most machine learning
algorithms work with numerical data, non-numeric data must be converted or
transformed into a numerical format. Pre-processing also involves handling
inconsistencies, encoding categories, and ensuring that the data is in
a format suitable for analysis.
Key Techniques in Non-Numeric Data
Pre-Processing
Handling Missing Values
Why It Matters: Just like in numeric data, missing values in non-numeric
data can lead to biased results and poor model performance.
Addressing missing values is crucial for maintaining data integrity.
Common Strategies:
• Imputation: Replace missing categorical values with the most frequent
category (mode) or a new category (e.g., "Unknown").
• Dropping: If a significant amount of data is missing, you might choose to
remove those rows or columns entirely.
Encoding Categorical Variables
Why It Matters: Machine learning algorithms require numerical input,
so categorical variables must be converted into a numerical format. This
process is called encoding.
Common Encoding Techniques:
• Label Encoding: Assigns a unique integer to each category.
• Example: Categories "Red", "Blue", and "Green" might be encoded as 0, 1,
and 2, respectively.
• One-Hot Encoding: Creates a new binary variable for each category.
Each category is represented by a binary vector.
• Example: A "Color" variable with categories "Red", "Blue", and "Green" would
be split into three binary columns: "Color_Red", "Color_Blue", and
"Color_Green".
Handling Text Data (Text Pre-Processing)
Why It Matters: Text data, such as customer reviews or comments, is
unstructured and must be processed before it can be used in
models. Text pre-processing transforms raw text into a format that can be
analyzed.
Common Steps in Text Pre-Processing:
• Tokenization: Splitting text into individual words or tokens.
• Lowercasing: Converting all text to lowercase to ensure uniformity.
• Removing Stop Words: Removing common words that do not
contribute much meaning (e.g., "and", "the").
• Stemming/Lemmatization: Reducing words to their base or root form
(e.g., "running" to "run").
• Vectorization: Converting text into numerical format using techniques
like Bag of Words (BoW), Term Frequency-Inverse Document Frequency
(TF-IDF), or Word Embeddings.
Handling Ordinal Data
Why It Matters: Ordinal data represents categories with a meaningful
order but no consistent difference between them. Examples include
rating scales (e.g., "Poor", "Average", "Good") or education levels (e.g.,
"High School", "Bachelor's", "Master's").
Common Strategy:
• Ordinal Encoding: Assigns integers to categories based on their order.
• Example: "Poor", "Average", "Good" might be encoded as 1, 2, and 3.
Feature Engineering for Non-Numeric Data
Why It Matters: Feature engineering involves creating new features
from existing non-numeric data to improve model performance.
For text data, this might involve creating features like the length of a
review or the presence of specific keywords.
Common Techniques:
• Creating Binary Features: From categorical data (e.g., a binary
column indicating whether a product is 'High Risk').
• Extracting Features from Text: Such as word count, presence of
certain phrases, or sentiment analysis scores.
Non-numeric data pre-processing is a vital step in preparing qualitative data
for analysis and modeling. It involves various techniques, including:
• Handling Missing Values: Ensures that the data remains unbiased and
complete.
• Encoding Categorical Variables: Converts categories into numerical
format for use in algorithms.
• Text Pre-Processing: Transforms raw text into a format suitable for
analysis.
• Handling Ordinal Data: Ensures that ordered categories are
represented numerically in a way that reflects their order.
• Feature Engineering: Creates new features from non-numeric data to
improve model performance.
These techniques are crucial for making non-numeric data compatible with
machine learning models, ensuring that all relevant information is captured
and utilized effectively in analysis.
What Is a Missing Value?
Missing data is defined as the values or data that is not stored (or not
present) for some variable/s in the given dataset. Below is a sample of
the missing data from the Titanic dataset. You can see the columns ‘Age’
and ‘Cabin’ have some missing values.
How is a Missing Value Represented in a Dataset?
NaN (Not a Number): In many programming languages and data analysis tools, missing
values are represented as NaN. This is the default for libraries like Pandas in Python.
NULL or None: In databases and some programming languages, missing values are
often represented as NULL or None. For instance, in SQL databases, a missing value is
typically recorded as NULL.
Empty Strings: Sometimes, missing values are denoted by empty strings (""). This is
common in text-based data or CSV files where a field might be left blank.
Special Indicators: Datasets might use specific indicators like -999, 9999, or other
unlikely values to signify missing data. This is often seen in older datasets or specific
industries where such conventions were established.
Blanks or Spaces: In some cases, particularly in fixed-width text files, missing values
might be represented by spaces or blank fields.
Why is Data Missing From the
Dataset?
There can be multiple reasons why certain values are missing from the
data. Reasons for the missing of data from the dataset affect the
approach of handling missing data. So it’s necessary to understand why
the data could be missing.
2. Wrapper Methods: These involve using a machine learning model to evaluate the
performance of a subset of features.
1. Example Techniques:
1. Forward Selection: Start with no features, add one by one and keep those that improve the model.
2. Backward Elimination: Start with all features, remove them one by one, and keep those that
contribute most to the model.
3. Recursive Feature Elimination (RFE): A recursive process where features are removed to determine
the best subset.
3. Embedded Methods: These methods perform feature selection as part of the model
training process.
1. Example Techniques:
1. Lasso Regression (L1 Regularization): Penalizes the absolute size of the coefficients.
2. Decision Trees and Random Forests: Use feature importance scores to rank features.
Example of Feature Selection:
Continuing with the house price prediction example:
• Filter Method: Use a correlation matrix to remove highly correlated
features like "Number of Bathrooms" if it is highly correlated with
"Number of Bedrooms."
• Wrapper Method: Apply Recursive Feature Elimination (RFE) with a
linear regression model to select the top 5 features.
• Embedded Method: Use a decision tree algorithm and evaluate the
importance scores of each feature to select the most relevant ones.
Feature Selection
Feature Selection is the process of selecting a subset of the most
relevant features (variables, predictors) from the data that contribute
the most to predicting the target variable. It is a critical step in machine
learning that focuses on improving the model’s performance by reducing
the feature space.
Why Feature Selection is Important:
1. Improves Model Accuracy:
Irrelevant or redundant features can reduce the predictive power of the model by
introducing noise. Selecting only the most relevant features helps the model focus on the
true patterns in the data.
2. Reduces Overfitting:
Models with too many features can capture noise in the training data rather than
the underlying data patterns. By reducing the number of features, feature selection helps
to mitigate overfitting and improves the model’s ability to generalize to new, unseen data.
3. Reduces Training Time:
Fewer features mean less computational complexity, which translates to faster
training times, particularly important for large datasets and complex models.
4. Enhances Interpretability:
A model with fewer features is easier to interpret, explain, and deploy, especially in
fields like healthcare or finance, where understanding the decisions made by a model is
crucial.
Types of Feature Selection Methods
Feature selection methods can be broadly categorized into three types: Filter Methods,
Wrapper Methods, and Embedded Methods. Each of these methods has different
approaches to evaluating the importance of features.
1. Filter Methods
Filter methods use statistical techniques to evaluate the relationship between each
feature and the target variable. They are model-agnostic, meaning they do not depend on
any specific machine learning algorithm.
Common Filter Methods:
• Correlation Coefficient:
• Measures the linear relationship between each feature and the target variable. Features with low
correlation can be removed. This method is useful for continuous data.
• Chi-Squared Test:
• Measures the dependence between categorical features and the target variable. It evaluates
whether the occurrence of a feature is independent of the occurrence of the target.
• Variance Threshold:
• Removes features with low variance. Features with little to no variance (e.g., almost all values
are the same) are unlikely to be useful in predicting the target variable.
• Mutual Information:
• Measures the dependency between two variables. Features with low mutual information with the
target variable can be removed as they do not provide much information about the target.
Advantages of Filter Methods:
• Fast and efficient as they do not involve model training.
• Suitable for high-dimensional datasets.
• Simple to implement and understand.
Disadvantages of Filter Methods:
• Consider each feature independently, ignoring feature interactions.
• May miss important features that are useful in combination but not
individually.
2. Wrapper Methods
Wrapper methods involve evaluating multiple subsets of features and selecting
the subset that produces the best model performance based on a specific
machine learning algorithm. These methods are computationally expensive but often
yield more accurate results than filter methods.
Common Wrapper Methods:
• Recursive Feature Elimination (RFE):
• Iteratively builds a model and removes the least important feature until the desired number of
features is reached. The importance of features is typically determined by the model coefficients
(e.g., in linear regression) or feature importance scores (e.g., in decision trees).
• Sequential Feature Selection:
• Forward Selection: Starts with an empty set of features and adds one feature at a time, which
improves the model the most until no significant improvement is observed.
• Backward Elimination: Starts with all features and removes one feature at a time, which, when
removed, improves the model the most until no further improvement is observed.
Advantages of Wrapper Methods:
• Consider the interaction between features.
• Typically yield better performance as they optimize for a specific algorithm.
Disadvantages of Wrapper Methods:
• Computationally expensive, especially with a large number of features.
• Prone to overfitting, especially with limited data.
3. Embedded Methods
Embedded methods perform feature selection during the process of model training. These
methods are specific to certain learning algorithms that have built-in feature selection
capabilities.
Common Embedded Methods:
• LASSO (Least Absolute Shrinkage and Selection Operator):
• A linear regression model that uses L1 regularization, which adds a penalty equal to the absolute
value of the magnitude of coefficients. This penalty causes some coefficients to shrink to zero,
effectively selecting a subset of features.
• Decision Trees and Ensemble Methods:
• Decision trees, Random Forest, and Gradient Boosting methods provide feature importance scores
based on how much each feature contributes to reducing impurity (like Gini impurity or entropy).
• Regularization Methods:
• Models like Ridge Regression (L2 regularization) and Elastic Net (a combination of L1 and L2
regularization) penalize the size of coefficients, effectively shrinking less important features.
Advantages of Embedded Methods:
• Incorporate feature selection as part of the model training, which can improve efficiency.
• Typically result in good performance since feature selection is optimized for a specific
algorithm.
Disadvantages of Embedded Methods:
• Limited to specific types of models.
• Less flexible compared to filter and wrapper methods.
Evaluating Feature Selection Methods
When selecting a feature selection method, consider the following:
1. Type of Data:
Continuous vs. categorical data may require different feature selection techniques
(e.g., Chi-Squared Test for categorical data, Correlation for continuous data).
2. Number of Features:
Filter methods are more suitable for high-dimensional datasets due to their
computational efficiency.
3. Model Type:
Some models (like tree-based models) already have built-in feature selection
mechanisms, making embedded methods a natural choice.
4. Computational Resources:
Wrapper methods can be computationally expensive, especially for large datasets.
5. Interpretability:
Simpler models and methods (like filter methods) may provide more interpretable
results compared to more complex wrapper or embedded methods.