EDA Feature eng- Estimation Inference and Hypothesis
EDA Feature eng- Estimation Inference and Hypothesis
Exploratory Data Analysis (EDA) is important for several reasons, especially in the context of data science and statistical modeling. It involves analyzing
and visualizing data to understand its main characteristics, uncover patterns, and identify relationships between variables.
Here are some of the key reasons why EDA is a critical step in the data analysis process:
• Understanding Data Structures: EDA helps in getting familiar with the dataset, understanding the number of features, the type of data in each
feature, and the distribution of data points. This understanding is crucial for selecting appropriate analysis or prediction techniques.
• Identifying Patterns and Relationships: Through visualizations and statistical summaries, EDA can reveal hidden patterns and intrinsic
relationships between variables. These insights can guide further analysis and enable more effective feature engineering and model building.
• Detecting Anomalies and Outliers: EDA is essential for identifying errors or unusual data points that may adversely affect the results of your
analysis. Detecting these early can prevent costly mistakes in predictive modeling and analysis.
• Testing Assumptions: Many statistical models assume that data follow a certain distribution or that variables are independent. EDA involves
checking these assumptions. If the assumptions do not hold, the conclusions drawn from the model could be invalid.
• Informing Feature Selection and Engineering: Insights gained from EDA can inform which features are most relevant to include in a model and
how to transform them (scaling, encoding) to improve model performance.
• Optimizing Model Design: By understanding the data’s characteristics, analysts can choose appropriate modeling techniques, decide on the
complexity of the model, and better tune model parameters.
• Facilitating Data Cleaning: EDA helps in spotting missing values and errors in the data, which are critical to address before further analysis to
improve data quality and integrity.
Types of Exploratory Data Analysis
1. Univariate Analysis
Univariate analysis focuses on a single variable to understand its internal structure. It is primarily concerned with describing the data
and finding patterns existing in a single feature. Common techniques include:
Histograms: Used to visualize the distribution of a variable.
Box plots: Useful for detecting outliers and understanding the spread and skewness(quantifies the degree to which the data deviates
from a perfectly symmetrical distribution) of the data.
Bar charts: Employed for categorical data to show the frequency of each category.
Summary statistics: Calculations like mean, median, mode, variance, and standard deviation that describe the central tendency
and dispersion of the data.
2. Bivariate Analysis
Bivariate analysis is a crucial form of exploratory data analysis that examines the relationship between two variables. It enables find
associations, correlations, and dependencies between pairs of variables. Some key techniques used in bivariate analysis:
Scatter Plots: These are one of the most common tools used in bivariate analysis. A scatter plot helps visualize the relationship
between two continuous variables.
Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient for linear relationships) quantifies the
degree to which two variables are related.
Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze the relationship between two categorical
variables. It shows the frequency distribution of categories of one variable in rows and the other in columns, which helps in
understanding the relationship between the two variables.
Line Graphs: In the context of time series data, line graphs can be used to compare two variables over time. This helps in
identifying trends, cycles, or patterns that emerge in the interaction of the variables over the specified period.
Covariance: Covariance is a measure used to determine how much two random variables change together. However, it is
sensitive to the scale of the variables, so it’s often supplemented by the correlation coefficient for a more standardized
assessment of the relationship.
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more variables in the dataset. It aims to understand
how variables interact with one another, which is crucial for most statistical modeling techniques. Techniques include:
Pair plots: Visualize relationships across several variables simultaneously to capture a comprehensive view of
potential interactions.
Principal Component Analysis (PCA): A dimensionality reduction technique used to reduce the dimensionality
of large datasets, while preserving as much variance as possible.
Univariate, Bivariate and Multivariate data and its analysis
Univariate Bivariate Multivariate
It does not deal with causes and It does deal with causes and It does not deal with causes and
relationships. relationships and analysis is done. relationships and analysis is done.
It does not contain any dependent It does contain only one dependent It is similar to bivariate but it contains
variable. variable. more than 2 variables.
1. Python Libraries
Pandas: Provides extensive functions for data manipulation and analysis, including data structure
handling and time series functionality.
Matplotlib: A plotting library for creating static, interactive, and animated visualizations in Python.
Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive and informative
statistical graphics.
Plotly: An interactive graphing library for making interactive plots and offers more sophisticated
visualization capabilities.
2. R Packages
ggplot2: it’s a powerful tool for making complex plots from data in a data frame.
dplyr: A grammar of data manipulation, providing a consistent set of verbs that help you solve the most
common data manipulation challenges.
tidyr: Helps to tidy your data. Tidying your data means storing it in a consistent form that matches the
semantics of the dataset with the way it is stored.
Steps for Performing Exploratory Data Analysis
Performing Exploratory Data Analysis (EDA) involves a series of steps designed to help you understand the
data we’re working with, uncover underlying patterns, identify anomalies, test hypotheses, and ensure the data
is clean and suitable for further analysis.
Step 1: Understand the Problem and the Data
The first step in any information evaluation project is to sincerely apprehend the trouble you are trying to resolve and the statistics
you have at your disposal. This entails asking questions consisting of:
What is the commercial enterprise goal or research question you are trying to address?
What are the variables inside the information, and what do they mean?
What are the data sorts (numerical, categorical, textual content, etc.) ?
Is there any known information on first-class troubles or obstacles?
Are there any relevant area-unique issues or constraints?
By thoroughly knowing the problem and the information, you can better formulate your evaluation technique and avoid making
incorrect assumptions or drawing misguided conclusions.
We engineer features for various reasons, and some of the main reasons include:
•Improve User Experience: The primary reason we engineer features is to enhance the user experience of a product
or service. By adding new features, we can make the product more intuitive, efficient, and user-friendly, which can
increase user satisfaction and engagement.
•Competitive Advantage: Another reason we engineer features is to gain a competitive advantage in the
marketplace. By offering unique and innovative features, we can differentiate our product from competitors and attract
more customers.
•Meet Customer Needs: We engineer features to meet the evolving needs of customers. By analyzing user feedback,
market trends, and customer behavior, we can identify areas where new features could enhance the product’s value
and meet customer needs.
•Increase Revenue: Features can also be engineered to generate more revenue. For example, a new feature that
streamlines the checkout process can increase sales, or a feature that provides additional functionality could lead to
more upsells or cross-sells.
•Future-Proofing: “Engineering features can also be done to future-proof a product or service” refers to the practice
of designing and developing features in a way that ensures a product or service remains relevant, adaptable, and
effective in the future, even as technology, market conditions, or user needs change. By anticipating future trends
and potential customer needs, we can develop features that ensure the product remains relevant and useful in the long
term.
Processes Involved in Feature Engineering
It is an iterative process that requires experimentation and testing to find the best combination of features for a given
problem.
The success of a machine learning model largely depends on the quality of the features used in the model.
1. Feature Creation
Feature creation refers to the creation of new features from existing data to help with better predictions.
Examples of feature creation include: one-hot-encoding, binning, splitting, and calculated features.
1.Domain-Specific: Creating new features based on domain knowledge, such as creating features based on
business rules or industry standards.
2.Data-Driven: Creating new features by observing patterns in the data, such as calculating aggregations or
creating interaction features.
3.Synthetic: Generating new features by combining existing features or synthesizing new data points.
1.Improves Model Performance: By providing additional and more relevant information to the model, feature
creation can increase the accuracy and precision of the model.
2.Increases Model Robustness: By adding additional features, the model can become more robust to outliers
and other anomalies.
3.Improves Model Interpretability: By creating new features, it can be easier to understand the model’s
predictions.
4.Increases Model Flexibility: By adding new features, the model can be made more flexible to handle different
types of data.
2. Feature Transformation
Feature transformation and imputation include steps for replacing missing features or features that are not valid.
Some techniques include: forming Cartesian products of features, non-linear transformations (such as binning numeric
variables into categories), and creating domain-specific features.
3.Improves Computational Efficiency: Many machine learning algorithms, such as k-nearest neighbors, are
sensitive to the scale of the features and perform better with scaled features.
4.Improves Model Interpretability: By transforming the features to have a similar scale, it can be easier to
understand the model’s predictions.
Techniques Used in Feature Engineering
Feature engineering is the process of transforming raw data into features that are suitable for machine learning
models. There are various techniques that can be used in feature engineering to create new features by combining or
transforming the existing ones. The following are some of the commonly used feature engineering techniques:
One-Hot Encoding
One-hot encoding is a technique used to transform categorical variables into numerical values that can be used by machine
learning models. In this technique, each category is transformed into a binary value indicating its presence or absence. For
example, consider a categorical variable “Colour” with three categories: Red, Green, and Blue. One-hot encoding would
transform this variable into three binary variables: Colour_Red, Colour_Green, and Colour_Blue, where the value of each variable
would be 1 if the corresponding category is present and 0 otherwise.
Binning
Binning is a technique used to transform continuous variables into categorical variables. In this technique, the range of values of
the continuous variable is divided into several bins, and each bin is assigned a categorical value. For example, consider a
continuous variable “Age” with values ranging from 18 to 80. Binning would divide this variable into several age groups such as
18-25, 26-35, 36-50, and 51-80, and assign a categorical value to each age group.
Scaling
The most common scaling techniques are standardization and normalization. Standardization scales the variable so that it has
zero mean and unit variance. Normalization scales the variable so that it has a range of values between 0 and 1.
Feature Split
Feature splitting is a powerful technique used in feature engineering to improve the performance of machine
learning models. It involves dividing single features into multiple sub-features or groups based on specific criteria.
This process unlocks valuable insights and enhances the model’s ability to capture complex relationships and
patterns within the data.
Variable transformation involves changing the form or structure of variables in a dataset to make them more suitable for analysis,
improve model performance, or address specific statistical requirements. Transformations can help in normalizing data, handling
non-linearity, reducing skewness, or improving interpretability.
1.Scaling Transformations:
1. Normalization (Min-Max Scaling): Scales variables to a fixed range, usually [0, 1]. Useful when variables have different
units or scales.
2. Standardization (Z-score Normalization): Converts variables to have a mean of 0 and a standard deviation of 1. Useful for
algorithms that assume normally distributed data.
2.Logarithmic Transformation:
1. Purpose: Reduces skewness (lack of straightness or symmetry., face is symmetrical left or right, Skewness measures the
deviation of a random variable's given distribution from the normal distribution, which is symmetrical on both sides.) and
handles exponential growth patterns by converting values to their logarithms. Often used for highly skewed data.
2. Yeo-Johnson Transformation: An extension of Box-Cox that handles zero and negative values.
5. Binning:
1. Purpose: Converts continuous variables into categorical ones by grouping values into bins or intervals. Useful for simplifying data or
handling non-linear relationships.
1. Example: Grouping ages into bins like 0-18, 19-35, 36-50, etc.
6. Polynomial Transformation:
1. Purpose: Adds polynomial terms (squares, cubes) of the original variables to capture non-linear relationships. Useful for polynomial
regression models.
7. Categorical Encoding:
1. Purpose: Converts categorical variables into numerical formats for use in algorithms that require numerical input. Includes various
methods.
1. One-Hot Encoding: Creates binary columns for each category.
2. Label Encoding: Assigns integer values to categories.
8. Rank Transformation:
1. Purpose: Converts values to their ranks (ordinal position) to handle outliers and make data more robust to non-normal distributions.
1. Formula: Assigns ranks based on the ordering of values.
Estimation, Inferences & Hypothesis Testing
In machine learning, estimation, inference, and hypothesis testing are fundamental concepts that help in understanding, evaluating, and
improving models. Here's a breakdown of these concepts and their relevance to machine learning:
Estimation
Estimation refers to the process of determining the parameters of a model based on observed data. In machine learning, this involves fitting a
model to data to approximate underlying relationships or patterns.
Methods:
•Maximum Likelihood Estimation (MLE): This method estimates parameters by maximizing the likelihood function, which measures how likely
the observed data is given the parameters. It is widely used due to its desirable properties like asymptotic unbiasedness and efficiency.
•Bayesian Estimation: This approach incorporates prior beliefs about the parameters (prior distributions) and updates them with the data to form
posterior distributions. Bayesian estimation provides a full probability distribution for the parameters rather than a single estimate.
•Least Squares Estimation: Often used in linear regression, this method minimizes the sum of the squared differences between observed and
predicted values.
Example in Machine Learning: Estimating the parameters (weights) of a neural network through backpropagation and gradient descent.
Interval Estimation: This provides a range of values within which the parameter is expected to fall with a certain level of confidence.
Inference
•Inference is about using these parameters to draw broader conclusions, make predictions, or understand the significance of the
model components.
Statistical Inference: This involves making judgments about the model parameters or data generation process. It can include
hypothesis testing, confidence intervals, and understanding the significance of different features.
Techniques in Inference
Hypothesis Testing: Involves testing a hypothesis about a parameter or model structure. For instance, you might test whether a
certain feature significantly contributes to the prediction or whether a model performs better than a baseline.
Confidence Intervals: Provide a range of values for an estimate, indicating the reliability and precision of the estimate.
Model Evaluation Metrics: Techniques like cross-validation, AIC (Akaike Information Criterion), and BIC (Bayesian Information
Criterion) help in assessing model performance and generalizability.
Methods:
•Point Predictions: Using the model to output specific values or classes for given inputs.
•Probabilistic Predictions: Estimating the probability distribution over possible outcomes, such as predicting the probability of a
class in classification problems.
Hypothesis Testing
Hypothesis testing in machine learning involves evaluating whether certain assumptions or hypotheses about the model or data
hold true. This helps in validating the model's performance, understanding feature importance, or comparing different models.
Level of Significance (α)
•The level of significance, denoted as α (alpha), is the threshold you set for deciding whether to reject the null hypothesis. It represents the
probability of making a Type I error, which occurs when you incorrectly reject a true null hypothesis.
•Typical values for α are 0.05, 0.01, or 0.10 or 1%, 5%, 10%. For example, an α of 0.05 means you are willing to accept a 5% chance of incorrectly
rejecting the null hypothesis.
Example: Suppose you are comparing two models and perform a statistical test to determine if their performance differences are significant. If
you set α = 0.01, you are willing to accept a 1% chance of incorrectly concluding that there is a significant difference when there isn’t one.
Confidence Level (1 - α)
•The confidence level is the complement of the level of significance. It represents the proportion of times you would expect to correctly reject the
null hypothesis if you repeated the study multiple times.
•If your α is 0.05, your confidence level is 95% (i.e., 1 - 0.05 = 0.95). This means you expect to make the correct decision (either rejecting or not
rejecting the null hypothesis) 95% of the time.
Example: If you are evaluating the average effect of a feature on the target variable and compute a 95% confidence interval for the effect size,
you can be 95% confident that the true effect size lies within this interval. This helps in understanding the range of possible effects and the
reliability of your estimates.
P-Value
Definition: The p-value measures the probability of obtaining test results at least as extreme as the observed results, assuming that the null
hypothesis is true. It helps determine the statistical significance of the findings.
Example: If you're testing whether a feature significantly improves model performance, you might use a statistical test to compare model
performance with and without that feature. The p-value helps determine if the observed improvement is significant or if it could have occurred
by chance.
One-Tailed vs. Two-Tailed Tests
•One-Tailed Test: This test is used when you have a specific
direction in mind for the effect or difference you are testing.
For example, if you are testing whether a new drug increases
recovery rates compared to an old drug, you might use a
one-tailed test if you only care about the new drug being
better, not worse.
• Critical Region: In a one-tailed test, the critical
region is located in only one tail of the distribution
(either the upper or lower tail, depending on the
direction of the test).
Type II error occurs when the null hypothesis is not rejected even though it is false. In other
words, it is a false negative result. This type of error is also known as a beta error. It is
denoted by the symbol beta (β).
For example, imagine a person is taking a medical test for a disease. If the test result is
negative, but the person is actually positive for the disease, then it is a Type II error. This
means that the test incorrectly failed to reject the null hypothesis that the person is disease-
free.
In summary, Type I error is a false positive result and Type II error is a false negative result
in hypothesis testing. It is important to balance these two types of errors when conducting
hypothesis testing and interpreting the results.
The rejection region is the region of values that corresponds to the rejection of the null hypothesis at
some chosen probability level
The significance level denoted as α (alpha),determines the probability of rejecting the null hypothesis when it is actually
true. The commonly used values for the significance level are 0.05 or 0.01, but the choice of significance level is
somewhat subjective and can influence the outcome of the test.
These are all fundamental statistical tests used to analyze different types of data and hypotheses. Here’s a brief overview of each:
1. Z-Test: Used to determine if there is a significant difference between sample and population means, or between means of two samples, when
the population variance is known or the sample size is large (typically n > 30).
•Assumptions:
• Data is normally distributed (or sample size is large enough for the Central Limit Theorem to apply).
• Population variance is known or sample size is large.
• Data is independent.
•Types:
• One-Sample Z-Test: Compares the sample mean to a known population mean.
• Two-Sample Z-Test: Compares the means of two independent samples.
• Z-Test for Proportions: Compares sample proportions to a known proportion.
2. T-Test: Used to compare means when the population variance is unknown and/or the sample size is small (typically n < 30). The t-test is more
robust than the z-test for small sample sizes.
•Assumptions:
• Data is normally distributed (more important for smaller sample sizes).
• Variances of the populations are equal (in some variations).
• Data is independent.
•Types:
• One-Sample T-Test: Compares the sample mean to a known value (usually the population mean).
• Two-Sample T-Test: Compares the means of two independent samples.
• Equal Variances: Assumes that the variances of the two samples are equal.
• Unequal Variances: Does not assume equal variances (Welch’s T-Test).
• Paired Sample T-Test: Compares means from the same group at different times or under different conditions.
3. ANOVA (Analysis of Variance): Used to determine if there are significant differences among means of three or more groups.
It tests the null hypothesis that all group means are equal.
•Assumptions:
• Data is normally distributed within each group.
• Variances are equal across groups (homogeneity of variances).
• Observations are independent.
•Types:
• One-Way ANOVA: Tests differences between group means for one independent variable.
• Two-Way ANOVA: Tests differences between group means for two independent variables, and can also assess
interactions between them.
• Repeated Measures ANOVA: Used when the same subjects are measured multiple times under different conditions.
4. Chi-Square Test: Used to determine if there is a significant association between categorical variables or if a sample
distribution fits an expected distribution.
•Assumptions:
• Data are counts or frequencies.
• Observations are independent.
• Expected frequency counts in each cell of the contingency table should be at least 5 for the test to be valid.
•Types:
• Chi-Square Test of Independence: Assesses whether two categorical variables are independent.
• Chi-Square Test of Homogeneity: Tests if different populations have the same distribution of a categorical variable.
• Chi-Square Goodness of Fit Test: Tests if a sample data fits a specific distribution.
Steps involved in Hypothesis Testing
1. Formulate Hypotheses:
State the null hypothesis (H0) and alternative hypothesis (H1) based on the research question.
2. Select a Test Statistic:
Choose an appropriate test statistic based on the type of data and hypotheses being tested (e.g., t-test, chi-square
test).
3. Set Significance Level (α):
Determine the significance level (α) to control the Type I error rate.
4. Collect Sample Data:
Collect data from a representative sample that is relevant to the hypotheses.
5. Compute Test Statistic and P-value:
Calculate the test statistic from the sample data. Use the test statistic to calculate the p-value.
6. Make Decision:
Compare the p-value to the significance level (α):
If p-value < α: Reject the null hypothesis (evidence against H0).
If p-value ≥ α: Fail to reject the null hypothesis (insufficient evidence against H0).
7. Interpret Results:
Draw conclusions based on the decision made:
Rejecting the null hypothesis supports the alternative hypothesis.
Failing to reject the null hypothesis does not provide sufficient evidence to support the alternative hypothesis.
Parametric Inference vs Nonparametric Inference
Statistical inference can be broadly categorized into parametric inference and nonparametric inference, depending on the assumptions
made about the underlying distribution of the data. These approaches differ in terms of their flexibility, assumptions, and applicability to
different types of data.
Parametric Inference
•Parametric inference assumes that the data follows a specific distribution characterized by a finite number of parameters (e.g., mean
and variance for a normal distribution).
•Parametric methods estimate these parameters from the data and use them to make inferences about the population.
•The most common way of estimating parameters in parametric modeling is through MAXIMUM LIKELIHOOD ESTIMATION(MLE)
Example:
•Suppose you have a dataset of exam scores from a university course. If you assume that the scores are normally distributed, you can
use parametric methods such as the t-test to compare the mean scores of two groups (e.g., students who attended lectures vs. those
who did not).
Nonparametric Inference
•Nonparametric inference makes fewer assumptions about the underlying data distribution and does not rely on specific parameter
estimates.
•Nonparametric methods are based on ranking or ordering data rather than estimating parameters.
Example:
•If you want to compare the median incomes of two populations but do not assume any specific distribution for the income data, you can
use nonparametric methods like the Wilcoxon rank-sum test (Mann-Whitney U test) to assess differences without assuming normality.
Hypothesis Testing ML Applications
1.Model comparison: Hypothesis testing can be used to compare the performance of different machine learning
models or algorithms on a given dataset. For example, you can use a paired t-test to compare the accuracy or error
rate of two models on multiple cross validation folds to determine if one model performs significantly better than the
other.
2.Feature selection: Hypothesis testing can help identify which features are significantly related to the target variable
or contribute meaningfully to the model’s performance. For example, you can use a t-test, chi-square test, or ANOVA
to test the relationship between individual features and the target variable. Features with significant relationships can
be selected for building the model, while non-significant features may be excluded.
3.Hyperparameter tuning: Hypothesis testing can be used to evaluate the performance of a model trained with
different hyperparameter settings. By comparing the performance of models with different hyperparameters, you can
determine if one set of hyperparameters leads to significantly better performance.
4.Assessing model assumptions: In some cases, machine learning models rely on certain statistical assumptions,
such as linearity or normality of residuals in linear regression. Hypothesis testing can help assess whether these
assumptions are met, allowing you to determine if the model is appropriate for the data.
5.Outlier detection: Hypothesis testing can be used to test the significance of outliers in a dataset and to determine if
they should be removed or retained in the analysis.
6.Data preprocessing: Hypothesis testing can be used to test and validate assumptions about the data, such as the
normality or independence of the variables, before applying ML algorithms.
Select the type of Hypothesis test
We choose the type of test statistic based on the predictor variable – quantitative or categorical. Below are a few of the
commonly used test statistics for quantitative data
Type of predictor
Distribution type Desired Test Attributes
variable
•Large sample size
Quantitative Normal Distribution Z – Test •Population standard
deviation known
•Sample size less than 30
Quantitative T Distribution T-Test •Population standard
deviation unknown
•When you want to
Positively skewed
Quantitative F – Test compare 3 or more
distribution
variables
•Requires feature
Negatively skewed
Quantitative NA transformation to
distribution
perform a hypothesis test
•Test of independence
Categorical NA Chi-Square test
•Goodness of fit