0% found this document useful (0 votes)
15 views

EDA Feature eng- Estimation Inference and Hypothesis

The document discusses Exploratory Data Analysis (EDA) as a vital step in data science, focusing on understanding data characteristics, identifying patterns, and preparing data for modeling. It outlines the importance of EDA, types of analysis (univariate, bivariate, multivariate), tools for EDA, and steps involved in the process. Additionally, it covers feature engineering, emphasizing the creation and transformation of features to enhance machine learning model performance.

Uploaded by

abdul94744
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

EDA Feature eng- Estimation Inference and Hypothesis

The document discusses Exploratory Data Analysis (EDA) as a vital step in data science, focusing on understanding data characteristics, identifying patterns, and preparing data for modeling. It outlines the importance of EDA, types of analysis (univariate, bivariate, multivariate), tools for EDA, and steps involved in the process. Additionally, it covers feature engineering, emphasizing the creation and transformation of features to enhance machine learning model performance.

Uploaded by

abdul94744
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Unit 2:

• Exploratory Data Analysis for Machine Learning


• Feature Engineering and variable transformation
• Estimation and Inference
• Hypothesis testing
Exploratory Data Analysis
What is Exploratory Data Analysis (EDA)?
EDA involves a comprehensive range of activities, including data integration, analysis,
cleaning, transformation, and dimension reduction.
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It involves
analyzing and visualizing data to understand its key characteristics, uncover patterns, and
identify relationships between variables refers to the method of studying and exploring record
sets to apprehend their predominant traits, discover patterns, locate outliers, and identify
relationships between variables.
EDA is normally carried out as a preliminary step before undertaking extra formal statistical
analyses or modeling.
Why Exploratory Data Analysis is Important?

Exploratory Data Analysis (EDA) is important for several reasons, especially in the context of data science and statistical modeling. It involves analyzing
and visualizing data to understand its main characteristics, uncover patterns, and identify relationships between variables.
Here are some of the key reasons why EDA is a critical step in the data analysis process:

• Understanding Data Structures: EDA helps in getting familiar with the dataset, understanding the number of features, the type of data in each
feature, and the distribution of data points. This understanding is crucial for selecting appropriate analysis or prediction techniques.

• Identifying Patterns and Relationships: Through visualizations and statistical summaries, EDA can reveal hidden patterns and intrinsic
relationships between variables. These insights can guide further analysis and enable more effective feature engineering and model building.

• Detecting Anomalies and Outliers: EDA is essential for identifying errors or unusual data points that may adversely affect the results of your
analysis. Detecting these early can prevent costly mistakes in predictive modeling and analysis.

• Testing Assumptions: Many statistical models assume that data follow a certain distribution or that variables are independent. EDA involves
checking these assumptions. If the assumptions do not hold, the conclusions drawn from the model could be invalid.

• Informing Feature Selection and Engineering: Insights gained from EDA can inform which features are most relevant to include in a model and
how to transform them (scaling, encoding) to improve model performance.

• Optimizing Model Design: By understanding the data’s characteristics, analysts can choose appropriate modeling techniques, decide on the
complexity of the model, and better tune model parameters.

• Facilitating Data Cleaning: EDA helps in spotting missing values and errors in the data, which are critical to address before further analysis to
improve data quality and integrity.
Types of Exploratory Data Analysis
1. Univariate Analysis
Univariate analysis focuses on a single variable to understand its internal structure. It is primarily concerned with describing the data
and finding patterns existing in a single feature. Common techniques include:
 Histograms: Used to visualize the distribution of a variable.
 Box plots: Useful for detecting outliers and understanding the spread and skewness(quantifies the degree to which the data deviates
from a perfectly symmetrical distribution) of the data.
 Bar charts: Employed for categorical data to show the frequency of each category.
 Summary statistics: Calculations like mean, median, mode, variance, and standard deviation that describe the central tendency
and dispersion of the data.

2. Bivariate Analysis
Bivariate analysis is a crucial form of exploratory data analysis that examines the relationship between two variables. It enables find
associations, correlations, and dependencies between pairs of variables. Some key techniques used in bivariate analysis:
 Scatter Plots: These are one of the most common tools used in bivariate analysis. A scatter plot helps visualize the relationship
between two continuous variables.
 Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient for linear relationships) quantifies the
degree to which two variables are related.
 Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze the relationship between two categorical
variables. It shows the frequency distribution of categories of one variable in rows and the other in columns, which helps in
understanding the relationship between the two variables.
 Line Graphs: In the context of time series data, line graphs can be used to compare two variables over time. This helps in
identifying trends, cycles, or patterns that emerge in the interaction of the variables over the specified period.
 Covariance: Covariance is a measure used to determine how much two random variables change together. However, it is
sensitive to the scale of the variables, so it’s often supplemented by the correlation coefficient for a more standardized
assessment of the relationship.
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more variables in the dataset. It aims to understand
how variables interact with one another, which is crucial for most statistical modeling techniques. Techniques include:
 Pair plots: Visualize relationships across several variables simultaneously to capture a comprehensive view of
potential interactions.
 Principal Component Analysis (PCA): A dimensionality reduction technique used to reduce the dimensionality
of large datasets, while preserving as much variance as possible.
Univariate, Bivariate and Multivariate data and its analysis
Univariate Bivariate Multivariate

It only summarize single variable at a


It only summarize two variables It only summarize more than 2 variables.
time.

It does not deal with causes and It does deal with causes and It does not deal with causes and
relationships. relationships and analysis is done. relationships and analysis is done.

It does not contain any dependent It does contain only one dependent It is similar to bivariate but it contains
variable. variable. more than 2 variables.

The main purpose is to study the


The main purpose is to describe. The main purpose is to explain.
relationship among them.

Example, Suppose an advertiser wants to


compare the popularity of four
The example of bivariate can be advertisements on a website.
The example of a univariate can be
temperature and ice sales in summer Then their click rates could be measured
height.
vacation. for both men and women and
relationships between variable can be
examined
Tools for Performing Exploratory Data Analysis
Exploratory Data Analysis (EDA) can be effectively performed using a variety of tools and software, each
offering unique features suitable for handling different types of data and analysis requirements.

1. Python Libraries
 Pandas: Provides extensive functions for data manipulation and analysis, including data structure
handling and time series functionality.
 Matplotlib: A plotting library for creating static, interactive, and animated visualizations in Python.
 Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive and informative
statistical graphics.
 Plotly: An interactive graphing library for making interactive plots and offers more sophisticated
visualization capabilities.

2. R Packages
 ggplot2: it’s a powerful tool for making complex plots from data in a data frame.
 dplyr: A grammar of data manipulation, providing a consistent set of verbs that help you solve the most
common data manipulation challenges.
 tidyr: Helps to tidy your data. Tidying your data means storing it in a consistent form that matches the
semantics of the dataset with the way it is stored.
Steps for Performing Exploratory Data Analysis
Performing Exploratory Data Analysis (EDA) involves a series of steps designed to help you understand the
data we’re working with, uncover underlying patterns, identify anomalies, test hypotheses, and ensure the data
is clean and suitable for further analysis.
Step 1: Understand the Problem and the Data
The first step in any information evaluation project is to sincerely apprehend the trouble you are trying to resolve and the statistics
you have at your disposal. This entails asking questions consisting of:
 What is the commercial enterprise goal or research question you are trying to address?
 What are the variables inside the information, and what do they mean?
 What are the data sorts (numerical, categorical, textual content, etc.) ?
 Is there any known information on first-class troubles or obstacles?
 Are there any relevant area-unique issues or constraints?
By thoroughly knowing the problem and the information, you can better formulate your evaluation technique and avoid making
incorrect assumptions or drawing misguided conclusions.

Step 2: Import and Inspect the Data


Once you have clean expertise of the problem and the information, the following step is to import the data into your evaluation
environment (e.g., Python, R, or a spreadsheet program). Here are a few obligations you could carry out at this stage:
 Load the facts into your analysis environment, ensuring that the facts are imported efficiently and without errors or truncations.
 Examine the size of the facts (variety of rows and columns) to experience its length and complexity.
 Check for missing values and their distribution across variables, as missing information can notably affect the quality and
reliability of your evaluation.
 Identify facts sorts and formats for each variable, as these records may be necessary for the following facts manipulation and
evaluation steps.
 Look for any apparent errors or inconsistencies in the information, such as invalid values, mismatched units, or outliers, that
can indicate exceptional issues with information.
Step 3: Handle Missing Data
Here are some techniques you could use to handle missing statistics:
 Understand the styles and capacity reasons for missing statistics: Is the information lacking entirely at random (MCAR),
lacking at random (MAR), or lacking not at random (MNAR)? Understanding the underlying mechanisms can inform the proper
method for handling missing information.
 Decide whether to eliminate observations with lacking values (listwise deletion) or attribute (fill in) missing values.
 Use suitable imputation strategies, such as mean/median imputation, regression imputation, a couple of imputations, or
device-getting-to-know-based imputation methods like k-nearest associates (KNN) or selection trees. The preference for the
imputation technique has to be primarily based on the characteristics of the information and the assumptions underlying every
method.
 Consider the effect of lacking information: Even after imputation, lacking facts can introduce uncertainty and bias. It is
important to acknowledge those limitations and interpret your outcomes with warning.
Handling missing information nicely can improve the accuracy and reliability of your evaluation and save you biased or deceptive
conclusions.

Step 4: Explore Data Characteristics


After addressing the facts that are lacking, the next step within the EDA technique is to explore the traits of your statistics. This entails
examining your variables’ distribution, crucial tendency, and variability and identifying any ability outliers or anomalies. Understanding
the characteristics of your information is critical in deciding on appropriate analytical techniques, figuring out capability information
first-rate troubles, and gaining insights that may tell subsequent evaluation and modeling decisions.
Calculate summary facts (suggest, median, mode, preferred deviation, skewness, kurtosis, and many others.) for numerical
variables: These facts provide a concise assessment of the distribution and critical tendency of each variable, aiding in the
identification of ability issues or deviations from expected patterns.
Step 5: Perform Data Transformation
Data transformation is a critical step within the EDA process because it enables us to prepare the statistics for evaluation and
modeling.
Here are a few common records transformation strategies:
 Scaling or normalizing numerical variables to a standard variety (e.g., min-max scaling, standardization).
 Encoding categorical variables to be used in machine mastering fashions (e.g., one hot encoding(we use to represent categorical
variables as numerical values in ML, label encoding).
 Applying mathematical differences to numerical variables (e.g., logarithmic, square root) to correct for skewness or non-linearity
 Creating derived variables or capabilities primarily based on current variables (e.g., calculating ratios, combining variables)
 Aggregating or grouping records mainly based on unique variables or situations.

Step 6: Visualize Data Relationships


Visualization is an effective tool in the EDA manner, as it allows to discover relationships between variables and become aware of
styles or trends that may not immediately be apparent from summary statistics or numerical outputs.
To visualize data relationships, explore univariate, bivariate, and multivariate analysis.
 Create frequency tables, bar plots, and pie charts to express variables: These visualizations can help you apprehend the
distribution of classes and discover any ability imbalances or unusual patterns.
 Generate histograms, container plots, violin plots, and density plots to visualize the distribution of numerical variables. These
visualizations can screen critical information about the form, unfold, and ability outliers within the statistics.
 Examine the correlation or association among variables using scatter plots, correlation matrices, or statistical assessments like
Pearson’s correlation coefficient or Spearman’s rank correlation: Understanding the relationships between variables can tell
characteristic choice, dimensionality discount, and modeling choices.
Step 7: Handling Outliers
An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal)objects. They can be caused by
measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect
outliers, and the removal process of these outliers from the dataframe is the same as removing a data item from the panda’s
dataframe.
Identify and inspect capability outliers through the usage of strategies like the interquartile range (IQR), Z-scores, or area-specific
regulations: Outliers can considerably impact the results of statistical analyses and gadget studying fashions, so it’s essential to
perceive and take care of them as it should be.

Step 8: Communicate Findings and Insights


The final step in the EDA technique is effectively discussing your findings and insights. This includes summarizing our evaluation,
highlighting fundamental discoveries, and imparting our outcomes cleanly and compellingly.
Here are a few hints for effective verbal exchange:
 Clearly state the targets and scope of your analysis
 Provide context and heritage data to assist others in apprehending your approach.
 Use visualizations and photos to guide your findings and make them more reachable.
 Highlight critical insights, patterns, or anomalies discovered for the duration of the EDA manner
 Discuss any barriers or caveats related to your analysis.
 Suggest ability next steps or areas for additional investigation
Effective conversation is critical for ensuring that your EDA efforts have a meaningful impact and that your insights are understood
and acted upon with the aid of stakeholders.
Feature Engineering
What is a Feature?
In the context of machine learning, a feature (also known as a variable or attribute) is an individual measurable property or
characteristic of a data point that is used as input for a machine learning algorithm. Features can be numerical, categorical, or
text-based, and they represent different aspects of the data that are relevant to the problem at hand.
For example, in a dataset of housing prices, features could include the number of bedrooms, the square footage, the location,
and the age of the property. In a dataset of customer demographics, features could include age, gender, income level, and
occupation.
The choice and quality of features are critical in machine learning, as they can greatly impact the accuracy and performance of
the model.

What is Feature Engineering?


Feature Engineering is the process of creating new features or transforming existing features to improve the
performance of a machine-learning model. It involves selecting relevant information from raw data and transforming it
into a format that can be easily understood by a model. The goal is to improve model accuracy by providing more
meaningful and relevant information.
Need for Feature Engineering in Machine Learning?

We engineer features for various reasons, and some of the main reasons include:
•Improve User Experience: The primary reason we engineer features is to enhance the user experience of a product
or service. By adding new features, we can make the product more intuitive, efficient, and user-friendly, which can
increase user satisfaction and engagement.

•Competitive Advantage: Another reason we engineer features is to gain a competitive advantage in the
marketplace. By offering unique and innovative features, we can differentiate our product from competitors and attract
more customers.

•Meet Customer Needs: We engineer features to meet the evolving needs of customers. By analyzing user feedback,
market trends, and customer behavior, we can identify areas where new features could enhance the product’s value
and meet customer needs.

•Increase Revenue: Features can also be engineered to generate more revenue. For example, a new feature that
streamlines the checkout process can increase sales, or a feature that provides additional functionality could lead to
more upsells or cross-sells.

•Future-Proofing: “Engineering features can also be done to future-proof a product or service” refers to the practice
of designing and developing features in a way that ensures a product or service remains relevant, adaptable, and
effective in the future, even as technology, market conditions, or user needs change. By anticipating future trends
and potential customer needs, we can develop features that ensure the product remains relevant and useful in the long
term.
Processes Involved in Feature Engineering

Feature engineering in Machine learning consists of mainly 5 processes:


1. Feature Creation,
2. Feature Transformation,
3. Feature Extraction,
4. Feature Selection, and
5. Feature Scaling.

It is an iterative process that requires experimentation and testing to find the best combination of features for a given
problem.

The success of a machine learning model largely depends on the quality of the features used in the model.
1. Feature Creation
Feature creation refers to the creation of new features from existing data to help with better predictions.
Examples of feature creation include: one-hot-encoding, binning, splitting, and calculated features.

Types of Feature Creation:

1.Domain-Specific: Creating new features based on domain knowledge, such as creating features based on
business rules or industry standards.
2.Data-Driven: Creating new features by observing patterns in the data, such as calculating aggregations or
creating interaction features.
3.Synthetic: Generating new features by combining existing features or synthesizing new data points.

Why Feature Creation?

1.Improves Model Performance: By providing additional and more relevant information to the model, feature
creation can increase the accuracy and precision of the model.
2.Increases Model Robustness: By adding additional features, the model can become more robust to outliers
and other anomalies.
3.Improves Model Interpretability: By creating new features, it can be easier to understand the model’s
predictions.
4.Increases Model Flexibility: By adding new features, the model can be made more flexible to handle different
types of data.
2. Feature Transformation
Feature transformation and imputation include steps for replacing missing features or features that are not valid.
Some techniques include: forming Cartesian products of features, non-linear transformations (such as binning numeric
variables into categories), and creating domain-specific features.

Types of Feature Transformation:


1.Normalization: Rescaling the features to have a similar range, such as between 0 and 1, to prevent some features
from dominating others.
2.Scaling: Scaling is a technique used to transform numerical variables to have a similar scale, so that they can be
compared more easily. Rescaling the features to have a similar scale, such as having a standard deviation of 1, to
make sure the model considers all features equally.
3.Encoding: Transforming categorical features into a numerical representation. Examples are one-hot encoding and
label encoding.
4.Transformation: Transforming the features using mathematical operations to change the distribution or scale of the
features. Examples are logarithmic, square root, and reciprocal transformations.

Why Feature Transformation?


1.Improves Model Performance: By transforming the features into a more suitable representation, the model can
learn more meaningful patterns in the data.
2.Increases Model Robustness: Transforming the features can make the model more robust to outliers and other
anomalies.
3.Improves Computational Efficiency: The transformed features often require fewer computational resources.
4.Improves Model Interpretability: By transforming the features, it can be easier to understand the model’s
predictions.
3. Feature Extraction
Feature extraction involves reducing the amount of data to be processed using dimensionality reduction techniques.
These techniques include: Principal Components Analysis (PCA), Independent Component Analysis (ICA), and Linear
Discriminant Analysis (LDA). This reduces the amount of memory and computing power required, while still accurately
maintaining original data characteristics.

Types of Feature Extraction:


1.Dimensionality Reduction: Reducing the number of features by transforming the data into a lower-dimensional
space while retaining important information. Examples are PCA and t-SNE.
2.Feature Combination: Combining two or more existing features to create a new one. For example, the interaction
between two features.
3.Feature Aggregation: Aggregating features to create a new one. For example, calculating the mean, sum, or count
of a set of features.
4.Feature Transformation: Transforming existing features into a new representation. For example, log transformation
of a feature with a skewed distribution.

Why Feature Extraction?


1.Improves Model Performance: By creating new and more relevant features, the model can learn more meaningful
patterns in the data.
2.Reduces Overfitting: By reducing the dimensionality of the data, the model is less likely to overfit the training data.
3.Improves Computational Efficiency: The transformed features often require fewer computational resources.
4.Improves Model Interpretability: By creating new features, it can be easier to understand the model’s predictions.
4. Feature Selection
Feature selection is the process of selecting a subset of extracted features. This is the subset that is relevant and
contributes to minimizing the error rate of a trained model. Feature importance score and correlation matrix can be
factors in selecting the most relevant features for model training.

Types of Feature Selection:


1.Filter Method: Based on the statistical measure of the relationship between the feature and the target variable.
Features with a high correlation are selected.
2.Wrapper Method: Based on the evaluation of the feature subset using a specific machine learning algorithm. The
feature subset that results in the best performance is selected.
3.Embedded Method: Based on the feature selection as part of the training process of the machine learning
algorithm.

Why Feature Selection?


1.Reduces Overfitting: By using only the most relevant features, the model can generalize better to new data.
2.Improves Model Performance: Selecting the right features can improve the accuracy, precision, and recall of the
model.
3.Decreases Computational Costs: A smaller number of features requires less computation and storage
resources.
4.Improves Interpretability: By reducing the number of features, it is easier to understand and interpret the results
of the model.
5. Feature Scaling
•Feature Scaling is the process of transforming the features so that they have a similar scale. This is important in
machine learning because the scale of the features can affect the performance of the model.

Types of Feature Scaling:


1.Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and 1, by subtracting the
minimum value and dividing by the range.
2.Standard Scaling: Rescaling the features to have a mean of 0 and a standard deviation of 1 by subtracting the
mean and dividing by the standard deviation.
3.Robust Scaling: Rescaling the features to be robust to outliers by dividing them by the interquartile range.

Why Feature Scaling?


1.Improves Model Performance: By transforming the features to have a similar scale, the model can learn from all
features equally and avoid being dominated by a few large features.
2.Increases Model Robustness(the ability of a computer system to handle errors and incorrect input while it's running
): By transforming the features to be robust to outliers, the model can become more robust to anomalies.

3.Improves Computational Efficiency: Many machine learning algorithms, such as k-nearest neighbors, are
sensitive to the scale of the features and perform better with scaled features.
4.Improves Model Interpretability: By transforming the features to have a similar scale, it can be easier to
understand the model’s predictions.
Techniques Used in Feature Engineering
Feature engineering is the process of transforming raw data into features that are suitable for machine learning
models. There are various techniques that can be used in feature engineering to create new features by combining or
transforming the existing ones. The following are some of the commonly used feature engineering techniques:

One-Hot Encoding
One-hot encoding is a technique used to transform categorical variables into numerical values that can be used by machine
learning models. In this technique, each category is transformed into a binary value indicating its presence or absence. For
example, consider a categorical variable “Colour” with three categories: Red, Green, and Blue. One-hot encoding would
transform this variable into three binary variables: Colour_Red, Colour_Green, and Colour_Blue, where the value of each variable
would be 1 if the corresponding category is present and 0 otherwise.

Binning
Binning is a technique used to transform continuous variables into categorical variables. In this technique, the range of values of
the continuous variable is divided into several bins, and each bin is assigned a categorical value. For example, consider a
continuous variable “Age” with values ranging from 18 to 80. Binning would divide this variable into several age groups such as
18-25, 26-35, 36-50, and 51-80, and assign a categorical value to each age group.

Scaling
The most common scaling techniques are standardization and normalization. Standardization scales the variable so that it has
zero mean and unit variance. Normalization scales the variable so that it has a range of values between 0 and 1.
Feature Split
Feature splitting is a powerful technique used in feature engineering to improve the performance of machine
learning models. It involves dividing single features into multiple sub-features or groups based on specific criteria.
This process unlocks valuable insights and enhances the model’s ability to capture complex relationships and
patterns within the data.

Text Data Preprocessing


Text data requires special preprocessing techniques before it can be used by machine learning models. Text
preprocessing involves removing stop words, stemming, lemmatization, and vectorization. Stop words are common
words that do not add much meaning to the text, such as “the” and “and”. Stemming involves reducing words to
their root form, such as converting “running” to “run”. Lemmatization is similar to stemming, but it reduces words to
their base form, such as converting “running” to “run”. Vectorization involves transforming text data into numerical
vectors that can be used by machine learning models.
Feature Engineering Tools
There are several tools available for feature engineering. Here are some popular ones:
1. Featuretools
Featuretools is a Python library that enables automatic feature engineering for structured data. It can extract features
from multiple tables, including relational databases and CSV files, and generate new features based on user-defined
primitives.
2. TPOT
TPOT (Tree-based Pipeline Optimization Tool) is an automated machine learning tool that includes feature engineering
as one of its components. It uses genetic programming to search for the best combination of features and machine
learning algorithms for a given dataset.
3. DataRobot
DataRobot is a machine learning automation platform that includes feature engineering as one of its capabilities. It
uses automated machine learning techniques to generate new features and select the best combination of features
and models for a given dataset.
4. Alteryx
Alteryx is a data preparation and automation tool that includes feature engineering as one of its features. It provides a
visual interface for creating data pipelines that can extract, transform, and generate features from multiple data
sources.
5. H2O.ai
H2O.ai is an open-source machine learning platform that includes feature engineering as one of its capabilities. It
provides a range of automated feature engineering techniques, such as feature scaling, imputation, and encoding, as
well as manual feature engineering capabilities for more advanced users.
Variable Transformation
Variable Transformation

Variable transformation involves changing the form or structure of variables in a dataset to make them more suitable for analysis,
improve model performance, or address specific statistical requirements. Transformations can help in normalizing data, handling
non-linearity, reducing skewness, or improving interpretability.

Types of Variable Transformation

1.Scaling Transformations:
1. Normalization (Min-Max Scaling): Scales variables to a fixed range, usually [0, 1]. Useful when variables have different
units or scales.

2. Standardization (Z-score Normalization): Converts variables to have a mean of 0 and a standard deviation of 1. Useful for
algorithms that assume normally distributed data.

2.Logarithmic Transformation:
1. Purpose: Reduces skewness (lack of straightness or symmetry., face is symmetrical left or right, Skewness measures the
deviation of a random variable's given distribution from the normal distribution, which is symmetrical on both sides.) and
handles exponential growth patterns by converting values to their logarithms. Often used for highly skewed data.

3.Square Root Transformation:


1. Purpose: Similar to the logarithmic transformation, it reduces skewness but is less aggressive. Used for moderately
skewed data.
4. Power Transformation:
1. Purpose: Applies a power function to the data to stabilize variance and make the data more normally distributed. Includes various
specific transformations.
1. Box-Cox Transformation: Includes both logarithmic and power transformations with a parameter λ\lambdaλ that is optimized.

2. Yeo-Johnson Transformation: An extension of Box-Cox that handles zero and negative values.
5. Binning:
1. Purpose: Converts continuous variables into categorical ones by grouping values into bins or intervals. Useful for simplifying data or
handling non-linear relationships.
1. Example: Grouping ages into bins like 0-18, 19-35, 36-50, etc.

6. Polynomial Transformation:
1. Purpose: Adds polynomial terms (squares, cubes) of the original variables to capture non-linear relationships. Useful for polynomial
regression models.

7. Categorical Encoding:
1. Purpose: Converts categorical variables into numerical formats for use in algorithms that require numerical input. Includes various
methods.
1. One-Hot Encoding: Creates binary columns for each category.
2. Label Encoding: Assigns integer values to categories.

8. Rank Transformation:
1. Purpose: Converts values to their ranks (ordinal position) to handle outliers and make data more robust to non-normal distributions.
1. Formula: Assigns ranks based on the ordering of values.
Estimation, Inferences & Hypothesis Testing
In machine learning, estimation, inference, and hypothesis testing are fundamental concepts that help in understanding, evaluating, and
improving models. Here's a breakdown of these concepts and their relevance to machine learning:

Estimation
Estimation refers to the process of determining the parameters of a model based on observed data. In machine learning, this involves fitting a
model to data to approximate underlying relationships or patterns.

Methods:
•Maximum Likelihood Estimation (MLE): This method estimates parameters by maximizing the likelihood function, which measures how likely
the observed data is given the parameters. It is widely used due to its desirable properties like asymptotic unbiasedness and efficiency.

•Bayesian Estimation: This approach incorporates prior beliefs about the parameters (prior distributions) and updates them with the data to form
posterior distributions. Bayesian estimation provides a full probability distribution for the parameters rather than a single estimate.

•Least Squares Estimation: Often used in linear regression, this method minimizes the sum of the squared differences between observed and
predicted values.
Example in Machine Learning: Estimating the parameters (weights) of a neural network through backpropagation and gradient descent.

There are two main types of estimation:


Point Estimation: This involves providing a single best guess or value for a parameter.

Interval Estimation: This provides a range of values within which the parameter is expected to fall with a certain level of confidence.
Inference
•Inference is about using these parameters to draw broader conclusions, make predictions, or understand the significance of the
model components.

It can be broadly categorized into:


Predictive Inference: Making predictions about new, unseen data based on the model. For instance, after training a model on
historical data, you use it to predict future events or classify new data points.

Statistical Inference: This involves making judgments about the model parameters or data generation process. It can include
hypothesis testing, confidence intervals, and understanding the significance of different features.

Techniques in Inference
Hypothesis Testing: Involves testing a hypothesis about a parameter or model structure. For instance, you might test whether a
certain feature significantly contributes to the prediction or whether a model performs better than a baseline.
Confidence Intervals: Provide a range of values for an estimate, indicating the reliability and precision of the estimate.
Model Evaluation Metrics: Techniques like cross-validation, AIC (Akaike Information Criterion), and BIC (Bayesian Information
Criterion) help in assessing model performance and generalizability.

Methods:
•Point Predictions: Using the model to output specific values or classes for given inputs.
•Probabilistic Predictions: Estimating the probability distribution over possible outcomes, such as predicting the probability of a
class in classification problems.
Hypothesis Testing
Hypothesis testing in machine learning involves evaluating whether certain assumptions or hypotheses about the model or data
hold true. This helps in validating the model's performance, understanding feature importance, or comparing different models.
Level of Significance (α)
•The level of significance, denoted as α (alpha), is the threshold you set for deciding whether to reject the null hypothesis. It represents the
probability of making a Type I error, which occurs when you incorrectly reject a true null hypothesis.
•Typical values for α are 0.05, 0.01, or 0.10 or 1%, 5%, 10%. For example, an α of 0.05 means you are willing to accept a 5% chance of incorrectly
rejecting the null hypothesis.
Example: Suppose you are comparing two models and perform a statistical test to determine if their performance differences are significant. If
you set α = 0.01, you are willing to accept a 1% chance of incorrectly concluding that there is a significant difference when there isn’t one.

Confidence Level (1 - α)
•The confidence level is the complement of the level of significance. It represents the proportion of times you would expect to correctly reject the
null hypothesis if you repeated the study multiple times.
•If your α is 0.05, your confidence level is 95% (i.e., 1 - 0.05 = 0.95). This means you expect to make the correct decision (either rejecting or not
rejecting the null hypothesis) 95% of the time.
Example: If you are evaluating the average effect of a feature on the target variable and compute a 95% confidence interval for the effect size,
you can be 95% confident that the true effect size lies within this interval. This helps in understanding the range of possible effects and the
reliability of your estimates.

P-Value
Definition: The p-value measures the probability of obtaining test results at least as extreme as the observed results, assuming that the null
hypothesis is true. It helps determine the statistical significance of the findings.
Example: If you're testing whether a feature significantly improves model performance, you might use a statistical test to compare model
performance with and without that feature. The p-value helps determine if the observed improvement is significant or if it could have occurred
by chance.
One-Tailed vs. Two-Tailed Tests
•One-Tailed Test: This test is used when you have a specific
direction in mind for the effect or difference you are testing.
For example, if you are testing whether a new drug increases
recovery rates compared to an old drug, you might use a
one-tailed test if you only care about the new drug being
better, not worse.
• Critical Region: In a one-tailed test, the critical
region is located in only one tail of the distribution
(either the upper or lower tail, depending on the
direction of the test).

•Two-Tailed Test: This test is used when you are interested in


deviations in both directions. For example, if you are testing
whether a new drug has a different effect (either better or
worse) compared to an old drug, you would use a two-tailed
test.
• Critical Region: In a two-tailed test, the critical
region is split between both tails of the distribution.
Each tail contains half of the total α level. For
instance, with α = 0.05, each tail would have 0.025
(2.5%) of the significance level.
Type I and Type II errors are two types of errors that can occur in
hypothesis testing.
Type I error occurs when the null hypothesis is rejected even though it is true. In other
words, it is a false positive result. This type of error is also known as a false alarm or a
alpha error. It is denoted by the symbol alpha (α).
For example, imagine a person is taking a pregnancy test. If the test result is positive, but
the person is actually not pregnant, then it is a Type I error. This means that the test
incorrectly rejected the null hypothesis that the person is not pregnant.

Type II error occurs when the null hypothesis is not rejected even though it is false. In other
words, it is a false negative result. This type of error is also known as a beta error. It is
denoted by the symbol beta (β).

For example, imagine a person is taking a medical test for a disease. If the test result is
negative, but the person is actually positive for the disease, then it is a Type II error. This
means that the test incorrectly failed to reject the null hypothesis that the person is disease-
free.

In summary, Type I error is a false positive result and Type II error is a false negative result
in hypothesis testing. It is important to balance these two types of errors when conducting
hypothesis testing and interpreting the results.
The rejection region is the region of values that corresponds to the rejection of the null hypothesis at
some chosen probability level

The significance level denoted as α (alpha),determines the probability of rejecting the null hypothesis when it is actually
true. The commonly used values for the significance level are 0.05 or 0.01, but the choice of significance level is
somewhat subjective and can influence the outcome of the test.
These are all fundamental statistical tests used to analyze different types of data and hypotheses. Here’s a brief overview of each:
1. Z-Test: Used to determine if there is a significant difference between sample and population means, or between means of two samples, when
the population variance is known or the sample size is large (typically n > 30).
•Assumptions:
• Data is normally distributed (or sample size is large enough for the Central Limit Theorem to apply).
• Population variance is known or sample size is large.
• Data is independent.
•Types:
• One-Sample Z-Test: Compares the sample mean to a known population mean.
• Two-Sample Z-Test: Compares the means of two independent samples.
• Z-Test for Proportions: Compares sample proportions to a known proportion.

2. T-Test: Used to compare means when the population variance is unknown and/or the sample size is small (typically n < 30). The t-test is more
robust than the z-test for small sample sizes.
•Assumptions:
• Data is normally distributed (more important for smaller sample sizes).
• Variances of the populations are equal (in some variations).
• Data is independent.
•Types:
• One-Sample T-Test: Compares the sample mean to a known value (usually the population mean).
• Two-Sample T-Test: Compares the means of two independent samples.
• Equal Variances: Assumes that the variances of the two samples are equal.
• Unequal Variances: Does not assume equal variances (Welch’s T-Test).
• Paired Sample T-Test: Compares means from the same group at different times or under different conditions.
3. ANOVA (Analysis of Variance): Used to determine if there are significant differences among means of three or more groups.
It tests the null hypothesis that all group means are equal.
•Assumptions:
• Data is normally distributed within each group.
• Variances are equal across groups (homogeneity of variances).
• Observations are independent.
•Types:
• One-Way ANOVA: Tests differences between group means for one independent variable.
• Two-Way ANOVA: Tests differences between group means for two independent variables, and can also assess
interactions between them.
• Repeated Measures ANOVA: Used when the same subjects are measured multiple times under different conditions.

4. Chi-Square Test: Used to determine if there is a significant association between categorical variables or if a sample
distribution fits an expected distribution.
•Assumptions:
• Data are counts or frequencies.
• Observations are independent.
• Expected frequency counts in each cell of the contingency table should be at least 5 for the test to be valid.
•Types:
• Chi-Square Test of Independence: Assesses whether two categorical variables are independent.
• Chi-Square Test of Homogeneity: Tests if different populations have the same distribution of a categorical variable.
• Chi-Square Goodness of Fit Test: Tests if a sample data fits a specific distribution.
Steps involved in Hypothesis Testing
1. Formulate Hypotheses:
State the null hypothesis (H0) and alternative hypothesis (H1) based on the research question.
2. Select a Test Statistic:
Choose an appropriate test statistic based on the type of data and hypotheses being tested (e.g., t-test, chi-square
test).
3. Set Significance Level (α):
Determine the significance level (α) to control the Type I error rate.
4. Collect Sample Data:
Collect data from a representative sample that is relevant to the hypotheses.
5. Compute Test Statistic and P-value:
Calculate the test statistic from the sample data. Use the test statistic to calculate the p-value.
6. Make Decision:
Compare the p-value to the significance level (α):
If p-value < α: Reject the null hypothesis (evidence against H0).
If p-value ≥ α: Fail to reject the null hypothesis (insufficient evidence against H0).
7. Interpret Results:
Draw conclusions based on the decision made:
Rejecting the null hypothesis supports the alternative hypothesis.
Failing to reject the null hypothesis does not provide sufficient evidence to support the alternative hypothesis.
Parametric Inference vs Nonparametric Inference
Statistical inference can be broadly categorized into parametric inference and nonparametric inference, depending on the assumptions
made about the underlying distribution of the data. These approaches differ in terms of their flexibility, assumptions, and applicability to
different types of data.

Parametric Inference
•Parametric inference assumes that the data follows a specific distribution characterized by a finite number of parameters (e.g., mean
and variance for a normal distribution).
•Parametric methods estimate these parameters from the data and use them to make inferences about the population.
•The most common way of estimating parameters in parametric modeling is through MAXIMUM LIKELIHOOD ESTIMATION(MLE)
Example:
•Suppose you have a dataset of exam scores from a university course. If you assume that the scores are normally distributed, you can
use parametric methods such as the t-test to compare the mean scores of two groups (e.g., students who attended lectures vs. those
who did not).

Nonparametric Inference
•Nonparametric inference makes fewer assumptions about the underlying data distribution and does not rely on specific parameter
estimates.
•Nonparametric methods are based on ranking or ordering data rather than estimating parameters.
Example:
•If you want to compare the median incomes of two populations but do not assume any specific distribution for the income data, you can
use nonparametric methods like the Wilcoxon rank-sum test (Mann-Whitney U test) to assess differences without assuming normality.
Hypothesis Testing ML Applications
1.Model comparison: Hypothesis testing can be used to compare the performance of different machine learning
models or algorithms on a given dataset. For example, you can use a paired t-test to compare the accuracy or error
rate of two models on multiple cross validation folds to determine if one model performs significantly better than the
other.
2.Feature selection: Hypothesis testing can help identify which features are significantly related to the target variable
or contribute meaningfully to the model’s performance. For example, you can use a t-test, chi-square test, or ANOVA
to test the relationship between individual features and the target variable. Features with significant relationships can
be selected for building the model, while non-significant features may be excluded.
3.Hyperparameter tuning: Hypothesis testing can be used to evaluate the performance of a model trained with
different hyperparameter settings. By comparing the performance of models with different hyperparameters, you can
determine if one set of hyperparameters leads to significantly better performance.
4.Assessing model assumptions: In some cases, machine learning models rely on certain statistical assumptions,
such as linearity or normality of residuals in linear regression. Hypothesis testing can help assess whether these
assumptions are met, allowing you to determine if the model is appropriate for the data.
5.Outlier detection: Hypothesis testing can be used to test the significance of outliers in a dataset and to determine if
they should be removed or retained in the analysis.
6.Data preprocessing: Hypothesis testing can be used to test and validate assumptions about the data, such as the
normality or independence of the variables, before applying ML algorithms.
Select the type of Hypothesis test
We choose the type of test statistic based on the predictor variable – quantitative or categorical. Below are a few of the
commonly used test statistics for quantitative data

Type of predictor
Distribution type Desired Test Attributes
variable
•Large sample size
Quantitative Normal Distribution Z – Test •Population standard
deviation known
•Sample size less than 30
Quantitative T Distribution T-Test •Population standard
deviation unknown
•When you want to
Positively skewed
Quantitative F – Test compare 3 or more
distribution
variables
•Requires feature
Negatively skewed
Quantitative NA transformation to
distribution
perform a hypothesis test
•Test of independence
Categorical NA Chi-Square test
•Goodness of fit

You might also like