0% found this document useful (0 votes)
27 views

Unit 1

Uploaded by

binokad912
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Unit 1

Uploaded by

binokad912
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Unit I Feature Engineering

What is a Feature?

In the context of machine learning, a feature is an individual measurable property or


characteristic of a data point that is used as input for a machine learning algorithm.

Feature Engineering is the process of creating new features or transforming existing features
to improve the performance of a machine-learning model.

Data pre-processing

Data preprocessing is a step in the data mining and data analysis process that takes raw data
and transforms it into a format that can be understood and analyzed by computers and
machine learning.

Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it
contain errors and inconsistencies, but it is often incomplete, and doesn’t have a regular,
uniform design.

Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy.
However, unstructured data, in the form of text and images must first be cleaned and
formatted before analysis.

Data Preprocessing Importance

Data pre-processing is an important step in the data mining process that involves cleaning
and transforming raw data to make it suitable for analysis.

When using data sets to train machine learning models, you’ll often hear the phrase “garbage
in, garbage out” This means that if you use bad or “dirty” data to train your model, you’ll
end up with a bad, improperly trained model that won’t actually be relevant to your analysis.

Good, preprocessed data is even more important than the most powerful algorithms, to the
point that machine learning models trained with bad data could actually be harmful to the
analysis you’re trying to do – giving you “garbage” results.

It is an important step in the data mining process. It refers to the cleaning, transforming, and
integrating of data in order to make it ready for analysis. The goal of data pre-processing is
to improve the quality of the data and to make it more suitable for the specific data mining
task.
Some common steps in data pre-processing include:

Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation(bad quality), removal, and train Missing data

There are a number of ways to correct for missing data, but the two most common are:

Ignore the tuples: A tuple is an ordered list or sequence of numbers. If multiple values are
missing within tuples, you may simply discard the tuples with that missing information. This
is only recommended for large data sets, when a few ignored tuples won’t harm further
analysis.

Manually fill in missing data: This can be tedious, but is definitely necessary when working
with smaller data sets.

Noisy data: Data cleaning also includes fixing “noisy” data. This is data that includes
unnecessary data points, irrelevant data, and data that’s more difficult to group together.

Binning: Binning sorts data of a wide data set into smaller groups of more similar data.
It’s often used when analyzing demographics. Income, for example, could be grouped:
$35,000-$50,000, $50,000-$75,000, etc.

Regression: Regression is used to decide which variables will actually apply to your analysis.
Regression analysis is used to smooth large amounts of data. This will help you get a
handle on your data, so you’re not overburdened with unnecessary data.

Clustering: Clustering algorithms are used to properly group data, so that it can be
analyzed with like data. They’re generally used in unsupervised learning, when not a lot is
known about the relationships within your data.

If you’re working with text data, for example, some things you should consider when
cleaning your data are:

 Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
 Translate all text into the language you’ll be working in
 Remove HTML tags
 Remove boilerplate email text
 Remove unnecessary blank text between words
 Remove duplicate data

After data cleaning, you may realize you have insufficient data for the task at hand. At this
point you can also perform data wrangling or data enrichment to add new data sets and
run them through quality assessment and cleaning again before adding them to your original
data.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can be
used for data integration.

Normalization: The main purpose of data normalization is to minimize or even exclude


duplicated data.

It is the process of developing clean data. This includes eliminating redundant and
unstructured data and making the data appear similar across all records and fields.

Normalization changes the observations so that they can be described as normal distribution

Scaling: In scaling, you're changing the range of the data

By scaling the variables, one can help compare different variables on an equal footing. The
subplots below can give us a better insight.

Normal distribution:Also known as the “bell curve”, this is a specific statistical distribution
where roughly equal observations fall above and below the mean, the mean and the
median are the same, and there are more observations closer to the mean. The normal
distribution is also known as the Gaussian distribution.

The shape of our data has changed. Before normalizing it was almost L-shaped. But after
normalizing it looks more like a “bell curve”.
The shape of the data doesn’t change, but the scale on the X-axis changes (0 to 1) instead
of ranging from 0 to 8.

Standardization:

Standardization is a technique that transforms data to have a mean of 0 and a standard

deviation of 1, making it easier to compare and combine features. Standardization is useful

when:

1. Features have different distributions: Standardization helps to transform features

with different distributions into a common distribution, making it easier to compare

and combine them.

2. Model assumes normality: Some models, like linear regression, assume normality of the

features. Standardization helps to transform features into a normal distribution,

making it easier to meet this assumption.

3. Feature importance needs to be evaluated: Standardization helps to evaluate the

importance of each feature by scaling them to a common range, making it easier to

compare their coefficients.

The main difference between normalization and standardization is that normalization scales
the data to a common range, while standardization transforms the data to have a mean of
0 and a standard deviation of 1.
When to choose normalization:

 When features have different units or scales.


 When the model is sensitive to feature scales.
 When the goal is to prevent features with large ranges from dominating the model.

When to choose standardization:

 When features have different distributions.


 When the model assumes normality of the features.
 When the goal is to evaluate the importance of each feature.

PRINCIPAL COMPONENT ANALYSIS(PCA)


It is a tool which is used to reduce the dimension of the data. It allows us to reduce the
dimension of the data without much loss of information. PCA reduces the dimension by
finding a few orthogonal linear combinations (principal components) of the original variables
with the largest variance. The first principal component captures most of the variance in the
data. The second principal component is orthogonal to the first principal component and
captures the remaining variance, which is left of first principal component and so on.

KERNEL PCA: PCA is a linear method. That is it can only be applied to datasets which are
linearly separable. It does an excellent job for datasets, which are linearly separable. But, if
we use it to non-linear datasets, we might get a result which may not be the optimal
dimensionality reduction. Kernel PCA uses a kernel function to project dataset into a
higher dimensional feature space, where it is linearly separable. It is similar to the idea of
Support Vector Machines. There are various kernel methods like linear, polynomial, and
gaussian.

Kernel Principal Component Analysis (KPCA) is a technique used in machine learning for
nonlinear dimensionality reduction. It is an extension of the classical Principal Component
Analysis (PCA) algorithm, which is a linear method that identifies the most significant
features or components of a dataset. KPCA applies a nonlinear mapping function to the data
before applying PCA, allowing it to capture more complex and nonlinear relationships
between the data points.

Objectives of PCA:

It is basically a non-dependent procedure in which it reduces attribute space from a large


number of variables to a smaller number of factors.

PCA is basically a dimension reduction process but there is no guarantee that the dimension
is interpretable.

The main task in this PCA is to select a subset of variables from a larger set, based on which
original variables have the highest correlation with the principal amount.
Identifying patterns: PCA can help identify patterns or relationships between variables that
may not be apparent in the original data. By reducing the dimensionality of the data, PCA can
reveal underlying structures that can be useful in understanding and interpreting the
data.

Feature extraction: PCA can be used to extract features from a set of variables that are more
informative or relevant than the original variables. These features can then be used in
modeling or other analysis tasks.

Data compression: PCA can be used to compress large datasets by reducing the number of
variables needed to represent the data, while retaining as much information as possible.

Noise reduction: PCA can be used to reduce the noise in a dataset by identifying and
removing the principal components that correspond to the noisy parts of the data.

Visualization: PCA can be used to visualize high-dimensional data in a lower-dimensional


space, making it easier to interpret and understand. By projecting the data onto the principal
components, patterns and relationships between variables can be more easily visualized.

Uses of PCA:

 It is used to find interrelations between variables in the data.


 It is used to interpret and visualize data.
 The number of variables is decreasing which makes further analysis simpler.
 It’s often used to visualize genetic distance and relatedness between populations.

Local Binary Patterns (LBP): is a type of visual descriptor used for classification in CV.

LBP summarizes texture information by analyzing pixel patterns.

It’s invariant to illumination, translation, and scaling.

Originally proposed in 1994, LBP is a particular case of the Texture Spectrum model
introduced in 1990.

How It Works:

LBP describes the neighbourhood of image elements using binary codes.

It’s commonly used for texture classification, face recognition, and gender recognition.

By studying local properties, LBP identifies characteristics of individual parts of an image2.

Image Representation:

Images are represented as matrices of pixels.


Each pixel has three color channels (red, green, blue), and their values range from 0 to 255.

LBP operates on local neighborhoods within the image, comparing pixel intensities to create
binary patterns

Local Binary Pattern (LBP) has several advantages when applied in DS & CV

1. Texture Analysis:

LBP is excellent for texture analysis. It captures local patterns in an image, making it useful
for tasks like texture classification and segmentation.

2.Robustness to Illumination Changes:

LBP is invariant to monotonic illumination changes. It focuses on relative pixel differences,


which makes it robust in varying lighting conditions.

3.Simple and Efficient:

LBP is computationally efficient and straightforward to implement.

It doesn’t require complex mathematical operations, making it suitable for real-time


applications.

4.Face Recognition:

LBP is commonly used in face recognition systems.

It extracts discriminative features from facial images, allowing accurate recognition even
with variations in pose and lighting.

5.Histogram Representation:

LBP generates histograms of local patterns, which can be used as features for machine
learning models.

These histograms capture texture information effectively.

6.Spatial Information:

LBP considers the spatial arrangement of pixels, which helps in capturing local context.

It’s useful for tasks like object detection and tracking.

Remember that LBP is just one of many techniques in data science, but its simplicity and
effectiveness make it a valuable tool in various applications

Sequential Forward Selection (SFS) is a feature selection technique used in data science
and big data analysis. It is a type of greedy algorithm that iteratively adds features to a dataset
to improve the performance of a predictive model. Here’s how it works:
1. Initialization: The selector is initialized with a predictive model, the number of
features to select, the scoring metric, and the tolerance for improvement.
2. Model Fitting: The selector fits the predictive model on the full set of features.
3. Model Evaluation: The model is evaluated on the training set using the scoring
metric.
4. Feature Addition: The feature that most improves the model’s cross-validation score
is added to the selected features set.
5. Iteration: The selector repeats steps 2-4 until the desired number of features has been
selected.

This method is particularly useful when dealing with a large number of features, as it
incrementally builds the model based on the most informative features. It involves assessing
new features, evaluating combinations of features, and selecting the optimal subset of
features that best contribute to model accuracy.

Sequential Backward Selection (SBS) is a feature selection technique used in machine


learning and data science. It is a type of greedy algorithm that iteratively removes features
from a dataset to improve the performance of a predictive model. Here’s how it works:

1. Initialization: The selector is initialized with a predictive model, the number of


features to select, the scoring metric, and the tolerance for improvement.
2. Model Fitting: The selector fits the predictive model on the full set of features.
3. Model Evaluation: The model is evaluated on the training set using the scoring
metric.
4. Feature Removal: The feature that least reduces the model’s cross-validation score is
removed from the selected features set2.
5. Iteration: The selector repeats steps 2-4 until the desired number of features has been
selected.

This method is particularly useful when dealing with a large number of features, as it
incrementally builds the model based on the most informative features. It involves assessing
new features, evaluating combinations of features, and selecting the optimal subset of
features that best contribute to model accuracy.

You might also like