Unit 1
Unit 1
What is a Feature?
Feature Engineering is the process of creating new features or transforming existing features
to improve the performance of a machine-learning model.
Data pre-processing
Data preprocessing is a step in the data mining and data analysis process that takes raw data
and transforms it into a format that can be understood and analyzed by computers and
machine learning.
Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it
contain errors and inconsistencies, but it is often incomplete, and doesn’t have a regular,
uniform design.
Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy.
However, unstructured data, in the form of text and images must first be cleaned and
formatted before analysis.
Data pre-processing is an important step in the data mining process that involves cleaning
and transforming raw data to make it suitable for analysis.
When using data sets to train machine learning models, you’ll often hear the phrase “garbage
in, garbage out” This means that if you use bad or “dirty” data to train your model, you’ll
end up with a bad, improperly trained model that won’t actually be relevant to your analysis.
Good, preprocessed data is even more important than the most powerful algorithms, to the
point that machine learning models trained with bad data could actually be harmful to the
analysis you’re trying to do – giving you “garbage” results.
It is an important step in the data mining process. It refers to the cleaning, transforming, and
integrating of data in order to make it ready for analysis. The goal of data pre-processing is
to improve the quality of the data and to make it more suitable for the specific data mining
task.
Some common steps in data pre-processing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation(bad quality), removal, and train Missing data
There are a number of ways to correct for missing data, but the two most common are:
Ignore the tuples: A tuple is an ordered list or sequence of numbers. If multiple values are
missing within tuples, you may simply discard the tuples with that missing information. This
is only recommended for large data sets, when a few ignored tuples won’t harm further
analysis.
Manually fill in missing data: This can be tedious, but is definitely necessary when working
with smaller data sets.
Noisy data: Data cleaning also includes fixing “noisy” data. This is data that includes
unnecessary data points, irrelevant data, and data that’s more difficult to group together.
Binning: Binning sorts data of a wide data set into smaller groups of more similar data.
It’s often used when analyzing demographics. Income, for example, could be grouped:
$35,000-$50,000, $50,000-$75,000, etc.
Regression: Regression is used to decide which variables will actually apply to your analysis.
Regression analysis is used to smooth large amounts of data. This will help you get a
handle on your data, so you’re not overburdened with unnecessary data.
Clustering: Clustering algorithms are used to properly group data, so that it can be
analyzed with like data. They’re generally used in unsupervised learning, when not a lot is
known about the relationships within your data.
If you’re working with text data, for example, some things you should consider when
cleaning your data are:
Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
Translate all text into the language you’ll be working in
Remove HTML tags
Remove boilerplate email text
Remove unnecessary blank text between words
Remove duplicate data
After data cleaning, you may realize you have insufficient data for the task at hand. At this
point you can also perform data wrangling or data enrichment to add new data sets and
run them through quality assessment and cleaning again before adding them to your original
data.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can be
used for data integration.
It is the process of developing clean data. This includes eliminating redundant and
unstructured data and making the data appear similar across all records and fields.
Normalization changes the observations so that they can be described as normal distribution
By scaling the variables, one can help compare different variables on an equal footing. The
subplots below can give us a better insight.
Normal distribution:Also known as the “bell curve”, this is a specific statistical distribution
where roughly equal observations fall above and below the mean, the mean and the
median are the same, and there are more observations closer to the mean. The normal
distribution is also known as the Gaussian distribution.
The shape of our data has changed. Before normalizing it was almost L-shaped. But after
normalizing it looks more like a “bell curve”.
The shape of the data doesn’t change, but the scale on the X-axis changes (0 to 1) instead
of ranging from 0 to 8.
Standardization:
when:
2. Model assumes normality: Some models, like linear regression, assume normality of the
The main difference between normalization and standardization is that normalization scales
the data to a common range, while standardization transforms the data to have a mean of
0 and a standard deviation of 1.
When to choose normalization:
KERNEL PCA: PCA is a linear method. That is it can only be applied to datasets which are
linearly separable. It does an excellent job for datasets, which are linearly separable. But, if
we use it to non-linear datasets, we might get a result which may not be the optimal
dimensionality reduction. Kernel PCA uses a kernel function to project dataset into a
higher dimensional feature space, where it is linearly separable. It is similar to the idea of
Support Vector Machines. There are various kernel methods like linear, polynomial, and
gaussian.
Kernel Principal Component Analysis (KPCA) is a technique used in machine learning for
nonlinear dimensionality reduction. It is an extension of the classical Principal Component
Analysis (PCA) algorithm, which is a linear method that identifies the most significant
features or components of a dataset. KPCA applies a nonlinear mapping function to the data
before applying PCA, allowing it to capture more complex and nonlinear relationships
between the data points.
Objectives of PCA:
PCA is basically a dimension reduction process but there is no guarantee that the dimension
is interpretable.
The main task in this PCA is to select a subset of variables from a larger set, based on which
original variables have the highest correlation with the principal amount.
Identifying patterns: PCA can help identify patterns or relationships between variables that
may not be apparent in the original data. By reducing the dimensionality of the data, PCA can
reveal underlying structures that can be useful in understanding and interpreting the
data.
Feature extraction: PCA can be used to extract features from a set of variables that are more
informative or relevant than the original variables. These features can then be used in
modeling or other analysis tasks.
Data compression: PCA can be used to compress large datasets by reducing the number of
variables needed to represent the data, while retaining as much information as possible.
Noise reduction: PCA can be used to reduce the noise in a dataset by identifying and
removing the principal components that correspond to the noisy parts of the data.
Uses of PCA:
Local Binary Patterns (LBP): is a type of visual descriptor used for classification in CV.
Originally proposed in 1994, LBP is a particular case of the Texture Spectrum model
introduced in 1990.
How It Works:
It’s commonly used for texture classification, face recognition, and gender recognition.
Image Representation:
LBP operates on local neighborhoods within the image, comparing pixel intensities to create
binary patterns
Local Binary Pattern (LBP) has several advantages when applied in DS & CV
1. Texture Analysis:
LBP is excellent for texture analysis. It captures local patterns in an image, making it useful
for tasks like texture classification and segmentation.
4.Face Recognition:
It extracts discriminative features from facial images, allowing accurate recognition even
with variations in pose and lighting.
5.Histogram Representation:
LBP generates histograms of local patterns, which can be used as features for machine
learning models.
6.Spatial Information:
LBP considers the spatial arrangement of pixels, which helps in capturing local context.
Remember that LBP is just one of many techniques in data science, but its simplicity and
effectiveness make it a valuable tool in various applications
Sequential Forward Selection (SFS) is a feature selection technique used in data science
and big data analysis. It is a type of greedy algorithm that iteratively adds features to a dataset
to improve the performance of a predictive model. Here’s how it works:
1. Initialization: The selector is initialized with a predictive model, the number of
features to select, the scoring metric, and the tolerance for improvement.
2. Model Fitting: The selector fits the predictive model on the full set of features.
3. Model Evaluation: The model is evaluated on the training set using the scoring
metric.
4. Feature Addition: The feature that most improves the model’s cross-validation score
is added to the selected features set.
5. Iteration: The selector repeats steps 2-4 until the desired number of features has been
selected.
This method is particularly useful when dealing with a large number of features, as it
incrementally builds the model based on the most informative features. It involves assessing
new features, evaluating combinations of features, and selecting the optimal subset of
features that best contribute to model accuracy.
This method is particularly useful when dealing with a large number of features, as it
incrementally builds the model based on the most informative features. It involves assessing
new features, evaluating combinations of features, and selecting the optimal subset of
features that best contribute to model accuracy.