Basics of Feature Engineering Marked
Basics of Feature Engineering Marked
• The features in a data set are also called its dimensions. So a data set having ‘n’ features is called an n-dimensional data set.
• Example: Iris dataset (5 dimensional dataset – species – class, others are predictor variables)
Introduction to feature
• What is a feature Engineering?
• Feature engineering refers to the process of translating a data set into features such that these features are
able to represent the data set more effectively and result in a better learning performance.
• Feature Engineering – very important step in pre-processing – 1 . Feature transformation and 2. feature
subset selection
• Feature transformation:
• Transforms the data whether structured or unstructured into new set of features that can represent the
underlying problem which ML can try to solve.
• Feature extraction is the process of extracting or creating a new set of features from the original set of features using some
functional mapping.
• Feature subset selection: no new features will be added
• The objective of feature selection is to derive subset of features from the full feature set which is most meaningful in
the context of a specific machine learning problem.
FEATURE TRANSFORMATION
• What is a feature construction?
• Feature construction process discovers missing information about the relationships between features and augments
the feature space by creating additional features.
• Hence, if there are ‘n’ features or dimensions in a data set, after feature construction ‘m’ more features or dimensions
may get added. So at the end, the data set will become ‘n + m’ dimensional.
• Feature extraction is the process of extracting or creating a new set of features from the original set of features using some
functional mapping.
• Feature subset selection: no new features will be added
• The objective of feature selection is to derive subset of features from the full feature set which is most meaningful in
the context of a specific machine learning problem.
FEATURE Construction
• Feature transformation is used as an effective tool for dimensionality reduction
• Goals:
• Achieving best reconstruction of the original features in the data set
• Achieving highest efficiency in the learning task
• Feature construction: It involves transforming a given set of input features to generate a new set of more
powerful features.
• Example: real estate data set having details of all apartments sold in a specific region.
• In PCA, a new set of features are extracted from the original features which are quite dissimilar in nature.
• So an ‘n’ dimensional feature space gets transformed to an ‘m’ dimensional feature space, where the dimensions are
orthogonal to each other, i.e. completely independent of each other.
• 2. The principal components are generated in order of the variability in the data that it captures. Hence, the first principal
component should capture the maximum variability, the second principal component should capture the next highest
variability etc.
• 3. The sum of variance of the new features or the principal component should be equal to the sum of variance of the
original features.
Feature extraction - PCA
• PCA works based on a process called eigenvalue decomposition of a covariance matrix of a data set.
• Below are the steps to be followed:
• 3. The eigenvector having highest eigenvalue represents the direction in which there is the highest variance. So this will
help in identifying the first principal component.
• 4. The eigenvector having the next highest eigenvalue represents the direction in which data has the highest remaining
variance and also orthogonal to the first direction. So this helps in identifying the second principal component.
• 5. Like this, identify the top ‘k’ eigenvectors having top ‘k’ eigenvalues so as to get the ‘k’ principal components.
• very high quantity of computational resources and high amount of time will be required
• performance of the model – both for supervised and unsupervised machine learning task, also
degrades sharply due to unnecessary noise in the data.
• Also, a model built on an extremely high number of features may be very difficult to understand.
• The predictor variables or the features is expected to contribute information to decide the class.
• In UnSupervised learning: no class data or training data. Just grouping similar items
• Certain variables do not contribute any useful information for deciding the similarity of dissimilarity. Hence, those variables make no significant
information contribution in the grouping process. Are irrelevant
• So, in context of the Weight prediction problem, Age and Height contribute similar information.
• Now, the question is how to find out which of the features are irrelevant or which features have potential redundancy?.
FEATURE SUBSET SELECTION - Measures of feature relevance and redundancy
• Measures of feature relevance
• For supervised learning - mutual information is considered as a good measure of information.
• Higher the value of mutual information of a feature, more relevant is that feature.
• 1. Correlation-based measures
• 2. Distance-based measures, and
• 3. Other coefficient-based measure
• Correlation-based similarity measure: Correlation is a measure of linear dependency between two random variables.
• For two random feature variables F1 and F2 , Pearson correlation coefficient is defined as:
Correlation: +1 to -1
Minkowski distance
FEATURE SUBSET SELECTION
• Other similarity measures
• Jaccard index/coefficient is used as a measure of similarity between two features.
• The Jaccard distance, a measure of dissimilarity between two features, is complementary of Jaccard index.
• For two features having binary values, Jaccard index is measured as
FEATURE SUBSET SELECTION
• Other similarity measures Cosine Similarity
• Simple matching coefficient (SMC)
FEATURE SUBSET SELECTION
• Cosine similarity actually measures the angle (refer to Fig.) between x and y vectors.
• Hence, if cosine similarity has a value 1, the angle between x and y is 0° which means x and y are same except for the
magnitude.
• If cosine similarity is 0, the angle between x and y is 90°. Hence, they do not share any similarity (in case of text data, no
term/word is common).
• In the above example, the angle comes to be 43.2°.
FEATURE SUBSET SELECTION
• Overall feature selection process: Subset generation
• Feature selection is the process of selecting a subset of features in a data set. • for an ‘n’ dimensional data set, 2^n subsets
• 1. generation of possible subsets can be generated.
• So, as the value of ‘n’ becomes high, finding
• 2. subset evaluation
an optimal subset from all the 2^n candidate
• 3. stop searching based on some stopping criterion subsets becomes intractable.
• 4. validation of the result • sequential forward selection: empty set –
keep adding
• Sequential backward elimination: a full set
and successively remove features
• Stopping Criterion:
1. the search completes
2. some given bound (e.g. a specified number of
iterations) is reached
3. subsequent addition (or deletion) of the
feature is not producing a better subset
4. a sufficiently good subset (e.g. a subset having
better classification accuracy than the existing
benchmark) is selected
Feature selection approaches
• There are four types of approach for feature selection:
• In the wrapper approach identification of best feature
• 1. Filter approach
subset is done using the induction algorithm as a black
• 2. Wrapper approach box.
• 3. Hybrid approach = filter (statistical) + wrapper (algorithm)
• The feature selection algorithm searches for a good
• 4. Embedded approach feature subset
• Filter approach – based on statistical measures • Since for every candidate subset, the learning model is
trained and the result is evaluated by running the learning
algorithm.
2 . Compare the Jaccard index and similarity matching coefficient of two features having values (1 , 1 , 0, 0, 1 , 0,
1 , 1 ) and (1 , 0, 0, 1 , 1 , 0, 0, 1 ).
5.
Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using Cosine Similarity.
The ‘x’ vector has values, x = { 3, 2, 0, 5 }
The ‘y’ vector has values, y = { 1, 0, 0, 0 }
6/13/2023 30
Go, change the world
RV College of
Engineering
ACTIVE LEARNING
6/13/2023 31
Go, change the world
RV College of
Engineering
ACTIVE LEARNING
6/13/2023 32
Go, change the world
RV College of
Engineering 4.3 Overall feature selection process
6/13/2023 33