Feature Engineering
Feature Engineering
• Feature Engineering:
• Feature Extraction and Engineering,
• Feature Engineering on Numeric Data,
• Feature Engineering on Categorical Data,
• Feature Engineering on Text Data,
• Feature Engineering on Temporal Data,
• Feature Engineering on Image Data,
• Feature Scaling,
• Feature Selection.
• Dimensionality Reduction:
• Feature extraction with Principal Component Analysis.
What is feature?
• A feature is an attribute of a data set that is used in a machine learning
process.
• The features in a data set are also called its dimensions.
• So a data set having ‘n’ features is called an n-dimensional data set.
• Example:
Class
variable or
dependent
variable
What is Feature Engineering?
• Feature engineering refers to the process of translating a data set
into features such that these features are able to represent the data
set more effectively and result in a better learning performance.
• feature engineering is an important pre-processing step for machine
learning. It has two major elements:
1. feature transformation
2. feature subset selection
Why Feature Engineering?
• Better representation of data
• Better performing models
• Essential for model building and evaluation:
• More flexibility on data types
How Do You Engineer Features?
• Feature engineering strategies can be applies on following data types:
• Numeric data, Categorical data
• Text data, Temporal data
• Image data
• Another aspect into feature engineering has recently gained prominence.
Here, the machine itself try to detect patterns and extract useful data
representations from the raw data, which can be used as features. This
process is also known as “auto feature generation”.
• Deep Learning has proved to be extremely effective in this area and neural
network architectures like convolutional neural networks (CNNs),
recurrent neural networks (RNNs), and Long Short Term Memory
networks (LSTMs) are extensively used for auto feature engineering and
extraction
Feature Engineering on Categorical Data
• Any attribute or feature that is categorical in nature represents discrete
values that belong to a specific finite set of categories or classes.
• Category or class labels can be text or numeric in nature. Usually there
are two types of categorical variables—nominal and ordinal.
• Nominal Categorical feature - doesn’t have any order
Example: video game genres, weather seasons, country names
• Ordinal Categorical feature – have specified order
Example: Clothing size, education levels
Engineer features from categorical data
• Transforming Nominal features
• Transforming Ordinal features
• Encoding Categorical features
• One hot encoding
• Dummy coding scheme
• Effect coding scheme
• Bin – counting scheme
• Feature Hashing scheme
Feature Engineering on Text Data
• In case of unstructured data like text documents, the first challenge is
dealing with the unpredictable nature of the syntax, format, and
content of the documents, which make it a challenge to extract useful
information for building models.
• The second challenge is transforming these textual representations into
numeric representations that can be understood by Machine Learning
algorithms.
• There exist various feature engineering techniques employed by data
scientists daily to extract numeric feature vectors from unstructured
text
load some sample text documents, do some basic pre-processing
Text pre processing
• Text tokenization and lower casing
• Removing special characters
• Removing stop words
• Correcting spellings
• Stemming
• Lemmatization
Tokenization
Corpus: collection of text documents.
Tokens: Tokens are a basic meaningful unit of a sentence or a document
N-Grams: N-grams is a combination of N words or characters together.
for example, if we have a sentence “I Love My Phone”.
Tokenization: process of splitting a text object into smaller units known
as tokens. Examples of tokens can be words, characters, numbers,
symbols, or n-grams.
Create a term frequency matrix where rows are documents and columns
are distinct terms throughout all documents. Count word occurrences in
every text.
Compute inverse document frequency (IDF) using the previously
explained formula.
Chi-square Test:
• Chi-square test is a technique to determine the relationship between
the categorical variables.
• The chi-square value is calculated between each feature and the
target variable, and the desired number of features with the best
chi-square value is selected.
Fisher's Score:
• Fisher's score is one of the popular supervised technique of features
selection. It returns the rank of the variable on the fisher's criteria in
descending order. Then we can select the variables with a large fisher's
score.
Missing Value Ratio:
• The value of the missing value ratio can be used for evaluating the
feature set against the threshold value.. The variable is having more
than the threshold value can be dropped.
Embedded methods
• Embedded methods combined the advantages of both filter and
wrapper methods by considering the interaction of features along with
low computational cost.
• These are fast processing methods similar to the filter method but
more accurate than the filter method.
• Some of the techniques are:
• Regularization
• Random forest importance
Regularization:
• Regularization adds a penalty term to different parameters of the
machine learning model for avoiding overfitting in the model.
• This penalty term is added to the coefficients; hence it shrinks some
coefficients to zero.
• Those features with zero coefficients can be removed from the dataset.
• The types of regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).
• Random Forest Importance -
• Different tree-based methods of feature selection help us with feature
importance to provide a way of selecting features.
• Here, feature importance specifies which feature has more importance
in model building or has a great impact on the target variable.
• Random Forest is such a tree-based method, which is a type of bagging
algorithm that aggregates a different number of decision trees.
• It automatically ranks the nodes by their performance or decrease in the
impurity (Gini impurity) over all the trees.
• Nodes are arranged as per the impurity values, and thus it allows to
pruning of trees below a specific node.
• The remaining nodes create a subset of the most important features.
Filter vs Wrapper vs Embedded methods