0% found this document useful (0 votes)
3 views

5.Feauture Engineering

Feature engineering is the process of transforming raw data into a dataset, requiring creativity and domain knowledge from data analysts. Techniques include one-hot encoding, binning, normalization, standardization, and handling missing features, each serving to improve model performance. The document also discusses model training challenges like underfitting and overfitting, and methods to mitigate these issues, such as data augmentation and weight decay.

Uploaded by

arnooshanajafi26
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

5.Feauture Engineering

Feature engineering is the process of transforming raw data into a dataset, requiring creativity and domain knowledge from data analysts. Techniques include one-hot encoding, binning, normalization, standardization, and handling missing features, each serving to improve model performance. The document also discusses model training challenges like underfitting and overfitting, and methods to mitigate these issues, such as data augmentation and weight decay.

Uploaded by

arnooshanajafi26
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Basic Practice

Feature Engineering
• The problem of transforming raw data into a dataset is called
feature engineering

• For most practical problems, feature engineering is a labor-


intensive process that demands from the data analyst a lot of
creativity and, preferably, domain knowledge.

• Everything measurable can be used as a feature

• The role of the data analyst is to create informative features

• We say that a model has a low bias when it predicts the


training data well
Feature Engineering

• One-Hot Encoding

• Binning

• Normalization

• Standardization

• Dealing with Missing Features


One-Hot Encoding
• Some learning algorithms only work with numerical feature vectors

• When some feature in your dataset is categorical, you can


transform such a categorical feature into several binary ones

• By doing so, you increase the dimensionality of your feature


vectors.

• You should not transform feature values to ordered numbers:

• This confuses the learning algorithm

• the algorithm will try to nd a regularity where there’s no one,


which may potentially lead to over tting.
fi
fi
One-Hot Encoding
Binning
• In Binning, you have a numerical feature but you want to convert it into a categorical one

• Binning (also called bucketing) is the process of converting a continuous feature into multiple binary
features called bins or buckets, typically based on value range

• For example, instead of representing age as a single real-valued feature, the analyst could chop ranges
of age into discrete bins

• All ages between 0 and 5 years-old

• All ages between 6 to 10 years-old

• All ages between 11 to 15 years-old

• So on

• In some cases, a carefully designed binning can help the learning algorithm to learn using fewer
examples.

• Because we give a “hint” to the learning algorithm that if the value of a feature falls within a speci c
range, the exact value of the feature doesn’t matter

fi
Normalization
• Normalization is the process of converting an actual range of values, into a
standard range of values,

• typically in the interval [−1, 1] or [0, 1].

5.1.3 • Normalization
Example:

• The natural
Normalization is therange of aofparticular
process convertingfeature is 350 range
an actual to 1450of values which a numerica
feature can take, into a standard range of values, typically in the interval [≠1, 1] or [0, 1].
• subtract 350 from every value of the feature, and divide the result by 1100
For example, suppose the natural range of a particular feature is 350 to 1450. By subtractin
350 from •every
The value
result of
is the feature,
in the range and dividing the result by 1100, one can normalize thos
of [0,1]
values into the range [0, 1].
• More generally, the normalization formula looks like this:
More generally, the normalization formula looks like this:

x(j) ≠ min(j)
x̄(j) = (j) (j)
,
max ≠ min
Normalization (2)
• Why do we normalize?

• In practice, it can lead to an increased speed of


learning

• In gradient descent, If x1 is in the range [0,1000] and


x2 the range [0,0.0001], then the derivative with
respect to a larger feature will dominate the update

• It helps to avoid numerical over ow


fl
e previous chapter. Imagine you have a two-dimensional feature vector. When you u
e parameters of w(1) and w(2) , you use partial derivatives of the mean squared error
pect to w(1) and w(2) . If x(1) is in the range [0, 1000] and x(2) the range [0, 0.0001],

Standardization
e derivative with respect to a larger feature will dominate the update.
ditionally, it’s useful to ensure that our inputs are roughly in the same relatively
nge to avoid problems which computers have when working with very small or ver
mbers (known as numerical overflow).

1.4 •
Standardization
In Standardization (or z-score normalization) the
feature values are rescaled so that they have the
andardization (or z-score
properties normalization)
of a standard normal isdistribution
the procedure
withduring
μ = 0which the fe
ues are rescaled so
and σ = 1,that they have the properties of a standard normal distribution
= 0 and ‡ = 1, where µ is the mean (the average value of the feature, averaged ov
amples in the dataset) and ‡ is the standard deviation from the mean.
• Standard scores (or z-scores) of features are calculated
andard scores (or z-scores) of features are calculated as follows:
as follows:

x(j) ≠ µ(j)
x̂(j) = (j)
.

u may ask when you should use normalization and when standardization. There
finitive answer to this question. Usually, if your dataset is not too big and you have
Normalization
vs
Standardization
• When you should use normalization and when standardization?

• Usually, if your dataset is not too big and you have time, you can try both and see
which one performs better for your task

• If you don’t have time, as a rule of thumb:

• Unsupervised learning algorithms, in practice, more often bene t from standardization than
from normalization;

• Standardization is also preferred for a feature if the values this feature takes are distributed
close to a normal distribution (so-called bell curve);

• Standardization is preferred for a feature if it can sometimes have extremely high or low
values (outliers); this is because normalization will “squeeze” the normal values into a very
small range;

• In all other cases, normalization is preferable.


fi
Dealing with
Missing Features
• In some examples, values of some features can be
missing

• The typical approaches of dealing with missing values for


a feature include:

• Removing the examples with missing features from the


dataset (that can be done if your dataset is big enough
so you can sacri ce some training examples);

• Using a data imputation technique.


fi
library and a specific implementation of the algorithm);
• using a data imputation technique.

5.1.6
Data Imputation
Data Imputation Techniques

One data imputation technique consists in replacing the missing value of a feature by
average value of this featurevalue
in the
of adataset:

Replace the missing feature by an average value of this feature in the
dataset
1 ÿN
(j)
x̂(j) Ω xi ,
M i=1

• Replace the missing value with a value outside the normal range of values
where M < N is the number of examples in which the value of the feature j is present, wh
he summation
• For excludes
example, ifthe
theexamples in which
normal range is [0, 1],the
thenvalue of the
you can feature
set the j is
missing absent.
value to
2 or −1.
Another technique is to replace the missing value with a value outside the normal range
values. For•example,
The idea isif that
thethe
normal range
learning is [0,will
algorithm 1], learn
thenwhat
you iscan settothe
best missing
do when the value to 2
≠1. The ideafeature
is thathas
thea learning
value signialgorithm will learn
cantly di erent what isvalues
from regular best to do when the feature h
a value significantly different from regular values. Alternatively, you can replace the miss
value by• aReplace
value inthethe middle
missing of by
value thea range.
value in For example,
the middle if range
of the the range for a feature is [≠1


For example, if the range for a feature is [−1, 1], you can set the missing value
Andriy Burkov ThetoHundred-Page
to be equal 0 Machine Learning Book - Draft
fi
ff
Data Imputation (2)

• Use the missing value as the target variable for a


regression problem

• If you have a signi cantly large dataset and just a few


features with missing values, you can increase the
dimensionality of your feature vectors by adding a binary
indicator feature for each feature with missing values.
fi
Learning Algorithm
Selection
• Explainability

• In-memory vs. out-of-memory

• Number of features and examples

• Categorical vs. numerical features

• Nonlinearity of the data

• Training speed

• Prediction speed
Three Sets
• In practice, we work we three separate sets of data:

• Training set,

• Validation set,

• Test set

• The Validation and Test sets are called hold-out sets

• There’s no optimal proportion to split the dataset into these three subsets.

• In the past: 70/15/15

• With big datasets: 95/2.5/2.5

• We use the validation set to

• Choose the learning algorithm

• nd the best values of hyper-parameters

• We use the test set to assess the model before delivering it to the client or putting it in production
fi
Under tting & Over tting
• If the model makes many mistakes on the training data, we say
that the model has a high bias or that the model under ts.

• In over tting, the model predicts very well the training data but
poorly the data from at least one of the two holdout sets.

• It is also called high variance.

Underfitting Good fit Overfitting

Figure 2: Examples of underfitting (linear model), good fit (quadratic model), and overfitting
(polynomial of degree 15).
fi
fi
fi
fi
Over tting
• How can we train a model that’s complex enough to model the structure in the data,
but prevent it from over tting? I.e., how to achieve low bias and low variance?

• Our bag of tricks

• data augmentation

• reduce the number of parameters

• weight decay

• early stopping

• ensembles (combine predictions of di erent models)

• The best-performing models on most benchmarks use some or all of these tricks.
fi
fi
ff
Data Augmentation
• The best way to improve generalization is to collect more data!

• Suppose we already have all the data we’re willing to collect. We can augment the training
data by transforming the examples. This is called data augmentation.

• Examples (for visual recognition)

• translation

• horizontal or vertical ip rotation

• smooth warping

• noise (e.g. ip random pixels)

• Only warp the training, not the test, examples.

• The choice of transformations depends on the task. (E.g. horizontal ip for object
recognition, but not handwritten digit recognition.)
fl
fl
fl
Data Augmentation
Af ne Elastic
Noise
Distortion Deformation

Horizontal Random
Hue Shift
ip Translation
fl
fi
Weight Decay
Figure 4: Training curves, showing the relationship between the number of
training iterations and the training and test error. (left) Idealized version.
(right) Accounting for fluctuations in the error, caused by stochasticity in
the SGD updates.
• So far, all of the cost functions we’ve discussed have
consisted of the average of some loss function over the
training set.

• Often, we want to add another term, called a regularization


term, or regularizer, which penalizes hypotheses we think
are somehow
Figure pathological
5: Two sets and
of weights which unlikely
make to predictions
the same generalizeassuming
well
inputs x1 and x2 are identical.

• The total cost, then, is


The total cost, then, is
N
1 X
J (✓) = L(y(x, ✓), t) + R(✓) (2)
N | {z }
| i=1 {z } regularizer
training loss

For instance, suppose we are training a linear regression model with two
inputs, x1 and x2 , and these inputs are identical in the training set. The
two sets of weights shown in Figure 5 will make identical predictions on the

L2 Regularization
training set, so they are equivalent from the standpoint of minimizing the
loss. However, Hypothesis A is somehow better, because we would expect it
to be more stable if the data distribution changes. E.g., suppose we observe
the input (x1 = 1, x2 = 0) on the test set; in this case, Hypothesis A will
predict 1, while Hypothesis B will predict -8. The former is probably more
sensible. We would like a regularizer to favor Hypothesis A by assigning it
• One such
a smaller regularizer which achieves this is L2
penalty.
One such regularizer which achieves this is L2 regularization; for a This is
regularization; for a linear model, it is de ned as follows:
linear model, it is defined as follows: mathem
really c
D L2 nor
Weight Decay X
RL2 (w) = wj2 . (3)
2
j=1
We’ve already seen that we can regularize a network by penalizing
• By this,
(Thelarge the cost
weight values,function
hyperparameter thereby becomes:
is sometimes called
encouraging the the weight
weights to be cost.)
small in L2 reg-
ularization tends to favor hypotheses where the norms of the weights are
magnitude.
X
Jreg = J + R = J + wj2
8 2
j


We saw that the gradient descent update can be interpreted as
L2 regularization
weight decay: tends to favor hypotheses where the
norms of the weights are smaller. ✓ ◆
@J @R
w w ↵ +
@w @w
✓ ◆
@J
=w ↵ + w
fi
Weight Decay

L2 Regularization
We’ve already seen that we can regularize a network by penalizing
large weight values, thereby encouraging the weights to be small in
magnitude.
X
Jreg = J + R = J + wj2
2

j
By Incorporating
We saw that thethe regularization
gradient term
descent update can in gradient
be interpreted as descent
update, we decay:
weight get an interesting interpretation:
✓ ◆
@J @R
w w ↵ +
@w @w
✓ ◆
@J
=w ↵ + w
@w
@J
= (1 ↵ )w ↵
@w
• In each iteration, we shrink the weights by a factor of 1 − αλ.
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 12: Generalization 8 / 22

For this reason, L2 regularization is also known as weight


decay

• In academic literature, L2 regularization is also known as


ridge regression or Tikhonov regularization.
Weight Decay
Weight Decay
• Why weWhy
want weights to be small:
we want weights to be small:

y = 0.1x 5 + 0.2x 4 + 0.75x 3 x2 2x + 2


y= 7.2x 5 + 10.4x 4 + 24.5x 3 37.9x 2 3.6x + 12
The red polynomial overfits. Notice it has really large coefficients.

• Regularizers are sometimes viewed as penalizing the


Roger Grosse and Jimmy Ba CSC421/2516 Lecture 12: Generalization 9 / 22

“complexity” of a network, or favoring explanations which


are “more likely.”
L1 Regularization
• L1 Regularization on the model parameter w is de ned as

λ D

RL1(w) = | wj |
2 j=1

• That is, as the sum of absolute values of the individual parameters.

• In comparison to L2 regularization, L1 regularization results in a


solution that is more sparse. Sparsity in this context refers to the
fact that some parameters have an optimal value of zero

• The sparsity property induced by L1 regularization has been used


extensively as a feature selection mechanism.

• Notable Example: LASSO model.


fi
L1 Regularization
• L1 penalizes weights equally regardless of the magnitude of those weights.

• L2 penalizes bigger weights more than smaller weights.

• For example, suppose w3 = 100 and w4 = 10.

• By reducing w3 by 1, L1’s penalty is reduced by 1. By reducing w4 by 1, L1’s penalty is also


reduced by 1.

• By reducing w3 by 1, L2’s penalty is reduced by 199. By reducing w4 by 1, L2’s penalty is also


reduced by only 19.

• Thus, L2 tends to prefer reducing w3 over w4.

• In general, when a weight wi as already been small in magnitude, L2 does not care to reduce it to
zero, L2 would rather reduce big weights than eliminate small weights to 0.

• On the other hand, L1 cares about reducing big weights and small weights equally. For L1, the
less informative features get reduced. Some features may get completely eliminated by L1, thus
we have feature selection.
Early Stopping
Early Stopping
• We don’t always want to nd a global (or even local)
optimum of our cost function. It may be advantageous to
We don’t always want to find a global (or even local) optimum of our
stop training early.
cost function. It may be advantageous to stop training early.

• Early stopping: monitor performance on a validation set, stop training


Early stopping: monitor performance on a validation set,
when the validtion error starts going up.
stop training when the validation error starts going up.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 12: Generalization 13 / 22


fi
Early Stopping
Early Stopping

• Validation errorvalidation
A slight catch: uctuates because
error fluctuates of stochasticity
because in the
of stochasticity in
updates.
the updates.

Determining when the validation error has actually leveled o↵ can be


• Determining
tricky. when the validation error has actually leveled
o can be tricky.
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 12: Generalization 14 / 22
ff
fl
Early Stopping
• Why does early stopping work?

• Weights start out small, so it takes time for them to grow large.

• Therefore, it has a similar e ect to weight decay.

• If you are using sigmoidal units, and the weights start out
small, then the inputs to the activation functions take only a
small range of values.

• Therefore, the network starts out approximately linear, and


gradually becomes more nonlinear (and hence more
powerful).
ff
Ensembles
• If you average the predictions of multiple networks trained independently on separate
training sets, this reduces the variance of the predictions, which can lead to lower loss

• But, we may not have separate training sets.

• However, we can try to simulate the e ect of independent training sets by somehow
injecting variability into the training procedure

• Train on random subsets of the full training data. This procedure is known as bagging.

• Train networks with di erent architectures (e.g. di erent numbers of layers or units, or
di erent choice of activation function).

• Use entirely di erent models or learning algorithms.

• The set of trained models whose predictions we are combining is known as an ensemble.

• Ensembles can improve generalization quite a bit, and the winning systems for most
machine learning benchmarks are ensembles.

• But they are expensive, and the predictions can be hard to interpret.
ff
ff
ff
ff
ff
Performance Assessment
• How can you say how good the model is?

• You use the test set to assess the model

• Model Assessment:

• In regression

• Mean Square Error (MSE)

• In Classi cation

• Confusion matrix,

• Accuracy,

• Cost-sensitive accuracy,

• Precision/Recall, and

• Area under the ROC curve.


fi
for the test data. If the MSE of the model on the test data is substantially higher than
the MSE obtained on the training data, this is a sign of overfitting. Regularization or a
better hyperparameter tuning could solve the problem. The meaning of “substantially higher”
depends on the problem at hand and has to be decided by the data analyst jointly with the

Confusion Matrix
decision maker/product owner who ordered the model.
For classification, things are a little bit more complicated. The most widely used metrics and
tools to assess the classification model are:
• confusion matrix,
• accuracy,
• A table that summarizes how successful the classi cation
• cost-sensitive accuracy,
• precision/recall, and
model
is at predicting examples belonging
• area under the ROC curve.
to various classes
To simplify the illustration, I use a binary classification problem. Where necessary, I show
• One axis of how
thetoconfusion matrix
extend the approach ismulticlass
to the the label case.that the model
predicted, and the other axis is the actual label
5.6.1 Confusion Matrix

• Example forThe
spam detection:
confusion matrix is a table that summarizes how successful the classification model
is at predicting examples belonging to various classes. One axis of the confusion matrix
is the label that the model predicted, and the other axis is the actual label. In a binary
classification problem, there are two classes. Let’s say, the model predicts two classes: “spam”
and “not_spam”:
• TP: True Positive

spam (predicted) not_spam (predicted)


• FP: False Positive
spam (actual) 23 (TP) 1 (FN)
• FN: False Negative not_spam (actual) 12 (FP) 556 (TN)

• TN: True Negative


The above confusion matrix shows that of the 24 examples that actually were spam, the
model correctly classified 23 as spam. In this case, we say that we have 23 true positives
or TP = 23. The model incorrectly classified 1 example as not_spam. In this case, we have 1
fi
can decide to add more labeled examples of these species to help the learning algorithm
to “see” the difference between them. Alternatively, you might add additional features the

Precision/Recall
learning algorithm can use to build a model that would better distinguish between these
species.
Confusion matrix is used to calculate two other performance metrics: precision and recall.
cide to add more labeled examples of these species to help the learning algorithm
” the difference between them.
5.6.2 Alternatively, you might add additional features the
Precision/Recall
g algorithm can use to build a model that would better distinguish between these
• Precision is the
The two
is thetwo
ratio ofusedcorrect
most frequently
ratio of correct positive
metrics to positive predictions
assess the model
predictions to theand
overall
torecall.
are precision and the Precision
number of positive predictions:
overall number of positive predictions
on matrix is used to calculate other performance metrics: precision recall.

def TP
Precision/Recall precision = .
• Recall is the ratio of correct positive predictions to the TP + FP

o most frequently used metrics


overall numberto assess
Recall the
is the of model
ratio are precision
of correct
positive positive
examplesand recall.
predictions
in Precision
to the overall
the datasetnumber of positive examples
atio of correct positive predictions to the overall number of positive predictions:
in the dataset:

def TP def TP
precision = . recall = .
TP + FP TP + FN

To understand
s the ratio of correct positive predictionstheto meaning
the overallandnumber
importance of precision
of positive and recall for the model assessment it
examples
dataset:
• Example:
is often useful to think about the prediction problem as the problem of research of documents
in the database using a query. The precision is the proportion of relevant documents in the
• Suppose you have a problem of research of documents in the database using a query.
list of alldef
returned
TP documents. The recall is the ratio of the relevant documents returned
by recall = engine to
the search . the total number of the relevant documents that could have been
• The precision
returned.
is theTP + FN
proportion of relevant documents in the list of all returned documents

erstand the meaning andrecall


• The importance
is the of precision and recall for the model assessment search
it
In the caseratio of the
of the spamrelevant documents
detection problem,returned
we wantbytothehave high engine to the
precision (we total
want to avoid
useful to think about the prediction
number problem
of the mistakes
relevant as the problem
documents of research of documents
making by detectingthat
thatcould have been
a legitimate returned
message is spam) and we are ready to tolerate
database using a query. The precision
lower is the
recall (we proportion
tolerate some of relevant
spam documents
messages in the
in our inbox).
all returned documents. The recall is the ratio of the relevant documents returned
Accuracy
want to assess these metrics. Then you consider all examples of the selected class as positives
and all examples of the remaining classes as negatives.


5.6.3Accuracy
Accuracy is given by the number of correctly classi ed
examples divided by the total number of classi ed examples.
Accuracy is given by the number of correctly classified examples divided by the total number
of classified examples. In terms of the confusion matrix, it is given by:

def TP + TN
accuracy = . (5)
TP + TN + FP + FN

Accuracy is a useful metric when errors in predicting all classes are equally important. In
case of the spam/not spam, this may not be the case. For example, you would tolerate false
• Accuracy
positives less than is a useful
false negatives.metric when in
A false positive errors in predicting
spam detection all in which
is the situation
yourclasses
friend sendsareyouequally
an email,important.
but the model labels it as spam and doesn’t show you. On
the other hand, the false negative is less of a problem: if your model doesn’t detect a small
percentage of spam messages, it’s not a big deal.
fi
fi
Cross Validation
• When you don’t have a decent validation set to tune your hyper-parameters on, the common technique
that can help you is called cross-validation

• It works as follows:

• First, you x the values of the hyper-parameters you want to evaluate.

• Then you split your training set into several subsets of the same size, which are called fold

• To train ve models, do as follows

• To train the rst model, f1, you use all examples from folds F2, F3, F4, and F5 as the training
set and the examples from F1 as the validation set

• To train the second model, f2, you use the examples from folds F1, F3, F4, and F5 to train and
the examples from F2 as the validation set

• You continue building models iteratively like this and compute the value of the metric of interest
on each validation set, from F1 to F5

• Then you average the ve values of the metric to get the nal value
fi
fi
fi
fi
fi

You might also like