5.Feauture Engineering
5.Feauture Engineering
Feature Engineering
• The problem of transforming raw data into a dataset is called
feature engineering
• One-Hot Encoding
• Binning
• Normalization
• Standardization
• Binning (also called bucketing) is the process of converting a continuous feature into multiple binary
features called bins or buckets, typically based on value range
• For example, instead of representing age as a single real-valued feature, the analyst could chop ranges
of age into discrete bins
• So on
• In some cases, a carefully designed binning can help the learning algorithm to learn using fewer
examples.
• Because we give a “hint” to the learning algorithm that if the value of a feature falls within a speci c
range, the exact value of the feature doesn’t matter
fi
Normalization
• Normalization is the process of converting an actual range of values, into a
standard range of values,
5.1.3 • Normalization
Example:
• The natural
Normalization is therange of aofparticular
process convertingfeature is 350 range
an actual to 1450of values which a numerica
feature can take, into a standard range of values, typically in the interval [≠1, 1] or [0, 1].
• subtract 350 from every value of the feature, and divide the result by 1100
For example, suppose the natural range of a particular feature is 350 to 1450. By subtractin
350 from •every
The value
result of
is the feature,
in the range and dividing the result by 1100, one can normalize thos
of [0,1]
values into the range [0, 1].
• More generally, the normalization formula looks like this:
More generally, the normalization formula looks like this:
x(j) ≠ min(j)
x̄(j) = (j) (j)
,
max ≠ min
Normalization (2)
• Why do we normalize?
Standardization
e derivative with respect to a larger feature will dominate the update.
ditionally, it’s useful to ensure that our inputs are roughly in the same relatively
nge to avoid problems which computers have when working with very small or ver
mbers (known as numerical overflow).
1.4 •
Standardization
In Standardization (or z-score normalization) the
feature values are rescaled so that they have the
andardization (or z-score
properties normalization)
of a standard normal isdistribution
the procedure
withduring
μ = 0which the fe
ues are rescaled so
and σ = 1,that they have the properties of a standard normal distribution
= 0 and ‡ = 1, where µ is the mean (the average value of the feature, averaged ov
amples in the dataset) and ‡ is the standard deviation from the mean.
• Standard scores (or z-scores) of features are calculated
andard scores (or z-scores) of features are calculated as follows:
as follows:
x(j) ≠ µ(j)
x̂(j) = (j)
.
‡
u may ask when you should use normalization and when standardization. There
finitive answer to this question. Usually, if your dataset is not too big and you have
Normalization
vs
Standardization
• When you should use normalization and when standardization?
• Usually, if your dataset is not too big and you have time, you can try both and see
which one performs better for your task
• Unsupervised learning algorithms, in practice, more often bene t from standardization than
from normalization;
• Standardization is also preferred for a feature if the values this feature takes are distributed
close to a normal distribution (so-called bell curve);
• Standardization is preferred for a feature if it can sometimes have extremely high or low
values (outliers); this is because normalization will “squeeze” the normal values into a very
small range;
5.1.6
Data Imputation
Data Imputation Techniques
One data imputation technique consists in replacing the missing value of a feature by
average value of this featurevalue
in the
of adataset:
•
Replace the missing feature by an average value of this feature in the
dataset
1 ÿN
(j)
x̂(j) Ω xi ,
M i=1
• Replace the missing value with a value outside the normal range of values
where M < N is the number of examples in which the value of the feature j is present, wh
he summation
• For excludes
example, ifthe
theexamples in which
normal range is [0, 1],the
thenvalue of the
you can feature
set the j is
missing absent.
value to
2 or −1.
Another technique is to replace the missing value with a value outside the normal range
values. For•example,
The idea isif that
thethe
normal range
learning is [0,will
algorithm 1], learn
thenwhat
you iscan settothe
best missing
do when the value to 2
≠1. The ideafeature
is thathas
thea learning
value signialgorithm will learn
cantly di erent what isvalues
from regular best to do when the feature h
a value significantly different from regular values. Alternatively, you can replace the miss
value by• aReplace
value inthethe middle
missing of by
value thea range.
value in For example,
the middle if range
of the the range for a feature is [≠1
•
For example, if the range for a feature is [−1, 1], you can set the missing value
Andriy Burkov ThetoHundred-Page
to be equal 0 Machine Learning Book - Draft
fi
ff
Data Imputation (2)
• Training speed
• Prediction speed
Three Sets
• In practice, we work we three separate sets of data:
• Training set,
• Validation set,
• Test set
• There’s no optimal proportion to split the dataset into these three subsets.
• We use the test set to assess the model before delivering it to the client or putting it in production
fi
Under tting & Over tting
• If the model makes many mistakes on the training data, we say
that the model has a high bias or that the model under ts.
• In over tting, the model predicts very well the training data but
poorly the data from at least one of the two holdout sets.
Figure 2: Examples of underfitting (linear model), good fit (quadratic model), and overfitting
(polynomial of degree 15).
fi
fi
fi
fi
Over tting
• How can we train a model that’s complex enough to model the structure in the data,
but prevent it from over tting? I.e., how to achieve low bias and low variance?
• data augmentation
• weight decay
• early stopping
• The best-performing models on most benchmarks use some or all of these tricks.
fi
fi
ff
Data Augmentation
• The best way to improve generalization is to collect more data!
• Suppose we already have all the data we’re willing to collect. We can augment the training
data by transforming the examples. This is called data augmentation.
• translation
• smooth warping
• The choice of transformations depends on the task. (E.g. horizontal ip for object
recognition, but not handwritten digit recognition.)
fl
fl
fl
Data Augmentation
Af ne Elastic
Noise
Distortion Deformation
Horizontal Random
Hue Shift
ip Translation
fl
fi
Weight Decay
Figure 4: Training curves, showing the relationship between the number of
training iterations and the training and test error. (left) Idealized version.
(right) Accounting for fluctuations in the error, caused by stochasticity in
the SGD updates.
• So far, all of the cost functions we’ve discussed have
consisted of the average of some loss function over the
training set.
For instance, suppose we are training a linear regression model with two
inputs, x1 and x2 , and these inputs are identical in the training set. The
two sets of weights shown in Figure 5 will make identical predictions on the
L2 Regularization
training set, so they are equivalent from the standpoint of minimizing the
loss. However, Hypothesis A is somehow better, because we would expect it
to be more stable if the data distribution changes. E.g., suppose we observe
the input (x1 = 1, x2 = 0) on the test set; in this case, Hypothesis A will
predict 1, while Hypothesis B will predict -8. The former is probably more
sensible. We would like a regularizer to favor Hypothesis A by assigning it
• One such
a smaller regularizer which achieves this is L2
penalty.
One such regularizer which achieves this is L2 regularization; for a This is
regularization; for a linear model, it is de ned as follows:
linear model, it is defined as follows: mathem
really c
D L2 nor
Weight Decay X
RL2 (w) = wj2 . (3)
2
j=1
We’ve already seen that we can regularize a network by penalizing
• By this,
(Thelarge the cost
weight values,function
hyperparameter thereby becomes:
is sometimes called
encouraging the the weight
weights to be cost.)
small in L2 reg-
ularization tends to favor hypotheses where the norms of the weights are
magnitude.
X
Jreg = J + R = J + wj2
8 2
j
•
We saw that the gradient descent update can be interpreted as
L2 regularization
weight decay: tends to favor hypotheses where the
norms of the weights are smaller. ✓ ◆
@J @R
w w ↵ +
@w @w
✓ ◆
@J
=w ↵ + w
fi
Weight Decay
L2 Regularization
We’ve already seen that we can regularize a network by penalizing
large weight values, thereby encouraging the weights to be small in
magnitude.
X
Jreg = J + R = J + wj2
2
•
j
By Incorporating
We saw that thethe regularization
gradient term
descent update can in gradient
be interpreted as descent
update, we decay:
weight get an interesting interpretation:
✓ ◆
@J @R
w w ↵ +
@w @w
✓ ◆
@J
=w ↵ + w
@w
@J
= (1 ↵ )w ↵
@w
• In each iteration, we shrink the weights by a factor of 1 − αλ.
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 12: Generalization 8 / 22
λ D
∑
RL1(w) = | wj |
2 j=1
• In general, when a weight wi as already been small in magnitude, L2 does not care to reduce it to
zero, L2 would rather reduce big weights than eliminate small weights to 0.
• On the other hand, L1 cares about reducing big weights and small weights equally. For L1, the
less informative features get reduced. Some features may get completely eliminated by L1, thus
we have feature selection.
Early Stopping
Early Stopping
• We don’t always want to nd a global (or even local)
optimum of our cost function. It may be advantageous to
We don’t always want to find a global (or even local) optimum of our
stop training early.
cost function. It may be advantageous to stop training early.
• Validation errorvalidation
A slight catch: uctuates because
error fluctuates of stochasticity
because in the
of stochasticity in
updates.
the updates.
• Weights start out small, so it takes time for them to grow large.
• If you are using sigmoidal units, and the weights start out
small, then the inputs to the activation functions take only a
small range of values.
• However, we can try to simulate the e ect of independent training sets by somehow
injecting variability into the training procedure
• Train on random subsets of the full training data. This procedure is known as bagging.
• Train networks with di erent architectures (e.g. di erent numbers of layers or units, or
di erent choice of activation function).
• The set of trained models whose predictions we are combining is known as an ensemble.
• Ensembles can improve generalization quite a bit, and the winning systems for most
machine learning benchmarks are ensembles.
• But they are expensive, and the predictions can be hard to interpret.
ff
ff
ff
ff
ff
Performance Assessment
• How can you say how good the model is?
• Model Assessment:
• In regression
• In Classi cation
• Confusion matrix,
• Accuracy,
• Cost-sensitive accuracy,
• Precision/Recall, and
Confusion Matrix
decision maker/product owner who ordered the model.
For classification, things are a little bit more complicated. The most widely used metrics and
tools to assess the classification model are:
• confusion matrix,
• accuracy,
• A table that summarizes how successful the classi cation
• cost-sensitive accuracy,
• precision/recall, and
model
is at predicting examples belonging
• area under the ROC curve.
to various classes
To simplify the illustration, I use a binary classification problem. Where necessary, I show
• One axis of how
thetoconfusion matrix
extend the approach ismulticlass
to the the label case.that the model
predicted, and the other axis is the actual label
5.6.1 Confusion Matrix
• Example forThe
spam detection:
confusion matrix is a table that summarizes how successful the classification model
is at predicting examples belonging to various classes. One axis of the confusion matrix
is the label that the model predicted, and the other axis is the actual label. In a binary
classification problem, there are two classes. Let’s say, the model predicts two classes: “spam”
and “not_spam”:
• TP: True Positive
Precision/Recall
learning algorithm can use to build a model that would better distinguish between these
species.
Confusion matrix is used to calculate two other performance metrics: precision and recall.
cide to add more labeled examples of these species to help the learning algorithm
” the difference between them.
5.6.2 Alternatively, you might add additional features the
Precision/Recall
g algorithm can use to build a model that would better distinguish between these
• Precision is the
The two
is thetwo
ratio ofusedcorrect
most frequently
ratio of correct positive
metrics to positive predictions
assess the model
predictions to theand
overall
torecall.
are precision and the Precision
number of positive predictions:
overall number of positive predictions
on matrix is used to calculate other performance metrics: precision recall.
def TP
Precision/Recall precision = .
• Recall is the ratio of correct positive predictions to the TP + FP
def TP def TP
precision = . recall = .
TP + FP TP + FN
To understand
s the ratio of correct positive predictionstheto meaning
the overallandnumber
importance of precision
of positive and recall for the model assessment it
examples
dataset:
• Example:
is often useful to think about the prediction problem as the problem of research of documents
in the database using a query. The precision is the proportion of relevant documents in the
• Suppose you have a problem of research of documents in the database using a query.
list of alldef
returned
TP documents. The recall is the ratio of the relevant documents returned
by recall = engine to
the search . the total number of the relevant documents that could have been
• The precision
returned.
is theTP + FN
proportion of relevant documents in the list of all returned documents
•
5.6.3Accuracy
Accuracy is given by the number of correctly classi ed
examples divided by the total number of classi ed examples.
Accuracy is given by the number of correctly classified examples divided by the total number
of classified examples. In terms of the confusion matrix, it is given by:
def TP + TN
accuracy = . (5)
TP + TN + FP + FN
Accuracy is a useful metric when errors in predicting all classes are equally important. In
case of the spam/not spam, this may not be the case. For example, you would tolerate false
• Accuracy
positives less than is a useful
false negatives.metric when in
A false positive errors in predicting
spam detection all in which
is the situation
yourclasses
friend sendsareyouequally
an email,important.
but the model labels it as spam and doesn’t show you. On
the other hand, the false negative is less of a problem: if your model doesn’t detect a small
percentage of spam messages, it’s not a big deal.
fi
fi
Cross Validation
• When you don’t have a decent validation set to tune your hyper-parameters on, the common technique
that can help you is called cross-validation
• It works as follows:
• Then you split your training set into several subsets of the same size, which are called fold
• To train the rst model, f1, you use all examples from folds F2, F3, F4, and F5 as the training
set and the examples from F1 as the validation set
• To train the second model, f2, you use the examples from folds F1, F3, F4, and F5 to train and
the examples from F2 as the validation set
• You continue building models iteratively like this and compute the value of the metric of interest
on each validation set, from F1 to F5
• Then you average the ve values of the metric to get the nal value
fi
fi
fi
fi
fi