Machine and Deep Learning (Nezar a. El-Kady)
Machine and Deep Learning (Nezar a. El-Kady)
CREDITS TO SOURCES:
Main courses: (Pages with white background )
Part1(Machine learning): Machine learning course By Andrew NG.
Part2 (Deep Learning Basics): Course1+Course2 of deep learning
specialization By Andrew NG + Draft version of Machine Learning
Yearning book.
Part3(Computer Vision):Course4 of deep learning specialization by
Andrew NG
Part4(Natural Language Processing):
2018
PART 1
MACHINE LEARNING
Introduction:
What is machine learning ?
1st def. :A field of study that gives the computer the ablity to learn without being explicitly
programmed.(Arthur Samuel)
2nd def. :Well-Posed learning problem where the computer is said to learn from experience E
with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.(Tom Mitchell)
Q. Ans.
T: Classifying…
E: Watching…
P: The number
correctly classified…
Note: Clustering is used in google news to cluster different websites talking about similar topic
into one packet having different URLs to those websites.( https://news.google.com/?hl=en-
US&gl=US&ceid=US:en).
Note: Classification means discrete valued output while Regression means continuous valued
output.
Q. Ans:
The answer will be no.3 as the first
problem is continuous as we have
items but for the second problem it's
discrete either hacked '1' or not
hacked '0'.
Machine learning Intro
What is Machine learning?
1-"The science of getting computers to learn and act like humans do, and improve their learning
over time in autonomous fashion, by feeding them data and information in the form of
observations and real-world interactions."
2-“Machine Learning at its most basic is the practice of using algorithms to parse data, learn
from it, and then make a determination or prediction about something in the world.” – Nvidia
3-“Machine learning is the science of getting computers to act without being explicitly
programmed.” – Stanford
4-“Machine learning algorithms can figure out how to perform important tasks by generalizing
from examples.” – University of Washington
5-“The field of Machine Learning seeks to answer the question “How can we build computer
systems that automatically improve with experience, and what are the fundamental laws that
govern all learning processes?” – Carnegie Mellon University
6-Machine learning research is part of research on artificial intelligence, seeking to provide
knowledge to computers through data, observations and interacting with the world. That
acquired knowledge allows computers to correctly generalize to new settings" – Dr. Yoshua
Bengio, Université de Montréal
نقدر نقول ان الماشين ليرننج هو علم يحاول محاكاة طرق التعلم البشرية و تطبيقها علي اآللة بحيث تقدر اآللة تتعلم من
المعطيات و األمثلة اللي داخلة ليها و تعملها تعميم علي أي شيء مشابه ليها و لو بشكل جزئي و مع مرور الوقت أكتر اآللة
خطوات أساسية و هي التمثيل و معناها اني ازاي هعبر3 الماشين ليرننج بيتم عن طريق.تبدأ تتعلم أكتر و تحسن من أدائها
تاني خطوة هي. تفهمه اآللة و تقدر تطبقه و ده ليه أشكال كتيرalgorithmعن المشكلة او الحاجة اللي اآللة هتتعلمها ب
اللي استخدكته طلعلي أداء عامل ازاي ؟ و طبعا عملية حساب أداءalgorithmعملية التقويم او التصحيح يعني بشوف ال
تالت مرحلة هي عملية التحسين. اللي استخدمته قد ايهalgorithm اآللة ليه برده طرق مختلفة عشان اقدر احكم علي دقة ال
ده يعنيalgorithm او تقوية الخوارزمية اللي أنا مستخدمها و ازاي ابدأ اعدل عليها مع الوقت عشان اقدر احسن من أداء ال
. او لفهepoch بيتعمل مع كلtuning process نقدر نسميها
Unsupervised means that the data fed to your algorithm is unlabeled which means that you
know no explicit instructions on what to do with this data. So no expected output or correct
answers here but the algorithm is trying to find some features to divide the data upon them.
Unsupervised learning is used in Clustering, Anomaly detection, Association and Auto-
encoders.
Is there any other types of machine learning process ? If yes, what are they?
There is two main other learning processes which are semi-supervised learning and
reinforcement learning.
Semi-supervised learning is an intermediate between supervised and unsupervised where part of
the data is labeled while the other part is unlabeled. This method is motivated from the fact that
labeling examples is a time-intensive and expensive task but at the same time extracting
features from unlabeled data is difficult. general adversarial networks (GAN) uses this type of
learning.
Reinforcement learning (AKA agent) is a method that aims to interact with environments by
actions a1,a2 ,… and these actions produce rewards or punishments r1,r2,… and then repeats
itself several times and as more rounds of feedbacks takesplace, the better the agent's strategy
becomes. In this kind of machine learning, AI agents are attempting to find the optimal way to
accomplish a particular goal, or improve performance on a specific task. As the agent takes
action that goes toward the goal, it receives a reward. The overall aim: predict the best next step
to take to earn the biggest final reward. Video games is full of reinforcement learning as
complete a level and earn a badge. Defeat the bad guy in a certain number of moves and earn a
bonus. Step into a trap — game over.
What is the difference between classification and regression?
Mainly classification means discrete output while regression means continuous output but let's
take them into deeper meanings:
We have to know that both classification and regression are prediction missions but
classification is more into identifying group membership so this means that the output takes
class labels but regression has infinite classes or in other words is not classified(non-
categorical).
However, in classification, we can still use probabilistic models (continuous probability output)
as logistic regression but we use this probability to assign our output to a specific class or label.
Classification Regression
-Discrete output value -Continuous output value
-Categorical (unordered) -Non-categorical (ordered)
-Decision boundary -Best fit line
-Integer value representing the class -Integer or float value
-Evaluated by accuracy -Evaluated by sum of squared mean error
Lec2: (Linear Regression)
Model representation(Univariate):
Ex
We are given the price of housing for some sizes of
housing (dataset) so it's supervised, and since
values are continuous so it's a regression problem.
Notes:
1-Dataset is aka training set.
2- The word hypothesis was used in the
early days of machine learning.
3- h maps from x's to y's.
How do we represent h ?
hθ(x) or h(x) = θo + θ1 x Linear regression with one variable (Univariate linear regression)
Note:
There is a major difference
between n & m where m is
the number of training
examples while n is the
number of features
describing each training
example. Similarly with x(i)
& xj(i) where x(i) is the
training example including
all features of this example
while xj(i) is the value of feature j of the training example i.
For our example, x(2) is while x3(2) is 2 and n=4 as each example has 4 features (size,
Notes: 1-x1 means feature 1 of all examples(x1(i) for any i). 2-x0 is always equal to 1.
Gradient descent(Multivariate):
We will annotate all θ's (θ0, θ1 …, θn) as θ so
instead of saying J(θ0, θ1 …, θn), we will say
just J(θ).
Remember:
1-x0(i)= x0 = 1
2-Always do
simultaneous update
for values of θ's(θj
for j=1,2,3,…,n).
Feature scaling:
Idea; Make sure feature on a similar scale.
For our previous example, x1 is size which scales from 0 to 2000
feet2 and x2 is number of bedrooms which scales from 1 to 5 so we
can notice that each feature differs in its scale which would affect
the gradient descent greatly by making it slow.(Notice contour plot)
Generally, Make every feature into approximately a -1≤ xj ≤ 1
range.
-The easiest way to do feature scaling is to normalize by dividing each feature by its range
(max. scale - min. scale).
-Another way is called mean normalization which is replacing xj with xj - μj to make features
have approximately zero mean(Don't apply to x0=1).
Linear Regression(Model, cost function, gradient descent & feature scaling)
What is linear regression?
Linear regression is a statistical relationship between two continuous variables. One is a
predictor (independent variable) and one is response (dependent variable). The idea here is to
get the linear line that best fits the data which means that the error between predicted and actual
is small as possible.(Error here means the vertical distance between the point and the regression
line as shown in the figure )
Notes: 1-Statitistical relationship as it doesn't give the exact relationship between input and
output but try to get the best fit function to express this relation. (The opposite of statistical
relationship is deterministic relationship which gets the acurate relationship between the
variables)
2- Regression differs from correlation where regression is used to predict a dependent variable
based on the value of at least one variable (could be more for multivariate) while correlation
only shows the strength of the association (linear relation) between the two variables.
2 2
The residual error for the 3 points (R) is R= = =
This method is called Least squares method
Why do we use square ? The reason for squaring the individual residual errors is to prevent
positive residual error from canceling out negative residual errors.
Equation:
r2= SSR/SST (0 ≤ r2 ≤ 1)
Is r-squared really matters ? which is better r-squared result or residual plot? when to use
it ?
R-squared doesn't measure really the goodness of fit as you could have very low r 2 while the
model is absolutely correct and could be very high r2 while the model is wrong so you need both
r-squared value and residual plot such that checking residual plots comes first to make sure that
predictions are not biased and that your model is correct then to compare between correct
models you can use the r-squared value. (Some professors see that it is useless as professor
Cosma Shalizi at Carnegie Mellon University)
Example:
The fitted line plot shows
that these data follow a nice
tight function and the R-
squared is 98.5%, which
sounds great. However,
look closer to see how the
regression line
systematically over and under-predicts the data (bias) at different points along the curve.
Standard error of estimate (SYX):
This shows the standard deviation of the estimation around the regression line.
So standard error of estimate has alittle bit more advantages as it tells you straight up how
precise the model’s predictions are using the units of the dependent variable. This statistic
indicates how far the data points are from the regression line on average. You want lower values
of S because it signifies that the distances between the data points and the fitted values are
smaller. S is also valid for both linear and nonlinear regression models.
Residual analysis advantages:
We have talked before about residuals which is the error between the data and the prediction
and can be visualized using residual plots but residual plots can be also used to determine 4
things known as LINE :
Linearity - Independence of errors -Normality of errors - Equal variance (Homoscedasticity)
Linear regression is a statistics topic which has much more mathematical background to be
discussed but we will be content with what we have said to serve machine learning topics.
Cost Function
Cost function is the function that shows us how well our parameters β0,β1 (some people use
another notation θ0,θ1) have done to best fit the line to the input data. Our ultimate goal is to
minimize the cost function as possible. I mean here by minimizing is to make the cost function
as close as possible to 0. We will use mean squared error (MSE) or what is known by L2 cost
function.
Equation: Normally the equation is J= -y(i))2
but more commonly used is J= -y(i))2 (dividing by 1/2 as explained before)
=hβ(x)= hθ(x)= β0 + β1 x + β2 x + …= θ0 + θ1 x + θ2 x + …
Is there any difference between cost function, loss function and objective function?
usually these terms can't be strictly defined but most commonly used is that:
Loss function is defined for a data point which means measuring penalty for a single training
example.
Cost function is defined for more general on a batch or all training examples of the training set.
Objective function is the most general term to be used for any function to be optimized during
training.
To summarize we can say : loss function is a part of cost function which is a type of objective
function.
L2 is less robust as L2 squares the error so the model will see much error (e vs e2) than L1-norm
so it's much more sensitive. L1 is less stable as if there is an outlier L1 will face more changes
in fitting the line as the point gets more to outlying region.
Generally in using regularization (a term added to cost function to overcome what is called by
overfitting), L2 has a better and more computational efficient regularizer than L1.
Gradient descent
Gradient descent is the most popular optimization strategy to use in machine learning
nowadays. It's used in the training phase to optimize our algorithm to reach the best learning
parameters (θ0,θ1,…,θn). It's basically a convex function that tweaks its parameters iteratively to
minimize a given function (objective function) to its local minimum.
The word "gradient" means the measure of how much the output of
function changes if you change the inputs a little bit. It simply
measures the change in all weights with regard to the change in
error. You can also think of a gradient as the slope of a function. The
higher the gradient, the steeper the slope and the faster a model can
learn. But if the slope is zero, the model stops learning. Said it more
mathematically, a gradient is a partial derivative with respect to its
inputs.
Note that length from x0 to x1 is larger than that of x1 to x2 and so on and that is because as we
are getting closer to the local minimum the the gradient become less steep until it reaches 0 at
the exact local minimum.
Equation:
Generally anew = aold - α f(aold) ( )
So α shouldn't be too high or too low and you can check how well
your learning rate works by plotting the gradient descent with each
iteration to make sure the cost function gets lower each iteration
(epoch) until it converges at low value (near 0).
Notes:
1-Number of iterations (epochs) needed till convergence takes
place is hard to estimate (it may take 20 epochs or up to 3 million epochs) and generally you
decide when to stop it when you reach the performance needed for your application.
2-α is optimally ranges from about 0.001 to 1.
2-Mean normalization:
Redistributes features to be between -1 and 1 and have a mean of
0.
3-Min-Max scaling:
Redistributes features to be between 0 and 1.
4-Unit vector:
Scaling is done considering the whole feature vector to be of unit length &
redistributes features to be between 0 and 1.
Notes: 1-Standardisation and Mean Normalization can be used for algorithms that assumes zero
centric data like Principal Component Analysis(PCA).
2- Min-Max Scaling and Unit Vector techniques produces values of range [0,1]. When dealing
with features with hard boundaries this is quite useful. For example, when dealing with image
data, the colors can range from only 0 to 255.
Polynomial regression:
Not all problems actually can be solved by linear regression (Actually most can't be solved by
linear regression), so the next way is to use polynomial regression (non-linear regression).
As shown in this
example, we have
only one feature
which is the size but
the distribution of
data is more likely to
be fit by 3rd order
regression rather than
linear so we use the
feature, feature2 &
feature3 to fit a curve
to our model.(Remember feature scaling as each x has different scale as shown on the right)
Normal equation:
A method used to solve for θ analytically
To do so we only need to apply this equation to get the values of θ's θ = (XTX)-1.XTy
Note: on octave use pinv(X' * X)* X' *y
Ex:
Lec6:
Logistic regression:
Logistic regression is used mainly for
classification problems which could be a little
confusing that its name is "regression" and used
for "classification" but that is just a name given for
historical reasons.
Logistic regression: 0 ≤ hθ(x) ≤ 1
Hypothesis: g(z) = so hθ(x)= [Remember: hθ(x)= θ0x0 + θ1 x1 …+ θn xn =θT x]
Notes: 1- is called logistic function or sigmoid function.
2- hθ(x)=P(y=1|x;θ)=probability that y=1, given x parameterized by θ.
3- P(y=1|x;θ)+ P(y=0|x;θ)=1.
Decision boundary: if hθ(x) ≥0.5 then ypredicted=1, otherwise, ypredicted=0 noting that hθ(x) ≥0.5
when ≥ 0.( hθ(x) =0.5 Decision boundary equation)
Non-linear decision boundary:
You can notice that with this formula we are in convex form and we can penalize error in
prediction in an efficient way such that if y=1 and h θ=1 (right prediction) then cost=0 as
log(1)=0 and similarly when y=0 and hθ=0 (right prediction) so cost =0 as log(1-0)=0 but in
case y=1 and and hθ→0 (wrong prediction) then cost→∞ as log(0)=∞.
Gradient descent for logistic regression: It's the same algorithm to that of linear regression.
Normal equations, Polynomial regression and Logistic regression
Normal equations:
It is a closed-form solution which minimizes the sum of the square difference which is our
optimization objective and it is called normal equation as if Ax=b then b-Ax is normal to the
range of A. Since it is a closed form so it can solve our problem easier than gradient descent
with no need to iterate till we reach the best values for the learning parameters.
Equation: θ = (XTX)-1.XTy
Polynomial regression:
Polynomial regression means an n-th degree regression instead of only 1-st degree as in linear
regression which means that we add some sort of non-linearity to our line to be a curvilinear
regression to fit a nonlinear relationship between the value of x and the
corresponding conditional mean of y. High-order or complex regression are used to avoid what
is called-by underfitting (discussed later).
Equation:
Generally, polynomial regression is a little bit more common to use than linear regression as
most of the times the data fed to any system will be more likely to be fit by a non-linear
function than a linear function. Noting that don't exaggerate in increasing the order of the
polynomial to avoid overfitting (discussed later).
Logistic regression
Logistic regression is a type of regression used when the dependent variable (response) is
categorical which means that it is being used mainly with classification (aliitle bit confusing
that its name has the word "regression" but mostly used with classification unlike all other
regression we took). Logistic regression gives a continuous output value which indicates
prediction or probability then upon this output we classify. For example, Consider a scenario
where we need to classify whether an email is spam or not. If we use linear regression for this
problem, there is a need for setting up a threshold based on which classification can be done.
Say if the actual class is malignant, predicted continuous value 0.4 and the threshold value is
0.5, the data point will be classified as not malignant which can lead to serious consequence in
real time.
Logistic regression uses the logistic function aka sigmoid function Sigmoid(Z)=
The output of logistic regression is bounded from 0 to 1.
Equation :
hθ(x)= where = θ0x0 + θ1 x1 …+ θn xn +
bias
Mathematical representation: P(Y=1| x;θ)
Gradient descent:
Why cost function used for linear regression not used for logistic ?
Simply because linear regression uses MSE (mean
square error) as its cost function. If this used with
logistic regression it will lead to a non-convex
function that can have many local minimum as
shown in the figure so cost function would fall in a
local minimum instead of global minimum.
Elasticnet regression:
It is a hybrid of lasso and ridge regression by adding both L1 and L2 regularization.
Advanced optimization algorithms: (just for knowledge)
1-Conjugate descent 2- BFGS 3-L-BFGS
Advantages: Disadvantages:
1- No need to manually pick α (automatic pick) 1-More complex
2-often faster than gradient descent
Lec7:
Overfitting(high variance):
Overfitting is a modeling error which occurs when a function is too closely fit to a limited set
of data points. Overfitting the model generally takes the form of making an overly complex
model to explain idiosyncrasies in the data under study. So the function is more likely to
memorize the data rather than generalize it. (J(θ)≈0)
Notes:
1-λ should be chosen high enough to eliminate overfitting but not so much high leading to
underfitting problem.
2-θ0 is not penalized. Only θ1,2,3,…,n are penalized.
θ = (XTX + λ )-1.XTy
Overfitting, Underfitting and Regularization
Overfitting VS. Underfitting
Most of poor results that would be given by a machine learning model would be as a result of
either overfitting or underfitting. Overfitting means that the line fitting the input data is so much
accurate on the data as shown in the figure which means that the model is highly variant (in
other words low biased) which means that the model almost memorizing the data not
generalizing on it. Underfitting is the
opposite of overfitting where the model
or the line fitting the data is so bad and
faraway from fitting the data which
means faraway predictions due to high
biasing (in other words low variance).
What is meant by high variance or high bias? What is meant by memorizing and
generalizing?
The word variance means how much a model changes in response to the training data. So if the
model is just keeping tracking the changes in training data and change itself to perfectly match
the data so this is known as highly variant model. In other words we can say it is highly
dependent on the training data. This leads to memorizing data which means that it consider any
data will be given in the test phase to be typical to that of the training phase so here we say that
it is memorizing the data not generalizing on it. As the degree of polynomial of the fitting line
increases the model tends to be memorizing more. Bias on the other hand is the flip side of
variance which means representing the strength of our assumptions we make about our data.
To summarize, bias shows how much you ignore the data and variance shows how dependent
our model is on the data. There will always be a trade-off between them and our goal is to
balance between them.
Overfitting: too much reliance on the training data
Underfitting: a failure to learn the relationships in the training data
High Variance: model changes significantly based on training data
High Bias: assumptions about model lead to ignoring training data
Overfitting and underfitting cause poor generalization on the test set
A validation set for model tuning can prevent under and overfitting Cross-validation
More often you will be exposed to overfitting rather than underfitting in real practical problems.
Note: In machine learning we describe the learning of the target function from training data as
inductive learning. Induction refers to learning general concepts from specific examples which
is exactly the problem that supervised machine learning problems aim to solve. This is different
from deduction that is the other way around and seeks to learn specific concepts from general
rules.
How overfitting and underfitting affect the training and testing error?
Overfitting means that the error is so small on the training set and so high on the testing set.
While Underfitting means that the error is too high on training set as well as the testing set.
Note: Assume training set has error of 15% and dev set has error of 16%, thus here we have
underfitting in our algorithm due to the high error in training and no overfitting as our dev set
(which is responsible for parameters' tuning) error is only 1% higher than train set error. (Check
more examples at Tip 21 at the end of part1 (Machine learning))
Regularization
The word regularize means to make things regular or acceptable which is a technique which
makes slight modifications to the learning algorithm such that the model generalizes better. This
in turn improves the model’s performance on the unseen data as well. To
perform regularization, we will be modifying our Cost Function by adding a penalty to the sum
of squared residuals (RSS) (mean square error MSE).
By adding a penalty to the Cost Function, the values
of the parameters would decrease and thus the
overfitted model gradually starts to smooth
out depending on the magnitude of the penalty added.
λ = ∞: The coefficients θ's will be all zeros due to the infinite weightage on the coeffecients.
λ = 0: The objective will be non-regularized so getting same coefficients of normal regression.
0 < λ < ∞: By default the coefficients will be affected by the weightage given to them.
Selection of λ as shown if too low the effect will almost vanish and if too high the model will
most likely to underfit so the selection of the optimal λ is not strict or a predefined value but to
reach it, you will need cross-validation to be plotted using different values of λ to reach the
optimal value for your problem.
In other words, for every value of α, there is some ‘s’ such that the equations (old and new cost
functions) will give the same coefficient estimates.
When p=2, then (6.8) indicates that the lasso coefficient estimates have the smallest RSS out of
all points that lie within the diamond defined by |β1|+ |β2|≤s.
Similarly, the ridge regression estimates have the smallest RSS out of all points that lie within
the circle defined by (β1)²+(β2)²≤s
Now, the above formulations can be used to shed some light on the issue.
The least squares solution is marked as βˆ, while the blue diamond and circle represent the
lasso and ridge regression constraints as explained above.
If ‘s’ is sufficiently large, then the constraint regions will contain βˆ, and so the ridge regression
and lasso estimates will be the same as the least squares estimates. (Such a large value of s
corresponds to α=0 in the original cost function). However, in figure, the least squares estimates
lie outside of the diamond and the circle, and so the least squares estimates are not the same as
the lasso and ridge regression estimates. The ellipses that are centered around βˆ represent
regions of constant RSS.
In other words, all of the points on a given ellipse share a common value of the RSS. As the
ellipses expand away from the least squares coefficient estimates, the RSS increases. The above
equations indicate that the lasso and ridge regression coefficient estimates are given by the first
point at which an ellipse contacts the constraint region.
Since, ridge regression has a circular constraint with no sharp points, this intersection will not
generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively
non-zero.
However, the lasso constraint has corners at each of the axes, and so the ellipse will often
intersect the constraint region at an axis. When this occurs, one of the coefficients will equal
zero. In higher dimensions, many of the coefficient estimates may equal zero simultaneously. In
figure, the intersection occurs at β1=0, and so the resulting model will only include β2.
Are L1, L2 and L1/L2 regularization are the only methods to overcome overfitting ?
No, There are other methods to face this problem as dropout, data augmentation and early
stopping which will be discussed in details in the deep learning part.
For deeply Probabilistic view of regularization techniques with intensive mathematical proofs
check this link (https://towardsdatascience.com/regularization-an-important-concept-in-
machine-learning-5891628907ea)
Regularization for logistic regression:
Remember: J(θ) =
Added Term:
Lec8:
Images are dealt with as a bunch of pixels intensity which are acting as the features of the image
so assume having a 50×50 pixel image which means that the image is of 2500 pixels (7500 if
RGB), this means that the number of features representing this image is 2500 features
(x= ) but as we have said that most of our problems can't be handled
using linear representations which means that we need non-linear hypothesis so assuming using
some quadratic features, therefore, the result will be millions of features which is so hard to be
dealt with and here comes the idea of neural networks. x0
Neural Network:
-x0 and a0(2) (the bias units) are usually not drawn.
-Layer 1 is called input layer.
-Layer 3 (Last layer) is called output layer.
All layers between first and last layers (Layer 2 in our
case) are called hidden layers.
-ai(j) means "activation" of unit i in layer j.
-θ(j) matrix of weights controlling function mapping from layer j to layer (j+1).
Notes:
1-if network has sj units in layer j and sj+1 units in layer j+1, then θ(j) will be of dimension (sj+1 ×
(sj +1)).
2- = z1(2), …so on
3-Generally: aunit(layer) = g(zunit(layer)) =g( unit, i
(layer)
.ai(layer-1) ) where k is the total number of
inputs to this unit. Taking into consideration that a(1) = x = input.
Forward propagation:
This is called forward
propagation as we move
forward from input layer
towards the output layer.
a(1)
Assume: θ10 =-30, θ11 =20, θ12 =20. This will lead to
This assumption succeeded in creating a non-linear
function that does and gate logical function.
Note: This is not the only assumption that solves this problem, there is infinity solutions.
Similarly for or gate if you use θ10=-10, θ11 =20, θ12 =20, it will be an or gate logical func..
You can apply the same method for any logical function(NOT, XOR, XNOR, NOR, NAND)
Multi-class classification using
neural network representation:
We notice that the output layer now is
composed of several nodes rather than 1
node only(binary classification), and each
node is responsible to determine a specific
class by using a probability evaluation for
this node.
Cost function:
Remember: For logistic regression:
J(θ) = ]+
For neural network:
-To perform gradient computation we will use δj(l) to denote error of node j in layer l.
-for each output unit we have δj(4) which equals to aj(4) - yj where y is actual value while aj(4)
equals to hθ(x) which is the predicted value.
Backpropagation algorithm:
δ(4)=a(4) - y
δ(3)=(θ(3))T δ(4) .* g'(z(3)) =(θ(3))T δ(4) .* (a(3) .*(1-a(3)))
δ(2)=(θ(2))T δ(3) .* g'(z(2)) =(θ(2))T δ(3) .* (a(2) .*(1-a(2)))
Notes: 1-No δ(1) exists as it corresponds to input layer which is features observed.
2- .* is element wise multiplication.
3-remember that g(z) means the sigmoid of variable z.
Now as we know that gradient needs the derivative of cost function with respect to θ, so after
some complex mathematical derivation they were able to reach the following:
visualization:
The perfect way to initialize theta is initializing each θ ij(l) to a random value in range of [-ϵ, ϵ]
where ϵ is a very small number close to zero but not zero.
Neural Networks(NN), Forward propagation, Backpropagation and Weights Initialization
Neural Networks:
Neural network (AKA Artificial neural network) is a computing system made up of number of
simple highly interconnected processing elements which process information by their dynamic
state response to external inputs. ANN are a rough estimate model of how human brain is
structured. ANN are organized in layers. Each layer is composed of several units called
neurons. Each bunch of neurons in each layer are connected to the neurons of the following and
preceding layers through synapses(interconnections) such that the first layer are the input data
(features) passing forward through these interconnections till it reach the other far-end called
the output layer. Neural networks are
generally consists of 3 parts: 1) Neurons
(units) 2)Weights (interconnections)
3)Biases. Layers are mainly input layer,
output layer and several hidden layers in
between. Neural networks has several
types as convolutional neural
networks(CNN), Recurrent Neural
network and so on … (the types will be discussed in details in deep learning part)
Forward propagation:
It is a process of feeding input values to the neural network and getting an output which we call
predicted value. Sometimes we refer forward propagation as inference. When we feed the input
values to the neural network’s first layer, it goes without any operations. Second layer takes
values from first layer and applies multiplication, addition and activation operations and passes
this value to the next layer. Same process repeats for subsequent layers and finally we get an
output value from the last layer. We can see that it is called forward propagation as we are
moving forward from the input layer to the output layer.
Forward propagation is not sufficient for learning as after
reaching output layer for the first time there will be no
chance to start again with a more optimized parameters and
here comes the idea of backpropagation to move backwards
from the output layer to the input layer and so on till we
reach a well trained model with suitable learning parameters
(weights or θ's).
Equation: a(n+1)= g((∑ θi(n) . ai(n) ) + b) where the subscript defines the node (neuron) and the
superscript defines the layer number and g() defines the activation function. This equation is
applied for each neuron of each layer.
Backpropagation
Backpropagation is performed after the feedforward one in a continuous iterative manner by
calculating the cost function using MSE then apply gradient descent (partial derivative) with
respect to each weight feeding the output layer then goes back to the previous layer with these
gradients and so on using the chain rule of differential calculus. This process is repeated
consecutively until reaching the first layer (input layer) then now we have all gradients of each
weight in the neaural network so we apply the update equation of the learning parameter
(θnew=θold - α.grad(w.r.t θ) ) to reduce the error then again do the feedforward propagation with
the new learning parameters and repeat this process of forward then backward propagation till
we reach the minimum cost function available by reaching the local minimum of the gradient
descent. (This topic will be discussed in more details and example in deep learning part)
Weights initialization
Weights(θ's) should be initialized randomly with values near 0 (-ϵ,ϵ) but should be non-zero
values. This is done but can cause problems called vanishing and exploding gradient descent
along with some other problems as if the network is so deep and those weights are so small so
the gradients calculated in the backpropagation are proportional to these small weights which
means that the learning process will take so much time to train. The other problem is that the
distribution of the output of each neuron has a variance that gets larger with more inputs so to
solve this problem we normalize the output by dividing the weights by where ni is the
number of inputs to the neuron. Then weights now is normally distributed between ( ).
Weights can also be initialized by something called transfer learning so we start by weights that
were reached from another project that has close objective.(discussed in details in deep
learning)
The last method is called Xavier/Glorot initialization.
Note: Normaly distributed means have a mean of 0 and standard deviation of 1.
What is Xavier/Glorot initialization and how it works?
Xavier initialization makes sure that the weights initialization is just right to make the input
features reach reasonably through the layers. The values here are also normally distributed as
the normal random intialization but the difference here is instead we divide by 1/sqrt(n i) where
ni is the number of inputs to the neuron we will here divide by (1/sqrt(ni + no)) where no is the
number of neurons in the next layer. What is the concept behind that? This sort of initialization
helps to set the weight matrix neither too bigger than 1, nor too smaller than 1. Thus it doesn’t
explode or vanish gradients respectively. For mathematical derivation visit
(http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization)
The topic of neural networks is mainly a deep learning topic so that is why a lot of staff will be
discussed more in details in the deep learning part but at least those main points discussed here
should be known very well.
Remember: No. of input units is determined by no. of features while no. of output units is
determined by no. of classes.
No. of hidden layer is not defined but it's architectural choice. Usually we choose number of
hidden units in each layer comparable to number of features as twice or 3 times of it.
Lec10:
Debugging learning algorithms:
Assume you are testing an implemented regularized linear regression to predict a housing price,
but in testing it gives an unacceptably large errors in its predictions. What should you try next?
1-Get more training examples. fixes overfitting
2-Try smaller sets of features. fixes overfitting
3-Try getting more additional features (not redundant of course). fixes underfitting
4-Try adding polynomial features. fixes underfitting
5-Try modifying the learning rate (λ).
Diagnosing bias-variance:
This trade-off between using hypothesis
of different degree where some could
cause underfitting and some could cause
overfitting.
How to know it's a bias problem (underfit) or a variance problem (overfit) ?
Note: High value of λ could lead to underfitting and very low value
of λ (≈0) could lead to overfitting. Usually try λ
=0.01,0.02,0.04,0.08,0.1,0.2,0.5,0.7,1,…, 10. and choose that gives
best Jcv(θ).
Lec11:
Machine learning system design:
Skewed classes:
This means we have a lot of examples for one class than the other class. This for sure will make
a problem in evaluating your hypothesis because it could give you very low error due to non-
existence of one of the classes which if it existed it would make higher error.
Precision/Recall:
Lec12:
Support Vector Machine (SVM):
As we know from logistic regression that cost function is as follows:
J(θ) =
So this term has 2 parts :
1- for y=1, the 1st term will take the control 2-for y=0, the 2nd term will take the control.
J(θ) = J(θ) =
Datasets split, Skewed classes and Evaluation metrics
Data splitting
It is a standard in machine learning to split the data into training and test sets. The reason for
this is very straightforward: if you try and evaluate your system on data you have trained it on,
you are doing something unrealistic. The whole point of a machine learning system is to be able
to work with unseen data: if you know you are going to see all possible values in your training
data, you might as well just use some form of lookup.
However, the two-way split is over-simplistic. Real ML typically involves four phases:
1) Training
2) Development (also known as Validation or Tuning (statistical))
3) Testing (aka Evaluation)
4) Use
The ultimate goal of any ML system is to achieve best possible performance in the actual use
(phase 4) or in other words with the unseen data. Our Evaluation and statistical methods we are
using is to make sure you are not fooling yourself that your system is better than it actually is.
K-fold cross-validation
A way of resampling procedure used to evaluate machine learning models on a limited dataset.
As if there is not enough data to train your model sp removing a part of it for the validation will
more likely to pose an underfitting problem. By reducing the training data, we risk losing
important patterns/ trends in data set, which in turn increases error induced by bias. In K Fold
cross validation, the data is divided into k subsets. Now the holdout method is repeated k times,
such that each time, one of the k subsets is used as the validation set and the other k-1 subsets
are put together to form a training set. The error estimation is averaged over all k trials to get
total effectiveness of our model. As can be seen, every data point gets to be in a validation set
exactly once, and gets to be in a training set k-1times. This significantly reduces bias as we are
using most of the data for fitting, and also significantly reduces variance as most of the data is
also being used in validation set. Interchanging the training and test sets also adds to the
effectiveness of this method. As a general rule and empirical evidence, K = 5 or 10 is generally
preferred, but nothing’s fixed and it can take any value.
The general procedure is as follows:
1-Shuffle the dataset randomly. 2-Split the dataset into k groups
3-For each unique group:
Take the group as a hold out or test data set
Take the remaining groups as a training data set.
Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
4-Summarize the skill of the model using the sample of model evaluation scores.
Importantly, each observation in the data sample is assigned to an individual group and stays in
that group for the duration of the procedure. This means that each sample is given the
opportunity to be used in the hold out set 1 time and used to train the model k-1 times.
Is there any other types of cross-validation ?
There are 2 other types : 1)Stratified K-fold cross validation 2)Leave P-out cross-validation
Stratified K-fold: In some cases, there may be a large imbalance in the response variables. For
example, in dataset concerning price of houses, there might be large number of houses having
high price. Or in case of classification, there might be several times more negative samples than
positive samples. For such problems, a slight variation in the K Fold cross validation technique
is made, such that each fold contains approximately the same percentage of samples of each
target class as the complete set, or in case of prediction problems, the mean response value is
approximately equal in all the folds. This variation is also known as Stratified K Fold.
Leave P-out: This approach leaves p data points out of training data, i.e. if there are n data
points in the original sample then, n-p samples are used to train the model and p points are used
as the validation set. This is repeated for all combinations in which original sample can be
separated this way, and then the error is averaged for all trials, to give overall effectiveness. A
particular case of this method is when p = 1. This is known as Leave one out cross
validation. This method is generally preferred over the previous one because it does not suffer
from the intensive computation, as number of possible combinations is equal to number of data
points in original sample or n.
Skewed classes
Skewed classes (AKA unbalanced classes) means one class is over-represented in the dataset
that would lead to 9x% accuracy in classification for example. But in deed when you go deep
you will find that your model is useless as most of data is of one class so it is just a fake
evaluating model. This called accuracy paradox where the accuracy is reflecting the underlying
class distribution.
Recall of class x =
For our example:
Precision of class 1 = A/(A+D+G)
Recall of class 1 = EA(A+B+C)
Precision of class 2 = E/(B+E+H)
Recall of class 2 = E/(D+E+F)
and so on …
3) F1 score:
A weighted average of precision and recall or in other words we can say it is the harmonic mean
between precision and recall. It ranges between 0 and 1. The closer the f1 score to the 1, the
better our model performes.
F1_score=
Note: You can assume the 3 previous method as 1 method as they are all related to the same
concept.
There is much more types of evaluation metrics as AUC-ROC, Kolomogorov Smirnov charts
and gain and lift charts.
Analogy to understand precision and recall idea and the trade-off between them.
To explain precision and recall, we have found the fishing example to be helpful. In this
example, we have a pond of fish and we know the total number of fish within. Our goal is to
build a model that catches red fish (we may say that we want to ‘predict’ that we catch red
fish). In our first test we have a model that consists of two fishing poles with bait made from a
recipe based on scientific analysis of what red fish like. The precision metric, is about making
sure your model works accurately (or that the predictions you make are accurate). With our fish
example, this means that the fish caught with the special bait are, in fact, red. The following
test shows great precision—the two fish caught were both red. We were trying to (or predicted
we would) catch red fish, and all of the fish that we caught were red.
. انهم كالس اكس فعال تبع كالس اكسpredictions بتاع كالس اكس هو قد ايه من اللي انا عملتلهمprecision يعني ال
لقد ايه منهم انها تبعprediction بتاع كالس اكس يعني من كل الداتا اللي تبع كالس اكس فعال انا عملتrecallلكن ال
.كالس اكس
Unlike the logistic regression the hypothesis here doesn't output a probability but either 1) or 0.
Now, we want to adapt the SVM to be able to develop complex non-linear classifier, this will
be done using a technique called kernel.
Kernel:
Assume you have a training set as shown, it's obvious that
the decision boundary will need non-linear hypothesis to
represent it. So it will need a hypothesis like that
θ0+ θ1x1+ θ2 x2+ θ3 x1 x2+ θ4 x12+ θ5 x22+….
and hθ(x) gives prediction 1 if θ0+ θ1x1+ θ2 x2+ θ3 x1 x2+ θ4
x12+ θ5 x22+…. ≥ 0 and gives 0 otherwise.
Let's try writing it in different notations θ0+ θ1f1+ θ2 f2+ θ3 f3+ θ4 f4+ θ5 f5+….
such that f1=x1, f2=x2, f3=x1x2, f4=x12,… and so on
We know that using a higher order polynomial is very highly computationally expensive. So the
question is : is there a different/better choice of these features f 1,f2,f3,… rather than those higher
order polynomials and gives us a non-linear decision boundary ?
We going to pick manually some landmarks (l (1),l(2),l(3),…), knowing that I have a training set as
in the previous figure(x's), I'm going to define the following :
f1=similarity(x,l(1)) = e(- )
, f2=similarity(x,l(2)) = e(- )
, … and so on
Note: similarity is also known as gaussian kernel k(x,l(i)).
Now if the training example x is ≈ l(i), This means that the training example is close to this
landmark l(i) which means that fi≈e0≈1 and similarly on the other hand if x is far away from the
landmark l(i) this means that fi≈e-(large number)≈0. (Note: f0=1 always)
Summary:
BUT, this case is not a real case as the data here is visually
easily separable by a line with no errors. There are other cases like if these data can be linearly
separated but with error or even can't be separated linearly and need non-linear separable
decision boundary (Kernel). Let's see those scenarios.
Scenarios for the data that can be linearly separated without and with errors.
(Scenario 1) (Scenario 2) (Scenario 3) (Scenario 4)
Scenario1: Here, we have three hyper-planes (A, B and C). Now, identify the right hyper-plane
to classify star and circle. You need to remember a thumb rule to identify the right hyper-plane:
“Select the hyper-plane which segregates the two classes better”. In this scenario, hyper-plane
“B” has excellently performed this job because A and C has errors in classification while B has
no errors.
Scenario2: Here, we have three hyper-planes (A, B and C) and
all are segregating the classes well. Now, How can we identify
the right hyper-plane? Here, maximizing the distances
between nearest data point (either class) and hyper-plane
will help us to decide the right hyper-plane. This distance is
called as Margin Let’s look at this snapshot: you can see that
the margin for hyper-plane C is high as compared to both A and B. Hence, we name the right
hyper-plane as C. Another lightning reason for selecting the hyper-plane with higher margin is
robustness. If we select a hyper-plane having low margin then there is high chance of miss-
classification.
Till now we have 2 rules : 1)select the one with least errors 2) select the one with max margin.
Scenario3: Use the rules as discussed in previous section to identify the right hyper-plane. Some
of you may have selected the hyper-plane B as it has higher margin compared to A. But, here is
the catch, SVM selects the hyper-plane which classifies the classes accurately prior
to maximizing margin. Here, hyper-plane B has a classification error and A has classified all
correctly. Therefore, the right hyper-plane is A even if B has larger margin. But this is not the
usual case there is a parameter (C) which controls which line to select A or B. It will be
discussed in the next question.
Scenario4: I am unable to segregate the two classes using a
straight line, as one of star lies in the territory of
other(circle) class as an outlier. As I have already mentioned,
one star at other end is like an outlier for star class. SVM has a
feature to ignore outliers and find the hyper-plane that has
maximum margin as shown in this figure. Hence, we can
say, SVM is robust to outliers.
What is the objective function of SVM?
By noticing the figure we have 2 classes(blue and green) and we
have found the line that best separate these 2 classes. This
separator line(red line) has an equation of wT x -b=0 (equivalent
to θTx =0 where θ0=b as x0=1) and the 2 lines defining the
margin (dashed line) have equation equal to wTx -b =± 1noting
that the equation is vectorized so w is the normal vector to the
separator (hyperplane) therefore the distance or the margin will
be denoted as and we know that our goal is to maximize
the margin which means by default minimizing and the
constraints of any point is denoted by w .x -b ≥ 1 in case yi=1 or wT.x -b ≤-1 in case yi=-1
T
which prevents falling a point inside the margin. To make a generalized formula for this
2
objective we say: Minimize subjected to yi (wT.x -b) ≥ 1 for i =1, …, n(number of
examples in the data set).(More details: http://cs229.stanford.edu/notes/cs229-notes3.pdf)
(squaring is used to remove the ugly square root occurs due to existence of norm and 1/2 is used
due to the lagrange multipliers (derivatives) that is used for optimization (not sure ))
We can notice that all these equations are clear for data like shown where no outliers exist so
the support vectors are clear and the hyperplane is easy to be found to do hard separation with
no misclassification. Let's now see what to happen in case of outliers.
Kernel
Kernel is a mapping function that transforms a
given space into some other higher dimensional
space. In this higher dimensional space it is easier
even by visualization to separate our data as shown
in the figure. As we can see here that the data when
transformed to 3-D it was more easily noticed to be
separable using a hyper-plane which was not
noticed in its 2-D form.(Visualize: https://www.youtube.com/watch?v=3liCbRZPrZA)
This is done by using dot product between the vectors in the co-ordinate space. As it turns out,
while solving the SVM, we only need to know the inner product of vectors in the coordinate
space. Say, we choose a kernel K and P1 and P2 are two points in the original space. K would
map these points to K(P1) and K(P2) in the transformed space. To find the solution using
SVMs, we only need to compute inner product of the transformed points K(P1) and K(P2).
If we denote S as the similarity function in transformed space (as expressed in the terms of the
original space), then: S(P1,P2) = <K(P1),K(P2)>
The idea here is like transforming Cartesian co-ordinate to other form of co-ordinates like polar
or cylindrical but here we are transforming to higher dimensional co-ordinates.
We found that we need not compute the exact transformation of our data, we just need the inner
product of our data in that higher dimensional space. This works like a charm in datasets which
aren’t linearly separable!
The Kernel trick essentially is to define S in terms of original space itself without even defining
(or in fact, even knowing), what the transformation function K is.
Types of kernels
1) Polynomial kernel
2) Gaussian kernel
6) Sigmoid Kernel
There are more functions but that is enough.
Notes:
1-Kernels were tried on logistic regression but it makes it too slow so you can't use it.
2-As C(= ), then Large C means lower bias and high variance while small C means higher bias
and low variance.
3-for σ2, Large σ2 means high bias and low variance while small σ2 means low bias and high
variance.
4-You can use SVM with no kernel (linear kernel).
5-If number of features(n) is high relative to number of training examples(m), use logistic
regression or SVM without kernel(linear kernel).
6-if n is small and m is intermediate so use SVM with gaussian kernel.
7-if n is small and m is large so add more features and use logistic regression.
Lec13:
Unsupervised learning (Clustering):
To recall, unsupervised learning means unlabeled dataset to be fit with
a hypothesis. The hypothesis is trying to separate the data using their
features by some methods as clustering.
Cluster assigning:
This means that we will compute the distance between
each of the data and each centroid and assign this data to
the nearest centroid (cluster). Check this
Moving centroids:
This means that we move our centroid to a new place
which is the point at which the average of the data
belonging to this centroid is. Check this
Then we repeat these 2 steps until the centroids converges as shown in the following figures:
Until we reach the final centroid
Algorithm summary:
To avoid this problem, we should make the initialization more than once and collect all the cost
function result in each case after convergence and compare them and take the best initialization.
K-means clustering
K-means clustering is the one of the most popular clustering algorithm and mostly used one and
it is a centroid-based algorithm. the objective of K-means is simple: group similar data points
together and discover underlying patterns. To achieve this objective, K-means looks for a fixed
number (k) of clusters in a dataset. the K-means algorithm identifies k number of centroids, and
then allocates every data point to the nearest cluster, while keeping the centroids as small as
possible. If you don’t know how many groups you want, it’s problematical, because K-means
needs a specific number k of clusters in order to use it. So, the first lesson, whenever, you have
to optimize and solve a problem, you should know your data and on what basis you want to
group them. Then, you will be able to determine the number of clusters you need.
But, most of the time, we really have no idea what the right number of clusters is, so no worries,
there is a solution for it, that we will discuss it later.
K-means clustering analogy
Imagine you’re opening a small book store. You have a stack of different books, and 3
bookshelves. Your goal is place similar books in one shelf. What you would do, is pick up 3
books, one for each shelf in order to set a theme for every shelf. These books will now dictate
which of the remaining books will go in which shelf. Every time you pick a new book up from
the stack, you would compare it with those first 3 books, and place this new book on the shelf
that has similar books. You would repeat this process until all the books have been placed.
Once you’re done, you might notice that changing the number of bookshelves, and picking up
different initial books for those shelves (changing the theme for each shelf) would increase how
well you’ve grouped the books. So, you repeat the process iteratively in hopes of a better
outcome.K-Means algorithm works something just like this.
K-means clustering algorithm
1)Initialize Cluster Centroids (Choose those 3 books to start with)
K Centroids should be initialized based on your number of clusters (K) and the usual simple
method is to randomly choose K data points from the data to assign the centroid at their
locations and then continue your algorithm as will be shown but bad initialization would lead to
bad clustering so we have here 2 methods to choose :
Good method: The k-means clustering has a cost function that was discussed before so
you can initialize your centroids and do the full algorithm for n times (say 100) and
compute the cost for each try and just take the lowest cost which is the best among those
trials.
Better method: This method is called K-mean ++
1-Choose 1 cluster centroid uniformly at random from data points.
2-for each data point x, compute the distance D(x) to the nearest cluster center. Noting
that for the first time we have only 1 cluster center so we just compute the distance
between the points and this centroid but with time we will have more centroids so for
each data point we choose the nearest centroid.
3-Choose new cluster centroid from amongst the data points such that the probability of x
being chosen proportional to D(x)2. (Note: This means that we are more likely select a
data point as our new cluster centroid if that data point is far away.(dist2 is used for
exaggerating to increase the effect))(Visualize: https://www.coursera.org/lecture/ml-
clustering-and-retrieval/smart-initialization-via-k-means-T9ZaG)
4-Repeat 2 and 3 until all cluster centroids have been chosen.
2)Assign data points to Clusters (Place remaining the books one by one)
Each data point will be performed on by Euclidean distance (L2 distance) between it and all
centroids to find the nearest centroid to this data point. Thus, this data point is now assigned to
this near centroid.
3)Update Cluster centroids (Start over with 3 different books)
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean
of all the examples in a cluster.
4)Repeat step 2–3 until the stopping condition is met.
It halts creating and optimizing clusters when either:
The datapoints assigned to specific cluster remain the same (takes too much time)
Centroids remain the same (time consuming)
The distance of datapoints from their centroid is minimum (the thresh you’ve set)
Fixed number of iterations have reached (insufficient iterations → poor results, choose
max iteration wisely)
How to evaluate your k-means algorithm result ?
1-Inertia: (i.e. he sum of squared distances to the nearest cluster center)inertia tells how far
away the points within a cluster are. Therefore, a small of inertia is aimed for. The range of
inertia’s value starts from zero and goes up.(By the way this is the cost function explained
before)
2- Silhouette score: (used with any clustering technique) Silhouette score tells how far away the
data points in one cluster are, from the data points in another cluster. The range of silhouette
score is from -1 to 1. Score should be closer to 1 than -1.
Assume the data have been clustered via any technique, such as k-means, into k clusters. For
each datum i let a(i) be the average distance between i and all other data within the same cluster.
We can interpret a(i) as a measure of how well i is assigned to its cluster (the smaller the value,
the better the assignment).
Let b(i) be the smallest average distance of i to all points in any other cluster, of which i s not a
member. The cluster with this smallest average dissimilarity is said to be the "neighboring
cluster" of i because it is the next best fit cluster for point i We now define a silhouette:
(Visualize: https://www.youtube.com/watch?list=PLmNPvQr9Tf-ZSDLwOzxpvY-HrE0yv-
8Fy&v=5TPldC_dC0s)
3- By plotting silhouette score against each value of K and choosing the k corresponding to the
highest silhouette score.
Note: K-Medians is another clustering algorithm related to K-Means, except instead of
recomputing the group center points using the mean we use the median vector of the group.
This method is less sensitive to outliers (because of using the Median) but is much slower for
larger datasets as sorting is required on each iteration when computing the Median vector.
Are there any other algorithms of clustering other than k-means and hierarchical ?
Yes, there are several algorithms as:
Mean shift clustering: Mean shift clustering is a sliding-window-based algorithm that attempts
to find dense areas of data points. It is a centroid-based algorithm meaning that the goal is to
locate the center points of each group/class, which works by updating candidates for center
points to be the mean of the points within the sliding-window. These candidate windows are
then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of
center points and their corresponding groups.
1-To explain mean-shift we will consider a set of points in two-dimensional space like the
above illustration. We begin with a circular sliding window centered at a point C (randomly
selected) and having radius r as the kernel. Mean shift is a hill climbing algorithm which
involves shifting this kernel iteratively to a higher density region on each step until
convergence.
2-At every iteration the sliding window is shifted towards regions of higher density by shifting
the center point to the mean of the points within the window (hence the name). The density
within the sliding window is proportional to the number of points inside it. Naturally, by
shifting to the mean of the points in the window it will gradually move towards areas of higher
point density.
3-We continue shifting the sliding window according to the mean until there is no direction at
which a shift can accommodate more points inside the kernel. Check out the graphic above; we
keep moving the circle until we no longer are increasing the density (i.e number of points in the
window).
4-This process of steps 1 to 3 is done with many sliding windows until all points lie within a
window. When multiple sliding windows overlap the window containing the most points is
preserved. The data points are then clustered according to the sliding window in which they
reside.
In contrast to K-means clustering there is no need to select the number of clusters as mean-shift
automatically discovers this. That’s a massive advantage. The fact that the cluster centers
converge towards the points of maximum density is also quite desirable as it is quite intuitive to
understand and fits well in a naturally data-driven sense. The drawback is that the selection of
the window size/radius “r” can be non-trivial.
(Check this GIF noting that the black dots are the centers of the generated sliding windows and
the red dots are the average that is computed at the end of each iteration to move the center of
the sliding window to it and the grey dots are the data : https://gph.is/2piyT8P)
Lec14:
Dimensionality reduction:
Dimensionality reduction means to reduce
features representing data by removing
redundant data and compress them to use up
less memory and speed up our learning
algorithm.
As shown in the figure, both feature x1 & x2
represent the length but in different units which
means highly redundant representation. But in
industry we may have thousands of features
representing the data which is hard to track
them like this example to know which features
are redundant.
Back to our example, so we want to reduce
data from 2D (R2) to 1D (R1) so we are
representing our data in z1 instead of x1 & x2 as
shown in the previous figure.
In the following figure, A reduction from 3D
data to 2D data is also shown.
Dimensionality reduction offers us better data visualization as well. As if we have for example
50-D data (50 features) we can reduce them to 2-D data in z1 and z2 and visualize them.
1) Data preprocessing
3) Compute eigen vectors of matrix ∑: [U,S,V]=svd(Sigma); "octave" [you can use eig(sigma)]
5) z= UreducedT . x
Notes:
1-xapprox(i) is the data projected on the PCA vector.
2-in the previous formula, numerator is average squared projection error and denominator is the
total variation in the data.
Dimensionality reduction, Principle component analysis (PCA)
Dimensionality reduction
The Big data applications began to come viral in our lives and in machine learning the final
prediction or decision generally is based on some variables known as features and as time goes
on, the number of these features are getting higher to the range of millions. As these features
increase. it becomes harder to visualize as the human can only see in 1, 2 and 3 dimensional
space, so if the dimensionality increases it is harder to be visualized and imagined. But the
visualization is not the only problem but sometimes those features are correlated which means
by default that these data now are redundant which in turns lead to make the model more prone
to overfit. Also the computational cost increases with increasing the number of features dealt
with so here comes the motivation behind finding some ways to decrease the dimensionality to
be able to solve these problems mentioned. If we want to sum up we can say that dimensionality
reduction is finding a low dimensional representation of the present higher dimensional data
that retains as much information as possible.
PCA should also applied on numerical data but in case we have categorical data we can
convert them to numerical features by giving a number to each category.
What is variance, covariance, singular value decomposition, eigen values and eigen vectors
mean?
1-Variance: It is a measure of the variability or it simply measures how spread the data set is.
2-Covariance: a measure of the strength and direction of the correlation between two or more
sets of random variables such that if covariance of x, y is positive then x and y are in same
direction so if x increase, y increases and vice versa. If covariance of x, y is negative then x, y
are in opposite direction then if x increases, y decreases and if covariance is 0 then x and y are
un-correlated. The strength of this correlation is determined by the value as the covariance is in
range of -1 to 1 so as the number is getting closer to 1 or -1 then means a strong relationship
between x and y either positively or negatively.
When we say n×n covariance matrix this means that the diagonal contains the variance of each
feature (feature1, feature2, …, feature n) and
the off-diagonal values are the covariance
between the i,j of this place. so covariance
matrix is symmetric.
3-Singular value decomposition (SVD): The
intuition behind this is that every matrix has a close matrix of lower rank to it that can act as a
the best approximation to the original higher rank matrix. Note that SVD can be applied to all
matrices (rectangular or square) which is a very good benefit. SVD of say matrix A means the
factorization of A to 3 matrices U,D& V where columns of U and V are orthonormal
(orthogonal and normalized) and D is diagonal with positive entries.(A=UDVT )
4) Eigen values and eigen vectors: If you can draw a line through the three
points (0,0), v and Av, then Av is just v multiplied by a number λ; that is, Av=λv. In this case,
we call λ an eigenvalue and v an eigenvector. For example
(1,2) is an eigvector and 5 an eigenvalue.
Algorithm of PCA:
1-Normalize or standarize the data
2-calculate the covariance matrix of these data. (∑)
3-Calculate the eigen vectors of this covariance and sort it according to their eigen values.
4-Take the first K columns (K principal components) of the eigen vector.
5-Multiply this matrix by the original data.
To check that these PC are right just calculate the covariance matrix of this new formed reduced
matrix and you will see that the diagonal values are non-zeros and all off-diagonal values are
zeros which means that all the principal components are un-correlated.
BUT, maybe we have some questions about step 3 and 4 which are why to calculate the eigen
vectors of the covariance matrix and why taking first K components works as PC?
We want now to transform the original data to another data with a diagonalized covariance so
let's have a look at the mathematical derivation:
here comes the whole idea if we can find P that would make Cy diagonalized then Y will be
new transformed data with principal components. By mathematical theorms we can find that if
eigen vectors of Cx is used as P so the condition will be met. Now, if we want to transform
points to k dimensions then we will select first k eigen vectors of the matrix Cx (sorted
decreasingly according to eigen values) and form a matrix with them and use them as P.
So, if we have m dimensional original n data points then
X :m*n
P : k*m
Y = PX : (k*m)(m*n) = (k*n)
Hence, our new transformed matrix has n data points having k dimensions.
This shows that the distribution of the given data controls the dimensionality reduction method
to be used.
Algorithm
Given a set of observations of random variables x1(t), x2(t)…xn(t), where t is the time or sample
index, assume that they are generated as a linear mixture of independent components: y=Wx,
where W is some unknown matrix. Independent component analysis now consists of estimating
both the matrix W and the yi(t), when we only observe the xi(t).
• Use statistical “latent variables“ system
• Random variable sk instead of time signal
• xj = aj1s1 + aj2s2 + .. + ajnsn, for all j
x = As
• IC‘s s are latent variables & are unknown AND Mixing matrix A is also unknown
• Task: estimate A and s using only the observeable random vector x
• Lets assume that no. of IC‘s = no of observable mixtures
and A is square and invertible
• So after estimating A, we can compute W=A-1 and hence
s = Wx = A-1x
So our main task is to find s so we have to find W which is equivalent to A -1 which in turns
means that our task now is to find matrix A which will solve our whole problem.
Find A get its inverse A-1 Multiply by the observation given x Then s is computed
How to get A? I searched alot and didn't get a direct answer (need help)
Factor Analysis (FA)
In the Factor Analysis technique, variables are grouped by their correlations, i.e., all variables in
a particular group will have a high correlation among themselves, but a low correlation with
variables of other group(s). Here, each group is known as a factor. These factors are small in
number as compared to the original dimensions of the data. However, these factors are difficult
to observe. In every factor analysis, there are the same number of factors as there are variables.
Each factor captures a certain amount of the overall variance in the observed variables, and the
factors are always listed in order of how much variation they explain. The eigenvalue is a
measure of how much of the variance of the observed variables a factor explains. Any factor
with an eigenvalue ≥1 explains more variance than a single observed variable. So if the factor
for socioeconomic status had an eigenvalue of 2.3 it would explain as much variance as 2.3 of
the three variables. This factor, which captures most of the variance in those three variables,
could then be used in other analyses. The factors that explain the least amount of variance are
generally discarded.
The variable with the strongest association to the underlying latent variable. Factor 1, is income,
with a factor loading of 0.65.
Since factor loadings can be interpreted like standardized regression coefficients, one could also
say that the variable income has a correlation of 0.65 with Factor 1. This would be considered a
strong association for a factor analysis in most research fields.
Two other variables, education and occupation, are also associated with Factor 1. Based on the
variables loading highly onto Factor 1, we could call it “Individual socioeconomic status.”
House value, number of public parks, and number of violent crimes per year, however, have
high factor loadings on the other factor, Factor 2. They seem to indicate the overall wealth
within the neighborhood, so we may want to call Factor 2 “Neighborhood socioeconomic
status.”
Notice that the variable house value also is marginally important in Factor 1 (loading = 0.38).
This makes sense, since the value of a person’s house should be associated with his or her
income.
Random forest
The most popular method in the feature selection methods (remember that all the explained
dimensionality reduction before were of feature extraction type). But to explain it we should
know an algorithm called decision tree as random forest is a collection of decision trees.
Decision Trees
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable). It
works for both categorical and continuous input and output variables so in other words we say
that it is used for both classification and regression problems. In this technique, we split the
population or sample into two or more homogeneous sets (or sub-populations) based on most
significant splitter / differentiator in input variables.
Termenologies:
Root Node: It represents entire population or sample and this further gets divided into
two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision
node.
Leaf/Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
You can say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent and Child Node: A
node, which is divided
into sub-nodes is called
parent node of sub-nodes
where as sub-nodes
are the child of parent
node.
Algorithm and examples
There are some algorithms to be used in bulding our decision tree but we will talk about only 2
of those algorithms.
1)ID3 (Iterative Dichotomiser 3) This uses entropy and information gain as metrics.
2)CART (Classification And Regression Trees) This uses Gini index as metric. (This is
called greedy algorithm as we have an excessive desire to lower the cost)
There are other algorithms depending on other metrics as Chi-square and reduction in variance.
Algorithm:
1-Start with a training data set which we’ll call S. It should have attributes and classification.
2-Determine the best attribute in the dataset. (We will go over the definition of best attribute)
3-Split S into subset that contains the possile values for the best attribute.
4-Make decision tree node that contains the best attribute.
5-Recursively generate new decision trees by using the subset of data created from step 3 until a
stage is reached where you cannot classify the data further. Represent the class as leaf node.
Best attribute means higher information gain or lower gini score. Let's see each type by
examples.
therefore, IG(outlook)=
May be you are asking where is the feature Temperature in out tree and here comes the concept
of feature selection for dimensionality reduction where the attribute or the feature temperature
in this problem is useless or redundant which means that it overloads on my model and
consumes computation but doesn't contain that much important information in it. Random forest
will use a bunch of decision trees to do that and will be discussed in details later.
NOW, Let's go back to Random forest dimensionality reduction technique to see how it collect
decision trees and use them for feature selection.
Random forest
It is an ensemble of decision trees which is used as
prediction model as the decision trees algorithm
and also used in dimensionality reduction using
feature selection approach. Most of the time trained
with the “bagging” method. The general idea of the
bagging method is that a combination of learning
models (decision trees) increases the overall result
and get a more accurate and stable prediction.
Random Forest adds additional randomness to the model, while growing the trees. Instead of
searching for the most important feature while splitting a node, it searches for the best feature
among a random subset of features. This results in a wide diversity that generally results in a
better model. Therefore, in Random Forest, only a random subset of the features is taken into
consideration by the algorithm for splitting a node. You can even make trees more random, by
additionally using random thresholds for each feature rather than searching for the best possible
thresholds (like a normal decision tree does).
Note: If you still don't get what is random forest algorithm so we can say that it is the same
algorithm of decision tree but the difference that we take random features not selected features
and then construct several decision trees where each tree has randomly selected features to use
then when we have input data we make them input to all the decision trees and take the output
of each tree and just make averaging to get the final output.
What are the hyper parameters of random forest ?
The Hyperparameters in random forest are either used to increase the predictive power of the
model or to make the model faster. I will here talk about the hyperparameters of sklearns built-
in random forest function.(Sklearn is a python library built upon SciPy)
1) Increasing the Predictive Power:
Firstly, there is the „n_estimators“ hyperparameter, which is just the number of trees the
algorithm builds before taking the maximum voting or taking averages of predictions. In
general, a higher number of trees increases the performance and makes the predictions more
stable, but it also slows down the computation.
Another important hyperparameter is „max_features“, which is the maximum number of
features Random Forest is allowed to try in an individual tree. Sklearn provides several options,
described in their documentation.
The last important hyper-parameter we will talk about in terms of speed, is „min_sample_leaf
“. This determines, like its name already says, the minimum number of leafs that are required to
split an internal node.
2) Increasing the Models Speed:
The „n_jobs“ hyperparameter tells the engine how many processors it is allowed to use. If it
has a value of 1, it can only use one processor. A value of “-1” means that there is no limit.
„random_state“ makes the model’s output replicable. The model will always produce the same
results when it has a definite value of random_state and if it has been given the same
hyperparameters and the same training data.
Lastly, there is the „oob_score“ (also called oob sampling), which is a random forest cross
validation method. In this sampling, about one-third of the data is not used to train the model
and can be used to evaluate its performance. These samples are called the out of bag samples. It
is very similar to the leave-one-out cross-validation method, but almost no additional
computational burden goes along with it.
=================================================================
In case you just didn't get it, Random forest or decision trees is not like PCA that will give you
a set of features to use with any other classification or regression algorithm. It is a classification
or regression model that has a feature selection method inside it. Thus dimensionality reduction
is included in its algorithm unlike PCA and ICA which gives you a set of new features created
based on the original features then you can use it in any classification or regression technique
say SVM for example.
How to reconstruct from compressed representation (z)?
in fact we can't get the exact data x but we can get x approx which is the projected data on the line
which is almost equals to the exact data.
xapprox = Ureduced . z
Notes:
1-As PCA gives fewer featuress so it is less likely to overfit. Taking into consideration that if
overfitting takes place the solution is to regularize not to apply PCA. (take care of the
difference)
2-Don't start with PCA, but use it when the result is not as you expect or the learning process is
slow due to the high number of features.
Lec15
Gaussian (Normal) distribution:
If x is a distributed Gaussian with mean μ and
variance σ2 x ~ (μ, σ2)
Equation:
where:
Note: σ is called standard deviation which determines the width of the gaussian probability
density.
Examples:
Anomaly detection:
The identification of rare items, events or observations which raise suspicions by differing
significantly from the majority of the data.
Example:
Assume features of aircraft engine were given as
follows for training set (red crosses) and then we
have a new engine (new example) so when the
features of this new engine plotted (green crosses)
we found:
1) the plot was in the same region of other
predefined examples then this new engine is OK and can be shipped to the user
2) the plot was outside the normal region (act as outlier) then this new engine is anomaly and
should be checked in details before being shipped to the user.
how to define xtest as anomaly ? we model our given data (p(x)) then check for p(x test) and
compare to ϵ (very small number) such that if less than this ϵ then flagged as anomaly,
otherwise, then OK.
Algorithm:
Note: If the features were non-gaussian, you should convert them to be gaussian either by using
log or square root or whatever needed to do that. (check lec15.6)
Assume we have monitoring machines system with 2 features x 1 and x2 distributed as shown in
red crosses. a new data to be entered (green cross) which is obviously an anomaly data. But if
we plotted the value of x1 and x2 on their probability distribution we will find that both values
are normal values that are close to some other values plotted before and not anomaly so the
probability will be high so the new data will not be flagged as anomaly data.
Multivariate gaussian distribution will not model each p(x) separately as shown in the previous
example but will model p(x) all in one go. so μ and ∑(covariance) will be matrices such that μ
is Rn and ∑ is Rn×n.
Equation:
Examples:
Parameters fitting for anomaly detection:
Algorithm:
Gaussian Mixture Models (GMM), Anomaly detection & K-nearest neighbours (KNN)
The k-means clustering model explored before is simple and relatively easy to understand, but
its simplicity leads to practical challenges in its application. In particular, the non-probabilistic
nature of k-means and its use of simple distance-from-cluster-center to assign cluster
membership in addition to using circle clusters only and also the overlapping clusters leads to
poor performance for many real-world situations. In this section we will take a look at Gaussian
mixture models (GMMs), which can be viewed as an extension of the ideas behind k-means,
but can also be a powerful tool for estimation beyond simple clustering.
Note: The Gaussian or normal distribution is also called a bell curve sometimes. The two
parameters are called the mean μ and standard deviation σ. One of key properties of the normal
distribution is that the highest probability is located at mean while the probabilities approach
zero as we move away from the mean. What makes Gaussians so special is that it turns out
that many dataset distributions are actually Gaussian. A famous theorem in statistics called
the Central Limit Theorem that states that as we collect more and more samples from a dataset,
they tend to resemble a Gaussian, even if the original dataset distribution is not Gaussian! This
makes Gaussian very powerful and versatile!
Note: Values are not outside the normal global range, but are abnormal compared to the
seasonal pattern.
3-Collective anomalies (AKA conditional outlier):
A subset of data points within a data set is considered anomalous if those values as a collection
deviate significantly from the entire data set, but the values of the individual data points are not
themselves anomalous in either a contextual or global sense.
Note: In the example, two time series that were discovered to be related to each other, are
combined into a single anomaly. For each time series the individual behavior does not deviate
significantly from the normal range, but the combined anomaly indicated a bigger issue.
A sudden surge in order volume at an ecommerce company, as seen in that company’s hourly
total orders for example, could be a contextual outlier if this high volume occurs outside of a
known promotional discount or high volume period like Black Friday. Could this stampede be
due to a pricing glitch which is allowing customers to pay pennies on the dollar for a product?
A publicly traded company’s stock is never a static thing, even when prices are relatively stable
and there isn’t an overall trend, and there are minute fluctuations over time. If the stock price
remained at exactly the same price (to the penny) for an extended period of time, then that
would be a collective outlier. In fact, this very thing occurred to not one, but several tech
companies on July 3 of this year on the Nasdaq exchange when the stock prices for several
companies – including tech giants Apple and Microsoft – were listed as $123.45.
What are types of anomaly detection techniques?
1-Statistical methods as moving average (AKA rolling average or Low-pass filter) and moving
median (These are not machine-learning based algorithm unlike the upcoming ones)
2-Density-based methods as k-nearest neighbors and gaussian mixture models.
3-Clustering-based methods as K-means.
4-SVM-based methods as SVM (for supervised) and One-class SVM (for unsupervised)
GMM was explained enough in the main course to be used in anomaly detection so we will
discuss here another method in details which is k-nearest neighbours.
K-nearest neighbors (KNN)
K-nearest neighbors is mainly a simple robust versatile classification technique (can be used in
regression but not recommended). It falls in the supervised learning family of algorithms. KNN
is described as non-parametric, instance-based (lazy) algorithm:
Non-parametric: It does not make any assumptions on the underlying data distribution.
This is pretty useful , as in the real world , most of the practical data does not obey the
typical theoretical assumptions made (eg gaussian mixtures, linearly separable etc) . Non
parametric algorithms like KNN come to the rescue here. For example, suppose our data
is highly non-Gaussian but the learning model we choose assumes a Gaussian form. In
that case, our algorithm would make extremely poor predictions.
Instance-based (lazy): It does not use the training data points to do any generalization. In
other words, there is no explicit training phase or it is very minimal. This means the
training phase is pretty fast . Lack of generalization means that KNN keeps all the training
data. More exactly, all the training data is needed during the testing phase. (Well this is an
exaggeration, but not far from truth). This is in contrast to other techniques like SVM
where you can discard all non support vectors without any problem. Most of the lazy
algorithms – especially KNN – makes decision based on the entire training data set (in the
best case a subset of them).The dichotomy is pretty obvious here – There is a non existent
or minimal training phase but a costly testing phase. The cost is in terms of both time and
memory. More time might be needed as in the worst case, all data points might take point
in decision. More memory is needed as we need to store all training data.
Since it is non-parametric, Does this mean that there are no assumption at all?
The answer is neither yes nor no, as there is no assumptions on how the data is distributed but
we can say that there are some conditions to be able to apply the algorithm and even the
conditions are simple and logical:
1-The data should be in feature space, or in other words in a metric space. Therefore data are
scalar or vectors so we can apply distance-based calculations as Euclidean (most popular),
Manhattan, Chebyshev and Hamming distances' equations.
2-Since this is a supervised learning so data should be labeled.
3-The number K should be assigned to determine the number of neighbors influences the
classification results.(will be understood more in the algorithm)
How to choose K? If K is too low
(near 1) so we are increasing variance
and decreasing bias (overfitting) and if
too high so we are more likely to
increase bias (underfit) so the best
value of K is got by cross-validation.
Algorithm
1. Load the training and test data
2. Choose the value of K
3. For each point in test data:
- find the distance (Euclidean distance for example) to all training data points
- store the distances in a list and sort it
- choose the first k points
- assign a class to the test point based on the majority of classes present in the chosen points
4. End
Curse of dimensionality
The higher we go in dimensions, the more hard our problems become. This curse is well-known
in computer science in a lot of topics related and it was one of the motivations behind
dimensionality reduction that we have explained. The problem here comes from the fact that
The distance measures lose accuracy while dealing with higher dimensions. In high dimensions,
a curious phenomenon arises: the ratio between the nearest and farthest points approaches 1, i.e.
the points essentially become uniformly distant from each other. This phenomenon can be
observed for wide variety of distance metrics, but it is more pronounced for the Euclidean
metric than, say, Manhattan distance metric. The premise of nearest neighbor search is that
"closer" points are more relevant than "farther" points, but if all points are essentially uniformly
distant from each other, the distinction is meaningless. So to end this part, you should apply
dimensionality reduction method if you are dealing with high dimensional data and also try
Manhattan distance measurement but if it is still has the same curse so probably the KNN is not
the best suited case for classification for your problem.
Some observations on KNN
1- If we assume that the points are d-dimensional, then the straight forward implementation of
finding k Nearest Neighbor takes O(dn) time.
2- A simple approach to select k is set in case you don't want to apply cross-
validation.
3- In KNN, k is usually chosen as an odd number if the number of classes is 2.
4- There are also some nice techniques like condensing, search tree and partial distance that try
to reduce the time taken to find the k nearest neighbor.
(kd search tree (only for low and medium dimensional data) :
1-https://www.coursera.org/lecture/ml-clustering-and-retrieval/kd-tree-representation-S0gfp
2-https://www.coursera.org/lecture/ml-clustering-and-retrieval/nn-search-with-kd-trees-6eTzw)
Improvements on KNN
1- Weighted KNN is one of the most popular improvement to normal KNN algorithm such that
The class of each of the K neighbors is multiplied by a weight proportional to the inverse of the
distance from that point to the given test point. This ensures that nearer neighbors contribute
more to the final vote than the more distant ones.
2- Choosing precisely the distance metric can help improving accuracy as using hamming
distance for text classification or Manhattan for higher dimensionality data.
3- Rescaling data is so important either by normalization or standardization (mean-
normalization)
4- As said before dimensionality reduction is a good step to be done before using KNN
algorithm.
Recommender system:
Assume a problem like this where we want to construct a recommender system for movies
based on previous recommendation, so how could we construct a learning algorithm for such
system?
For each user j, learn a parameter θ(j). Predict user j as rating movie I with
(θ(j))T.x(i) stars. Now our goal is to predict those question marks' values using the formula
(θ(j))T.x(i) which is a linear regression formula. So we will apply the cost function and gradient
descent similarly to what we studied in linear regression as follows:
Second Approach: Collaborative filtering:
We can notice here different
formulation where the features
themselves are unknown as it
could be hard to hire some
people to enter each movie and
tell how much each one is
romantic or action. But we have
θs that tells us how much each
user (Alice, Bob, …) love or hate
the romantic and action movies
and by some methods that will
collaborate these θs and the
rating given by those users we will be able to form these features. What happens here that for
example Alice and Bob gives 5 for romantic movies which means that they love it and when we
check for example the first movie (love at last) we will see that both Bob and Alice gave 5 so
that means that most probably this movie is a romantic movie and so on…
Therefore, we can say the following:
Take care that we are here having θs and we learn for x(i) (features) so our optimization is:
Algorithm:
Mean normalization:
If you have a user that hasn't given any rating to any movie, if we used the previous method we
will end up with 0 rating for all movies so mean normalization solves this problem by instead of
using the given rating we will use them subtracted by its mean μ Y-μ as follows:
Lec17
Remember:
As number of data in the dataset increases, learning curves become better and less likely to
overfit but at the same time it's computationally expensive to apply gradient descent by its
normal definition on the whole batch so we have to find better techniques.
Algorithm:
Note:
The lines in red is
batch gradient
descent, while the
lines in magenta is
stochastic gradient
descent.
Conceptually association rules is a very simple technique. The end result is one or more
statements of the form “if this happened, then the following is likely to happen." In a rule, the
"if" portion is called the antecedent, and the "then" portion is called the consequent.
Support
Confidence
Lift
====================================================================
Now that we understand how to quantify the importance of association of products within an
item set, the next step is to generate rules from the entire list of items and identify the most
important ones. This is not as simple as it might sound. Supermarkets will have thousands of
different products in store. After some simple calculations, it can be shown that just 10 products
will lead to 57000 rules!! And this number increases exponentially with the increase in number
of items. Finding lift values for each of these will get computationally very very expensive.
How to deal with this problem? How to come up with a set of most important association rules
to be considered? Apriori algorithm comes to our rescue for this. The challenge is the mining of
important rules from a massive number of association rules that can be derived from a list of
items.
===================================================================
Apriori algorithm
Apriori is an algorithm for frequent item set mining and association rule learning over
transactional databases. It proceeds by identifying the frequent individual items in the database
and extending them to larger and larger item sets as long as those item sets appear sufficiently
often in the database. If we have item set contains {A,B,C,D} items and assume that items
{A,B} comes with each other frequently (frequently means it exceeds a certain specified
threshold), therefore number of occuring (transactions) containing {A,B,….} such that A and B
are subset of the new item list will be greater than or equal number of occuring of {A,B}. So we
can say that this new transaction is also frequent without checking. This is called the anti-
monotone property of support where if we drop out an item from an item set, support value of
new item set generated will either be the same or will increase. Apriori principle is a result of
anti-monotone property of support. Similarly when we start calculating frequency of individual
items so For example, if {A,B} does not satisfy our threshold , an item set with any item added
to this will never cross the threshold too.
Note: Frequent item sets are the item sets for which support value (fraction of transactions
containing the item set) is above a minimum threshold (Threshold is aka minsup).
Algorithm in steps:
1-Generate all frequent item sets (support ≥ minsup) having only one item.
2-Next, generate item sets of length 2 as all possible combinations of above item sets. Then,
prune the ones for which support value fell below minsup.
3-Generate item sets of length 3 as all possible combinations of length 2 item sets (that
remained after pruning) and perform the same check on support value.
4-Iteratively, we keep increasing the length of item sets by one like this and check for the
threshold at each step until we reach the end.
(Visualize: https://annalyzin.files.wordpress.com/2016/04/association-rules-apriori-tutorial-
explanation.gif
At first we had 4 individual items (tomato, sausage sandwich, Guava & grapes) so support of
tomato was less than minsup so we removed it and then made all combinations possible from
the three remaining items to form item set of length too then check each item set that its support
is above minsup so all of them passed then we form all combinations possible again to form
item set of length 3.)
Closed frequent item set: It is a frequent item set for which there exists no superset which has
the same support as the item set. Consider an item set X. If ALL occurrences of X are
accompanied by occurrence of Y, then X is NOT a closed set.
اللي انا حاطه و جيت ابصminsup threshold يعني معدية الfrequent {و كانA,B} فيهاitemset (يعني لو عندي
بتاعةitemset يبقي الC تالت موجود دايما و ليكنitem { لما بيبقوا موجودين بالقي معاهمA,B} لقيت ان دايما
) و العكس صحيحclosed frequent itemset { دي مشA,B}
Naïve bayes
Before starting in Naïve bayes, we have to discuss what is so called by Bayes' theorm.
Bayes' Theorm
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:
where, P(A|B) : the conditional probability of response variable belonging to a particular value,
given the input attributes. This is also known as the posterior probability.
P(A) : The prior probability of the response variable.
P(B) :The probability of training data or the evidence.
P(B|A) : This is known as the likelihood of the training data.
Noting that A here represents y or class variable and B here represents the x or features or
predictors of size n (x1,x2,…,xn)
Example: A Path Lab is performing a Test of disease say “D” with two results “Positive” &
“Negative.” They guarantee that their test result is 99% accurate: if you have the disease, they
will give test positive 99% of the time. If you don’t have the disease, they will test negative
99% of the time. If 3% of all the people have this disease and test gives “positive” result, what
is the probability that you actually have the disease?
For solving the above problem, we will have to use conditional probability.
Probability of people suffering from Disease D, P(D) = 0.03 = 3%
Probability that test gives “positive” result and patient have the disease, P(Pos | D) = 0.99 =99%
Probability of people not suffering from Disease D, P(~D) = 0.97 = 97% (~D ≡ NOT D)
Probability that test gives “positive” result and patient does have the disease, P(Pos | ~D) = 0.01
=1%
For calculating the probability that the patient actually have the disease i.e, P( D | Pos) we will
use Bayes theorem:
We have all the values of numerator but we need to calculate P(Pos):
P(Pos) = P(D, pos) + P( ~D, pos)
= P(pos|D)*P(D) + P(pos|~D)*P(~D) = 0.99 * 0.03 + 0.01 * 0.97 = 0.0297 + 0.0097 = 0.0394
Let’s calculate, P( D | Pos) = (P(Pos | D) * P(D)) / P(Pos) = (0.99 * 0.03) / 0.0394 = 0.7538
So, Approximately 75% chances are there that the patient is actually suffering from disease.
Naïve Bayes
The naive Bayes algorithm does that by making an assumption of conditional independence
over the training dataset features (predictors). Therefore instead of saying (P(x1,x2,…xn|y)) we
can say P(x1|y) × P(x2|y) × …× P(xn|y) = . This is a very strong assumption that
makes the Bayes algorithm, naïve as well as it is most unlikely in real data, i.e. that the
attributes do not interact. Nevertheless, the approach performs surprisingly well on data where
this assumption does not hold. Naïve bayes mostly used in natural language processing (NLP).
Equations:
We calculate probability of each class given the attributes and take the highest probability
and this known as Maximum A posteriori (MAP)
Note: Assume we have n number of attributes (features) and response is of C classes then the
complexity of our algorithm is O(Cn).
Now our classifier is ready, let's have our test example today={ Sunny, Hot, Normal, False}
What should be our prediction for playing golf (y)? Y could be Yes or NO so we compute both
given today data and take the highest P according to MAP algorithm.
, Similarly with
Therefore Ypredicted= argmax of (P(Yes|today) & P(No|today)) = YES
Note: Argmax returns the index of the highest value so if Yes is 1 and No is 0 then returns 1.
What are the types of Naïve bayes classifier?
1-Gaussian Naïve Bayes:
Above, we calculated the probabilities for input values for each class using a frequency. With
real-valued inputs, we can calculate the mean and standard deviation of input values (x) for
each class to summarize the distribution. This means that in addition to the probabilities for
each class, we must also store the mean and standard deviations for each input variable for each
class.
Naive Bayes can be extended to real-valued attributes, most commonly by assuming a Gaussian
distribution. This extension of naive Bayes is called Gaussian Naive Bayes. Other functions can
be used to estimate the distribution of the data,
but the Gaussian (or Normal distribution) is the
easiest to work with because you only need to
estimate the mean and the standard deviation from your training data.
Equation:
Remember:
Mean ( ) and standard deviation σ
What is the difference between deep learning and machine learning technically?
The major difference between deep learning and machine learning is its execution:
1-As the size of data increases. Deep learning algorithms need a large amount of data, this is
why, when the data is small those algorithms don’t perform that well. On the other hand,
machine learning algorithms with their high quality principles win in this situation.
2-Deep learning algorithms heavily depend on high-end machines, compare to machine learning
algorithms, because the requirements of deep learning algorithms include GPUs which are an
integral part of its working. Machine learning can work on low-end machines, most of the
applied features need to be identified by an expert and then hand-coded as per the domain and
data type.
3-Deep learning algorithms try to take high-level features from data. This is an extremely
distinctive part of Deep Learning and a major step ahead of Machine Learning. Therefore, deep
learning reduces the task of developing new feature extractor for each issue.
4-Usually, a deep learning algorithm takes a long time to train. This is because there are so
many parameters in a deep learning algorithm that training them takes longer than usual.
Whereas machine learning comparatively takes much less time to train, ranging from a few
seconds to a few hours.
Course 1 Week1
Introduction:
Example:
Assume you have a feature called size of house and
you want to predict the price given some examples as
shown in the graph. Probably if you are familiar with
linear regression you will go with it to fit these
data.(blue line fitting the data)
As said before that deep learning is actually a step towards mimicking the brain so let us see
how could such a problem be represented in neural networks.
This will be represented as a simple neuron having one input x
(size of house) and having one output y which is the price.
Note: The line I drew up there in fact doesn't exist in neural networks but it is replaced by what
is called by ReLU function (discussed in details in activation functions part in this chapter) to
add some sort of non-linearity. (linear) (ReLU)
Larger network to be constructed is by having more neurons and stack them together. Also
features are more than just size of house, maybe by adding some other fetures like #bedrooms,
zip code and so on …
For the given applications in the preceding image, we use different types of neural networks
that best fit this application:
-For Real estate and online advertising Standard neural network
-For Photo tagging (Photos and videos) Convolutional neural network
-For Speech recognition and Machine translation (time-series apps) Recurrent neural network
-For Autonomous driving (complex advanced application) Custom/ Hybrid NN
Custom or hybrid means that we use a mixture of CNN, RNN and other types.
Supervised learning are having two types of data called structured data and unstructured data
where structure data is like all what we have done in the previous chapter by having some
features like in our previous example (size, number of bedrooms, …) and output determines
price while unstructured data is like audio(for speech recognition), images (for image
recognition), texts (for natural language processing), etc. which the elementary data here is
different like in images we have
pixels or in text we have
individual words and so on …
Historically, It is harder on the
computer to make sense on the
unstructured data than structured
ones but now with neural
networks and big data era as well
as the computational speeds we
are much better in dealing with
such data.
Course 1 Week2
Binary classification:
As said before in machine learning part that
binary classification means that you have 2
classes which represents positive or negative
such that if the classification is about cats so we
predict y to be either 1 (cat or positive example)
or 0 (non-cat negative example).
How image is represented in computer? Image is stored in the form of three matrices called
RGB where each matrix represents the precentage of color of each pixel so for first pixel in
image (top left) to be white so we make this pixel (top left) in each of the 3 matrices of value
255 to form the white color noting that each value of each matrices ranges from 0→255.
If the image is of size 64×64, therefore each of the 3 matrices will be of size 64×64.
To represent those matrices as feature vector x (inputs), we unroll each matrix of RGB matrices
into column vector and stack them together to end up with a column vector of size (64×64×3)×1
= 12288×1 vector as shown below: (Note: notation n=n×=12288)
Notations:
Single example is expressed as (x,y) where x , y {0,1}
For m training example : (x ,y ), (x ,y ), ……, (x(m),y(m))
(1) (1) (2) (2)
X= , Y=
Logisitic regression: (recap)
Given x where x , want = P(y=1 | x)
Parameters will be w ,b
What is w & b ? This is similar to θ we were using in machine learning part where :
b= θ0.x0 noting that x0 is always equals 1 as we said before and w≡θ, Hence:
( =θTx=θ0 x0 + θ1 x1 + θ2 x2 + … + θnxn in machine learning) ≡ ( =wTx + b in deep learning)
In neural networks we stick to the w, b conventional notations rather than θ as it tends to be
easier in explanation so we will use w, b along this chapter as our learning parameters.
Gradient descent:
Intuition behind derivatives:
We know that if f(a)=3a then f ' (a) =3 but
the intuition behind is that if we have a=2
then f(a)=6 and if we move alittle bit to the
right as shown in the figure at a=2.001
then f(a)=6.003 therefore we can say that
the amount of change in the variable a
results in triple as much change that
happened in the variable. Then slope
(derivative) of f(a) at a=2 is = = 3.
Note: The definition of derivative doesn't state the moving by 0.001 but it is much smaller
movement called by infinitesimal amount which is much smaller than 0.001. We just used
0.001 for easier explanation.
Computational graph:
Computations of neural network is as discussed before a combination of feed-forward path
followed by backpropagation in an iterative manner, so computation graph discusses why
neural networks is organized this way.
Assume function J of 3 variables as follows: J(a,b,c)=3(a+bc) so how to compute derivative?
Derivative in this example is done in 3 steps:
1-u=bc 2-v=(a+u) 3-J=3v Computing derivative done by backpropagation from right to left
assume we initialize a,b,c as 5,3,2 respectively (initialization of learning parameter), Therefore:
As you notice we have already done the
feed-forward computation on the
computation graph so let us done the
backprogation for computing derivatives.
As said we will start from right to left: =?
J=3v and v=11 so we move alittle from 11→11.001 so J will move from 33→33.003 so =3
Now we have gone 1 step backward from J to v. Next we have 2 steps one from v to u and one
from v to a so we gonna compute two derivatives ( )
For : as a goes from 5→5.001so v goes from 11→11.001so J goes from 33→33.003 so =3
For : as u goes from 6→6.001so v goes from 11→11.001so J goes from 33→33.003 so =3
Note: & as we already have from first step so is all we
need is to compute according to the chain rule used.
Remember: our target from the gradient descent is computing the derivative of final output J
with respect to learning parameter a,b & c.
Now we got one of the 3 derivative needed which is and the remaining are & .
Similarly with using chain rule = . (we know that =3 & =1 so . =?! )
as b goes from 3→3.001 so u goes from 6→6.002 so = . =3×1×2=6
Similarly with c = . =3×1×3=9
Note: Take care that a≡ as we will continue our chapter using "a" notation for our outputs.
Remember:
Going back from the left to the right: = da= (Using derivatives' rules)
Then back again to z : = dz = = .
Then back again to weights and biases:
= dw1 = x1 . dz & = dw2 = x2 . dz & = db =dz
Update equations:
w1 := w1 - α dw1
w2 := w2 - α dw2 The 3 equations are updated simultaneously as discussed before.
b := b - α db Note: This equations for a single training example.
For m-training examples:
Notes:
1-We use dw1 and dw2 and b with no superscript (i) because they are accumulators for all
examples while dz(i) is having superscript (i) as it is computed each iteration for example i.
2-All the examples we give till now is having 2 features (n=2).
3-This is the steps in details for 1 epoch (loop) of gradient descent so we repeat those steps for a
specific number of epochs until reaching local minima.
There are two weakness points in this implementation, What are they?
The two weakness points lies in the two for loops used in the implementation which are the
loop over the m-examples and the loop over features (not shown above as we only have 2
features but in case we have more which is more realistic we gonna have for loop over the
features for j=1 to n : (dwj += xj dz(i))). The weakness in using for loops in deep learning
algorithm is that it takes so much run time that hinder the performance especially in the big data
era that offers millions of data. Here comes the idea of Vectorization to get rid of explicit for
loops.
Vectorization uses numpy function from python or any other
built in function used for vectorization in other programming
languages. You can notice that the unvectorized version
(using for loops) took 300 as much time that took by the
vectorized one.
Functions like numpy take advantage of parallelism between
CPU and GPU using SIMD (Single Instruction Multiple
Data) which makes the computation faster.
And in the graph, we use the output from each node as the weight of the corresponding edge:
Either way, graph or function notation, we get the same answer because these are just two ways
of expressing the same thing.
In this simple example it might be hard to see the advantage of using a computational graph
over function notation. After all, there isn’t anything terribly hard to understand about the
function f(x, y, z) = (x + y) * z. The advantages become more apparent when we reach the scale
of neural networks.
Even relatively “simple” deep neural networks have hundreds of thousands of nodes and edges;
it’s quite common for a neural network to have more than one million edges. Try to imagine the
function expression for such a computational graph… can you do it? How much paper would
you need to write it all down? This issue of scale is one of the reasons computational graphs are
used.
For backpropagation through computation graphs: It is enough what is said in main course.
For more examples :
[Starting from min 6:00 https://www.youtube.com/watch?v=u2OeYrlAx_A]
Course 1 Week3
Neural networks : (recap)
We have discussed this topic before in machine learning part but we want to add some more
info before continuing.
5-Each layer has w (weights) and b (biases) matrices with defined size:
weights: (number of nodes in the layer × number of nodes in the preceding layer)
biases: (number of nodes in the layer × 1)
Example: w[2]:matrix of size (number of nodes in layer 2 × number of nodes in layer 1)
7-a[2] & w[2] are output and weight of layer 2 which is equivalent to & where u is
Note: Every point of the preceding point, go visualize it in the image for better understanding.
Remember:
-Inside each unit (neuron), computation are done in 2 steps:
compute z
apply sigmoid function (or any other activation function)
Why don't we use linear activation function or even remove the step of activation ?
In fact because if we use linear function in each layer so
we will end up having output as a linear combination of
input features which in using is less likely to express the
data efficiently due to non-linearity in the input features
themselves. you may use linear activations in regression
problems like housing price predictions as y is a real-
valued number ranging from 0 to millions of dollars so
linear can be used but generally this is a rare usage.
Derivatives of activation functions:
For sigmoid:
For Tanh: :
For ReLU:
Now let us discover the most 7 popular non-linear activations in deep learning.
Sigmoid (Logistic) function
Equation , Output range: 0→1
Derivative: ) = g(z) (1-g(z))
Advantages:
It is an S-shaped curve which has looks as a smoothed
step function like but with the benefit of being non-linear
so combinations of this function along stack of neurons will be non-linear as well.
We can notice that between -2 and 2 for z the function is so steep meaning that any
change in input will lead to significant change in output.
Output is bounded between 0 and 1 so normalized and prevent blowing up of outputs.
Disadvantages:
Vanishing gradient descent problem.
Outputs are not centered around 0 and it ranges from 0 to 1 thus the values received are
all positive. So not all times would we desire the values going to the next neuron to be all
of the same sign.
Computationally expensive.
Takes time to converge (Slow convergence).
Vanishing gradient descent problem:
By taking a look to the gradient (derivative) of the
sigmoid, we would notice that regions other than
the region between -3 and 3 is pretty flat which
means that the gradient is almost zero which in
turns mean that the network is not learning or
drastically so slow (depending on the value of the
gradient) as learning depends mainly on updating
weights and biases which are updated based on
derivative and learning rate so if derivative is 0 so
no learning takes place.(w := w -α so if ≈0 so w is not updated, similarly with bias)
Hyperbolic tangent (Tanh) function
As we notice here that tanh (green) is a scaled modified
version of sigmoid (red)
Equation: , Output range: -1→1
Derivative: = 1- (g(z))2
Advantages:
It is similar to sigmoid function with added two more
advantages:
More steep so more sensitive to changes.
Zero centered so it basically solves our problem
more
of the values all being of the same sign so making steep
it easier to model inputs that have strongly
negative, neutral, and strongly positive values.
Disadvantages:
Vanishing gradient descent.
Computationally expensive.
Slow convergence (but faster than sigmoid)
Note: 1- tanh (z) = 2 sigmoid (2z) -1
Rectified linear unit (ReLU) function
The most popular activation function in deep learning.
Equation: g(z) = max (0,z) , Output range: 0→∞
Derivative: :
Advantages:
It solves the problem of vanishing gradient descent.
Computationally efficient as it involves simpler
mathematical operations.
Faster to converge.
Sparsity which means it does not activate all the neurons at the same time. What does
this mean ? If you look at the ReLU function if the input is negative it will convert it to
zero and the neuron does not get activated. This means that at a time only a few neurons
are activated making the network sparse making it efficient and easy for computation and
make the learning more concentrated on the activated neurons so learns better unlike
sigmoid and tanh which will cause almost all neurons to fire in an analog way (
remember? ). That means almost all activations will be processed to describe the output
of a network. In other words the activation is dense. This is costly.
Disadvantages:
Dead neurons problem as If you look at the
negative side of the graph of the gradient, the
gradient is zero, which means for activations in that
region, the gradient is zero and the weights are not
updated during back propagation. This can create
dead neurons which never get activated.
ReLU function is used in hidden layers only. Output layer doesn't use it.
Softmax function
This function will be discussed in details at the end of this chapter.
How to choose the right activation function?
Now that we have seen so many activation functions, we need some logic / heuristics to know
which activation function should be used in which situation. Good or bad – there is no rule of
thumb. However depending upon the properties of the problem we might be able to make a
better choice for easy and quicker convergence of the network.
Sigmoid functions and their combinations generally work better in the case of classifiers.
Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient
problem.
ReLU function is a general activation function and is used in most cases these days.
If we encounter a case of dead neurons in our networks the leaky ReLU function is the
best choice.
Sometimes leaky ReLU doesn't help in waking up dead neurons so parametric is used
which can give more stable training if the parameter was learnt well.
Always keep in mind that ReLU function should only be used in the hidden layers.
Swish function is still new (released in 2017) but as it gives very good results so it can be
your second choice after ReLU.
Rule of thumb: You can begin with using ReLU function and then move over to other activation
functions in case ReLU doesn’t provide with optimum results.
Course 1 Week4
Deep neural network:
Neural network deepness is
defined by the number of hidden
layers the network has. As
number of hidden layers increase,
the network becomes more deep
so better performance is more
likely to be expected from this
network.
Notations:
1- L defines number of layers of the network.
2-n[ℓ] defines number of nodes in layer ℓ.
Notes:
1-nx denotes number of features and equivalent to n[0].
2-Assume 4 layers NN, then L=4 thus n[L] ≡ n[4] ≡ output layer nodes.
Bias-Variance tradeoff:
High bias is called underfitting and
High variance is called overfitting.
We say that we have high bias (underfitting) if the error in training set is much higher than the
human-level performance error and we say we have high variance (overfitting) if the
development set error is much higher than training error.
Assume we have human level performance error (AKA Bayes error) ≈ 0% . Therefore, ↓
Remember: Underfitting means your model is so bad to describe the data while overfitting
means that your model is so good on the data that it can't generalize on unseen data. Both of
them are not required to be found in your model.
Regularization is also in the update equation not only the cost function:
Remember:
1-dw[l] ≡ (Always check notations)
2-By increasing λ, you are decreasing weights (≈0) so your network becomes simpler and closer
to linear network so increase the bias and decrease the variance.
Dropout regularization:
Dropout is a famous regularization technique used with neural networks by eliminating some
neurons from each layer and keep the others. Removing nodes means by default removing all
ingoing and outgoing links from that node so you end up with much smaller diminished simpler
network less likely to overfit. Nodes to be eliminated is changed with each example after
computing the backpropagation.
Removing some nodes helps
also in making the kept nodes to
learn more especially those who
weren't participating enough in
the learning process.
Keep probability (p) : It is the parameter that decides the percentage of neurons to be kept so if
keep probability of layer 2 of network is 0.8 and this layer has 10 neurons then 8 of these
neurons will be kept and only 2 will be eliminated from the layer.
There are different techniques to apply dropout as Inverted dropout which is the most common
technique in dropout which is similar to original dropout explained so it keeps some weights
and sets others to zero. The one difference is that, during the training of a neural network,
inverted dropout scales the activations by the inverse of the keep probability q (q=1/p) . This
makes the testing faster.
Note: For layers having more neurons so weight matrix is bigger so by default more likely to
overfit so we give those layers with high number of neurons a lower keep probability to reduce
overfitting caused by those layers.
One of the drawbacks of dropout that it makes the cost function J no longer well-defined as
each iteration we have removed bunch of neurons so J can't be drawn as a graph against the
iterations and see something useful to check. The solution is to firstly not applying any dropout
(keep probability =1) then after checking that J is monotonically decreasing (i.e. your algorithm
is learning well) so you retrain again using dropout.
If we assume that each w is initialized as except the last layer as it has different
dimensions but now output y will be a value related to the multiplications of w or in other
words y will be in terms of 1.5L so it is exploding with more added hidden layers L. On the
other hand, assuming w= so y will be 0.5L so it is vanishing with increasing number
of hidden layers L.
A solution that helps a lot in solving these problems is carefully choosing the initialization of
the weights. We have discussed this topic before and said to normalize our outputs by dividing
weights by where ni is the number of neurons in the previous layer (inputs to this neuron) or
use Xavier initialization discussed before.
Note: It is found that if activation function is ReLU, it's better to use weights multiplied by
instead of .
Gradient checking:
Before getting into gradient checking, we want to talk about something.
Numerical approximation of derivative computation:
Derivative is known as discussed before
by taking height over width of the triangle
drawn between θ and θ+ε where ε is small
number but by trying we discovered that
by taking a bigger triangle between θ-ε
and θ+ε the approximation is better
numerically and gives better results.
(without getting into calculus details)
Then is better than
This new definition of derivative will be used in gradient checking
Note: usually use ε=10-7
We will apply the derivative rule to each of J(θ) noting that J(θ) = J(θ1,θ2,…θL) so if dθapprox that
is computed from the derivative rule is close to that dθ we have made before so our
backpropagation was done right. But the question is how to know that dθ approx is close to dθ?
The Check is so if the result in range of 10-6 or 10-7 so it is OK if it is bigger
then more likely there is a bug somewhere in computing gradients.
Notes:
1-Include regularization term in your check.
2-Grad check doesn't work with dropout.
3-Don't use it across your training but just once for debugging.
Dropout, Early stopping & Grad checking
Deep learning neural networks are likely to quickly overfit a training dataset with few
examples. Ensembles of neural networks with different model configurations are known to
reduce overfitting, but require the additional computational expense of training and maintaining
multiple models. Dropout can help in this problem by offering a very computationally cheap
and remarkably effective regularization method.
Dropout
Dropout is a regularization technique used in neural networks where the word "dropout" refers
to dropping out some neurons during the training of the neural network. They are “dropped-out”
randomly. This means that their contribution to the activation of downstream neurons is
temporally removed on the forward pass and any weight updates are not applied to the neuron
on the backward pass. You can imagine that if neurons are randomly dropped out of the
network during training, that other neurons will have to step in and handle the representation
required to make predictions for the missing neurons. This is believed to result in multiple
independent internal representations being learned by the network. The effect is that the
network becomes less sensitive to the specific weights of neurons. This in turn results in a
network that is capable of better generalization and is less likely to overfit the training data.
This conceptualization suggests that perhaps dropout breaks-up situations where network layers
co-adapt to correct mistakes from prior layers, in turn making the model more robust.
Notes:
1-We found that as a side-effect of doing dropout, the activations of the hidden units become
sparse, even when no sparsity inducing regularizers are present.
2-Dropout can be applied to input layer and hidden layers but not output layer of course.
Inverted dropout
Dropout is not used after training when making a prediction with the fit network. The weights
of the network will be larger than normal because of dropout. Therefore, before finalizing the
network, the weights are first scaled by the chosen dropout rate. The network can then be used
as per normal to make predictions "If a unit is retained with probability p during training, the
outgoing weights of that unit are multiplied by p at test time."
The rescaling of the weights can be performed at training time instead, after each weight update
at the end of the mini-batch. This is sometimes called “inverse dropout” and does not require
any modification of weights during training.
Early stopping
A major challenge in training neural networks is how long to train them. Too little training will
mean that the model will underfit the train and the test sets. Too much training will mean that
the model will overfit the training dataset and have poor performance on the test set.
A compromise is to train on the training dataset but to stop training at the point when
performance on a validation dataset starts to degrade. This simple, effective, and widely used
approach to training neural networks is called early stopping.
During training, the model is evaluated on a holdout validation dataset after each epoch. If the
performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase
or accuracy begins to decrease), then the training process is stopped. The model at the time that
training is stopped is then used and is known to have good generalization performance.
This procedure is called “early stopping” and is perhaps one of the oldest and most widely used
forms of neural network regularization.
The answer is NO, Back prop as an algorithm has a lot of details and can be a little bit tricky to
implement. And one unfortunate property is that there are many ways to have subtle bugs in
back prop. So that if you run it with gradient descent or some other optimizational algorithm, it
could actually look like it's working. And your cost function, J of theta may end up decreasing
on every iteration of gradient descent. But this could prove true even though there might be
some bug in your implementation of back prop. So that it looks J of theta is decreasing, but you
might just wind up with a neural network that has a higher level of error than you would with a
bug free implementation. And you might just not know that there was this subtle bug that was
giving you worse performance. So, what can we do about this? There's an idea called gradient
checking that eliminates almost all of these problems. We describe a method for numerically
checking the derivatives computed by your code to make sure that your implementation is
correct. Carrying out the derivative checking procedure significantly increase your confidence
in the correctness of your code.
Course 2 Week2
Mini-batch gradient descent: (recap)
When you try to train on a large dataset, especially in the big data era, is just slow and memory
consuming. Vectorization as said before allows you efficiently compute on m examples without
the need of for loops but the problem arises when data are so large which in turns make the
memory can't have all this amount of data so here comes the role of mini-batch gradient descent
to solve what batch gradient descent can't. Mini-batch gradient descent depends on splitting
your data into portions (mini-batches) and then apply the gradient descent consequently on the
mini-batches.
Notations: Assume having 100,000 data in the training set (x (1)→x(1000000)) so we are going to
split this data into 100 mini-batch such that each mini batch has 1000 training example so the
notation of 1st mini batch for example equals x{1} ≡ [x(1),x(2),… ,x(1000)]
generally
Notes:
1-Mini-batch size= total data / number of mini-batches
2-Mini-batch gradient descent needs for loops to loop over the mini-batches so vectorization is
applied in each loop not on whole data.
Algorithm
This data looks a little bit noisy so we want now to compute local average or moving average of
this data so this can done as follows:
v0=0 , v1=0.9v0 + 0.1θ1 , v2=0.9v1 + 0.1θ2 , …
So what is happening here that we compute the temperature of the day based on past several
days' temperature to smoothen our data and reduce noise.
Equation: vt= β vt-1 + (1-β) θt
vt can now be think as approximately averaging over days' temperature (generally over last
data ). So we can say that v100 for example is actually a mix between real data at day 100
and data of all preceding 99 days where as day become closer it will have a higher effect so day
99 temperature has higher effect than day 98 temperature than day 97 temperature and so on
which means by default higher coefficient. (exponential decrease in effect)
Note:
1-It is called exponential as if 1-β=ε, then so exponential decrease over .
2-We use vθ sometimes to denote that this v is averaging over θ.
Algorithm
This weighted averages is not so accurate so we will use a technique called bias correction to
give more accurate computation of these averages.
Bias correction
The problem is that at the first as we initialize v0=0 so
the first steps of v will have very low values as shown
in the figure (the difference between green (expected)
and purple (actual) curves) as for v1=0.98 v0 +0.02 θ1
but as v0 =0 so v1=0.02 θ1 so if θ1 is 40 degrees so v1
will only equal 8 degrees !!! Here comes the role of bias
correction.
[http://www.ashukumar27.io/exponentially-weighted-average/]
The previous link provides the same topic with the same example, so in case you don't
understand the example by my style of writing, you can try this link for a different way of
explanation of the same example.
Optimization algorithms:
Gradient descent is not the best optimization algorithm for all problems so there are several
optimization algorithms that gives better results than normal gradient descent.
Assume having the contour plot shown and having our optima point at this red point and when
starting gradient descent we had this oscillation shown so we can't have higher learning rate for
faster learning in order to avoid divergence.
Another point of view is that we want to
prevent overshooting in vertical axis (↨) and
at same time we want bigger steps on the
horizontal axis (↔) for faster learning. To apply this idea we gonna use grad descent with
momentum which relies mainly on the idea of exponentially weighted averages.
Algorithm
vdw=0 , vdb=0
On iterations:
compute dw, db
vdw = β vdw + (1-β) dw
vdb = β vdb + (1-β) db
w := w - α vdw
b := b - α vdb
This algorithm helps in damping out oscillations and smoothing the steps towards the minima
as shown with red steps so this will lead to faster movement in the horizontal direction and
more straight forward path towards the minima.
The most common value for hyper-parameter beta is 0.9 which average over last previous 10
iterations. In practice, bias correction is not implemented.
Some references use vdw= β vdw + dw instead of β vdw + (1-β) dw by omitting (1-β) but in fact
both are similar and that will only affect the α chosen but the one with (1-β) is more preferred
intuitively.
Root mean squared propagation (RMSprop)
This algorithm is also used to speed up optimization process. Similarly we want to provide
oscillation in vertical direction and speed up movement in the horizontal direction.
Algorithm
On iteration:
compute dw, db
Sdw = β2 Sdw + (1-β2) dw2 (squaring here is element-wise)
Sdb = β2 Sdb + (1-β2) db2
w := w - α (we add ϵ to ensure that no division by zero occurs)
b := b - α
Algorithm
vdw=0 , vdb=0 ,Sdw=0, Sdb=0
On iteration:
Compute dw , db
vdw = β vdw + (1-β) dw (sometimes we use notation β1 instead of β)
vdb = β vdb + (1-β) db
Sdw = β2 Sdw + (1-β2) dw2
Sdb = β2 Sdb + (1-β2) db2
vdwcorrected =
vdbcorrected =
Sdwcorrected =
Sdbcorrected =
w := w - α
b := b - α
Note: β usually equal 0.9, β2 usually equals 0.999 and ϵ equals (10-8)
Learning rate decay: (recap)
Learning rate decay
We have talked about this idea of having a
large learning rate (α) at the beginning of
learning process (at the first epochs) then start
to decay it with each new epoch so that we
start with big steps towards local minima and
then by time we decrease those steps as we go
closer to convergence.
The new info to add here is what relation to follow with reducing α?
where α0 is an initial learning rate to start with while decay rate is hyper-parameter to tune.
Here the momentum is same as the momentum in classical physics , as we throw a ball down a
hill it gathers momentum and its velocity keeps on increasing. The same thing happens with our
parameter updates :
It leads to faster and stable convergence.
Reduced Oscillations
Remember: β does parameter updates only for relevant examples ( previous derivatives).
This reduces the unnecessary parameter update
Nesterov accelerated gradient descent (NAG)
a slightly different version of the momentum update that has recently been gaining popularity.
A researcher named Yurii Nesterov saw a problem with Momentum. However, a ball that rolls
down a hill, blindly following the slope, is highly unsatisfactory. We’d like to have a smarter
ball, a ball that has a notion of where it is going so that it knows to slow down before the hill
slopes up again. What actually happens is that as we reach the minima i.e. the lowest point on
the curve ,the momentum is pretty high and it doesn’t knows to slow down at that point due to
the high momentum which could cause it to miss the minima entirely and continue movie up. In
the method he suggested we first make a big jump based on out previous momentum then
calculate the Gradient and them make an correction which results in an parameter update. Now
this anticipatory update prevents us to go too fast and not miss the minima and makes it more
responsive to changes.
is called projected gradeint. This means that for this time step t, we have to carry out
another forward propagation before we can finally execute the backpropagation. Here's how it is
computed:
Update the current weight wt to a projected weight w* using the previous velocity.
AdaGrad
Adaptive gradient works on the learning rate component by dividing the learning rate by the
square root of S, which is the cumulative sum of current and past squared gradients (i.e. up to
time t). Note that the gradient component remains unchanged like in SGD. It simply allows the
learning Rate -α to adapt based on the parameters. So it makes big updates for infrequent
parameters and small updates for frequent parameters. For this reason, it is well-suited for
dealing with sparse data. In other words we can say that Adagrad modifies the general learning
rate α at each time step t for every parameter θ(i) based on the past gradients that have been
computed for θ(i).
Note: As it uses a different learning Rate for every parameter θ at a time step based on the past
gradients which were computed for that parameter so this method uses a different learning rate
for every parameter w at every time step t, we first show Adagrad’s per-parameter update,
which we then vectorize.
How using squaring of gradients affects the learning rate and what is its disadvantage?
For parameters with high gradient values, the squared term will be large and hence dividing
with large term would make gradient accelerate slowly in that direction. Similarly, parameters
with low gradients will produce smaller squared terms and hence gradient will accelerate faster
in that direction. Its main weakness is that its learning rate-η is always Decreasing and
decaying. This happens due to the accumulation of each squared Gradients in the denominator ,
since every added term is positive.The accumulated sum keeps growing during training. This in
turn causes the learning rate to shrink and eventually become so small, that the model just stops
learning entirely and stops acquiring new additional knowledge. Because we know that as the
learning rate gets smaller and smaller the ability of the Model to learn fastly decreases and
which gives very slow convergence and takes very long to train and learn i.e learning speed
suffers and decreases.
This problem of Decaying learning Rate will be Rectified in another algorithm called AdaDelta.
RMSProp
Root mean square prop or RMSprop is another adaptive learning rate that is an improvement of
AdaGrad. Instead of taking cumulative sum of squared gradients, we take their exponential
moving average. ). By introducing exponentially weighted moving average we are weighing
recent past more heavily in comparison to distant past.
Note: AdaGrad implies a decreasing learning rate even if the gradients remain constant due to
accumulation of gradients from the beginning of training (same problem mentioned at the end
of AdaGrad part).
AdaDelta
Like RMSprop, Adadelta is also another improvement from AdaGrad, focusing on the learning
rate component as it tends to remove the decaying learning Rate problem of it. . Adadelta is
probably short for ‘adaptive delta’, where delta here refers to the difference between the current
weight and the newly updated weight. The difference between Adadelta and RMSprop is that
Adadelta removes the use of the learning rate parameter completely by replacing it with D, the
exponential moving average of squared deltas.
Adam
Adaptive moment estimation, or Adam, is a combination of momentum and RMSprop. It acts
upon
(i) the gradient component by using V, the exponential moving average of gradients (like in
momentum)
(ii) the learning rate component by dividing the learning rate α by square root of S, the
exponential moving average of squared gradients (like in RMSprop).
AdaMax
AdaMax is an adaptation of the Adam optimiser by the same authors using infinity norms
(hence ‘max’). V is the exponential moving average of gradients, and S is the exponential
moving average of past p-norm of gradients, approximated to the max function as seen below.
Nadam
Nadam is an acronym for Nesterov and Adam optimiser. The Nesterov component, however, is
a more efficient modification than its original implementation (momentum). Why Nesterov is
better than momentum was discussed before.
AMSGrad
Another variant of Adam is the AMSGrad. This variant revisits the adaptive learning rate
component in Adam and changes it to ensure that the current S is always larger than the
previous time step.
Equation: , S choice
Batch normalization applies normalization to some or all neurons before applying activation
function immediately which helped a lot in making learning parameters (w, b) of next layer
learn faster. It makes Z controlled by a specific μ, σ instead of being so random noting that μ &
σ could be 0 & 1 or any other value (What controls their value is γ & β).
Algorithm
As shown in the figure applying this algorithm and using instead of Z(i) makes learning
process faster.
Notes:
1- Z(i) here is equivalent to Z[l](i) but used as Z(i) for simplicity not more.
2-ϵ is added to variance to prevent dividing by zero.
3-γ & β here are learning parameters that will learn with the optimization algorithm used and
they are used to let μ & σ have values other than 0 and 1 to give some freedom for the outputs.
4-if γ= & so this will lead to that ≡ Z(i) (Act if there is no batch norm
applied)
Why batch norm works?
Reason1: Similarly as feature scaling, this helps in making all data centered and deviated
closely rather than having some data ranging from 0 to 1 and others from 0 to 1000.
Reason2: Batch norm helps in reducing covariance shift done by the previous layers.
Assume having 4 hidden layers in network, if we cover the first 2 layers and start study the 3 rd
layer so we will find that the inputs a1[2], …, a4[2] to layer 3 are the main reason for finding w[3],
b[3], w[4], b[4],… using gradient descent process but if we uncover the first 2 layers and look
again at the 3rd layer we can see that a1[2], …, a4[2] are themselves affected by previous weights
and suffering from covariance shift so batch norm controls this shift using its learning
parameters (γ & β). In other words we can say that this reduces the coupling between layers so
that each layer of network can learn by itself and more independently from other layers.
Batch norm in fact has a slight regularization effect due to the noise added by computing mean
and variance on mini-batches like dropout. But mainly batch norm is not used for
regularization.
Note: As mini-batch size increases, the regularization effect decreases.
TAKE CARE: Batch normalization handles data one mini-batch at a time which means that
mean and variance computed are computed on mini-batches not the whole data at once. This
will have an effect on the test phase as in test phase we don't apply on a
mini-batch but applies prediction on a single example every time so as μ
and σ2 are computed over a mini-batch (m here is number of examples in
batch not whole training set) so when applying on a single training example
on test time the equations will be meaningless so what to do now?
As we have mini-batches and each mini batch has a specific μ & σ2 so for layer ℓ for example
we have μ{1}[ℓ], μ{2}[ℓ], μ{3}[ℓ], … so to compute μ[ℓ] used in test phase we will use exponential
weighted average as follows :
until reach vlast mini-batch and then
used this as my μ for test phase. Similarly we do same steps for σ2 .
This is not the only way to compute μ and σ2 but you can use any reasonable method to have a
single μ and σ2 for each layer using the ones of each mini-batch.
Softmax regression
We are always talking about logistic regression for binary classes (2 classes) so positive or
negative classes so softmax regression is the generalization of logistic regression for c classes.
If you have for example a network that classifies cats, dogs, baby chicks or others so now we
have 4 classes (class0, class1, class2, class3). In this case the output layer will be of 4 neurons
(c neurons). So we can say n[L]= 4 or c generally. Each neuron of the output layer provides the
probability of a class of the c classes to be the output given input features x. So for our example,
the 4 neuron outputs are ( P(cats | X), P(dogs | X), P(baby chicks | X), P(others | X)). here is
an output vector of dimension (4 × 1). Note that the summation of the 4 numbers or
probabilities of the output layer should equal to 1.
Softmax regression is done using softmax layer put in the output layer that differs from usual
layers that it has softmax activation function so the flow is as follows:
It is called softmax classifier because hardmax ones will look at values of z [L] and put 1 on the
highest value and zeros for other c-1classes while softmax produces probability of each classas
shown:
Loss function of softmax regression
ℓ( , y) =
Since y or yactual is a vector having 0's for all classes except for the right class which has 1 so the
loss function is actually equal - (1) log so the way to reduce the cost is to make the prediction
of the right class as high as possible (1 is the best) to make the least cost possible
Since batch normalization is done to prevent covariance shift from changing distributions of
input with the output of each activation function from each layer, so why would we use γ and β
which will lead to deviating mean and variance from 0 and 1?? so what is the point of batch
normalization then ??
This is a very hard and tough question that has some mathematical mystery behind the idea of
gradient descent with a high-order complex interactions between multiple layers. For a little bit
of intuition behind this idea visit the following link in the high-order effects part.
[http://mlexplained.com/2018/01/10/an-intuitive-explanation-of-why-batch-normalization-
really-works-normalization-in-deep-learning-part-1/]
Benefits of batch normalization other than reducing covariance shift and faster training
We can use higher learning rates because batch normalization makes sure that there’s no
activation that’s gone really high or really low. And by that, things that previously
couldn’t get to train, it will start to train.(This leads to faster training)
It reduces overfitting because it has a slight regularization effects. Similar to dropout, it
adds some noise to each hidden layer’s activations. Therefore, if we use batch
normalization, we will use less dropout, which is a good thing because we are not going
to lose a lot of information. However, we should not depend only on batch normalization
for regularization; we should better use it together with dropout.
Make weights initialization easier as batch norm reduce sensitivity to initial start weights.
Makes more activation functions usable as some activation functions as sigmoid and
ReLU sensitive to ranges of values fed to them so batch norm regulates those values.
Allow deeper networks due to the previous mentioned points and deeper networks
generally produces better results
Note: Each training iteration will be slower due to extra steps added as normalization (or
standardization) in forward propagation and new learning parameters added (γ & β) in the back-
propagation. However, we say the overall training is faster with batch normalization as the
convergence will be faster.
Another main topic
Auto-Encoders
Autoencoder
Autoencoders are a specific type of unsupervised feed-
forward neural networks where the input is the same
as the output. It has three layers: an input layer, a
hidden (encoding) layer, and an output (decoding)
layer. They work by compressing the input into
a latent-space representation, and then reconstructing
the output from this representation. The encoder and
decoders has the following function:
Encoder: This is the part of the network that compresses the input into a latent-space
representation. It can be represented by an encoding function h=f(x).
Decoder: This part aims to reconstruct the input from the latent space representation. It
can be represented by a decoding function =g(h).
The autoencoder as a whole can thus be described by the function g(f(x)) = where you
want as close as the original input x. In other words, Autoencoder‘s objective is to minimize
reconstruction error between the input and output. This helps autoencoders to learn important
features present in the data. When a representation allows a good reconstruction of its input,
then it has retained much of the information present in the input.
Note: Although we have said in the definition of autoencoders that it has 3 layers but some
types of autoencoders use more than one hidden layers.
Types of autoencoders
Vanilla autoencoder: The simplest form of autoencoders which has only 1 hidden layer
where input and output are of same size and the hidden layer acts as the compressor
(coded part) has a smaller size.
Multilayer autoencoder: It has more than 1 hidden layer (usually use odd number) so if
for example we use 3 then:
Input (sizeI) hidden1 (size1) hidden2 (size2) hidden 3 (size3) output (sizeO)
Note: sizeI=sizeO -||- (size1=size3)<sizeI -||- size2(coded part) < size1 & sizeI
Convolutional autoencoder: This is used with CNN and has same principle but uses
images (3D vectors) instead of flattened 1D vectors. (Discussed again in CNN part)
Regularized autoencoder: use a loss function that encourages the model to have other
properties besides the ability to copy its input to its output. In practice, we usually find
two types of regularized auto encoder: 1)sparse autoencoder 2)denoising autoencoder
Sparse autoencoder
Sparse autoencoders are typically used to learn features for another task such as classification.
An autoencoder that has been regularized to be sparse must respond to unique statistical
features of the dataset it has been trained on, rather than simply acting as an identity function. In
this way, training to perform the copying task with a sparsity penalty can yield a model that has
learned useful features as a byproduct.
Another way we can constraint the reconstruction of autoencoder is to impose a constraint in its
loss. We could, for example, add a regularization term in the loss function. Doing this will make
our autoencoder learn sparse representation of data.
Notice in our hidden layer, we add an activity regularizer (L1 or L2 or both), that will apply a
penalty to the loss function during the optimization phase. As a result, if we are using 1 hidden
layer the representation is now sparser compared to the vanilla autoencoder.
Denoising autoencoder
Rather than adding a penalty to the loss function, we can obtain an autoencoder that learns
something useful by changing the reconstruction error term of the loss function. This can be
done by adding some noise of the input image and make the autoencoder learn to remove it. By
this means, the encoder will extract the most important features and learn a robuster
representation of the data. The idea behind denoising autoencoders is simple. In order to force
the hidden layer to discover more robust features and prevent it from simply learning the
identity, we train the autoencoder to reconstruct the input from a corrupted version of it.
The amount of noise to apply to the input takes the form of a percentage. Typically, 30 percent,
or 0.3, is fine, but if you have very little data, you may want to consider adding more.
https://www.youtube.com/watch?v=
dFX8k1kXhOw&list=PLkDaE6sCZ
n6E7jZ9sN_xHwSHOdjUxUW_b
QUESTIONS FROM DRAFT VERSION OF MACHINE
LEARNING YEARNING BOOK BY ANDREW NG
(For full version: https://bit.ly/2LgaTee)
4 Scale drives machine learning progress
Many of the ideas of deep learning (neural networks) have been around for decades. Why are
these ideas taking off now?
Two of the biggest drivers of recent progress have been:
• Data availability. People are now spending more time on digital devices (laptops, mobile
devices). Their digital activities generate huge amounts of data that we can feed to our
learning algorithms.
• Computational scale. We started just a few years ago to be able to train neural
networks that are big enough to take advantage of the huge datasets we now have.
Thus, you obtain the best performance when you (i) Train a very large neural network, so
that you are on the green curve above; (ii) Have a huge amount of data.
Many other details such as neural network architecture are also important, and there has
been much innovation here. But one of the more reliable ways to improve an algorithm’s
performance today is still to (i) train a bigger network and (ii) get more data.
5 Your development and test sets
Let’s return to our earlier cat pictures example: You run a mobile app, and users are
uploading pictures of many different things to your app. You want to automatically find the
cat pictures.
Your team gets a large training set by downloading pictures of cats (positive examples) and
non-cats (negative examples) off of different websites. They split the dataset 70%/30% into
training and test sets. Using this data, they build a cat detector that works well on the
training and test sets.
But when you deploy this classifier into the mobile app, you find that the performance is
really poor!
What happened?
You figure out that the pictures users are uploading have a different look than the website
images that make up your training set: Users are uploading pictures taken with mobile
phones, which tend to be lower resolution, blurrier, and poorly lit. Since your training/test
sets were made of website images, your algorithm did not generalize well to the actual
distribution you care about: mobile phone pictures.
Before the modern era of big data, it was a common rule in machine learning to use a
random 70%/30% split to form your training and test sets. This practice can work, but it’s a
bad idea in more and more applications where the training distribution (website images in
our example above) is different from the distribution you ultimately care about (mobile
phone images).
We usually define:
• Training set — Which you run your learning algorithm on.
• Dev (development) set — Which you use to tune parameters, select features, and
make other decisions regarding the learning algorithm. Sometimes also called the
hold-out cross validation set .
• Test set — which you use to evaluate the performance of the algorithm, but not to make
any decisions regarding what learning algorithm or parameters to use.
Once you define a dev set (development set) and test set, your team will try a lot of ideas,
such as different learning algorithm parameters, to see what works best. The dev and test
sets allow your team to quickly see how well your algorithm is doing.
In other words, the purpose of the dev and test sets are to direct your team toward
the most important changes to make to the machine learning system .
So, you should do the following:
Choose dev and test sets to reflect data you expect to get in the future
and want to do well on.
In other words, your test set should not simply be 30% of the available data, especially if you
expect your future data (mobile phone images) to be different in nature from your training
set (website images).
If you have not yet launched your mobile app, you might not have any users yet, and thus
might not be able to get data that accurately reflects what you have to do well on in the
future. But you might still try to approximate this. For example, ask your friends to take
mobile phone pictures of cats and send them to you. Once your app is launched, you can
update your dev/test sets using actual user data.
If you really don’t have any way of getting data that approximates what you expect to get in the future,
perhaps you can start by using website images. But you should be aware of the
risk of this leading to a system that doesn’t generalize well.
It requires judgment to decide how much to invest in developing great dev and test sets. But
don’t assume your training distribution is the same as your test distribution. Try to pick test examples
that reflect what you ultimately want to perform well on, rather than whatever data
you happen to have for training.
6 Your dev and test sets should come from the
same distribution
You have your cat app image data segmented into four regions, based on your largest
markets: (i) US, (ii) China, (iii) India, and (iv) Other. To come up with a dev set and a test
set, say we put US and India in the dev set; China and Other in the test set. In other words,
we can randomly assign two of these segments to the dev set, and the other two to the test
set, right?
Once you define the dev and test sets, your team will be focused on improving dev set
performance. Thus, the dev set should reflect the task you want to improve on the most: To
do well on all four geographies, and not only two.
There is a second problem with having different dev and test set distributions: There is a
chance that your team will build something that works well on the dev set, only to find that it
does poorly on the test set. I’ve seen this result in much frustration and wasted effort. Avoid
letting this happen to you.
As an example, suppose your team develops a system that works well on the dev set but not
the test set. If your dev and test sets had come from the same distribution, then you would
have a very clear diagnosis of what went wrong: You have overfit the dev set. The obvious
cure is to get more dev set data.
But if the dev and test sets come from different distributions, then your options are less
clear. Several things could have gone wrong:
1. You had overfit to the dev set.
2. The test set is harder than the dev set. So your algorithm might be doing as well as could
be expected, and no further significant improvement is possible.
3. The test set is not necessarily harder, but just different, from the dev set. So what works
well on the dev set just does not work well on the test set. In this case, a lot of your work
to improve dev set performance might be wasted effort.
Working on machine learning applications is hard enough. Having mismatched dev and test
sets introduces additional uncertainty about whether improving on the dev set distribution
also improves test set performance. Having mismatched dev and test sets makes it harder to
figure out what is and isn’t working, and thus makes it harder to prioritize what to work on.
If you are working on a 3rd party benchmark problem, their creator might have specified dev
and test sets that come from different distributions. Luck, rather than skill, will have a
greater impact on your performance on such benchmarks compared to if the dev and test
sets come from the same distribution. It is an important research problem to develop
learning algorithms that are trained on one distribution and generalize well to another. But if
your goal is to make progress on a specific machine learning application rather than make
research progress, I recommend trying to choose dev and test sets that are drawn from the
same distribution. This will make your team more efficient.
7 How large do the dev/test sets need to be?
The dev set should be large enough to detect differences between algorithms that you are trying out.
For example, if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a
dev set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine
learning problems I’ve seen, a 100 example dev set is small. Dev sets with sizes from 1,000 to 10,000
examples are common. With 10,000 examples, you will have a good chance of detecting an
improvement of 0.1%. For mature and important applications—for example, advertising, web search,
and product recommendations—I have also seen teams that are highly motivated to eke out even a
0.01% improvement, since it has a direct impact on the company’s profits. In this case, the dev set
could be much larger than 10,000, in order to detect even smaller improvements. How about the size
of the test set? It should be large enough to give high confidence in the overall performance of your
system. One popular heuristic had been to use 30% of your data for your test set. This works well
when you have a modest number of examples—say 100 to 10,000 examples. But in the era of big data
where we now have machine learning problems with sometimes more than a billion examples, the
fraction of data allocated to dev/test sets has been shrinking, even as the absolute number of
examples in the dev/test sets has been growing. There is no need to have excessively large dev/test
sets beyond what is needed to evaluate the performance of your algorithms.
It seems unnatural to derive a single metric by putting accuracy and running time into a
single formula, such as:
Accuracy - 0.5*RunningTime
Here’s what you can do instead: First, define what is an “acceptable” running time. Lets say
anything that runs in 100ms is acceptable. Then, maximize accuracy, subject to your
classifier meeting the running time criteria. Here, running time is a “satisficing
metric”—your classifier just has to be “good enough” on this metric, in the sense that it
should take at most 100ms. Accuracy is the “optimizing metric.”
If you are trading off N different criteria, such as binary file size of the model (which is
important for mobile apps, since users don’t want to download large apps), running time,
and accuracy, you might consider setting N-1 of the criteria as “satisficing” metrics. I.e., you
simply require that they meet a certain value. Then define the final one as the “optimizing”
metric. For example, set a threshold for what is acceptable for binary file size and running
time, and try to optimize accuracy given those constraints.
As a final example, suppose you are building a hardware device that uses a microphone to
listen for the user saying a particular “wakeword,” that then causes the system to wake up.
Examples include Amazon Echo listening for “Alexa”; Apple Siri listening for “Hey Siri”;
Android listening for “Okay Google”; and Baidu apps listening for “Hello Baidu.” You care
about both the false positive rate—the frequency with which the system wakes up even when
no one said the wakeword—as well as the false negative rate—how often it fails to wake up
when someone says the wakeword. One reasonable goal for the performance of this system is to
minimize the false negative rate (optimizing metric), subject to there being no more than
one false positive every 24 hours of operation (satisficing metric).
Once your team is aligned on the evaluation metric to optimize, they will be able to make
faster progress.
11 When to change dev/test sets and metrics
When starting out on a new project, I try to quickly choose dev/test sets, since this gives the
team a well-defined target to aim for.
I typically ask my teams to come up with an initial dev/test set and an initial metric in less
than one week—rarely longer. It is better to come up with something imperfect and get going
quickly, rather than overthink this. But this one week timeline does not apply to mature
applications. For example, anti-spam is a mature deep learning application. I have seen
teams working on already-mature systems spend months to acquire even better dev/test
sets.
If you later realize that your initial dev/test set or metric missed the mark, by all means
change them quickly. For example, if your dev set + metric ranks classifier A above classifier
B, but your team thinks that classifier B is actually superior for your product, then this might
be a sign that you need to change your dev/test sets or your evaluation metric.
There are three main possible causes of the dev set/metric incorrectly rating classifier A
higher:
1. The actual distribution you need to do well on is different from the dev/test sets.
Suppose your initial dev/test set had mainly pictures of adult cats. You ship your cat app,
and find that users are uploading a lot more kitten images than expected. So, the dev/test set
distribution is not representative of the actual distribution you need to do well on. In this
case, update your dev/test sets to be more representative.
3. The metric is measuring something other than what the project needs to optimize.
Suppose that for your cat application, your metric is classification accuracy. This metric
currently ranks classifier A as superior to classifier B. But suppose you try out both
algorithms, and find classifier A is allowing occasional pornographic images to slip through.
Even though classifier A is more accurate, the bad impression left by the occasional
pornographic image means its performance is unacceptable. What do you do?
Here, the metric is failing to identify the fact that Algorithm B is in fact better than
Algorithm A for your product. So, you can no longer trust the metric to pick the best
algorithm. It is time to change evaluation metrics. For example, you can change the metric to
heavily penalize letting through pornographic images. I would strongly recommend picking
a new metric and using the new metric to explicitly define a new goal for the team, rather
than proceeding for too long without a trusted metric and reverting to manually choosing
among classifiers.
It is quite common to change dev/test sets or evaluation metrics during a project. Having an
initial dev/test set and metric helps you iterate quickly. If you ever find that the dev/test sets
or metric are no longer pointing your team in the right direction, it’s not a big deal! Just
change them and make sure your team knows about the new direction.
12 Takeaways: Setting up development and
test sets (Summary of the previous)
• Choose dev and test sets from a distribution that reflects what data you expect to get in
the future and want to do well on. This may not be the same as your training data’s
distribution.
• Choose dev and test sets from the same distribution if possible.
• Choose a single-number evaluation metric for your team to optimize. If there are multiple
goals that you care about, consider combining them into a single formula (such as
averaging multiple error metrics) or defining satisficing and optimizing metrics.
• Machine learning is a highly iterative process: You may try many dozens of ideas before
finding one that you’re satisfied with.
• Having dev/test sets and a single-number evaluation metric helps you quickly evaluate
algorithms, and therefore iterate faster.
• When starting out on a brand new application, try to establish dev/test sets and a metric
quickly, say in less than a week. It might be okay to take longer on mature applications.
• The old heuristic of a 70%/30% train/test split does not apply for problems where you
have a lot of data; the dev and test sets can be much less than 30% of the data.
• Your dev set should be large enough to detect meaningful changes in the accuracy of your
algorithm, but not necessarily much larger. Your test set should be big enough to give you
a confident estimate of the final performance of your system.
• If your dev set and metric are no longer pointing your team in the right direction, quickly
change them: (i) If you had overfit the dev set, get more dev set data. (ii) If the actual
distribution you care about is different from the dev/test set distribution, get new
dev/test set data. (iii) If your metric is no longer measuring what is most important to
you, change the metric.
13 Build your first system quickly, then iterate
You want to build a new email anti-spam system. Your team has several ideas:
• Collect a huge training set of spam email. For example, set up a “honeypot”: deliberately
send fake email addresses to known spammers, so that you can automatically harvest the
spam messages they send to those addresses.
• Develop features for understanding the text content of the email.
• Develop features for understanding the email envelope/header features to show what set
of internet servers the message went through.
• and more.
Even though I have worked extensively on anti-spam, I would still have a hard time picking one of
these directions. It is even harder if you are not an expert in the application area. So don’t start off
trying to design and build the perfect system. Instead, build and train a basic system quickly—perhaps
in just a few days. Even if the basic system is far from the “best” system you can build, it is valuable to
examine how the basic system functions: you will quickly find clues that show you the most promising
directions in which to invest your time. These next few chapters will show you how to read these clues.
As the training set size increases, the dev set error should decrease.
We will often have some “desired error rate” that we hope our learning algorithm will
eventually achieve. For example:
• If we hope for human-level performance, then the human error rate could be the “desired
error rate.”
• If our learning algorithm serves some product (such as delivering cat pictures), we might
have an intuition about what level of performance
is needed to give users a great
experience.
• If you have worked on a important application
for a long time, then you might have
intuition about how much more progress you can
reasonably make in the next
quarter/year.
Add the desired level of performance to your
learning curve:
Looking at the learning curve might therefore help you avoid spending months collecting
twice as much training data, only to realize it does not help.
One downside of this process is that if you only look at the dev error curve, it can be hard to
extrapolate and predict exactly where the red curve will go if you had more data. There is one
additional plot that can help you estimate the impact of adding more data: the training error.
29 Plotting training error
Your dev set (and test set) error should decrease as the training set size grows. But your
training set error usually increases as the training set size grows.
Let’s illustrate this effect with an example. Suppose your training set has only 2 examples:
One cat image and one non-cat image. Then it is easy for the learning algorithms to
“memorize” both examples in the training set, and get 0% training set error. Even if either or
both of the training examples were mislabeled, it is still easy for the algorithm to memorize
both labels.
Now suppose your training set has 100 examples. Perhaps even a few examples are
mislabeled, or ambiguous—some images are very blurry, so even humans cannot tell if there
is a cat. Perhaps the learning algorithm can still “memorize” most or all of the training set,
but it is now harder to obtain 100% accuracy. By increasing the training set from 2 to 100
examples, you will find that the training set accuracy will drop slightly.
Finally, suppose your training set has 10,000
examples. In this case, it becomes even harder
for the algorithm to perfectly fit all 10,000
examples, especially if some are ambiguous or
mislabeled. Thus, your learning algorithm will do
even worse on this training set.
Let’s add a plot of training error to our earlier
figures:
You can see that the blue “training error” curve increases with the size of the training set.
Furthermore, your algorithm usually does better on the training set than on the dev set; thus
the red dev error curve usually lies strictly above the blue training error curve.
Let’s discuss next how to interpret these plots.
Now, you can be absolutely sure that adding more data will not, by itself, be sufficient. Why is that?
Remember our two observations:
• As we add more training data, training error can only get worse. Thus, the blue training error curve
can only stay the same or go higher, and thus it can only get further away from the (green line) level of
desired performance.
• The red dev error curve is usually higher than the blue training error. Thus, there’s almost no way
that adding more data would allow the red dev error curve to drop down to the desired level of
performance when even the training error is higher than the desired level of performance.
Examining both the dev error curve and the training error curve on the same plot allows us to more
confidently extrapolate the dev error curve. Suppose, for the sake of discussion, that the desired
performance is our estimate of the optimal error rate. The figure above is then the standard“textbook”
example of what a learning curve with high avoidable bias looks like: At the largest training set size—
presumably corresponding to all the training data we have—there is a large gap between the training
error and the desired performance, indicating large avoidable bias.
Furthermore, the gap between the training and dev curves is small, indicating small variance.
Previously, we were measuring training and dev set error only at the rightmost point of this
plot, which corresponds to using all the available training data. Plotting the full learning
curve gives us a more comprehensive picture of the algorithms’ performance on different
training set sizes.
I would not bother with either of these techniques unless you have already tried plotting
learning curves and concluded that the curves are too noisy to see the underlying trends. If
your training set is large—say over 10,000 examples—and your class distribution is not very
skewed, you probably won’t need these techniques.
Finally, plotting a learning curve may be computationally expensive: For example, you might
have to train ten models with 1,000, then 2,000, all the way up to 10,000 examples. Training
models with small datasets is much faster than training models with large datasets. Thus,
instead of evenly spacing out the training set sizes on a linear scale as above, you might train
models with 1,000, 2,000, 4,000, 6,000, and 10,000 examples. This should still give you a
clear sense of the trends in the learning curves. Of course, this technique is relevant only if
the computational cost of training all the additional models is significant.
33 Why we compare to human-level
performance
Many machine learning systems aim to automate things that humans do well. Examples
include image recognition, speech recognition, and email spam classification. Learning
algorithms have also improved so much that we are now surpassing human-level
performance on more and more of these tasks.
Further, there are several reasons building an ML system is easier if you are trying to do a
task that people can do well:
1. Ease of obtaining data from human labelers. For example, since people recognize
cat images well, it is straightforward for people to provide high accuracy labels for your
learning algorithm.
2. Error analysis can draw on human intuition. Suppose a speech recognition
algorithm is doing worse than human-level recognition. Say it incorrectly transcribes an
audio clip as “This recipe calls for a pear of apples,” mistaking “pair” for “pear.” You can
draw on human intuition and try to understand what information a person uses to get the
correct transcription, and use this knowledge to modify the learning algorithm.
3. Use human-level performance to estimate the optimal error rate and also set
a “desired error rate.” Suppose your algorithm achieves 10% error on a task, but a person
achieves 2% error. Then we know that the optimal error rate is 2% or lower and the
avoidable bias is at least 8%. Thus, you should try bias-reducing techniques.
Even though item #3 might not sound important, I find that having a reasonable and
achievable target error rate helps accelerate a team’s progress. Knowing your algorithm has
high avoidable bias is incredibly valuable and opens up a menu of options to try.
There are some tasks that even humans aren’t good at. For example, picking a book to
recommend to you; or picking an ad to show a user on a website; or predicting the stock
market. Computers already surpass the performance of most people on these tasks. With
these applications, we run into the following problems:
• It is harder to obtain labels. For example, it’s hard for human labelers to annotate a
database of users with the “optimal” book recommendation. If you operate a website or
app that sells books, you can obtain data by showing books to users and seeing what they
buy. If you do not operate such a site, you need to find more creative ways to get data.
• Human intuition is harder to count on. For example, pretty much no one can
predict the stock market. So if our stock prediction algorithm does no better than random
guessing, it is hard to figure out how to improve it.
• It is hard to know what the optimal error rate and reasonable desired error
rate is. Suppose you already have a book recommendation system that is doing quite
well. How do you know how much more it can improve without a human baseline?
34 How to define human-level performance
Suppose you are working on a medical imaging application that automatically makes
diagnoses from x-ray images. A typical person with no previous medical background besides
some basic training achieves 15% error on this task. A junior doctor achieves 10% error. An
experienced doctor achieves 5% error. And a small team of doctors that discuss and debate
each image achieves 2% error. Which one of these error rates defines “human-level
performance”?
In this case, I would use 2% as the human-level performance proxy for our optimal error
rate. You can also set 2% as the desired performance level because all three reasons from the
previous chapter for comparing to human-level performance apply:
• Ease of obtaining labeled data from human labelers. You can get a team of doctors
to provide labels to you with a 2% error rate.
• Error analysis can draw on human intuition. By discussing images with a team of
doctors, you can draw on their intuitions.
• Use human-level performance to estimate the optimal error rate and also set
achievable “desired error rate.” It is reasonable to use 2% error as our estimate of the
optimal error rate. The optimal error rate could be even lower than 2%, but it cannot be
higher, since it is possible for a team of doctors to achieve 2% error. In contrast, it is not
reasonable to use 5% or 10% as an estimate of the optimal error rate, since we know these
estimates are necessarily too high.
When it comes to obtaining labeled data, you might not want to discuss every image with an
entire team of doctors since their time is expensive. Perhaps you can have a single junior
doctor label the vast majority of cases and bring only the harder cases to more experienced
doctors or to the team of doctors.
If your system is currently at 40% error, then it doesn’t matter much whether you use a
junior doctor (10% error) or an experienced doctor (5% error) to label your data and provide
intuitions. But if your system is already at 10% error, then defining the human-level
reference as 2% gives you better tools to keep improving your system.
If you set =1/40, the algorithm would give equal weight to the 5,000 mobile images and the
200,000 internet images. You can also set the parameter to other values, perhaps by tuning to the
dev set. By weighting the additional Internet images less, you don’t have to build as massive a neural
network to make sure the algorithm does well on both types of tasks. This type of re-weighting is
needed only when you suspect the additional data (Internet Images) has a very different distribution
than the dev/test set, or if the additional data is much larger than the data that came from the same
distribution as the dev/test set (mobile images).
40 Generalizing from the training set to the
dev set
Suppose you are applying ML in a setting where the training and the dev/test distributions
are different. Say, the training set contains Internet images + Mobile images, and the
dev/test sets contain only Mobile images. However, the algorithm is not working well: It has
a much higher dev/test set error than you would like. Here are some possibilities of what
might be wrong:
1. It does not do well on the training set. This is the problem of high (avoidable) bias on the
training set distribution.
2. It does well on the training set, but does not generalize well to previously unseen data
drawn from the same distribution as the training set . This is high variance.
3. It generalizes well to new data drawn from the same distribution as the training set, but
not to data drawn from the dev/test set distribution. We call this problem data
mismatch , since it is because the training set data is a poor match for the dev/test set
data.
For example, suppose that humans achieve near perfect performance on the cat recognition
task. Your algorithm achieves this:
• 1% error on the training set
• 1.5% error on data drawn from the same distribution as the training set that the algorithm
has not seen
• 10% error on the dev set
In this case, you clearly have a data mismatch problem. To address this, you might try to
make the training data more similar to the dev/test data. We discuss some techniques for
this later.
In order to diagnose to what extent an algorithm suffers from each of the problems 1-3
above, it will be useful to have another dataset. Specifically, rather than giving the algorithm
all the available training data, you can split it into two subsets: The actual training set which
the algorithm will train on, and a separate set, which we will call the “Training dev” set, that
we will not train on.
You now have four subsets of data:
• Training set. This is the data that the algorithm will learn from (e.g., Internet images +
Mobile images). This does not have to be drawn from the same distribution as what we
really care about (the dev/test set distribution).
• Training dev set: This data is drawn from the same distribution as the training set (e.g.,
Internet images + Mobile images). This is usually smaller than the training set; it only
needs to be large enough to evaluate and track the progress of our learning algorithm.
• Dev set: This is drawn from the same distribution as the test set, and it reflects the
distribution of data that we ultimately care about doing well on. (E.g., mobile images.)
• Test set: This is drawn from the same distribution as the dev set. (E.g., mobile images.)
Armed with these four separate datasets, you can now evaluate:
• Training error, by evaluating on the training set.
• The algorithm’s ability to generalize to new data drawn from the training set distribution,
by evaluating on the training dev set.
• The algorithm’s performance on the task you care about, by evaluating on the dev and/or
test sets.
Most of the guidelines in Chapters 5-7 for picking the size of the dev set also apply to the
training dev set.
41 Identifying Bias, Variance, and Data
Mismatch Errors
Suppose humans achieve almost perfect performance (≈0% error) on the cat detection task, and thus
the optimal error rate is about 0%. Suppose you have:
• 1% error on the training set. • 5% error on training dev set. • 5% error on the dev set.
What does this tell you? Here, you know that you have high variance. The variance reduction
techniques described earlier should allow you to make progress. Now, suppose your algorithm
achieves:
• 10% error on the training set. • 11% error on training dev set. • 12% error on the dev set.
This tells you that you have high avoidable bias on the training set. I.e., the algorithm is doing poorly
on the training set. Bias reduction techniques should help. In the two examples above, the algorithm
suffered from only high avoidable bias or high variance. It is possible for an algorithm to suffer from
any subset of high avoidable bias, high variance, and data mismatch. For example:
• 10% error on the training set. • 11% error on training dev set. • 20% error on the dev set.
This algorithm suffers from high avoidable bias and from data mismatch. It does not, however, suffer
from high variance on the training set distribution. It might be easier to understand how the different
types of errors relate to each other by drawing them as entries in a table:
Continuing with the example of th e cat image detector, you can see that there are two
different distributions of data on the x-axis. On the y-axis, we ha ve three types of error:
human level error, error on examples the algorithm has trained on, and error on examples
the algorithm has not trained on. We can fill in the boxes with the different types of errors we
identified in the previous chapter.
If you wish, you can also fill in the remaining two boxes in this table:
You can fill in the upper-right box (Human level performance on Mobile Images) by asking some
humans to label your mobile cat images data and measure their error. You can fill in the next box by
taking the mobile cat images (Distribution B) and putting a small fraction of into the training set so
that the neural network learns on it too. Then you measure the learned model’s error on that subset of
data. Filling in these two additional entries may sometimes give additional insight about what the
algorithm is doing on the two different distributions (Distribution A and B) of data.
By understanding which types of error the algorithm suffers from the most, you will be better
positioned to decide whether to focus on reducing bias, reducing variance, or reducing data
mismatch.
47 The rise of end-to-end learning
Suppose you want to build a system to examine online product reviews and automatically tell you if
the writer liked or disliked that product. For example, you hope to recognize the following review as
highly positive: This is a great mop!
and the following as highly negative: This mop is low quality--I regret buying it.
The problem of recognizing positive vs. negative opinions is called “sentiment classification.”
To build this system, you might build a “pipeline” of two components:
1. Parser: A system that annotates the text with information identifying the most important words.
For example, you might use the parser to label all the adjectives 15 and nouns. You would therefore get
the following annotated text: This is a great Adjective mop Noun !
2. Sentiment classifier: A learning algorithm that takes as input the annotated text and predicts the
overall sentiment. The parser’s annotation could help this learning algorithm greatly: By giving
adjectives a higher weight, your algorithm will be able to quickly hone in on the important words such
as “great,” and ignore less important words such as “this.”
We can visualize your “pipeline” of two components as follows:
There has been a recent trend toward replacing pipeline systems with a single learning
algorithm. An end-to-end learning algorithm for this task would simply take as input
the raw, original text “This is a great mop!”, and try to directly recognize the sentiment:
Neural networks are commonly used in end-to-end learning systems. The term “end-to-end”
refers to the fact that we are asking the learning algorithm to go directly from the input to
the desired output. I.e., the learning algorithm directly connects the “input end” of the
system to the “output end.”
In problems where data is abundant, end-to-end systems have been remarkably successful.
But they are not always a good choice. The next few chapters will give more examples of
end-to-end systems as well as give advice on when you should and should not use them.
48 More end-to-end learning examples
Suppose you want to build a speech recognition system. You might build a system with three
components:
So far, we have only described machine learning “pipelines” that are completely linear: the
output is sequentially passed from one staged to the next. Pipelines can be more complex.
For example, here is a simple architecture for an autonomous car:
It has three components: One detects other cars using the camera images; one detects
pedestrians; then a final component plans a path for our own car that avoids the cars and
pedestrians.
Not every component in a pipeline has to be learned. For example, the literature on “robot
motion planning” has numerous algorithms for the final path planning step for the car. Many
of these algorithms do not involve learning.
In contrast, and end-to-end approach might try to take in the sensor inputs and directly
output the steering direction:
Even though end-to-end learning has seen many successes, it is not always the best
approach. For example, end-to-end speech recognition works well. But I’m skeptical about
end-to-end learning for autonomous driving. The next few chapters explain why.
49 Pros and cons of end-to-end learning
Consider the same speech pipeline from our earlier example:
Having more hand-engineered components generally allows a speech system to learn with less data.
The hand-engineered knowledge captured by MFCCs and phonemes “supplements” the knowledge
our algorithm acquires from data. When we don’t have much data, this knowledge is useful.
Now, consider the end-to-end system:
This system lacks the hand-engineered knowledge. Thus, when the training set is small, it might do
worse than the hand-engineered pipeline.
However, when the training set is large, then it is not hampered by the limitations of an MFCC or
phoneme-based representation. If the learning algorithm is a large-enough neural network and if it is
trained with enough training data, it has the potential to do very well, and perhaps even approach the
optimal error rate.
End-to-end learning systems tend to do well when there is a lot of labeled data for “both ends”—the
input end and the output end. In this example, we require a large dataset of (audio, transcript) pairs.
When this type of data is not available, approach end-to-end learning with great caution.
If you are working on a machine learning problem where the training set is very small, most of your
algorithm’s knowledge will have to come from your human insight. I.e., from your “hand engineering”
components.
If you choose not to use an end-to-end system, you will have to decide what are the steps in your
pipeline, and how they should plug together. In the next few chapters, we’ll give some suggestions for
designing such pipelines.
50 Choosing pipeline components: Data
availability
When building a non-end-to-end pipeline system, what are good candidates for the
components of the pipeline? How you design the pipeline will greatly impact the overall
system’s performance. One important factor is whether you can easily collect data to train
each of the components.
For example, consider this autonomous driving architecture:
You can use machine learning to detect cars and pedestrians. Further, it is not hard to obtain
data for these: There are numerous computer vision datasets with large numbers of labeled
cars and pedestrians. You can also use crowdsourcing (such as Amazon Mechanical Turk) to
obtain even larger datasets. It is thus relatively easy to obtain training data to build a car
detector and a pedestrian detector.
In contrast, consider a pure end-to-end approach:
To train this system, we would need a large dataset of (Image, Steering Direction) pairs. It is
very time-consuming and expensive to have people drive cars around and record their
steering direction to collect such data. You need a fleet of specially-instrumented cars, and a
huge amount of driving to cover a wide range of possible scenarios. This makes an
end-to-end system difficult to train. It is much easier to obtain a large dataset of labeled car
or pedestrian images.
More generally, if there is a lot of data available for training “intermediate modules” of a
pipeline (such as a car detector or a pedestrian detector), then you might consider using a
pipeline with multiple stages. This structure could be superior because you could use all that
available data to train the intermediate modules.
Until more end-to-end data becomes available, I believe the non-end-to-end approach is
significantly more promising for autonomous driving: Its architecture better matches the
availability of data.
51 Choosing pipeline components: Task
simplicity
Other than data availability, you should also consider a second factor when picking
components of a pipeline: How simple are the tasks solved by the individual components?
You should try to choose pipeline components that are individually easy to build or learn.
But what does it mean for a component to be “easy” to learn?
Suppose you are building a Siamese cat detector. This is the pure end-to-end architecture:
The first step (cat detector) detects all the cats in the image.
The second step then passes cropped images of each of the detected cats (one at a time) to a
cat species classifier, and finally outputs 1 if any of the cats detected is a Siamese cat.
Compared to training a purely end-to-end classifier using just labels 0/1, each of the two
components in the pipeline--the cat detector and the cat breed classifier--seem much easier
to learn and will require significantly less data.
By using this pipeline, you are telling the algorithm that there are 3 key steps to driving: (1)
Detect other cars, (2) Detect pedestrians, and (3) Plan a path for your car. Further, each of
these is a relatively simpler function--and can thus be learned with less data--than the
purely end-to-end approach.
In summary, when deciding what should be the components of a pipeline, try to build a
pipeline where each component is a relatively “simple” function that can therefore be learned
from only a modest amount of data.
END OF PART 2
DEEP LEARNING
BASICS
PART 3
COMPUTER VISION
Computer Vision Introduction
What is computer vision?
1-Computer vision can be defined as “the theory and technology for building artificial systems
that obtain information from images or multi-dimensional data.”
A simpler explanation is that computer vision strives to solve the same problems you can solve
with your very own eyes.
2- Computer Vision is the broad parent name for any computations involving visual content –
that means images, videos, icons, and anything else with pixels involved.
One of the challenges facing computer vision is that inputs can get really big so for a 1 mega
pixel image (which is actually pretty small comparing to what cameras output now) is actually a
1000 × 1000 × 3 = 3,000,000 (note that 3 here determines the 3 channels composing the image
which red, green and blue) features for just 1 training example which is pretty large amount of
data to be stored as a one input. This means that w[1] now is a matrix of (n[1]×3,000,000) matrix
which means very large number of learning parameters that will make learning process harder
on the computation requirements, memory requirements and also in getting much data to
prevent overfitting with such large number of parameters. Here comes the convolution
operations which is the fundamental element of computer vision.
The earlier layers of the neural networks usually focus on lines and edges and as we go further
in the layer we are focusing on more complex shapes as curves then geometrical shapes as
triangles, circles, rectangles, etc followed by more complex shapes. As an example if we are
dealing with a face recognition problem so may be the early layers are dealing with edges of the
face and its components and then intermediate layers are dealing with eyes, noses, ears, eye
brows, mouths and so on then the latter layers of the network deals with the faces as awhole as
shown in the example:
Let us focus on the earlier stages and see how convolution works on edge detection.
Convolutional operations: (edge detection as motivation example)
Assume having an example
like this, so edge detection is
usually done as:
1) vertical edge detection
2)horizontal edge detection
So how would we detect
edges in image like this?
Let's take a grey scale image (hence only 1 channel) as an example which is of size 6 × 6 and as
said before each pixel of those 36 pixels will have a value between 0 (black) →255 (white).
Let us do a vertical edge detection:
We need a matrix called filter or kernel which we will assume here
Then:
Then:
Another example
You notice here that the vertical edge in the original image is translated into a brighter region
between the 2 darker regions to show that there is a vertical edge existed here.
If we flipped the original
image horizontally such
that the dark is on the left
and bright is on the right
here will be the result :
The difference between the two images that the first one was light to dark transition while the
second photo was a dark to light transition (moving from left to right (→)).
we here use a transposed filter of that used with vertical edge detection
What is the best set of numbers to use in the kernel to perform vertical and horizontal detection?
It turned out that Sobel filter is one of the best edge detecting kernels that puts more weight on
But computer vision researchers wanted to exaggerate more so they are now using Scharr filter:
Note: Edge detection is not only for vertical or horizontal but also could be done with edge
detection for edges with slope (like 5 edges)
Padding
Motivation behind using padding
One down side of the convolution is that it will shrink the result so when entered a 6×6 image to
convolve with a filter of 3×3 size we ended up with a 4×4 image which means that our image
now shrunk from 6×6 to 4×4 which means that if we apply convolution several times on an
image it will end up being so small. Another downside is that some pixels are involved in the
convolution less than the others so for example the up left pixel in the image is just involved
once in the convolution while for example the pixels in the middle of the image is involved
more than once (some are twice, or 3 times or whatever)
This means that we throw away some information near edges by only
using them small number of times compared to the others.
Those problems are solved by padding.
Padding is adding a border of pixels around the image to increase the size of the image as
shown:
As we can notice here that we entered
6×6 image and ended with 6×6 also so
image is not shrunk in size and also now
the edges of the image are included more
often than before so we are not throwing
this information.
Note: Padding is done with zeros.
NOW, if we have n×n matrix convolved with f×f filter and padded by P so the result will be
n+2P-f+1 × n+2P-f+1.
As seen in the previous example that the filter is moving with 1 step on the left in case of
stride=1 and 2 steps on the right in the case of stride =2.
Note: Take care that the stride is convenient with the image size in
order not to make filter goes out the image due to this stride as
shown
Note: In the Math and signal processing textbooks convolution is not done as we did but they
firstly mirror the filter before applying it and what we did here in math is called cross-
correlation but in deep learning cross-correlation is called convolution and this is the
convention deep learning researchers used to use.
Convolution on RGB images
What we were doing was done on 2-D image or grey scale image but for RGB images which is
more common we now has 3-D images so how convolution deal with them?
This will be done by stacking filters as shown:
Note:
Number of channels in
the input image should
equal number of
channels of the filter.
But How 6×6×3 matrix convolved with 3×3×3 filter ends with 4×4×1 matrix?
Multiple filters
In fact we don't always want to only find vertical or horizontal but in fact we want them both in
addition to a lot of filters that even find the e]inclined edges with different degrees that's how
we need multiple filters to convolve with them to let each one has a specific task. So if we
assume that we have an image and we want to get the horizontal and vertical edge detection so
we will convolve the image once with vertical edge detector kernel and once with horizontla
one then stack the output together. (output now is )
where nc' is the
number of filters
used.
Convolutional neural network
We start with n×n×3 image (a[0]) and convolve with a number of nc' kernels of size f×f×3 (w[1])
so the output now is (w[1].a[0]) then now add a bias term to
each output of the nc' outputs (z[1]) and apply activation function on them (a[1]) so now we have
moved from a[0] to a[1] which is one layer of convolutional neural network. Check the example:
Exercises
Notations
Simple convolutional neural network
Take care from how sizes differs from one layer to another:
From layer 1 → layer2: since we have 39×39×3 with kernel 3×3×3 and P=0 and S=1 and nc'=10
Hence the output equals = 37×37×10
From layer2 →layer3: we have 37×37×10 with kernel 5×5×10 and P=0 and S=2 and n c'=20
Hence the output equals = 17×17×20
From layer3→layer : we have 17×17×20 with kernel 5×5×20 and P=0 and S=2 and nc'=40
Hence the output equals = 7×7×40
From layer →layer5: This is just flattening as 7×7× 0=1960 to be able to fed to softmax
function to perform prediction of .
Any neural network is composed of a number of these layers to form our CNN.
Pooling layers
Pooling layers is mainly used to reduce the size of the representation, speed computation and
make the network more robust to the features.
Pooling has no learning parameters unlike convolution so gradient descent has nothing to learn.
Max pooling
Max pooling is to take the highest value of a bunch of pixels in an image to represent this bunch
by this value as shown in the figures
Average pooling
So, For n×n×nc image therefore the output of the max pooling layer will be:
Notes:
1-This is the simplest deep neural network in the history.
2-Each conv layer followed by pooling layer are together called a neural network layer.
3-Conv-pool-conv-pool-…-conv-pool-FC-FC-…-FC is a pretty common pattern used in most
of neural networks.
4-Each neuron in the last pooling layer is connected to all neuron in the first FC layer that why
it is of size (120,400).
5-Softmax has 10 outputs in this example as it was an example of hand-written digits
recognition.
Convolutional Neural Networks were inspired by research done on the visual cortex of
mammals and how they perceive the world using a layered architecture of neurons in the brain
as said before. So hink of this model of the visual cortex as groups of neurons designed
specifically to recognize different shapes. Each group of neurons fires at the sight of an object,
and communicate with each other to develop a holistic understanding of the perceived object.
General architecture (building blocks or structure) of convolutional neural networks
A simple convnet is a sequence of layers where each layer transforms one volume of activations
to another through a differentiable function. Convolutional Neural Networks have a different
architecture than regular Neural Networks. Regular Neural Networks transform a vector input
by putting it through a series of hidden layers. Every layer is made up of a set of neurons, where
each layer is fully connected to all neurons in the layer before. Finally, there is a last layer — the
output layer — that represent the predictions.
Convolutional Neural Networks are a bit different. First of all, the layers are organized in 3
dimensions: width, height and depth (channels). Further, the neurons in one layer do not
connect to all the neurons in the next layer but only to a small region of it (it is known as
sparsity of connections or receptive field). Lastly, the final output will be reduced to a single
vector of probability scores, organized along the depth dimension.
The convnet is mainly composed of 3 types of layers which are:
1-Input layer
2-Convolutional layer
Feature extraction/learning part
3-Pooling layer (aka sub-sampling or downsapling layer)
4-Fully connected layers Classification part
Note: There is non-linearity operations as ReLU done as discussed before. Each Layer may or
may not have learning parameters (e.g. CONV/FC do, RELU/POOL don’t). Each Layer may or
may not have additional hyper-parameters (e.g. CONV/FC/POOL do, RELU doesn’t).
Now, let’s go back to visualizing this mathematically. When we have this filter at the top left
corner of the input volume, it is computing
multiplications between the filter and pixel values
at that region. Now let’s take an example of an
image that we want to classify, and let’s put our
filter at the top left corner.
To visualize, this video is more than great to see how as we go deeper in the network we start
having more complex features and that feature map of each layer in fact depends on the feature
maps from the previous ones:
[https://www.youtube.com/watch?time_continue=152&v=AgkfIQ4IGaM]
Two main concepts to know about CNN
Local connectivity (aka sparsity of connectivity)
When dealing with high-dimensional inputs such as images, it is impractical to connect neurons
to all neurons in the previous volume as was done in the regular ANN discussed in the previous
chapter. Instead, we will connect each neuron to only a local region of the input volume. The
spatial extent of this connectivity is a hyper-parameter called the receptive field of the neuron
(equivalently this is the filter size). The extent of the connectivity along the depth axis is always
equal to the depth (no. of channels) of the input volume (that is why kernel's number of
channels is always equal to number of channels of the input to this kernel). It is important to
emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and
the depth dimension: The connections are local in space (along width and height), but always
full along the entire depth of the input volume.
To sum up we can say that local connectivity is the concept that each neuron connected only to
a subset of the input image (unlike a neural network where all the neurons are fully connected to
the all neurons in the previous layer).
Note: These 2 concepts are the main reasons or advantages that make convolution useful in
neural networks.
Pooling layer (aka sub-sampling or down sampling layer)
The second secret sauce that has made CNNs very effective is pooling. It is common to
periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture.
Pooling is a vector to scalar transformation that operates on each local region of an image, just
like convolutions do, however, unlike convolutions, they do not have filters and do not compute
dot products with the local region, instead they compute the average of the pixels in the region
(Average Pooling) or simply picks the pixel with the highest intensity and discards the rest
(Max Pooling) or sum the pixels in the region (Sum Pooling). Max Pooling is the most used
type among them. For an F x F pooling, it will effectively reduce the size of feature maps by a
factor of F (stride×F).
Example (Max Pooling): This is max pooling of size 2×2 and stride=2
Classic networks
LeNet-5 By Yann-LeCun
LeNet was trained on greyscale images on MNIST dataset concerning handwritten digits
recognition and Here is the architecture:
Notes:
1-This neural network is pretty small compared to recent neural networks as it has only 60K
parameters while now we have reached from 10 mils to 100 mils parameters so it is small.
2-As we go deeper in this network, the height and width shrinks (32→28→1 →10→5…).
3-As we go deeper, the number of channels increases (1→6→16).
4-This network was using sigmoid/Tanh activation function not ReLU as it wasn't introduced
yet.
5-Sigmoid non-linearity was used after pooling layer but now it is not used anymore.
6-In case you will read its paper just focus on section II and III as recommended form Andrew.
AlexNet By Alex Krizhevsky
Notes:
1-This neural network similar to leNet-5 but much bigger with 60 mil parameters.
2-This network used ReLU activation function instead of sigmoid/Tanh functions.
3-This architecture was introduced in a time when GPU was slower than now so it was trained
on multiple GPUS.
4-This network uses layer called local response normalization but this type of layer is not used
much recently.
Notes:
1-This network is composed of 16 layers of a total number of learning paramters≈138 mil.
2-There is another VGG network composed of 19 layers instead of 16 called VGG-19.
Residual networks (ResNets)
Very deep neural networks are hard to be trained due to vanishing and exploding problems.
Here comes the idea of skip connections which would make activations from one layer feeds a
deeper layer in the network and it is the basic idea resnets relies on.
Residual block
To go from a[l] to a[l+2] you have to go through the main path as follows:
a[l] → z[l+1]=w[l+1].a[l]+b[l+1] → a[l+1] = g(z[l+1]) → z[l+2]=w[l+2].a[l+1]+b[l+2] → a[l+2] = g(z[l+2])
How to modify this ?
we are going to take a[l] and add it to output of z[l+2] using shortcut path as follows:
Residual network
Did residual networks affect the performance compared to plain networks (no residuals)?
Yes, it did make the performance better as theoretically when we have more layers in the
network it should make training error decrease but practically with plain networks it didn't help
that much as it will reach to a certain number of layers then the training error will increase
again. ResNets solved this to a high extent and make it stand further with increasing number of
layers of the network.
Why ResNets work?
Assume we have those two networks
where one of them is a big neural network
and the other is a big neural network
followed by residual block, so by writing
the equation for the 2nd one a[l+2]= g(z[l+2]+ a[l])= g(w[l+2].a[l+1]+b[l+2]+a[l])
So if w[l+2] = b[l+2] = 0 and we are using ReLU activation function so this will lead to that a [l+2]
will be equal to a[l] which means that residual block is acting now as just an identity function
which means that with the presence of residual block we can add more deep network and if
there is nothing more to learn, the network will have the ability to just copy previous output
again and not hurt the neural network by making it doing similarly to simpler networks unlike
plain networks which has to choose values for parameters as we go deeper which may result in
worse result if it couldn't learn. Of course it is not only for not hurting your neural network but
also for improving by making this residual blocks learn more if there is something to be
learned. To summarize, the residual block will allow the network to either learn something new
or act as identity so not affect the neural networks as it goes deeper.
The problem is obvious here in the inception module which is the computational cost as it is
so high compared to other architecture. So for an example if we just estimate the number of
operations done in the 5×5 filters only so we just have 28×28×192 input and have 28×28×32
output so this means that we have 28×28×32 values each value is computed using 5×5×192
operations using the filter so this ends up to (28×28×32) × (5×5×192) =120 mil operations !!!
So here comes the idea of introducing 1×1 conv to perform 5×5 conv in about 1 tenth the
operations needed in the original method (instead of 120 mil →12mil) . This will be done as
shown : (left is the original method and right is the modified one to reduce computations)
The new method still preserve the input and output dimensions but with less computations.
Computation operations: [(1×1×192) × (28×28×16)] + [(5×5×16) × (28×28×32)]=12.4 mil
So the computational cost decreased to approximately 1/10 of the operations needed in the
original method with preserving the dimensions of the input and output.
Note: The layer added before the 5×5 layer is called bottle neck layer.
The last thing to say before showing the final inception module is that we have said before that
1×1 conv helps in shrinking number of channels which will be used after the pooling layer to
decrease number of channels from 192 to any less number we need by using n number of filters
each having size of 1×1×192.
Now to construct an inception network we will add some inception modules to form it as
follows:
Transfer learning
Rather than making your network learns from scratch with randomly initialized weights, you
can make much faster progress if you started from good weights that somebody else has ended
to after training his network (of course both should be using same architecture ). These
weights are usually reached by training on well-known datasets as COCO, ImageNet or Pascal
for weeks or may be months so starting from these weights is a very good initialization.
Transfer learning is important in a case like having small dataset. In fact transfer learning is
almost always used except if you have exceptionally large dataset that doesn't need this method.
Important Note: Change the number of classes of softmax layer to match your task and freeze
all the layers of the network except for softmax layer and start training in case of small dataset
but for larger dataset you can freeze some layers of network and train the others and as you
have larger dataset you increase the number of layers to be trained such that if the dataset is
large enough you can train the whole network so the pre-trained weights are just initialization to
the network.
Remember: For small dataset, data augmentation is a good option for increasing your data as
discussed before.
Networks architecture & Transfer learning
In this part we will talk about the most popular CNN architecture released till 2018 in
classification tasks and what are the basic concepts they introduced noting that as we go deeper
in this section we see more recent CNN architectures that have achieved better results in
ImageNet Large Scale Visual Recognition Competition (ILSVRC), except for leNet-5, and all
of them are either winners or runner-ups. Let us start:
LeNet-5 (by Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner in 1998)
This paper was a state-of-art for convolutional neural networks as it proposed most of the main
ideas that are discussed in the previous section and was the first step toward the evolution of
using convolutional neural networks in machine learning field. It was trained for handwritten
digits classification task.
Architecture
Trainable
Type Filter size No. of filters Stride Padding Output size
parameters
Input Input size = 32 × 32 ×1 (Grey-scale image)
Layer1 Conv 5×5 6 1 0 156 28×28×6
Layer2 AvgPool 2×2 6 2 0 12 14×14×6
Layer 3 Conv 5×5 16 1 0 1516 10×10×16
Layer4 AvgPool 2×2 16 2 0 32 5×5×16
Layer5 FC by Conv 5×5 120 - - 48120 1×1×120
Layer6 FC - - - - 10164 84
Layer7 GC(Softmax) - - - - - 10 (digits)
Activation function = Tanh & Total parameters =60,000
Note: Pooling layer here has trainable parameter as it is of type average pooling which has 1
coefficient and 1 bias term but generally as we said there is no trainable parameters in pooling.
Visualize architecture
[https://engmrk.com/wp-content/uploads/2018/09/LeNet-5_Architecture_Explanation.mp4]
Main concepts: convolution, local receptive fields, shared weights, spacial subsampling and all
of them was discussed in details in the previous section.
There was a gap between 1998 and 2010 till Dan Ciresan Net was introduce that provided 9
layer CNN trained on NVIDIA GTX 280 but we will pass this network and start from 2012 Net.
Architecture
Trainable
Type Filter size No. of filters Stride Padding Output size
parameters
Input Input size = 227×227×3
Layer1 Conv 11×11 96 4 0 34944 55×55×96
Layer2 MaxPool 3×3 96 2 0 - 27×27×96
Layer 3 Conv 5×5 256 1 2 614656 27×27×256
Layer4 MaxPool 3×3 256 2 0 - 13×13×256
Layer5 Conv 3×3 384 1 1 885120 13×13×384
Layer6 Conv 3×3 384 1 1 1327488 13×13×384
Layer7 Conv 3×3 256 1 1 884992 13×13×256
Layer8 MaxPool 3×3 256 2 0 - 6×6×256
Flattening 9216
Layer9 FC - - - - 37748736 4096
Layer10 FC - - - - 16777216 4096
Layer11 FC(Softmax) - - - - 4096000 1000
Activation function = ReLU & Total parameters ≈ 62.3 million
Notes: 1-Input size is 227 not 224 as mentioned in the paper. It was just a mistake in the paper.
2-This model was trained on 2 GTX 580 3GB GPUS so that is why the model in the figure
above is separated on two pipelines.
3-MaxPooling is done by overlapping not the normal MaxPooling we take and it will be
discussed.
4-There is normalization used after the first 2 MaxPooling but researchers found that they are
not that much effective.
5-Dropout was used before each of the FC layers 9 & 10 with keep probability = 0.5
6-This model reached an error of 15.3% on ImageNet.
Main concepts
1-Use of ReLU instead of Tanh (This point was discussed in the previous chapter)
2-Use of dropout to reduce overfitting (This point was discussed in the previous chapter)
3-Data augmentation methods to reduce overfitting (This was discussed in the previous chapter)
4-SGD with momentum was introduced. (This was discussed in the previous chapter)
3-Overlapping MaxPooling to avoid averaging effects of AvgPool.
We can see from example that he is using pooling filter of size 2×3 with stride 2 so the 3rd
column in the first pooling is similar to 1st column in second pooling.
Actually most of the popular , if not all, use overlapping MaxPooling as it turned out to give
better results than normal non-overlapped MaxPooling. Geoffrey Hinton said that without
overlapping MaxPooling, the network may lose some information in their feature maps.
We can notice that the architecture is the same as AlexNet except for the first conv layer that
used 7×7 filter size with stride 2 instead of 11×11 filter size with stride 4.
Main concepts
1-Using smaller filter sizes with smaller strides at the first layers.
2-Visualization technique using what is called by deconv net.
Deconvolution
Deconvolution is the inverse process of convolution and works on two steps:
Pad by 1 around each pixel of the convolved output matrix.
Apply convolution with the filter used in convolution but transposed.
To visualize [https://i.stack.imgur.com/f2RiP.gif]
In fact, ZFNet is one of the most important network in CNN history as it provided great
intuition of how CNN works. The visualization technique proposed in this paper helped us to
know that first layers of CNN learn low-level features and that as we go deeper complex
features are learned and gave us good insight to improve network architectures.
VGG-19 (by Karen Simonyan and Andrew Zisserman in 2014)
Simplicity and depth could be the header for this network. This 19-layer network was the
runner-up of ILSVRC in 2014 that was able to reach 7.3% error rate by strictly sticked with 3×3
filters with stride and pad =1 along with pooling of 2×2 with stride 2 (Non-overlapped
MaxPooling). VGG has another version of 16 layers only that is mentioned in the main course.
Architecture
Trainable
Type Filter size No. of filters Stride Padding Output size
parameters
Input Input size = 224×224×3
Layer1 Conv 3×3 64 1 1 1792 224×224×64
Layer2 Conv 3×3 64 1 1 36928 224×224×64
Layer 3 MaxPool 2×2 64 2 0 - 112×112×64
Layer4 Conv 3×3 128 1 1 73856 112×112×128
Layer5 Conv 3×3 128 1 1 147584 112×112×128
Layer6 MaxPool 2×2 128 2 0 - 56×56×128
Layer7 Conv 3×3 256 1 1 295168 56×56×256
Layer8 Conv 3×3 256 1 1 590080 56×56×256
Layer9 Conv 3×3 256 1 1 590080 56×56×256
Layer10 Conv 3×3 256 1 1 590080 56×56×256
Layer11 MaxPool 2×2 256 2 0 - 28×28×256
Layer12 Conv 3×3 512 1 1 1,180,160 28×28×512
Layer13 Conv 3×3 512 1 1 2,359,808 28×28×512
Layer14 Conv 3×3 512 1 1 2,359,808 28×28×512
Layer15 Conv 3×3 512 1 1 2,359,808 28×28×512
Layer16 MaxPool 2×2 512 2 0 - 14×14×512
Layer17 Conv 3×3 512 1 1 2,359,808 14×14×512
Layer18 Conv 3×3 512 1 1 2,359,808 14×14×512
Layer19 Conv 3×3 512 1 1 2,359,808 14×14×512
Layer20 Conv 3×3 512 1 1 2,359,808 14×14×512
Layer21 MaxPool 2×2 512 2 0 - 7×7×512
Flattening 25088
Layer22 FC - - - - 102,764,544 4096
Layer23 FC - - - - 16,781,312 4096
Layer24 FC(Softmax) - - - - 8194 1000
Activation function = ReLU & Total parameters ≈ 139.5 million
Notes:
1- This network was trained on 4 NVIDIA TITAN Black GPUS for 2-3 weeks.
2-This network is simple and uniform as it repeats itself.
3-The difference between 19 and 16 layer VGG is that instead of stacking 4 consecutive conv
layers we stack only 3 conv layers.
4-It works well in both classification and localization tasks. (localization is discussed later)
Main concepts
1-Having deeper network using smaller kernels (filters).
2-Doubling number of kernels after each MaxPooling layer.
3-Using Mini-batch gradient descent. (This was discussed before in machine learning chapter)
Architecture
It used 9 inception modules with over 100 layers so complex but very deep.
You can notice that the architecture is simply a multiple of those inception cells in addition to
something called auxiliary loss.
#3×3 reduce and #5×5 reduce represents the no. of 1×1 conv before each 3×3 & 5×5 conv
respectively pool proj represents no. of 1×1 conv after MaxPooling.
Main concepts
1-Introduce an idea called Network-in-Network (NiN).
2-Introduce the idea of inception modules (aka inception cells).
3-Usage of RMSProp optimization. (discussed before in previous chapter)
But if we notice the photo on the right which is the final idea we will find that 3×3 conv and
5×5 conv were preceded by 1×1 convolution and also MaxPooling was followed by 1×1
convolution!! Why would they do that !! As we said in the Network-in-Network section that
1×1 convolution can be used for dimensionality reduction as if we moved along with the naïve
idea that authors came up with, It would lead to way too many outputs. We would end up with
an extremely large depth channel for the output volume. The way that the authors address this is
by adding 1x1 conv operations before the 3x3 and 5x5 layers. You may be asking yourself
“How does this architecture help?”. Well, you have a module that consists of a network in
network layer, a medium sized filter convolution, a large sized filter convolution, and a pooling
operation. The network in network conv is able to extract information about the very fine grain
details in the volume, while the 5x5 filter is able to cover a large receptive field of the input, and
thus able to extract its information as well. You also have a pooling operation that helps to
reduce spatial sizes and combat overfitting. On top of all of that, you have ReLUs after each
conv layer, which help improve the nonlinearity of the network. Basically, the network is able
to perform the functions of these different operations while still remaining computationally
considerate.
Notes:
1-This architecture reduced trainable parameters to just 4 millions with over 100 layers
2-The 1×1 conv layer added to reduce calculations is called bottle neck layer.
Auxilary loss (aka auxillary classifiers)
By adding auxiliary classifiers connected to these intermediate layers, we would expect to
encourage discrimination in the lower stages in the classifier, increase the gradient signal that
gets propagated back, and provide additional regularization. I don't understand what is meant by
discrimination in earlier stages
There are more concepts that are introduced in v4, v5,v6 & v7.
ResNet (by Kaiming He et al from Microsoft in 2015)
Residual network introduced a 152 layer network architecture that made new reocrds in
classification, localization and detection tasks. It wins the ILSVRC in 2015 by incredible error
rate of 3.6% that beats even the human-level performance. Although it is a very deep neural
network but thank to the idea they brought it was able to have lower complexity than VGG.
As VGG has 16 and 19 layer architecture, ResNet has 18, 34, 50, 101 & 152 layer architecture.
Architecture
Notes:
1-The architecture in figure is 34-layer so if you want 152 layer map it from the table.
2-This model was trained on 8 GPUS.
3-Number of trainable parameters are : ResNet-5025.6Mil, ResNet-10144.5Mil, ResNet-
15260.2Mil.
Main concept
1-It introduced the idea of residual blocks that partially eliminates the problem of overfitting
with the increase of layers. (we can reach now 1000 layers by using this idea)
2-Used the idea of NiN together with residual network.
Residual block
Residual block is based on skipping connection such that we feed the
output of two or more successive convolutional layer AND also
bypass the input to the next layers. F(x) is the output of passing
through two or more successive convolutional layers.
So now instead of passing F(x) to the next ReLU we now pass F(x) + x
to the next ReLU or any activation function in general.
Note:
The idea of residuals was used in Inception v4 as
shown
ResNeXt (by Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He in 2016)
A modified version that was the ruuner-up in ILSVRC in 2016. it replaces the standard residual
block with one that leverages a "split-transform-merge" strategy (ie. branched paths within a
cell) used in the Inception models. It is similar to the architecture of ResNet except for two
things:
1) All 3×3 and smaller 1×1 conv (not bigger ones)
are doubled in no. of filters
Architecture
By densing each layer to all previous, the deeper layers will have very large size !!
They controlled the size by making the number of filters added in all layers constant, the no. of
filters is denoted by k and has the name "growth rate" as it actually controls the growth rate of
the layers and prevents the network from growing too wide. each successive layer will
have k more channels than the previous one (as a result of accumulating and concatenating all
previous layers to the input).
Notes:
1-They also used a 1x1 convolutional bottleneck layer to reduce the number of feature maps
before the expensive 3x3 convolution.
2-The DenseNet has a modified version with compression factor θ. It is used to reduce the
number of output feature maps. Instead of having m feature maps at a certain layer, we will
have θ*m. Of course, is in the range [0–1]. So DenseNets will remain the same when θ=1.
These DenseNet are called DenseNet-BC.
3- Dense connections have a regularizing effect, which reduces over- fitting on tasks with
smaller training set sizes.
SENet (by Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu in late 2017)
Squeeze and Excitation network was the winner of ILSVRC 2017 by achieving an error rate of
2.25% in classification tasks. This the final network to be discussed here.
Architecture
Main concept
Increase channels inter-dependencies by squeeze-excitation module
How Squeezing and excitation is done? In other words, what is Squeeze and excitation module?
Squeezing is done using average pooling of each feature map. So we apply AvgPool filter to
each feature map having the same size so that the result would be a single numeric value and
that is why it is called global average pooling as it results in a single value for each feature map.
Now we have a vector of the same length of the number of feature maps so we weight each
feature map by this value.
Notes:
1-The authors show that by adding SE-blocks to ResNet-50 you can expect almost the same
accuracy as ResNet-101 delivers. This is impressive for a model requiring only half of the
computational costs.
2-Use SE block with latter layers as they are more specialized for each class.
Transfer learning (aka Inductive transfer)
Introduction
Humans have an inherent ability to transfer knowledge across tasks. What we acquire as
knowledge while learning about one task, we utilize in the same way to solve related tasks. The
more related the tasks, the easier it is for us to transfer, or cross-utilize our knowledge. Some
simple examples would be,
Know how to ride a motorbike ⮫ Learn how to ride a car
Know how to play classic piano ⮫ Learn how to play jazz piano
Know math and statistics ⮫ Learn machine learning
In each of the above scenarios, we don’t learn everything from scratch when we attempt to learn
new aspects or topics. We transfer and leverage our knowledge from what we and others have
learnt in the past! Conventional machine learning and deep learning algorithms, so far, have
been traditionally designed to work in isolation. These algorithms are trained to solve specific
tasks. The models have to be rebuilt from scratch once the feature-space distribution
changes. So Transfer learning came to end rebuilding from scratch and make use of what others
have reached to start from. Transfer learning is not a machine learning model or technique; it is
rather a ‘design methodology’ within machine learning.
What is meant by transfer learning?
Transfer learning is a machine learning design methodology where a model trained on one task
is re-purposed on a second related task by reusing the model trained on first task as the starting
point for a model on a second task. It is mostly used in computer vision and natural language
processing tasks. Hence instead of starting the learning process from scratch, you start from
patterns that have been learned from solving a related task.
y= , where:
Note: If the image is a background so Pc=0 and all the remaining will be don't cares (?).
Loss function
ℓ( )=
Note: 5+c represents the length or size of vector y so if we have 3 classes as our previous
example so we have y1=Pc & y2,…, y5=(bx, by, bh & bw) & y6, y7, y8 = C1, C2, C3.
Landmark detection
Instead of having bounding box with (bx, by, bh & bw) as an output to localize an object, here we
can have some cases where we need just x & y co-ordinates of some important points in an
image to be recognized, these points are called landmarks.
For example, assume you are having a face recognition task where you want to
know where is the corner of the right eye so this in fact is a single co-ordinate
(x,y) that represents the corner of the eye (red dot in the eye's corner)
So now our final layer will produce co-ordinates lx, ly to represent this
landmark.
Similarly if you want 4 landmarks that represents the 2 corners of both eyes so
now our final layer will have to produce (l1x, l1y), (l2x, l2y), (l3x, l3y), (l4x, l4y) to
represent the 4 landmarks (2 corners of each eye).
Of course landmark detection is not used only for face recognition but it has
millions of other tasks as for example person's pose detection (walking,
running, bending, kicking, …)
Real-world application for landmark detection: From the famous real-world applications is the
augmented reality used in snapchat filters that detects landmarks of the face to put the filter on
your face, Also face unlock in the recent smart phones are using landmark detection.
Sliding window for object detection
We have said before that object detection is mainly having
multiple objects of similar or different classes in an image
and our task is to localize each object and assign class
label to it. The basic idea to do that is to have a dataset of
cropped images that is labeled to a specific class say we
have cars or no cars as a simple example as shown
then train our network on this dataset using convnet
similarly to what we were doing in image classification
tasks (now training phase is over) then apply sliding window of specific size over the test image
and crop this window and pass it to the trained network to see whether there is a car or not in
this cropped window as shown below in the test image:
…
As shown above we have slided a window over the test image and each time we crop this
window and pass it to the convnet to say whether there is a car in the window or not and if there
is a car it saves the boundary box position relatively to the test image as whole.
But may be the sliding window size was not right so that is why it couldn't detect the car so that
is why we apply the sliding window several times each time with different window size:
This method has a problem of high computational cost as iteratively we are doing several
sliding windows with different sizes and each sliding window is moving along the whole image
where each time you apply the cropped part to a convnet which is an expensive task to be done.
Before getting in how to solve this problem, let us conquer another idea then get back to the
solution for this problem that this idea will be used in.
Turning Fully Connected (FC) layer into convolutional layer
Now we want to have a fully connected layer but in form of convolutional layer, so what we do
is that we will convolve the last layer before the required FC layer with n filters each of the
same dimension (and channels of course) of this last layer before required FC layer where n is
the length of FC layer so in our example the layer before the first FC is of dimension 5×5×16 so
we will convolve it with 400 filters each filter is of dimension 5×5×16 so each convolution will
result in a 1×1 result (single value) and since we have 400 filters so the final output will be of
dimension 1×1×400 as shown:
What about the next FC layer ? Similarly, we will look at the layer before it so its size is
1×1×400 and the size needed for FC is also 400 so we will convolve this layer with 400 filters
each of them is of size 1×1×400 so that the output now is of size 1×1×400 as shown:
Similarly with the softmax layer which has 4 outputs (4 classes) so to form this layer we should
convolve the layer before it with 4 filters each of dimension 1×1×400 to end up with an output
of dimension 1×1×4 as shown:
Now, How to solve the problem of high computational cost of sliding windows?
This is solved using overfeat method which depends on convolutions. Assume having input
RGB images of size 14×14 and the same training network we used before as shown:
So now our window is of size 14×14 so if we assume that our test image is of size 16×16 and
we want to apply this 14 by 14 sliding window with stride=2 (means that every time we slide
the window over the testing image by shifting 2 pixels) so by doing that we will find that we
have 4 possible windows to be applied as shown on the test image:
So what we were doing is that each window of this 4 windows will crop the test image and
passes the cropped photo to the convnet to decide whether there is an object of the 4 classes or
not but this is expensive computation to do as we will repeat this 4 times for each of the 4
windows taking into consideration that we are giving a simple example where there is only 4
possible windows but if stride=1 there will be 16 possible windows and if the image is larger
and sliding window size is smaller there will be much much more so what overfeat method says
is to apply the 16×16 test image as it is without any cropping to the same pre-trained network
and the output will contain all the 4 possible outcomes of the 4 windows internally so the final
output will be 2×2 instead of 1×1 so we will have the 4 vector output where each one represents
the output of one of the windows as shown:
So the vector with red arrow is for the red window, the vector with green arrow is for the green
window and so on … So now we have 4 vectors (representing the 4 windows) each of length
4 (representing the 4 classes) so instead of doing the same path 4 times for each window, now
we do it just one time with the same result.
Similarly if the test images is 28×28 and we have the same sliding 14×14 window that is slided
with stride=2 so now we have 64 possible windows so instead of doing 64 paths through the
trained network each with the cropped window of the 64 windows now we pass it once and get
the whole output at once as shown:
for each grid cell we will have y for the training where y = and as we have 9 grid cells then
(?) is don't care while the 2 green grid cells will have the following y's
assuming that we have 3 classes and class 2 is the cars classes. so now the output size is 3×3×8.
So now during training we are passing an input image that a grid cell of any size (3×3 in our
example) into the conv network then having an output as shown above including a more precise
bounding box with any aspect ratio rather than like sliding window which has specific aspect
ratio represented in the size of the sliding window all over the image.
Notes:
1-There is a problem with this implementation in the case that more than 1 object's center lies in
the same grid so only 1 object will be identified and the remaining will be neglected.
2-As the grid size gets smaller (instead of 3×3 we can have 19×19) the chance that more than 1
object center lies in the same grid cell decreases.
3- The process on one image is done at once which means that I'm not applying each grid cell to
a convnet to apply classification and localization but the whole image is passed at once as we
have done with the sliding window using convolutional layers.
How bx, by, bh, bw are represented?
1-bx and by are values between 0 and 1 that represents how far each of the co-ordinates from the
upper left point relatively to the whole size of the grid (whole size is always equals 1). This
means that if bx=0.5 then the x co-ordinate of the center of the bounding box of the object in
this grid cell is half the way from left side of grid cell → right side of the grid cell
and if by=0.7 then y co-ordinate of the center of the bounding box of the object in this grid cell
is 7/10 the way from the upper side of the grid cell → lower side of the grid cell
2- bh and bw can be any value larger than 0 so if bh say equals 0.9 so this means that the height
of the bounding box of the object in this grid cell is 0.9 the height of the grid cell (0.9×height of
the grid cell) and similarly for the width.
So if bw is larger than 1 (say 1.5) then this
means that width of the bounding box is 1.5
the width of the grid cell (1.5× width of the
grid cell).
Note:
All what we have said in this section (Bounding box prediction) is introduced mainly in what is
called by YOLO algorithm that will be discussed later after summing all the ideas introduced in
it.
IoU =
So if IoU = 0 so there is no overlapping between the boxes and if IoU =1 so the predicted box is
perfectly matching with the ground truth (actual) box. We use IoU to tell how much we would
penalize our algorithm by comparing to a threshold (say 0.5) so if IoU is greater than the
threshold then this is a good acceptable predicted boundary box and otherwise (<threshold) then
we should penalize our algorithm for this bad prediction.
Non-Max suppression
Non-max suppression is used for finding only one box for
each objects as we previously have said that we could use
different sliding window to find the object so these sliding
windows in fact would detect the same object within
several sliding windows or more generally with any other
technique to be used for detection there could be several
boxes detecting the same object as shown in the figure
so how could we make our algorithm only choose the best
box for each object and reject the others? Noting that the values on the boxes is the highest
probability among the classes of the network so if a network has 4 classes (cats, dogs, persons
& cars) so each class has its own probability as said before in the softmax regression discussion
and the highest probability (called Pc) is the predicted class by the network to this input test
image so we take the highest probability and wrote it along with the boundary box. Let us get
back to non-max suppression algorithm:
Algorithm
1-Discard all boxes associated with low probabilities (Pc < threshold).
2-If there are remaining boxes (boxes with Pc > 0.6) so loop over the following:
Take the box with highest probability(Pc) among all the remaining boxes and keep it.
Perform IoU calculation between this kept box and all the remaining boxes and discard
any box of the remaining ones with IoU > threshold (say 0.5) because this means that
most probably these boxes are detecting the same object of the kept box.
Repeat the 2 previous steps again on the non-discarded boxes (IoU<0.5) because these
boxes are most probably belonging to another object in the image.
Example
Now we will work on the example at the top of the page so firstly
we will pick the highest probability among them all (0.9) and keep
it (in white), then now we apply IoU with all the remaining 4 boxes
so we found out that 2 of the remaining boxes(0.6 and 0.7 boxes)
are having high IoU with this box in white so we will discard them.
Now we have remaining 2 non-discarded boxes (on the left) so we
will take the highest among these 2 boxes (0.8 box in white) then
perform IoU with the remaining 1 box (0.7 box on the left) so we
found out that it has high IoU so we will discard it. Eventually we
have no remaining boxes so final output is the 2 white boxes with
0.9 and 0.8 class probability.
Anchor boxes
We have addressed a problem before that we skipped it which is that each grid cell in the image
can only has 1 object so if there is more than 1 object so the remaining will be neglected and
here comes the idea of anchor boxes to be able to get all the objects that their centers lie in the
same grid cells.
Assuming having an image like this shown and there are 2 objects
(a person and a car) where both the centers of the 2 objects are in
the same grid cell. Anchor boxes are introduced so that each of
them is of a different aspect ratio so if we assume we have 2
anchor boxes (could be more as you like) so y now is not only
So now each part of the output is detecting objects that mostly can be represented by a box with
the aspect ratio of those anchor boxes so for our example Anchor box 1 will mostly fit the
person and anchor box 2 will mostly fit the car noting that in each grid cell we can't have more
objects' centers than the number of anchor boxes (If no. of anchor boxes=2 therefore the
maximum number of objects center to be detected within one grid cells is 2 ) so that is why we
are using small grid size (19×19) with high number of anchor boxes (around 5) to be sure that
almost no probability of missing any objects in the image.
Note: we determine the best fit anchor box based on the highest IoU
between the anchor boxes (purple) and the ground truth box (red)
Output size now (assuming grid cells 3×3) is [3×3× no. of anchor boxes × (5 + no. of classes)]
Our Example:
You Only Look Once (YOLO) Algorithm
In fact most of the topics discussed after the sliding window topic are introduced in the YOLO
algorithm paper so this part we will gather all what we said in a single algorithm called YOLO.
For training:
In training we have images labeled as shown as shown assuming having 3×3 grid cells, 2
anchor boxes and 3 classes so outputs are 9 vectors each of length 16 (2 anchor boxes × (5+3))
such that for the given training example there will be 8 vectors of just 0's and don't cares (?)
while there will be 1 vector with 0 for the first anchor box and 1 for the second as the ground
truth box is wider than being taller so IoU with anchor box 2 is greater.
You will notice in the bottom right of the image that we can use 19×19 grid cells and also we
can use 5 anchor boxes (5 anchor boxes × (5+3) = 40).
After this image will pass through the network we will have output of (3×3×2×8) as shown
Don't forget that in the test phase we will do non-max suppression to discard boxes pointing at
same object as well as discarding boxes with low probability by take into your consideration
that all grid cells will provide boxes (their number is similar to number of anchor boxes) and
non-max suppression is the technique that will take care of discarding all these boxes especially
that most of them will have low probabilities and some will point to same object.
Note:
YOLO algorithm is usually used for real-time application as it is a fast run-time algorithm
compared to other algorithms.
Region Proposals
This idea is the main basis of some networks as R-CNN, Fast R-CNN, Faster R-CNN,…
As we said before that sliding windows is a little bit bad and time consuming as we need to
slide the window over the whole image so most probably a lot of windows will be of
background that doesn't contain any objects and this is a waste of times and resources. What R-
CNN did is applying region proposal where
we also having windows but more
reasonable windows and to determine these
reasonable windows, a segmentation
algorithm is used (right photo)
As we see here that segmentation gives
some blobs with different colors so the
windows generated are detecting those
blobs to check if they contain object or not and the number of windows here of course is much
less than that will be generated by normal sliding window algorithm.
R-CNN was too slow so fast and faster R-CNN was introduced.
Localization and detection, Overfeat, Image Segmentation, R-CNN and its variants, SSD,
YOLO and its variants, other algorithms
Localization VS. Detection VS. Instance segmentation VS. Semantic segmentation
We all know the classification task where there will be an object in an image and your network
has to tell the probability of each class of your pre-defined classes concerning this image such
that the class with the highest probability is the predicted class of the object in the image. but
what if we want not just to say what is the class of the object in the image, what if we want also
to detect its position in the image. Is that all ? No what if there are multiple objects in the image
and you want to detect all the object regarding their position then classify each object. Here
comes some terminologies like localization, detection & segmentation.
Localization: it refers for having a single object in the image so we have an object in the image
and want to detect its position by drawing a boundary box around the object and classify it.
Detection: it refers for having multiple objects in the same image (objects could be of same
class or different classes) and your task is to determine the position of each object by drawing a
boundary box around each object and classify it.
Instance segmentation: This means detecting each object outline at a pixel level and classify it
such that if we have 3 objects (2 cars and 1 person) we have to detect each of the cars and the
person separately on pixel level.
Semantic segmentation: it refers for per-pixel segmentation which means that we are not only
defining each pixel of object is related to which instance but we are detecting every single pixel
is related to which class.
Note: Instance means that we are dealing with each object as a different object even if they are
belonging to the same class, while semantic or detection (non-instance based) we are dealing
with each object as different object but if we found 2 objects of same class here we put around
each object a boundary box (in case of detection) or color each pixel (in case of semantic
segmentation) by the same color. (In semantic all balloons are pixel colored with the same
color, In detection all balloons have boundary boxes of same color, In Instance each
balloon is colored differently as a single instance although they are all of same class "balloon")
Before going deeply with the recent CNN models used in deep learning for object detection as
R-CNN and its variant, SSD, YOLO, etc… Let us think of object detection approaches starting
from the most basic naïve idea till we reach deep learning approaches and their progress till we
reached recent CNN models.
Different approaches for solving object localization and detection
Assume our example that we will cope with along
this part is hypothetically building a pedestrian
detection system for a self-driving car supposing
your car captures an image like this one
What is our target ? creating a bounding box
around these people, so that the system can
pinpoint where in the image the people are, and
then accordingly make a decision as to which path
to take, in order to avoid any mishaps.
Say the classifier found that the upper right part of the image contains pedestrians so he will
draw a bounding box including this whole part as a part to be avoided by the car as shown:
Good as starting point but still we need more precise bounding boxes around the objects.
Approach2 :Increase number of divisions
Instead of having 4 patches only with fixed sizes to be given to our trained classifier, let us
make them several patches with different sizes to be passed and to store the patch size and
position to act as bounding boxes as shown:
A lot of boxes!! but the good thing that now we have some boxes closer to the two objects we
have than the previous method. So we are somehow much closer to find more precise and
accurate bounding boxes.
Less boundary box than the previous approach and more close to the right boundary boxes
needed. We are more steps closer to the precise boundary boxes.
Let us take a look on deep learning as a whole in detection problems before going with CNN.
Approach5: Using Deep Learning for feature selection and to build an end-to-end approach
Deep learning has somuch potential in the object detection space. Can you recommend where
and how can we leverage it for our problem? I have listed a couple of methodologies below:
Instead of taking patches from the original image, we can pass the original image through
a neural network to reduce the dimensions
We could also use a neural network to suggest selective patches
We can reinforce a deep learning algorithm to give predictions as close to the original
bounding box as possible. This will ensure that the algorithm gives more tighter and finer
bounding box predictions
Now instead of training different neural networks for solving each individual problem, we can
take a single deep neural network model which will attempt to solve all the problems by itself.
The advantage of doing this, is that each of the smaller components of a neural network will
help in optimizing the other parts of the same
neural network. This will help us in jointly
training the entire deep model.
Our output would give us the best performance
out of all the approaches we have seen so far,
somewhat similar to this image.
We have the normal path for applying classification as we have learnt:
Input image stack of conv layers pooling stack of conv layers pooling… stack
of fully-connected layer (FC) Fully connected softmax layer.
The softmax layer was having class scores (probabilities) that determines each object
probability such that the class with highest probability is the predicted class. However, that is
all for classification what about localization, what modifications do we need to apply to perform
localization then we can generalize it to perform detection and segmentation. Here comes the
first published neural net based object localization architecture called overfeat that integrates
classification, localization and detection in one conv network in 2013 noting that AlexNet in
2012 was supposed to do object localization by some modifications but they didn't mention the
implementation.
Overfeat (by Pierre Sermanet David Eigen Xiang Zhang Michael Mathieu Rob Fergus
Yann LeCun in 2013/2014)
This was the winner of ILSVRC 2013 in localization task as well as detection task and ranked
fourth in classification task. The main point of this paper is to show that training a
convolutional network to simultaneously classify, locate and detect objects in images can boost
the classification accuracy and the detection and localization accuracy of all tasks. The paper
proposes a new integrated approach to object detection, recognition, and localization with a
single ConvNet.
Back to our algorithm, maybe you didn't get the idea let us
see the steps of the training process noting that overfeat
model architecture is similar to AlexNet model:
Train a CNN model (similar to AlexNet) on the image classification task.
Then, we replace the top classifier layers by a regression network and train it to predict
object bounding boxes at each spatial location and scale. The regressor is class-specific,
each generated for one image class.
o Input: Images with classification and bounding box.
o Output: (x, y, width, height), 4 values in total, representing co-ordinates of the
center of the box in addition to width and height of this box
o Loss: The regressor is trained to minimize L2 norm between generated bounding
box and the ground truth for each training example.
Computing the Average Precision (AP) for a particular object detection pipeline is essentially a
three step process:
Compute the precision which is the proportion of true positives.
Compute the recall which is the proportion of true positives out of all possible positives.
Average together the maximum precision value across all recall levels in steps of size s.
To compute the precision we first apply our object detection algorithm to an input image. Each
bounding box normally has something called confidence score. The bounding box scores are
then sorted in descending order by their confidence till a certain threshold (differs from
competition to another) and all boxes with confidence scores less than this threshold are
discarded.
We know from a priori knowledge (i.e., it’s a validation/testing example and we therefore know
the total number of objects in the image) there are 5 objects in this image for specific class. We
seek to determine how many “correct” detections our network made. A “correct” prediction
here is one where we have a matching classification with the ground truth label and minimum
IoU of 0.5 (this value is tunable depending on the challenge but 0.5 is a standard value).
Here is where the calculation starts to
become a bit more complicated. We need
to compute the precision at
different recall values (also called “recall
levels” or “recall steps”).
For example, assume after ordering the
boundary boxes according to their
confidence class, this was the output
if we take the box with the highest confidence score (#rank 1) so it was correct which means
that is correctly classified to the class and has IoU > 0.5. So this means that the only prediction
we have done is right then precision is 1/1=1 and recall is that we detect 1 object out of the 5
objects of this class we know they are in the image so recall=1/5=0.2.
Let us take the second highest confidence score (#rank2) so it was also a correct prediction so
this means that out of the 2 predictions we have done there are 2 correct predictions so precision
= 2/2=1 and recall now is detecting 2 objects out of the 5 objects we need to detect. Hence,
recall= 2/5 =0.4.
Let us now take the third highest confidence score (#rank3) so it was false prediction which
means that it is either false classified or IoU<0.5. This means that out of the 3 predictions we
have so far we have 2 correct ones and 1 false one. So precision is 2/3=0.667 while the recall
now is still getting only 2 objects out of the 5 objects we required to get. It is still 2 as this
prediction (third highest confidence score) was not correct.
Similarly we will go along the first top N predictions (usually 10) and see the precision and
recall of each rank.
You will notice as we go further in the table, the recall is increasing as we are more likely to
find the next object needed till reaching 1 if we found all objects.
Let's now plot precision versus recall of this class (P-R curve) in this image:
Always use recall of 11 steps
recall= ȓ= (0,0.1,0.2,0.3,…,0.9,1).
But this plot is so zigzag-shaped so
we smooth it by replacing the
precision value with the maximum
precision for any recall ≥ ȓ.
for example, at ȓ=0.4 we look at all
precision values for ȓ=0.4 or higher
(ȓ=0. ,0.5,0.6,…1) and take the
highest value. Therefore for r=0.4 we stick to precision=1. similarly for ȓ=0.5 we look at all
precision values with ȓ=0.5 or larger so we will find that the highest precision available is 0.57.
Note that at ȓ=0.4 the highest precision is 1 but for ȓ=0.400000001 the highest precision will be
0.57 so there will be a falling edge from 1 to 0.57 at ȓ=0.4. Let us see now the new P-R curve.
Now how to calculate AP?
AP (average precision) is computed as the
average of maximum precision at these 11
recall levels:
This is close to finding the total area under the green curve and divides it by 11.
For more precise definition:
The final recall and max precision for each value of recall is shown in this table
So now AP for our example is (5 × 1.0 + 4 × 0.57 + 2 × 0.5) /11=0.7527
so now We now have our average precision for a single evaluation image for
single class. we repeat this for every class in the image. Then we calculate AP for
all images in your testing/validation dataset. Now we have a list containing AP for
each class for each image. We do 2 more steps to calculate mean of AP:
Compute the mean of the APs for each class, giving us a mAP for each
individual class (for many datasets/challenges you’ll want to examine the
mAP class-wise so you can spot if your deep learning object detector is
struggling with a specific class)
,where n is the number of images in dataset.
Note: if an image doesn't contain classi, its AP will be equal 1.
Take the mAPs for each individual class and then average them together, yielding the
final mAP for the dataset. , where C
is the total number of classes.
Notes:
1-PASCAL VOC is a popular dataset for object detection. For the PASCAL VOC challenge, a
prediction is positive if IoU > 0.5. However, if multiple detections of the same object are
detected, it counts the first one as a positive while the rest as negatives.
2-For COCO, AP is the average over multiple IoU (the minimum IoU to consider a positive
match). AP@[.5:.95] corresponds to the average AP for IoU from 0.5 to 0.95 with a step size of
0.05. For the COCO competition, AP is the average over 10 IoU levels on 80 categories
(AP@[.50:.05:.95]: start from 0.5 to 0.95 with a step size of 0.05).
3-Averaging over the 10 IoU thresholds rather than only considering one generous threshold of
IoU ≥ 0.5 tends to reward models that are better at precise localization.
Region-based convolution neural network R-CNN and its variants (Fast & Faster R-CNN)
Those papers were done by Ross Girshick from Microsoft research team in 2014, 2015& 2016.
As we said before, Object detection started by the simple idea of sliding windows along the
whole image several times such that each time we use different window sizes and aspect ratios
but we said before that for each windows size to be slided along the whole image has high
computational cost and then solved it using the overfeat method and changing each FC layer to
a conv layer but we still have the main problem of trying different windows sizes and aspect
ratios to be solved. In R-CNN, our algorithm is splitted into 3 main processes: Region proposal
step, feature extraction step using CNN & classification/regression step.
They introduced the idea of selective search to make
each image doesn't need more than 2000 proposal or
windows to perform convolutional operations on to
extract features of these 2000 regions. In R-CNN,
the CNN architecture used for extracting features
form each of the 2000 region proposed was
AlexNet. The output of each convnet (fixed-length
feature vector from each region) is passed to 2
places as we said before a classification network for
classification and regressor for bounding boxes. The
classification network is a set of linear SVMs
classifier that are trained for each class and output classification while the boundary box
regressor is a linear regressor for tightening boundary boxes. Non-max suppression is then used
to suppress bounding boxes that have significant overlap with each other (discussed in the main
course and nothing more to be said in it)
Let us now go deeply in understanding how the image segmentation is done then go back to see
how we will use it in the selective algorithm used for region proposal.
Image segmentation (Felzenszwalb’s Algorithm):
When there exist multiple objects in one image (true for almost every real-world photos), we
need to identify a region that potentially contains a target object so that the classification can be
executed more efficiently.
Felzenszwalb and Huttenlocher (2004) proposed an algorithm for segmenting an image into
similar regions using a graph-based approach. It is also the initialization method for Selective
Search (a popular region proposal algorithm) that we are gonna discuss later.
Say, we use a undirected graph G=(V,E) to represent an input image(remember graph theory
mentioned in the computational graph part in machine learning chapter).
One vertex vi V represents one pixel. One edge e=(vi, vj) E connects
two vertices vi and vj. Its associated weight w(vi, vj) measures the
dissimilarity between vi and vj.
The dissimilarity can be quantified in dimensions like color, location, intensity, etc. The higher
the weight, the less similar two pixels are. A segmentation solution S is a partition of V into
multiple connected components, {C}. Intuitively similar pixels should belong to the same
components while dissimilar ones are assigned to different components.
A connected component (or just component) of an undirected
graph is a subgraph in which any two vertices are connected to each
other by paths, and which is connected to no additional vertices in
the supergraph. For example, the graph shown in the illustration has
three connected components. A vertex with no incident edges is
itself a connected component. A graph that is itself connected has
exactly one connected component, consisting of the whole graph.
The connected components of the graph are taken to be the segments in
the image segmentation (so we will refer to segments and connected components
interchangeably).
Let us show now the algorithm knowing that it follows a bottom up procedure so at the
beginning of the algorithm, there are no edges, so each vertex is its own connected component.
So if the image is w by h pixels, there are w×h connected components (we will call it n
components). then we start having edges between each two vertices so we will have m
edges(e1,e2,…,em) where m equals nC2 (n combination 2) and weight each edge of them.
Note: At the beginning as each vertex (pixel) is of different component so point 3 in step 3
won't exist but after we finish the m edges and merge the components related to each other to
form a segment so we end up with some segments (components) we repeat again over these
components to form bigger segments and that is a bottom up approach.(see next part)
How R-CNN uses selective search algorithm works?
It is done as follows:
At the initialization stage, apply Felzenszwalb and Huttenlocher’s graph-based image
segmentation algorithm to create regions to start with.
Use a greedy algorithm to iteratively group regions together:
o First the similarities between all neighbouring regions are calculated.
o The two most similar regions are grouped together, and new similarities are
calculated between the resulting region and its neighbours.
The process of grouping the most similar regions (Step 2) is repeated until the whole
image becomes a single region.
Hence start to propose the 2000
region proposals as shown in the
figure
The idea of spatial pyramid pooling that paved the way for fast R-CNN to developed
Still, RCNN was very slow. Because running CNN on 2000 region proposals generated by
Selective search takes a lot of time. SPP-Net tried to fix this. With SPP-net, we calculate the
CNN representation for entire image only once and can use that to calculate the CNN
representation for each patch generated by Selective Search. This can be done by performing a
pooling type of operation on JUST that section of the feature maps of last conv layer that
corresponds to the region. The rectangular section of conv layer corresponding to a region can
be calculated by projecting the region on conv layer by taking into account the downsampling
happening in the intermediate layers.
There was one more challenge: we need to generate the fixed size of input for the fully
connected layers of the CNN so, SPP introduces one more trick. It uses spatial pooling (aka RoI
pooling) after the last convolutional layer as opposed to traditionally used max-pooling. SPP
layer divides a region of any arbitrary size into a constant number of bins (grids) and max pool
is performed on each of the bins. Since the number of bins is known, we can just concatenate
the SPP outputs to give a fixed length
representation as determined in this
figure
However, there was one big drawback
with SPP net, it was not trivial to
perform back-propagation through
spatial pooling layer. Hence, the
network only fine-tuned the fully
connected part of the network. SPP-
Net paved the way for more popular
Fast RCNN which we will see next.
Notes:
1-256 here determines the number of filters.
2-the number of bins is fixed regardless of the image size.
3-Visulaize the RoI pooling[https://i.stack.imgur.com/rJL7D.gif]
Fast R-CNN in 2015
Fast RCNN uses the ideas from SPP-net and RCNN and fixes the key problem in SPP-net
i.e. they made it possible to train end-to-end. To propagate the gradients through spatial
pooling, It uses a simple back-propagation calculation which is very similar to max-pooling
gradient calculation with the exception that pooling regions overlap and therefore a cell can
have gradients pumping in from multiple regions.
One more thing that Fast RCNN did that they added the bounding box regression to the neural
network training itself. So, now the network had two heads, classification head, and bounding
box regression head. This multitask objective is a salient feature of Fast-RCNN as it no longer
requires training of set of networks independently for classification(SVMs) and
localization(Boundary box regressors). These two changes reduce the overall training time and
increase the accuracy in comparison to SPP net because of the end to end learning of CNN.
Remember: RoI pooling layer is a
type of pooling layer which
performs max pooling on inputs
(here, convnet feature maps) of
non-uniform sizes and produces a
small feature map of fixed size
(say 7x7). The choice of this fixed
size is a network hyper-parameter
and is predefined.
The main function of ROI layer is to reshape inputs with arbitrary size into a fixed length output
because of size constraint in Fully Connected layers as well as speed up the training and test
time and also to train the whole system from end-to-end (in a joint manner).
Summary
Note: We can notice that Fast R-CNN is about 25x faster than R-CNN, Faster R-CNN is 10x
faster than Fast R-CNN and 250x faster than R-CNN.
So far, all the methods discussed handled detection as a classification problem by building a
pipeline where first object proposals are generated and then these proposals are send to
classification/regression heads. However, there are a few methods that pose detection as a
regression problem. Two of the most popular ones are SSD and YOLO. These detectors are also
called single shot detectors. Let’s have a look at them.
Single-Shot MultiBox Detector (by W Liu from Google in 2015)
Single-shot multi-box detector (SSD) presents an object detection model using a single deep
neural network combining regional proposals and feature extraction. It reached new records in
terms of performance and precision for object detection tasks, scoring over 74% mAP
(mean Average Precision) at 59 frames per second on standard datasets such
as PascalVOC and COCO.
What is meant by its name?
Single Shot: this means that the tasks of object localization and classification are done in
a single forward pass of the network unlike R-CNN and its variants.
MultiBox: this is the name of a technique for bounding box regression developed by
Szegedy et al. (we will briefly cover it shortly)
Detector: The network is an object detector that also classifies those detected objects.
Architecture
SSD’s architecture builds on the venerable VGG-16 architecture, but discards the fully
connected layers.
Algorithm
Concretely, given an input image and a set of ground truth labels, SSD does the following:
Pass the image through a series of convolutional layers, yielding several sets of feature
maps at different scales (e.g. 10x10, then 6x6, then 3x3, etc.)
For each location in each of these feature maps, use a 3x3 convolutional filter to evaluate
a small set of default bounding boxes. These default bounding boxes are essentially
equivalent to Faster R-CNN’s anchor boxes.
For each box, simultaneously predict a) the bounding box offset and b) the class
probabilities
During training, match the ground truth box with these predicted boxes based on IoU.
The best predicted box will be labeled a “positive,” along with all other boxes that have
an IoU with the truth >0.5.
What is the problem of making the RPN inside the network as a single shot & how to solve it?
With the previous two models, the region proposal network ensured that everything we tried to
classify had some minimum probability of being an “object.” With SSD, however, we skip that
filtering step. We classify and draw bounding boxes from every single position in the image,
using multiple different shapes, at several different scales(Anchor boxes). As a result, we
generate a much greater number of bounding boxes than the other models, and nearly all of the
them are negative examples.
To fix this imbalance, SSD does two things. Firstly, it uses non-maximum suppression to group
together highly-overlapping boxes into a single box. In other words, if four boxes of similar
shapes, sizes, etc. contain the same dog, Non-Max Suppression would keep the one with the
highest confidence and discard the rest. Secondly, the model uses a technique called hard
negative mining to balance classes during training. In hard negative mining, only a subset of the
negative examples with the highest training loss (i.e. false positives) are used at each iteration of
training. SSD keeps a 3:1 ratio of negatives to positives.
MultiBox’s loss function also combined two critical components that made their way into SSD:
Confidence Loss: this measures how confident the network is of the objectness of the
computed bounding box. Categorical cross-entropy is used to compute this loss.
Location Loss: this measures how far away the network’s predicted bounding boxes are
from the ground truth ones from the training set. L2-Norm is used here.
Multibox_loss= confidence loss + α × location loss where α balances the contribution of
location loss.
Why each output feature map is convolved?
SSD predicts bounding boxes after multiple convolutional layers (feature maps) in order to
handle scales as each convolutional layer is able to detect objects of different scale from another
convolutional layer. For example here is SSD in action ↓
In smaller feature maps (e.g. 4x4), each cell covers a larger region of the image, enabling them
to detect larger objects. Region proposal and classification are performed simultaneously:
given p object classes, each bounding box is associated with a (4+p)-dimensional vector that
outputs 4 box offset coordinates and p class probabilities. In the last step, softmax is again used
to classify the object.
What is hard negative mining? why do we need it?
During training, as most of the bounding
boxes will have low IoU and therefore be
interpreted as negative training examples, we
may end up with a disproportionate amount
of negative examples in our training set.
Therefore, instead of using all negative
predictions, it is advised to keep a ratio of
negative to positive examples of around 3:1.
The reason why you need to keep negative
samples is because the network also needs to
learn and be explicitly told what constitutes
an incorrect detection.
Summary of SSD operation (For more detailed explanation: https://bit.ly/2wqox7L)
The model takes an image as input which passes through multiple convolutional layers with
different sizes of filter (10x10, 5x5 and 3x3). Feature maps from convolutional layers at
different position of the network are used to predict the bounding boxes. They are processed by
a specific convolutional layers with 3x3 filters called extra feature layers to produce a set of
bounding boxes similar to the anchor boxes of the Fast R-CNN.
Each box has 4 parameters: the coordinates of the center, the width and the height. At the same
time, it produces a vector of probabilities corresponding to the confidence over each class of
object. Non-max suppression and hard negative mining are then used.
You Only Look Once (YOLO) and its versions
Loss function
YOLO predicts multiple bounding boxes per grid cell. To compute the loss for the true positive,
we only want one of them to be responsible for the object. For this purpose, we select the one
with the highest IoU (intersection over union) with the ground truth. This strategy leads to
specialization among the bounding box predictions. Each prediction gets better at predicting
certain sizes and aspect ratios.
YOLO uses sum-squared error between the predictions and the ground truth to calculate loss.
The loss function composes of:
the classification loss.
the localization loss (errors between the predicted boundary box and the ground truth).
the confidence loss (the objectness of the box).
We do not want to weight absolute errors in large boxes and small boxes equally. i.e. a 2-pixel
error in a large box is the same for a small box. To partially address this, YOLO predicts the
square root of the bounding box width and height instead of the width and height. In addition, to
put more emphasis on the boundary box accuracy, we multiply the loss by λcoord(default: 5).
For confidence loss: If an object is detected in the box, the confidence loss
(measuring the objectness of the box) is:
Non-max suppression:
YOLO can make duplicate detections for the same object. To fix this, YOLO applies non-
maximal suppression to remove duplications with lower confidence. Non-maximal suppression
adds 2- 3% in mAP.
Here is one of the possible non-maximal suppression implementation:
Sort the predictions by the confidence scores.
Start from the top scores, ignore any current prediction if we find any previous
predictions that have the same class and IoU > 0.5 with the current prediction.
Repeat step 2 until all predictions are checked.
Note: SSD is a strong competitor for YOLO which at one point demonstrates higher accuracy
for real-time processing. Comparing with region based detectors, YOLO has higher localization
errors and the recall (measure how good to locate all objects) is lower. YOLOv2 is the second
version of the YOLO with the objective of improving the accuracy significantly while making it
faster.
Yolov2 (by Joseph Redmon, Ali Farhadi in 25 Dec 2016)
What are the improvements made in this version?
Batch normalization: Add batch normalization in all convolution layers. This removes the
need for dropouts and pushes mAP up 2%.
High resolution classifier: After trained by 224×224 images, YOLOv2 also uses 448×448
images for fine-tuning the classification network for 10 epochs on ImageNet. This gives
the network time to adjust its filters to work better on higher resolution input. We then
fine tune the resulting network on detection. This makes the detector training easier and
moves mAP up by 4%.Higher resolution images are accepted as input: the YOLO model
uses 448x448 images while the YOLOv2 uses 608x608 images, thus enabling the
detection of potentially smaller objects.
Convolutional with Anchor Boxes: YOLOv2 removes all fully connected layers and uses
anchor boxes to predict bounding boxes. One pooling layer is removed to increase the
resolution of output. . Using anchor boxes we get a small decrease in accuracy. YOLO
uses (7×7) grid cells with 2 bounding boxes that only predicts(7×7×2)= 98 boxes per
image but with anchor boxes our model predicts 19×19×5=1805 anchor boxes per image
as we use 5 anchor boxes per grid cell. Without anchor boxes our intermediate model gets
69.5 mAP with a recall of 81%. With anchor boxes our model gets 69.2 mAP with a
recall of 88%. Even though the mAP decreases, the increase in recall means that our
model has more room to improve.
Dimension cluster: In many problem domains, the boundary boxes have strong patterns.
For example, in the autonomous driving, the 2 most common boundary boxes will be cars
and pedestrians at different distances. To identify the top-K boundary boxes that have the
best coverage for the training data, we run K-means clustering on the training data to
locate the centroids of the top-K clusters. Since we are dealing with boundary boxes
rather than points, we cannot use the regular spatial
distance to measure datapoint distances. No surprise, we
use IoU. IoU is done between anchor boxes and ground
truth boxes noting that number of clusters determines the
number of anchor boxes. as number of anchors increases,
the accuracy increases then start to settle down as shown
and YOLOv2 settles down with 5 anchors.
Direct location prediction: I didn't understand the idea
Fine-Grained features: Convolution layers decrease the spatial dimension gradually. As
the corresponding resolution decreases, it is harder to detect small objects. Other object
detectors like SSD locate objects from different layers of feature maps. So each layer
specializes at a different scale. YOLO adopts a different approach called passthrough that
brings features from an earlier layer (finer low-level features that helps in detecting
smaller objects) and concatenate it with the normal output before making predictions.
Hence, it takes the earlier layer output that has the shape of 28×28×512 and reshaped it to
14×14×2048. We also have the final output layer of shape 14×14×1024. So we
concatenate the reshaped earlier layer output with final output to form an output layer of
14×14×3072 that now can make good predictions for both small and large objects.
.
Multi-scale training: For every 10 batches, new image dimensions are randomly chosen.
The image dimensions are {320, 352, …, 608} by step 32.The network is resized and
continue training. This acts as data augmentation and forces the network to predict well
for different input image dimension and scale. In additional, we can use lower resolution
images for object detection at the cost of accuracy. This can be a good tradeoff for speed
on low GPU power devices. At 288 × 288 YOLO runs at more than 90 FPS with mAP
almost as good as Fast R-CNN. At high-resolution YOLO achieves 78.6 mAP on VOC
2007.
Accuracy:
The last thing in this part is that YOLOv2 paper's title is better, faster and stronger, The
improvements we were talking about is related to the "better". For faster, they developed a
framework called darknet and make a network called darknet-19 that can reach 200 FPS. For
stronger, they combine COCO and ImageNet dataset by using wordtree to combine classes of
related objects together.
Darknet-19 architecture
With DarkNet, YOLO achieves 72.9% top-1
accuracy and 91.2% top-5 accuracy on ImageNet.
Darknet uses mostly 3 × 3 filters to extract features
and 1 × 1 filters to reduce output channels. It also
uses global average pooling to make predictions. We
replace the last convolution layer (the cross-out
section) with three 3 × 3 convolutional layers each
outputting 1024 output channels. Then we apply a
final 1 × 1 convolutional layer to convert the 7 × 7 ×
1024 output into 7 × 7 × 125. (5 boundary boxes
each with 4 parameters for the box, 1 objectness
score and 20 conditional class probabilities).
Darknet-19 can work on 45 FPS.
Faster-RCNN gives very good accuracy but low FPS regarding real-time application. SSD gives
better FPS so that we can get real-time application into work but lower accuracy. Then came
YOLO that dives more towards real-time application by giving more FPS but still lower
accuracy than both of Faster R-CNN and SSD and still the trade-off exists between accuracy
and FPS and it will remain. RetinaNet then showed up to give more ideal balance between
accuracy and speed while being simple and open source and hence come into action. Let us see
what it tells.
RetinaNet (Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár in 2017)
Architecture
RetinaNet is a composite network composed of:
A backbone network called Feature Pyramid Net, which is built on top of ResNet and is
responsible for computing convolutional feature maps of an entire image.
a subnetwork responsible for performing object classification using the backbone’s
output (feature pyramid net's output).
a subnetwork responsible for performing bounding box regression using the backbone’s
output (feature pyramid net's output).
What is the feature pyramid net (FPN)?
RetinaNet adopts the Feature Pyramid Network (FPN) as its backbone, which is in turn built on
top of ResNet (ResNet-50, ResNet-101 or ResNet-152)1 in a fully convolutional fashion. The
fully convolutional nature enables the network to take an image of an arbitrary size and outputs
proportionally sized feature maps at multiple levels in the feature pyramid. The construction of
FPN involves two pathways (Bottom-up pathway & Top-down pathway) which are connected
with lateral connections. They are described as below.
Bottom-up pathway→ Recall that in ResNet, some
consecutive layers may output feature maps of the
same scale; but generally, feature maps of deeper
layers have smaller scales/resolutions. The bottom-
up pathway of building FPN is accomplished by
choosing the last feature map of each group of
consecutive layers2 that output feature maps of the
same scale. These chosen feature maps will be used
as the foundation of the feature pyramid.
Top-down pathway→ Using nearest neighbor
upsampling, the last feature map from the bottom-up
pathway is expanded to the same scale as the
second-to-last feature map. These two feature maps
are then merged3 by element-wise addition to form a
new feature map. This process is iterated until each
feature map from the bottom-up pathway has a
corresponding new feature map connected with
lateral connections.
There are altogether five levels in the pyramid (the figures only shows three).
History of feature pyramids till we reached FPN.
In real-world object detection, objects from the
same class may be presented in a wide range of
scales in images. This leads to some decrease in
detection accuracy, especially for small objects.
This is because feature maps from higher levels of
the pyramid are spatially coarser, though
semantically stronger. Therefore, only using the
last feature map of a network to make the
prediction is less ideal as in fig(b).
One solution would be to generate different scales of an image and feed them to the network
separately for prediction as fig(a). This approach is termed “featurized image pyramid” and was
widely adopted before the era of deep learning. However, since each image needs to be fed into
the network multiple times, this approach also introduces a significant increase in test time,
making it impractical for real-time applications.
Another solution would be to simply use multiple feature maps generated by a ConvNet for
prediction as in fig(c), and each feature map would be used to detect objects of different scales.
This is an approach adopted by some detectors like SSD. However, although the approach
requires little extra cost in computation, it is still sub-optimal since the lower feature maps
cannot sufficiently obtain semantical features from the higher ones.
Finally, we turn to FPN. As mentioned, FPN is built in a fully convolutional fashion which can
take an image of an arbitrary size and output proportionally sized feature maps at multiple
levels. Higher level feature maps contain grid cells that cover larger regions of the image and is
therefore more suitable for detecting larger objects; on the contrary, grid cells from lower level
feature maps are better at detecting smaller objects(check this figure ↓). With the help of the
top-down pathway and lateral connections , which
do not require much extra computation, every level
of the resulting feature maps can be both
semantically and spatially strong. These feature
maps can be used independently to make
predictions and thus contributes to a model that is
scale-invariant and can provide better performance
both in terms of speed and accuracy.
With this rescaling, the large number of easily classified examples (mostly background) does
not dominate the loss anymore and learning can concentrate on the few interesting cases.
All our discussion about the focal loss as seen was on the classification part only. For regression
part it is as normal which is the L1 or L2 (L1 was used here in retina as it is more robust to
outliers) loss between predictions and ground truth labels.
Assuming loss of classification is denoted by Lclass and loss of regression (detection) is denoted
by Ldetection, the total loss equation will be : Lclass + λ Ldetectioon noting that α is a hyper-parameter
for balancing between the contribution of each task loss in the total loss.
Note: λ also has an impact on the class imbalance as γ so selecting them should be taken care of
as they interact with each other. Generally, as you increase γ, λ should be decreased slightly.
Semantic segmentation
Semantic segmentation is a natural step in the progression from coarse to fine inference:
The origin could be located at classification, which consists of making a prediction for a
whole input.
The next step is localization / detection, which provide not only the classes but also
additional information regarding the spatial location of those classes.
Finally, semantic segmentation achieves fine-grained inference by making dense
predictions inferring labels for every pixel, so that each pixel is labeled with the class of
its enclosing object ore region. [visualize GIF: https://bit.ly/2AxkE3I]
Improvement:
Conditional Random Field (CRF)
postprocessing are usually used to
improve the segmentation. CRFs
are graphical models which
‘smooth’ segmentation based on the
underlying image intensities. They
work based on the observation that
similar intensity pixels tend to be
labeled as the same class. CRFs can
boost scores by 1-2%.
Note: In the previous CRF illustration figure: (b) Unary classifiers is the segmentation input to
the CRF. (c, d, e) are variants of CRF with (e) being the widely used one.
Fully convolutional network FCN (by Jonathan Long, Evan Shelhamer, Trevor Darrell
from UC Berkeley in 2014)
FCNs owe their name to their architecture, which is built only from locally connected layers,
such as convolution, pooling and upsampling. Note that no dense layer (FC) is used in this kind
of architecture. This reduces the number of parameters and computation time. Also, the network
can work regardless of the original image size, without requiring any fixed number of units at
any stage, given that all connections are local.
The final output layer will be the same height and width as the input image, but the number of
channels will be equal to the number of classes. If we’re classifying each pixel as one of fifteen
different classes, then the final output layer will be height x width x 15 classes.
To obtain a segmentation map (output), segmentation networks usually have 2 parts :
Downsampling path : capture semantic/contextual information
Upsampling path : recover spatial information
The downsampling path is used to extract and interpret the context (what), while
the upsampling path is used to enable precise localization (where). Furthermore, to fully recover
the fine-grained spatial information lost in the pooling or downsampling layers, we often use
skip connections to transfer local information from downsampling path with upsampling path
either by concatenating or by summing (as in Feature pyramid network FPN)
As explained above, the upsampling paths of the FCN variants are different, since they use
different skip connection layers and strides for the last convolution, yielding different
segmentations. As we go back to retrieve finer spatial information we are getting better
segmentation as shown.
We can notice that the idea is simple but we still lack one information that is used extensively in
our context which is upsampling or more precisely we call it learnable upsampling
(deconvolution or backward convolution).
This is the convolution matrix. Each row defines one convolution operation. If you do not see it,
the below diagram may help. Each row of the convolution matrix is just a rearranged kernel
matrix with zero padding in different places.
The output is 4×1 matrix that can be reshaped to 2×2 matrix. If you check the
normal convolution we did at the beginning you will find that this method
ends up with the same result.
The point is that with the convolution matrix, you can go from 16 (4×4) to 4 (2×2) because the
convolution matrix is 4×16. Hence, if you have a 16×4 matrix, you can go from 4 (2×2) to 16
(4×4).
I think now you got what we will do next. We
just transpose the kernel matrix we have formed
and just multiply it by the 4×1 output to perform
the up-sampling operation.
We have just up-sampled a smaller matrix (2×2)
into a larger one (4×4). The transposed
convolution maintains the 1 to 9 relationship
because of the way it lays out the weights.
DOWN UP
For those who are familiar with traditional convolutional neural networks first part (denoted as
DOWN) of the architecture will be familiar. This first part is called down or you may think it as
the encoder part where you apply convolution blocks followed by a maxpool downsampling to
encode the input image into feature representations at multiple different levels. The second part
of the network consists of upsample and concatenation followed by regular convolution
operations. Upsampling uses transposed convolutions as discussed before.
As we have prior knowledge that while upsampling and going deeper in the network we are
concatenating the higher resolution features from down part with the upsampled features in
order to better localize and learn representations with following convolutions. Since upsampling
is a sparse operation we need a good prior from earlier stages to better represent the
localization. Similar idea of combining matching levels is also seen in FPNs.
By inspecting the figure more carefully, you may notice that output dimensions (388 × 388) are
not same as the original input (572 × 572). If you want to get consistent size, you may apply
padded convolutions to keep the dimensions consistent across concatenation levels
Weight-map
Since the touching objects are closely placed each other, they are easily merged by the network,
to separate them, a weight map is applied to the output of network.
To compute the weight map as above, d1(x)
is the distance to the nearest cell border at
position x, d2(x) is the distance to the
second nearest cell border.
Thus, at the border, weight is much higher
as in the figure(Segmentation map on the
left and weight map on the right).
Thus, the cross entropy function is
penalized at each position by the weight
map. And it help to force the network to
learn the small separation borders between
touching cells.
The weight map is not related to the
network weights but to the sample weights: during the training, you feed three pieces of
information to the network, 1) the pixel intensities, 2) the pixel labels, 3) the pixel weights.
Using binary map as on the left, you will be able to determine border hence give them more
weight and use this weight in the cross-entropy loss (similar to what we did in focal loss).
There was something called overlap tiles strategy mentioned in the paper but I didn't understand
its idea and can't find a good explanation that gives the intuition behind it.
Before going to the last semantic segmentation technique to be discussed, I want to make
something clear. Upsampling generally has a lot of techniques to be applied not only transposed
convolution but transposed convolution is the only technique -I know- to be learnable, however,
there are other techniques for upsampling that is not learnable and just has simple algorithm as
nearest neighbor interpolation and bilinear interpolation.
Bilinear interpolation
A single pixel value is calculated as the weighted average of all other values based on distances.
given a matrix how to do bilinear interpolation to make it 4×4?
simply put them on the edges like that and do normal linear interpolation on each
axis as follows:
along x-axis between value 1 & 2 we need 2 pixel value.
according to this equation yinterpolated= y1+ (x-x1) noting that y represent pixel value and x
represents pixel position.
then value of pixel at second position = 1+ (2-1) = 1.333
then value of pixel at third position = 1+ (3-1) = 1.666
and similarly we do that for every single position till we finish the whole matrix.
DilatedNet (by F Yu in late 2015)
DilatedNet was the introduction to 2 new concepts which are dilated (atrous) convolution and
multi-scale context aggregation.
Dilated convolution
Dilated convolution is a simple idea as it looks exactly the same as the convolution operation
except that we introduce notation l that defines the gap between the kernel values known as
dilation factor and the gap is filled with zeros hence no effect. If l=1 so it is the normal
convolution where no gaps between the values. See the examples:
The importance of dilated convolution is obvious which is increasing the receptive field seen by
the network (global view). This increase is exponential in receptive field while being linear in
number of parameters. One general use is image segmentation where each pixel is labelled by
its corresponding class. In this case, the network output needs to be in the same size of the input
image. Straight forward way to do is to apply convolution then add deconvolution layers to
upsample[1]. However, it introduces many more parameters to learn. Instead, dilated
convolution is applied to keep the output resolutions high and it avoids the need of upsampling
The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which
is 3x3 in this example, and greed area is the receptive field captured by each of these inputs.
Receptive field is the implicit area captured on the initial input by each input (unit) to the next
layer .
Multi-scale context aggregation (The context module)
The context module is designed to increase the performance of dense prediction architectures by
aggregating multi-scale contextual information by using dilated convolution. The module takes
C feature maps as input and produces C feature maps as output. The input and output have the
same form, thus the module can be plugged into existing dense prediction architectures.
The context module architecture:
The context module has 7 layers that apply 3×3 convolutions with different dilation factors. The
dilations are 1, 1, 2, 4, 8, 16, and 1.
The last one is the 1×1 convolutions for mapping the number of channels to be the same as the
input one. Therefore, the input and the output has the same number of channels. And it can be
inserted to different kinds of convolutional neural networks.
The basic context module has the only 1 channels (1C) throughout the module while the large
context module has increasing number of channels from 1C as input to 32C at 7th layer.
This module can be used with any architectures of other networks but for their paper VGG-16 is
used as the front-end module. The last two pooling and striding layers are removed entirely,
and the context module is plugged in. The padding of the intermediate feature maps is also
removed. Authors only pad the input feature maps by a width of 33. Zero padding and reflection
padding yielded similar results in our experiments. Also, a weight initialization considering the
number of channels of input and output is used instead of standard random initialization.
====================================================================
That is enough for semantic segmentation, however, there are more recent networks as
DeepLapv3, RefineNet & PSPNet that can be walked up through. In case you want a quick
review of the concepts they add you can check this: [http://blog.qure.ai/notes/semantic-
segmentation-deep-learning-review]
Semantic segmentation is relatively easier compared to its big brother, instance segmentation.
In instance segmentation, our goal is to not only make pixel-wise predictions for every person,
car or tree but also to identify each entity separately as person 1, person 2, tree 1, tree 2, car 1,
car 2, car 3 and so on. Current state of the art algorithm for instance segmentation is Mask-
RCNN. Before going forward to read Mask-RCNN, revise faster R-CNN and the idea of RoI
pooling layer that was said in SPP-Net and Fast R-CNN.
Mask-RCNN (by Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick in 2017)
Before going deeply in the algorithm we can say that The actual model we use to solve the
problem of instance segmentation is actually much simpler than you might think. Instance
segmentation can essentially be solved in 2 steps:
Perform a version of object detection to draw bounding boxes around each instance of a
class
Perform semantic segmentation on each of the bounding boxes
This amazing simple model actually performs extremely well. It works, because if we assume
step 1 to have a high accuracy, then semantic segmentation in step 2 is provided a set of images
which are guaranteed to have only 1 instance of the main class. The job of the model in step 2 is
to just take in an image with exactly 1 main class, and predict which pixels correspond to the
main class (cat/dog/human etc.), and which pixels correspond to the background of an image.
Another interesting fact is that if we are able to solve the multi bounding box problem and
semantic segmentation problem independently, we have also essentially solved the task of
instance segmentation! The good news is that very powerful models have been built to solve
both of these problems, and putting the 2 together is a relatively trivial task.
Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that
says whether or not a given pixel is part of an object. The branch (in white in the above image),
as before, is just a Fully Convolutional Network on top of a CNN based feature map. Here are
its inputs and outputs:
Inputs: CNN Feature Map.
Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s
elsewhere (this is known as a binary mask).
You may notice that the segmentation part is similar to what we used to see (FCN) but there is a
different word written here called RoI allign. What is RoI allign? Is it similar to normal RoI
layer we know from SPP_Net? let us see.
RoI allign
When run normal RoI layer without modifications on the original Faster R-CNN architecture,
the Mask R-CNN authors realized that the regions of the feature map selected by RoIPool were
slightly misaligned from the regions of the original image. Since pixel-level segmentation
requires much more fine-grained alignment than bounding boxes, mask R-CNN improves the
RoI pooling layer (named “RoIAlign layer”) so that RoI can be better and more precisely
mapped to the regions of the original image.
But what was the cause of misalignment? simply it was rounding up still didn't get it !! ok
let us assume having an image of size 128×128 which have a region of interest of size 15×15 so
after passing through the convNet the
resulting feature map was of size 25×25
hence the size is reduce by a factor
(128/25=5.12) so the RoI layer should be
now reduced by the same factor hence
become 2.929×2.929. In normal RoI layer it
was not output as 2.929×2.929 but was
rounded up to the nearest integer hence the
Region of interest now become 3×3 which is
not accurate hence leading to misalignment.
What authors of Mask R-CNN do to prevent such misalignment? they removed this rounding up
(quantization) by using bilinear interpolation for computing the floating-point location values.
Loss function
In faster R-CNN, there was to tasks to penalized which are classification and localization but
now we have a new task to be penalized which is segmentation so it is a multi-task loss function
that combines the loss of classification, localization and segmentation mask.
We are familiar now for both of classification and localization loss. What about segmentation
mask loss?
It is defined as the average binary cross-entropy loss, only including k-th mask if the region is
associated with the ground truth class k.
where is the the label of a cell (i, j) in the true mask and is the predicted value of the
same cell in the mask learned for the ground-truth class k and m×m is the mask of dimension
for each RoI and each class (for K total classes).
Summary of R-CNN
Course 4 Week 4
This week will be concerned with 2 real-world applications which are face recognition and
neural style transfer.
Face verification VS. Face recognition
Verification
Input Image + Name or ID
Output Whether the image belongs to the claimed person or not.
It is one to one (1:1) problem which means we compare 1 image to 1 ID or name.
Recognition
We have a database of K persons
Input Image
Output ID if the image is any of the K persons
It is one to all (1:K) problem which means we compare 1 image to all K data in the database.
Face recognition system is harder than face verification because if you have a face verification
system of 99%, it will be acceptable but if we use this system as a building block of the face
recognition system having K=100 so this means that as we 1 % error for each person and we
have K=100 then we may get wrong 100 times which is a high number so this means that we
need the face verification method to be higher than 99% (may be 99.9% accuracy)
One-shot learning
One of the challenges that faces face recognition system is that we need to solve the one-shot
learning problem which means that for most face recognition application you need to recognize
the person given just 1 single image or in other words given 1 training example of that person's
face which historically is almost an impossible task for deep learning to solve.
Example
Assume we have a database of 4 persons (K=4) as shown where each
person has a photo of his face and now we entered 2 persons and want
to check if those 2 persons are in the database or some intruders so we
find that one of them is actually in the database and the other is not so
here the case is that we are learning from one example to recognize a
person again so learning process will be bad due to the small dataset.
Note: If we supposed that we will train this with convnet, softmax will
be 5 not 4 because none of the above choice will be added.
So here comes the question, how to apply this function ? using Siamese network
Siamese network
Usually we input image (x(1)) to a network composed of convolutional, max pooling and fully
connected layers and then passes the output from the last FC layer to softmax so we will do the
same sequence except that we will remove the softmax layer (now the FC layer is the last
layer). Assume that the last FC layer is composed of 128 nodes and we gonna call it f(x (1)) and
we can think of this as an encoding of x(1). so now we have the image x(1) is represented by
vector 128 values. So now if you want to build a face recognition system so you want to
compare two images (x(1) & x(2)) so what you do is to apply x(2) to the same network with the
same parameters that x(1) was
trained on.
So now as x(2) is a different
person as shown in the figure
so the resultant vector will be of
different 128 values (call it
f(x(2))) that encodes the x(2).
Now we have two vectors each of length 128 which are f(x(1)) & f(x(2)) and to apply similarity
function we do the following d(x(1), x(2)) =
Triplet loss
To get good parameters for the network to be able to get a good encoding of the image is to
define and apply gradient descent on what is called by triplet loss so we need to define our
learning objective of this system:
Our learning objective is to make the distance between the anchor image (original image) and
the positive image to be small and the distance between the anchor image and the negative
image to be as high as possible and from this learning objective the idea of triplet loss arises as
we always look at the 3 images (anchor (A), positive (P) & negative (N)) at a time.
Finally
Triplet loss function
Given 3 images (A, P & N)
ℓ(A, P, N)=
cost function J =
Note: You can notice here that every person is only having 10 images so we can train network
with small dataset using triplet loss function and then use the final network for one-shot
learning for your facial recognition system.
Another formula:
Each hidden unit is dealing with patches of the input image that maximizes its activation so
assume each hidden unit is dealing with 9 patches of the input image. Now let us pick one
hidden unit from layer 1 so what we have seen that those 9 patches are the patches
that maximizes this unit's activation which means that this neuron is dealing with those edges
inclined to the right and when we picked another unit we found that those 9 patches are
the patches that maximizes the activation so this means that this neuron dealing with those
edges inclined to the left and we started repeating picking different neurons so we found that
each neuron is responsible of picking a low feature to cope with it as left side to be green
So now we understand that hidden units in layer 1 are concerning low-level features as edges
with different inclination, shades, colors and so on…
When we went further through the network for deeper layers we found out that hidden units
now are dealing with larger patches sizes that could formulate higher level features as
curvatures, geometrical shapes and more complex shapes as we go deeper.
Layer 2
Layer 3
Layer 4
Note: As we go deeper, patches are bigger and concerned shapes are more complex.
Neural style transfer
Neural style transfer allows you to create an existed image using a style by merging them in a
way such that the generated image is the original image (aka content) drawn by this style as
shown in the figure below:
Cost function
J(G)= α Jcontent(C,G) + β Jstyle(S,G)
Note: It is a little bit redundant to use 2 hyper-parameters (α, β) while we can use only 1 for
weighting but the original paper discussing this was using 2 parameters so we will follow that.
Now what the cost function will do is to measure how similar are those images (C, G) and
penalizes if it they are different and controlled by α & β to balances between content and style:
Jcontent(C,G) =
Style cost function Jstyle(S,G)
Say we are using layer l's activation to measure style so we will define the style that it is the
correlation between activations across channels in this layer l.
Correlation between activation across channels can be illustrated as shown:
Layer l is a block of dimensions nH, nW, nC where nH, nW defines the
number of height and width of the image in this layer and n C represents the
number of channels of the image in this layer and assume n C=5 so if we
looked at first two channels (Red and yellow) so each activation in red channel has a
corresponding activation in the yellow channel (as those pointed at by arrows) so we look at the
correlation between each 2 corresponding activation along the channel but why does the
correlations between different channels capture the style of the image ? let us see an example:
assume the red channel represents the neuron the find those patches
surrounded by red border (subtle vertical texture) and yellow channel
represents the neuron the find those patches surrounded by yellow border
(orange tint) so saying that there is a high correlation between activations
of these 2 channels means that whenever part of the image having those
types of subtle vertical textures, this part of the image will most probably
has this orange tint and vice versa for uncorrelated. So we can say now that the degree of
correlation measures how often different high level (or intermediate-level) features will occur
together.
Now what we will do is to apply the concept of degree of correlation between channels on the
style image and the generated image to tell how close is the style of the generated image to that
of the input style image. Let us formalize what we have said:
Since we are measuring correlation between all channels and each other so G[l](S) is nc[l] × nc[l].
let ai,j,k[l] = activation at i, j, k where i is height, j is width and k is channel.
,
where ai,j,k[l] and ai,j,k'[l] are the corresponding activations in the two channels k and k' in layer l
and k varies from 1 to nc[l] and Gkk' is the "correlation matrix" or "style matrix" or "gram matrix"
.
Note: As we are only multiplying activation to represent correlation so it is actually not
correlation that we know from mathematics but it is mathematically unnormalized cross
variance but we will use the word correlation as convention.
epochs increases
loss decreases as
Remember: J(G) = α Jcontent(C,G) + β Jstyle(S,G)
===============================================
All what we were concerning was 2-D images and we were using 2-D filters with them to
perform convolutions but in fact convolutional networks are used in 1-D and 3-D as well.
Convolution in 1-D is used in fields like signal processing as in EKG signal and it is performed
the same way as 2-D by sliding 1-D filter along the signal (discretized) and outputs a signal of
length n-f+1 similarly to what occurs in 2-D convolutions.
3-D is a little bit harder where we have depth as in CT scan which takes slices of your body to
represent the depth.
Note: RGB images are not 3-D images they are just 2-D images with 3 channels.
One shot learning & Improvement on Siamese Network loss function
Motivation behind one shot learning
Humans learn new things with a very small set of examples — e.g. a child can generalize the
concept of a “Dog” from a single picture but a machine learning system needs a lot of examples
to learn its features. In particular, when presented with stimuli, people seem to be able to
understand new concepts quickly and then recognize variations on these concepts in future
percepts. Machine learning as a field has been highly successful at a variety of tasks such as
classification, web search, image and speech recognition. Often times however, these models do
not do very well in the regime of low data. This is the primary motivation behind One Shot
Learning.
One-shot learning (aka few-shot learning or low-shot learning)
One-shot learning can be used for object categorization problem in computer vision. Whereas
most machine learning based object categorization algorithms require training on hundreds or
thousands of images and very large datasets, one-shot learning aims to learn information about
object categories from one, or only a few, training images.
Why not using CNNs as usual to solve CV problems especially Face recognition?
CNNs can't be used for such problems because:
CNN doesn’t work on a small training set
It is not convenient to retrain the model every time we add a picture of a new person to
the system.
When working with a huge dataset, correctly labeling the data can be costly.
However, we can use Siamese neural network for face recognition. FaceNet is a Siamese
network which is currently state of art for face recognition. Those types of networks are used in
FaceID recognition in smart phones. Baidu is using face recognition instead of ID cards to
allow their employees to enter their offices.
Siamese Network
Siamese networks are neural networks containing two or more identical subnetwork
components. It is important that not only the architecture of the sub networks is identical, but
the weights have to be shared among them as well for the network to be called “Siamese”. The
main idea behind Siamese networks is that they can learn useful data descriptors that can be
further used to compare between the inputs of the respective sub networks. Hereby, inputs can
be anything from numerical data (in this case the sub networks are usually formed by fully-
connected layers), image data (with CNNs as sub networks) or even sequential data such
as sentences or time signals (with RNNs as sub networks).
There is nothing for basics of Siamese networks more to be told than main course except for an
improvement that will be discussed so read the triplet loss from the main course before
continuing for improvement.
Lossless triplet loss
Assume as you can see on this picture, the positive
data marked as green dots are easily spitted away
from error type 1 (Red) and error type 2 (Orange)
that are acting as negatives. This is the final model
after training with the normal triplet loss discussed in
main course.
So what is the problem, it seems to work fine, doesn’t it? After some reflection I realized that
there was a big flaw in the loss function which is in the max function max (0,loss)
There is a major issue here, every time your loss gets below 0, you lose information, a ton of
information.
First let’s look at this function
It basically does this:
It tries to bring close the Anchor
(current record) with the Positive (A record that is in theory similar with the Anchor) as far as
possible from the Negative (A record that is different from the Anchor).
So as long as the negative value is further than the positive value + alpha there will be no gain
for the algorithm to condense the positive and the anchor.
Still didn't get the problem!! check this :
Let’s pretend that:
Alpha is 0.2
Negative Distance is 2.4
Positive Distance is 1.2
The loss function result will be 1.2–2.4+0.2 = -1. Then
when we look at Max(-1,0) we end up with 0 as a loss. The
Positive Distance could be anywhere above 1 and the loss
would be the same. With this reality, it’s going to be very
hard for the algorithm to reduce the distance between the
Anchor and the Positive value.
here is 2 scenarios A and B. They both represent what the
loss function measure for us. After the Max function both
A and B now return 0 as their loss, which is a clear lost of information. By looking simply, we
can say that B is better than A. In other words, you cannot trust the loss function result.
Solution (Lossless triplet function)
With the title, you can easily guess what is my plan… To make a loss function that will capture
the “lost” information below 0. After some basic geometry, I realized that if you contain the N
dimension space where the loss is calculated you can more efficiently control this. So the first
step was to modify the model. The last layer (Embedding layer) needed to be controlled in size.
By using a Sigmoïde activation function instead of a linear we can guarantee that each
dimension will be between 0 and 1.
Secondly, Instead of the linear cost we proposed a non-linear cost function, so with this new
non-linearity, our cost function now looks like:
Where N is the number of dimensions (Number of output of your network; Number of features
for your embedding) and β is a scaling factor.
However, there are bad customers who sell fake wine in order to get money. In this case, the
shop owner has to be able to distinguish between the fake and authentic wines.
You can imagine that initially, the forger might make a lot of mistakes when trying to sell the
fake wine and it will be easy for the shop owner to identify that the wine is not authentic.
Because of these failures, the forger will keep on trying different techniques to simulate the
authentic wines and some will eventually be successful. Now that the forger knows that certain
techniques got past the shop owner's checks, he can start to further improve the fake wines
based on those techniques.
At the same time, the shop owner would probably get some feedback from other shop owners or
wine experts that some of the wines that she has are not original. This means that the shop
owner would have to improve how she determines whether a wine is fake or authentic. The goal
of the forger is to create wines that are indistinguishable from the authentic ones, and the goal
of the shop owner is to accurately tell if a wine is real or not.
This back-and-forth competition is the main idea behind GANs.
There are two major components within GANs: the generator and the discriminator. The shop
owner in the example is known as a discriminator network and is usually a convolutional neural
network (since GANs are mainly used for image tasks) which assigns a probability that the
image is real.
The forger is known as the generative network, and is also typically a convolutional neural
network (with deconvolution layers). This network takes some noise vector and outputs an
image. When training the generative network, it learns which areas of the image to
improve/change so that the discriminator would have a harder time differentiating its generated
images from the real ones.
The generative network keeps producing images that are closer in appearance to the real images
while the discriminative network is trying to determine the differences between real and fake
images. The ultimate goal is to have a generative network that can produce images which are
indistinguishable from the real ones.
Generative Adversarial Networks GAN (by Ian Goodfellow in 2014)
Generative Adversarial Networks takes up a game-theoretic approach, unlike a conventional
neural network. The network learns to generate from a training distribution through a 2-player
game. These two adversaries are in constant battle throughout the training process. The 2
players are actually a deep neural networks (mostly CNNs as for dealing with image but other
domains are also included as music & speech) that called:
Generative network: Called the Generator.
o Given a certain label, tries to predict features.
o EX: Given an email is marked as spam, predicts (generates) the text of the email.
o Generative models learn the distribution of individual classes.
The network has 4 convolutional layers, all followed by Batch Normalization (except for the
output layer) and Rectified Linear unit (ReLU) activations.
It takes as an input a random vector z (drawn from a normal distribution). After reshaping z to
have a 4D shape, we feed it to the generator that starts a series of upsampling layers.
Each upsampling layer represents a transpose convolution operation with strides 2. The stride of
a transpose convolution operation defines the size of the output layer. With “same” padding and
stride of 2, the output features will have double the size of the input layer.
In short, the generator begins with this very deep but narrow input vector. After each transpose
convolution, z becomes wider and shallower. All transpose convolutions use a 5×5 kernel’s size
with depths reducing from 512 all the way down to 3 representing an RGB color image.
The final layer outputs a 32×32×3 tensor — squashed between values of -1 and 1 through
the Hyperbolic Tangent (tanh) function.
Now the generator by itself produces garbage as the initial input is a randomly chosen values
through Gaussian or any other distribution (latent features). So it needs to be trained to enhance
the output image of the generator each epoch. but to enhance the output we need the adversarial
network that will guide the generator to be enhanced so let us see the discriminator now.
Discriminator
Discriminator is much simpler as it looks like typical machine and deep learning problem where
we have two outputs to compare and compute the error between them.
GAN builds a discriminator to learn what
contributes as real images, and it provides feedback
to the generator to create paintings that look like
the real Monet paintings. So how is it done
technically? The discriminator looks at real images
(training samples) and generated images separately.
It distinguishes whether the input image to the
discriminator is real or generated. The
output D(X) is the probability that the input x is
real, i.e. P(class of input = real image).
We train the discriminator just like a deep network
classifier. If the input is real, we want D(x)=1. If it
is generated, it should be zero. Through this process, the discriminator identifies features that
contribute to real images. The last activation is sigmoid to tell us the probability of whether the
input image is real or not. So, the output can be any value between 0 and 1.
On the other hand, we want the generator to create images with D(x) = 1. So we can train the
generator by backpropagation this target value all the way back to the generator, i.e. we train the
generator to create images that towards what the discriminator thinks it is real.
Now let us discuss the toughest part that makes GANs hard task to be ultimately solved which
is the objective function of GANs called Nash equilibrium.
For generator: its objective function wants the model to generate images with the highest
possible value of D(x) to fool the discriminator.
Algorithm with losses: Usually K=1 so we done once for discriminator and once for generator.
You think we have finished !! No. This model suffered from generator diminished gradient
(vanishing gradient descent) so we had to understand it and solve it.
Generator diminished gradient
It is observed that optimizing the generator objective function does not work so well, this is
because when the sample is generated is likely to be classified as fake, the model would like to
learn from the gradients but the gradients turn out to be relatively flat. This makes it difficult for
the model to learn.
In other words we can say that the discriminator usually wins early against the generator. It is
always easier to distinguish the generated images from real images in early training. That
makes V approaches 0. i.e. log(1 -D(G(z))) → 0. The gradient for the generator will also vanish
which makes the gradient descent optimization very slow. So how to solve it ?
Instead of minimizing the likelihood of discriminator being correct, we maximize the likelihood
of discriminator being wrong. Therefore, we perform gradient ascent on generator. So the GAN
provides an alternative function to backpropagate the gradient to the generator as shown:
There is more recent improvements on the cost function that are introduce in 2017 called
Wasserstein GANs but I still didn't get its idea. It provides ideas as Wasserstein distance to be
used as loss function with some enhancement on the model as adding noise on the inputs of
discriminators and using one-sided label smoothing (instead of using 0 & 1 as labels, they use
0.1 and 0.9 instead)
You can check its explanation here:[ https://medium.com/@jonathan_hui/gan-wasserstein-gan-
wgan-gp-6a1a2aa1b490]
END OF PART 3
COMPUTER VISION