0% found this document useful (0 votes)
6 views

Machine and Deep Learning (Nezar a. El-Kady)

Uploaded by

tahatarek7770
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Machine and Deep Learning (Nezar a. El-Kady)

Uploaded by

tahatarek7770
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 353

Machine and Deep Learning

Collected by/ Nezar A. El-Kady

CREDITS TO SOURCES:
Main courses: (Pages with white background )
Part1(Machine learning): Machine learning course By Andrew NG.
Part2 (Deep Learning Basics): Course1+Course2 of deep learning
specialization By Andrew NG + Draft version of Machine Learning
Yearning book.
Part3(Computer Vision):Course4 of deep learning specialization by
Andrew NG
Part4(Natural Language Processing):

Topics in details: (Pages with tan background )


Towards Data science Articles
Medium Articles
Analytics Vidhya Articles
Machine Learning Mastery Articles
Datacamp Articles
Questions on Quora, Stack Exchange, Reddit, Stackoverflow, Research
gate, Github.
Deep learning course by Geoffrey Hinton
Coursera videos
Youtube videos
Other sites visited only once or twice

2018
PART 1
MACHINE LEARNING
Introduction:
What is machine learning ?
1st def. :A field of study that gives the computer the ablity to learn without being explicitly
programmed.(Arthur Samuel)
2nd def. :Well-Posed learning problem where the computer is said to learn from experience E
with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.(Tom Mitchell)
Q. Ans.
T: Classifying…
E: Watching…
P: The number
correctly classified…

Supervised learning: a process of an algorithm learning from a


training dataset (correct answers given) as classification and
regression.(input x and output y and try find a mapping function)

Unsupervised learning: a process of an algorithm learning on its own


by modeling the underlying ‫ اساسية‬structure or distribution of data(No
correct answers given) as clustering and association (only input x and
no output y to map to)

Note: Clustering is used in google news to cluster different websites talking about similar topic
into one packet having different URLs to those websites.( https://news.google.com/?hl=en-
US&gl=US&ceid=US:en).
Note: Classification means discrete valued output while Regression means continuous valued
output.
Q. Ans:
The answer will be no.3 as the first
problem is continuous as we have
items but for the second problem it's
discrete either hacked '1' or not
hacked '0'.
Machine learning Intro
What is Machine learning?
1-"The science of getting computers to learn and act like humans do, and improve their learning
over time in autonomous fashion, by feeding them data and information in the form of
observations and real-world interactions."
2-“Machine Learning at its most basic is the practice of using algorithms to parse data, learn
from it, and then make a determination or prediction about something in the world.” – Nvidia
3-“Machine learning is the science of getting computers to act without being explicitly
programmed.” – Stanford
4-“Machine learning algorithms can figure out how to perform important tasks by generalizing
from examples.” – University of Washington
5-“The field of Machine Learning seeks to answer the question “How can we build computer
systems that automatically improve with experience, and what are the fundamental laws that
govern all learning processes?” – Carnegie Mellon University
6-Machine learning research is part of research on artificial intelligence, seeking to provide
knowledge to computers through data, observations and interacting with the world. That
acquired knowledge allows computers to correctly generalize to new settings" – Dr. Yoshua
Bengio, Université de Montréal

Machine learning Basic concepts:


All machine learning algorithms is consists of 3 parts:
 Representation (a set of classifiers or the language that a computer understands)
 Evaluation (aka objective/scoring function)
 Optimization (search method; often the highest-scoring classifier, for example; there are
both off-the-shelf and custom optimization methods used)

‫نقدر نقول ان الماشين ليرننج هو علم يحاول محاكاة طرق التعلم البشرية و تطبيقها علي اآللة بحيث تقدر اآللة تتعلم من‬
‫المعطيات و األمثلة اللي داخلة ليها و تعملها تعميم علي أي شيء مشابه ليها و لو بشكل جزئي و مع مرور الوقت أكتر اآللة‬
‫ خطوات أساسية و هي التمثيل و معناها اني ازاي هعبر‬3 ‫ الماشين ليرننج بيتم عن طريق‬.‫تبدأ تتعلم أكتر و تحسن من أدائها‬
‫ تاني خطوة هي‬.‫ تفهمه اآللة و تقدر تطبقه و ده ليه أشكال كتير‬algorithm‫عن المشكلة او الحاجة اللي اآللة هتتعلمها ب‬
‫ اللي استخدكته طلعلي أداء عامل ازاي ؟ و طبعا عملية حساب أداء‬algorithm‫عملية التقويم او التصحيح يعني بشوف ال‬
‫ تالت مرحلة هي عملية التحسين‬. ‫اللي استخدمته قد ايه‬algorithm ‫اآللة ليه برده طرق مختلفة عشان اقدر احكم علي دقة ال‬
‫ده يعني‬algorithm ‫او تقوية الخوارزمية اللي أنا مستخدمها و ازاي ابدأ اعدل عليها مع الوقت عشان اقدر احسن من أداء ال‬
. ‫او لفه‬epoch ‫بيتعمل مع كل‬tuning process ‫نقدر نسميها‬

What is the difference between supervised and unsupervised ?


Supervised means that the data fed to your algorithm is labeled which is similar to been
working under supervision so your supervisor is always judging whether you are on the right
track or not and tell you so with each answer you are giving you know whether it's wrong or
right which is similar to that all outputs my algorithm is giving is being compared to the label
given with the input data to make sure that the data has labeled right using your algorithm.
Supervised learning is used in classification and regression problems.

Unsupervised means that the data fed to your algorithm is unlabeled which means that you
know no explicit instructions on what to do with this data. So no expected output or correct
answers here but the algorithm is trying to find some features to divide the data upon them.
Unsupervised learning is used in Clustering, Anomaly detection, Association and Auto-
encoders.

Is there any other types of machine learning process ? If yes, what are they?
There is two main other learning processes which are semi-supervised learning and
reinforcement learning.
Semi-supervised learning is an intermediate between supervised and unsupervised where part of
the data is labeled while the other part is unlabeled. This method is motivated from the fact that
labeling examples is a time-intensive and expensive task but at the same time extracting
features from unlabeled data is difficult. general adversarial networks (GAN) uses this type of
learning.

Reinforcement learning (AKA agent) is a method that aims to interact with environments by
actions a1,a2 ,… and these actions produce rewards or punishments r1,r2,… and then repeats
itself several times and as more rounds of feedbacks takesplace, the better the agent's strategy
becomes. In this kind of machine learning, AI agents are attempting to find the optimal way to
accomplish a particular goal, or improve performance on a specific task. As the agent takes
action that goes toward the goal, it receives a reward. The overall aim: predict the best next step
to take to earn the biggest final reward. Video games is full of reinforcement learning as
complete a level and earn a badge. Defeat the bad guy in a certain number of moves and earn a
bonus. Step into a trap — game over.
What is the difference between classification and regression?
Mainly classification means discrete output while regression means continuous output but let's
take them into deeper meanings:
We have to know that both classification and regression are prediction missions but
classification is more into identifying group membership so this means that the output takes
class labels but regression has infinite classes or in other words is not classified(non-
categorical).
However, in classification, we can still use probabilistic models (continuous probability output)
as logistic regression but we use this probability to assign our output to a specific class or label.

Classification Regression
-Discrete output value -Continuous output value
-Categorical (unordered) -Non-categorical (ordered)
-Decision boundary -Best fit line
-Integer value representing the class -Integer or float value
-Evaluated by accuracy -Evaluated by sum of squared mean error
Lec2: (Linear Regression)
Model representation(Univariate):
Ex
We are given the price of housing for some sizes of
housing (dataset) so it's supervised, and since
values are continuous so it's a regression problem.
Notes:
1-Dataset is aka training set.
2- The word hypothesis was used in the
early days of machine learning.
3- h maps from x's to y's.

(x,y) - one training example.


(x(i),y(i)) - ith training example.
x(1)  2104, x(2)1416, …so on & y(1)460, y(2)232, …so on

How do we represent h ?
hθ(x) or h(x) = θo + θ1 x  Linear regression with one variable (Univariate linear regression)

Cost function(linear regression):


Cost function helps us figure out which line fits most to our data.
hθ(x) or h(x) = θo + θ1 x,
So how to choose θi's ?
which of those values of
θ's gives better prediction ?
Idea: Choose θo, θ1 so that
hθ(x) or ypred is close to y
for training examples
(x,y). So our goal is to
make (hθ(x)-y)2 close as much as possible to 0. But we have m training samples with x(i), y(i) so
our goal is to minimize (close to 0) hθ(x(i))-y(i))2=J(θo, θ1)= cost function.
Why ()2, why and why ?
1-For ()2: The square is used to reduce the effect of outliers (extreme cases) and also to avoid
using positive and negative values that could cancel each other during a summation.
for example, Imagine we have 4 data points and we are trying to find the line that best fits them.
Imagine a line that goes exactly through 3 of the points, but misses the 4th point by four units.
Now imagine a second line that misses ALL of the points by one unit. Which of these is a better
fit? If we used absolute value, the cost function would evaluate to the same value for both cases.
If we use the square, the cost function will say the second line is the best fit. So, when we use
the square we are assuming it is better to have a line that is as close as possible to ALL the
points than one that hits some points exactly and misses other by miles.
2-For : Because we are concerned more for the gradient of cost function so the 1/2 is to
cancel the 2 from the square.
3-For : To make it average instead of summation so its scale doesn't change with adding more
points(avoid nan or inf). Also by including m, the regularization term (will be added) will not
depend on the dataset size.

Note: Cost function is also called squared error function. Summary

Inituition behind cost function:


For simplified visualization, assume θ0=0
hence, hθ(x)=θ1.x. and J(θ1)= θ1.x(i) -y(i))2
For an example assume (x,y)'s values are {(1,1),(2,2),(3,3)}
Our goal is to find best value of θ that fits data.
We start assuming θ1 by values 0,0.5,1,2,3,… so on and start
calculating J(θ1) .

For θ1=0 For θ1=0.5 For θ1=1


J(0) = 0-y(i))2 | J(0.5) = 0-y(i))2 | J(1) = 0-y(i))2
J(0) = 0-y(i))2 | J(0.5) = 0-y(i))2 | J(1) = 0-y(i))2
J(0) = 12 +22 +32) = 2.3| J(0.5) = (0.5-1)2 +(1-2)2 +(1.5-3)2) = 2.3 | J(1) = 02 +…) = 0
Then we start plotting cost function against parameter θ1.
The best θ1 value is the value that minimizes the cost
function which is θ1=1 in our problem. This point at which
the θ1 gives the lowest J value is called minima.
We have simplified our problem by
assuming θ0=0 to be able to draw J(θ) in
2-D but assuming
θ0≠0 so our J(θ) plot will be in 3-D as
shown.
Cost function for linear regression is
always of bowl shape (convex function).
For easier illustration we are going to use
contour plot.
Contour plot is a graphical technique which portrays a 3-
dimensional surface in two dimensions. Such
a plot contains contour lines, which are constant z slices.
To draw the contour line for a certain z value, we connect
all the (x, y) pairs, which produce the value z.
For an example, those 3 (x) in magenta color are having
different (θ0, θ1) but same value of J(θ0, θ1).
As we go in the direction of the arrow, J(θ0, θ1) is getting closer to 0 which means a better
choice of (θ0, θ1).
Gradient descent(Univariate):
Gradient descent is a first-order iterative optimization
algorithm for finding the minimum of a function. To
find a local minimum of a function using gradient
descent, one takes steps proportional to the negative of
the gradient (or approximate gradient) of the function at
the current point.
Idea: Start with some θ0,θ1,…..θn and keep changing
them to reduce J(θ0,θ1,…..θn) until we hopefully end up at minimum.
Algorithm:
Notations:
:=  assignment operator
(put R.H.S in L.H.S)
α  learning rate.
Note:
Inituition behind gradient descent:
Derivative role:
Let us assume we have θ1 and J(θ1) and we want to find value of
θ1 which minimizes J using gradient descent.(Let θ 1 initialized
as the blue dot on the cost function curve)
By applying gradient descent algorithm update equation:
θ1:= θ1-α J(θ1)
the derivative means the slope of the tangent (red line) which is +ve value (acute angle with +ve
direction of x-axis) but there is a negative sign so the updated θ1 will move to the left
(negatively) as shown. (new θ1 is the green dot)
By repeating the previous algorithm the dot (θ1 value) will keep moving towards the minima at
which J(θ1) is minimum.(grey dot)
Similarly, if the initial value of θ1 was on the other side of minima, the derivative will be
negative so the movement will be to the right towards the local minima also as the purple dots.
Learning rate role:
Learning rate α controls the amount of movement of the value of θ 1,
very small α
it can't be so small as the gradient descent will be too slow(baby
steps each update). However, it can't be too large as it may
overshoot the minimum and fail to converge or even diverge as
shown in the figure. (if grad descent is not working (increases with
very large α
no. of iterations), try decreasing α)
Note: As we approach a local minima (slope=0), gradient descent
will automatically take smaller steps as derivative value is going
closer to 0.
Note: optimal values usually used are 0.001, 0.01, 0.1 & 1

Gradient descent for linear regression:


Remember: J(θ0, θ1)= θ0 + θ1x
Note: The gradient descent we are doing is
called batch gradient descent as each step of
gradient descent uses all the training
samples ( ).
To check if gradient descent is working:
Check the figure.
Lec4:
Model representation(Multivariate):
Our previous model was based on only one input variable (x) that helps in predicting the output
hθ(x), now we will talk about multiple input variables (multiple features).

Note:
There is a major difference
between n & m where m is
the number of training
examples while n is the
number of features
describing each training
example. Similarly with x(i)
& xj(i) where x(i) is the
training example including
all features of this example
while xj(i) is the value of feature j of the training example i.

For our example, x(2) is while x3(2) is 2 and n=4 as each example has 4 features (size,

number of bedrooms, number of floors and age of home)


Hypothesis: hθ(x)=θ0 + θ1 x1 + θ2 x2+ θ3 x3 + θ4 x4

generally: hθ(x)=θ0x0 + θ1 x1 …+ θn xn =θT x where θ= & x= (dim of x and θ is n+1 ×1)

Notes: 1-x1 means feature 1 of all examples(x1(i) for any i). 2-x0 is always equal to 1.
Gradient descent(Multivariate):
We will annotate all θ's (θ0, θ1 …, θn) as θ so
instead of saying J(θ0, θ1 …, θn), we will say
just J(θ).

Remember:
1-x0(i)= x0 = 1
2-Always do
simultaneous update
for values of θ's(θj
for j=1,2,3,…,n).

Feature scaling:
Idea; Make sure feature on a similar scale.
For our previous example, x1 is size which scales from 0 to 2000
feet2 and x2 is number of bedrooms which scales from 1 to 5 so we
can notice that each feature differs in its scale which would affect
the gradient descent greatly by making it slow.(Notice contour plot)
Generally, Make every feature into approximately a -1≤ xj ≤ 1
range.

-The easiest way to do feature scaling is to normalize by dividing each feature by its range
(max. scale - min. scale).
-Another way is called mean normalization which is replacing xj with xj - μj to make features
have approximately zero mean(Don't apply to x0=1).
Linear Regression(Model, cost function, gradient descent & feature scaling)
What is linear regression?
Linear regression is a statistical relationship between two continuous variables. One is a
predictor (independent variable) and one is response (dependent variable). The idea here is to
get the linear line that best fits the data which means that the error between predicted and actual
is small as possible.(Error here means the vertical distance between the point and the regression
line as shown in the figure )
Notes: 1-Statitistical relationship as it doesn't give the exact relationship between input and
output but try to get the best fit function to express this relation. (The opposite of statistical
relationship is deterministic relationship which gets the acurate relationship between the
variables)
2- Regression differs from correlation where regression is used to predict a dependent variable
based on the value of at least one variable (could be more for multivariate) while correlation
only shows the strength of the association (linear relation) between the two variables.

Equations of linear regression


Yi =β0 + β1 xi + εi
Notes: Random error term ε has a mean of zero.

β0 & β1 are the values that minimizes the sum of


squared error between y and y^.

β0 is the estimated average value of Y when the value of x is 0.


β1 is the estimated change in the value of Y as a result of a one-unit change in x.
( β0=yavg- β1xavg , β1= )

Can we extrapolate with linear regression?


First of all, extrapolation means the process of
estimating, beyond the original observation range,
the value of a variable on the basis of its
relationship with another variable.
Normally we use regression models for
extrapolation to be able to predict the response to
an input which lies outside the range of values of
the predictor values used to fit the model. But, The
danger associated here that the data could have
different alignment outside this range as shown:
How error (residual) is calculated?

In the figure, the blue dots are the data while


the red dots on the linear regression are the
estimated(predicted) output. So error here is
represented as residual ri where ri=yi -y^i
yi real output i (label of the data)
y^I estimated output i

2 2
The residual error for the 3 points (R) is R= = =
This method is called Least squares method

Why do we use square ? The reason for squaring the individual residual errors is to prevent
positive residual error from canceling out negative residual errors.

Residual can be random or random as shown in the following residual plots:

For non-random as shown, non-linear regression is better fit.


Measures of variations
Total variations is made up of 2 parts : 1) Regression sum of squares 2)Error sum of squares.
SST (Total sum of squares) = SSR (Regression sum of squares)+ SSE (Error sum of squares)
where:SST= ∑ (Yi - Yavg)2 SSR=∑(Y^i - Yavg)2 SSE=∑(Y^i - Yi)2
SST measures the variation of Yi values around
the average.
SSR measures the variation of Y^i values around
the average. (attributable to relation between x
and y)
SSE measures the variation of factors other than
relation between x and y.
Coefficient of determination (r-squared) r2 :
It's a statistical measure of how close the data are to the fitted regression line. In other words
we can say that it's the percentage of the response variable (dependent variable) variation that is
explained by the linear model.
R-squared is always between 0 and 100%:
 0% indicates that the model explains none of the variability of the response data around
its mean.(No relation between the response and the predictor)
 100% indicates that the model explains all the variability of the response data around its
mean.(Perfect relation between the response and the predictor)

Equation:
r2= SSR/SST (0 ≤ r2 ≤ 1)

Is r-squared really matters ? which is better r-squared result or residual plot? when to use
it ?
R-squared doesn't measure really the goodness of fit as you could have very low r 2 while the
model is absolutely correct and could be very high r2 while the model is wrong so you need both
r-squared value and residual plot such that checking residual plots comes first to make sure that
predictions are not biased and that your model is correct then to compare between correct
models you can use the r-squared value. (Some professors see that it is useless as professor
Cosma Shalizi at Carnegie Mellon University)
Example:
The fitted line plot shows
that these data follow a nice
tight function and the R-
squared is 98.5%, which
sounds great. However,
look closer to see how the
regression line
systematically over and under-predicts the data (bias) at different points along the curve.
Standard error of estimate (SYX):
This shows the standard deviation of the estimation around the regression line.

SYX= where n is the sample size.


The magnitude of SYX should be judget relatively to the magnitude of Y values.

What is the difference between standard error of estimate and r-squared?


Both of them are good keys for the goodness of fit measures for regression analysis. However,
There are difference :
 The standard error of the regression provides the absolute measure of the typical distance
that the data points fall from the regression line. S is in the units of the dependent
variable.
 R-squared provides the relative measure of the percentage of the dependent variable
variance that the model explains. R-squared can range from 0 to 100%.
An analogy makes the difference very clear. Suppose we’re talking about how fast a car is
traveling.
R-squared is equivalent to saying that the car went 80% faster. That sounds a lot faster!
However, it makes a huge difference whether the initial speed was 20 MPH or 90 MPH.
The increased velocity based on the percentage can be either 16 MPH or 72 MPH, respectively.
One is lame, and the other is very impressive. If you need to know exactly how much faster, the
relative measure just isn’t going to tell you.
The standard error of the regression is equivalent to telling you directly how many MPH faster
the car is traveling. The car went 72 MPH faster. Now that’s impressive!

So standard error of estimate has alittle bit more advantages as it tells you straight up how
precise the model’s predictions are using the units of the dependent variable. This statistic
indicates how far the data points are from the regression line on average. You want lower values
of S because it signifies that the distances between the data points and the fitted values are
smaller. S is also valid for both linear and nonlinear regression models.
Residual analysis advantages:
We have talked before about residuals which is the error between the data and the prediction
and can be visualized using residual plots but residual plots can be also used to determine 4
things known as LINE :
Linearity - Independence of errors -Normality of errors - Equal variance (Homoscedasticity)

Linear regression is a statistics topic which has much more mathematical background to be
discussed but we will be content with what we have said to serve machine learning topics.
Cost Function
Cost function is the function that shows us how well our parameters β0,β1 (some people use
another notation θ0,θ1) have done to best fit the line to the input data. Our ultimate goal is to
minimize the cost function as possible. I mean here by minimizing is to make the cost function
as close as possible to 0. We will use mean squared error (MSE) or what is known by L2 cost
function.
Equation: Normally the equation is J= -y(i))2
but more commonly used is J= -y(i))2 (dividing by 1/2 as explained before)
=hβ(x)= hθ(x)= β0 + β1 x + β2 x + …= θ0 + θ1 x + θ2 x + …

Is there any difference between cost function, loss function and objective function?
usually these terms can't be strictly defined but most commonly used is that:
Loss function is defined for a data point which means measuring penalty for a single training
example.
Cost function is defined for more general on a batch or all training examples of the training set.
Objective function is the most general term to be used for any function to be optimized during
training.

To summarize we can say : loss function is a part of cost function which is a type of objective
function.

What is the difference between L1 and L2 loss function (cost function)?


L1 loss function L2 loss function

More robust Not very robust


Non-Stable solution Stable solution
Possibly multiple solutions Only one solutions

L2 is less robust as L2 squares the error so the model will see much error (e vs e2) than L1-norm
so it's much more sensitive. L1 is less stable as if there is an outlier L1 will face more changes
in fitting the line as the point gets more to outlying region.
Generally in using regularization (a term added to cost function to overcome what is called by
overfitting), L2 has a better and more computational efficient regularizer than L1.
Gradient descent
Gradient descent is the most popular optimization strategy to use in machine learning
nowadays. It's used in the training phase to optimize our algorithm to reach the best learning
parameters (θ0,θ1,…,θn). It's basically a convex function that tweaks its parameters iteratively to
minimize a given function (objective function) to its local minimum.
The word "gradient" means the measure of how much the output of
function changes if you change the inputs a little bit. It simply
measures the change in all weights with regard to the change in
error. You can also think of a gradient as the slope of a function. The
higher the gradient, the steeper the slope and the faster a model can
learn. But if the slope is zero, the model stops learning. Said it more
mathematically, a gradient is a partial derivative with respect to its
inputs.
Note that length from x0 to x1 is larger than that of x1 to x2 and so on and that is because as we
are getting closer to the local minimum the the gradient become less steep until it reaches 0 at
the exact local minimum.

Note: Convex function 

Equation:
Generally  anew = aold - α f(aold) ( )

For our case  θ := θ - α (J(θ)) = θ - α ( -y(i))2)


=θ-α ( hθ(x(i)) -y(i)) . xj(i)) (Applied iteratively)

We can notice here that this equation updates the value of θ


by a step towards the steepest descent to the local minimum
which is done using the the partial derivative with respect
to θ so as shown in the figure the derivative gives the
tangent at this point which determines the line the point will
move on then the sign (+ or -) determines the direction while α determines how big the step will
take in the direction of this tangent.
Note:
θ := θ - α ( hθ(x(i)) -y(i)) . xj(i)) except for θ0 where θ0 =θ0 - α ( hθ(x(i)) -y(i))
What is the importance of learning rate (α)?
We can notice from its name that it determines the rate of learning of the parameters θ by
determining how big the steps will be taken towards the local minimum which means by default
that it controls how fast or slow the learning will be (smaller α means slower learning process)
But wait, does this mean that as α getting bigger is better as the learning process will be fast ?
NO, α shouldn't be much large or small. But why ?
If α is large :
The steps will be too big, it maybe will not
reach the local minimum because it just
bounces back and forth between the convex
function of gradient descent like you can see
on the left side of the image.
If α is small :
gradient descent will eventually reach the
local minimum but it will maybe take too much time like you can see on the right side of the
image.

So α shouldn't be too high or too low and you can check how well
your learning rate works by plotting the gradient descent with each
iteration to make sure the cost function gets lower each iteration
(epoch) until it converges at low value (near 0).

Notes:
1-Number of iterations (epochs) needed till convergence takes
place is hard to estimate (it may take 20 epochs or up to 3 million epochs) and generally you
decide when to stop it when you reach the performance needed for your application.
2-α is optimally ranges from about 0.001 to 1.

What is adaptive learning rate ?


Adaptive learning rate means that the α value will
change along the iterations (epochs) and there is 2
known types of adaptive learning rate which are :
1)Annealing (decaying) learning rate: you start
learning at high LR and decrease it when you get
closer to the local minimum. The reason is you don’t
want to lose the location of the minimum by oscillate too much by using high LR. It could
decay by step wise or cosine or whatever the technique to be used to decay its value with time.
2) Cyclic learning rate: is after some time, you jump up
your LR to get out of the actual local minimum position.
The reason is there are several local minimums, some are
“spiky” and it is not good for generalization. The final
point should be in a “flat” area.

How learning parameters (θ's or β's) is updated ?


They should be updated simultaneously so at first all the gradient (partial differentiation) take
place then the updating lines are done simultaneously as shown. If they are not updated
simultaneously then the change in θ0 value will change the cost function for θ1 so the derivative
will change and the change for both θ0 and θ1 will also affect the cost function used in θ2 update
equation and so on …

Why & How to do feature scaling?


Why ?  Most of the times, your dataset will contain features highly varying in magnitudes,
units and range. But since, most of the machine learning algorithms use Eucledian distance
between two data points in their computations, this is a problem.
How?  4 methods to perform feature scaling:
1-Standarisation:
Redistribute the features to have mean of 0 and standard deviation of 1.

2-Mean normalization:
Redistributes features to be between -1 and 1 and have a mean of
0.

3-Min-Max scaling:
Redistributes features to be between 0 and 1.

4-Unit vector:
Scaling is done considering the whole feature vector to be of unit length &
redistributes features to be between 0 and 1.
Notes: 1-Standardisation and Mean Normalization can be used for algorithms that assumes zero
centric data like Principal Component Analysis(PCA).
2- Min-Max Scaling and Unit Vector techniques produces values of range [0,1]. When dealing
with features with hard boundaries this is quite useful. For example, when dealing with image
data, the colors can range from only 0 to 255.
Polynomial regression:
Not all problems actually can be solved by linear regression (Actually most can't be solved by
linear regression), so the next way is to use polynomial regression (non-linear regression).

As shown in this
example, we have
only one feature
which is the size but
the distribution of
data is more likely to
be fit by 3rd order
regression rather than
linear so we use the
feature, feature2 &
feature3 to fit a curve
to our model.(Remember feature scaling as each x has different scale as shown on the right)

Note: polynomial is could be like that 

Normal equation:
A method used to solve for θ analytically 

To do so we only need to apply this equation to get the values of θ's  θ = (XTX)-1.XTy
Note: on octave use  pinv(X' * X)* X' *y
Ex:

Then apply the rule : θ = (XTX)-1.XTy


Advantages Disadvantages
1-Need to choose value for α
Works well even when n is
Gradient descent 2-Needs many iterations to
large (large no. of features)
converge
1-No need to choose value for 1-Need to compute (XT.X)
Normal equation α which is O(n3)
2-Don't need many iterations 2-Slow if n is very large.

Normal equation non-invertibility:


What if (XT .X) is non-invertible (singular) ?
No problem as octave has a function called pinv (pseudo inverse which can perform even if the
matrix is non-invertible )
The matrix would be non-invertible due to redundant information (independent) or due to so
many features such that number of training examples (m) is less than number of features (n).

Remember: Classification means that predicted variable is discrete valued number.


Ex: Email spam/not spam , TumorMalignant/Benign ? , …So on
Note: Linear regression is not a good classifier that is why we use what is called by logistic
regression.

Lec6:
Logistic regression:
Logistic regression is used mainly for
classification problems which could be a little
confusing that its name is "regression" and used
for "classification" but that is just a name given for
historical reasons.
Logistic regression: 0 ≤ hθ(x) ≤ 1
Hypothesis: g(z) = so hθ(x)= [Remember: hθ(x)= θ0x0 + θ1 x1 …+ θn xn =θT x]
Notes: 1- is called logistic function or sigmoid function.
2- hθ(x)=P(y=1|x;θ)=probability that y=1, given x parameterized by θ.
3- P(y=1|x;θ)+ P(y=0|x;θ)=1.

Decision boundary: if hθ(x) ≥0.5 then ypredicted=1, otherwise, ypredicted=0 noting that hθ(x) ≥0.5
when ≥ 0.( hθ(x) =0.5  Decision boundary equation)
Non-linear decision boundary:

When the data is distributed such


that no linear line could separate
them so here comes the importance
of polynomial hθ(x) to find a non-
linear decision boundary as the
magenta circle drawn.

Cost function (logistic regression):


Due to non-linearity of logistic
regression, the cost function as we
defined before would lead to a non-
convex form as shown in the figure
where there will be so many local
minima which would deceive the
function by getting inside one of
those minima while there is another
minima of different θ's values which
gives better J(θ) (less value of J(θ)).

We will use this cost function instead 

You can notice that with this formula we are in convex form and we can penalize error in
prediction in an efficient way such that if y=1 and h θ=1 (right prediction) then cost=0 as
log(1)=0 and similarly when y=0 and hθ=0 (right prediction) so cost =0 as log(1-0)=0 but in
case y=1 and and hθ→0 (wrong prediction) then cost→∞ as log(0)=∞.

Simplified cost function: cost(hθ(x),y)=


Therefore, J(θ) =

Gradient descent for logistic regression: It's the same algorithm to that of linear regression.
Normal equations, Polynomial regression and Logistic regression
Normal equations:
It is a closed-form solution which minimizes the sum of the square difference which is our
optimization objective and it is called normal equation as if Ax=b then b-Ax is normal to the
range of A. Since it is a closed form so it can solve our problem easier than gradient descent
with no need to iterate till we reach the best values for the learning parameters.
Equation: θ = (XTX)-1.XTy

Why do we use gradient descent instead of normal equations ?


Gradient descent is used more often in linear regression than normal equation as it is
computationally cheaper and faster to find the solution in most cases. The only case that the
normal equations is better and cheaper is when using univariate linear regression but with multi
variate regression, which is more common in problems, the normal equation will be much more
expensive computationally (due to the existence of matrix inversion). More over normal
equations need memory to save the matrices and as said it takes more time especially for large
matrices like in machine learning processes (thousands or millions of features).
Complexity of calculating inverse of matrix is O(n3).

Polynomial regression:
Polynomial regression means an n-th degree regression instead of only 1-st degree as in linear
regression which means that we add some sort of non-linearity to our line to be a curvilinear
regression to fit a nonlinear relationship between the value of x and the
corresponding conditional mean of y. High-order or complex regression are used to avoid what
is called-by underfitting (discussed later).
Equation:

Generally, polynomial regression is a little bit more common to use than linear regression as
most of the times the data fed to any system will be more likely to be fit by a non-linear
function than a linear function. Noting that don't exaggerate in increasing the order of the
polynomial to avoid overfitting (discussed later).
Logistic regression
Logistic regression is a type of regression used when the dependent variable (response) is
categorical which means that it is being used mainly with classification (aliitle bit confusing
that its name has the word "regression" but mostly used with classification unlike all other
regression we took). Logistic regression gives a continuous output value which indicates
prediction or probability then upon this output we classify. For example, Consider a scenario
where we need to classify whether an email is spam or not. If we use linear regression for this
problem, there is a need for setting up a threshold based on which classification can be done.
Say if the actual class is malignant, predicted continuous value 0.4 and the threshold value is
0.5, the data point will be classified as not malignant which can lead to serious consequence in
real time.
Logistic regression uses the logistic function aka sigmoid function  Sigmoid(Z)=
The output of logistic regression is bounded from 0 to 1.
Equation :
hθ(x)= where = θ0x0 + θ1 x1 …+ θn xn +
bias
Mathematical representation: P(Y=1| x;θ)

What is the types of logistic regression?


1-Binary (dichotomous): Categorical response has 2 possible outcomes (0 or 1)
2-Multinomial (Polytomous)(nominal)(softmax): Three or more responses without ordering.
3-Ordinal logistic regression: Three or more response with ordering as movie rating from 0 to 5.

What is the decision boundary of logistic regression?


Decision boundary are defined by threshold 0.5 in case of binary while in case of multi classes
K, we use either K-1 models or K-models(one vs rest) which will be discussed in deep learning
part in softmax regression part. But for now we should know that logistic regression can be
used for binary classification 0 or 1.

Can logistic regression be used for non-linear binary classification?


Yes by making the hypothesis equals to a polynomial regression of higher order then applying
sigmoid.

Is logistic regression a linear or non-linear regression ?


Of course yes, as the output here always depends on the summation of the inputs and
parameters(θ). Although the activation function sigmoid looks non-linear at all, it produce a
linear decision surface (boundary).
Why not using linear regression instead of logistic regression in classification?
Simply linear regression outputs numeric values prediction while logistic regression outputs
probability distribution used for classification by comparing to the threshold.

Cost function and gradient descent for logistic regression


Cost function here is different from the normal cost function which is clear to determine how
penalty to apply in case of error in classification.
Cost function: J(θ) =
This function is previously explained of how they penalize in case of wrong classification.
Gradient descent is similar to that of linear regression.

Gradient descent:

Why cost function used for linear regression not used for logistic ?
Simply because linear regression uses MSE (mean
square error) as its cost function. If this used with
logistic regression it will lead to a non-convex
function that can have many local minimum as
shown in the figure so cost function would fall in a
local minimum instead of global minimum.

What are the types of regression ?


We have 7 known types of regression:
1-Linear regression 2- Polynomial regression 3- Logistic regression  Already explained
4-Stepwise regression 5-Ridge regression 6- Lasso regression 7-Elasticnet regression
we will talk quickly about the other types:
Stepwise regression:
Stepwise regression is a variable selection procedure for the independent variable (Xi) such that
each step of the procedure done the X's are evaluated using a set of criterion to see if it should
remain in the model or not. The aim of this type of modeling technique is to maximize the
prediction power with using the least predictors (X's) number. Used with higher dimensionality
of data set.
Ridge regression:
It is used when data suffers from multi-co-linearity (independent variables are highly
correlated). In fact this regression is what is so called by regularization by adding a regulaozer
term including shrinkage parameter λ. (will be discussed later in details)
Lasso regression:
This type of regression is similar to ridge regression except for that it uses L1 regularization
instead of the L2 regularization used in the ridge regressor.

Elasticnet regression:
It is a hybrid of lasso and ridge regression by adding both L1 and L2 regularization.
Advanced optimization algorithms: (just for knowledge)
1-Conjugate descent 2- BFGS 3-L-BFGS
Advantages: Disadvantages:
1- No need to manually pick α (automatic pick) 1-More complex
2-often faster than gradient descent

Multi-class classification (one vs. all or one vs. rest):


Ex: Email foldering  work, friends,
family and hobby.
What will be done here is that we will
have ℓ hypothesis hθ(x) (assuming
having ℓ classes) and each of specific
equation and with new input we apply
the 3 hypothesis one after the other and
then pick the class I that maximizes
hθ(i)(x). (Check the figure)

Lec7:
Overfitting(high variance):
Overfitting is a modeling error which occurs when a function is too closely fit to a limited set
of data points. Overfitting the model generally takes the form of making an overly complex
model to explain idiosyncrasies in the data under study. So the function is more likely to
memorize the data rather than generalize it. (J(θ)≈0)

How to address overfitting ?


1-Reduce number of features either manually or by using model selection algorithm (discussed
later)
2-Regularization.
Regularization:
A technique used for tuning the function by adding an additional penalty term in the error
function. The additional term controls the excessively fluctuating function such that the
coefficients don't take extreme values. This process is used to avoid overfitting problem.

Regularization for linear regression:


Added Term: Regularization parameter

Gradient descent after regularization:

We can notice that after


adding regularization
term, the gradient descent
has now the term (1- )
which means that every
iteration the θj will get
smaller even if the
derivative term reaches
zero.

Notes:
1-λ should be chosen high enough to eliminate overfitting but not so much high leading to
underfitting problem.
2-θ0 is not penalized. Only θ1,2,3,…,n are penalized.

Normal equation after regularization:

θ = (XTX + λ )-1.XTy
Overfitting, Underfitting and Regularization
Overfitting VS. Underfitting
Most of poor results that would be given by a machine learning model would be as a result of
either overfitting or underfitting. Overfitting means that the line fitting the input data is so much
accurate on the data as shown in the figure which means that the model is highly variant (in
other words low biased) which means that the model almost memorizing the data not
generalizing on it. Underfitting is the
opposite of overfitting where the model
or the line fitting the data is so bad and
faraway from fitting the data which
means faraway predictions due to high
biasing (in other words low variance).

What is meant by high variance or high bias? What is meant by memorizing and
generalizing?
The word variance means how much a model changes in response to the training data. So if the
model is just keeping tracking the changes in training data and change itself to perfectly match
the data so this is known as highly variant model. In other words we can say it is highly
dependent on the training data. This leads to memorizing data which means that it consider any
data will be given in the test phase to be typical to that of the training phase so here we say that
it is memorizing the data not generalizing on it. As the degree of polynomial of the fitting line
increases the model tends to be memorizing more. Bias on the other hand is the flip side of
variance which means representing the strength of our assumptions we make about our data.
To summarize, bias shows how much you ignore the data and variance shows how dependent
our model is on the data. There will always be a trade-off between them and our goal is to
balance between them.
 Overfitting: too much reliance on the training data
 Underfitting: a failure to learn the relationships in the training data
 High Variance: model changes significantly based on training data
 High Bias: assumptions about model lead to ignoring training data
 Overfitting and underfitting cause poor generalization on the test set
 A validation set for model tuning can prevent under and overfitting  Cross-validation
More often you will be exposed to overfitting rather than underfitting in real practical problems.
Note: In machine learning we describe the learning of the target function from training data as
inductive learning. Induction refers to learning general concepts from specific examples which
is exactly the problem that supervised machine learning problems aim to solve. This is different
from deduction that is the other way around and seeks to learn specific concepts from general
rules.
How overfitting and underfitting affect the training and testing error?
Overfitting means that the error is so small on the training set and so high on the testing set.
While Underfitting means that the error is too high on training set as well as the testing set.

Note: Assume training set has error of 15% and dev set has error of 16%, thus here we have
underfitting in our algorithm due to the high error in training and no overfitting as our dev set
(which is responsible for parameters' tuning) error is only 1% higher than train set error. (Check
more examples at Tip 21 at the end of part1 (Machine learning))
Regularization
The word regularize means to make things regular or acceptable which is a technique which
makes slight modifications to the learning algorithm such that the model generalizes better. This
in turn improves the model’s performance on the unseen data as well. To
perform regularization, we will be modifying our Cost Function by adding a penalty to the sum
of squared residuals (RSS) (mean square error MSE).
By adding a penalty to the Cost Function, the values
of the parameters would decrease and thus the
overfitted model gradually starts to smooth
out depending on the magnitude of the penalty added.

What are types of regularization techniques?


We have mentioned them before which are lasso regression (L1 regularization), ridge regression
(L2 regularization) and ElasticNet regression (L1/L2 regualrization).
Lasso regression: It stands for Least absolute shrinkage and selection operator and it depends on
L1 regularization which is adding a factor of some of the absolute value of coefficients in the
optimization objective.  objective = RSS + λ (sum of the absolute values of coefficients (θ's))
where λ is the regularization tuning parameter which balances the emphasis given to
minimizing RSS vs minimizing sum of absolute values of coeffecients.
Ridge regression: It depends on L2 regularization by adding a factor of square of magnitude of
coefficients.  objective = RSS + λ (sum of the square of coefficients (θ's))
ElasticNet regression: This depends on both L1 and L2 regression and by default has 2 tuning
parameters λ1 and λ2.
What are the values of λ ? How to select λ?

λ = ∞: The coefficients θ's will be all zeros due to the infinite weightage on the coeffecients.
λ = 0: The objective will be non-regularized so getting same coefficients of normal regression.
0 < λ < ∞: By default the coefficients will be affected by the weightage given to them.
Selection of λ as shown if too low the effect will almost vanish and if too high the model will
most likely to underfit so the selection of the optimal λ is not strict or a predefined value but to
reach it, you will need cross-validation to be plotted using different values of λ to reach the
optimal value for your problem.

What are the key difference between Lasso and Ridge ?


Ridge: It includes all (or none) of the features in the model. Thus, the major advantage of ridge
regression is coefficient shrinkage and reducing model complexity.
Lasso: Along with shrinking coefficients, lasso performs feature selection as well. (Remember
the ‘selection‘ in the lasso full-form?) As we observed earlier, some of the coefficients become
exactly zero, which is equivalent to the particular feature being excluded from the model.

How Lasso performs feature selection ?


Before explaining this property, let’s look at another
way of writing minimisation objective. One can
show that the lasso and ridge regression coefficient
estimates solve the problems respectively.

In other words, for every value of α, there is some ‘s’ such that the equations (old and new cost
functions) will give the same coefficient estimates.
When p=2, then (6.8) indicates that the lasso coefficient estimates have the smallest RSS out of
all points that lie within the diamond defined by |β1|+ |β2|≤s.
Similarly, the ridge regression estimates have the smallest RSS out of all points that lie within
the circle defined by (β1)²+(β2)²≤s
Now, the above formulations can be used to shed some light on the issue.
The least squares solution is marked as βˆ, while the blue diamond and circle represent the
lasso and ridge regression constraints as explained above.
If ‘s’ is sufficiently large, then the constraint regions will contain βˆ, and so the ridge regression
and lasso estimates will be the same as the least squares estimates. (Such a large value of s
corresponds to α=0 in the original cost function). However, in figure, the least squares estimates
lie outside of the diamond and the circle, and so the least squares estimates are not the same as
the lasso and ridge regression estimates. The ellipses that are centered around βˆ represent
regions of constant RSS.
In other words, all of the points on a given ellipse share a common value of the RSS. As the
ellipses expand away from the least squares coefficient estimates, the RSS increases. The above
equations indicate that the lasso and ridge regression coefficient estimates are given by the first
point at which an ellipse contacts the constraint region.
Since, ridge regression has a circular constraint with no sharp points, this intersection will not
generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively
non-zero.
However, the lasso constraint has corners at each of the axes, and so the ellipse will often
intersect the constraint region at an axis. When this occurs, one of the coefficients will equal
zero. In higher dimensions, many of the coefficient estimates may equal zero simultaneously. In
figure, the intersection occurs at β1=0, and so the resulting model will only include β2.

When to use Lasso (L1) and when to use Ridge (L2)?


Ridge: In majority of the cases, it is used to prevent overfitting. Since it includes all the
features, it is not very useful in case of exorbitantly high features, say in millions, as it will pose
computational challenges.
Lasso: Since it provides sparse solutions(sparse means the majority of x's components (weights)
are zeros, only few are non-zeros), it is generally the model of choice (or some variant of this
concept) for modeling cases where the features are in millions or more. In such a case, getting a
sparse solution is of great computational advantage as the features with zero coefficients can
simply be ignored.

Are L1, L2 and L1/L2 regularization are the only methods to overcome overfitting ?
No, There are other methods to face this problem as dropout, data augmentation and early
stopping which will be discussed in details in the deep learning part.

For deeply Probabilistic view of regularization techniques with intensive mathematical proofs
 check this link (https://towardsdatascience.com/regularization-an-important-concept-in-
machine-learning-5891628907ea)
Regularization for logistic regression:
Remember: J(θ) =
Added Term:

For gradient descent, it's the same as linear regression.

Lec8:
Images are dealt with as a bunch of pixels intensity which are acting as the features of the image
so assume having a 50×50 pixel image which means that the image is of 2500 pixels (7500 if
RGB), this means that the number of features representing this image is 2500 features

(x= ) but as we have said that most of our problems can't be handled

using linear representations which means that we need non-linear hypothesis so assuming using
some quadratic features, therefore, the result will be millions of features which is so hard to be
dealt with and here comes the idea of neural networks. x0

Neural network representation:


Neuron model:
x0 is called bias unit (=1).
x1, x2, x3 are called input neurons (input layer).
yellow neuron is called activation function.
Notes: If hθ= then the neuron will be logistic activation function.

Neural Network:
-x0 and a0(2) (the bias units) are usually not drawn.
-Layer 1 is called input layer.
-Layer 3 (Last layer) is called output layer.
All layers between first and last layers (Layer 2 in our
case) are called hidden layers.
-ai(j) means "activation" of unit i in layer j.
-θ(j) matrix of weights controlling function mapping from layer j to layer (j+1).
Notes:
1-if network has sj units in layer j and sj+1 units in layer j+1, then θ(j) will be of dimension (sj+1 ×
(sj +1)).
2- = z1(2), …so on
3-Generally: aunit(layer) = g(zunit(layer)) =g( unit, i
(layer)
.ai(layer-1) ) where k is the total number of
inputs to this unit. Taking into consideration that a(1) = x = input.
Forward propagation:
This is called forward
propagation as we move
forward from input layer
towards the output layer.

a(1)

How neural networks can represent complex non-linear hypothesis?


AND Gate:
hθ(x) = g(θ10 . 1 + θ11 . x1 + θ12 . x2)
Our goal is to find values of θ's such that :
if x1 =0 and x2=0  hθ(x) =0
if x1 =0 and x2=1  hθ(x) =0
if x1 =1 and x2=0  hθ(x) =0
if x1 =1 and x2=1  hθ(x) =1

Assume: θ10 =-30, θ11 =20, θ12 =20. This will lead to 
This assumption succeeded in creating a non-linear
function that does and gate logical function.
Note: This is not the only assumption that solves this problem, there is infinity solutions.
Similarly for or gate if you use θ10=-10, θ11 =20, θ12 =20, it will be an or gate logical func..
You can apply the same method for any logical function(NOT, XOR, XNOR, NOR, NAND)
Multi-class classification using
neural network representation:
We notice that the output layer now is
composed of several nodes rather than 1
node only(binary classification), and each
node is responsible to determine a specific
class by using a probability evaluation for
this node.
Cost function:
Remember: For logistic regression:
J(θ) = ]+
For neural network:

m :denotes number of training examples.


K :denotes number of classes (output nodes). L :denotes number of layers.
sl : denotes number of nodes in current layer. sl+1 :denotes number of nodes in next layer.
Lec9:
Backpropagation:
Backpropagation is used to perform gradient computation to minimize cost function.
Remember: Forward propagation equation are as follows:

-To perform gradient computation we will use δj(l) to denote error of node j in layer l.
-for each output unit we have δj(4) which equals to aj(4) - yj where y is actual value while aj(4)
equals to hθ(x) which is the predicted value.
Backpropagation algorithm:
δ(4)=a(4) - y
δ(3)=(θ(3))T δ(4) .* g'(z(3)) =(θ(3))T δ(4) .* (a(3) .*(1-a(3)))
δ(2)=(θ(2))T δ(3) .* g'(z(2)) =(θ(2))T δ(3) .* (a(2) .*(1-a(2)))

Notes: 1-No δ(1) exists as it corresponds to input layer which is features observed.
2- .* is element wise multiplication.
3-remember that g(z) means the sigmoid of variable z.
Now as we know that gradient needs the derivative of cost function with respect to θ, so after
some complex mathematical derivation they were able to reach the following:

Full Backpropagation algorithm for a training set:

visualization:

Try to visualize this


figure and see how
forward prop. is done
and how δ is
computed.

Weights (θ) initialization:


The parameter θ should be initialized with some value before start forward and backward
propagation. In logistic regression you can assign it to zeros but in neural networks you can't.
Why zero initialization isn't accepted?
Because after each update, parameters corresponding to inputs going to each of two hidden
units are identical which means all of the hidden units are computing exact same feature which
means by default a high redundant representation.

The perfect way to initialize theta is initializing each θ ij(l) to a random value in range of [-ϵ, ϵ]
where ϵ is a very small number close to zero but not zero.
Neural Networks(NN), Forward propagation, Backpropagation and Weights Initialization
Neural Networks:
Neural network (AKA Artificial neural network) is a computing system made up of number of
simple highly interconnected processing elements which process information by their dynamic
state response to external inputs. ANN are a rough estimate model of how human brain is
structured. ANN are organized in layers. Each layer is composed of several units called
neurons. Each bunch of neurons in each layer are connected to the neurons of the following and
preceding layers through synapses(interconnections) such that the first layer are the input data
(features) passing forward through these interconnections till it reach the other far-end called
the output layer. Neural networks are
generally consists of 3 parts: 1) Neurons
(units) 2)Weights (interconnections)
3)Biases. Layers are mainly input layer,
output layer and several hidden layers in
between. Neural networks has several
types as convolutional neural
networks(CNN), Recurrent Neural
network and so on … (the types will be discussed in details in deep learning part)

What are the neurons, weights and biases ?


Neuron is Basic unit of neural network. It gets certain number of inputs(a1,a2,…,an) and a bias
value (b) such that each input is multiplied by a number (weight (w1,w2,…,wn)) then summing
all these multiplication (∑) and then add what is called by bias value. Afterwards, This
Summation is passed to what is called by an activation function (as sigmoid function (g(z)) then
gets out of the neuron(aout).
What are the origin of idea of the neurons that they 've come from?
The origin of this idea is what is called by perceptron. Perceptron was the precursor to the idea
of ANN. The difference between perceptron and multilayer neurons is the output where the
output in perceptron was always compared to a certain threshold and then outputs either 0 or 1.
While An ANN node also sums weighted inputs, but instead of producing a binary output of 0
or 1 based on a threshold, it instead produces a graded value between 0 and 1 based on how
close the input is to the desired category (the "1" value). The network nodes are usually biased
toward favoring the extreme values of 0 or 1 using a sigmoidal output function. In other words
we can say that perceptron is a single layer neural network acting as a binary classifier. In
addition that perceptron is not using backpropagation which will be discussed later.

What is the importance of the bias in neural networks?


The bias value helps in shifting activation function to the left or the right which will be always
critical for successful learning. Like in using sigmoid function as an activation function and
assume you need to output 0 when input is 2 for an example. The normal sigmoid wouldn't be
able to do that without such bias. So bias always equals 1 and controlled by its weight (Wb) to
shift the curve. It also assures that if all inputs are none (0's), there will be activation in the
neuron.

What is the importance of activation function(transfer function or mapping function ) and


its types?
As we have seen that the neurons are in fact linear combinations of the multiplication between
weights and input features which means by default a linear decision boundary and since in
practical life most of our problems are non-llinear, so we have to add some sort of non-linearity
to our model which is the main role of the activation function. To sum up we can say the role of
the activation function in a neural network is to produce a non-linear decision boundary via
linear combinations of the weighted inputs. Activation functions could be sigmoid, tanh, ReLU
or leaky ReLU function which will be discussed in details later in the deep learning part.
What are the layers of neural networks in details?
Neural network is having mainly 3 types of layers which are input layer, output layer and
hidden layer.
Input layer: The first layer of the neural network which takes the input signals (features or data)
and just passes them to the first hidden layer with no weights or biases values associated.
Hidden layers: Those are the layers following the input layer where all computations explained
before take place in. We can say that one hidden layer is a stack of neurons stacked vertically.
As the number of hidden layers increases (more deep network), the better the learning becomes.
Output layer: The last layer in the network which receives its input from the last hidden layer
and the number of classes to be classified is determined by the number of neurons in this layer.

Forward propagation:
It is a process of feeding input values to the neural network and getting an output which we call
predicted value. Sometimes we refer forward propagation as inference. When we feed the input
values to the neural network’s first layer, it goes without any operations. Second layer takes
values from first layer and applies multiplication, addition and activation operations and passes
this value to the next layer. Same process repeats for subsequent layers and finally we get an
output value from the last layer. We can see that it is called forward propagation as we are
moving forward from the input layer to the output layer.
Forward propagation is not sufficient for learning as after
reaching output layer for the first time there will be no
chance to start again with a more optimized parameters and
here comes the idea of backpropagation to move backwards
from the output layer to the input layer and so on till we
reach a well trained model with suitable learning parameters
(weights or θ's).
Equation: a(n+1)= g((∑ θi(n) . ai(n) ) + b) where the subscript defines the node (neuron) and the
superscript defines the layer number and g() defines the activation function. This equation is
applied for each neuron of each layer.
Backpropagation
Backpropagation is performed after the feedforward one in a continuous iterative manner by
calculating the cost function using MSE then apply gradient descent (partial derivative) with
respect to each weight feeding the output layer then goes back to the previous layer with these
gradients and so on using the chain rule of differential calculus. This process is repeated
consecutively until reaching the first layer (input layer) then now we have all gradients of each
weight in the neaural network so we apply the update equation of the learning parameter
(θnew=θold - α.grad(w.r.t θ) ) to reduce the error then again do the feedforward propagation with
the new learning parameters and repeat this process of forward then backward propagation till
we reach the minimum cost function available by reaching the local minimum of the gradient
descent. (This topic will be discussed in more details and example in deep learning part)

Weights initialization
Weights(θ's) should be initialized randomly with values near 0 (-ϵ,ϵ) but should be non-zero
values. This is done but can cause problems called vanishing and exploding gradient descent
along with some other problems as if the network is so deep and those weights are so small so
the gradients calculated in the backpropagation are proportional to these small weights which
means that the learning process will take so much time to train. The other problem is that the
distribution of the output of each neuron has a variance that gets larger with more inputs so to
solve this problem we normalize the output by dividing the weights by where ni is the
number of inputs to the neuron. Then weights now is normally distributed between ( ).
Weights can also be initialized by something called transfer learning so we start by weights that
were reached from another project that has close objective.(discussed in details in deep
learning)
The last method is called Xavier/Glorot initialization.
Note: Normaly distributed means have a mean of 0 and standard deviation of 1.
What is Xavier/Glorot initialization and how it works?
Xavier initialization makes sure that the weights initialization is just right to make the input
features reach reasonably through the layers. The values here are also normally distributed as
the normal random intialization but the difference here is instead we divide by 1/sqrt(n i) where
ni is the number of inputs to the neuron we will here divide by (1/sqrt(ni + no)) where no is the
number of neurons in the next layer. What is the concept behind that? This sort of initialization
helps to set the weight matrix neither too bigger than 1, nor too smaller than 1. Thus it doesn’t
explode or vanish gradients respectively. For mathematical derivation visit
(http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization)

The topic of neural networks is mainly a deep learning topic so that is why a lot of staff will be
discussed more in details in the deep learning part but at least those main points discussed here
should be known very well.
Remember: No. of input units is determined by no. of features while no. of output units is
determined by no. of classes.
No. of hidden layer is not defined but it's architectural choice. Usually we choose number of
hidden units in each layer comparable to number of features as twice or 3 times of it.

Lec10:
Debugging learning algorithms:
Assume you are testing an implemented regularized linear regression to predict a housing price,
but in testing it gives an unacceptably large errors in its predictions. What should you try next?
1-Get more training examples.  fixes overfitting
2-Try smaller sets of features.  fixes overfitting
3-Try getting more additional features (not redundant of course).  fixes underfitting
4-Try adding polynomial features.  fixes underfitting
5-Try modifying the learning rate (λ).

Evaluation of learning algorithm (hypothesis):


Having a very low training error doesn't mean by default that your learning algorithm is good.
May be it's just overfitting the data in training and will fail to generalize in testing phase.
To evaluate your hypothesis:
Split your training data (dataset) to 3 parts :
1- Training set (~ 60%) 2-Cross-Validation set (~20%) 3-Test set (~20%).
Make sure that the splitted data is randomly selected (in case of sorted data). Just start training
on the training set applying forward and backward propagation using different hypotheses till
you reach a set of thetas for every hypothesis used giving a pretty low training error then try
computing the cost function now on the validation set or try use misclassification error to
choose the best model fit on the validation set. At last you estimate generalization on the test set
to make sure that the hypothesis picked after validation is generalized well.
Note: Misclassification error (0/1 Misclassification error )means to count the number of
misclassified examples in validation set or test set and divide by total number of validation set
or test set examples(mcv or mtest).

Diagnosing bias-variance:
This trade-off between using hypothesis
of different degree where some could
cause underfitting and some could cause
overfitting.
How to know it's a bias problem (underfit) or a variance problem (overfit) ?

The answer is simply :


Bias problem high training error and high vaidation error (Jcv(θ) ↑ & Jtrain(θ) ↑).
Variance problemlow training error and high validation error (Jcv(θ) ↑ & Jtrain(θ) ↓).

Note: High value of λ could lead to underfitting and very low value
of λ (≈0) could lead to overfitting. Usually try λ
=0.01,0.02,0.04,0.08,0.1,0.2,0.5,0.7,1,…, 10. and choose that gives
best Jcv(θ).

Lec11:
Machine learning system design:

Skewed classes:
This means we have a lot of examples for one class than the other class. This for sure will make
a problem in evaluating your hypothesis because it could give you very low error due to non-
existence of one of the classes which if it existed it would make higher error.
Precision/Recall:

Assume we are using logistic regression on 0/1 classification so we


predict 1 if hθ(x)≥0.5 and predict 0 if hθ(x)<0.5. To get a higher
precision and lower recall we increase 0.5 to 0.7 or 0.8 for an example,
and to get a higher recall and lower precision we decrease 0.5 to 0.4 or
0.3 for example. Generally, we predict 1 if hθ(x)≥threshold where as
the threshold increase the precision increases and recall decrease and
vice versa.

Lec12:
Support Vector Machine (SVM):
As we know from logistic regression that cost function is as follows:
J(θ) =
So this term has 2 parts :
1- for y=1, the 1st term will take the control 2-for y=0, the 2nd term will take the control.
J(θ) = J(θ) =
Datasets split, Skewed classes and Evaluation metrics
Data splitting
It is a standard in machine learning to split the data into training and test sets. The reason for
this is very straightforward: if you try and evaluate your system on data you have trained it on,
you are doing something unrealistic. The whole point of a machine learning system is to be able
to work with unseen data: if you know you are going to see all possible values in your training
data, you might as well just use some form of lookup.
However, the two-way split is over-simplistic. Real ML typically involves four phases:
1) Training
2) Development (also known as Validation or Tuning (statistical))
3) Testing (aka Evaluation)
4) Use
The ultimate goal of any ML system is to achieve best possible performance in the actual use
(phase 4) or in other words with the unseen data. Our Evaluation and statistical methods we are
using is to make sure you are not fooling yourself that your system is better than it actually is.

What is the role of each phase of data?


Training data: The sample of data used to fit the model. The actual dataset that we use to train
the model (weights and biases in the case of Neural Network). The model sees and learns from
this data.
Validation data: The sample of data used to provide an unbiased evaluation of a model fit on the
training dataset while tuning model hyper-parameters. The evaluation becomes more biased as
skill on the validation dataset is incorporated into the model configuration. The validation set is
used to evaluate a given model, but this is for frequent evaluation. We as machine learning
engineers use this data to fine-tune the model hyper-parameters. Hence the model
occasionally sees this data, but never does it “Learn” from this. We use the validation set results
and update higher level hyper-parameters. So the validation set in a way affects a model, but
indirectly.
Testing data: The sample of data used to provide an unbiased evaluation of a final model fit on
the training dataset. The Test dataset provides the gold standard used to evaluate the model. It is
only used once a model is completely trained(using the train and validation sets). The test set is
generally what is used to evaluate competing models (For example on many Kaggle
competitions, the validation set is released initially along with the training set and the actual test
set is only released when the competition is about to close, and it is the result of the model on
the Test set that decides the winner). Many a times the validation set is used as the test set, but it
is not good practice. The test set is generally well curate. It contains carefully sampled data that
spans the various classes that the model would face, when used in the real world.
What is the data split ratio commonly used?
More often it depends on the use case of each problem but let's take more commonly. Data was
usually splitted into 60% for training and two 20% for each of the validation and testing. Many
a times, people first split their dataset into 2 — Train and Test. After this, they keep aside the
Test set, and randomly choose X% of their Train dataset to be the actual Train set and the
remaining (100-X)% to be the Validation set, where X is a fixed number(say 80%), the model is
then iteratively trained and validated on these different sets. There are multiple ways to do this,
and is commonly known as Cross Validation. Basically you use your training set to generate
multiple splits of the Train and Validation sets. Cross validation avoids over fitting and is
getting more and more popular, with K-fold Cross Validation being the most popular method of
cross validation.

K-fold cross-validation
A way of resampling procedure used to evaluate machine learning models on a limited dataset.
As if there is not enough data to train your model sp removing a part of it for the validation will
more likely to pose an underfitting problem. By reducing the training data, we risk losing
important patterns/ trends in data set, which in turn increases error induced by bias. In K Fold
cross validation, the data is divided into k subsets. Now the holdout method is repeated k times,
such that each time, one of the k subsets is used as the validation set and the other k-1 subsets
are put together to form a training set. The error estimation is averaged over all k trials to get
total effectiveness of our model. As can be seen, every data point gets to be in a validation set
exactly once, and gets to be in a training set k-1times. This significantly reduces bias as we are
using most of the data for fitting, and also significantly reduces variance as most of the data is
also being used in validation set. Interchanging the training and test sets also adds to the
effectiveness of this method. As a general rule and empirical evidence, K = 5 or 10 is generally
preferred, but nothing’s fixed and it can take any value.
The general procedure is as follows:
1-Shuffle the dataset randomly. 2-Split the dataset into k groups
3-For each unique group:
 Take the group as a hold out or test data set
 Take the remaining groups as a training data set.
 Fit a model on the training set and evaluate it on the test set
 Retain the evaluation score and discard the model
4-Summarize the skill of the model using the sample of model evaluation scores.

Importantly, each observation in the data sample is assigned to an individual group and stays in
that group for the duration of the procedure. This means that each sample is given the
opportunity to be used in the hold out set 1 time and used to train the model k-1 times.
Is there any other types of cross-validation ?
There are 2 other types : 1)Stratified K-fold cross validation 2)Leave P-out cross-validation
Stratified K-fold: In some cases, there may be a large imbalance in the response variables. For
example, in dataset concerning price of houses, there might be large number of houses having
high price. Or in case of classification, there might be several times more negative samples than
positive samples. For such problems, a slight variation in the K Fold cross validation technique
is made, such that each fold contains approximately the same percentage of samples of each
target class as the complete set, or in case of prediction problems, the mean response value is
approximately equal in all the folds. This variation is also known as Stratified K Fold.
Leave P-out: This approach leaves p data points out of training data, i.e. if there are n data
points in the original sample then, n-p samples are used to train the model and p points are used
as the validation set. This is repeated for all combinations in which original sample can be
separated this way, and then the error is averaged for all trials, to give overall effectiveness. A
particular case of this method is when p = 1. This is known as Leave one out cross
validation. This method is generally preferred over the previous one because it does not suffer
from the intensive computation, as number of possible combinations is equal to number of data
points in original sample or n.

What are the limitations of cross-validation?


For cross validation to give some meaningful results, the training set and the validation set are
required to be drawn from the same population. Also, human biases need to be controlled, or
else cross validation will not be fruitful. (Logical )

Skewed classes
Skewed classes (AKA unbalanced classes) means one class is over-represented in the dataset
that would lead to 9x% accuracy in classification for example. But in deed when you go deep
you will find that your model is useless as most of data is of one class so it is just a fake
evaluating model. This called accuracy paradox where the accuracy is reflecting the underlying
class distribution.

How to face the skewed classes problem ?


1) Collect more data : Maybe you see it unpractical sometimes but anyway it is the best solution
ever.
2) Oversampling or undersampling : Sampling your dataset means to change the dataset to be
more balanced :
For oversampling: You can add copies of instances from the under-represented class.
For undersampling: You can delete instances from the over-represented class.
Always try both not one of them.
3)Using synthetic(‫ )مركبة‬samples: This method generally means to create some data (instances)
that are not natural but synthetic. The most popular synthetic samples generator is SMOTE
(Synthetic Minority Oversampling TEchnique) which create synthetic instances from the minor
class instead of creating just copies. The algorithm selects two or more similar instances (using
a distance measure) and perturbing an instance one attribute at a time by a random amount
within the difference to the neighboring instances.
4)Data augmentation: data augmentation means to change some perspective to the data to enrich
your minor class. Data augmentation has several techniques as scaling, translation, rotation,
flipping, adding salt and pepper noise, change lighting condition and perspective transform.
Evaluation/Performance metrics
Classification accuracy is not the best metric to use especially when working on unbalanced
dataset as it is misleading. You model could give high accuracy score while giving poor results
on other metrics so it is not enough. There are a lot of performance metrics to be used but we
will discuss some types of the most known methods of evaluation.
1)Confusion matrix: (AKA contingency table)
An N×N matrix, where N is the number of classes to be classified, where the one of the rows
and columns is the predicted and the other is the actual data and accuracy is measured by adding
the data on the diagonal (Right classifications) divided by the total number of examples. It is
usually used in the classification problems. (Accuracy= )
2)Precision/Recall:
In fact this type is derived from the confusion matrix (associated with the confusion matrix).
Precision measures exactness while recall measures completeness. A trade-off is found between
precision and recall.
Precision of class x =

Recall of class x =
For our example:
Precision of class 1 = A/(A+D+G)
Recall of class 1 = EA(A+B+C)
Precision of class 2 = E/(B+E+H)
Recall of class 2 = E/(D+E+F)
and so on …

3) F1 score:
A weighted average of precision and recall or in other words we can say it is the harmonic mean
between precision and recall. It ranges between 0 and 1. The closer the f1 score to the 1, the
better our model performes.
F1_score=
Note: You can assume the 3 previous method as 1 method as they are all related to the same
concept.

4) Mean absolute error() and mean square error(MSE):


This is well-known and discussed before but we want to add that MSE is better asit is easier to
compute the gradient, whereas Mean Absolute Error requires complicated linear programming
tools to compute the gradient. As, we take square of the error, the effect of larger errors become
more pronounced then smaller error, hence the model can now focus more on the larger errors.
5) Logarithmic loss:
This was explained before in the cost function of logistic regression but shown in binary
classification only. We will add here the formula used for multiclass classification. Logarithmic
Loss or Log Loss, works by penalizing the false classifications. It works well for multi-class
classification. When working with Log Loss, the classifier must assign probability to each class
for all the samples. Suppose, there are N samples belonging to M classes, then the Log Loss is
calculated as shown :
where,
y_ij, indicates whether sample i belongs to class j or not
p_ij, indicates the probability of sample i belonging to class j
Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates
higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.
In general, minimizing Log Loss gives greater accuracy for the classifier.

There is much more types of evaluation metrics as AUC-ROC, Kolomogorov Smirnov charts
and gain and lift charts.
Analogy to understand precision and recall idea and the trade-off between them.
To explain precision and recall, we have found the fishing example to be helpful. In this
example, we have a pond of fish and we know the total number of fish within. Our goal is to
build a model that catches red fish (we may say that we want to ‘predict’ that we catch red
fish). In our first test we have a model that consists of two fishing poles with bait made from a
recipe based on scientific analysis of what red fish like. The precision metric, is about making
sure your model works accurately (or that the predictions you make are accurate). With our fish
example, this means that the fish caught with the special bait are, in fact, red. The following
test shows great precision—the two fish caught were both red. We were trying to (or predicted
we would) catch red fish, and all of the fish that we caught were red.

There is one small problem here though. We knew


there were a lot more fish in the pond and you might
notice that when we looked closer we also found a lot
more red fish that we didn’t catch. How can we do a
better job of catching more of the red fish?
Here is where our other measure, recall, comes into
play. Recall awards us for catching more of the red
fish. In our first model we didn’t do a very good job
of this. So although our precision was good our recall
was not so good.
Knowing this we decide to develop a new model using a fishing net and a new bait recipe. The
picture below shows the result of our first test with this new model. We caught more of the red
fish! Our recall has definitely improved.

Unfortunately you will notice that we caught a lot of


blue fish in our net as well. We weren’t trying to
catch them but they ended up in our net anyway. We
were trying to (or predicted we would) catch all red
fish and, while we caught more red fish, many of the
fish caught were blue. Our precision has suffered
even though we improved our recall.
This is the fundamental trade-off between precision
and recall. In our model with high precision (most or all of the fish we caught were red) had
low recall (we missed a lot of red fish). In our model with high recall (we caught most of the
red fish), we had low precision (we also caught a lot of blue fish).

.‫ انهم كالس اكس فعال تبع كالس اكس‬predictions ‫ بتاع كالس اكس هو قد ايه من اللي انا عملتلهم‬precision ‫يعني ال‬

‫ لقد ايه منهم انها تبع‬prediction ‫ بتاع كالس اكس يعني من كل الداتا اللي تبع كالس اكس فعال انا عملت‬recall‫لكن ال‬
.‫كالس اكس‬

The idea may be a little bit confusing but genuine 


Support vector machine differs from logistic regression that they will not use the log curve but
will get an approximate to this curve in straight lines (red and blue lines) and we will call them
cost1(z) and cost0(z) respectively noting that z=θT .x(i).
In support vector machine we remove also the 1/m noting that it will not change the optimal
value of theta.
Moreover, We know that our optimization objective is minimizing cost function which is A+λB
(after adding regularization), but we will change it to CA+B so C is comparable to .
Now our final form of cost function for SVM is:

Unlike the logistic regression the hypothesis here doesn't output a probability but either 1) or 0.

SVM is called large margin classifier. What does this mean ?

The answer is that we can


notice here that the
constraints changed from
greater and smaller than 0
as in logistic regression to 1
and -1 in SVM which leads
to a large margin decision
boundary as in second
figure. We can notice that
in the 2nd figure there's a
lot of linear hypothesis that
can classifies those 2
classes(magenta, green,
black and more) but for
SVM it will make the linear
hypothesis is the black one
which makes the large
margin possible between
the 2 classes to act as
tolerance and this was a
result of the mathematical derivation we have done before. But here comes 2 questions ? What
if there are outliers which is more common in the problems ? and how the mathematical
derivation we did leads to this choice of large marginal classifier ?
For outliers:
Assume you have an outlier like the red cross on the left so
you have 2 options which are either to choose the black line
and have a large margin and let the red cross on the wrong
side or choose the magenta line and have a very small margin
but all data are in the right side. Here comes the role of
constant C where if C is large number it will choose the
magenta line and if C is small number it will choose the black
line. Similarly if we have more crosses on the left side and more
circles on the right side as the following figure the C will control
how to choose the separation line.

For how mathematical derivation lead to this ?


Check lec (12.3) in Coursera videos (Andrew NG)

Now, we want to adapt the SVM to be able to develop complex non-linear classifier, this will
be done using a technique called kernel.

Kernel:
Assume you have a training set as shown, it's obvious that
the decision boundary will need non-linear hypothesis to
represent it. So it will need a hypothesis like that
θ0+ θ1x1+ θ2 x2+ θ3 x1 x2+ θ4 x12+ θ5 x22+….
and hθ(x) gives prediction 1 if θ0+ θ1x1+ θ2 x2+ θ3 x1 x2+ θ4
x12+ θ5 x22+…. ≥ 0 and gives 0 otherwise.
Let's try writing it in different notations θ0+ θ1f1+ θ2 f2+ θ3 f3+ θ4 f4+ θ5 f5+….
such that f1=x1, f2=x2, f3=x1x2, f4=x12,… and so on
We know that using a higher order polynomial is very highly computationally expensive. So the
question is : is there a different/better choice of these features f 1,f2,f3,… rather than those higher
order polynomials and gives us a non-linear decision boundary ?
We going to pick manually some landmarks (l (1),l(2),l(3),…), knowing that I have a training set as
in the previous figure(x's), I'm going to define the following :

f1=similarity(x,l(1)) = e(- )
, f2=similarity(x,l(2)) = e(- )
, … and so on
Note: similarity is also known as gaussian kernel k(x,l(i)).
Now if the training example x is ≈ l(i), This means that the training example is close to this
landmark l(i) which means that fi≈e0≈1 and similarly on the other hand if x is far away from the
landmark l(i) this means that fi≈e-(large number)≈0. (Note: f0=1 always)

By noticing this figure,


we will see that we
have now new features
f1,f2,f3 instead of
x1,x2,x1x2 and by doing
similarity (kernels) on
any point in the
training set we will be
able to predict either 0
or 1 depending on
their distance from the
landmarks and values of θ's and assign a decision boundary which is non-linear.

How to choose locations of landmarks?


Put landmark at each training example given.

Summary:

Now instead of having features x(i), we have f(i)  .


Support Vector Machine (SVM), Kernels and SVM with kernels
Support vector machine (AKA Maximum margin classifier):
SVM is a supervised machine learning linear model for both
classification and regression tasks.
However, SVM is mostly used with classification. SVM
attempts to pass a linearly separable hyper-plane through a
dataset in order to classify the data into two groups. This
hyper-plane is a linear separator for any dimension; it could
be a line (2D), plane (3D), and hyper-plane (4D+). Support
Vectors are simply the co-ordinates of individual
observation. Support Vector Machine is a frontier which best
segregates the two classes (hyper-plane/ line).

What is the theory of SVM or in other words how it works?


We have data here distributed as shown. This data belongs to
A
2 classes(red and blue). This data is linearly separable by B

infinite number of lines that can separate it so here comes


the question.. which one separates the data the best?
C
Let's take 3 of those infinite lines as shown in the figure and
ask again which one is better in fitting ? Here comes the
theory behind SVM. SVM says that the best line, plane or
hyper-plane to separate a group of data is the one that
maximizes the margin.
Algorithm: we find the points closest to the line from both
the classes. These points are called support vectors. Now, we
compute the distance between the line and the support
vectors. This distance is called the margin. Our goal is to
maximize the margin. The hyperplane for which the margin
is maximum is the optimal hyperplane.
For our example, Line B is the best line that maximizes the
margin. Thus reduces the error the most.

BUT, this case is not a real case as the data here is visually
easily separable by a line with no errors. There are other cases like if these data can be linearly
separated but with error or even can't be separated linearly and need non-linear separable
decision boundary (Kernel). Let's see those scenarios.
Scenarios for the data that can be linearly separated without and with errors.
(Scenario 1) (Scenario 2) (Scenario 3) (Scenario 4)

Scenario1: Here, we have three hyper-planes (A, B and C). Now, identify the right hyper-plane
to classify star and circle. You need to remember a thumb rule to identify the right hyper-plane:
“Select the hyper-plane which segregates the two classes better”. In this scenario, hyper-plane
“B” has excellently performed this job because A and C has errors in classification while B has
no errors.
Scenario2: Here, we have three hyper-planes (A, B and C) and
all are segregating the classes well. Now, How can we identify
the right hyper-plane? Here, maximizing the distances
between nearest data point (either class) and hyper-plane
will help us to decide the right hyper-plane. This distance is
called as Margin Let’s look at this snapshot: you can see that
the margin for hyper-plane C is high as compared to both A and B. Hence, we name the right
hyper-plane as C. Another lightning reason for selecting the hyper-plane with higher margin is
robustness. If we select a hyper-plane having low margin then there is high chance of miss-
classification.
Till now we have 2 rules : 1)select the one with least errors 2) select the one with max margin.
Scenario3: Use the rules as discussed in previous section to identify the right hyper-plane. Some
of you may have selected the hyper-plane B as it has higher margin compared to A. But, here is
the catch, SVM selects the hyper-plane which classifies the classes accurately prior
to maximizing margin. Here, hyper-plane B has a classification error and A has classified all
correctly. Therefore, the right hyper-plane is A even if B has larger margin. But this is not the
usual case there is a parameter (C) which controls which line to select A or B. It will be
discussed in the next question.
Scenario4: I am unable to segregate the two classes using a
straight line, as one of star lies in the territory of
other(circle) class as an outlier. As I have already mentioned,
one star at other end is like an outlier for star class. SVM has a
feature to ignore outliers and find the hyper-plane that has
maximum margin as shown in this figure. Hence, we can
say, SVM is robust to outliers.
What is the objective function of SVM?
By noticing the figure we have 2 classes(blue and green) and we
have found the line that best separate these 2 classes. This
separator line(red line) has an equation of wT x -b=0 (equivalent
to θTx =0 where θ0=b as x0=1) and the 2 lines defining the
margin (dashed line) have equation equal to wTx -b =± 1noting
that the equation is vectorized so w is the normal vector to the
separator (hyperplane) therefore the distance or the margin will
be denoted as and we know that our goal is to maximize
the margin which means by default minimizing and the
constraints of any point is denoted by w .x -b ≥ 1 in case yi=1 or wT.x -b ≤-1 in case yi=-1
T

which prevents falling a point inside the margin. To make a generalized formula for this
2
objective we say: Minimize subjected to yi (wT.x -b) ≥ 1 for i =1, …, n(number of
examples in the data set).(More details: http://cs229.stanford.edu/notes/cs229-notes3.pdf)
(squaring is used to remove the ugly square root occurs due to existence of norm and 1/2 is used
due to the lagrange multipliers (derivatives) that is used for optimization (not sure ))
We can notice that all these equations are clear for data like shown where no outliers exist so
the support vectors are clear and the hyperplane is easy to be found to do hard separation with
no misclassification. Let's now see what to happen in case of outliers.

How SVM be controlled to deal with outliers ?


As shown with the data given that there are some
outliers that would make the linear separation has
some sort of errors. In this case the SVM
wouldn't work with their hard maximum margin
classifier concept so a soft margin was developed
called soft margin classifier (AKA Support vector
classifier) which allow for some misclassification
and also attempted to maximize margin or in
other words make a balance between the error
and the margin. This method is the realistic
method used as it is hard to find data that is
completely linearly separable with no error at all (hard classifiers).
Objectives of hard and soft SVM:
In both of them Maximizing the margin between support vectors is our goal where minimizing
2
is needed but with soft there is an added term here so the general equation will be:
2
min + C ∑ εi subjected to yi (wT.x -b) ≥ 1-εi for i =1, …, n and εi >0
This approach gives linear penalty on mistakes in
classification but we will make it more generalized formulae
and express it as follows:

By the way this formulae is totally similar to this:

Now,There can be three cases for a point x(i)x(i)


1-It lies beyond the margin (in its area of classification) and doesn’t contribute to loss.
2-It lies on the margin, then it is a support vector.
3-It lies inside the margin, a penalty is added (linear i.e. proportional to the amount by which
each data point is violating the hard constraint).
Now if we increase C, we are penalizing the errors more. C being equal to infinity is the case
for hard margin. Now your parameter C will decide how you want to handle outliers. As C
increase the misclassification decreases but the margin becomes smaller so try to balance.
All our previous talks were about linear separation so let's conquer the non-linear one.

Kernel
Kernel is a mapping function that transforms a
given space into some other higher dimensional
space. In this higher dimensional space it is easier
even by visualization to separate our data as shown
in the figure. As we can see here that the data when
transformed to 3-D it was more easily noticed to be
separable using a hyper-plane which was not
noticed in its 2-D form.(Visualize: https://www.youtube.com/watch?v=3liCbRZPrZA)
This is done by using dot product between the vectors in the co-ordinate space. As it turns out,
while solving the SVM, we only need to know the inner product of vectors in the coordinate
space. Say, we choose a kernel K and P1 and P2 are two points in the original space. K would
map these points to K(P1) and K(P2) in the transformed space. To find the solution using
SVMs, we only need to compute inner product of the transformed points K(P1) and K(P2).
If we denote S as the similarity function in transformed space (as expressed in the terms of the
original space), then: S(P1,P2) = <K(P1),K(P2)>
The idea here is like transforming Cartesian co-ordinate to other form of co-ordinates like polar
or cylindrical but here we are transforming to higher dimensional co-ordinates.
We found that we need not compute the exact transformation of our data, we just need the inner
product of our data in that higher dimensional space. This works like a charm in datasets which
aren’t linearly separable!
The Kernel trick essentially is to define S in terms of original space itself without even defining
(or in fact, even knowing), what the transformation function K is.

Types of kernels
1) Polynomial kernel 
2) Gaussian kernel 

3) Gaussian radial basis function(RBF) 


4) Laplace RBF 

5) Hyperbolic tangent kernel 

6) Sigmoid Kernel 
There are more functions but that is enough.

Kernel VS. Kernel Trick(Used in SVM)


Summary
The kernel trick is based on some concepts: you have a dataset, e.g. two classes of 2D data,
represented on a cartesian plane. It is not linearly separable, so for example a SVM could not
find a line that separates the two classes. Now, what you can do it project this data into an
higher dimension space, for example 3D, where it could be divided linearly by a plane.
Now, a basic concept in ML is the dot product. You often do dot products of the features of a
data sample with some weights w, the parameters of your model. Instead of doing explicitly this
projection of the data in 3D and then evaluating the dot product, you can find a kernel
function that simplifies this job by simply doing the dot product in the projected space for you,
without the need to actually compute projections and then the dot product. This allows you to
find a complex non linear boundary that is able to separate the classes in the dataset. This is a
very intuitive explaination.
To apply SVM now, it's similar to the normal SVM except that we change x  f.

Notes:
1-Kernels were tried on logistic regression but it makes it too slow so you can't use it.
2-As C(= ), then Large C means lower bias and high variance while small C means higher bias
and low variance.
3-for σ2, Large σ2 means high bias and low variance while small σ2 means low bias and high
variance.
4-You can use SVM with no kernel (linear kernel).
5-If number of features(n) is high relative to number of training examples(m), use logistic
regression or SVM without kernel(linear kernel).
6-if n is small and m is intermediate so use SVM with gaussian kernel.
7-if n is small and m is large so add more features and use logistic regression.
Lec13:
Unsupervised learning (Clustering):
To recall, unsupervised learning means unlabeled dataset to be fit with
a hypothesis. The hypothesis is trying to separate the data using their
features by some methods as clustering.

K-means clustering algorithm:


One of the simplest and popular unsupervised machine
learning algorithms. To achieve this objective, K-
means looks for a fixed number (k) of clusters in a
dataset.” A cluster refers to a collection of data points
aggregated together because of certain similarities and
defined by its centroid. This is done by randomly
initialize K centroid points depending on how many
classes you want to separate the given data, then the
algorithm do 2 steps iteratively until it converges.
The 2 steps are :
1-cluster assigning 2-moving centroid to the average of data assigned to this cluster

Cluster assigning:
This means that we will compute the distance between
each of the data and each centroid and assign this data to
the nearest centroid (cluster). Check this 

Moving centroids:
This means that we move our centroid to a new place
which is the point at which the average of the data
belonging to this centroid is. Check this 

Then we repeat these 2 steps until the centroids converges as shown in the following figures:

 

 
Until we reach the final centroid 

Algorithm summary:

Cost function (K-means):


Cost function here determines the difference between each data and the centroid of the cluster
that this data is assigned to.

J(c(1),…, c(m), μ1,…, μk) =


and our goal is to minimize this J.
Centroid initialization:
-K (number of clusters) < m (number of training examples)
logically 
-Randomly pick K training examples and set the centroids equal
to these K examples as following: 

-The problem facing the initialization that you could initialize


your centroids in places such that convergence takes place with
bad clustering as shown in the following :

To avoid this problem, we should make the initialization more than once and collect all the cost
function result in each case after convergence and compare them and take the best initialization.

How to choose the number of clusters?


For the given data, some of us may see it should be divided into 4
clusters (red ones), and some may see only 2 clusters (green ones)
So which is better ? this depends on the cost function J value
relatively with the number of clusters. We want an intermediate
answer that doesn't give high number of clusters and at the same
time a low J relatively, so we will use elbow method.
Unsupervised learning, K-means Clustering and its cost function
Unsupervised learning
Unsupervised means that the data fed to your algorithm is unlabeled which means that you
know no explicit instructions on what to do with this data. So no expected output or correct
answers here but the algorithm is trying to find some features to divide the data upon them.
Unsupervised learning is used in Clustering, Anomaly detection, Association and Auto-
encoders. This was explained in details at the beginning but now we are going to elaborate the
first and most important usage of this type of learning which is clustering.
Clustering
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group than
those in other groups. In simple words, the aim is to segregate groups with similar traits and
assign them into clusters. Broadly we can categorize clustering into 2 major types which are
hard clustering and soft clustering.
Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or
not. For example, in the above example each customer is put into one group out of the 10
groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example, from
the above scenario each costumer is assigned a probability to be in either of 10 clusters of the
retail store.

Why to use clustering ? what is the criteria for good clustering ?


Clustering is very much important as it determines the intrinsic grouping among the unlabeled
data present. There are no criteria for a good clustering. It depends on the user, what is the
criteria they may use which satisfy their need. For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and suitable
groupings (“useful” data classes) or in finding unusual data objects (outlier detection). This
algorithm must make some assumptions which constitute the similarity of points and each
assumption make different and equally valid clusters.

What are the types of clustering?


1)Hierarchical-based clustering: As the name suggests, these models are based on the notion
that the data points closer in data space exhibit more similarity to each other than the data points
lying farther away. These models can follow two approaches. Agglomerative (bottom up
approach), they start with classifying all data points into separate clusters & then aggregating
them as the distance decreases. Divisive (top down approach), all data points are classified as a
single cluster and then partitioned as the distance increases. Also, the choice of distance
function is subjective. These models are very easy to interpret but lacks scalability for handling
big datasets. Examples of these models are hierarchical clustering algorithm and its variants.
2) Centroid-based clustering: These are iterative clustering algorithms in which the notion of
similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means
clustering algorithm is a popular algorithm that falls into this category. In these models, the no.
of clusters required at the end have to be mentioned beforehand, which makes it important to
have prior knowledge of the dataset. These models run iteratively to find the local optima. It is
also known as Partitioning clustering.
3) Grid-based clustering: In this method the data space are formulated into a finite number of
cells that form a grid-like structure. All the clustering operation done on these grids are fast and
independent of the number of data objects.
4) Density-based clustering: These models search the data space for areas of varied density of
data points in the data space. It isolates various different density regions and assign the data
points within these regions in the same cluster. Popular examples of density models are
DBSCAN and OPTICS.

K-means clustering
K-means clustering is the one of the most popular clustering algorithm and mostly used one and
it is a centroid-based algorithm. the objective of K-means is simple: group similar data points
together and discover underlying patterns. To achieve this objective, K-means looks for a fixed
number (k) of clusters in a dataset. the K-means algorithm identifies k number of centroids, and
then allocates every data point to the nearest cluster, while keeping the centroids as small as
possible. If you don’t know how many groups you want, it’s problematical, because K-means
needs a specific number k of clusters in order to use it. So, the first lesson, whenever, you have
to optimize and solve a problem, you should know your data and on what basis you want to
group them. Then, you will be able to determine the number of clusters you need.
But, most of the time, we really have no idea what the right number of clusters is, so no worries,
there is a solution for it, that we will discuss it later.
K-means clustering analogy
Imagine you’re opening a small book store. You have a stack of different books, and 3
bookshelves. Your goal is place similar books in one shelf. What you would do, is pick up 3
books, one for each shelf in order to set a theme for every shelf. These books will now dictate
which of the remaining books will go in which shelf. Every time you pick a new book up from
the stack, you would compare it with those first 3 books, and place this new book on the shelf
that has similar books. You would repeat this process until all the books have been placed.
Once you’re done, you might notice that changing the number of bookshelves, and picking up
different initial books for those shelves (changing the theme for each shelf) would increase how
well you’ve grouped the books. So, you repeat the process iteratively in hopes of a better
outcome.K-Means algorithm works something just like this.
K-means clustering algorithm
1)Initialize Cluster Centroids (Choose those 3 books to start with)
K Centroids should be initialized based on your number of clusters (K) and the usual simple
method is to randomly choose K data points from the data to assign the centroid at their
locations and then continue your algorithm as will be shown but bad initialization would lead to
bad clustering so we have here 2 methods to choose :
 Good method: The k-means clustering has a cost function that was discussed before so
you can initialize your centroids and do the full algorithm for n times (say 100) and
compute the cost for each try and just take the lowest cost which is the best among those
trials.
 Better method: This method is called K-mean ++
1-Choose 1 cluster centroid uniformly at random from data points.
2-for each data point x, compute the distance D(x) to the nearest cluster center. Noting
that for the first time we have only 1 cluster center so we just compute the distance
between the points and this centroid but with time we will have more centroids so for
each data point we choose the nearest centroid.
3-Choose new cluster centroid from amongst the data points such that the probability of x
being chosen proportional to D(x)2. (Note: This means that we are more likely select a
data point as our new cluster centroid if that data point is far away.(dist2 is used for
exaggerating to increase the effect))(Visualize: https://www.coursera.org/lecture/ml-
clustering-and-retrieval/smart-initialization-via-k-means-T9ZaG)
4-Repeat 2 and 3 until all cluster centroids have been chosen.
2)Assign data points to Clusters (Place remaining the books one by one)
Each data point will be performed on by Euclidean distance (L2 distance) between it and all
centroids to find the nearest centroid to this data point. Thus, this data point is now assigned to
this near centroid.
3)Update Cluster centroids (Start over with 3 different books)
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean
of all the examples in a cluster.
4)Repeat step 2–3 until the stopping condition is met.
It halts creating and optimizing clusters when either:
 The datapoints assigned to specific cluster remain the same (takes too much time)
 Centroids remain the same (time consuming)
 The distance of datapoints from their centroid is minimum (the thresh you’ve set)
 Fixed number of iterations have reached (insufficient iterations → poor results, choose
max iteration wisely)
How to evaluate your k-means algorithm result ?
1-Inertia: (i.e. he sum of squared distances to the nearest cluster center)inertia tells how far
away the points within a cluster are. Therefore, a small of inertia is aimed for. The range of
inertia’s value starts from zero and goes up.(By the way this is the cost function explained
before)
2- Silhouette score: (used with any clustering technique) Silhouette score tells how far away the
data points in one cluster are, from the data points in another cluster. The range of silhouette
score is from -1 to 1. Score should be closer to 1 than -1.
Assume the data have been clustered via any technique, such as k-means, into k clusters. For
each datum i let a(i) be the average distance between i and all other data within the same cluster.
We can interpret a(i) as a measure of how well i is assigned to its cluster (the smaller the value,
the better the assignment).
Let b(i) be the smallest average distance of i to all points in any other cluster, of which i s not a
member. The cluster with this smallest average dissimilarity is said to be the "neighboring
cluster" of i because it is the next best fit cluster for point i We now define a silhouette:

(Visualize: https://www.youtube.com/watch?list=PLmNPvQr9Tf-ZSDLwOzxpvY-HrE0yv-
8Fy&v=5TPldC_dC0s)

How to determine number of clusters K?


We have 3 main methods to use:
1-By visualizing the data, it would be clear of how many clusters should it be separated.
2- By using what is called by elbow method by drawing a
plot of cost function versus each value of k from 1 to say 10
and find the elbow but sometimes or mostly there will be no
elbow.

3- By plotting silhouette score against each value of K and choosing the k corresponding to the
highest silhouette score.
Note: K-Medians is another clustering algorithm related to K-Means, except instead of
recomputing the group center points using the mean we use the median vector of the group.
This method is less sensitive to outliers (because of using the Median) but is much slower for
larger datasets as sorting is required on each iteration when computing the Median vector.

Agglomerative Hierarchal clustering


This type of clustering is preferred more than k-means.
1-We begin by treating each data point as a single cluster i.e if there are X data points in our
dataset then we have X clusters.
2-We then select a distance metric that measures the distance between two clusters. As an
example we will use average linkage which defines the distance between two clusters to be the
average distance between data points in the first cluster and data points in the second cluster.
We can use any type of distance metric to measure the
distance between two clusters like single linkage which
depends on the closest data point in each of 2 clusters to
each other or complete linkage which depends the
furthest data point in each of 2 clusters to each other or
means linkage which depends on the distance between
centroids. Choice of the cluster distance method to be
used will affect clusters created as shown in the figure.
3-On each iteration we combine two clusters into one.
The two clusters to be combined are selected as those with the smallest average linkage. I.e.
according to our selected distance metric, these two clusters have the smallest distance between
each other and therefore are the most similar and should be combined.
4-Step 3 is repeated until we reach the root of the tree i.e. we only have one cluster which
contains all data points. In this way we can select how many clusters we want in the end, simply
by choosing when to stop combining the clusters i.e. when we stop building the tree! In fact it is
not called a tree but it is called dendrogram which is built along our procedure where the
connection legs in it is defined by the distance between the points (as the distance increases, the
leg becomes taller). It is so hard to be explained in words as it should be visualized so it is a
must to see this video which explains it significantly perfect.
(https://www.youtube.com/watch?v=OcoE7JlbXvY)

Why Agglomerative hierarchical clustering is preferred more than k-means ?


Because it is both more flexible and has fewer hidden assumptions about the distribution of the
underlying data.
With k-Means clustering, you need to have a sense ahead-of-time what your desired number of
clusters is (this is the 'k' value). Also, k-means will often give unintuitive results if (a) your data
is not well-separated into sphere-like clusters, (b) you pick a 'k' not well-suited to the shape of
your data, i.e. you pick a value too high or too low, or (c) you have weird initial values for your
cluster centroids (one strategy is to run a bunch of k-means algorithms with random starting
centroids and take some common clustering result as the final result).
In contrast, hierarchical clustering has fewer assumptions about the distribution of your data -
the only requirement (which k-means also shares) is that a distance can be calculated each pair
of data points. Hierarchical clustering typically 'joins' nearby points into a cluster, and then
successively adds nearby points to the nearest group. You end up with a 'dendogram', or a sort
of connectivity plot. You can use that plot to decide after the fact how many clusters your data
has, by cutting the dendogram at different heights. Of course, if you need to pre-decide how
many clusters you want (based on some sort of business need) you can do that too. Hierarchical
clustering can be more computationally expensive, but usually produces more intuitive results.

Are there any other algorithms of clustering other than k-means and hierarchical ?
Yes, there are several algorithms as:
Mean shift clustering: Mean shift clustering is a sliding-window-based algorithm that attempts
to find dense areas of data points. It is a centroid-based algorithm meaning that the goal is to
locate the center points of each group/class, which works by updating candidates for center
points to be the mean of the points within the sliding-window. These candidate windows are
then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of
center points and their corresponding groups.
1-To explain mean-shift we will consider a set of points in two-dimensional space like the
above illustration. We begin with a circular sliding window centered at a point C (randomly
selected) and having radius r as the kernel. Mean shift is a hill climbing algorithm which
involves shifting this kernel iteratively to a higher density region on each step until
convergence.
2-At every iteration the sliding window is shifted towards regions of higher density by shifting
the center point to the mean of the points within the window (hence the name). The density
within the sliding window is proportional to the number of points inside it. Naturally, by
shifting to the mean of the points in the window it will gradually move towards areas of higher
point density.
3-We continue shifting the sliding window according to the mean until there is no direction at
which a shift can accommodate more points inside the kernel. Check out the graphic above; we
keep moving the circle until we no longer are increasing the density (i.e number of points in the
window).
4-This process of steps 1 to 3 is done with many sliding windows until all points lie within a
window. When multiple sliding windows overlap the window containing the most points is
preserved. The data points are then clustered according to the sliding window in which they
reside.
In contrast to K-means clustering there is no need to select the number of clusters as mean-shift
automatically discovers this. That’s a massive advantage. The fact that the cluster centers
converge towards the points of maximum density is also quite desirable as it is quite intuitive to
understand and fits well in a naturally data-driven sense. The drawback is that the selection of
the window size/radius “r” can be non-trivial.
(Check this GIF noting that the black dots are the centers of the generated sliding windows and
the red dots are the average that is computed at the end of each iteration to move the center of
the sliding window to it and the grey dots are the data : https://gph.is/2piyT8P)

Density-Based Spatial Clustering of Applications with Noise (DBSCAN):


DBSCAN is a density based clustered algorithm similar to mean-shift, but with a couple of
notable advantages.
1-DBSCAN begins with an arbitrary starting data point that has not been visited. The
neighborhood of this point is extracted using a distance epsilon ε (All points which are within
the ε distance are neighborhood points).
2-If there are a sufficient number of points (according to minPoints) within this neighborhood
then the clustering process starts and the current data point becomes the first point in the new
cluster. Otherwise, the point will be labeled as noise (later this noisy point might become the
part of the cluster). In both cases that point is marked as “visited”.
3-For this first point in the new cluster, the points within its ε distance neighborhood also
become part of the same cluster. This procedure of making all points in the ε neighborhood
belong to the same cluster is then repeated for all of the new points that have been just added to
the cluster group.
4-This process of steps 2 and 3 is repeated until all points in the cluster are determined i.e all
points within the ε neighborhood of the cluster have been visited and labelled.
5-Once we’re done with the current cluster, a new unvisited point is retrieved and processed,
leading to the discovery of a further cluster or noise. This process repeats until all points are
marked as visited. Since at the end of this all points have been visited, each point well have
been marked as either belonging to a cluster or being noise.
DBSCAN poses some great advantages over other clustering algorithms. Firstly, it does not
require a pe-set number of clusters at all. It also identifies outliers as noises unlike mean-shift
which simply throws them into a cluster even if the data point is very different. Additionally, it
is able to find arbitrarily sized and arbitrarily shaped clusters quite well.
The main drawback of DBSCAN is that it doesn’t perform as well as others when the clusters
are of varying density. This is because the setting of the distance threshold ε and minPoints for
identifying the neighborhood points will vary from cluster to cluster when the density varies.
This drawback also occurs with very high-dimensional data since again the distance threshold ε
becomes challenging to estimate
(Check this GIF: https://gph.is/2pugYXT)
Elbow method:
Elbow method is a graph where number of clusters is on the x-axis and cost function on the y-
axis. Then the optimal number of clusters is the number at the elbow of the curve as in the curve
on the left in the following figure. But sometimes or most of times there will be no elbow so
here comes the choosing of the number of clusters based on a metric of how well it performs for
the purpose of the problem.

Lec14:
Dimensionality reduction:
Dimensionality reduction means to reduce
features representing data by removing
redundant data and compress them to use up
less memory and speed up our learning
algorithm.
As shown in the figure, both feature x1 & x2
represent the length but in different units which
means highly redundant representation. But in
industry we may have thousands of features
representing the data which is hard to track
them like this example to know which features
are redundant.
Back to our example, so we want to reduce
data from 2D (R2) to 1D (R1) so we are
representing our data in z1 instead of x1 & x2 as
shown in the previous figure.
In the following figure, A reduction from 3D
data to 2D data is also shown.
Dimensionality reduction offers us better data visualization as well. As if we have for example
50-D data (50 features) we can reduce them to 2-D data in z1 and z2 and visualize them.

Principle Component Analysis (PCA) :


PCA is the most popular and commonly used
method in dimensionality reduction.

Assume you have data as shown. We want to


reduce data from 2D to 1D so we want to find
a line to project all data on it and not
changing the position of data too much
(error).
For the 2 lines we have drawn (red and blue),
both of them we can project the data on them
as shown to make them 1D instead of 2D so
which line to choose ? We would choose the
red line because the distance between the data and the projected ones is small which means low
projection error (converting 2D to 1D). So PCA would choose the best vector u(1) to project the
data on so that to achieve the lowest error possible.
Generally, to reduce from n-dimension to k-dimension, we have to find k-vectors u(1),u(2),…,u(k)
onto which to project the data so as to minimize the projection error.

What is the difference between PCA & linear regression?


PCA is not a linear regression.
As shown in the figure, on the left
the linear regression is a line to fit
data to be able to predict y if x is
given so as to minimize the
vertical distance between real data
and the fitted line. While in PCA
on the right, we tend to find a line
or a vector to project data on (not
verticle but the minimum distance
allowed) to minimize the
projection errot.
Algorithm:

1) Data preprocessing 

2) Compute covariance matrix 

3) Compute eigen vectors of matrix ∑: [U,S,V]=svd(Sigma); "octave" [you can use eig(sigma)]

4) U matrix is n×n matrix so we will take first k


vectors (Ureduced) to represent the whole n-
dimension data.

5) z= UreducedT . x 

How to choose number of principal components k ?


Easier method
[U,S,V]=svd(sigma)

Notes:
1-xapprox(i) is the data projected on the PCA vector.
2-in the previous formula, numerator is average squared projection error and denominator is the
total variation in the data.
Dimensionality reduction, Principle component analysis (PCA)
Dimensionality reduction
The Big data applications began to come viral in our lives and in machine learning the final
prediction or decision generally is based on some variables known as features and as time goes
on, the number of these features are getting higher to the range of millions. As these features
increase. it becomes harder to visualize as the human can only see in 1, 2 and 3 dimensional
space, so if the dimensionality increases it is harder to be visualized and imagined. But the
visualization is not the only problem but sometimes those features are correlated which means
by default that these data now are redundant which in turns lead to make the model more prone
to overfit. Also the computational cost increases with increasing the number of features dealt
with so here comes the motivation behind finding some ways to decrease the dimensionality to
be able to solve these problems mentioned. If we want to sum up we can say that dimensionality
reduction is finding a low dimensional representation of the present higher dimensional data
that retains as much information as possible.

Why dimensionality reduction is useful ?


We have answered this question implicitly in the previous paragraph but let's sum up in this
points and add some other benefits for dimensionality reduction techniques:
 Space required to store the data is reduced as the number of dimensions comes down
 Less dimensions lead to less computation/training time
 Some algorithms do not perform well when we have a large dimensions. So reducing
these dimensions needs to happen for the algorithm to be useful
 It takes care of multi-colinearity by removing redundant features. For example, you have
two variables – ‘time spent on treadmill in minutes’ and ‘calories burnt’. These variables
are highly correlated as the more time you spend running on a treadmill, the more
calories you will burn. Hence, there is no point in storing both as just one of them does
what you require
 It helps in visualizing data. As discussed earlier, it is very difficult to visualize data in
higher dimensions so reducing our space to 2D or 3D may allow us to plot and observe
patterns more clearly

What are the common techniques used for dimensionality reduction ?


Basically we can categorize them to 2 major types which are :
1)feature selection : By only keeping the most relevant variables from the original dataset.
2)dimensionality reduction(AKA feature extraction or transform methods) : By finding a
smaller set of new variables, each being a combination of the input variables, containing
basically the same information as the input variables. This could be linear or non-linear method.
Then each category has some techniques as in the next figure:
In fact these are too much to be
discussed in details so we will discuss
principle component analysis,
Independent component analysis,
and Random forest and the remaining
will have close concepts that can be
searched in the future if needed.

What are the use case of each


method?
 Missing Value Ratio: If the dataset has too many missing values, we use this approach to
reduce the number of variables. We can drop the variables having a large number of
missing values in them
 Low Variance filter: We apply this approach to identify and drop constant variables from
the dataset. The target variable is not unduly affected by variables with low variance, and
hence these variables can be safely dropped
 High Correlation filter: A pair of variables having high correlation increases
multicollinearity in the dataset. So, we can use this technique to find highly correlated
features and drop them accordingly
 Random Forest: This is one of the most commonly used techniques which tells us the
importance of each feature present in the dataset. We can find the importance of each
feature and keep the top most features, resulting in dimensionality reduction
 Both Backward Feature Elimination and Forward Feature Selection techniques take a lot
of computational time and are thus generally used on smaller datasets
 Factor Analysis: This technique is best suited for situations where we have highly
correlated set of variables. It divides the variables based on their correlation into different
groups, and represents each group with a factor
 Principal Component Analysis: This is one of the most widely used techniques for
dealing with linear data. It divides the data into a set of components which try to explain
as much variance as possible
 Independent Component Analysis: We can use ICA to transform the data into
independent components which describe the data using less number of components
 ISOMAP: We use this technique when the data is strongly non-linear
 t-SNE: This technique also works well when the data is strongly non-linear. It works
extremely well for visualizations as well
 UMAP: This technique works well for high dimensional data. Its run-time is shorter as
compared to t-SNE
Principle Component Analysis (PCA):
PCA is one of the most widely commonly used method for dimensionality reduction. It is an
unsupervised method which means that it doesn't need labeled data to be performed. PCA gets
some linearly un-correlated and orthogonal components called principle components such that
each component is a linear combination of some features of the original data where those
uncorrelated components preserve as much as possible from the total variance of the data noting
that variance means how much the data is spread around the mean.
How principal components are organized ?
Principal components are described in a way such that the more early components, the more
variance are explained and retained. This means that first component in the principal
components carries the most of information inside it (we express the preceding statement
scientifically as follows: explains the most variance of the original dataset ). Then the second
principal component try to explain the variance of the original data set that the first component
couldn't capture then the third component tries to explain the variance of the dataset that the
first 2 components couldn't capture and so on …
Equations: Z1 = 1st principal component = Φ11X1+Φ21X2+…+ Φn1Xn
Similarly, Z2= 2nd principal component = Φ12X1+Φ22X2+…+ Φn2Xn
and so on with all the principal components …
What is the condition on the data to perform PC analysis ? Why it is important?
 Original data should be normalized and some people are even taking it more serious and
say that it should be standardized(mean normalization) but before discussing why data
should be preprocessed either by normalization or standardization let's know the
difference:
1-Normalization: rescales the values into a range of [0,1]. This might be useful in some
cases where all parameters need to have the same positive scale. However, the outliers
from the data set are lost.
2-Standardization: rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1
(unit variance). for most of application standardization is better.

Normalization or standardization is needed because the original predictors may have


different scales. For example: Imagine a data set with variables’ measuring units as
gallons, kilometers, light years etc. It is definite that the scale of variances in these
variables will be large. Performing PCA on un-normalized variables will lead to insanely
large loadings for variables with high variance. In turn, this will lead to dependence of a
principal component on the variable with high variance. This is undesirable.

 PCA should also applied on numerical data but in case we have categorical data we can
convert them to numerical features by giving a number to each category.
What is variance, covariance, singular value decomposition, eigen values and eigen vectors
mean?
1-Variance: It is a measure of the variability or it simply measures how spread the data set is.
2-Covariance: a measure of the strength and direction of the correlation between two or more
sets of random variables such that if covariance of x, y is positive then x and y are in same
direction so if x increase, y increases and vice versa. If covariance of x, y is negative then x, y
are in opposite direction then if x increases, y decreases and if covariance is 0 then x and y are
un-correlated. The strength of this correlation is determined by the value as the covariance is in
range of -1 to 1 so as the number is getting closer to 1 or -1 then means a strong relationship
between x and y either positively or negatively.

When we say n×n covariance matrix this means that the diagonal contains the variance of each
feature (feature1, feature2, …, feature n) and
the off-diagonal values are the covariance
between the i,j of this place. so covariance
matrix is symmetric.
3-Singular value decomposition (SVD): The
intuition behind this is that every matrix has a close matrix of lower rank to it that can act as a
the best approximation to the original higher rank matrix. Note that SVD can be applied to all
matrices (rectangular or square) which is a very good benefit. SVD of say matrix A means the
factorization of A to 3 matrices U,D& V where columns of U and V are orthonormal
(orthogonal and normalized) and D is diagonal with positive entries.(A=UDVT )
4) Eigen values and eigen vectors: If you can draw a line through the three
points (0,0), v and Av, then Av is just v multiplied by a number λ; that is, Av=λv. In this case,
we call λ an eigenvalue and v an eigenvector. For example 
(1,2) is an eigvector and 5 an eigenvalue.

Algorithm of PCA:
1-Normalize or standarize the data
2-calculate the covariance matrix of these data. (∑)
3-Calculate the eigen vectors of this covariance and sort it according to their eigen values.
4-Take the first K columns (K principal components) of the eigen vector.
5-Multiply this matrix by the original data.
To check that these PC are right just calculate the covariance matrix of this new formed reduced
matrix and you will see that the diagonal values are non-zeros and all off-diagonal values are
zeros which means that all the principal components are un-correlated.
BUT, maybe we have some questions about step 3 and 4 which are why to calculate the eigen
vectors of the covariance matrix and why taking first K components works as PC?
We want now to transform the original data to another data with a diagonalized covariance so
let's have a look at the mathematical derivation:

here comes the whole idea if we can find P that would make Cy diagonalized then Y will be
new transformed data with principal components. By mathematical theorms we can find that if
eigen vectors of Cx is used as P so the condition will be met. Now, if we want to transform
points to k dimensions then we will select first k eigen vectors of the matrix Cx (sorted
decreasingly according to eigen values) and form a matrix with them and use them as P.
So, if we have m dimensional original n data points then
X :m*n
P : k*m
Y = PX : (k*m)(m*n) = (k*n)
Hence, our new transformed matrix has n data points having k dimensions. 

Independent component analysis(ICA)


ICA is based on information-theory and is also one of the most widely used dimensionality
reduction techniques. The major difference between PCA and ICA is that PCA looks for
uncorrelated factors while ICA looks for independent factors which in turns mean that PCA
chooses orthogonal vectors iteratively to explain most of the variance with the first few of them.
ICA on the other hand does not have the orthogonality constraint of PCA but wants to
emphasize statistical independence between components. Another helpful way of thinking
about the difference between PCA and ICA is in the distribution of the data along the new
dimensions that each aims. While PCA aims for distributions that are as Gaussian as possible,
ICA aims for distributions that are as far from Gaussian as possible
Uncorrelated: means Pearson correlation coeffecient =0 then covariance =0.
Independent: Two random variables XX and YY are independent if the joint probability
distribution P(X, Y) can be written as the product of the two individual distributions.
(P(X, Y)=P(x).P(Y)) & (P(X|Y)=P(X)) & (P(Y|X)=P(Y))
If 2 random variables are independent then uncorrelated but the opposite isn't true.
The only case where we can assume the terms independent and uncorrelated equivalent is when
data is gaussian as in case of PCA.
As ICA should have sources of non-gaussian distribution unlike PCA so it is used in the
cocktail party problem (considered as blind-source separation) as shown:

This shows that the distribution of the given data controls the dimensionality reduction method
to be used.

Algorithm
Given a set of observations of random variables x1(t), x2(t)…xn(t), where t is the time or sample
index, assume that they are generated as a linear mixture of independent components: y=Wx,
where W is some unknown matrix. Independent component analysis now consists of estimating
both the matrix W and the yi(t), when we only observe the xi(t).
• Use statistical “latent variables“ system
• Random variable sk instead of time signal
• xj = aj1s1 + aj2s2 + .. + ajnsn, for all j
x = As
• IC‘s s are latent variables & are unknown AND Mixing matrix A is also unknown
• Task: estimate A and s using only the observeable random vector x
• Lets assume that no. of IC‘s = no of observable mixtures
and A is square and invertible
• So after estimating A, we can compute W=A-1 and hence
s = Wx = A-1x
So our main task is to find s so we have to find W which is equivalent to A -1 which in turns
means that our task now is to find matrix A which will solve our whole problem.
Find A  get its inverse A-1  Multiply by the observation given x  Then s is computed 
How to get A?  I searched alot and didn't get a direct answer  (need help)
Factor Analysis (FA)
In the Factor Analysis technique, variables are grouped by their correlations, i.e., all variables in
a particular group will have a high correlation among themselves, but a low correlation with
variables of other group(s). Here, each group is known as a factor. These factors are small in
number as compared to the original dimensions of the data. However, these factors are difficult
to observe. In every factor analysis, there are the same number of factors as there are variables.
Each factor captures a certain amount of the overall variance in the observed variables, and the
factors are always listed in order of how much variation they explain. The eigenvalue is a
measure of how much of the variance of the observed variables a factor explains. Any factor
with an eigenvalue ≥1 explains more variance than a single observed variable. So if the factor
for socioeconomic status had an eigenvalue of 2.3 it would explain as much variance as 2.3 of
the three variables. This factor, which captures most of the variance in those three variables,
could then be used in other analyses. The factors that explain the least amount of variance are
generally discarded.

Example to illustarate factor analysis and its loadings


The relationship of each variable to the underlying factor is expressed by the so-called factor
loading. Here is an example of the output of a simple factor analysis looking at indicators of
wealth, with just six variables and two resulting factors.

The variable with the strongest association to the underlying latent variable. Factor 1, is income,
with a factor loading of 0.65.
Since factor loadings can be interpreted like standardized regression coefficients, one could also
say that the variable income has a correlation of 0.65 with Factor 1. This would be considered a
strong association for a factor analysis in most research fields.
Two other variables, education and occupation, are also associated with Factor 1. Based on the
variables loading highly onto Factor 1, we could call it “Individual socioeconomic status.”
House value, number of public parks, and number of violent crimes per year, however, have
high factor loadings on the other factor, Factor 2. They seem to indicate the overall wealth
within the neighborhood, so we may want to call Factor 2 “Neighborhood socioeconomic
status.”
Notice that the variable house value also is marginally important in Factor 1 (loading = 0.38).
This makes sense, since the value of a person’s house should be associated with his or her
income.

Factoral analysis methodology explanation on empirical situation


Check this series of videos to see step by step how the factoral analysis is done.
https://www.youtube.com/playlist?list=PLwJRxp3blEvaOTZfSKXysxRmi6gXJf5gP.

Random forest
The most popular method in the feature selection methods (remember that all the explained
dimensionality reduction before were of feature extraction type). But to explain it we should
know an algorithm called decision tree as random forest is a collection of decision trees.

Decision Trees
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable). It
works for both categorical and continuous input and output variables so in other words we say
that it is used for both classification and regression problems. In this technique, we split the
population or sample into two or more homogeneous sets (or sub-populations) based on most
significant splitter / differentiator in input variables.
Termenologies:
 Root Node: It represents entire population or sample and this further gets divided into
two or more homogeneous sets.
 Splitting: It is a process of dividing a node into two or more sub-nodes.
 Decision Node: When a sub-node splits into further sub-nodes, then it is called decision
node.
 Leaf/Terminal Node: Nodes do not split is called Leaf or Terminal node.
 Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
You can say opposite process of splitting.
 Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
 Parent and Child Node: A
node, which is divided
into sub-nodes is called
parent node of sub-nodes
where as sub-nodes
are the child of parent
node.
Algorithm and examples
There are some algorithms to be used in bulding our decision tree but we will talk about only 2
of those algorithms.
1)ID3 (Iterative Dichotomiser 3) This uses entropy and information gain as metrics.
2)CART (Classification And Regression Trees) This uses Gini index as metric. (This is
called greedy algorithm as we have an excessive desire to lower the cost)
There are other algorithms depending on other metrics as Chi-square and reduction in variance.

Algorithm:
1-Start with a training data set which we’ll call S. It should have attributes and classification.
2-Determine the best attribute in the dataset. (We will go over the definition of best attribute)
3-Split S into subset that contains the possile values for the best attribute.
4-Make decision tree node that contains the best attribute.
5-Recursively generate new decision trees by using the subset of data created from step 3 until a
stage is reached where you cannot classify the data further. Represent the class as leaf node.

Best attribute means higher information gain or lower gini score. Let's see each type by
examples.

Iterative Dichotomiser 3(ID3):


Equations
Entropy(S)= where c number of classes and pi is the portion of data with ith
classification and S is the dataset
Information gain(S,A)= Entropy(S) Where the term after
negative is the weighted entropy of different possible classes obtain after split.
Now the goal is to maximize this information gain. The attribute which has the maximum
information gain is selected as the parent node and successively data is split on the node.
Example As shown we have 2 classes which is
either go out and play a game of tennis or not and
this depends on 4 features (outlook,temp.,humidity &
wind) so how to start this decision tree or in other
word which attribute (feature) will be the root of the
tree or the main parent of the tree ? We will decide
based on the information gain of each future as
calculated in the following lines.
Firstly calculate the entropy of your whole dataset which contains 14 examples as we will use it
along our calculations of the IG of each future:
Here we have 2 classes which is "Yes" and "No" where "Yes" is the output of 9 examples out of
14 while "No" is the output of 5 examples out of 14.
Entropy(S)=
Now let us calculate the IG for each feature:
For outlook: outlook has 3 probabilities which are sunny, overcast and rain where sunny has 2
"Yes" and 3"No" & overcast has 4 "Yes" and 0 "NO" & rain has 3"Yes" and 2"No".
IG(outlook)= Entropy(S)

therefore, IG(outlook)=

SImilarly with IG(temperature),IG(Humidity) & IG(wind)

Now we can see that the highest information


gain is in the outlook attribute. Therefore Our root
node is outlook 
Then repeat what we have done again with each
child of outlook (sunny, overcast and rain but here
overcast is already done as it is already yes). So
for sunny calculate the IG for temp,humidity and
wind to decide the node (it will be humidity) so now for rain calculate IG for temp and wind(
it will be wind) and so on …
The final decision tree will be as follow:

May be you are asking where is the feature Temperature in out tree and here comes the concept
of feature selection for dimensionality reduction where the attribute or the feature temperature
in this problem is useless or redundant which means that it overloads on my model and
consumes computation but doesn't contain that much important information in it. Random forest
will use a bunch of decision trees to do that and will be discussed in details later.

Classification and regression trees (CART):


here we use what is called by Gini index which acts as a cost function to evaluate the different
splits in the dataset.
Equation  G= Where K is number of classes
Note that maximum value for G is and minimum value is 0 which is the best.
Let's apply this formula on our example:
For outlook: Gini score= Weighted sunny gini index+ Weighted overcast gini index+ Weighted
rain gini index= =0.3428

For Temp: = =0.4404

For Humidity: 0.36734

For Wind: 0.35422


As Gini score for outlook is the lowest therefore the root will be outlook attribute. And
similarly do what we have done before and we will reach the same decision tree.
Problem of Decision trees and solution
The main problem facing decision trees is overfitting as the decision tree would become so
large that it will have no limit set on it leading to almost 100% accuracy on training set. This
problem is solved by one of the two ways:
1- set a constraint on the tree size to stop it before becoming too large (Early stopping).
2-Pruning the tree after it becomes too large.

Setting a tree size constraint (depth of the tree)


There is no theoritical calculation for the optimum tree depth but it is all about cross-validation
by calculating accuracy (or any evaluation metric) on both training set and testing set and make
sure that your model doesn't give high accuracy in training phase and low in validation
phase.To make it in form of steps here is the best approach:
 Choose a number of tree depths to start a for loop (try to cover whole area so try small
ones and very big ones as well)
 Inside a for loop divide your dataset to train/validation (e.g. 70%/30%)
 Each time train your decision tree with that depth on training data and test it on the
validation set, then keep the validation error (you can also keep the training error)
 Plot the validation error (you can combine it with evolution of training error to have a
prettier plot for understanding!)
 Find the global minimum of validation error.
 Then you can narrow your search in a new for loop according to the value you found to
reach a more precise value
Pruning the tree:
It involves removing the branches that make use of features having low importance. This way,
we reduce the complexity of tree, and thus increasing its predictive power by reducing
overfitting. Pruning trees is a bottom-up approach so we start from the leafs of the tree.
We use here cost function 
C(tree)= error of training phase + λ × L(tree) where L(tree) is number of leafs of the tree
So we start removing a branch and put a leaf instead then compute the cost if it gives lower cost
then it is a good pruning if not then try with another branch until a specific point where
removing any of the nodes or branches will lead to higher cost so now we reached the best of
this decision tree. This pruning algorithm is called reduced error pruning.
Value of λ can be estimated using K-fold cross validation to decide the best λ to balance
between classification error (accuracy in other words) and complexity of the tree. (Visulaize:
https://www.coursera.org/lecture/ml-classification/optional-tree-pruning-algorithm-wmODB)

NOW, Let's go back to Random forest dimensionality reduction technique to see how it collect
decision trees and use them for feature selection.
Random forest
It is an ensemble of decision trees which is used as
prediction model as the decision trees algorithm
and also used in dimensionality reduction using
feature selection approach. Most of the time trained
with the “bagging” method. The general idea of the
bagging method is that a combination of learning
models (decision trees) increases the overall result
and get a more accurate and stable prediction.
Random Forest adds additional randomness to the model, while growing the trees. Instead of
searching for the most important feature while splitting a node, it searches for the best feature
among a random subset of features. This results in a wide diversity that generally results in a
better model. Therefore, in Random Forest, only a random subset of the features is taken into
consideration by the algorithm for splitting a node. You can even make trees more random, by
additionally using random thresholds for each feature rather than searching for the best possible
thresholds (like a normal decision tree does).

What is the difference between decision trees and random forest ?


Like I already mentioned, Random Forest is a collection of Decision Trees, but there are some
differences.
If you input a training dataset with features and labels into a decision tree, it will formulate
some set of rules, which will be used to make the predictions.
For example, if you want to predict whether a person will click on an online advertisement, you
could collect the ad’s the person clicked in the past and some features that describe his decision.
If you put the features and labels into a decision tree, it will generate some rules. Then you can
predict whether the advertisement will be clicked or not. In comparison, the Random Forest
algorithm randomly selects observations and features to build several decision trees and then
averages the results.
Another difference is that „deep“ decision trees might suffer from overfitting. Random Forest
prevents overfitting most of the time, by creating random subsets of the features and building
smaller trees using these subsets. Afterwards, it combines the subtrees. Note that this doesn’t
work every time and that it also makes the computation slower, depending on how many trees
your random forest builds.

Note: If you still don't get what is random forest algorithm so we can say that it is the same
algorithm of decision tree but the difference that we take random features not selected features
and then construct several decision trees where each tree has randomly selected features to use
then when we have input data we make them input to all the decision trees and take the output
of each tree and just make averaging to get the final output.
What are the hyper parameters of random forest ?
The Hyperparameters in random forest are either used to increase the predictive power of the
model or to make the model faster. I will here talk about the hyperparameters of sklearns built-
in random forest function.(Sklearn is a python library built upon SciPy)
1) Increasing the Predictive Power:
Firstly, there is the „n_estimators“ hyperparameter, which is just the number of trees the
algorithm builds before taking the maximum voting or taking averages of predictions. In
general, a higher number of trees increases the performance and makes the predictions more
stable, but it also slows down the computation.
Another important hyperparameter is „max_features“, which is the maximum number of
features Random Forest is allowed to try in an individual tree. Sklearn provides several options,
described in their documentation.
The last important hyper-parameter we will talk about in terms of speed, is „min_sample_leaf
“. This determines, like its name already says, the minimum number of leafs that are required to
split an internal node.
2) Increasing the Models Speed:
The „n_jobs“ hyperparameter tells the engine how many processors it is allowed to use. If it
has a value of 1, it can only use one processor. A value of “-1” means that there is no limit.
„random_state“ makes the model’s output replicable. The model will always produce the same
results when it has a definite value of random_state and if it has been given the same
hyperparameters and the same training data.
Lastly, there is the „oob_score“ (also called oob sampling), which is a random forest cross
validation method. In this sampling, about one-third of the data is not used to train the model
and can be used to evaluate its performance. These samples are called the out of bag samples. It
is very similar to the leave-one-out cross-validation method, but almost no additional
computational burden goes along with it.

=================================================================
In case you just didn't get it, Random forest or decision trees is not like PCA that will give you
a set of features to use with any other classification or regression algorithm. It is a classification
or regression model that has a feature selection method inside it. Thus dimensionality reduction
is included in its algorithm unlike PCA and ICA which gives you a set of new features created
based on the original features then you can use it in any classification or regression technique
say SVM for example.
How to reconstruct from compressed representation (z)?

in fact we can't get the exact data x but we can get x approx which is the projected data on the line
which is almost equals to the exact data.
xapprox = Ureduced . z

Notes:
1-As PCA gives fewer featuress so it is less likely to overfit. Taking into consideration that if
overfitting takes place the solution is to regularize not to apply PCA. (take care of the
difference)
2-Don't start with PCA, but use it when the result is not as you expect or the learning process is
slow due to the high number of features.

Lec15
Gaussian (Normal) distribution:
If x is a distributed Gaussian with mean μ and
variance σ2  x ~ (μ, σ2)
Equation:

where:

Note: σ is called standard deviation which determines the width of the gaussian probability
density.

Examples:
Anomaly detection:
The identification of rare items, events or observations which raise suspicions by differing
significantly from the majority of the data.
Example:
Assume features of aircraft engine were given as
follows for training set (red crosses) and then we
have a new engine (new example) so when the
features of this new engine plotted (green crosses)
we found:
1) the plot was in the same region of other
predefined examples then this new engine is OK and can be shipped to the user
2) the plot was outside the normal region (act as outlier) then this new engine is anomaly and
should be checked in details before being shipped to the user.

how to define xtest as anomaly ? we model our given data (p(x)) then check for p(x test) and
compare to ϵ (very small number) such that if less than this ϵ then flagged as anomaly,
otherwise, then OK.

Algorithm:

Note: If the features were non-gaussian, you should convert them to be gaussian either by using
log or square root or whatever needed to do that. (check lec15.6)

Evaluating anomaly detection system:


Firstly 
Then 

It's a very logical way to evaluate. Nothing new 

Why not using supervised learning instead of anomaly detection? (compare)


Sometimes anomalies are not detected using the anomaly detection system using normal
distribution so let's define multivariate gaussian distribution.
Multivariate Gaussian distribution:
Why could we need
multivariate gaussian
distribution ? or in
other words when
could the anomaly
detection system we
have studied fail to
find anomaly data ?
Here is an example.

Assume we have monitoring machines system with 2 features x 1 and x2 distributed as shown in
red crosses. a new data to be entered (green cross) which is obviously an anomaly data. But if
we plotted the value of x1 and x2 on their probability distribution we will find that both values
are normal values that are close to some other values plotted before and not anomaly so the
probability will be high so the new data will not be flagged as anomaly data.

Multivariate gaussian distribution will not model each p(x) separately as shown in the previous
example but will model p(x) all in one go. so μ and ∑(covariance) will be matrices such that μ
is Rn and ∑ is Rn×n.

Equation: 

Examples:
Parameters fitting for anomaly detection:

Algorithm:
Gaussian Mixture Models (GMM), Anomaly detection & K-nearest neighbours (KNN)
The k-means clustering model explored before is simple and relatively easy to understand, but
its simplicity leads to practical challenges in its application. In particular, the non-probabilistic
nature of k-means and its use of simple distance-from-cluster-center to assign cluster
membership in addition to using circle clusters only and also the overlapping clusters leads to
poor performance for many real-world situations. In this section we will take a look at Gaussian
mixture models (GMMs), which can be viewed as an extension of the ideas behind k-means,
but can also be a powerful tool for estimation beyond simple clustering.

Note: The Gaussian or normal distribution is also called a bell curve sometimes. The two
parameters are called the mean μ and standard deviation σ. One of key properties of the normal
distribution is that the highest probability is located at mean while the probabilities approach
zero as we move away from the mean. What makes Gaussians so special is that it turns out
that many dataset distributions are actually Gaussian. A famous theorem in statistics called
the Central Limit Theorem that states that as we collect more and more samples from a dataset,
they tend to resemble a Gaussian, even if the original dataset distribution is not Gaussian! This
makes Gaussian very powerful and versatile!

Gaussian Mixture Models (GMM)


A gaussian mixture model is, as the name shows, a probabilistic model attempting to find a
mixture of multi dimensional gaussian probability distribution (explained in the main course)
that best models any input data set. Here rather than identifying clusters by “nearest” centroids,
we fit a set of k gaussians to the data. And we estimate gaussian distribution parameters such as
mean and Variance for each cluster and weight of a cluster. After learning the parameters for
each data point we can calculate the probabilities of it belonging to each of the clusters.

Equation  P(X)= where represents cluster in data with


mean μ and covariance ∑ and weight Ø.
constraint:

After learning the parameters μi, ∑i and Øi,


we learnt k gaussian distribution from the
data distribution that k gaussian distributions
are clusters.
Since we learnt k distributions(cluster) then
for any data point x depending upon its
distance from every distribution we will get
probability of it belong to each distribution.
Algorithm of GMM
Under the hood, a Gaussian mixture model is very similar to k-means: it uses an expectation–
maximization approach (EM algorithm) which qualitatively does the following:
1)Choose starting guesses for the location and shape
2)Repeat until converged:
 E-step: for each point, find weights encoding the probability of membership in each
cluster (unlike k-means which do hard assignment to one of the cluster, This is called soft
assignment as we use probability). Note that we’re not assigning each point to a
Gaussian, we’re simply determining the probability of a particular Gaussian generating a
particular point
 M-step: for each cluster, update its weight, covariance and mean based on all data points,
making use of the weights. To update a weight Ø, we simply sum up the probability that
each point was generated by Gaussian and divide by the total number of points. For a
mean μ, we compute the mean of all points weighted by the probability of that point
being generated by Gaussian. For a covariance ∑, we compute the covariance of all
points weighted by the probability of that point being generated by Gaussian. We do each
of these for each Gaussian. Now we’ve updated the weights, means, and covariances! In
K-Means, the maximization step is analogous to moving the cluster centers.
What are the advantages and disadvantages of each of K-means and GMM clustering ?
K-means GMM
 Does not assume
clusters to be of any
geometry. Works well
 Running Time with non-linear
 Better for high geometric distributions
dimensional data. as well.
Advantages
 Easy to interpret and  Does not bias the
Implement. cluster sizes to have
specific structures as
does by K-Means
(Circular).

 Assumes the clusters as


 Uses all the components
spherical, so does not
it has access to, so
work efficiently with
initialization of clusters
complex geometrical
will be difficult when
Disadvantages shaped data(Mostly
dimensionality of data
Non-Linear)
is high.
 Hard Assignment might
 2. Difficult to interpret.
lead to mis-grouping.
Anomaly detection
Anomaly detection is the identification of data points, items, observations or events that do not
conform to the expected pattern of a given group, called outliers. These anomalies occur very
infrequently but may signify a large and significant threat. Anomaly detection is mostly used in
data preprocessing stages which are the most important stage (according to some experts) in
machine learning and building predictive models. Anomaly detection finds extensive use in a
wide variety of applications such as fraud detection for credit cards, insurance or health care,
intrusion detection for cyber-security, fault detection in safety critical systems, and military
surveillance for enemy activities. Sometimes, Anomaly detection is said to be outlier
detection(not so accurate definition but acceptable).
What are the types of anomalies ?
1-Point anomalies (AKA global outliers):
A data point is considered a global outlier if its value is far outside the entirety of the data set in
which it is found or in other words we can say that it is A single instance of data is anomalous
if it's too far off from the rest.

2-Contextual anomalies (AKA conditional outlier):


A data point is considered a contextual outlier if its value significantly deviates from the rest of
the data points in the same context. Note that this means that same value may not be considered
an outlier if it occurred in a different context. If we limit our discussion to time series data, the
“context” is almost always temporal, because time series data are records of a specific quantity
over time. It’s no surprise then that contextual outliers are common in time series data.

Note: Values are not outside the normal global range, but are abnormal compared to the
seasonal pattern.
3-Collective anomalies (AKA conditional outlier):
A subset of data points within a data set is considered anomalous if those values as a collection
deviate significantly from the entire data set, but the values of the individual data points are not
themselves anomalous in either a contextual or global sense.
Note: In the example, two time series that were discovered to be related to each other, are
combined into a single anomaly. For each time series the individual behavior does not deviate
significantly from the normal range, but the combined anomaly indicated a bigger issue.

A good analogy to understand the difference:


A fist-size meteorite impacting a house in your neighborhood is a global outlier because it’s a
truly rare event that meteorites hit buildings. Your neighborhood getting buried in two feet of
snow would be a contextual outlier if the snowfall happened in the middle of summer and you
normally don’t get any snow outside of winter. Every one of your neighbors moving out of the
neighborhood on the same day is a collective outlier because although it’s definitely not rare
that people move from one residence to the next, it is very unusual that an entire neighborhood
relocates at the same time.

Good business example for different type of anomalies


A banking customer who normally deposits no more than $1000 a month in checks at a local
ATM suddenly makes two cash deposits of $5000 each in the span of two weeks is a global
anomaly because this event has never before occurred in this customer’s history. The time series
data of their weekly deposits would show an abrupt recent spike. Such a drastic change would
raise alarms as these large deposits could imply illicit commerce or money laundering.

A sudden surge in order volume at an ecommerce company, as seen in that company’s hourly
total orders for example, could be a contextual outlier if this high volume occurs outside of a
known promotional discount or high volume period like Black Friday. Could this stampede be
due to a pricing glitch which is allowing customers to pay pennies on the dollar for a product?

A publicly traded company’s stock is never a static thing, even when prices are relatively stable
and there isn’t an overall trend, and there are minute fluctuations over time. If the stock price
remained at exactly the same price (to the penny) for an extended period of time, then that
would be a collective outlier. In fact, this very thing occurred to not one, but several tech
companies on July 3 of this year on the Nasdaq exchange when the stock prices for several
companies – including tech giants Apple and Microsoft – were listed as $123.45.
What are types of anomaly detection techniques?
1-Statistical methods as moving average (AKA rolling average or Low-pass filter) and moving
median (These are not machine-learning based algorithm unlike the upcoming ones)
2-Density-based methods as k-nearest neighbors and gaussian mixture models.
3-Clustering-based methods as K-means.
4-SVM-based methods as SVM (for supervised) and One-class SVM (for unsupervised)
GMM was explained enough in the main course to be used in anomaly detection so we will
discuss here another method in details which is k-nearest neighbours.
K-nearest neighbors (KNN)
K-nearest neighbors is mainly a simple robust versatile classification technique (can be used in
regression but not recommended). It falls in the supervised learning family of algorithms. KNN
is described as non-parametric, instance-based (lazy) algorithm:
 Non-parametric: It does not make any assumptions on the underlying data distribution.
This is pretty useful , as in the real world , most of the practical data does not obey the
typical theoretical assumptions made (eg gaussian mixtures, linearly separable etc) . Non
parametric algorithms like KNN come to the rescue here. For example, suppose our data
is highly non-Gaussian but the learning model we choose assumes a Gaussian form. In
that case, our algorithm would make extremely poor predictions.
 Instance-based (lazy): It does not use the training data points to do any generalization. In
other words, there is no explicit training phase or it is very minimal. This means the
training phase is pretty fast . Lack of generalization means that KNN keeps all the training
data. More exactly, all the training data is needed during the testing phase. (Well this is an
exaggeration, but not far from truth). This is in contrast to other techniques like SVM
where you can discard all non support vectors without any problem. Most of the lazy
algorithms – especially KNN – makes decision based on the entire training data set (in the
best case a subset of them).The dichotomy is pretty obvious here – There is a non existent
or minimal training phase but a costly testing phase. The cost is in terms of both time and
memory. More time might be needed as in the worst case, all data points might take point
in decision. More memory is needed as we need to store all training data.

Since it is non-parametric, Does this mean that there are no assumption at all?
The answer is neither yes nor no, as there is no assumptions on how the data is distributed but
we can say that there are some conditions to be able to apply the algorithm and even the
conditions are simple and logical:
1-The data should be in feature space, or in other words in a metric space. Therefore data are
scalar or vectors so we can apply distance-based calculations as Euclidean (most popular),
Manhattan, Chebyshev and Hamming distances' equations.
2-Since this is a supervised learning so data should be labeled.
3-The number K should be assigned to determine the number of neighbors influences the
classification results.(will be understood more in the algorithm)
How to choose K? If K is too low
(near 1) so we are increasing variance
and decreasing bias (overfitting) and if
too high so we are more likely to
increase bias (underfit) so the best
value of K is got by cross-validation.
Algorithm
1. Load the training and test data
2. Choose the value of K
3. For each point in test data:
- find the distance (Euclidean distance for example) to all training data points
- store the distances in a list and sort it
- choose the first k points
- assign a class to the test point based on the majority of classes present in the chosen points
4. End

What are the equations of available distances measurement?

Curse of dimensionality
The higher we go in dimensions, the more hard our problems become. This curse is well-known
in computer science in a lot of topics related and it was one of the motivations behind
dimensionality reduction that we have explained. The problem here comes from the fact that
The distance measures lose accuracy while dealing with higher dimensions. In high dimensions,
a curious phenomenon arises: the ratio between the nearest and farthest points approaches 1, i.e.
the points essentially become uniformly distant from each other. This phenomenon can be
observed for wide variety of distance metrics, but it is more pronounced for the Euclidean
metric than, say, Manhattan distance metric. The premise of nearest neighbor search is that
"closer" points are more relevant than "farther" points, but if all points are essentially uniformly
distant from each other, the distinction is meaningless. So to end this part, you should apply
dimensionality reduction method if you are dealing with high dimensional data and also try
Manhattan distance measurement but if it is still has the same curse so probably the KNN is not
the best suited case for classification for your problem.
Some observations on KNN
1- If we assume that the points are d-dimensional, then the straight forward implementation of
finding k Nearest Neighbor takes O(dn) time.
2- A simple approach to select k is set in case you don't want to apply cross-
validation.
3- In KNN, k is usually chosen as an odd number if the number of classes is 2.
4- There are also some nice techniques like condensing, search tree and partial distance that try
to reduce the time taken to find the k nearest neighbor.
(kd search tree (only for low and medium dimensional data) :
1-https://www.coursera.org/lecture/ml-clustering-and-retrieval/kd-tree-representation-S0gfp
2-https://www.coursera.org/lecture/ml-clustering-and-retrieval/nn-search-with-kd-trees-6eTzw)

Improvements on KNN
1- Weighted KNN is one of the most popular improvement to normal KNN algorithm such that
The class of each of the K neighbors is multiplied by a weight proportional to the inverse of the
distance from that point to the given test point. This ensures that nearer neighbors contribute
more to the final vote than the more distant ones.
2- Choosing precisely the distance metric can help improving accuracy as using hamming
distance for text classification or Manhattan for higher dimensionality data.
3- Rescaling data is so important either by normalization or standardization (mean-
normalization)
4- As said before dimensionality reduction is a good step to be done before using KNN
algorithm.

KNN in anomaly detection


K-NN is not limited to merely predicting groups or values of data points. It can also be used in
detecting anomalies. Identifying anomalies can be the end goal in itself, such as in fraud
detection. Anomalies can also lead you to additional insights, such as discovering a predictor
you previously overlooked.
Given data that has 2 classes which are red wine and
white wine based on 2 features which are percentage
of chloride and sulfur dioxide as shown in the figure,
the simplest approach to detecting anomalies is by
visualizing the data in a plot. In the figure for
instance, we can see immediately which wines
deviate from their clusters. However, simple 2-D
visualizations may not be practical, especially when
you have more than 2 predictor variables to
examine. This is where predictive models such as k-NN come in.
As k-NN uses underlying patterns in the data to make predictions, any errors in these
predictions are thus telltale signs of data points which do not conform to overall trends. In fact,
by this approach, any algorithm that generates a predictive model can be used to detect
anomalies. For instance, in regression analysis, an outlier would deviate significantly from the
best-fit line.
In our wine data, we can examine misclassifications generated from k-NN analysis to identify
anomalies. For instance, it seems that red wines that get misclassified as white wines tend to
have higher-than-usual sulfur dioxide content. One reason could be the acidity of the wine.
Wines with lower acidity require more preservatives. Learning from this, we might consider
accounting for acidity to improve predictions.
While anomalies could be caused by missing predictors, they could also arise due to insufficient
data for training the predictive model. The fewer data points we have, the more difficult it
would be to discern patterns in the data. Hence, as it is important to ensure an adequate sample
size.
Once anomalies have been identified, they can be removed from the datasets used to train
predictive models. This will reduce noise in the data, thus strengthening the accuracy of
predictive models.
Lec16

Recommender system:

Assume a problem like this where we want to construct a recommender system for movies
based on previous recommendation, so how could we construct a learning algorithm for such
system?

First Approach: Content based recommender system:


Each movie will be described by 2 features x1 and x2 where x1 represent the
percentage of how much the movie is romantic and the other for action.

x(1)= , x(2)= , x(3)= , x(4)= , x(5)=  x0 = 1 (always)

For each user j, learn a parameter θ(j). Predict user j as rating movie I with
(θ(j))T.x(i) stars. Now our goal is to predict those question marks' values using the formula
(θ(j))T.x(i) which is a linear regression formula. So we will apply the cost function and gradient
descent similarly to what we studied in linear regression as follows:


Second Approach: Collaborative filtering:
We can notice here different
formulation where the features
themselves are unknown as it
could be hard to hire some
people to enter each movie and
tell how much each one is
romantic or action. But we have
θs that tells us how much each
user (Alice, Bob, …) love or hate
the romantic and action movies
and by some methods that will
collaborate these θs and the
rating given by those users we will be able to form these features. What happens here that for
example Alice and Bob gives 5 for romantic movies which means that they love it and when we
check for example the first movie (love at last) we will see that both Bob and Alice gave 5 so
that means that most probably this movie is a romantic movie and so on…
Therefore, we can say the following:

Take care that we are here having θs and we learn for x(i) (features) so our optimization is:

What collaborative filtering do at the end ?


Basic collaborative filtering does initialize
θs with random values then predict x then
use these x to predict new thetas and take
these new thetas to predict new x and so on

But, Going back and forth between x and θ could be optimized more by doing this
simultaneously on x and θ by gathering both optimization objectives in same objective as
follows:

Algorithm:

Mean normalization:
If you have a user that hasn't given any rating to any movie, if we used the previous method we
will end up with 0 rating for all movies so mean normalization solves this problem by instead of
using the given rating we will use them subtracted by its mean μ  Y-μ as follows:
Lec17
Remember:
As number of data in the dataset increases, learning curves become better and less likely to
overfit but at the same time it's computationally expensive to apply gradient descent by its
normal definition on the whole batch so we have to find better techniques.

Stochastic gradient descent:


The gradient descent we use before is called batch gradient descent where the update of θ
occurs after adding all of the examples ( ) so assuming we have a
training dataset of 300,000,000 data so it's very expensive to do such summation as we need to
read into the computer memory all the 300 million record to compute the derivative term.
Stochastic gradient descent doesn't have to look at all training examples in every single iteration
but only looks at a single example in one iteration.

Algorithm:
Note:
The lines in red is
batch gradient
descent, while the
lines in magenta is
stochastic gradient
descent.

Mini-batch gradient descent:


It's an intermediate stage between batch and stochastic where we use b examples in each
iteration then update the θs. The typical value for b is 10 and the typical range is from 2-100.
Algorithm:
Map reduce approach:
This approach is used when the data is so large that one machine couldn't handle it. Number of
data would be close to 400,000,000 or more to use this approach. What this approach does is
dividing our data by the number of machines (if 4 machines to be used then divide our data to 4
equal parts) then pass every part to the machine and apply gradient descent then update using
the usual method as follows:
Batch, Mini-batch & Stochastic gradient descent

Batch gradient descent Stochastic gradient descent Mini-batch gradient descent


More smooth and predictable More noisy (more variance)
Mini-batch is a balance
Takes more time to compute Shorter time to compute the
the gradient descent (Slower) gradient descent (Faster) between both and it is the
most used method which takes
More memory usage Less memory usage
all the performance benefits of
No parallelism can occur Parallelism can take place
SGD and at the same time
More likely to be stuck in Has more chance to be kicked smooth out some of the noise
local minima rather than out of local minima and find injected by SGD.
global minima the global one
Two other main topics
Association & Naïve bayes
Association
We have said in the intro that unsupervised learning is used in what is called by Association.
Association rules is one one of the very important concepts of machine learning being used in
market basket analysis.
Example: In a store, all vegetables are placed in the same aisle(‫)ممر‬, all dairy items are placed
together and cosmetics form another set of such groups. Investing time and resources on
deliberate product placements like this not only reduces a customer’s shopping time, but also
reminds the customer of what relevant items (s)he might be interested in buying, thus helping
stores cross-sell in the process. Association rules help uncover all such relationships between
items from huge databases.

Conceptually association rules is a very simple technique. The end result is one or more
statements of the form “if this happened, then the following is likely to happen." In a rule, the
"if" portion is called the antecedent, and the "then" portion is called the consequent.

Inituition example of how association rules work.


Here you see each person and the list of movies they have watched, here represented by
numbers from 1-5. 
As you work your way down that table
the first thing to stand out is that the
first and third users both watched
movies 1 and 4. From this data, the rule
there would be “if somebody watches
movie number 1, then they are likely to
watch movie number 4.” You’ll need to
understand the two terms I snuck in above: movie 1 is the antecedent and movie 4 is the
consequent. Let’s look at this rule in more detail.
How useful is this rule? There are 2 users out of 5 who demonstrate watched both movies 1 and
4. So we can say that this rule has support of 40% (2 out of 5 users). How confident are we that
it’s a reliable indicator? Three users watched movie number 1, but only 2 of them also watched
number 3. The confidence in this rule is 67 percent.
Note that if you reverse the order rule (or swap the antecedent and consequent if you prefer) we
can also say that “if somebody watched movie number 4 then they are likely to watch movie
number 1.” However, while the support is also 40 percent, the confidence changes and is now
only 50 percent (check the table above to see how that came about).
How to understand each of support, confidence and new term called "lift"?
Support  For example it might mean that people who watch some obscure 70s documentary
will also watch an equally obscure 80s film. In the movie recommendation space, this would
translate to a niche rule that might not get used very often, but could be quite valuable to that
very small subset of customers. However, if you were using rules to find the optimal placement
of products on the shelves in a supermarket, lots of low support rules would lead to a very
fragmented set of displays. In this kind of application, you might set a threshold for support and
discard rules that didn’t meet that minimum.
Confidence  It is a little easier to understand. If there’s a rule linking two movies but with
very low confidence, then it simply means that most of the time they watch the first movie, they
won’t actually watch the second one. For the purpose of making recommendations or
predictions, you’d much rather have a rule that you were confident about. You could also use a
minimum threshold for confidence and ignore or discard rules below a certain threshold.
Take another look at the first rule from above: if somebody watches movie 1 they will also
watch movie 4. The confidence here is 67 percent which is pretty good. But take a look at the
rest of the table. Four out of 5 users watched movie number 4 anyway. If we know nothing else
about their other movie preferences, we know that there’s an 80 percent chance of them
watching movie 4. So despite that confidence of 67 percent that first rule we found is actually
not useful: somebody who has watched movie 1 is actually less likely to watch movie 4 than
somebody picked at random. Fortunately, there’s a way to take this into account. It’s called
“lift”.
Lift  It is used to measure the performance of the rule when compared against the entire data
set. In the example above, we would want to compare the probability of “watching movie 1 and
movie 4” with the probability of “watching movie 4” occurring in the dataset as a whole. As
you might expect, there’s a formula for lift:
Lift is equal to the probability of the consequent given the antecedent (that’s just the confidence
for that rule) divided by probability of that consequent occurring in the entire data set (which is
the support for the consequent), or more concisely:
Lift = confidence / support(consequent)
In this example, the probability of movie 4, given that movie 1 was watched, is just the
confidence of that first rule: 67 percent or 0.67. The probability of some random person in the
entire dataset (of just 5 users in this simple example) watching movie 4 is 80 percent or 0.8.
Dividing 0.67 by 0.8 gives a lift of approximately 0.84.
In general, if you have a lift of less than 1, it shows a rule that is less predictive than just
picking a user at random which is the case with this rule as I explained in the first paragraph of
this section. If you have a lift of around 1, then it’s indicating two independent events, e.g.,
watching one movie does not influence the likelihood of watching another. Values of lift that
are greater than 1 show that the antecedent does influence finding the consequent. In other
words, here is a rule that is useful.
Equation of each metric (Support, Confidence & lift)
Assume the antecedent is X and consequent is Y Then:

Support 

Confidence 

Lift 

====================================================================
Now that we understand how to quantify the importance of association of products within an
item set, the next step is to generate rules from the entire list of items and identify the most
important ones. This is not as simple as it might sound. Supermarkets will have thousands of
different products in store. After some simple calculations, it can be shown that just 10 products
will lead to 57000 rules!! And this number increases exponentially with the increase in number
of items. Finding lift values for each of these will get computationally very very expensive.
How to deal with this problem? How to come up with a set of most important association rules
to be considered? Apriori algorithm comes to our rescue for this. The challenge is the mining of
important rules from a massive number of association rules that can be derived from a list of
items.
===================================================================

Apriori algorithm
Apriori is an algorithm for frequent item set mining and association rule learning over
transactional databases. It proceeds by identifying the frequent individual items in the database
and extending them to larger and larger item sets as long as those item sets appear sufficiently
often in the database. If we have item set contains {A,B,C,D} items and assume that items
{A,B} comes with each other frequently (frequently means it exceeds a certain specified
threshold), therefore number of occuring (transactions) containing {A,B,….} such that A and B
are subset of the new item list will be greater than or equal number of occuring of {A,B}. So we
can say that this new transaction is also frequent without checking. This is called the anti-
monotone property of support where if we drop out an item from an item set, support value of
new item set generated will either be the same or will increase. Apriori principle is a result of
anti-monotone property of support. Similarly when we start calculating frequency of individual
items so For example, if {A,B} does not satisfy our threshold , an item set with any item added
to this will never cross the threshold too.
Note: Frequent item sets are the item sets for which support value (fraction of transactions
containing the item set) is above a minimum threshold (Threshold is aka minsup).
Algorithm in steps:
1-Generate all frequent item sets (support ≥ minsup) having only one item.
2-Next, generate item sets of length 2 as all possible combinations of above item sets. Then,
prune the ones for which support value fell below minsup.
3-Generate item sets of length 3 as all possible combinations of length 2 item sets (that
remained after pruning) and perform the same check on support value.
4-Iteratively, we keep increasing the length of item sets by one like this and check for the
threshold at each step until we reach the end.
(Visualize: https://annalyzin.files.wordpress.com/2016/04/association-rules-apriori-tutorial-
explanation.gif
At first we had 4 individual items (tomato, sausage sandwich, Guava & grapes) so support of
tomato was less than minsup so we removed it and then made all combinations possible from
the three remaining items to form item set of length too then check each item set that its support
is above minsup so all of them passed then we form all combinations possible again to form
item set of length 3.)

How association rule is generated?


It is generated through 2 steps:
1- Generate most frequent item sets from the whole list of items available using apriori
algorithm discussed.
2-Generate all possible rules from the frequent item sets generated.
How step 2 is done?
If {Bread, Egg, Milk, Butter} is the frequent item set, candidate rules will look like:
(Egg, Milk, Butter → Bread), (Bread, Milk, Butter → Egg), (Bread, Egg → Milk, Butter), (Egg,
Milk → Bread, Butter), (Butter→ Bread, Egg, Milk), so on …
From a list of all possible candidate rules, we aim to identify rules that fall above a minimum
confidence level (minconf). Just like the anti-monotone property of support, confidence of
rules generated from the same itemset also follows an anti-monotone property. It is anti-
monotone with respect to the number of elements in consequent.
This means that confidence of (A,B,C→ D) ≥ (B,C → A,D) ≥ (C → A,B,D).
Example: We start with a frequent item set {a,b,c,d} and start forming rules with just one
consequent. Remove the rules failing to satisfy the
minconf condition. Now, start forming rules using a
combination of consequents from the remaining ones.
Keep repeating until only one item is left on antecedent.
This process has to be done for all frequent item sets.
Note: minimum confidence threshold that we pick up is
completely subjective to the problem at hand.
With these two steps, we have identified a set of association rules which satisfy both the
minimum support and minimum confidence condition. The number of such rules obtained will
vary with the values of minsup and minconf. Now, this subset of rules generated will be
searched for highest values of lift to make business decisions.

Maximal frequent item set and Closed frequent item set


Maximal frequent item set: It is a frequent item set for which none of the immediate supersets
are frequent. This is like a frequent item set X to which no item y can be added such that {X,y}
still remains above minsup threshold.
‫ اللي انا حاطه و جيت‬minsup threshold ‫ يعني معدية ال‬frequent ‫{و كان‬A,B} ‫ فيها‬itemset ‫(يعني لو عندي‬
frequent ‫ الجديدة دي‬itemset‫ ممكن يتضاف(وا) و يخلي ال‬item(s) ‫ ممكن عليها و لقيت ان مفيش اي‬item ‫اضيف اي‬
) maximal frequent item ‫{ دي اسمها‬A,B} ‫ بتاعة‬itemset‫برده يبقي ال‬

Closed frequent item set: It is a frequent item set for which there exists no superset which has
the same support as the item set. Consider an item set X. If ALL occurrences of X are
accompanied by occurrence of Y, then X is NOT a closed set.
‫ اللي انا حاطه و جيت ابص‬minsup threshold ‫ يعني معدية ال‬frequent ‫{و كان‬A,B} ‫ فيها‬itemset ‫(يعني لو عندي‬
‫ بتاعة‬itemset ‫ يبقي ال‬C ‫ تالت موجود دايما و ليكن‬item ‫{ لما بيبقوا موجودين بالقي معاهم‬A,B} ‫لقيت ان دايما‬
)‫ و العكس صحيح‬closed frequent itemset ‫{ دي مش‬A,B}
Naïve bayes
Before starting in Naïve bayes, we have to discuss what is so called by Bayes' theorm.

Bayes' Theorm
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following

equation:

where, P(A|B) : the conditional probability of response variable belonging to a particular value,
given the input attributes. This is also known as the posterior probability.
P(A) : The prior probability of the response variable.
P(B) :The probability of training data or the evidence.
P(B|A) : This is known as the likelihood of the training data.

Thus, we can rewrite the previous equation to :

Noting that A here represents y or class variable and B here represents the x or features or
predictors of size n (x1,x2,…,xn)

Example: A Path Lab is performing a Test of disease say “D” with two results “Positive” &
“Negative.” They guarantee that their test result is 99% accurate: if you have the disease, they
will give test positive 99% of the time. If you don’t have the disease, they will test negative
99% of the time. If 3% of all the people have this disease and test gives “positive” result, what
is the probability that you actually have the disease?
For solving the above problem, we will have to use conditional probability.
Probability of people suffering from Disease D, P(D) = 0.03 = 3%
Probability that test gives “positive” result and patient have the disease, P(Pos | D) = 0.99 =99%
Probability of people not suffering from Disease D, P(~D) = 0.97 = 97% (~D ≡ NOT D)
Probability that test gives “positive” result and patient does have the disease, P(Pos | ~D) = 0.01
=1%
For calculating the probability that the patient actually have the disease i.e, P( D | Pos) we will
use Bayes theorem:
We have all the values of numerator but we need to calculate P(Pos):
P(Pos) = P(D, pos) + P( ~D, pos)
= P(pos|D)*P(D) + P(pos|~D)*P(~D) = 0.99 * 0.03 + 0.01 * 0.97 = 0.0297 + 0.0097 = 0.0394
Let’s calculate, P( D | Pos) = (P(Pos | D) * P(D)) / P(Pos) = (0.99 * 0.03) / 0.0394 = 0.7538
So, Approximately 75% chances are there that the patient is actually suffering from disease.
Naïve Bayes
The naive Bayes algorithm does that by making an assumption of conditional independence
over the training dataset features (predictors). Therefore instead of saying (P(x1,x2,…xn|y)) we
can say P(x1|y) × P(x2|y) × …× P(xn|y) = . This is a very strong assumption that
makes the Bayes algorithm, naïve as well as it is most unlikely in real data, i.e. that the
attributes do not interact. Nevertheless, the approach performs surprisingly well on data where
this assumption does not hold. Naïve bayes mostly used in natural language processing (NLP).
Equations:

We calculate probability of each class given the attributes and take the highest probability
and this known as Maximum A posteriori (MAP)
Note: Assume we have n number of attributes (features) and response is of C classes then the
complexity of our algorithm is O(Cn).

How to calculate each of the class probabilities and conditional probabilities?


Calculating Class Probabilities: The class probabilities are simply the frequency of instances
that belong to each class divided by the total number of instances
Calculating Conditional Probabilities: The conditional probabilities are the frequency of each
attribute value for a given class value divided by the frequency of instances with that class
value.
Naïve Bayes example
Recalling our example of playing golf against our four attributes (outlook, wind, humidity &
temperature) that we have done in decision trees part.

Let's apply our formulae manually on this


dataset of 14 examples. We need to
compute P(xi|yj) for each value of x and
each value of y:

-Values of x are sunny, rainy, overcast,


hot, cool, mild, High, normal, True &
false.
-Values of y are Yes or No.

So each P(x|y) is as shown in the following figure :

Now our classifier is ready, let's have our test example  today={ Sunny, Hot, Normal, False}
What should be our prediction for playing golf (y)? Y could be Yes or NO so we compute both
given today data and take the highest P according to MAP algorithm.

, Similarly with
Therefore Ypredicted= argmax of (P(Yes|today) & P(No|today)) = YES
Note: Argmax returns the index of the highest value so if Yes is 1 and No is 0 then returns 1.
What are the types of Naïve bayes classifier?
1-Gaussian Naïve Bayes:
Above, we calculated the probabilities for input values for each class using a frequency. With
real-valued inputs, we can calculate the mean and standard deviation of input values (x) for
each class to summarize the distribution. This means that in addition to the probabilities for
each class, we must also store the mean and standard deviations for each input variable for each
class.
Naive Bayes can be extended to real-valued attributes, most commonly by assuming a Gaussian
distribution. This extension of naive Bayes is called Gaussian Naive Bayes. Other functions can
be used to estimate the distribution of the data,
but the Gaussian (or Normal distribution) is the
easiest to work with because you only need to
estimate the mean and the standard deviation from your training data.
Equation: 

Remember:
Mean ( ) and standard deviation σ 

2-Multi-nomial Naïve Bayes:


Multi-nomial Naive Bayes is preferred to use on data that is multi-nomially distributed. It is one
of the standard classic algorithms. Which is used in text categorization (classification). Each
event in text classification represents the occurrence of a word in a document.

3-Bernoulli Naïve Bayes:


Bernoulli Naive Bayes is used on the data that is distributed according to multivariate Bernoulli
distributions .i.e., multiple features can be there, but each one is assumed to be a binary-valued
(Bernoulli, Boolean) variable. So, it requires features to be binary valued.

What are the pros and cons of Naïve Bayes?


Advantages
 Naive Bayes Algorithm is a simple, fast, highly scalable algorithm.
 Naive Bayes can be use for Binary and Multiclass classification
 Great choice for Text Classification problems. It’s a popular choice for spam email
classification.
 It can be easily train on small dataset
Disadvantages
It considers all the features to be unrelated, so it cannot learn the relationship between features.
Tips to improve the power of Naive Bayes Model
Here are some tips for improving power of Naive Bayes Model:
 If continuous features do not have normal distribution, we should use transformation or
different methods to convert it in normal distribution.
 If test data set has zero frequency issue, apply smoothing techniques “Laplace
Correction” to predict the class of test data set.
 Remove correlated features, as the highly correlated features are voted twice in the model
and it can lead to over inflating importance.( I think this can be done by PCA)
END OF PART 1
MACHINE LEARNING
PART 2
DEEP LEARNING
BASICS
Machine learning VS. Deep learning
What is Deep learning?
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the
structure and function of the brain that focuses on a family of models called artificial neural
networks (ANN). Both machine and deep learning fall within the field of Artificial intelligence
(AI). In other words, Deep Learning is a new area of Machine Learning research, which has
been introduced with the objective of moving Machine Learning closer to one of its original
goals: Artificial Intelligence. The word “deep” of deep learning is a technical term and refers to
the number of layers or segments in the “network” part of “neural networks.” Deep learning is
currently playing critical roles in developing highly automated systems such as in self-driving
cars and natural language recognition and understanding.

What is the difference between deep learning and machine learning technically?
The major difference between deep learning and machine learning is its execution:
1-As the size of data increases. Deep learning algorithms need a large amount of data, this is
why, when the data is small those algorithms don’t perform that well. On the other hand,
machine learning algorithms with their high quality principles win in this situation.
2-Deep learning algorithms heavily depend on high-end machines, compare to machine learning
algorithms, because the requirements of deep learning algorithms include GPUs which are an
integral part of its working. Machine learning can work on low-end machines, most of the
applied features need to be identified by an expert and then hand-coded as per the domain and
data type.
3-Deep learning algorithms try to take high-level features from data. This is an extremely
distinctive part of Deep Learning and a major step ahead of Machine Learning. Therefore, deep
learning reduces the task of developing new feature extractor for each issue.
4-Usually, a deep learning algorithm takes a long time to train. This is because there are so
many parameters in a deep learning algorithm that training them takes longer than usual.
Whereas machine learning comparatively takes much less time to train, ranging from a few
seconds to a few hours.
Course 1 Week1
Introduction:
Example:
Assume you have a feature called size of house and
you want to predict the price given some examples as
shown in the graph. Probably if you are familiar with
linear regression you will go with it to fit these
data.(blue line fitting the data)
As said before that deep learning is actually a step towards mimicking the brain so let us see
how could such a problem be represented in neural networks.
This will be represented as a simple neuron having one input x
(size of house) and having one output y which is the price.

Note: The line I drew up there in fact doesn't exist in neural networks but it is replaced by what
is called by ReLU function (discussed in details in activation functions part in this chapter) to
add some sort of non-linearity. (linear)  (ReLU)
Larger network to be constructed is by having more neurons and stack them together. Also
features are more than just size of house, maybe by adding some other fetures like #bedrooms,
zip code and so on …

These features (1st blue rectangle) by default


introduces other higher features (red arrows) by
combining them as by features of size of house and
number of features we can identify how much family
members could fit to this house and from these high
features we get higher features and so on till we reach
our output which is the price of the house based on
these higher features as shown in the figure.

In fact these high features are introduced internally


by the network not by us and instead of giving the
neuron a specific inputs we give each neuron all the
features and let them decide how they will combine
them to get higher features and as we give our
network more examples with x and y, the neural
network will remarkably gives us better mapping
from x to y.
Remember: Supervised learning
depends on having both input
features and labels (outputs)
before training our classifier.
The features and labels differ
from an application to another.

For the given applications in the preceding image, we use different types of neural networks
that best fit this application:
-For Real estate and online advertising  Standard neural network
-For Photo tagging (Photos and videos)  Convolutional neural network
-For Speech recognition and Machine translation (time-series apps) Recurrent neural network
-For Autonomous driving (complex advanced application)  Custom/ Hybrid NN
Custom or hybrid means that we use a mixture of CNN, RNN and other types.

Supervised learning are having two types of data called structured data and unstructured data
where structure data is like all what we have done in the previous chapter by having some
features like in our previous example (size, number of bedrooms, …) and output determines
price while unstructured data is like audio(for speech recognition), images (for image
recognition), texts (for natural language processing), etc. which the elementary data here is
different like in images we have
pixels or in text we have
individual words and so on …
Historically, It is harder on the
computer to make sense on the
unstructured data than structured
ones but now with neural
networks and big data era as well
as the computational speeds we
are much better in dealing with
such data.
Course 1 Week2
Binary classification:
As said before in machine learning part that
binary classification means that you have 2
classes which represents positive or negative
such that if the classification is about cats so we
predict y to be either 1 (cat or positive example)
or 0 (non-cat negative example).

How image is represented in computer? Image is stored in the form of three matrices called
RGB where each matrix represents the precentage of color of each pixel so for first pixel in
image (top left) to be white so we make this pixel (top left) in each of the 3 matrices of value
255 to form the white color noting that each value of each matrices ranges from 0→255.
If the image is of size 64×64, therefore each of the 3 matrices will be of size 64×64.
To represent those matrices as feature vector x (inputs), we unroll each matrix of RGB matrices
into column vector and stack them together to end up with a column vector of size (64×64×3)×1
= 12288×1 vector as shown below: (Note: notation n=n×=12288)

Notations:
Single example is expressed as (x,y) where x , y {0,1}
For m training example : (x ,y ), (x ,y ), ……, (x(m),y(m))
(1) (1) (2) (2)

(m≡mtrain= number of training examples) & (mtest= number of test examples)

X= , Y=
Logisitic regression: (recap)
Given x where x , want = P(y=1 | x)

Parameters will be  w ,b
What is w & b ? This is similar to θ we were using in machine learning part where :
b= θ0.x0 noting that x0 is always equals 1 as we said before and w≡θ, Hence:
( =θTx=θ0 x0 + θ1 x1 + θ2 x2 + … + θnxn in machine learning) ≡ ( =wTx + b in deep learning)
In neural networks we stick to the w, b conventional notations rather than θ as it tends to be
easier in explanation so we will use w, b along this chapter as our learning parameters.

In logistic regression : = σ ( wTx + b) = sigmoid(wTx + b) = σ (z)


Remember: σ(z)= =  if z large then σ(z) = =1 & if z large negative
number then σ(z) = =0

Cost function: J(θ) = [in machine learning]


ℓ ( , y) = [in deep learning]
For m training examples: J(w,b) =
=
Our goal is to find w,b that minimizes J(w,b).
Note:
1-There is no difference in anything except notations as already equals to σ ( wTx + b) while
or in machine learning equals to θT x. Similarly with J(θ) and ℓ ( , y) as both denotes
the cost function which determines the difference between predicted and actual value.
2-ℓ ( , y) is an abbreviation of word loss between yactual and ypredicted.
3-Loss function ≡ Cost function ≡ Error function.

Gradient descent:
Intuition behind derivatives:
We know that if f(a)=3a then f ' (a) =3 but
the intuition behind is that if we have a=2
then f(a)=6 and if we move alittle bit to the
right as shown in the figure at a=2.001
then f(a)=6.003 therefore we can say that
the amount of change in the variable a
results in triple as much change that
happened in the variable. Then slope
(derivative) of f(a) at a=2 is = = 3.

Note: The definition of derivative doesn't state the moving by 0.001 but it is much smaller
movement called by infinitesimal amount which is much smaller than 0.001. We just used
0.001 for easier explanation.

Computational graph:
Computations of neural network is as discussed before a combination of feed-forward path
followed by backpropagation in an iterative manner, so computation graph discusses why
neural networks is organized this way.
Assume function J of 3 variables as follows: J(a,b,c)=3(a+bc) so how to compute derivative?
Derivative in this example is done in 3 steps:
1-u=bc 2-v=(a+u) 3-J=3v  Computing derivative done by backpropagation from right to left
assume we initialize a,b,c as 5,3,2 respectively (initialization of learning parameter), Therefore:
As you notice we have already done the
feed-forward computation on the
computation graph so let us done the
backprogation for computing derivatives.
As said we will start from right to left: =?
J=3v and v=11 so we move alittle from 11→11.001 so J will move from 33→33.003 so =3
Now we have gone 1 step backward from J to v. Next we have 2 steps one from v to u and one
from v to a so we gonna compute two derivatives ( )
For : as a goes from 5→5.001so v goes from 11→11.001so J goes from 33→33.003 so =3
For : as u goes from 6→6.001so v goes from 11→11.001so J goes from 33→33.003 so =3
Note: & as we already have from first step so is all we
need is to compute according to the chain rule used.
Remember: our target from the gradient descent is computing the derivative of final output J
with respect to learning parameter a,b & c.
Now we got one of the 3 derivative needed which is and the remaining are & .
Similarly with using chain rule  = . (we know that =3 & =1 so . =?! )
as b goes from 3→3.001 so u goes from 6→6.002 so = . =3×1×2=6
Similarly with c  = . =3×1×3=9

Note: ≡ d any variable so for example ≡ da.

How can computational graph implemented with logistic regression?

Note: Take care that a≡ as we will continue our chapter using "a" notation for our outputs.
Remember:

Gradient descent for logistic regression using computational graphs


Our target is to find ( , , ) to be able to update w's & b (weights and biases)

Going back from the left to the right: = da= (Using derivatives' rules)
Then back again to z : = dz = = .
Then back again to weights and biases:
= dw1 = x1 . dz & = dw2 = x2 . dz & = db =dz
Update equations:
w1 := w1 - α dw1
w2 := w2 - α dw2  The 3 equations are updated simultaneously as discussed before.
b := b - α db Note: This equations for a single training example.
For m-training examples:

Notes:
1-We use dw1 and dw2 and b with no superscript (i) because they are accumulators for all
examples while dz(i) is having superscript (i) as it is computed each iteration for example i.
2-All the examples we give till now is having 2 features (n=2).
3-This is the steps in details for 1 epoch (loop) of gradient descent so we repeat those steps for a
specific number of epochs until reaching local minima.

There are two weakness points in this implementation, What are they?
The two weakness points lies in the two for loops used in the implementation which are the
loop over the m-examples and the loop over features (not shown above as we only have 2
features but in case we have more which is more realistic we gonna have for loop over the
features for j=1 to n : (dwj += xj dz(i))). The weakness in using for loops in deep learning
algorithm is that it takes so much run time that hinder the performance especially in the big data
era that offers millions of data. Here comes the idea of Vectorization to get rid of explicit for
loops.
Vectorization uses numpy function from python or any other
built in function used for vectorization in other programming
languages. You can notice that the unvectorized version
(using for loops) took 300 as much time that took by the
vectorized one.
Functions like numpy take advantage of parallelism between
CPU and GPU using SIMD (Single Instruction Multiple
Data) which makes the computation faster.

Rule of thumb: KEEP IN MIND THAT WHENEVER


POSSIBLE, AVOID EXPLICIT FOR-LOOPS.
Computational graph
Graph theory
In mathematics, graph theory is the study of graphs,
which are mathematical structures used to model
pairwise relations between objects. A graph in this
context is made up of ordered pair G = (V,E) comprising
a set V of nodes (vertices or points) which are
connected together with set E of edges (arcs or lines). A
graph may be undirected, meaning that there is no
distinction between the two nodes associated with each edge, or its edges may be directed from
one node to another.

Computational graph (AKA data flow graph)


A computational graph is a way to represent a math function in the language of graph theory.
Recall the premise of graph theory: nodes are connected by edges, and everything in the graph
is either a node or an edge. In a computational graph nodes are either input values or functions
for combining these values. Edges receive their weights as the data flows through the graph.
There are two types of nodes:
 Input node: This defines the variables in the original equation to be computed.
 function node: This defines the operations between those variables.
Outbound (outgoing) edges from an input node are weighted with that input value; outbound
nodes from a function node are weighted by combining the weights of the inbound edges using
the specified function. The values that are fed into the nodes and come out of the nodes are
called tensors, which is just a fancy word for a multi-dimensional array.
For example:
For example, consider the relatively simple
expression: f(x, y, x) = (x + y) * z. This is
how we would represent that function as as
computational graph: 
There are three input nodes, labeled X, Y,
and Z. The two other nodes are function
nodes. In a computational graph we generally
compose many simple functions into a more
complex function.
We can do composition in mathematical notation as well, but I hope you’ll agree the following
isn’t as clean as the graph above:
In both of these notations we can compute the answers to each function separately, provided we
do so in the correct order. Before we know the answer to f(x, y, z) first we need the answer to
g(x, y) and then h(g(x, y), z). With the mathematical notation we resolve these dependencies by
computing the deepest parenthetical first; in computational graphs we have to wait until all the
edges pointing into a node have a value before computing the output value for that node.

Let’s look at the example for computing f(1, 2, 3):

And in the graph, we use the output from each node as the weight of the corresponding edge:

Either way, graph or function notation, we get the same answer because these are just two ways
of expressing the same thing.
In this simple example it might be hard to see the advantage of using a computational graph
over function notation. After all, there isn’t anything terribly hard to understand about the
function f(x, y, z) = (x + y) * z. The advantages become more apparent when we reach the scale
of neural networks.
Even relatively “simple” deep neural networks have hundreds of thousands of nodes and edges;
it’s quite common for a neural network to have more than one million edges. Try to imagine the
function expression for such a computational graph… can you do it? How much paper would
you need to write it all down? This issue of scale is one of the reasons computational graphs are
used.
For backpropagation through computation graphs: It is enough what is said in main course.
For more examples :
[Starting from min 6:00  https://www.youtube.com/watch?v=u2OeYrlAx_A]
Course 1 Week3
Neural networks : (recap)
We have discussed this topic before in machine learning part but we want to add some more
info before continuing.

Representation and notation:


1-This network is said to be 2
layers neural network despite of
having 3 layers as the input layer
is not counted as a layer. This
convention is used even in the
research papers in this field.

2-As said before, output of each


layer is annotated as "a".

3- There will be 3 different scripts to take care of:


 parentheses super script defines the training example number. (x(1)≡training example 1)
 brackets super script defines the layer number. (a[2]≡output of layer 2)
 sub script defines node number. (a4≡output of node 4)
Example: a3[4](7)  This means output of node 3 in layer 4 of the 7th training example.

4-a[0] is the output of input layer which is input features x.

5-Each layer has w (weights) and b (biases) matrices with defined size:
 weights: (number of nodes in the layer × number of nodes in the preceding layer)
 biases: (number of nodes in the layer × 1)
Example: w[2]:matrix of size (number of nodes in layer 2 × number of nodes in layer 1)

6- a[last layer of network] ≡  predicted output of the network.

7-a[2] & w[2] are output and weight of layer 2 which is equivalent to & where u is

number of nodes in layer 2.

Note: Every point of the preceding point, go visualize it in the image for better understanding.
Remember:
-Inside each unit (neuron), computation are done in 2 steps:
 compute z
 apply sigmoid function (or any other activation function)

-These 2 steps in equations are as shown(assuming 4 nodes


in layer 1) 

- The equations in vectorized


matrices form 

-Have a look on the matrices sizes


written in blue under the equations

Note: a[0] is equivalent to x. Take
care also that W[i] is the notation
for the stacked transposed weights
of each neuron in layer i.

Vectorizing across multiple training examples:


We can notice from the image that
instead of for loop across the training
examples, we vectorized all the
equations (similarly to what happens
inside numpy array). Note that as we
go horizontally, we are moving across
different training examples, and as we
go vertically we are going across
different nodes of the layer (in case of
hidden layers or output layer) or
different features (in case of input
layer X).(note that weights w[i] are common across all examples)
Activation functions:
In the forward propagation path, we always used the sigmoid function σ(z) as the activation
function. So we will change the σ(z) to g(z) to express the non-linear activation function
generally. Instead of a= σ(z)  Now generally a=g(z) where g could be any activation function.
Generally, sigmoid function is not the best activation function but in fact it is the simplest and
most basic one so let's see other activation function:

Hyperbolic tangent function (Tanh): 


a = Tanh(z) =
This activation function most of the time has better results than using sigmoid as an activation
function due to its zero mean. This makes the learning for next layers easier.
There is an exception that makes using of sigmoid function better than tanh which is in binary
classification so in that case the output layer will have a sigmoid activation function to make the
prediction either 0 or 1 so it can be compared to yactual which is also either 0 or 1.
As we can use different activation function in the same network for different layers so we
assign bracket superscript to the word g  g[i] (z[i]) ≡ activation function of layer i.

Rectified Linear Unit function (ReLU): 


a = max (0,z)
Actually ReLU is a better choice of both sigmoid and tanh so in case you don't know what to
use for your application just start with ReLU as your activation functionmost commonly used

Leaky Rectified Linear Unit function (leaky ReLU):


a=max(0.01z,z)
This is supposed to be better than regular ReLU but mostly ReLU is
used in the applications.

Why don't we use linear activation function or even remove the step of activation ?
In fact because if we use linear function in each layer so
we will end up having output as a linear combination of
input features which in using is less likely to express the
data efficiently due to non-linearity in the input features
themselves. you may use linear activations in regression
problems like housing price predictions as y is a real-
valued number ranging from 0 to millions of dollars so
linear can be used but generally this is a rare usage.
Derivatives of activation functions:
For sigmoid:

For Tanh: :

For ReLU:

For leaky ReLU:

Remember: You should initialize weights


by random initialization not zeros to avoid
similar weights in all the following
weights along the network as discussed
before in machine learning.
Activation functions
Like our human brain has millions of neurons in a hierarchy and Network of neurons which are
interconnected with each other via Axons and passes Electrical signals from one layer to
another called synapses. This is how we humans learn things. Whenever we see, hear, feel and
think something a synapse(electrical impulse) is fired from one neuron to another in the
hierarchy which enables us to learn , remember and memorize things in our daily life since the
day we were born.

Activation function (AKA transfer function)


Activation functions are mathematical equations that determine the output of a neural network.
The function is attached to each neuron in the network, and determines whether it should be
activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s
prediction. Activation functions also help normalize the output of each neuron to a range
between 1 and 0 or between -1 and 1. These functions are applied to hidden as well as to the
output layers of a neural network.

Why do we need activation functions in the artificial neural networks?


When we do not have the activation function the weights and bias would simply do a linear
transformation. A linear equation is simple to solve but is limited in its capacity to solve
complex problems and have less power to learn complex functional mappings from data. A
neural network without an activation function is essentially just a linear regression model. The
activation function does the non-linear transformation to the input making it capable to learn
and perform more complex tasks. We would want our neural networks to work on complicated
tasks like language translations and image classifications. Linear transformations would never
be able to perform such tasks. Activation functions make the back-propagation possible since
the gradients are supplied along with the error to update the weights and biases. Without the
differentiable non linear function, this would not be possible.
Also another important feature of a Activation function is that it should be differentiable. We
need it to be this way so as to perform backpropagation optimization strategy while propagating
backwards in the network.

What are the types of activation function?


There are 3 main types of activation functions: 1)Binary step 2)Linear 3)Non-linear
1)Binary step: A binary step function is a threshold-based
activation function. If the input value is above or below a certain
threshold, the neuron is activated and sends exactly the same
signal to the next layer. The problem with a step function is that
it does not allow multi-value outputs—for example, it cannot
support classifying the inputs into one of several categories.
2)Linear: it takes the form: y =linear(x) = cx + b
It takes the inputs, multiplied by the weights for each neuron,
and creates an output signal proportional to the input. In one
sense, a linear function is better than a step function because it
allows multiple outputs, not just yes and no. Output ranges
from -∞ to ∞ unlike non-linear function (will be discussed)
However, a linear activation function has two major problems:
 Not possible to use backpropagation (gradient descent) to train the model—the
derivative of the function is a constant, and has no relation to the input, X. So it’s not
possible to go back and understand which weights in the input neurons can provide a
better prediction.
 All layers of the neural network collapse into one—with linear activation functions, no
matter how many layers in the neural network, the last layer will be a linear function of
the first layer (because a linear combination of linear functions is still a linear function).
So a linear activation function turns the neural network into just one layer.
A neural network with a linear activation function is simply a linear regression model. It has
limited power and ability to handle complexity varying parameters of input data.

3)Non-linear: Modern neural network models use non-linear activation functions.


 They allow backpropagation because they have a derivative function which is related to
the inputs.
 They allow “stacking” of multiple layers of neurons to create a deep neural network.
Multiple hidden layers of neurons are needed to learn complex data sets with high levels
of accuracy.
Non-linear function confines the output between two limits as (0,1) , (-1,1) and so on …

Now let us discover the most 7 popular non-linear activations in deep learning.
Sigmoid (Logistic) function
Equation , Output range: 0→1
Derivative: ) = g(z) (1-g(z))
Advantages:
 It is an S-shaped curve which has looks as a smoothed
step function like but with the benefit of being non-linear
so combinations of this function along stack of neurons will be non-linear as well.
 We can notice that between -2 and 2 for z the function is so steep meaning that any
change in input will lead to significant change in output.
 Output is bounded between 0 and 1 so normalized and prevent blowing up of outputs.
Disadvantages:
 Vanishing gradient descent problem.
 Outputs are not centered around 0 and it ranges from 0 to 1 thus the values received are
all positive. So not all times would we desire the values going to the next neuron to be all
of the same sign.
 Computationally expensive.
 Takes time to converge (Slow convergence).
Vanishing gradient descent problem:
By taking a look to the gradient (derivative) of the
sigmoid, we would notice that regions other than
the region between -3 and 3 is pretty flat which
means that the gradient is almost zero which in
turns mean that the network is not learning or
drastically so slow (depending on the value of the
gradient) as learning depends mainly on updating
weights and biases which are updated based on
derivative and learning rate so if derivative is 0 so
no learning takes place.(w := w -α so if ≈0 so w is not updated, similarly with bias)
Hyperbolic tangent (Tanh) function
As we notice here that tanh (green) is a scaled modified
version of sigmoid (red) 
Equation: , Output range: -1→1

Derivative: = 1- (g(z))2
Advantages:
It is similar to sigmoid function with added two more
advantages:
 More steep so more sensitive to changes.
 Zero centered so it basically solves our problem
more
of the values all being of the same sign so making steep
it easier to model inputs that have strongly
negative, neutral, and strongly positive values.
Disadvantages:
 Vanishing gradient descent.
 Computationally expensive.
 Slow convergence (but faster than sigmoid)
Note: 1- tanh (z) = 2 sigmoid (2z) -1
Rectified linear unit (ReLU) function
The most popular activation function in deep learning.
Equation: g(z) = max (0,z) , Output range: 0→∞
Derivative: :
Advantages:
 It solves the problem of vanishing gradient descent.
 Computationally efficient as it involves simpler
mathematical operations.
 Faster to converge.
 Sparsity which means it does not activate all the neurons at the same time. What does
this mean ? If you look at the ReLU function if the input is negative it will convert it to
zero and the neuron does not get activated. This means that at a time only a few neurons
are activated making the network sparse making it efficient and easy for computation and
make the learning more concentrated on the activated neurons so learns better unlike
sigmoid and tanh which will cause almost all neurons to fire in an analog way (
remember? ). That means almost all activations will be processed to describe the output
of a network. In other words the activation is dense. This is costly.
Disadvantages:
 Dead neurons problem as If you look at the
negative side of the graph of the gradient, the
gradient is zero, which means for activations in that
region, the gradient is zero and the weights are not
updated during back propagation. This can create
dead neurons which never get activated.

ReLU function is used in hidden layers only. Output layer doesn't use it.

Leaky rectified linear unit (Leaky ReLU) function


Equation: g(z) = max (0.01z, z) , Output range: -∞→∞
Derivative:
Advantages: Same as ReLU with one added advantage
 Solve the dead neurons problem due to existence of
non zero gradient so neurons recover during training eventually.
Disadvantages:
 Some people have got results with this activation function but they are not always
consistent for negative value inputs.
 Slightly slower than ReLU.
Parametric rectified linear unit (PReLU) function
Equation: g(z) = max (az, z) , Output range: -∞→∞
Derivative:

Advantages: Same as ReLU in addition to:


 "a" here is a trainable parameter. The network also learns the value of ‘a‘ for faster and
more optimum convergence. therefore, possible to perform backpropagation and learn the
most appropriate value of a.
Disadvantages
 May perform differently for different problems.
 Added trainable parameter so more unknowns.

Self-Gated (Swish) function


Equation: = z × sigmoid(z)
Output range: -∞→∞
Derivative: :
Swish is a new, self-gated activation function
discovered by researchers at Google. According to
their paper, it performs better than ReLU with a
similar level of computational efficiency. In
experiments on ImageNet (a well-known dataset)
with identical models running ReLU and Swish, the
new function achieved top -1 classification accuracy
0.6-0.9% higher.

There is a modified version of swish function as follows:


Equation: where β is a tunable parameter which can take values other than 0 to
help in learning process.
Note: if β=0 so g(z) is linear function (g(z)=z) and if β→∞ so the function converges to be
ReLU. So including β is a way for us to nonlinearly interpolate between identity and ReLU.

Softmax function
This function will be discussed in details at the end of this chapter.
How to choose the right activation function?
Now that we have seen so many activation functions, we need some logic / heuristics to know
which activation function should be used in which situation. Good or bad – there is no rule of
thumb. However depending upon the properties of the problem we might be able to make a
better choice for easy and quicker convergence of the network.
 Sigmoid functions and their combinations generally work better in the case of classifiers.
 Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient
problem.
 ReLU function is a general activation function and is used in most cases these days.
 If we encounter a case of dead neurons in our networks the leaky ReLU function is the
best choice.
 Sometimes leaky ReLU doesn't help in waking up dead neurons so parametric is used
which can give more stable training if the parameter was learnt well.
 Always keep in mind that ReLU function should only be used in the hidden layers.
 Swish function is still new (released in 2017) but as it gives very good results so it can be
your second choice after ReLU.

Rule of thumb: You can begin with using ReLU function and then move over to other activation
functions in case ReLU doesn’t provide with optimum results.
Course 1 Week4
Deep neural network:
Neural network deepness is
defined by the number of hidden
layers the network has. As
number of hidden layers increase,
the network becomes more deep
so better performance is more
likely to be expected from this
network.

Notations:
1- L defines number of layers of the network.
2-n[ℓ] defines number of nodes in layer ℓ.
Notes:
1-nx denotes number of features and equivalent to n[0].
2-Assume 4 layers NN, then L=4 thus n[L] ≡ n[4] ≡ output layer nodes.

Feed-forward propagation in deep network:

Note that even with the


vectorized form we have to
use explicit for loop to
loop over the layers from
layer 1→ layer L.

Also make sure of your


matrix dimensions (sizes)
are right.
Summary for forward and backward propagation:

Parameters VS. Hyper-Parameters:


Parameters means here the learning parameters  weights and biases (w,b)

Hyper-parameters are parameter defined before learning process start as :


1-Learning rate α (it is in the update equations of w & b)
2-# iterations aka epochs (How many times you will iterate over your training examples)
3-#hidden layers
4-# nodes (units or neurons) in each hidden layer (n[1],n[2], …)
5-Choice of activation function to be used
6-Momentum terms (discussed later)
7-Mini-batch size (used with mini-batch gradient descent discussed before)
8-Regularization technique (L1,L2, …)
There are more hyper-parameters than those mentioned.
Those hyper-parameters are somehow determines how would weights and biases final value
will be.
Hyper-parameters optimization
Hyper-parameters
In statistics, hyperparameter is a parameter from a prior distribution; it captures the prior belief
before data is observed. In machine learning and deep learning, hyperparameters are settings
that can be tuned to control the behavior of a machine learning algorithm. Conceptually,
hyperparameters can be considered orthogonal to the learning model itself in the sense that,
although they live outside the model, there is a direct relationship between them.

Two basic approaches for selecting hyperparameters


The approaches are: 1)Manual hyperparameter selection 2)Automatic hyperparameter selection
Both approaches are technically viable
(‫ )خيار قابل للتطبيق‬and the decision
typically represents a tradeoff between
the deep understanding of a machine
learning model required to select
hyperparameters manually versus the
high computational cost required by
automatic selection algorithms.

1)Manual hyper-parameters selection:


The main objective of manual hyperparameter selection is to tune the effective capacity of a
model to match the complexity of the target task. Imagine that you are training to climb Mount
Everest. During the grueling training process, you want to subject your body to all sorts of
routines so that it can perform on high altitude, low temperature and low barometric pressure
situations. However, you don’t want to push your body to an extreme that it might cause it to
shut down. Similarly, you need to decide to carry enough provisions and tools to use in all sorts
of unexpected situations but you also don’t want to carry too much weight that can affect your
agility on the mountain. In other words, the objective of the training process is to help you
maximize your effective capacity for the task at hand.
Machine learning models also have a notion of effective capacity. In that context, the effective
capacity of a machine learning algorithm is determined by three main factors:
1) The representational capacity of the algorithm or the set of hypotheses that satisfy the
training dataset.
2) The effectiveness of the algorithm to minimize its cost function.
3) The degree on which the cost function and training process minimize the test error.
Sounds confusing? To see how these factors are related, let’s imagine a deep learning algorithm
with many layers and many hidden units. By definition, that type of model has a large
representational capacity because it can easily model complex functions. However, our model
might not be able to learn all those functions based on the constraints of the training set.
Similarly, many of the potential functions might conflict with the model regularization
strategies to minimize the test error. Tuning different hyperparameters is the key to find the
optimal balance between those factors.
2)Automatic Hyperparameter Optimization:
Optimizing hyperparameter manually can be an exhausting endeavor. To address that challenge,
we can use algorithms that automatically infer a potential set of hyperparameters and attempt to
optimize them while hiding that complexity from the developer. Algorithms such as Grid
Search and Random Search have become prominent when comes to hyperparameter inference
and optimization.

Hyperparameters optimization techniques


There are 3 main techniques used: 1) Grid search 2)Random search 3)Informed search
1)Grid search: (try all combinations)
This approach is not meant to be precise or efficient; it’s a brute force approach. If you have a
set of places in which the perfect combination of parameters can hide, grid search shoots at all
of them and returns the best choice. Let’s assume that I define five different learning rates and
also five different values for the maximum number of features my model is allowed to use.
Then, grid search trains 25 (5 × 5) different models, evaluates all of them and returns the
configuration of hyperparameters that scores highest on my metric of choice. So far so good.
However, as you already expect, the runtime of this approach can be massive. If you want to
check ten different values for three hyperparameters, grid search needs to compute 10 × 10 ×
10, that is 1000(!) models. Let’s add 3-fold cross-validation and we need to train 3000 models!
It also suffers from high dimensional data which is called curse of dimensionality discussed
before. Although there is no objective threshold on when grid search is not viable anymore, the
drawbacks of this approach is obvious.

2)Random search: (try as many combination as possible)


This approach randomly selects a set of combinations and tests them. There are two main
advantages compared to grid search:
 I can define the number of iterations, that is the number of possible combinations that the
search algorithm tests. By doing so, I can decide on the computational budget I have
available. Maybe it’s small because the next presentation of results is close. Maybe my
budget is significant because I leave for the weekend and can keep my machine running
in the meantime. The point is, I am in control now.
 Since the selection of combinations is random, I can use distributions instead of a fixed
set of values. For grid search, I need to define a distinct set of maximum features, e.g. [1,
10, 30, 50]. In randomized search, I can define that the maximum number of features has
to be between 1 and 50 with equal probability for each integer in between.
Both points improve on grid search. However, let’s address the elephant in the room: if the
selection of combinations is random, how likely is it that I find the best model (or at least
something reasonably close to it)? Fortunately, the answer is: very likely. To be clear: if my
grid includes 1000 combinations and I randomly select only 10 of them, I’m in trouble.
However, a research paper showed, that if 5% of combinations in the grid lead to an optimal
result, 60 iterations of randomized search are enough to be more than 95% sure to find one of
these points.

3)Informed search: (Try the most promising combinations)


This time, however, the algorithm won’t test all the combinations (grid search) or pick some of
them by chance (randomized search), but repeatedly reevaluate where to look next. let us
discuss Bayesian optimization which is one of the popular informed search algorithms.
Bayesian optimization is often called as ‘black function’ because it cannot be written into a
formula since the derivates of the function are unknown.
To understand the idea, follow me through a basic example. Let’s assume that I calculated the
first model with a random selection of hyperparameter values. Its accuracy is 67%. Now I
calculate a model with another set of hyperparameters and it produces an accuracy of only 64%.
A Bayesian approach decreases the probability that the values chosen for the second model are
part of the optimal solution. Now it uses the updated probabilities to select a new set of values
for each hyperparameter, see whether it increased or decreased the model’s quality and update
the probabilities. In other words, the algorithm more likely chooses values for the next round
that are related to a higher model performance than their less effective alternatives.
Focusing on the concept of updated probabilities is essential. Promising values are more likely
to be chosen, but there is still some randomness left. This procedure makes sure that the
algorithm stays open to new possibilities.
Course 2 Week1
Datasets / Bias-Variance / Regularization: (recap)
Dataset is split to 3 parts : Training set(60%) / Development set (20%) / Test set (20%)
But in big data era dev and test sets are smaller than that as data now have become in range of
millions of data so we don't need that much in dev and test sets so we increase the portion of
training set.

Bias-Variance tradeoff:
High bias is called underfitting and
High variance is called overfitting.

We say that we have high bias (underfitting) if the error in training set is much higher than the
human-level performance error and we say we have high variance (overfitting) if the
development set error is much higher than training error.
Assume we have human level performance error (AKA Bayes error) ≈ 0% . Therefore, ↓

Remember: Underfitting means your model is so bad to describe the data while overfitting
means that your model is so good on the data that it can't generalize on unseen data. Both of
them are not required to be found in your model.

Although there is a trade-off between bias and variance


but sometimes it could happen that your model suffers
from both as shown in this figure 

Basic recipe for learning: High bias? • Bigger network


• Train longer

High variance? • More data


• Regularization
Regularization:
It is used for reducing overfitting by adding regularization term either L1 or L2 (L2 is most
common)
For logistic regression:
Note: We discard the
regularization term
for bias and just keep
it for weights.
λ → regularization
parameter.

For neural networks generally:

 This is called Frobenius norm (not L2 norm) noting that w[l] is


a matrix of dimension n[l-1] × n[l].

Regularization is also in the update equation not only the cost function:

Remember:
1-dw[l] ≡ (Always check notations)
2-By increasing λ, you are decreasing weights (≈0) so your network becomes simpler and closer
to linear network so increase the bias and decrease the variance.

Dropout regularization:
Dropout is a famous regularization technique used with neural networks by eliminating some
neurons from each layer and keep the others. Removing nodes means by default removing all
ingoing and outgoing links from that node so you end up with much smaller diminished simpler
network less likely to overfit. Nodes to be eliminated is changed with each example after
computing the backpropagation.
Removing some nodes helps
also in making the kept nodes to
learn more especially those who
weren't participating enough in
the learning process.
Keep probability (p) : It is the parameter that decides the percentage of neurons to be kept so if
keep probability of layer 2 of network is 0.8 and this layer has 10 neurons then 8 of these
neurons will be kept and only 2 will be eliminated from the layer.

There are different techniques to apply dropout as Inverted dropout which is the most common
technique in dropout which is similar to original dropout explained so it keeps some weights
and sets others to zero. The one difference is that, during the training of a neural network,
inverted dropout scales the activations by the inverse of the keep probability q (q=1/p) . This
makes the testing faster.

Note: For layers having more neurons so weight matrix is bigger so by default more likely to
overfit so we give those layers with high number of neurons a lower keep probability to reduce
overfitting caused by those layers.

One of the drawbacks of dropout that it makes the cost function J no longer well-defined as
each iteration we have removed bunch of neurons so J can't be drawn as a graph against the
iterations and see something useful to check. The solution is to firstly not applying any dropout
(keep probability =1) then after checking that J is monotonically decreasing (i.e. your algorithm
is learning well) so you retrain again using dropout.

Other techniques of regularization:


 Data augmentation can be used by applying flipping, inverting, rotating, scaling,
translating or changing lighting condition (This topic is discussed and figures shown in
skewed classes part in machine learning part)
 Early stopping means to stop learning process at a
specific iteration (epoch) where the point we stop
is the place where training error and validation
error start to move away as shown in the figure 
Early stopping has a drawback as it stops early to prevent overfitting, we are not getting the best
optimized cost function as we stop gradient descent early so it is a trade-off between optimizing
your cost function using gradient descent or any other optimization algorithm and preventing
overfitting. This trade-off is also known as orthogonalization which will be discussed in details
later.

Remember: Don't forget to normalize or standarize


your input features (normalize training sets) to
make your data are of similar scales.
Vanishing/Exploding gradient descent:
These problems occur with very deep neural networks. This means that when training deep
network, the derivatives or the slopes becomes either so big or so small (exponentially small).

To illustrate this, assume having a network as


shown with a lot of hidden layer (deep
network) and for simplicity assume we have
linear activation function (g(z)=z) and bias
=0. So the output y now equals the
multiplications of all weights by the input x.

If we assume that each w is initialized as except the last layer as it has different
dimensions but now output y will be a value related to the multiplications of w or in other
words y will be in terms of 1.5L so it is exploding with more added hidden layers L. On the
other hand, assuming w= so y will be 0.5L so it is vanishing with increasing number
of hidden layers L.

A solution that helps a lot in solving these problems is carefully choosing the initialization of
the weights. We have discussed this topic before and said to normalize our outputs by dividing
weights by where ni is the number of neurons in the previous layer (inputs to this neuron) or
use Xavier initialization discussed before.
Note: It is found that if activation function is ReLU, it's better to use weights multiplied by

instead of .

Gradient checking:
Before getting into gradient checking, we want to talk about something.
Numerical approximation of derivative computation:
Derivative is known as discussed before
by taking height over width of the triangle
drawn between θ and θ+ε where ε is small
number but by trying we discovered that
by taking a bigger triangle between θ-ε
and θ+ε the approximation is better
numerically and gives better results.
(without getting into calculus details)
Then  is better than
This new definition of derivative will be used in gradient checking
Note: usually use ε=10-7

Gradient checking is a debugging method to verify that your backpropagation computation is


correct.
Gradient checking algorithm:
We have W's and b's (W[1],b[1],… , W[L],b[L]) so we gonna reshape them into big vector called θ
such that each matrix w is reshaped into a vector then concatenate them all (W's and b's) to a
vector called θ. Now instead of having J(W,b) we have J(θ).
Similarly with dW[1],db[1],… we going to reshape them and concatenate to form big vector
called dθ.

Now ask yourself is dθ is the gradient or the slope of J(θ)?

We will apply the derivative rule to each of J(θ) noting that J(θ) = J(θ1,θ2,…θL) so if dθapprox that
is computed from the derivative rule is close to that dθ we have made before so our
backpropagation was done right. But the question is how to know that dθ approx is close to dθ?
The Check is  so if the result in range of 10-6 or 10-7 so it is OK if it is bigger
then more likely there is a bug somewhere in computing gradients.

Notes:
1-Include regularization term in your check.
2-Grad check doesn't work with dropout.
3-Don't use it across your training but just once for debugging.
Dropout, Early stopping & Grad checking

Deep learning neural networks are likely to quickly overfit a training dataset with few
examples. Ensembles of neural networks with different model configurations are known to
reduce overfitting, but require the additional computational expense of training and maintaining
multiple models. Dropout can help in this problem by offering a very computationally cheap
and remarkably effective regularization method.
Dropout
Dropout is a regularization technique used in neural networks where the word "dropout" refers
to dropping out some neurons during the training of the neural network. They are “dropped-out”
randomly. This means that their contribution to the activation of downstream neurons is
temporally removed on the forward pass and any weight updates are not applied to the neuron
on the backward pass. You can imagine that if neurons are randomly dropped out of the
network during training, that other neurons will have to step in and handle the representation
required to make predictions for the missing neurons. This is believed to result in multiple
independent internal representations being learned by the network. The effect is that the
network becomes less sensitive to the specific weights of neurons. This in turn results in a
network that is capable of better generalization and is less likely to overfit the training data.
This conceptualization suggests that perhaps dropout breaks-up situations where network layers
co-adapt to correct mistakes from prior layers, in turn making the model more robust.
Notes:
1-We found that as a side-effect of doing dropout, the activations of the hidden units become
sparse, even when no sparsity inducing regularizers are present.
2-Dropout can be applied to input layer and hidden layers but not output layer of course. 

Keep probability (p)


A new hyperparameter is introduced that specifies the probability at which outputs of the layer
are dropped out, or inversely, the probability at which outputs of the layer are retained. The
interpretation is an implementation detail that can differ from paper to code library. But we will
stick to the 2nd definition (probability of neuron being retained).
Note: A common value is a probability of 0.5 for retaining the output of each node in a hidden
layer and a value close to 1.0, such as 0.8, for retaining inputs from the visible layer.

Inverted dropout
Dropout is not used after training when making a prediction with the fit network. The weights
of the network will be larger than normal because of dropout. Therefore, before finalizing the
network, the weights are first scaled by the chosen dropout rate. The network can then be used
as per normal to make predictions  "If a unit is retained with probability p during training, the
outgoing weights of that unit are multiplied by p at test time."
The rescaling of the weights can be performed at training time instead, after each weight update
at the end of the mini-batch. This is sometimes called “inverse dropout” and does not require
any modification of weights during training.

Some notes on dropout


 Dropout roughly doubles the number of iterations required to converge. However,
training time for each epoch is less.
 With H hidden units, each of which can be dropped, we have 2H possible models.
 Use dropout with larger networks as they are more likely to overfit.
 Use dropout with smaller dataset as they are more likely to overfit.
 Use searching methods discussed before with keep probability hyperparameter to reach
the best value.
 Dropout is not likely to be used now with recent convolutional neural networks
architectures and is replaced by batch normalization (discussed at the end of this chapter)
which has better regularization effect with CNNs.

Early stopping
A major challenge in training neural networks is how long to train them. Too little training will
mean that the model will underfit the train and the test sets. Too much training will mean that
the model will overfit the training dataset and have poor performance on the test set.
A compromise is to train on the training dataset but to stop training at the point when
performance on a validation dataset starts to degrade. This simple, effective, and widely used
approach to training neural networks is called early stopping.
During training, the model is evaluated on a holdout validation dataset after each epoch. If the
performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase
or accuracy begins to decrease), then the training process is stopped. The model at the time that
training is stopped is then used and is known to have good generalization performance.
This procedure is called “early stopping” and is perhaps one of the oldest and most widely used
forms of neural network regularization.

If there is an inflection point when training goes above


the validation, you might be able to use early
stopping(stop training).
Gradient checking
There is nothing more to say in this topic other than the main course but if I have to say in short
than Gradient Checking is kind of debugging your back prop algorithm. Gradient Checking
basically carry out the derivative checking procedure. But the question is why to implement
gradient check ? isn't finding that J(θ) decreasing with time enough ?

The answer is NO, Back prop as an algorithm has a lot of details and can be a little bit tricky to
implement. And one unfortunate property is that there are many ways to have subtle bugs in
back prop. So that if you run it with gradient descent or some other optimizational algorithm, it
could actually look like it's working. And your cost function, J of theta may end up decreasing
on every iteration of gradient descent. But this could prove true even though there might be
some bug in your implementation of back prop. So that it looks J of theta is decreasing, but you
might just wind up with a neural network that has a higher level of error than you would with a
bug free implementation. And you might just not know that there was this subtle bug that was
giving you worse performance. So, what can we do about this? There's an idea called gradient
checking that eliminates almost all of these problems. We describe a method for numerically
checking the derivatives computed by your code to make sure that your implementation is
correct. Carrying out the derivative checking procedure significantly increase your confidence
in the correctness of your code.
Course 2 Week2
Mini-batch gradient descent: (recap)
When you try to train on a large dataset, especially in the big data era, is just slow and memory
consuming. Vectorization as said before allows you efficiently compute on m examples without
the need of for loops but the problem arises when data are so large which in turns make the
memory can't have all this amount of data so here comes the role of mini-batch gradient descent
to solve what batch gradient descent can't. Mini-batch gradient descent depends on splitting
your data into portions (mini-batches) and then apply the gradient descent consequently on the
mini-batches.
Notations: Assume having 100,000 data in the training set (x (1)→x(1000000)) so we are going to
split this data into 100 mini-batch such that each mini batch has 1000 training example so the
notation of 1st mini batch for example equals x{1} ≡ [x(1),x(2),… ,x(1000)]
generally 

Notes:
1-Mini-batch size= total data / number of mini-batches
2-Mini-batch gradient descent needs for loops to loop over the mini-batches so vectorization is
applied in each loop not on whole data.

Algorithm

Cost along iterations


Exponentially weighted averages:
Assume temperatures in London along the year was distributed as follows:

This data looks a little bit noisy so we want now to compute local average or moving average of
this data so this can done as follows:
v0=0 , v1=0.9v0 + 0.1θ1 , v2=0.9v1 + 0.1θ2 , …
So what is happening here that we compute the temperature of the day based on past several
days' temperature to smoothen our data and reduce noise.
Equation: vt= β vt-1 + (1-β) θt
vt can now be think as approximately averaging over days' temperature (generally over last

data ). So we can say that v100 for example is actually a mix between real data at day 100
and data of all preceding 99 days where as day become closer it will have a higher effect so day
99 temperature has higher effect than day 98 temperature than day 97 temperature and so on
which means by default higher coefficient. (exponential decrease in effect)

For our given examples if β is 0.9 then we are computing


average over last 10 days (red), if β is 0.98 then we are
computing over last 50 days (green) and if β is 0.5 then
we are averaging over last 2 days(yellow). Take a look at
the figure 
As β ↓ so you are having much shorter averaging window so much more noisy data and much
more susceptible to outliers but adapts more quickly to temperature changes (data change)

Note:
1-It is called exponential as if 1-β=ε, then so exponential decrease over .
2-We use vθ sometimes to denote that this v is averaging over θ.

Algorithm

This weighted averages is not so accurate so we will use a technique called bias correction to
give more accurate computation of these averages.

Bias correction
The problem is that at the first as we initialize v0=0 so
the first steps of v will have very low values as shown
in the figure (the difference between green (expected)
and purple (actual) curves) as for v1=0.98 v0 +0.02 θ1
but as v0 =0 so v1=0.02 θ1 so if θ1 is 40 degrees so v1
will only equal 8 degrees !!! Here comes the role of bias
correction.

The correction is that instead of using vt= β vt-1 + (1-β) θt , use 


So now for the first steps in the averages we have t small, so 1-βt has a reasonable small value
that increases value of vt and as we go up for higher t so 1- βt becomes close to 1 as we don't
need this correction after the initial phase.

Note: Exponentially weighted average is effective however, it is not very easy to do as it


involves holding the past values in memory buffer and constantly updating the buffer whenever
a new observation is read.

[http://www.ashukumar27.io/exponentially-weighted-average/]
The previous link provides the same topic with the same example, so in case you don't
understand the example by my style of writing, you can try this link for a different way of
explanation of the same example.
Optimization algorithms:
Gradient descent is not the best optimization algorithm for all problems so there are several
optimization algorithms that gives better results than normal gradient descent.

Gradient descent with momentum


This algorithm almost always works faster than the standard gradient descent algorithm. The
basic idea here is to compute an exponentially weighted average of gradients then use them in
updating weights and biases.

Assume having the contour plot shown and having our optima point at this red point and when
starting gradient descent we had this oscillation shown so we can't have higher learning rate for
faster learning in order to avoid divergence.
Another point of view is that we want to
prevent overshooting in vertical axis (↨) and
at same time we want bigger steps on the
horizontal axis (↔) for faster learning. To apply this idea we gonna use grad descent with
momentum which relies mainly on the idea of exponentially weighted averages.

Algorithm
vdw=0 , vdb=0
On iterations:
compute dw, db
vdw = β vdw + (1-β) dw
vdb = β vdb + (1-β) db
w := w - α vdw
b := b - α vdb

This algorithm helps in damping out oscillations and smoothing the steps towards the minima
as shown with red steps so this will lead to faster movement in the horizontal direction and
more straight forward path towards the minima.
The most common value for hyper-parameter beta is 0.9 which average over last previous 10
iterations. In practice, bias correction is not implemented.

Some references use vdw= β vdw + dw instead of β vdw + (1-β) dw by omitting (1-β) but in fact
both are similar and that will only affect the α chosen but the one with (1-β) is more preferred
intuitively.
Root mean squared propagation (RMSprop)
This algorithm is also used to speed up optimization process. Similarly we want to provide
oscillation in vertical direction and speed up movement in the horizontal direction.

Algorithm
On iteration:
compute dw, db
Sdw = β2 Sdw + (1-β2) dw2 (squaring here is element-wise)
Sdb = β2 Sdb + (1-β2) db2
w := w - α (we add ϵ to ensure that no division by zero occurs)

b := b - α

ADAptive Moment estimation (ADAM) optimization


This is a mix between momentum and RMS which is turned out to perform better than each of
them separately.
Note: In ADAM we need bias correction for momentum.

Algorithm
vdw=0 , vdb=0 ,Sdw=0, Sdb=0
On iteration:
Compute dw , db
vdw = β vdw + (1-β) dw (sometimes we use notation β1 instead of β)
vdb = β vdb + (1-β) db
Sdw = β2 Sdw + (1-β2) dw2
Sdb = β2 Sdb + (1-β2) db2
vdwcorrected =
vdbcorrected =

Sdwcorrected =

Sdbcorrected =

w := w - α

b := b - α

Note: β usually equal 0.9, β2 usually equals 0.999 and ϵ equals (10-8)
Learning rate decay: (recap)
Learning rate decay
We have talked about this idea of having a
large learning rate (α) at the beginning of
learning process (at the first epochs) then start
to decay it with each new epoch so that we
start with big steps towards local minima and
then by time we decrease those steps as we go
closer to convergence.

The new info to add here is what relation to follow with reducing α?

where α0 is an initial learning rate to start with while decay rate is hyper-parameter to tune.

other methods of decaying:


1-exponential decay: α= nepoch number where n is a number less than 1.
2-

3-Discrete steps as explained before.

Problem of local optima


This problem simply means that when you are trying to
optimize some weights to find the optima of the curve so it
can get stuck in local optimas rather than global one as
shown in the figure:

Note: optimas generally are points where the derivative


equals 0.
Optimization algorithms
This topic is one of the hardest topics ever which is by itself a field and great companies like
Google, Facebook, Amazon, … etc. are hiring programmers only for their professionalism in
optimizing algorithm so we can't cover everything in this topic but as much as possible.
Optimization algoithms
a procedure which is executed iteratively by comparing various solutions till an optimum or a
satisfactory solution is found. Optimization algorithms differ according to types of problem but
we will stick to the most relevant to neural networks. It starts with defining some kind of loss
function/cost function or generally optimization objective and ends with minimizing (or
maximizing) it using one or the other optimization routine. The choice of optimization
algorithm can make a difference between getting a good accuracy in hours or days.
Types of optimization algorithms
Optimization algorithms in neural networks can be categorized into 3 types :
1)Gradient-based methods using first order derivative as batch and stochastic gradient descent.
The First order derivative (gradient) tells us whether the function is decreasing or increasing at a
particular point. First order Derivative basically give us a line which is Tangential to a point on
its Error Surface.
2)Gradient-based methods using second order derivative (Hessian) as Newton method,
conjugate gradient, scaled conjugate gradient. Since the second derivative is costly to compute,
the second order is not used much .The second order derivative tells us whether the first
derivative is increasing or decreasing which hints at the function’s curvature. Second Order
Derivative provide us with a quadratic surface which touches the curvature of the Error Surface.
3)Search-based methods as genetic algorithms, simulated annealing, …etc. These techniques
usually don't require the function being optimized to be differentiable, they try to find a solution
by sampling from a probability distribution. It is also not used too much in machine learning.
Notes:
1-A derivative is simply defined for a function dependent on single variables , whereas a
Gradient is defined for function dependent on multiple variables.
2-This categorization is not general but popular categorization in machine learning field.

Is second-order gradient optimization better than first-order one?


Not usually, Although the Second Order Derivative may be a bit costly to find and calculate,
but the advantage of a Second order Optimization Technique is that is does not neglect or
ignore the curvature of Surface.Secondly, in terms of Step-wise Performance they are better.
Now The First Order Optimization techniques are easy to compute and less time consuming ,
converging pretty fast on large data sets.
Second Order Techniques are faster only when the Second Order Derivative is known
otherwise, these methods are always slower and costly to compute in terms of both time and
memory.
Batch gradient descent (BGD), Stochastic gradient descent (SGD) & mini-batch gradient
descent were explained in details in machine learning part and a comparison was held between
them so here we will start with the optimized versions of them which enhance the performance
of those normal methods.

Types of gradient-based methods


This is an evolutionary map of how these optimizers
evolved from the simple vanilla stochastic gradient
descent (SGD), down to the variants of Adam. SGD
initially branched out into two main types of
optimizers: those which act on (i) the learning rate
component, through AdaGrad and (ii) the gradient
component, through momentum. Down the
generation line, we see the birth of which is a
combination of momentum and RMSprop, a
successor of AdaGrad. This not an official image
but it is good for visualization of categories.

Gradient descent with momentum


The high variance oscillations in SGD makes it hard to reach convergence , so a technique
called Momentum was invented which accelerates SGD by navigating along the relevant
direction and softens the oscillations in irrelevant directions. Instead of depending only on the
current gradient to update the weight, gradient descent with momentum replaces the current
gradient with vt (which stands for velocity), the exponential moving average of current and past
gradients (i.e. up to time t).

Equation: ,V0=0 & weight is updated 

Here the momentum is same as the momentum in classical physics , as we throw a ball down a
hill it gathers momentum and its velocity keeps on increasing. The same thing happens with our
parameter updates :
 It leads to faster and stable convergence.
 Reduced Oscillations
Remember: β does parameter updates only for relevant examples ( previous derivatives).
This reduces the unnecessary parameter update
Nesterov accelerated gradient descent (NAG)
a slightly different version of the momentum update that has recently been gaining popularity.
A researcher named Yurii Nesterov saw a problem with Momentum. However, a ball that rolls
down a hill, blindly following the slope, is highly unsatisfactory. We’d like to have a smarter
ball, a ball that has a notion of where it is going so that it knows to slow down before the hill
slopes up again. What actually happens is that as we reach the minima i.e. the lowest point on
the curve ,the momentum is pretty high and it doesn’t knows to slow down at that point due to
the high momentum which could cause it to miss the minima entirely and continue movie up. In
the method he suggested we first make a big jump based on out previous momentum then
calculate the Gradient and them make an correction which results in an parameter update. Now
this anticipatory update prevents us to go too fast and not miss the minima and makes it more
responsive to changes.

Equation: ,V0=0 & weight is updated 

is called projected gradeint. This means that for this time step t, we have to carry out
another forward propagation before we can finally execute the backpropagation. Here's how it is
computed:
 Update the current weight wt to a projected weight w* using the previous velocity.

 Carry out forward propagation, but using this projected weight.


 Obtain the projected gradient .
 Compute Vt and wt₊₁ accordingly.
Now that we are able to adapt our updates to the slope of our error function and speed up SGD
in turn.
‫ و بعد كده اعمل‬w* ‫ بتاعتي ؟ أل االول احسب‬w‫ فهل همشي بال‬Vtinitialized =0 ‫ و‬winitialization ‫دلوقتي انا باديء معايا‬
‫ ارجع احسب ال‬forward ‫ و مشيت‬w* ‫ ) و بعد ما جبت ال‬w=w*‫ (طبعا اول مرة خالص ال‬forward propagation
‫ مش‬momentum ‫) و بما اننا‬ (← w* ‫ بس بنحسبه طبعا بالنسبة ل‬backpropagation ‫في ال‬gradient descent
‫ االول‬vt ‫ لكن هحسب‬update equation ‫و اشتغل بيها علي طول في ال‬ ‫عادي ف مش هاخد ال‬gradient descent
... ‫و هكذا‬forward ‫ و اعمل‬w* ‫) فارجع احسب‬vt-1( ‫ بتاعة الخطوة دي‬v‫ و‬wt ‫ و دلوقتي معايا‬wt+1 ‫و بعد كده احسب ال‬
Visual
Why momentum works better than classic stochastic gradient descent
 With Stochastic Gradient Descent we don’t compute the exact derivate of our loss
function. Instead, we’re estimating it on a small batch. Which means we’re not always
going in the optimal direction, because our derivatives are ‘noisy’. Just like in my graphs
above. So, exponentially weighed averages can provide us a better estimate which is
closer to the actual derivate than our noisy calculations. This is one reason why
momentum might work better than classic SGD.
 The other reason lies in ravines. Ravine is an area, where the surface curves much more
steeply in one dimension than in another. Ravines are common near local minimas in
deep learning and SGD has troubles navigating them. SGD will tend to oscillate across
the narrow ravine since the negative gradient will point down one of the steep sides rather
than along the ravine towards the optimum. Momentum helps accelerate gradients in the
right direction. This is expressed in the following pictures:

AdaGrad
Adaptive gradient works on the learning rate component by dividing the learning rate by the
square root of S, which is the cumulative sum of current and past squared gradients (i.e. up to
time t). Note that the gradient component remains unchanged like in SGD. It simply allows the
learning Rate -α to adapt based on the parameters. So it makes big updates for infrequent
parameters and small updates for frequent parameters. For this reason, it is well-suited for
dealing with sparse data. In other words we can say that Adagrad modifies the general learning
rate α at each time step t for every parameter θ(i) based on the past gradients that have been
computed for θ(i).

Equation: S0=0& weight is updated

Note: As it uses a different learning Rate for every parameter θ at a time step based on the past
gradients which were computed for that parameter so this method uses a different learning rate
for every parameter w at every time step t, we first show Adagrad’s per-parameter update,
which we then vectorize.
How using squaring of gradients affects the learning rate and what is its disadvantage?
For parameters with high gradient values, the squared term will be large and hence dividing
with large term would make gradient accelerate slowly in that direction. Similarly, parameters
with low gradients will produce smaller squared terms and hence gradient will accelerate faster
in that direction. Its main weakness is that its learning rate-η is always Decreasing and
decaying. This happens due to the accumulation of each squared Gradients in the denominator ,
since every added term is positive.The accumulated sum keeps growing during training. This in
turn causes the learning rate to shrink and eventually become so small, that the model just stops
learning entirely and stops acquiring new additional knowledge. Because we know that as the
learning rate gets smaller and smaller the ability of the Model to learn fastly decreases and
which gives very slow convergence and takes very long to train and learn i.e learning speed
suffers and decreases.
This problem of Decaying learning Rate will be Rectified in another algorithm called AdaDelta.

RMSProp
Root mean square prop or RMSprop is another adaptive learning rate that is an improvement of
AdaGrad. Instead of taking cumulative sum of squared gradients, we take their exponential
moving average. ). By introducing exponentially weighted moving average we are weighing
recent past more heavily in comparison to distant past.

Note: AdaGrad implies a decreasing learning rate even if the gradients remain constant due to
accumulation of gradients from the beginning of training (same problem mentioned at the end
of AdaGrad part).

Equation: S0=0, weight is updated

AdaDelta
Like RMSprop, Adadelta is also another improvement from AdaGrad, focusing on the learning
rate component as it tends to remove the decaying learning Rate problem of it. . Adadelta is
probably short for ‘adaptive delta’, where delta here refers to the difference between the current
weight and the newly updated weight. The difference between Adadelta and RMSprop is that
Adadelta removes the use of the learning rate parameter completely by replacing it with D, the
exponential moving average of squared deltas.

Equation: , weight is updated

Visualize all methods: [http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html]


Since we are calculating individual learning rates for each parameter , why not calculate
individual momentum changes for each parameter and store them separately. This is where a
new modified technique and improvement comes into play called as Adam.

Adam
Adaptive moment estimation, or Adam, is a combination of momentum and RMSprop. It acts
upon
(i) the gradient component by using V, the exponential moving average of gradients (like in
momentum)
(ii) the learning rate component by dividing the learning rate α by square root of S, the
exponential moving average of squared gradients (like in RMSprop).

Equation: , V0 & S0 =0, bias correction 

weights are updated 

Adam works well in practice and compares


favorably to other adaptive learning-
method algorithms as it converges very
fast and the learning speed of the Model is
quiet Fast and efficient and also it rectifies
every problem that is faced in other
optimization techniques such as vanishing
Learning rate , slow convergence or High
variance in the parameter updates which
leads to fluctuating Loss function.

AdaMax
AdaMax is an adaptation of the Adam optimiser by the same authors using infinity norms
(hence ‘max’). V is the exponential moving average of gradients, and S is the exponential
moving average of past p-norm of gradients, approximated to the max function as seen below.

Equation: bias correction

weights are updated 


What is the benefit of AdaMax over Adam?
Adamax is supposed to be used when you’re using some setup that has sparse parameter
updates (ie word embeddings).
This is because the norm | | term is essentially ignored when it’s small. This means that
parameter updates (weights) are influenced by less gradients, and therefore less susceptible to
gradient noise.  This answer I have found on quora but I didn't understand how in equation
the norm term is ignored when small, hope I can understand it later.

Nadam
Nadam is an acronym for Nesterov and Adam optimiser. The Nesterov component, however, is
a more efficient modification than its original implementation (momentum). Why Nesterov is
better than momentum was discussed before.

Equation: ,Bias corrction 

Weights are updated 

Note: Adam optimizer weight updating was 


The difference is between Vt & Vt-1.

AMSGrad
Another variant of Adam is the AMSGrad. This variant revisits the adaptive learning rate
component in Adam and changes it to ensure that the current S is always larger than the
previous time step.

Equation: , S choice 

Weights are updated 


Course 2 Week3
Batch normalization
We have always talked about feature scaling using different methods as normalization,
standardization, mean normalization ,… and how it affects the training by making the data more
centralized and rounded rather than elongated but what if we want to generalize this idea to the
whole neural network not just the input features? here comes the role of batch normalization.

Batch normalization applies normalization to some or all neurons before applying activation
function immediately which helped a lot in making learning parameters (w, b) of next layer
learn faster. It makes Z controlled by a specific μ, σ instead of being so random noting that μ &
σ could be 0 & 1 or any other value (What controls their value is γ & β).

Algorithm

As shown in the figure applying this algorithm and using instead of Z(i) makes learning
process faster.
Notes:
1- Z(i) here is equivalent to Z[l](i) but used as Z(i) for simplicity not more.
2-ϵ is added to variance to prevent dividing by zero.
3-γ & β here are learning parameters that will learn with the optimization algorithm used and
they are used to let μ & σ have values other than 0 and 1 to give some freedom for the outputs.
4-if γ= & so this will lead to that ≡ Z(i) (Act if there is no batch norm
applied)
Why batch norm works?
Reason1: Similarly as feature scaling, this helps in making all data centered and deviated
closely rather than having some data ranging from 0 to 1 and others from 0 to 1000.
Reason2: Batch norm helps in reducing covariance shift done by the previous layers.
Assume having 4 hidden layers in network, if we cover the first 2 layers and start study the 3 rd
layer so we will find that the inputs a1[2], …, a4[2] to layer 3 are the main reason for finding w[3],
b[3], w[4], b[4],… using gradient descent process but if we uncover the first 2 layers and look
again at the 3rd layer we can see that a1[2], …, a4[2] are themselves affected by previous weights
and suffering from covariance shift so batch norm controls this shift using its learning
parameters (γ & β). In other words we can say that this reduces the coupling between layers so
that each layer of network can learn by itself and more independently from other layers.

Batch norm in fact has a slight regularization effect due to the noise added by computing mean
and variance on mini-batches like dropout. But mainly batch norm is not used for
regularization.
Note: As mini-batch size increases, the regularization effect decreases.

TAKE CARE: Batch normalization handles data one mini-batch at a time which means that
mean and variance computed are computed on mini-batches not the whole data at once. This
will have an effect on the test phase as in test phase we don't apply on a
mini-batch but applies prediction on a single example every time so as μ
and σ2 are computed over a mini-batch (m here is number of examples in
batch not whole training set) so when applying on a single training example
on test time the equations will be meaningless so what to do now?

As we have mini-batches and each mini batch has a specific μ & σ2 so for layer ℓ for example
we have μ{1}[ℓ], μ{2}[ℓ], μ{3}[ℓ], … so to compute μ[ℓ] used in test phase we will use exponential
weighted average as follows :
until reach vlast mini-batch and then
used this as my μ for test phase. Similarly we do same steps for σ2 .

This is not the only way to compute μ and σ2 but you can use any reasonable method to have a
single μ and σ2 for each layer using the ones of each mini-batch.
Softmax regression
We are always talking about logistic regression for binary classes (2 classes) so positive or
negative classes so softmax regression is the generalization of logistic regression for c classes.

If you have for example a network that classifies cats, dogs, baby chicks or others so now we
have 4 classes (class0, class1, class2, class3). In this case the output layer will be of 4 neurons
(c neurons). So we can say n[L]= 4 or c generally. Each neuron of the output layer provides the
probability of a class of the c classes to be the output given input features x. So for our example,
the 4 neuron outputs are ( P(cats | X), P(dogs | X), P(baby chicks | X), P(others | X)). here is
an output vector of dimension (4 × 1). Note that the summation of the 4 numbers or
probabilities of the output layer should equal to 1.
Softmax regression is done using softmax layer put in the output layer that differs from usual
layers that it has softmax activation function so the flow is as follows:

z[L]= w[L] a[L-1] + b[L]  activation function is  so


Note: dimensions of z[L], t & a[L] is c × 1.
The unusual thing about softmax activation function that it takes c×1 input and outputs a c×1
vector also unlike sigmoid and ReLU function which take a single real value and outputs a
single real one.
Assume we have only a softmax layer with no hidden layers so just take x1 & x2 and outputs 3
classes so what would a softmax layer do to this input data with 2 features x 1 & x2 ?
It will act as linear classifier to classify the data into 3 linearly separated regions as shown:

In case c=4 or 5 or 6 so it will be something like that ↓

It is called softmax classifier because hardmax ones will look at values of z [L] and put 1 on the
highest value and zeros for other c-1classes while softmax produces probability of each classas
shown: 
Loss function of softmax regression
ℓ( , y) =
Since y or yactual is a vector having 0's for all classes except for the right class which has 1 so the
loss function is actually equal - (1) log so the way to reduce the cost is to make the prediction
of the right class as high as possible (1 is the best) to make the least cost possible

Derivative with softmax


dz[L] = - y where dz[L] is equivalent to and its dimension is (c,1)
Batch normalization
Batch normalization is one of the reasons why deep learning has made such outstanding
progress in recent years. Batch normalization enables the use of higher learning rates, greatly
accelerating the learning process. It also enabled the training of deep neural networks with
sigmoid activations that were previously deemed too difficult to train due to the vanishing
gradient problem. Based on its success, other normalization methods such as layer
normalization and weight normalization have appeared and are also finding use within the field.
Though batch normalization is now a standard feature of any deep learning framework and can
be used off the shelf, using it naively can lead to difficulties in practice.
Batch normalization
It is a technique to normalize (Standardize) the internal representation of data for faster training.
It is mainly relying on the idea of reducing internal covariance shift.
Covariance shift: It refers to the change in the distribution of the input values to a learning
algorithm. This is a problem that is not unique to deep learning. For instance, if the train and
test sets come from entirely different sources (e.g. training images come from the web while test
images are pictures taken on the iPhone), the distributions would differ. The reason covariance
shift can be a problem is that the behavior of machine learning algorithms can change when the
input distribution changes.
In the context of deep learning, we are particularly concerned with the change in the distribution
of the inputs to the inner nodes within a network. A neural network changes the weights of each
layer over the course of training. This means that the activations (output) of each layer change
as well. Since the activations of a previous layer are the inputs of the next layer, each layer in
the neural network is faced with a situation where the input distribution changes with each step.
This is problematic because it forces each intermediate layer to continuously adapt to its
changing inputs.The basic idea behind batch normalization is to limit covariate shift by
normalizing the activations of each layer (transforming the inputs to be mean 0 and unit
variance). This, supposedly, allows each layer to learn on a more stable distribution of inputs,
and would thus accelerate the training of the network.
Algorithm
Note: In practice, restricting the
activations of each layer to be strictly
0 mean and unit variance can limit
the expressive power of the network.
Therefore, in practice, batch
normalization allows the network to
learn parameters γ and β that can
convert the mean and variance to any
value that the network desires.
Note: In fact the batch normalization does standardization not basic normalization so the word
normalize in the algorithm section is in fact standardize.

Since batch normalization is done to prevent covariance shift from changing distributions of
input with the output of each activation function from each layer, so why would we use γ and β
which will lead to deviating mean and variance from 0 and 1?? so what is the point of batch
normalization then ??
This is a very hard and tough question that has some mathematical mystery behind the idea of
gradient descent with a high-order complex interactions between multiple layers. For a little bit
of intuition behind this idea visit the following link in the high-order effects part.
[http://mlexplained.com/2018/01/10/an-intuitive-explanation-of-why-batch-normalization-
really-works-normalization-in-deep-learning-part-1/]

Benefits of batch normalization other than reducing covariance shift and faster training
 We can use higher learning rates because batch normalization makes sure that there’s no
activation that’s gone really high or really low. And by that, things that previously
couldn’t get to train, it will start to train.(This leads to faster training)
 It reduces overfitting because it has a slight regularization effects. Similar to dropout, it
adds some noise to each hidden layer’s activations. Therefore, if we use batch
normalization, we will use less dropout, which is a good thing because we are not going
to lose a lot of information. However, we should not depend only on batch normalization
for regularization; we should better use it together with dropout.
 Make weights initialization easier as batch norm reduce sensitivity to initial start weights.
 Makes more activation functions usable as some activation functions as sigmoid and
ReLU sensitive to ranges of values fed to them so batch norm regulates those values.
 Allow deeper networks due to the previous mentioned points and deeper networks
generally produces better results
Note: Each training iteration will be slower due to extra steps added as normalization (or
standardization) in forward propagation and new learning parameters added (γ & β) in the back-
propagation. However, we say the overall training is faster with batch normalization as the
convergence will be faster.
Another main topic
Auto-Encoders

Autoencoder
Autoencoders are a specific type of unsupervised feed-
forward neural networks where the input is the same
as the output. It has three layers: an input layer, a
hidden (encoding) layer, and an output (decoding)
layer. They work by compressing the input into
a latent-space representation, and then reconstructing
the output from this representation. The encoder and
decoders has the following function:
 Encoder: This is the part of the network that compresses the input into a latent-space
representation. It can be represented by an encoding function h=f(x).
 Decoder: This part aims to reconstruct the input from the latent space representation. It
can be represented by a decoding function =g(h).
The autoencoder as a whole can thus be described by the function g(f(x)) = where you
want as close as the original input x. In other words, Autoencoder‘s objective is to minimize
reconstruction error between the input and output. This helps autoencoders to learn important
features present in the data. When a representation allows a good reconstruction of its input,
then it has retained much of the information present in the input.

Note: Number of nodes in hidden layer(s) is a hyper-parameter.

Real world example:


We buy a service or an item on the internet. we ensure that the site is secure by checking they
use https protocol. We enter our credit card details for the purchase. Our credit card details are
encoded over the network using some encoding algorithm. Encoded credit card detail is
decoded to generate the original credit card number for validation.
In our credit card example, we took the credit card details, encoded it using some function.
Later decoded it using another function to reproduce the output identical to the input. This is
how autoencoders work.
Importance and properties of autoencoders
Autoencoders are mainly data denoising and dimensionality reduction (or compression)
algorithm for data visualization with a couple of important properties:
 Data-specific: Autoencoders are only able to meaningfully compress data similar to what
they have been trained on. Since they learn features specific for the given training data,
they are different than a standard data compression algorithm like gzip. So we can’t
expect an autoencoder trained on handwritten digits to compress landscape photos.
 Lossy: The output of the autoencoder will not be exactly the same as the input, it will be
a close but degraded representation. If you want lossless compression they are not the
way to go.
 Unsupervised: To train an autoencoder we don’t need to do anything fancy, just throw the
raw input data at it. Autoencoders are considered anunsupervised learning technique
since they don’t need explicit labels to train on. But to be more precise they are self-
supervised because they generate their own labels from the training data.
 With appropriate dimensionality and sparsity constraints, autoencoders can learn data
projections that are more interesting than PCA or other basic techniques as Autoencoders
are much more flexible than PCA. Autoencoders can represent both liners and non-linar
transformation in encoding but PCA can only perform linear transformation.
Autoencoders can be layered to form deep learning network due to its Network
representation.
 Less hidden units means more compression of data to be trained on.

Note: Although we have said in the definition of autoencoders that it has 3 layers but some
types of autoencoders use more than one hidden layers.
Types of autoencoders
 Vanilla autoencoder: The simplest form of autoencoders which has only 1 hidden layer
where input and output are of same size and the hidden layer acts as the compressor
(coded part) has a smaller size.
 Multilayer autoencoder: It has more than 1 hidden layer (usually use odd number) so if
for example we use 3 then:
Input (sizeI) hidden1 (size1)  hidden2 (size2) hidden 3 (size3) output (sizeO)
Note: sizeI=sizeO -||- (size1=size3)<sizeI -||- size2(coded part) < size1 & sizeI
 Convolutional autoencoder: This is used with CNN and has same principle but uses
images (3D vectors) instead of flattened 1D vectors. (Discussed again in CNN part)
 Regularized autoencoder: use a loss function that encourages the model to have other
properties besides the ability to copy its input to its output. In practice, we usually find
two types of regularized auto encoder: 1)sparse autoencoder 2)denoising autoencoder
Sparse autoencoder
Sparse autoencoders are typically used to learn features for another task such as classification.
An autoencoder that has been regularized to be sparse must respond to unique statistical
features of the dataset it has been trained on, rather than simply acting as an identity function. In
this way, training to perform the copying task with a sparsity penalty can yield a model that has
learned useful features as a byproduct.
Another way we can constraint the reconstruction of autoencoder is to impose a constraint in its
loss. We could, for example, add a regularization term in the loss function. Doing this will make
our autoencoder learn sparse representation of data.
Notice in our hidden layer, we add an activity regularizer (L1 or L2 or both), that will apply a
penalty to the loss function during the optimization phase. As a result, if we are using 1 hidden
layer the representation is now sparser compared to the vanilla autoencoder.

Denoising autoencoder
Rather than adding a penalty to the loss function, we can obtain an autoencoder that learns
something useful by changing the reconstruction error term of the loss function. This can be
done by adding some noise of the input image and make the autoencoder learn to remove it. By
this means, the encoder will extract the most important features and learn a robuster
representation of the data. The idea behind denoising autoencoders is simple. In order to force
the hidden layer to discover more robust features and prevent it from simply learning the
identity, we train the autoencoder to reconstruct the input from a corrupted version of it.
The amount of noise to apply to the input takes the form of a percentage. Typically, 30 percent,
or 0.3, is fine, but if you have very little data, you may want to consider adding more.

Stacked denoising autoencoders (SDA)


a kind of denoising encoders which uses unsupervised pre-training mechanism on their layers,
where once each layer is pre-trained to conduct feature selection and extraction on the input
from the preceding layer, a second stage of supervised fine-tuning can follow. SDA is simply a
multiple denoising autoencoders strung together. Once the first k layers are trained, we can train
the k+1-th layer because we can now compute the code or latent representation from the layer
below.
Once all layers are pre-trained, the network goes through a second stage of training called fine-
tuning. We go for supervised learning mechanism her to fine-tune in order to minimise the
prediction error on supervised task. We then train the entire network as we would train
a multilayer perceptron. At this point, we only consider the encoding parts of each auto-
encoder. This stage is supervised, since now we use the target class during training.
NOTATIONS
Course 3 Week1&2

Course 3 is actually some of the data


said in the questions of machine
learning yearning book by Andrew
NG so reading those questions is
more than enough but in case you
need visualization you can see the 2
weeks of course 3 that discusses
those questions in videos.

https://www.youtube.com/watch?v=
dFX8k1kXhOw&list=PLkDaE6sCZ
n6E7jZ9sN_xHwSHOdjUxUW_b
QUESTIONS FROM DRAFT VERSION OF MACHINE
LEARNING YEARNING BOOK BY ANDREW NG
(For full version: https://bit.ly/2LgaTee)
4 Scale drives machine learning progress
Many of the ideas of deep learning (neural networks) have been around for decades. Why are
these ideas taking off now?
Two of the biggest drivers of recent progress have been:
• Data availability. People are now spending more time on digital devices (laptops, mobile
devices). Their digital activities generate huge amounts of data that we can feed to our
learning algorithms.
• Computational scale. We started just a few years ago to be able to train neural
networks that are big enough to take advantage of the huge datasets we now have.

In detail, even as you accumulate more data, usually


the performance of older learning
algorithms, such as logistic regression, “plateaus.”
This means its learning curve “flattens
out,” and the algorithm stops improving even as you
give it more data:

It was as if the older algorithms didn’t know what to


do with all the data we now have.
If you train a small neutral network (NN) on the
same supervised learning task, you might
get slightly better performance:

Here, by “Small NN” we mean a neural network with


only a small number of hidden
units/layers/parameters. Finally, if you train larger and
larger neural networks, you can
obtain even better performance:

Thus, you obtain the best performance when you (i) Train a very large neural network, so
that you are on the green curve above; (ii) Have a huge amount of data.
Many other details such as neural network architecture are also important, and there has
been much innovation here. But one of the more reliable ways to improve an algorithm’s
performance today is still to (i) train a bigger network and (ii) get more data.
5 Your development and test sets
Let’s return to our earlier cat pictures example: You run a mobile app, and users are
uploading pictures of many different things to your app. You want to automatically find the
cat pictures.
Your team gets a large training set by downloading pictures of cats (positive examples) and
non-cats (negative examples) off of different websites. They split the dataset 70%/30% into
training and test sets. Using this data, they build a cat detector that works well on the
training and test sets.
But when you deploy this classifier into the mobile app, you find that the performance is
really poor!
What happened?
You figure out that the pictures users are uploading have a different look than the website
images that make up your training set: Users are uploading pictures taken with mobile
phones, which tend to be lower resolution, blurrier, and poorly lit. Since your training/test
sets were made of website images, your algorithm did not generalize well to the actual
distribution you care about: mobile phone pictures.
Before the modern era of big data, it was a common rule in machine learning to use a
random 70%/30% split to form your training and test sets. This practice can work, but it’s a
bad idea in more and more applications where the training distribution (website images in
our example above) is different from the distribution you ultimately care about (mobile
phone images).
We usually define:
• Training set — Which you run your learning algorithm on.
• Dev (development) set — Which you use to tune parameters, select features, and
make other decisions regarding the learning algorithm. Sometimes also called the
hold-out cross validation set .
• Test set — which you use to evaluate the performance of the algorithm, but not to make
any decisions regarding what learning algorithm or parameters to use.
Once you define a dev set (development set) and test set, your team will try a lot of ideas,
such as different learning algorithm parameters, to see what works best. The dev and test
sets allow your team to quickly see how well your algorithm is doing.
In other words, the purpose of the dev and test sets are to direct your team toward
the most important changes to make to the machine learning system .
So, you should do the following:
Choose dev and test sets to reflect data you expect to get in the future
and want to do well on.
In other words, your test set should not simply be 30% of the available data, especially if you
expect your future data (mobile phone images) to be different in nature from your training
set (website images).
If you have not yet launched your mobile app, you might not have any users yet, and thus
might not be able to get data that accurately reflects what you have to do well on in the
future. But you might still try to approximate this. For example, ask your friends to take
mobile phone pictures of cats and send them to you. Once your app is launched, you can
update your dev/test sets using actual user data.
If you really don’t have any way of getting data that approximates what you expect to get in the future,
perhaps you can start by using website images. But you should be aware of the
risk of this leading to a system that doesn’t generalize well.
It requires judgment to decide how much to invest in developing great dev and test sets. But
don’t assume your training distribution is the same as your test distribution. Try to pick test examples
that reflect what you ultimately want to perform well on, rather than whatever data
you happen to have for training.
6 Your dev and test sets should come from the
same distribution
You have your cat app image data segmented into four regions, based on your largest
markets: (i) US, (ii) China, (iii) India, and (iv) Other. To come up with a dev set and a test
set, say we put US and India in the dev set; China and Other in the test set. In other words,
we can randomly assign two of these segments to the dev set, and the other two to the test
set, right?
Once you define the dev and test sets, your team will be focused on improving dev set
performance. Thus, the dev set should reflect the task you want to improve on the most: To
do well on all four geographies, and not only two.
There is a second problem with having different dev and test set distributions: There is a
chance that your team will build something that works well on the dev set, only to find that it
does poorly on the test set. I’ve seen this result in much frustration and wasted effort. Avoid
letting this happen to you.
As an example, suppose your team develops a system that works well on the dev set but not
the test set. If your dev and test sets had come from the same distribution, then you would
have a very clear diagnosis of what went wrong: You have overfit the dev set. The obvious
cure is to get more dev set data.
But if the dev and test sets come from different distributions, then your options are less
clear. Several things could have gone wrong:
1. You had overfit to the dev set.
2. The test set is harder than the dev set. So your algorithm might be doing as well as could
be expected, and no further significant improvement is possible.
3. The test set is not necessarily harder, but just different, from the dev set. So what works
well on the dev set just does not work well on the test set. In this case, a lot of your work
to improve dev set performance might be wasted effort.

Working on machine learning applications is hard enough. Having mismatched dev and test
sets introduces additional uncertainty about whether improving on the dev set distribution
also improves test set performance. Having mismatched dev and test sets makes it harder to
figure out what is and isn’t working, and thus makes it harder to prioritize what to work on.
If you are working on a 3rd party benchmark problem, their creator might have specified dev
and test sets that come from different distributions. Luck, rather than skill, will have a
greater impact on your performance on such benchmarks compared to if the dev and test
sets come from the same distribution. It is an important research problem to develop
learning algorithms that are trained on one distribution and generalize well to another. But if
your goal is to make progress on a specific machine learning application rather than make
research progress, I recommend trying to choose dev and test sets that are drawn from the
same distribution. This will make your team more efficient.
7 How large do the dev/test sets need to be?
The dev set should be large enough to detect differences between algorithms that you are trying out.
For example, if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a
dev set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine
learning problems I’ve seen, a 100 example dev set is small. Dev sets with sizes from 1,000 to 10,000
examples are common. With 10,000 examples, you will have a good chance of detecting an
improvement of 0.1%. For mature and important applications—for example, advertising, web search,
and product recommendations—I have also seen teams that are highly motivated to eke out even a
0.01% improvement, since it has a direct impact on the company’s profits. In this case, the dev set
could be much larger than 10,000, in order to detect even smaller improvements. How about the size
of the test set? It should be large enough to give high confidence in the overall performance of your
system. One popular heuristic had been to use 30% of your data for your test set. This works well
when you have a modest number of examples—say 100 to 10,000 examples. But in the era of big data
where we now have machine learning problems with sometimes more than a billion examples, the
fraction of data allocated to dev/test sets has been shrinking, even as the absolute number of
examples in the dev/test sets has been growing. There is no need to have excessively large dev/test
sets beyond what is needed to evaluate the performance of your algorithms.

8 Establish a single-number evaluation metric


for your team to optimize
Classification accuracy is an example of a single-number evaluation metric : You run
your classifier on the dev set (or test set), and get back a single number about what fraction
of examples it classified correctly. According to this metric, if classifier A obtains 97%
accuracy, and classifier B obtains 90% accuracy, then we judge classifier A to be superior.
In contrast, Precision and Recall is not a single-number evaluation metric: It gives two 3
numbers for assessing your classifier. Having multiple-
number evaluation metrics makes it
harder to compare algorithms. Suppose your algorithms
perform as shown:
Here, neither classifier is obviously superior, so it doesn’t immediately guide you toward
picking one.
During development, your team will try a lot of ideas about algorithm architecture, model parameters,
choice of features, etc. Having a single-number evaluation metric such as accuracy allows you to
sort all your models according to their performance on this metric, and quickly decide what is
working best.
If you really care about both Precision and Recall, I recommend using one of the standard
ways to combine them into a single number. For example, one could take the average of
precision and recall, to end up with a single number. Alternatively, you can compute the “F1 score,”
which is a modified way of computing their average, and works better than simply taking the mean.
Having a single-number evaluation metric speeds up your
ability to make a decision when you are selecting among a
large number of classifiers. It gives a clear preference ranking
among all of them, and therefore a clear direction for progress.
As a final example, suppose you are separately tracking the
accuracy of your cat classifier in four key markets: (i) US, (ii) China, (iii) India, and (iv) Other. This
gives four metrics. By taking an average or weighted average of these four numbers, you end up with a
single number metric. Taking an average or weighted average is one of the most common ways to
combine multiple metrics into one.
9 Optimizing and satisficing metrics
Here’s another way to combine multiple evaluation metrics.
Suppose you care about both the accuracy and the running time of a learning algorithm. You
need to choose from these three classifiers:

It seems unnatural to derive a single metric by putting accuracy and running time into a
single formula, such as:
Accuracy - 0.5*RunningTime
Here’s what you can do instead: First, define what is an “acceptable” running time. Lets say
anything that runs in 100ms is acceptable. Then, maximize accuracy, subject to your
classifier meeting the running time criteria. Here, running time is a “satisficing
metric”—your classifier just has to be “good enough” on this metric, in the sense that it
should take at most 100ms. Accuracy is the “optimizing metric.”
If you are trading off N different criteria, such as binary file size of the model (which is
important for mobile apps, since users don’t want to download large apps), running time,
and accuracy, you might consider setting N-1 of the criteria as “satisficing” metrics. I.e., you
simply require that they meet a certain value. Then define the final one as the “optimizing”
metric. For example, set a threshold for what is acceptable for binary file size and running
time, and try to optimize accuracy given those constraints.
As a final example, suppose you are building a hardware device that uses a microphone to
listen for the user saying a particular “wakeword,” that then causes the system to wake up.
Examples include Amazon Echo listening for “Alexa”; Apple Siri listening for “Hey Siri”;
Android listening for “Okay Google”; and Baidu apps listening for “Hello Baidu.” You care
about both the false positive rate—the frequency with which the system wakes up even when
no one said the wakeword—as well as the false negative rate—how often it fails to wake up
when someone says the wakeword. One reasonable goal for the performance of this system is to
minimize the false negative rate (optimizing metric), subject to there being no more than
one false positive every 24 hours of operation (satisficing metric).
Once your team is aligned on the evaluation metric to optimize, they will be able to make
faster progress.
11 When to change dev/test sets and metrics
When starting out on a new project, I try to quickly choose dev/test sets, since this gives the
team a well-defined target to aim for.
I typically ask my teams to come up with an initial dev/test set and an initial metric in less
than one week—rarely longer. It is better to come up with something imperfect and get going
quickly, rather than overthink this. But this one week timeline does not apply to mature
applications. For example, anti-spam is a mature deep learning application. I have seen
teams working on already-mature systems spend months to acquire even better dev/test
sets.
If you later realize that your initial dev/test set or metric missed the mark, by all means
change them quickly. For example, if your dev set + metric ranks classifier A above classifier
B, but your team thinks that classifier B is actually superior for your product, then this might
be a sign that you need to change your dev/test sets or your evaluation metric.
There are three main possible causes of the dev set/metric incorrectly rating classifier A
higher:
1. The actual distribution you need to do well on is different from the dev/test sets.
Suppose your initial dev/test set had mainly pictures of adult cats. You ship your cat app,
and find that users are uploading a lot more kitten images than expected. So, the dev/test set
distribution is not representative of the actual distribution you need to do well on. In this
case, update your dev/test sets to be more representative.

2. You have overfit to the dev set.


The process of repeatedly evaluating ideas on the dev set causes your algorithm to gradually
“overfit” to the dev set. When you are done developing, you will evaluate your system on the
test set. If you find that your dev set performance is much better than your test set
performance, it is a sign that you have overfit to the dev set. In this case, get a fresh dev set.
If you need to track your team’s progress, you can also evaluate your system regularly—say
once per week or once per month—on the test set. But do not use the test set to make any
decisions regarding the algorithm, including whether to roll back to the previous week’s
system. If you do so, you will start to overfit to the test set, and can no longer count on it to
give a completely unbiased estimate of your system’s performance (which you would need if
you’re publishing research papers, or perhaps using this metric to make important business
decisions).

3. The metric is measuring something other than what the project needs to optimize.
Suppose that for your cat application, your metric is classification accuracy. This metric
currently ranks classifier A as superior to classifier B. But suppose you try out both
algorithms, and find classifier A is allowing occasional pornographic images to slip through.
Even though classifier A is more accurate, the bad impression left by the occasional
pornographic image means its performance is unacceptable. What do you do?
Here, the metric is failing to identify the fact that Algorithm B is in fact better than
Algorithm A for your product. So, you can no longer trust the metric to pick the best
algorithm. It is time to change evaluation metrics. For example, you can change the metric to
heavily penalize letting through pornographic images. I would strongly recommend picking
a new metric and using the new metric to explicitly define a new goal for the team, rather
than proceeding for too long without a trusted metric and reverting to manually choosing
among classifiers.
It is quite common to change dev/test sets or evaluation metrics during a project. Having an
initial dev/test set and metric helps you iterate quickly. If you ever find that the dev/test sets
or metric are no longer pointing your team in the right direction, it’s not a big deal! Just
change them and make sure your team knows about the new direction.
12 Takeaways: Setting up development and
test sets (Summary of the previous)
• Choose dev and test sets from a distribution that reflects what data you expect to get in
the future and want to do well on. This may not be the same as your training data’s
distribution.

• Choose dev and test sets from the same distribution if possible.

• Choose a single-number evaluation metric for your team to optimize. If there are multiple
goals that you care about, consider combining them into a single formula (such as
averaging multiple error metrics) or defining satisficing and optimizing metrics.

• Machine learning is a highly iterative process: You may try many dozens of ideas before
finding one that you’re satisfied with.

• Having dev/test sets and a single-number evaluation metric helps you quickly evaluate
algorithms, and therefore iterate faster.

• When starting out on a brand new application, try to establish dev/test sets and a metric
quickly, say in less than a week. It might be okay to take longer on mature applications.

• The old heuristic of a 70%/30% train/test split does not apply for problems where you
have a lot of data; the dev and test sets can be much less than 30% of the data.

• Your dev set should be large enough to detect meaningful changes in the accuracy of your
algorithm, but not necessarily much larger. Your test set should be big enough to give you
a confident estimate of the final performance of your system.

• If your dev set and metric are no longer pointing your team in the right direction, quickly
change them: (i) If you had overfit the dev set, get more dev set data. (ii) If the actual
distribution you care about is different from the dev/test set distribution, get new
dev/test set data. (iii) If your metric is no longer measuring what is most important to
you, change the metric.
13 Build your first system quickly, then iterate
You want to build a new email anti-spam system. Your team has several ideas:
• Collect a huge training set of spam email. For example, set up a “honeypot”: deliberately
send fake email addresses to known spammers, so that you can automatically harvest the
spam messages they send to those addresses.
• Develop features for understanding the text content of the email.
• Develop features for understanding the email envelope/header features to show what set
of internet servers the message went through.
• and more.
Even though I have worked extensively on anti-spam, I would still have a hard time picking one of
these directions. It is even harder if you are not an expert in the application area. So don’t start off
trying to design and build the perfect system. Instead, build and train a basic system quickly—perhaps
in just a few days. Even if the basic system is far from the “best” system you can build, it is valuable to
examine how the basic system functions: you will quickly find clues that show you the most promising
directions in which to invest your time. These next few chapters will show you how to read these clues.

14 Error analysis: Look at dev set examples to


evaluate ideas
When you play with your cat app, you notice several examples where it mistakes dogs for
cats. Some dogs do look like cats! A team member proposes incorporating 3rd party software that will
make the system do better on dog images. These changes will take a month, and the team member is
enthusiastic. Should you ask them to go ahead?
Before investing a month on this task, I recommend that you first estimate how much it will actually
improve the system’s accuracy. Then you can more rationally decide if this is worth the month of
development time, or if you’re better off using that time on other tasks. Here’s what you can do:
1. Gather a sample of 100 dev set examples that your system misclassified . I.e., examples
that your system made an error on.
2. Look at these examples manually, and count what fraction of them are dog images.
The process of looking at misclassified examples is called error analysis . In this example, if you find
that only 5% of the misclassified images are dogs, then no matter how much you improve your
algorithm’s performance on dog images, you won’t get rid of more than 5% of your errors. In other
words, 5% is a “ceiling” (meaning maximum possible amount) for how much the proposed project
could help. Thus, if your overall system is currently 90% accurate (10% error), this improvement is
likely to result in at best 90.5% accuracy (or 9.5% error, which is 5% less error than the original 10%
error). In contrast, if you find that 50% of the mistakes are dogs, then you can be more confident that
the proposed project will have a big impact. It could boost accuracy from 90% to 95% (a 50% relative
reduction in error, from 10% down to 5%). This simple counting procedure of error analysis gives you
a quick way to estimate the possible value of incorporating the 3rd party software for dog images. It
provides a quantitative basis on which to decide whether to make this investment.
Error analysis can often help you figure out how promising different directions are. I’ve seen many
engineers reluctant to carry out error analysis. It often feels more exciting to just jump in and
implement some idea, rather than question if the idea is worth the time investment. This is a common
mistake: It might result in your team spending a month only to realize afterward that it resulted in
little benefit. Manually examining 100 examples does not take long. Even if you take one minute per
image, you’d be done in under two hours. These two hours could save you a month of wasted effort.
Error Analysis refers to the process of examining dev set examples that your algorithm misclassified,
so that you can understand the underlying causes of the errors. This can help you prioritize projects—
as in this example—and inspire new directions, which we will discuss next. The next few chapters will
also present best practices for carrying out error analyses.
17 If you have a large dev set, split it into two
subsets, only one of which you look at
Suppose you have a large dev set of 5,000 examples in which you have a 20% error rate.
Thus, your algorithm is misclassifying ~1,000 dev images. It takes a long time to manually
examine 1,000 images, so we might decide not to use all of them in the error analysis.
In this case, I would explicitly split the dev set into two subsets, one of which you look at, and
one of which you don’t. You will more rapidly overfit the portion that you are manually
looking at. You can use the portion you are not manually looking at to tune parameters.
Let ’ s continue our example above, in which the algorithm is misclassifying 1,000 out of
5,000 dev set examples. Suppose we want to manually examine about 100 errors for error
analysis (10% of the errors). You should randomly select 10% of the dev set and place that
into what we’ll call an Eyeball dev set to remind ourselves that we are looking at it with our
eyes. (For a project on speech recognition, in which you would be listening to audio clips,
perhaps you would call this set an Ear dev set instead). The Eyeball dev set therefore has 500
examples, of which we would expect our algorithm to misclassify about 100.
The second subset of the dev set, called the Blackbox dev set , will have the remaining
4500 examples. You can use the Blackbox dev set to evaluate classifiers automatically by
measuring their error rates. You can also use it to select among algorithms or tune
hyperparameters. However, you should avoid looking at it with your eyes. We use the term
“Blackbox” because we will only use this subset of the data to obtain “Blackbox” evaluations
of classifiers.
Why do we explicitly separate the dev set into Eyeball and Blackbox dev sets? Since you will
gain intuition about the examples in the Eyeball dev set, you will start to overfit the Eyeball
dev set faster. If you see the performance on the Eyeball dev set improving much more
rapidly than the performance on the Blackbox dev set, you have overfit the Eyeball dev set.
In this case, you might need to discard it and find a new Eyeball dev set by moving more
examples from the Blackbox dev set into the Eyeball dev set or by acquiring new labeled
data. Explicitly splitting your dev set into Eyeball and Blackbox dev sets allows you to tell when
your manual error analysis process is causing you to overfit the Eyeball portion of your data.

19 Takeaways: Basic error analysis (Summary)


• When you start a new project, especially if it is in an area in which you are not an expert,
it is hard to correctly guess the most promising directions.
• So don’t start off trying to design and build the perfect system. Instead build and train a basic
system as quickly as possible—perhaps in a few days. Then use error analysis to help you identify the
most promising directions and iteratively improve your algorithm from there.
• Carry out error analysis by manually examining ~100 dev set examples the algorithm
misclassifies and counting the major categories of errors. Use this information to
prioritize what types of errors to work on fixing.
• Consider splitting the dev set into an Eyeball dev set, which you will manually examine,
and a Blackbox dev set, which you will not manually examine. If performance on the
Eyeball dev set is much better than the Blackbox dev set, you have overfit the Eyeball dev
set and should consider acquiring more data for it.
• The Eyeball dev set should be big enough so that your algorithm misclassifies enough
examples for you to analyze. A Blackbox dev set of 1,000-10,000 examples is sufficient
for many applications.
• If your dev set is not big enough to split this way, just use the entire dev set as an Eyeball
dev set for manual error analysis, model selection, and hyperparameter tuning.
20 Bias and Variance: The two big sources of
error
Suppose your training, dev and test sets all come from the same distribution. Then you
should always try to get more training data, since that can only improve performance, right?
Even though having more data can’t hurt, unfortunately it doesn’t always help as much as
you might hope. It could be a waste of time to work on getting more data. So, how do you
decide when to add data, and when not to bother?
There are two major sources of error in machine learning: bias and variance. Understanding
them will help you decide whether adding data, as well as other tactics to improve
performance, are a good use of time.
Suppose you hope to build a cat recognizer that has 5% error. Right now, your training set
has an error rate of 15%, and your dev set has an error rate of 16%. In this case, adding
training data probably won’t help much. You should focus on other changes. Indeed, adding
more examples to your training set only makes it harder for your algorithm to do well on the
training set. (We explain why in a later chapter.)
If your error rate on the training set is 15% (or 85% accuracy), but your target is 5% error
(95% accuracy), then the first problem to solve is to improve your algorithm ’ s performance
on your training set. Your dev/test set performance is usually worse than your training set
performance. So if you are getting 85% accuracy on the examples your algorithm has seen,
there’s no way you’re getting 95% accuracy on examples your algorithm hasn’t even seen.
Suppose as above that your algorithm has 16% error (84% accuracy) on the dev set. We break
the 16% error into two components:
• First, the algorithm’s error rate on the training set. In this example, it is 15%. We think of
this informally as the algorithm’s bias .
• Second, how much worse the algorithm does on the dev (or test) set than the training set.
In this example, it does 1% worse on the dev set than the training set. We think of this
informally as the algorithm’s variance .
Some changes to a learning algorithm can address the first component of error— bias —and
improve its performance on the training set. Some changes address the second
component— variance —and help it generalize better from the training set to the dev/test
sets. To select the most promising changes, it is incredibly useful to understand which of
these two components of error is more pressing to address.
Developing good intuition about Bias and Variance will help you choose effective changes for
your algorithm.
21 Examples of Bias and Variance
Consider our cat classification task. An “ideal” classifier (such as a human) might achieve
nearly perfect performance in this task.
Suppose your algorithm performs as follows:
• Training error = 1%
• Dev error = 11%
What problem does it have? Applying the definitions from the previous chapter, we estimate
the bias as 1%, and the variance as 10% (=11%-1%). Thus, it has high variance . The
classifier has very low training error, but it is failing to generalize to the dev set. This is also
called overfitting .
Now consider this:
• Training error = 15%
• Dev error = 16%
We estimate the bias as 15%, and variance as 1%. This classifier is fitting the training set
poorly with 15% error, but its error on the dev set is barely higher than the training error.
This classifier therefore has high bias , but low variance. We say that this algorithm is
underfitting .
Now, consider this:
• Training error = 15%
• Dev error = 30%
We estimate the bias as 15%, and variance as 15%. This classifier has high bias and high
variance : It is doing poorly on the training set, and therefore has high bias, and its
performance on the dev set is even worse, so it also has high variance. The
overfitting/underfitting terminology is hard to apply here since the classifier is
simultaneously overfitting and underfitting.
Finally, consider this:
• Training error = 0.5%
• Dev error = 1%
This classifier is doing well, as it has low bias and low variance. Congratulations on achieving
this great performance!

22 Compare to the optimal error rate (Bayes error)


In our cat recognition example, the “ideal” error rate—that is, one achievable by an “optimal”
classifier—is nearly 0%. A human looking at a picture would be able to recognize if it contains a cat
almost all the time; thus, we can hope for a machine that would do just as well. Other problems are
harder. For example, suppose that you are building a speech recognition system, and find that 14% of
the audio clips have so much background noise or are so unintelligible that even a human cannot
recognize what was said. In this case, even the most “optimal” speech recognition system might have
error around 14%. Suppose that on this speech recognition problem, your algorithm achieves:
• Training error = 15% • Dev error = 30%
The training set performance is already close to the optimal error rate of 14%. Thus, there is not much
room for improvement in terms of bias or in terms of training set performance. However, this
algorithm is not generalizing well to the dev set; thus there is ample room for improvement in the
errors due to variance. This example is similar to the third example from the previous chapter, which
also had a training error of 15% and dev error of 30%. If the optimal error rate is ~0%, then a training
error of 15% leaves much room for improvement. This suggests bias-reducing changes might be
fruitful. But if the optimal error rate is 14%, then the same training set performance tells us that
there’s little room for improvement in the classifier’s bias.
For problems where the optimal error rate is far from zero, here ’ s a more detailed breakdown of an
algorithm ’ s error. Continuing with our speech recognition example above, the total dev set error of
30% can be broken down as follows (a similar analysis can be applied to the test set error):
• Optimal error rate (“unavoidable bias”) : 14%. Suppose we decide that, even with the best possible
speech system in the world, we would still suffer 14% error. We can think of this as the “unavoidable”
part of a learning algorithm ’ s bias.
• Avoidable bias : 1%. This is calculated as the difference between the training error and the optimal
error rate.
• Variance : 15%. The difference between the dev error and the training error.
To relate this to our earlier definitions, Bias and Avoidable Bias are related as follows:
Bias = Optimal error rate (“unavoidable bias”) + Avoidable bias
The “avoidable bias” reflects how much worse your algorithm performs on the training set than the
“optimal classifier.”
The concept of variance remains the same as before. In theory, we can always reduce variance to
nearly zero by training on a massive training set. Thus, all variance is “avoidable” with a sufficiently
large dataset, so there is no such thing as “unavoidable variance.”
Consider one more example, where the optimal error rate is 14%, and we have:
• Training error = 15% • Dev error = 16%
Whereas in the previous chapter we called this a high bias classifier, now we would say that error from
avoidable bias is 1%, and the error from variance is about 1%. Thus, the algorithm is already doing
well, with little room for improvement. It is only 2% worse than the optimal error rate.
We see from these examples that knowing the optimal error rate is helpful for guiding our
next steps. In statistics, the optimal error rate is also called Bayes error rate , or Bayes rate.
How do we know what the optimal error rate is? For tasks that humans are reasonably good
at, such as recognizing pictures or transcribing audio clips, you can ask a human to provide
labels then measure the accuracy of the human labels relative to your training set. This
would give an estimate of the optimal error rate. If you are working on a problem that even
humans have a hard time solving (e.g., predicting what movie to recommend, or what ad to
show to a user) it can be hard to estimate the optimal error rate.
In the section “Comparing to Human-Level Performance (Chapters 33 to 35), I will discuss
in more detail process of comparing a learning algorithm’s to human-level performance.
In the last few chapters, you learned how to estimate avoidable/unavoidable bias and
variance by looking at training and dev set error rates. The next chapter will discuss how you
can use insights from such an analysis to prioritize techniques that reduce bias vs.
techniques that reduce variance. There are very different techniques that you should apply
depending on whether your project’s current problem is high (avoidable) bias or high variance.

24 Bias vs. Variance tradeoff


You might have heard of the “Bias vs. Variance tradeoff.” Of the changes you could make to most
learning algorithms, there are some that reduce bias errors but at the cost of increasing variance, and
vice versa. This creates a “trade off” between bias and variance.
For example, increasing the size of your model—adding neurons/layers in a neural network, or adding
input features—generally reduces bias but could increase variance. Alternatively, adding
regularization generally increases bias but reduces variance.
In the modern era, we often have access to plentiful data and can use very large neural networks (deep
learning). Therefore, there is less of a tradeoff, and there are now more options for reducing bias
without hurting variance, and vice versa.
For example, you can usually increase a neural network size and tune the regularization method to
reduce bias without noticeably increasing variance. By adding training data, you
can also usually reduce variance without affecting bias. If you select a model architecture that is well
suited for your task, you might also reduce bias
and variance simultaneously. Selecting such an architecture can be difficult.
25 Techniques for reducing avoidable bias
If your learning algorithm suffers from high avoidable bias, you might try the following
techniques:
• Increase the model size (such as number of neurons/layers): This technique reduces
bias, since it should allow you to fit the training set better. If you find that this increases
variance, then use regularization, which will usually eliminate the increase in variance.
• Modify input features based on insights from error analysis : Say your error
analysis inspires you to create additional features that help the algorithm eliminate a
particular category of errors. (We discuss this further in the next chapter.) These new
features could help with both bias and variance. In theory, adding more features could
increase the variance; but if you find this to be the case, then use regularization, which will
usually eliminate the increase in variance.
• Reduce or eliminate regularization (L2 regularization, L1 regularization, dropout):
This will reduce avoidable bias, but increase variance.
• Modify model architecture (such as neural network architecture) so that it is more
suitable for your problem: This technique can affect both bias and variance.
One method that is not helpful:
• Add more training data : This technique helps with variance problems, but it usually
has no significant effect on bias.

27 Techniques for reducing variance


If your learning algorithm suffers from high variance, you might try the following
techniques:
• Add more training data : This is the simplest and most reliable way to address variance, so long as
you have access to significantly more data and enough computational power to process the data.
• Add regularization (L2 regularization, L1 regularization, dropout): This technique reduces variance
but increases bias.
• Add early stopping (i.e., stop gradient descent early, based on dev set error): This technique
reduces variance but increases bias. Early stopping behaves a lot like regularization methods, and
some authors call it a regularization technique.
• Feature selection to decrease number/type of input features: This technique might help with
variance problems, but it might also increase bias. Reducing the number of features slightly (say going
from 1,000 features to 900) is unlikely to have a huge effect on bias. Reducing it significantly (say
going from 1,000 features to 100—a 10x reduction) is more likely to have a significant effect, so long
as you are not excluding too many useful features. In modern deep learning, when data is plentiful,
there has been a shift away from feature selection, and we are now more likely to give all the features
we have to the algorithm and let the algorithm sort out which ones to use based on the data. But when
your training set is small, feature selection can be very useful.
• Decrease the model size (such as number of neurons/layers): Use with caution. This technique
could decrease variance, while possibly increasing bias. However, I don’t recommend this technique
for addressing variance. Adding regularization usually gives better classification performance. The
advantage of reducing the model size is reducing your computational cost and thus speeding up how
quickly you can train models. If speeding up model training is useful, then by all means consider
decreasing the model size. But if your goal is to reduce variance, and you are not concerned about the
computational cost, consider adding regularization instead.
Here are two additional tactics, repeated from the previous chapter on addressing bias:
• Modify input features based on insights from error analysis
• Modify model architecture
This techniques can affect both bias and variance.
28 Diagnosing bias and variance: Learning
curves
We’ve seen some ways to estimate how much error can be attributed to avoidable bias vs.
variance. We did so by estimating the optimal error rate and computing the algorithm’s
training set and dev set errors. Let’s discuss a technique that is even more informative:
plotting a learning curve.
A learning curve plots your dev set error against
the number of training examples. To plot it,
you would run your algorithm using different
training set sizes. For example, if you have
1,000 examples, you might train separate copies
of the algorithm on 100, 200, 300, …, 1000
examples. Then you could plot how dev set error
varies with the training set size. Here is an
example: 

As the training set size increases, the dev set error should decrease.
We will often have some “desired error rate” that we hope our learning algorithm will
eventually achieve. For example:
• If we hope for human-level performance, then the human error rate could be the “desired
error rate.”
• If our learning algorithm serves some product (such as delivering cat pictures), we might
have an intuition about what level of performance
is needed to give users a great
experience.
• If you have worked on a important application
for a long time, then you might have
intuition about how much more progress you can
reasonably make in the next
quarter/year.
Add the desired level of performance to your
learning curve: 

You can visually extrapolate the red “dev error”


curve to guess how much closer you could
get to the desired level of performance by adding
more data. In the example above, it looks
plausible that doubling the training set size might
allow you to reach the desired
performance.
But if the dev error curve has “plateaued” (i.e.
flattened out), then you can immediately tell
that adding more data won’t get you to your goal:

Looking at the learning curve might therefore help you avoid spending months collecting
twice as much training data, only to realize it does not help.
One downside of this process is that if you only look at the dev error curve, it can be hard to
extrapolate and predict exactly where the red curve will go if you had more data. There is one
additional plot that can help you estimate the impact of adding more data: the training error.
29 Plotting training error
Your dev set (and test set) error should decrease as the training set size grows. But your
training set error usually increases as the training set size grows.
Let’s illustrate this effect with an example. Suppose your training set has only 2 examples:
One cat image and one non-cat image. Then it is easy for the learning algorithms to
“memorize” both examples in the training set, and get 0% training set error. Even if either or
both of the training examples were mislabeled, it is still easy for the algorithm to memorize
both labels.
Now suppose your training set has 100 examples. Perhaps even a few examples are
mislabeled, or ambiguous—some images are very blurry, so even humans cannot tell if there
is a cat. Perhaps the learning algorithm can still “memorize” most or all of the training set,
but it is now harder to obtain 100% accuracy. By increasing the training set from 2 to 100
examples, you will find that the training set accuracy will drop slightly.
Finally, suppose your training set has 10,000
examples. In this case, it becomes even harder
for the algorithm to perfectly fit all 10,000
examples, especially if some are ambiguous or
mislabeled. Thus, your learning algorithm will do
even worse on this training set.
Let’s add a plot of training error to our earlier
figures: 

You can see that the blue “training error” curve increases with the size of the training set.
Furthermore, your algorithm usually does better on the training set than on the dev set; thus
the red dev error curve usually lies strictly above the blue training error curve.
Let’s discuss next how to interpret these plots.

30 Interpreting learning curves: High bias


We previously said that, if your dev error curve plateaus,
you are unlikely to achieve the desired performance just
by adding data. But it is hard to know exactly what an
extrapolation of the red dev error curve will look like. If
the dev set was small, you would be even less certain
because the curves could be noisy. Suppose we add the
training error curve to this plot and get the following:

Now, you can be absolutely sure that adding more data will not, by itself, be sufficient. Why is that?
Remember our two observations:
• As we add more training data, training error can only get worse. Thus, the blue training error curve
can only stay the same or go higher, and thus it can only get further away from the (green line) level of
desired performance.
• The red dev error curve is usually higher than the blue training error. Thus, there’s almost no way
that adding more data would allow the red dev error curve to drop down to the desired level of
performance when even the training error is higher than the desired level of performance.
Examining both the dev error curve and the training error curve on the same plot allows us to more
confidently extrapolate the dev error curve. Suppose, for the sake of discussion, that the desired
performance is our estimate of the optimal error rate. The figure above is then the standard“textbook”
example of what a learning curve with high avoidable bias looks like: At the largest training set size—
presumably corresponding to all the training data we have—there is a large gap between the training
error and the desired performance, indicating large avoidable bias.
Furthermore, the gap between the training and dev curves is small, indicating small variance.
Previously, we were measuring training and dev set error only at the rightmost point of this
plot, which corresponds to using all the available training data. Plotting the full learning
curve gives us a more comprehensive picture of the algorithms’ performance on different
training set sizes.

31 Interpreting learning curves: Other cases


Consider this learning curve: 
Does this plot indicate high bias, high variance, or
both?
The blue training error curve is relatively low, and
the red dev error curve is much higher
than the blue training error. Thus, the bias is small,
but the variance is large. Adding more
training data will probably help close the gap
between dev error and training error.

Now, consider this:


This time, the training error is large, as it is much
higher than the desired level of
performance. The dev error is also much larger
than the training error. Thus, you have
significant bias and significant variance. You will
have to find a way to reduce both bias and
variance in your algorithm.
32 Plotting learning curves
Suppose you have a very small training set of 100 examples. You train your algorithm using a
randomly chosen subset of 10 examples, then 20 examples, then 30, up to 100, increasing
the number of examples by intervals of ten. You then use these 10 data points to plot your
learning curve. You might find that the curve looks slightly noisy (meaning that the values
are higher/lower than expected) at the smaller training set sizes.
When training on just 10 randomly chosen examples, you might be unlucky and have a
particularly “bad” training set, such as one with many ambiguous/mislabeled examples. Or,
you might get lucky and get a particularly “good” training set. Having a small training set
means that the dev and training errors may randomly fluctuate.
If your machine learning application is heavily skewed toward one class (such as a cat
classification task where the fraction of negative examples is much larger than positive
examples), or if it has a huge number of classes (such as recognizing 100 different animal
species), then the chance of selecting an especially “unrepresentative” or bad training set is
also larger. For example, if 80% of your examples are negative examples (y=0), and only
20% are positive examples (y=1), then there is a chance that a training set of 10 examples
contains only negative examples, thus making it very difficult for the algorithm to learn
something meaningful.
If the noise in the training curve makes it hard to see the true trends, here are two solutions:
• Instead of training just one model on 10 examples, instead select several (say 3-10)
different randomly chosen training sets of 10 examples by sampling with replacement10
from your original set of 100. Train a different model on each of these, and compute the
training and dev set error of each of the resulting models. Compute and plot the average
training error and average dev set error.
• If your training set is skewed towards one class, or if it has many classes, choose a
“balanced” subset instead of 10 training examples at random out of the set of 100. For
example, you can make sure that 2/10 of the examples are positive examples, and 8/10 are negative.
More generally, you can make sure the fraction of examples from each class is as close as possible to
the overall fraction in the original training set.

I would not bother with either of these techniques unless you have already tried plotting
learning curves and concluded that the curves are too noisy to see the underlying trends. If
your training set is large—say over 10,000 examples—and your class distribution is not very
skewed, you probably won’t need these techniques.
Finally, plotting a learning curve may be computationally expensive: For example, you might
have to train ten models with 1,000, then 2,000, all the way up to 10,000 examples. Training
models with small datasets is much faster than training models with large datasets. Thus,
instead of evenly spacing out the training set sizes on a linear scale as above, you might train
models with 1,000, 2,000, 4,000, 6,000, and 10,000 examples. This should still give you a
clear sense of the trends in the learning curves. Of course, this technique is relevant only if
the computational cost of training all the additional models is significant.
33 Why we compare to human-level
performance
Many machine learning systems aim to automate things that humans do well. Examples
include image recognition, speech recognition, and email spam classification. Learning
algorithms have also improved so much that we are now surpassing human-level
performance on more and more of these tasks.
Further, there are several reasons building an ML system is easier if you are trying to do a
task that people can do well:
1. Ease of obtaining data from human labelers. For example, since people recognize
cat images well, it is straightforward for people to provide high accuracy labels for your
learning algorithm.
2. Error analysis can draw on human intuition. Suppose a speech recognition
algorithm is doing worse than human-level recognition. Say it incorrectly transcribes an
audio clip as “This recipe calls for a pear of apples,” mistaking “pair” for “pear.” You can
draw on human intuition and try to understand what information a person uses to get the
correct transcription, and use this knowledge to modify the learning algorithm.
3. Use human-level performance to estimate the optimal error rate and also set
a “desired error rate.” Suppose your algorithm achieves 10% error on a task, but a person
achieves 2% error. Then we know that the optimal error rate is 2% or lower and the
avoidable bias is at least 8%. Thus, you should try bias-reducing techniques.
Even though item #3 might not sound important, I find that having a reasonable and
achievable target error rate helps accelerate a team’s progress. Knowing your algorithm has
high avoidable bias is incredibly valuable and opens up a menu of options to try.
There are some tasks that even humans aren’t good at. For example, picking a book to
recommend to you; or picking an ad to show a user on a website; or predicting the stock
market. Computers already surpass the performance of most people on these tasks. With
these applications, we run into the following problems:
• It is harder to obtain labels. For example, it’s hard for human labelers to annotate a
database of users with the “optimal” book recommendation. If you operate a website or
app that sells books, you can obtain data by showing books to users and seeing what they
buy. If you do not operate such a site, you need to find more creative ways to get data.
• Human intuition is harder to count on. For example, pretty much no one can
predict the stock market. So if our stock prediction algorithm does no better than random
guessing, it is hard to figure out how to improve it.
• It is hard to know what the optimal error rate and reasonable desired error
rate is. Suppose you already have a book recommendation system that is doing quite
well. How do you know how much more it can improve without a human baseline?
34 How to define human-level performance
Suppose you are working on a medical imaging application that automatically makes
diagnoses from x-ray images. A typical person with no previous medical background besides
some basic training achieves 15% error on this task. A junior doctor achieves 10% error. An
experienced doctor achieves 5% error. And a small team of doctors that discuss and debate
each image achieves 2% error. Which one of these error rates defines “human-level
performance”?
In this case, I would use 2% as the human-level performance proxy for our optimal error
rate. You can also set 2% as the desired performance level because all three reasons from the
previous chapter for comparing to human-level performance apply:
• Ease of obtaining labeled data from human labelers. You can get a team of doctors
to provide labels to you with a 2% error rate.
• Error analysis can draw on human intuition. By discussing images with a team of
doctors, you can draw on their intuitions.
• Use human-level performance to estimate the optimal error rate and also set
achievable “desired error rate.” It is reasonable to use 2% error as our estimate of the
optimal error rate. The optimal error rate could be even lower than 2%, but it cannot be
higher, since it is possible for a team of doctors to achieve 2% error. In contrast, it is not
reasonable to use 5% or 10% as an estimate of the optimal error rate, since we know these
estimates are necessarily too high.
When it comes to obtaining labeled data, you might not want to discuss every image with an
entire team of doctors since their time is expensive. Perhaps you can have a single junior
doctor label the vast majority of cases and bring only the harder cases to more experienced
doctors or to the team of doctors.
If your system is currently at 40% error, then it doesn’t matter much whether you use a
junior doctor (10% error) or an experienced doctor (5% error) to label your data and provide
intuitions. But if your system is already at 10% error, then defining the human-level
reference as 2% gives you better tools to keep improving your system.

35 Surpassing human-level performance


You are working on speech recognition and have a dataset of audio clips. Suppose your dataset has
many noisy audio clips so that even humans have 10% error. Suppose your system already achieves
8% error. Can you use any of the three techniques described in Chapter 33 to continue making rapid
progress? If you can identify a subset of data in which humans significantly surpass your system, then
you can still use those techniques to drive rapid progress. For example, suppose your system is much
better than people at recognizing speech in noisy audio, but humans are still better at transcribing
very rapidly spoken speech. For the subset of data with rapidly spoken speech:
1. You can still obtain transcripts from humans that are higher quality than your algorithm’s output.
2. You can draw on human intuition to understand why they correctly heard a rapidly spoken
utterance when your system didn’t.
3. You can use human-level performance on rapidly spoken speech as a desired performance target.
More generally, so long as there are dev set examples where humans are right and your algorithm is
wrong, then many of the techniques described earlier will apply. This is true even if, averaged over the
entire dev/test set, your performance is already surpassing human-level performance.
There are many important machine learning applications where machines surpass human level
performance. For example, machines are better at predicting movie ratings, how long it takes for a
delivery car to drive somewhere, or whether to approve loan applications. Only a subset of techniques
apply once humans have a hard time identifying examples that the algorithm is clearly getting wrong.
Consequently, progress is usually slower on problems where machines already surpass human-level
performance, while progress is faster when machines are still trying to catch up to humans.
36 When you should train and test on
different distributions
Users of your cat pictures app have uploaded 10,000 images, which you have manually
labeled as containing cats or not. You also have a larger set of 200,000 images that you
downloaded off the internet. How should you define train/dev/test sets?
Since the 10,000 user images closely reflect the actual probability distribution of data you
want to do well on, you might use that for your dev and test sets. If you are training a
data-hungry deep learning algorithm, you might give it the additional 200,000 internet
images for training. Thus, your training and dev/test sets come from different probability
distributions. How does this affect your work?
Instead of partitioning our data into train/dev/test sets, we could take all 210,000 images we
have, and randomly shuffle them into train/dev/test sets. In this case, all the data comes
from the same distribution. But I recommend against this method, because about
205,000/210,000 ≈ 97.6% of your dev/test data would come from internet images, which
does not reflect the actual distribution you want to do well on. Remember our
recommendation on choosing dev/test sets:
Choose dev and test sets to reflect data you expect to get in the future
and want to do well on.
Most of the academic literature on machine learning assumes that the training set, dev set
and test set all come from the same distribution. In the early days of machine learning, data 11
was scarce. We usually only had one dataset drawn from some probability distribution. So
we would randomly split that data into train/dev/test sets, and the assumption that all the
data was coming from the same source was usually satisfied.
But in the era of big data, we now have access to huge training sets, such as cat internet
images. Even if the training set comes from a different distribution than the dev/test set, we
still want to use it for learning since it can provide a lot of information.
For the cat detector example, instead of putting all 10,000 user-uploaded images into the
dev/test sets, we might instead put 5,000 into the dev/test sets. We can put the remaining
5,000 user-uploaded examples into the training set. This way, your training set of 205,000
examples contains some data that comes from your dev/test distribution along with the
200,000 internet images. We will discuss in a later chapter why this method is helpful.
Let’s consider a second example. Suppose you are building a speech recognition system to
transcribe street addresses for a voice-controlled mobile map/navigation app. You have
20,000 examples of users speaking street addresses. But you also have 500,000 examples of
other audio clips with users speaking about other topics. You might take 10,000 examples of
street addresses for the dev/test sets, and use the remaining 10,000, plus the additional
500,000 examples, for training.
We will continue to assume that your dev data and your test data come from the same
distribution. But it is important to understand that different training and dev/test
distributions offer some special challenges.
37 How to decide whether to use all your data
Suppose your cat detector’s training set includes 10,000 user-uploaded images. This data comes from
the same distribution as a separate dev/test set, and represents the distribution you care about doing
well on. You also have an additional 20,000 images downloaded from the internet. Should you
provide all 20,000+10,000=30,000 images to your learning algorithm as its training set, or discard
the 20,000 internet images for fear of it biasing your learning algorithm?
When using earlier generations of learning algorithms (such as hand-designed computer vision
features, followed by a simple linear classifier) there was a real risk that merging both types of data
would cause you to perform worse. Thus, some engineers will warn you against including the 20,000
internet images. But in the modern era of powerful, flexible learning algorithms—such as large neural
networks—this risk has greatly diminished. If you can afford to build a neural network with a large
enough number of hidden units/layers, you can safely add the 20,000 images to your training set.
Adding the images is more likely to increase your performance. This observation relies on the fact that
there is some x —> y mapping that works well for both types of data. In other words, there exists some
system that inputs either an internet image or a mobile app image and reliably predicts the label, even
without knowing the source of the image.
Adding the additional 20,000 images has the following effects:
1. It gives your neural network more examples of what cats do/do not look like. This is helpful, since
internet images and user-uploaded mobile app images do share some similarities. Your neural
network can apply some of the knowledge acquired from internet images to mobile app images.
2. It forces the neural network to expend some of its capacity to learn about properties that are
specific to internet images (such as higher resolution, different distributions of how the images are
framed, etc.) If these properties differ greatly from mobile app images, it will “use up” some of the
representational capacity of the neural network. Thus there is less capacity for recognizing data drawn
from the distribution of mobile app images, which is what you really care about. Theoretically, this
could hurt your algorithms’ performance.
To describe the second effect in different terms, we can turn to the fictional character Sherlock
Holmes, who says that your brain is like an attic; it only has a finite amount of space. He says that “for
every addition of knowledge, you forget something that you knew before. It is of the highest
importance, therefore, not to have useless facts elbowing out the useful ones.”
Fortunately, if you have the computational capacity needed to build a big enough neural network—
i.e., a big enough attic—then this is not a serious concern. You have enough capacity to learn from
both internet and from mobile app images, without the two types of data competing for capacity. Your
algorithm’s “brain” is big enough that you don’t have to worry about running out of attic space.
But if you do not have a big enough neural network (or another highly flexible learning algorithm),
then you should pay more attention to your training data matching your dev/test set distribution.
If you think you have data that has no benefit,you should just leave out that data for computational
reasons. For example, suppose your dev/test sets contain mainly casual pictures of people, places,
landmarks, animals. Suppose you also have a large collection of scanned historical documents:

These documents don’t contain anything resembling a cat. They also


look completely unlike your dev/test distribution. There is no point
including this data as negative examples, because the benefit from the
first effect above is negligible—there is almost nothing your neural
network can learn from this data that it can apply to your dev/test set
distribution. Including them would waste computation resources and
representation capacity of the neural network.
38 How to decide whether to include
inconsistent data
Suppose you want to learn to predict housing prices in New York City. Given the size of a
house (input feature x), you want to predict the price (target label y).
Housing prices in New York City are very high. Suppose you have a second dataset of
housing prices in Detroit, Michigan, where housing prices are much lower. Should you
include this data in your training set?
Given the same size x, the price of a house y is very different depending on whether it is in
New York City or in Detroit. If you only care about predicting New York City housing prices,
putting the two datasets together will hurt your performance. In this case, it would be better
to leave out the inconsistent Detroit data. 13
How is this New York City vs. Detroit example different from the mobile app vs. internet cat
images example?
The cat image example is different because, given an input picture x, one can reliably predict
the label y indicating whether there is a cat, even without knowing if the image is an internet
image or a mobile app image. I.e., there is a function f(x) that reliably maps from the input x
to the target output y, even without knowing the origin of x. Thus, the task of recognition
from internet images is “consistent” with the task of recognition from mobile app images.
This means there was little downside (other than computational cost) to including all the
data, and some possible significant upside. In contrast, New York City and Detroit, Michigan
data are not consistent. Given the same x (size of house), the price is very different
depending on where the house is.

39 Weighting data (VERY IMPORTANT)


Suppose you have 200,000 images from the internet and 5,000 images from your mobile app users.
There is a 40:1 ratio between the size of these datasets. In theory, so long as you build a huge neural
network and train it long enough on all 205,000 images, there is no harm in trying to make the
algorithm do well on both internet images and mobile images. But in practice, having 40x as many
internet images as mobile app images might mean you need to spend 40x (or more) as much
computational resources to model both, compared to if you trained on only the 5,000 images.
If you don’t have huge computational resources, you could give the internet images a much lower
weight as a compromise.
For example, suppose your optimization objective is squared error (This is not a good choice
for a classification task, but it will simplify our explanation.) Thus, our learning algorithm
tries to optimize: 

The first sum above is over the 5,000 mobile


images, and the second sum is over the
200,000 internet images. You can instead
optimize with an additional parameter :

If you set =1/40, the algorithm would give equal weight to the 5,000 mobile images and the
200,000 internet images. You can also set the parameter to other values, perhaps by tuning to the
dev set. By weighting the additional Internet images less, you don’t have to build as massive a neural
network to make sure the algorithm does well on both types of tasks. This type of re-weighting is
needed only when you suspect the additional data (Internet Images) has a very different distribution
than the dev/test set, or if the additional data is much larger than the data that came from the same
distribution as the dev/test set (mobile images).
40 Generalizing from the training set to the
dev set
Suppose you are applying ML in a setting where the training and the dev/test distributions
are different. Say, the training set contains Internet images + Mobile images, and the
dev/test sets contain only Mobile images. However, the algorithm is not working well: It has
a much higher dev/test set error than you would like. Here are some possibilities of what
might be wrong:
1. It does not do well on the training set. This is the problem of high (avoidable) bias on the
training set distribution.
2. It does well on the training set, but does not generalize well to previously unseen data
drawn from the same distribution as the training set . This is high variance.
3. It generalizes well to new data drawn from the same distribution as the training set, but
not to data drawn from the dev/test set distribution. We call this problem data
mismatch , since it is because the training set data is a poor match for the dev/test set
data.
For example, suppose that humans achieve near perfect performance on the cat recognition
task. Your algorithm achieves this:
• 1% error on the training set
• 1.5% error on data drawn from the same distribution as the training set that the algorithm
has not seen
• 10% error on the dev set
In this case, you clearly have a data mismatch problem. To address this, you might try to
make the training data more similar to the dev/test data. We discuss some techniques for
this later.
In order to diagnose to what extent an algorithm suffers from each of the problems 1-3
above, it will be useful to have another dataset. Specifically, rather than giving the algorithm
all the available training data, you can split it into two subsets: The actual training set which
the algorithm will train on, and a separate set, which we will call the “Training dev” set, that
we will not train on.
You now have four subsets of data:
• Training set. This is the data that the algorithm will learn from (e.g., Internet images +
Mobile images). This does not have to be drawn from the same distribution as what we
really care about (the dev/test set distribution).
• Training dev set: This data is drawn from the same distribution as the training set (e.g.,
Internet images + Mobile images). This is usually smaller than the training set; it only
needs to be large enough to evaluate and track the progress of our learning algorithm.
• Dev set: This is drawn from the same distribution as the test set, and it reflects the
distribution of data that we ultimately care about doing well on. (E.g., mobile images.)
• Test set: This is drawn from the same distribution as the dev set. (E.g., mobile images.)
Armed with these four separate datasets, you can now evaluate:
• Training error, by evaluating on the training set.
• The algorithm’s ability to generalize to new data drawn from the training set distribution,
by evaluating on the training dev set.
• The algorithm’s performance on the task you care about, by evaluating on the dev and/or
test sets.
Most of the guidelines in Chapters 5-7 for picking the size of the dev set also apply to the
training dev set.
41 Identifying Bias, Variance, and Data
Mismatch Errors
Suppose humans achieve almost perfect performance (≈0% error) on the cat detection task, and thus
the optimal error rate is about 0%. Suppose you have:
• 1% error on the training set. • 5% error on training dev set. • 5% error on the dev set.

What does this tell you? Here, you know that you have high variance. The variance reduction
techniques described earlier should allow you to make progress. Now, suppose your algorithm
achieves:
• 10% error on the training set. • 11% error on training dev set. • 12% error on the dev set.

This tells you that you have high avoidable bias on the training set. I.e., the algorithm is doing poorly
on the training set. Bias reduction techniques should help. In the two examples above, the algorithm
suffered from only high avoidable bias or high variance. It is possible for an algorithm to suffer from
any subset of high avoidable bias, high variance, and data mismatch. For example:
• 10% error on the training set. • 11% error on training dev set. • 20% error on the dev set.
This algorithm suffers from high avoidable bias and from data mismatch. It does not, however, suffer
from high variance on the training set distribution. It might be easier to understand how the different
types of errors relate to each other by drawing them as entries in a table:

Continuing with the example of th e cat image detector, you can see that there are two
different distributions of data on the x-axis. On the y-axis, we ha ve three types of error:
human level error, error on examples the algorithm has trained on, and error on examples
the algorithm has not trained on. We can fill in the boxes with the different types of errors we
identified in the previous chapter.
If you wish, you can also fill in the remaining two boxes in this table:
You can fill in the upper-right box (Human level performance on Mobile Images) by asking some
humans to label your mobile cat images data and measure their error. You can fill in the next box by
taking the mobile cat images (Distribution B) and putting a small fraction of into the training set so
that the neural network learns on it too. Then you measure the learned model’s error on that subset of
data. Filling in these two additional entries may sometimes give additional insight about what the
algorithm is doing on the two different distributions (Distribution A and B) of data.

By understanding which types of error the algorithm suffers from the most, you will be better
positioned to decide whether to focus on reducing bias, reducing variance, or reducing data
mismatch.
47 The rise of end-to-end learning
Suppose you want to build a system to examine online product reviews and automatically tell you if
the writer liked or disliked that product. For example, you hope to recognize the following review as
highly positive:  This is a great mop!
and the following as highly negative:  This mop is low quality--I regret buying it.
The problem of recognizing positive vs. negative opinions is called “sentiment classification.”
To build this system, you might build a “pipeline” of two components:
1. Parser: A system that annotates the text with information identifying the most important words.
For example, you might use the parser to label all the adjectives 15 and nouns. You would therefore get
the following annotated text: This is a great Adjective mop Noun !
2. Sentiment classifier: A learning algorithm that takes as input the annotated text and predicts the
overall sentiment. The parser’s annotation could help this learning algorithm greatly: By giving
adjectives a higher weight, your algorithm will be able to quickly hone in on the important words such
as “great,” and ignore less important words such as “this.”
We can visualize your “pipeline” of two components as follows:

There has been a recent trend toward replacing pipeline systems with a single learning
algorithm. An end-to-end learning algorithm for this task would simply take as input
the raw, original text “This is a great mop!”, and try to directly recognize the sentiment:

Neural networks are commonly used in end-to-end learning systems. The term “end-to-end”
refers to the fact that we are asking the learning algorithm to go directly from the input to
the desired output. I.e., the learning algorithm directly connects the “input end” of the
system to the “output end.”
In problems where data is abundant, end-to-end systems have been remarkably successful.
But they are not always a good choice. The next few chapters will give more examples of
end-to-end systems as well as give advice on when you should and should not use them.
48 More end-to-end learning examples
Suppose you want to build a speech recognition system. You might build a system with three
components:

The components work as follows:


1. Compute features: Extract hand-designed features, such as MFCC ( Mel-frequency
cepstrum coefficients) features, which try to capture the content of an utterance while
disregarding less relevant properties, such as the speaker’s pitch.
2. Phoneme recognizer: Some linguists believe that there are basic units of sound called
“phonemes.” For example, the initial “k” sound in “keep” is the same phoneme as the “c”
sound in “cake.” This system tries to recognize the phonemes in the audio clip.
3. Final recognizer: Take the sequence of recognized phonemes, and try to string them
together into an output transcript.
In contrast, an end-to-end system might input an audio clip, and try to directly output the
transcript:

So far, we have only described machine learning “pipelines” that are completely linear: the
output is sequentially passed from one staged to the next. Pipelines can be more complex.
For example, here is a simple architecture for an autonomous car:

It has three components: One detects other cars using the camera images; one detects
pedestrians; then a final component plans a path for our own car that avoids the cars and
pedestrians.
Not every component in a pipeline has to be learned. For example, the literature on “robot
motion planning” has numerous algorithms for the final path planning step for the car. Many
of these algorithms do not involve learning.
In contrast, and end-to-end approach might try to take in the sensor inputs and directly
output the steering direction:

Even though end-to-end learning has seen many successes, it is not always the best
approach. For example, end-to-end speech recognition works well. But I’m skeptical about
end-to-end learning for autonomous driving. The next few chapters explain why.
49 Pros and cons of end-to-end learning
Consider the same speech pipeline from our earlier example:

Many parts of this pipeline were “hand-engineered”:


• MFCCs are a set of hand-designed audio features. Although they provide a reasonable summary of
the audio input, they also simplify the input signal by throwing some information away.
• Phonemes are an invention of linguists. They are an imperfect representation of speech sounds. To
the extent that phonemes are a poor approximation of reality, forcing an algorithm to use a phoneme
representation will limit the speech system’s performance. These hand-engineered components limit
the potential performance of the speech system.
However, allowing hand-engineered components also has some advantages:
• The MFCC features are robust to some properties of speech that do not affect the content, such as
speaker pitch. Thus, they help simplify the problem for the learning algorithm.
• To the extent that phonemes are a reasonable representation of speech, they can also help the
learning algorithm understand basic sound components and therefore improve its performance.

Having more hand-engineered components generally allows a speech system to learn with less data.
The hand-engineered knowledge captured by MFCCs and phonemes “supplements” the knowledge
our algorithm acquires from data. When we don’t have much data, this knowledge is useful.
Now, consider the end-to-end system:

This system lacks the hand-engineered knowledge. Thus, when the training set is small, it might do
worse than the hand-engineered pipeline.
However, when the training set is large, then it is not hampered by the limitations of an MFCC or
phoneme-based representation. If the learning algorithm is a large-enough neural network and if it is
trained with enough training data, it has the potential to do very well, and perhaps even approach the
optimal error rate.
End-to-end learning systems tend to do well when there is a lot of labeled data for “both ends”—the
input end and the output end. In this example, we require a large dataset of (audio, transcript) pairs.
When this type of data is not available, approach end-to-end learning with great caution.
If you are working on a machine learning problem where the training set is very small, most of your
algorithm’s knowledge will have to come from your human insight. I.e., from your “hand engineering”
components.
If you choose not to use an end-to-end system, you will have to decide what are the steps in your
pipeline, and how they should plug together. In the next few chapters, we’ll give some suggestions for
designing such pipelines.
50 Choosing pipeline components: Data
availability
When building a non-end-to-end pipeline system, what are good candidates for the
components of the pipeline? How you design the pipeline will greatly impact the overall
system’s performance. One important factor is whether you can easily collect data to train
each of the components.
For example, consider this autonomous driving architecture:

You can use machine learning to detect cars and pedestrians. Further, it is not hard to obtain
data for these: There are numerous computer vision datasets with large numbers of labeled
cars and pedestrians. You can also use crowdsourcing (such as Amazon Mechanical Turk) to
obtain even larger datasets. It is thus relatively easy to obtain training data to build a car
detector and a pedestrian detector.
In contrast, consider a pure end-to-end approach:

To train this system, we would need a large dataset of (Image, Steering Direction) pairs. It is
very time-consuming and expensive to have people drive cars around and record their
steering direction to collect such data. You need a fleet of specially-instrumented cars, and a
huge amount of driving to cover a wide range of possible scenarios. This makes an
end-to-end system difficult to train. It is much easier to obtain a large dataset of labeled car
or pedestrian images.
More generally, if there is a lot of data available for training “intermediate modules” of a
pipeline (such as a car detector or a pedestrian detector), then you might consider using a
pipeline with multiple stages. This structure could be superior because you could use all that
available data to train the intermediate modules.
Until more end-to-end data becomes available, I believe the non-end-to-end approach is
significantly more promising for autonomous driving: Its architecture better matches the
availability of data.
51 Choosing pipeline components: Task
simplicity
Other than data availability, you should also consider a second factor when picking
components of a pipeline: How simple are the tasks solved by the individual components?
You should try to choose pipeline components that are individually easy to build or learn.
But what does it mean for a component to be “easy” to learn?

Consider these machine learning tasks, listed in order of increasing difficulty:


1. Classifying whether an image is overexposed (like the example above)
2. Classifying whether an image was taken indoor or outdoor
3. Classifying whether an image contains a cat
4. Classifying whether an image contains a cat with both black and white fur
5. Classifying whether an image contains a Siamese cat (a particular breed of cat)
Each of these is a binary image classification task: You have to input an image, and output
either 0 or 1. But the tasks earlier in the list seem much “easier” for a neural network to
learn. You will be able to learn the easier tasks with fewer training examples.
Machine learning does not yet have a good formal definition of what makes a task easy or
hard.16 With the rise of deep learning and multi-layered neural networks, we sometimes say a
task is “easy” if it can be carried out with fewer computation steps (corresponding to a
shallow neural network), and “hard” if it requires more computation steps (requiring a
deeper neural network). But these are informal definitions.
If you are able to take a complex task, and break it down into simpler sub-tasks, then by
coding in the steps of the sub-tasks explicitly, you are giving the algorithm prior knowledge
that can help it learn a task more efficiently.

Suppose you are building a Siamese cat detector. This is the pure end-to-end architecture:
The first step (cat detector) detects all the cats in the image.

The second step then passes cropped images of each of the detected cats (one at a time) to a
cat species classifier, and finally outputs 1 if any of the cats detected is a Siamese cat.

Compared to training a purely end-to-end classifier using just labels 0/1, each of the two
components in the pipeline--the cat detector and the cat breed classifier--seem much easier
to learn and will require significantly less data.

As one final example, let’s revisit the autonomous driving pipeline.

By using this pipeline, you are telling the algorithm that there are 3 key steps to driving: (1)
Detect other cars, (2) Detect pedestrians, and (3) Plan a path for your car. Further, each of
these is a relatively simpler function--and can thus be learned with less data--than the
purely end-to-end approach.
In summary, when deciding what should be the components of a pipeline, try to build a
pipeline where each component is a relatively “simple” function that can therefore be learned
from only a modest amount of data.
END OF PART 2
DEEP LEARNING
BASICS
PART 3
COMPUTER VISION
Computer Vision Introduction
What is computer vision?
1-Computer vision can be defined as “the theory and technology for building artificial systems
that obtain information from images or multi-dimensional data.”
A simpler explanation is that computer vision strives to solve the same problems you can solve
with your very own eyes.
2- Computer Vision is the broad parent name for any computations involving visual content –
that means images, videos, icons, and anything else with pixels involved.

How can we divide computer vision into main tasks?


Computer vision concerns mainly object recognition which can be subdivided into three
varieties:
1)Object classification: is where you have several previously learned objects that you want to be
able to recognize in an image. Classifying a portrait photo as having person’s face in it is an
example object classification — you’ve classified that this photo contains a face in it.
In other words we can say it concerns classifying new objects as belonging to one or more of
your trained categories.
2)Object identification: your model will recognize a specific instance of an object – for
example, parsing two faces in an image and tagging one as Tom Cruise and one as Katie
Holmes.
3)Object detection: It is the ability to identify that there’s an object or more in an image and
determines their position in the image.

What are technologies used in this field?


 Face detection : Haar, HOG, MTCNN, Mobilenet
 Face recognition : CNN, Facenet
 Object recognition : AlexNet, VGG-16, InceptionNet, ResNet
 Transfer learning : re-training big neural network with little resources on a new topic
 Image segmentation : R-CNN and its versions.
 GAN

Examples of real-world applications


self-driving cars, autonomous drones, filtering pictures for a picture based website, automatic
tagging of pictures for apps, pedestrian detection, face recognition, gesture recognition, optical
character recognition, augmented reality, digital video fingerprinting, iris recognition, detecting
industrial product defects, people counting, reverse image search, and more.
Course 4 Week 1
Computer Vision
Computer vision using deep learning techniques mainly relies on using convolutional neural
networks which will be the main outline of this course especially week 1.
Computer vision is one of the rapidly advancing areas that helps in many fields as self-driving
cars to figure out surrounding cars and pedestrians to avoid them. It also serve face recognition
tasks as in phones, homes and work places to work better than before.
Computer vision problems:
There are problems like image
classification, object
detection, neural style
transfer, object segmentation,
object localization, object
tracking and much more.

One of the challenges facing computer vision is that inputs can get really big so for a 1 mega
pixel image (which is actually pretty small comparing to what cameras output now) is actually a
1000 × 1000 × 3 = 3,000,000 (note that 3 here determines the 3 channels composing the image
which red, green and blue) features for just 1 training example which is pretty large amount of
data to be stored as a one input. This means that w[1] now is a matrix of (n[1]×3,000,000) matrix
which means very large number of learning parameters that will make learning process harder
on the computation requirements, memory requirements and also in getting much data to
prevent overfitting with such large number of parameters. Here comes the convolution
operations which is the fundamental element of computer vision.
The earlier layers of the neural networks usually focus on lines and edges and as we go further
in the layer we are focusing on more complex shapes as curves then geometrical shapes as
triangles, circles, rectangles, etc followed by more complex shapes. As an example if we are
dealing with a face recognition problem so may be the early layers are dealing with edges of the
face and its components and then intermediate layers are dealing with eyes, noses, ears, eye
brows, mouths and so on then the latter layers of the network deals with the faces as awhole as
shown in the example:

Let us focus on the earlier stages and see how convolution works on edge detection.
Convolutional operations: (edge detection as motivation example)
Assume having an example
like this, so edge detection is
usually done as:
1) vertical edge detection
2)horizontal edge detection
So how would we detect
edges in image like this?

Let's take a grey scale image (hence only 1 channel) as an example which is of size 6 × 6 and as
said before each pixel of those 36 pixels will have a value between 0 (black) →255 (white).
Let us do a vertical edge detection:
We need a matrix called filter or kernel which we will assume here

as 3×3 filter having the following values:

Then now we do convolutional operation between the image and


this filter or kernel noting that convolutional operation is denoted by
asterisk (*).

How we apply convolution in matrices?


We just take the filter step by step from the up left corner and compute the summation of the
element-wise multiplication between the filter and this up left corner and write the answer in the
first place in the result and repeat this step by shifting the kernel to the left till you reach the end
then repeat again by shifting downward and so on as shown below:


Then:

Then:

And so on till you reach the end of the image as shown:

Another example

You notice here that the vertical edge in the original image is translated into a brighter region
between the 2 darker regions to show that there is a vertical edge existed here.
If we flipped the original
image horizontally such
that the dark is on the left
and bright is on the right
here will be the result :

The difference between the two images that the first one was light to dark transition while the
second photo was a dark to light transition (moving from left to right (→)).

For horizontal edge detection

we here use a transposed filter of that used with vertical edge detection 

Notes: (convolution properties)


1- Kernels can be any n×n kernel (usually odd number) not just 3×3 but the most common are
3×3, 5×5, 7×7 & 9×9.
2- If n×n matrix convolved with f×f matrix, the result will be n-f+1 × n-f+1.
3- Convolution has linear, commutative, associative & distributive properties.

What is the best set of numbers to use in the kernel to perform vertical and horizontal detection?
It turned out that Sobel filter is one of the best edge detecting kernels that puts more weight on

the center which makes it more robust.  vertical: & horizontal:

But computer vision researchers wanted to exaggerate more so they are now using Scharr filter:

Vertical: & Horizontal:

Are these 2 filters really the best?


No, These were the best in the classical techniques but with the rise of deep learning
era we found out that we don't need to handpick the numbers but in fact we need to
make these filters get learned by learning algorithm to find more robust filters. 

Note: Edge detection is not only for vertical or horizontal but also could be done with edge
detection for edges with slope (like 5 edges)
Padding
Motivation behind using padding
One down side of the convolution is that it will shrink the result so when entered a 6×6 image to
convolve with a filter of 3×3 size we ended up with a 4×4 image which means that our image
now shrunk from 6×6 to 4×4 which means that if we apply convolution several times on an
image it will end up being so small. Another downside is that some pixels are involved in the
convolution less than the others so for example the up left pixel in the image is just involved
once in the convolution while for example the pixels in the middle of the image is involved
more than once (some are twice, or 3 times or whatever) 
This means that we throw away some information near edges by only
using them small number of times compared to the others.
Those problems are solved by padding.

Padding is adding a border of pixels around the image to increase the size of the image as
shown:
As we can notice here that we entered
6×6 image and ended with 6×6 also so
image is not shrunk in size and also now
the edges of the image are included more
often than before so we are not throwing
this information.
Note: Padding is done with zeros.

NOW, if we have n×n matrix convolved with f×f filter and padded by P so the result will be
n+2P-f+1 × n+2P-f+1.

Types of convolutions regarding padding


1)Valid padding: This means no padding. (P=0)
2)Same Padding: This means to pad such that input and output of convolution is of same
size.(P= )

Remember: Filters are usually using odd numbers.


Stride
Stride means the step you take when convolving so usually we were taking stride = 1 as default
so the filter was moving 1 step each time as shown:
If stride=1: If stride=2:


As seen in the previous example that the filter is moving with 1 step on the left in case of
stride=1 and 2 steps on the right in the case of stride =2.

Note: Take care that the stride is convenient with the image size in
order not to make filter goes out the image due to this stride as
shown 

Convolution output size


For n×n input image using f×f filter with padding P and stride S so the output result will be:

Note: In the Math and signal processing textbooks convolution is not done as we did but they
firstly mirror the filter before applying it and what we did here in math is called cross-
correlation but in deep learning cross-correlation is called convolution and this is the
convention deep learning researchers used to use.
Convolution on RGB images
What we were doing was done on 2-D image or grey scale image but for RGB images which is
more common we now has 3-D images so how convolution deal with them?
This will be done by stacking filters as shown:

Note:
Number of channels in
the input image should
equal number of
channels of the filter.

But How 6×6×3 matrix convolved with 3×3×3 filter ends with 4×4×1 matrix?

Take the filter 3×3×3 and apply


it on the image such that each
channel of image corresponds
channel of filter and multiply
each two numbers then add
them all (the 27 numbers) and
put them in the final result.

Multiple filters
In fact we don't always want to only find vertical or horizontal but in fact we want them both in
addition to a lot of filters that even find the e]inclined edges with different degrees that's how
we need multiple filters to convolve with them to let each one has a specific task. So if we
assume that we have an image and we want to get the horizontal and vertical edge detection so
we will convolve the image once with vertical edge detector kernel and once with horizontla
one then stack the output together. (output now is )
where nc' is the
number of filters
used.
Convolutional neural network
We start with n×n×3 image (a[0]) and convolve with a number of nc' kernels of size f×f×3 (w[1])
so the output now is (w[1].a[0]) then now add a bias term to
each output of the nc' outputs (z[1]) and apply activation function on them (a[1]) so now we have
moved from a[0] to a[1] which is one layer of convolutional neural network. Check the example:

Exercises

Note: The good thing here


that no matters how big is
the size of image, we still
have the same number of
parameters as long as we
have the same filters' size
and number.

Notations
Simple convolutional neural network

Take care from how sizes differs from one layer to another:
From layer 1 → layer2: since we have 39×39×3 with kernel 3×3×3 and P=0 and S=1 and nc'=10
Hence the output equals = 37×37×10
From layer2 →layer3: we have 37×37×10 with kernel 5×5×10 and P=0 and S=2 and n c'=20
Hence the output equals = 17×17×20
From layer3→layer : we have 17×17×20 with kernel 5×5×20 and P=0 and S=2 and nc'=40
Hence the output equals = 7×7×40
From layer →layer5: This is just flattening as 7×7× 0=1960 to be able to fed to softmax
function to perform prediction of .

Note: As we go deeper in the network our image is getting diminished .

Types of layers in convolutional neural network


1) Convolution layer (CONV)
2)Pooling layer (POOL)
3)Fully connected layer (FC)

Any neural network is composed of a number of these layers to form our CNN.
Pooling layers
Pooling layers is mainly used to reduce the size of the representation, speed computation and
make the network more robust to the features.
Pooling has no learning parameters unlike convolution so gradient descent has nothing to learn.

Max pooling
Max pooling is to take the highest value of a bunch of pixels in an image to represent this bunch
by this value as shown in the figures

For having many channels, so the output will


have the same number of channels

Average pooling

Similarly to max pooling but instead of


taking the highest value we take the
average of the pixels.

So, For n×n×nc image therefore the output of the max pooling layer will be:

Note: We don't usually use padding with pooling layer.


Neural network example (LeNet-5)

Notes:
1-This is the simplest deep neural network in the history.
2-Each conv layer followed by pooling layer are together called a neural network layer.
3-Conv-pool-conv-pool-…-conv-pool-FC-FC-…-FC is a pretty common pattern used in most
of neural networks.
4-Each neuron in the last pooling layer is connected to all neuron in the first FC layer that why
it is of size (120,400).
5-Softmax has 10 outputs in this example as it was an example of hand-written digits
recognition.

Notice this table that translates the previous example.


Convolutional neural networks and its layers(Convolutional, Pooling, Fully Connected)
In the past decade, advances made in the field of computer vision are truly amazing and
unprecedented. Machines can now recognize images and frames in videos with an accuracy (98
percent) surpassing humans(97 percent). The machinery behind this amazing feat is inspired by
the functioning of human brain.
Back then neurologists were conducting experiments on cats when they discovered that similar
parts of an image cause similar parts of cat’s brain to become active. In other words, when a cat
looks at circle, the alpha-zone is activated in its brain. When it looks at square, beta-zone is
activated in the brain. Their findings concluded that animal’s brain contain a zone of neurons
which react to the specific characteristics of an image i.e they perceive the environment through
a layered architecture of neurons in the brain. And each and every image passes through some
sort of feature extractor before entering further into the brain.
Inspired by the functioning of brain, mathematicians hoarded to create a system to simulate
different groups of neurons firing for different aspect of an image and communicate with each
other to form a bigger picture.

Convolutional neural networks (CNN)


Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous
chapter: they are made up of neurons that have learnable weights and biases. Each neuron
receives some inputs, performs a dot product and optionally follows it with a non-linearity. The
whole network still expresses a single differentiable score function: from the raw image pixels
on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax)
on the last (fully-connected) layer and all the tips/tricks we developed for learning regular
Neural Networks still apply.
So what changes? ConvNet architectures make the explicit assumption that the inputs are uni or
multi-channeled images (unlike neural networks, where the input is a vector), which allows us
to encode certain properties into the architecture using convolutions. These then make the
forward function more efficient to implement and vastly reduce the amount of parameters in the
network.

Convolutional Neural Networks were inspired by research done on the visual cortex of
mammals and how they perceive the world using a layered architecture of neurons in the brain
as said before. So hink of this model of the visual cortex as groups of neurons designed
specifically to recognize different shapes. Each group of neurons fires at the sight of an object,
and communicate with each other to develop a holistic understanding of the perceived object.
General architecture (building blocks or structure) of convolutional neural networks
A simple convnet is a sequence of layers where each layer transforms one volume of activations
to another through a differentiable function. Convolutional Neural Networks have a different
architecture than regular Neural Networks. Regular Neural Networks transform a vector input
by putting it through a series of hidden layers. Every layer is made up of a set of neurons, where
each layer is fully connected to all neurons in the layer before. Finally, there is a last layer — the
output layer — that represent the predictions.
Convolutional Neural Networks are a bit different. First of all, the layers are organized in 3
dimensions: width, height and depth (channels). Further, the neurons in one layer do not
connect to all the neurons in the next layer but only to a small region of it (it is known as
sparsity of connections or receptive field). Lastly, the final output will be reduced to a single
vector of probability scores, organized along the depth dimension.
The convnet is mainly composed of 3 types of layers which are:
1-Input layer
2-Convolutional layer
Feature extraction/learning part
3-Pooling layer (aka sub-sampling or downsapling layer)
4-Fully connected layers Classification part
Note: There is non-linearity operations as ReLU done as discussed before. Each Layer may or
may not have learning parameters (e.g. CONV/FC do, RELU/POOL don’t). Each Layer may or
may not have additional hyper-parameters (e.g. CONV/FC/POOL do, RELU doesn’t).

Does input layer in CNN differ from that of regular ANN?


Yes, the difference is in dimensions where an input data in the form of a 4D matrix that
includes number of samples (number of images), height of each sample (height of each image),
width of each sample (width of each image) , number of channels (number of channels here
refers to the color specification of each image — a colored image corresponds to Red (R), Green
(G) and Blue (B) pixels, so each image has 3 channels — think of this as a three 2-dimensional
matrices superposed on top of each other each corresponding to the intensity of the RGB pixels,
whereas a gray image has only 1 channel). In our example we will be using a gray image only.
What is convolution mathematically and intuitively in machine learning?
the input is a 32 x 32 x 3 array of pixel values. Now, the best way to explain a conv layer is to
imagine a flashlight that is shining over the top left of the image. Let’s say that the light this
flashlight shines covers a 5 x 5 area. And now, let’s imagine this flashlight sliding across all the
areas of the input image. In machine learning terms, this flashlight is called a filter (or
sometimes referred to as a neuron or a kernel or a feature detector) and the region shining over
is called the receptive field. Now this filter
is also an array of numbers (the numbers
are called learning weights or parameters).
A very important note is that the depth of
this filter has to be the same as the depth of
the input (this makes sure that the math
works out), so the dimensions of this filter
is 5 x 5 x 3. Now, let’s take the first
position the filter is in for example. It would be the top left corner. As the filter is sliding,
or convolving, around the input image, it is multiplying the values in the filter with the original
pixel values of the image (aka computing element wise multiplications). These multiplications
are all summed up (mathematically speaking, this would be 75 multiplications in total). So now
you have a single number. Remember, this number is just representative of when the filter is at
the top left of the image. Now, we repeat this process for every location on the input volume.
(Next step would be moving the filter to the right by 1 unit, then right again by 1, and so on).
Every unique location on the input volume produces a number. After sliding the filter over all
the locations, you will find out that what you’re left with is a 28 x 28 x 1 array of numbers,
which we call an activation map or feature map. The reason you get a 28 x 28 array is that there
are 784 different locations that a 5 x 5 filter can fit on a 32 x 32 input image. These 784
numbers are mapped to a 28 x 28 array.
Let’s say now we use two 5 x 5 x 3 filters instead of one. Then our output volume would be 28
x 28 x 2. By using more filters, we are able to preserve the spatial dimensions better.
Mathematically, this is what’s going on in a convolutional layer.
A step-by-step example was given in the main course and nothing more to be added so in case
you still didn't get the idea of solving convolution so enough reading and it is time to visualize:
Check this GIF[https://commons.wikimedia.org/wiki/File:3D_Convolution_Animation.gif]
If still didn't get it, check this video [https://www.youtube.com/watch?v=XuD4C8vJzEQ]
GIF for multiple filters[http://cs231n.github.io/assets/conv-demo/index.html]
Note: what we are doing here is mathematically called cross-correlation not convolution as
convolution needs flipping before sliding but anyway in computer vision we call it convolution.
Remember: padding is adding a border of zeros around the matrix to avoid shrinking and
increase the effect of edges. Stride defines the step of the sliding window. Both concepts has
nothing to be said more than the main course so revise those parts.
Convolutional layer as high level perspective
The Conv layer is first layer and the core building block of a Convolutional Network that does
most of the computational heavy lifting. However, let’s talk about what this convolution is
actually doing from a high level. Each of these filters can be thought of as feature identifiers.
When I say features, I’m talking about things like straight edges, simple colors, and curves.
Think about the simplest characteristics that all images have in common with each other. Let’s
say our first filter is 7 x 7 x 3 and is going to be a curve detector. (In this section, let’s ignore
the fact that the filter is 3 units deep and only consider the top depth slice of the filter and the
image, for simplicity.)
As a curve detector, the filter will have a pixel
structure in which there will be higher
numerical values along the area that is a shape
of a curve (Remember, these filters that we’re
talking about as just numbers!). 

Now, let’s go back to visualizing this mathematically. When we have this filter at the top left
corner of the input volume, it is computing
multiplications between the filter and pixel values
at that region. Now let’s take an example of an
image that we want to classify, and let’s put our
filter at the top left corner. 

Remember, what we have to do is multiply


the values in the filter with the original pixel
values of the image.
Basically, in the input image, if there is a
shape that generally resembles the curve that
this filter is representing, then all of the
multiplications summed together will result in
a large value! 

Now let’s see what happens when we move


our filter. The value is much lower! This is
because there wasn’t anything in the image
section that responded to the curve detector
filter.
Remember, the output of this conv layer is an activation map. So, in the simple case of a one
filter convolution (and if that filter is a curve detector), the activation map will show the areas in
which there at mostly likely to be curves in the picture. In this example, the top left value of our
26 x 26 x 1 activation map (26 because of the 7x7 filter instead of 5x5) will be 6600. This high
value means that it is likely that there is some sort of curve in the input volume that caused the
filter to activate. The top right value in our activation map will be 0 because there wasn’t
anything in the input volume that caused the filter to activate (or more simply said, there wasn’t
a curve in that region of the original image). Remember, this is just for one filter. This is just a
filter that is going to detect lines that curve outward and to the right. We can have other filters
for lines that curve to the left or for straight edges. The more filters, the greater the depth of the
activation map, and the more information we have about the input volume.

It is evident from what we said that different values of the filter


matrix will produce different Feature Maps for the same input
image. In the table shown, we can see the effects of
convolution of this image with different

filters(assuming 3×3 filters). As shown, we can perform


operations such as Edge Detection, Sharpen and Blur just by
changing the numeric values of our filter matrix before the
convolution operation.

As you go through the network and go through more conv layers,


you get activation maps that represent more and more complex
features. By the end of the network, you may have some filters
that activate when there is handwriting in the image, filters that
activate when they see pink objects, etc. Another interesting
thing to note is that as you go deeper into the network, the filters
begin to have a larger and larger receptive field, which means
that they are able to consider information from a larger area of
the original input volume (another way of putting it is that they
are more responsive to a larger region of pixel space).

To visualize, this video is more than great to see how as we go deeper in the network we start
having more complex features and that feature map of each layer in fact depends on the feature
maps from the previous ones:
[https://www.youtube.com/watch?time_continue=152&v=AgkfIQ4IGaM]
Two main concepts to know about CNN
Local connectivity (aka sparsity of connectivity)
When dealing with high-dimensional inputs such as images, it is impractical to connect neurons
to all neurons in the previous volume as was done in the regular ANN discussed in the previous
chapter. Instead, we will connect each neuron to only a local region of the input volume. The
spatial extent of this connectivity is a hyper-parameter called the receptive field of the neuron
(equivalently this is the filter size). The extent of the connectivity along the depth axis is always
equal to the depth (no. of channels) of the input volume (that is why kernel's number of
channels is always equal to number of channels of the input to this kernel). It is important to
emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and
the depth dimension: The connections are local in space (along width and height), but always
full along the entire depth of the input volume.
To sum up we can say that local connectivity is the concept that each neuron connected only to
a subset of the input image (unlike a neural network where all the neurons are fully connected to
the all neurons in the previous layer).

Parameter sharing (aka shared weights)


Parameter sharing is sharing of weights by all neurons in a particular feature map. When we
have filter and start convolving it with the input image which is dot product (element-wise
multiplication followed by summation) so this is equivalent to what we were doing with regular
ANN which is multiplying weights (parameters) to input then adding bias so by analogy we can
say that the filter's values are the learning weights in CNN. but in regular ANN any neuron
connected to any neuron has specific weight representing the relation between these 2 neuron.
Here the concept is different as for each filter we slide it (convolve) across the whole image
which means that same weights are applied to all features so they are all sharing the same
parameters. Noting that we don't use only 1 filter but many filters but each filter is applied with
their unique weights to all features so we can say now that all neurons (features) are sharing
weights for each particular feature map (each feature map is the result of convolving with 1
filter so after convolving with several filters we get several feature maps).
Parameter sharing scheme is depending on the idea of local connectivity and it is used in
convolutional layers to reduce the number of parameters. and makes the computation more
efficient.

Note: These 2 concepts are the main reasons or advantages that make convolution useful in
neural networks.
Pooling layer (aka sub-sampling or down sampling layer)
The second secret sauce that has made CNNs very effective is pooling. It is common to
periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture.
Pooling is a vector to scalar transformation that operates on each local region of an image, just
like convolutions do, however, unlike convolutions, they do not have filters and do not compute
dot products with the local region, instead they compute the average of the pixels in the region
(Average Pooling) or simply picks the pixel with the highest intensity and discards the rest
(Max Pooling) or sum the pixels in the region (Sum Pooling). Max Pooling is the most used
type among them. For an F x F pooling, it will effectively reduce the size of feature maps by a
factor of F (stride×F).

Example (Max Pooling): This is max pooling of size 2×2 and stride=2

First region (red): Max (12,20,8,12)  20


Second region (yellow): Max (30,0,2,0)  30
Third region (blue): Max (34,70,112,100)  112
Fourth region (green): Max (37,25,4,12)  37
Data is reduced from 16 to 4  (reduce to quarter)
The very idea of pooling can seem non-productive as it leads to loss of information!!, however
it has proven to be very effective in practice.

What is the importance of pooling layer ?


 Reduces the spatial size of the representation to reduce the amount of parameters and
computation in the network hence reduce training time. Generally, the size of images at
each layer in the network is directly proportional to the computational cost (flops) of
each layer. Pooling reduces the dimensions of the image as the layers get deeper, hence,
it helps prevent an explosion of the number of flops a network requires.
 Although size is reduced but important information is retained based on the idea that the
maximum pixel in a region (Max Pooling) represents the most important feature in that
region.
 As it reduces no. of parameters, so it reduces overfitting.
 Makes the network invariant to small transformations, distortions (noise) and translations
in the input image (a small distortion in input will not change the output of Pooling –
since we take the maximum / average value in a local neighborhood).
Notes:
1- The Pooling Layer operates independently on every depth slice (channel) of the input.
2- Many people dislike the pooling operation and proposes to replace it by CONV layer with
larger stride once in a while.
Fully connected layers (FC)
Fully Connected Layer is simply, feed forward neural networks. Fully Connected Layers form
the last few layers in the network. The input to the fully connected layer is the output from
the final Pooling or Convolutional Layer, which is flattened (The output from the final (and
any) Pooling and Convolutional Layer is a 3-dimensional matrix, to flatten that is to unroll all
its values into a vector)and then fed into the fully connected layer and each FC layer passes to
the next FC layer. After passing through all fully connected layers, the final layer uses
the softmax activation function which is used to get probabilities of the input being in a
particular class (classification task). And so finally, we have the probabilities of the object in the
image belonging to the different classes!!
Neurons in a fully connected layer have full connections to all activations in the previous layer,
as seen in regular Neural Networks. Their activations can hence be computed with a matrix
multiplication followed by a bias offset similar to what we were doing in the previous chapter.

Learning parameters & Hyper-parameters for each layer in CNN


Conv Layer Pooling Layer FC Layer
Learning parameter  Weights N/A (sometimes  Weights
 Biases bias as avg pooling)  Biases
Hyper-parameter  Number of filters
 Similar hyper-
 Filter size  Filter size
parameters of
 Stride  Stride
regular ANN
 Padding
Summary of each layer in CNN
Convolutional Neural Nets take in images as input, attempt to decipher (‫)تفك رموز او شفرة‬
different small features (local connections) about the images regardless of their position (spatial
invariance) using a series of mathematical operations (layering, pooling) in order to understand
the full picture of what’s happening. These mathematical operations seek for modeling an
image as series of numbers with each number representing pixel density (the intensity of the
color in a specific position on the picture).
Conv layer Pooling layer
Course 4 Week 2
We have studied the basic layers that form together convolutional neural networks and over the
few past years the research was on how to put those layers together to form effective one. Like
we have learned coding by seeing other people's code, we can build convnets (convolutional
networks) by seeing other examples of effective convnets. It turns out that convnets which work
well on one computer vision task often work well on other computer vision tasks. Like if your
network works well on a task of recognizing cats, dogs and people so you can apply the same
network on other tasks as recognizing cars by applying it to its dataset and most probably it will
work well on it. So let us start of some famous classic neural networks that were a state of art.

Classic networks
LeNet-5 By Yann-LeCun
LeNet was trained on greyscale images on MNIST dataset concerning handwritten digits
recognition and Here is the architecture:

Notes:
1-This neural network is pretty small compared to recent neural networks as it has only 60K
parameters while now we have reached from 10 mils to 100 mils parameters so it is small.
2-As we go deeper in this network, the height and width shrinks (32→28→1 →10→5…).
3-As we go deeper, the number of channels increases (1→6→16).
4-This network was using sigmoid/Tanh activation function not ReLU as it wasn't introduced
yet.
5-Sigmoid non-linearity was used after pooling layer but now it is not used anymore.
6-In case you will read its paper just focus on section II and III as recommended form Andrew.
AlexNet By Alex Krizhevsky

Notes:
1-This neural network similar to leNet-5 but much bigger with 60 mil parameters.
2-This network used ReLU activation function instead of sigmoid/Tanh functions.
3-This architecture was introduced in a time when GPU was slower than now so it was trained
on multiple GPUS.
4-This network uses layer called local response normalization but this type of layer is not used
much recently.

VGG-16 By Simonian & Zisserman


It has all conv layers of size 3×3 with S=1 and same padding in addition to all max-pooling
layers of size 2×2 and S=2. As it is so deep so we represent n similar layers by (×n).

Notes:
1-This network is composed of 16 layers of a total number of learning paramters≈138 mil.
2-There is another VGG network composed of 19 layers instead of 16 called VGG-19.
Residual networks (ResNets)
Very deep neural networks are hard to be trained due to vanishing and exploding problems.
Here comes the idea of skip connections which would make activations from one layer feeds a
deeper layer in the network and it is the basic idea resnets relies on.
Residual block
To go from a[l] to a[l+2] you have to go through the main path as follows:
a[l] → z[l+1]=w[l+1].a[l]+b[l+1] → a[l+1] = g(z[l+1]) → z[l+2]=w[l+2].a[l+1]+b[l+2] → a[l+2] = g(z[l+2])
How to modify this ?
we are going to take a[l] and add it to output of z[l+2] using shortcut path as follows:

a[l]→z[l+1] = w[l+1].a[l]+b[l+1] → a[l+1] = g(z[l+1]) → z[l+2] = w[l+2].a[l+1]+b[l+2] a[l+2] = g(z[l+2]+ a[l])


Notes:
1-The path made above is called either shortcut or skip connection.
2-Shortcut is connected before applying activation function g().
3-You can make the path added to any following layer z [l+2] or z[l+3] or
… or z[L-1].
4-As we are adding z[l+2] to a[l] so they should have the same sizes so same padding (conv) is
used more often than valid padding (conv) in resnets. But in case z[l+2] has different size so we
make it g(z[l+2]+ ws.a[l]) where ws will match the size of a[l] to z[l+2] and it could be zero padding
or a weight matrix to be learned along the learning process.

Residual network

Did residual networks affect the performance compared to plain networks (no residuals)?
Yes, it did make the performance better as theoretically when we have more layers in the
network it should make training error decrease but practically with plain networks it didn't help
that much as it will reach to a certain number of layers then the training error will increase
again. ResNets solved this to a high extent and make it stand further with increasing number of
layers of the network.
Why ResNets work?
Assume we have those two networks
where one of them is a big neural network
and the other is a big neural network
followed by residual block, so by writing
the equation for the 2nd one  a[l+2]= g(z[l+2]+ a[l])= g(w[l+2].a[l+1]+b[l+2]+a[l])
So if w[l+2] = b[l+2] = 0 and we are using ReLU activation function so this will lead to that a [l+2]
will be equal to a[l] which means that residual block is acting now as just an identity function
which means that with the presence of residual block we can add more deep network and if
there is nothing more to learn, the network will have the ability to just copy previous output
again and not hurt the neural network by making it doing similarly to simpler networks unlike
plain networks which has to choose values for parameters as we go deeper which may result in
worse result if it couldn't learn. Of course it is not only for not hurting your neural network but
also for improving by making this residual blocks learn more if there is something to be
learned. To summarize, the residual block will allow the network to either learn something new
or act as identity so not affect the neural networks as it goes deeper.

Network in Network (1×1 convolution)


In fact if I told you that I'm doing 1×1 convolution you would laugh as 1×1 convolution means
just multiplying each feature (pixel) by number !! so it is just multiplication !! this is true for 1
channel but if we are convolving several channels then now it is more logical.
If you are convolving 6×6×1 image with 1×1 filter so this is a direct multiplication as shown but
if you are convolving 6×6×32 image for an example with 1×1×32 filter and now let's do
convolution so we are now convolving 1×1×32 (slice) with filter 1×1×32 (slice) and end up
with a single number to be an output for this place and then repeat for every slice of the 36
slices (6×6) to end up with 6×6 output but as we said before that we are not using only 1 filter
but multiple filters so we end up with 6×6× #filters.

But what is the effect of using such


convolution? in fact it could affect in
adding more non-linearity to your
system as we apply activation function
(say ReLU) after this step as usual so we
are making more complex function. It
can also help in shrinking number of
filters if the input to this conv layer was
of large number of filters and you want
to restrict it to reduce computations.
Inception Network
Inception network relies on the idea of network in network in its architecture. Inception network
makes more complicated architecture but remarkably doing well as you always should pick the
type of layer when constructing one so you have to decide this either 3×3 conv layer or 5×5
conv layer or 7×7 conv layer or pooling layer but inception says that why should you do them
all? let us see how to construct inception module then use several modules to construct network:
Assume having 28×28×192 input and now you have to pick a conv layer with specific size or
pooling layer with specific size but instead of picking one, we will do them all by applying ni
filters of 1×1 & 3×3 & 5×5 & … & Max pooling filters, all of them, to the input taking into
consideration that the output of each are of the same size (28×28) so we use same convolution
in conv layers and we do padding in max pooling with stride 1 to make sure that all outputs are
of same size then stack all outputs together. This style of constructing networks is doing well as
we don't need to pick some sort of filters but letting all the parameters of all types of filters
learn and do the combination they need for better performing network.

The problem is obvious here in the inception module which is the computational cost  as it is
so high compared to other architecture. So for an example if we just estimate the number of
operations done in the 5×5 filters only so we just have 28×28×192 input and have 28×28×32
output so this means that we have 28×28×32 values each value is computed using 5×5×192
operations using the filter so this ends up to (28×28×32) × (5×5×192) =120 mil operations !!!
So here comes the idea of introducing 1×1 conv to perform 5×5 conv in about 1 tenth the
operations needed in the original method (instead of 120 mil →12mil) . This will be done as
shown : (left is the original method and right is the modified one to reduce computations)


The new method still preserve the input and output dimensions but with less computations.
Computation operations: [(1×1×192) × (28×28×16)] + [(5×5×16) × (28×28×32)]=12.4 mil
So the computational cost decreased to approximately 1/10 of the operations needed in the
original method with preserving the dimensions of the input and output.
Note: The layer added before the 5×5 layer is called bottle neck layer.

The last thing to say before showing the final inception module is that we have said before that
1×1 conv helps in shrinking number of channels which will be used after the pooling layer to
decrease number of channels from 192 to any less number we need by using n number of filters
each having size of 1×1×192.

Final inception module

Now to construct an inception network we will add some inception modules to form it as
follows:
Transfer learning
Rather than making your network learns from scratch with randomly initialized weights, you
can make much faster progress if you started from good weights that somebody else has ended
to after training his network (of course both should be using same architecture ). These
weights are usually reached by training on well-known datasets as COCO, ImageNet or Pascal
for weeks or may be months so starting from these weights is a very good initialization.
Transfer learning is important in a case like having small dataset. In fact transfer learning is
almost always used except if you have exceptionally large dataset that doesn't need this method.

Important Note: Change the number of classes of softmax layer to match your task and freeze
all the layers of the network except for softmax layer and start training in case of small dataset
but for larger dataset you can freeze some layers of network and train the others and as you
have larger dataset you increase the number of layers to be trained such that if the dataset is
large enough you can train the whole network so the pre-trained weights are just initialization to
the network.

Remember: For small dataset, data augmentation is a good option for increasing your data as
discussed before.
Networks architecture & Transfer learning
In this part we will talk about the most popular CNN architecture released till 2018 in
classification tasks and what are the basic concepts they introduced noting that as we go deeper
in this section we see more recent CNN architectures that have achieved better results in
ImageNet Large Scale Visual Recognition Competition (ILSVRC), except for leNet-5, and all
of them are either winners or runner-ups. Let us start:

LeNet-5 (by Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner in 1998)
This paper was a state-of-art for convolutional neural networks as it proposed most of the main
ideas that are discussed in the previous section and was the first step toward the evolution of
using convolutional neural networks in machine learning field. It was trained for handwritten
digits classification task.

Architecture

Trainable
Type Filter size No. of filters Stride Padding Output size
parameters
Input Input size = 32 × 32 ×1 (Grey-scale image)
Layer1 Conv 5×5 6 1 0 156 28×28×6
Layer2 AvgPool 2×2 6 2 0 12 14×14×6
Layer 3 Conv 5×5 16 1 0 1516 10×10×16
Layer4 AvgPool 2×2 16 2 0 32 5×5×16
Layer5 FC by Conv 5×5 120 - - 48120 1×1×120
Layer6 FC - - - - 10164 84
Layer7 GC(Softmax) - - - - - 10 (digits)
Activation function = Tanh & Total parameters =60,000

Note: Pooling layer here has trainable parameter as it is of type average pooling which has 1
coefficient and 1 bias term but generally as we said there is no trainable parameters in pooling.
Visualize architecture
[https://engmrk.com/wp-content/uploads/2018/09/LeNet-5_Architecture_Explanation.mp4]

Main concepts: convolution, local receptive fields, shared weights, spacial subsampling and all
of them was discussed in details in the previous section.
There was a gap between 1998 and 2010 till Dan Ciresan Net was introduce that provided 9
layer CNN trained on NVIDIA GTX 280 but we will pass this network and start from 2012 Net.

AlexNet (by Alex Krizhevsky in 2012)


This network was a state-of-art and some people consider it as the real start of deep learning
with CNN for the concepts it has introduced. It is the winner of ILSVRC in 2012.

Architecture

Trainable
Type Filter size No. of filters Stride Padding Output size
parameters
Input Input size = 227×227×3
Layer1 Conv 11×11 96 4 0 34944 55×55×96
Layer2 MaxPool 3×3 96 2 0 - 27×27×96
Layer 3 Conv 5×5 256 1 2 614656 27×27×256
Layer4 MaxPool 3×3 256 2 0 - 13×13×256
Layer5 Conv 3×3 384 1 1 885120 13×13×384
Layer6 Conv 3×3 384 1 1 1327488 13×13×384
Layer7 Conv 3×3 256 1 1 884992 13×13×256
Layer8 MaxPool 3×3 256 2 0 - 6×6×256
Flattening 9216
Layer9 FC - - - - 37748736 4096
Layer10 FC - - - - 16777216 4096
Layer11 FC(Softmax) - - - - 4096000 1000
Activation function = ReLU & Total parameters ≈ 62.3 million

Notes: 1-Input size is 227 not 224 as mentioned in the paper. It was just a mistake in the paper.
2-This model was trained on 2 GTX 580 3GB GPUS so that is why the model in the figure
above is separated on two pipelines.
3-MaxPooling is done by overlapping not the normal MaxPooling we take and it will be
discussed.
4-There is normalization used after the first 2 MaxPooling but researchers found that they are
not that much effective.
5-Dropout was used before each of the FC layers 9 & 10 with keep probability = 0.5
6-This model reached an error of 15.3% on ImageNet.
Main concepts
1-Use of ReLU instead of Tanh (This point was discussed in the previous chapter)
2-Use of dropout to reduce overfitting (This point was discussed in the previous chapter)
3-Data augmentation methods to reduce overfitting (This was discussed in the previous chapter)
4-SGD with momentum was introduced. (This was discussed in the previous chapter)
3-Overlapping MaxPooling to avoid averaging effects of AvgPool.

Note: Usage of ReLU activation function speed up training to 5 times.

What is overlapping MaxPooling ?


Overlapping Max Pool layers are similar to the Max Pool layers, except the adjacent windows
over which the max is computed overlap each other. We were used to move the pooling filter by
stride equals to filter size so if pooling filter size is 2×2 then we move with stride=2 so the filter
each time is seeing new 4 values to get the max of them but with overlapping MaxPooling we
use stride less than filter size so the pooling filter is overlapping with values. See the next
example:

We can see from example that he is using pooling filter of size 2×3 with stride 2 so the 3rd
column in the first pooling is similar to 1st column in second pooling.
Actually most of the popular , if not all, use overlapping MaxPooling as it turned out to give
better results than normal non-overlapped MaxPooling. Geoffrey Hinton said that without
overlapping MaxPooling, the network may lose some information in their feature maps.

ZFNet (by Matthew Zeiler and Rob Fergus in 2013)


This network can be called the moified version of AlexNet as it was a fine tuning on it but still
developed a key idea that improved the performance and reduced error leading winning of
ILSVRC in 2013. In addition to developing a visualization technique.
Architecture
Trainable
Type Filter size No. of filters Stride Padding Output size
parameters
Input Input size = 224×224×3
Layer1 Conv 7×7 96 2 0 14208 110×110×96
Layer2 MaxPool 3×3 96 2 0 - 55×55×96
Layer 3 Conv 5×5 256 2 0 614656 26×26×256
Layer4 MaxPool 3×3 256 2 0 - 13×13×256
Layer5 Conv 3×3 384 1 1 885120 13×13×384
Layer6 Conv 3×3 384 1 1 1327488 13×13×384
Layer7 Conv 3×3 256 1 1 884992 13×13×256
Layer8 MaxPool 3×3 256 2 0 - 6×6×256
Flattening 9216
Layer9 FC - - - - 37748736 4096
Layer10 FC - - - - 16777216 4096
Layer11 FC(Softmax) - - - - 4096000 1000
Activation function = ReLU & Total parameters ≈ 62.3 million

We can notice that the architecture is the same as AlexNet except for the first conv layer that
used 7×7 filter size with stride 2 instead of 11×11 filter size with stride 4.

Main concepts
1-Using smaller filter sizes with smaller strides at the first layers.
2-Visualization technique using what is called by deconv net.

Why using smaller filter strides at the first layers helps?


The reasoning behind this modification is that a smaller filter size in the first conv layer helps
retain a lot of original pixel information in the input volume. A filtering of size 11x11 proved to
be skipping a lot of relevant information, especially as this is the first conv layer.

Visualizing CNN using deconv network


This is the main concept that really make ZFNet paper a popular one as it developed a a
visualization technique named Deconvolutional Network, which helps to examine different
feature activations and their relation to the input space. Called “deconvnet” because it maps
features to pixels (the opposite of what a convolutional layer does).
As we should know, a standard step in deep learning framework is to have a series of (Conv 
Rectification (Activation Function)  Pooling). To visualize a deep layer feature, we need a set
of decovnet techniques to reverse the above actions such that we can visualize the feature in
pixel domain.
The reversed order now is Unpooling  Unrectifying  Deconvolution
Since ReLU is used as the activation function, and ReLU is to keep all values positive while
make negative values become zero. In the reverse operation, we just need to perform ReLU
again. So now we have 2 steps to discuss which are unpooling and deconvolution.
Unpooling
Max-Pooling operation is non-inverible as we lose the information of the locations other than
the location that we find the max value at in the filter region. however we can obtain an
approximate inverse by recording the locations of the maxima within each pooling region
leaving the unstored values of other positions zeroed.
Note: The max locations are stored using something called Max location switches

Deconvolution
Deconvolution is the inverse process of convolution and works on two steps:
 Pad by 1 around each pixel of the convolved output matrix.
 Apply convolution with the filter used in convolution but transposed.
To visualize  [https://i.stack.imgur.com/f2RiP.gif]

The whole deconv process:

In fact, ZFNet is one of the most important network in CNN history as it provided great
intuition of how CNN works. The visualization technique proposed in this paper helped us to
know that first layers of CNN learn low-level features and that as we go deeper complex
features are learned and gave us good insight to improve network architectures.
VGG-19 (by Karen Simonyan and Andrew Zisserman in 2014)
Simplicity and depth could be the header for this network. This 19-layer network was the
runner-up of ILSVRC in 2014 that was able to reach 7.3% error rate by strictly sticked with 3×3
filters with stride and pad =1 along with pooling of 2×2 with stride 2 (Non-overlapped
MaxPooling). VGG has another version of 16 layers only that is mentioned in the main course.

Architecture

Trainable
Type Filter size No. of filters Stride Padding Output size
parameters
Input Input size = 224×224×3
Layer1 Conv 3×3 64 1 1 1792 224×224×64
Layer2 Conv 3×3 64 1 1 36928 224×224×64
Layer 3 MaxPool 2×2 64 2 0 - 112×112×64
Layer4 Conv 3×3 128 1 1 73856 112×112×128
Layer5 Conv 3×3 128 1 1 147584 112×112×128
Layer6 MaxPool 2×2 128 2 0 - 56×56×128
Layer7 Conv 3×3 256 1 1 295168 56×56×256
Layer8 Conv 3×3 256 1 1 590080 56×56×256
Layer9 Conv 3×3 256 1 1 590080 56×56×256
Layer10 Conv 3×3 256 1 1 590080 56×56×256
Layer11 MaxPool 2×2 256 2 0 - 28×28×256
Layer12 Conv 3×3 512 1 1 1,180,160 28×28×512
Layer13 Conv 3×3 512 1 1 2,359,808 28×28×512
Layer14 Conv 3×3 512 1 1 2,359,808 28×28×512
Layer15 Conv 3×3 512 1 1 2,359,808 28×28×512
Layer16 MaxPool 2×2 512 2 0 - 14×14×512
Layer17 Conv 3×3 512 1 1 2,359,808 14×14×512
Layer18 Conv 3×3 512 1 1 2,359,808 14×14×512
Layer19 Conv 3×3 512 1 1 2,359,808 14×14×512
Layer20 Conv 3×3 512 1 1 2,359,808 14×14×512
Layer21 MaxPool 2×2 512 2 0 - 7×7×512
Flattening 25088
Layer22 FC - - - - 102,764,544 4096
Layer23 FC - - - - 16,781,312 4096
Layer24 FC(Softmax) - - - - 8194 1000
Activation function = ReLU & Total parameters ≈ 139.5 million
Notes:
1- This network was trained on 4 NVIDIA TITAN Black GPUS for 2-3 weeks.
2-This network is simple and uniform as it repeats itself.
3-The difference between 19 and 16 layer VGG is that instead of stacking 4 consecutive conv
layers we stack only 3 conv layers.
4-It works well in both classification and localization tasks. (localization is discussed later)

Main concepts
1-Having deeper network using smaller kernels (filters).
2-Doubling number of kernels after each MaxPooling layer.
3-Using Mini-batch gradient descent. (This was discussed before in machine learning chapter)

What is the benefit of having deeper network using smaller kernels?


The VGG networks from Oxford were the first to use much smaller 3×3 filters in each
convolutional layers and also combined them as a sequence of convolutions.
The use of only 3×3 sized filters is quite different from AlexNet’s 11×11 filters in the first layer
or even 7×7 in ZFNet. The authors’ reasoning is that the combination of two 3×3 conv layers
has an effective receptive field of 5×5. This in turn simulates a larger filter while keeping the
benefits of smaller filter sizes.
But what are the benefits of having stacked, smaller-sized filters rather than having large-sized
one?
Multiple non-linear layers increase the depth of the network which enables it to learn more
complex features, we can understand that by observing a network passing by a ReLU function
of each convolutional layer which is better than passing by it only once with larger kernels.
So VGG shows that deeper networks gives better performance, moreover, it decreases number
of parameters concerning each layer.
For example: Three 3×3 filters on top of each other with stride 1 has a receptive size of 7, but
the number of parameters involved is 3×(9C2) in comparison to 49C2 parameters of kernels with
a size of 7.

Doubling number of kernels after each MaxPooling layer


The idea is simple here as maxpooling is introduced to add robustness to noise and to help
making the CNN translation equivariant. However, you also don't want to throw away
information contained in the image, together with the noise. To do that, for each convolutional
layer you should double the number of channels. This means that you have twice as many "high
level features" even if each of them contains half as many pixels. If your input image activates
one of this high level features, their activation will be passed to following layers.
Note: This added flexibility adds a risk of overfitting, which they combat with the usual
techniques as L2 regularization and dropout explained in the previous section
GoogleNet (by Christian Szegedy from GOOGLE in 2014)
On the contradictory of the idea of being simple, Google kind of threw that out the window with
the introduction of the Inception module. GoogLeNet is a 22 layer CNN and was the winner of
ILSVRC 2014 with a top 5 error rate of 6.7%. To my knowledge, this was one of the first CNN
architectures that really strayed from the general approach of simply stacking conv and pooling
layers on top of each other in a sequential structure. The authors of the paper also emphasized
that this new model places notable consideration on memory and power usage.

Architecture
It used 9 inception modules with over 100 layers so complex but very deep.

You can notice that the architecture is simply a multiple of those inception cells in addition to
something called auxiliary loss.

#3×3 reduce and #5×5 reduce represents the no. of 1×1 conv before each 3×3 & 5×5 conv
respectively pool proj represents no. of 1×1 conv after MaxPooling.
Main concepts
1-Introduce an idea called Network-in-Network (NiN).
2-Introduce the idea of inception modules (aka inception cells).
3-Usage of RMSProp optimization. (discussed before in previous chapter)

What is Network-in-Network (NiN) idea?


Network-in-network (NiN) had the great and simple insight of using 1x1 convolutions to
provide more combinational power to the features of a convolutional layers. The 1x1
convolutions (or network in network layer) provide a method of dimensionality reduction. For
example, let’s say you had an input volume of 100x100x60 (This isn’t necessarily the
dimensions of the image, just the input to any layer of the network). Applying 20 filters of 1x1
convolution would allow you to reduce the volume to 100x100x20. This can be thought of as a
“pooling of features” because we are reducing the depth of the volume, similar to how we
reduce the dimensions of height and width with normal maxpooling layers. Another note is that
these 1x1 conv layers are followed by ReLU units which definitely can’t hurt. 1×1 convolution
are used to spatially combine features across features maps after convolution, so they
effectively use very few parameters, shared across all pixels of these features!
What is inception module?
We can see here that the output of the previous layer
(MaxPool layer) is not passed to only one layer in sequential
manner as usual but passed to 4 different operations (3 conv
operations & 1 MaxPooling) so what is the idea behind this
modification ?
Basically, at each layer of a traditional ConvNet, you have to
make a choice of whether to have a pooling operation or a
conv operation (there is also the choice of filter size). What
an Inception module allows you to do is perform all of these
operations in parallel then concatenate. In fact, this was
exactly the “naïve” idea that the authors came up with.↓

But if we notice the photo on the right which is the final idea we will find that 3×3 conv and
5×5 conv were preceded by 1×1 convolution and also MaxPooling was followed by 1×1
convolution!! Why would they do that !! As we said in the Network-in-Network section that
1×1 convolution can be used for dimensionality reduction as if we moved along with the naïve
idea that authors came up with, It would lead to way too many outputs. We would end up with
an extremely large depth channel for the output volume. The way that the authors address this is
by adding 1x1 conv operations before the 3x3 and 5x5 layers. You may be asking yourself
“How does this architecture help?”. Well, you have a module that consists of a network in
network layer, a medium sized filter convolution, a large sized filter convolution, and a pooling
operation. The network in network conv is able to extract information about the very fine grain
details in the volume, while the 5x5 filter is able to cover a large receptive field of the input, and
thus able to extract its information as well. You also have a pooling operation that helps to
reduce spatial sizes and combat overfitting. On top of all of that, you have ReLUs after each
conv layer, which help improve the nonlinearity of the network. Basically, the network is able
to perform the functions of these different operations while still remaining computationally
considerate.
Notes:
1-This architecture reduced trainable parameters to just 4 millions with over 100 layers
2-The 1×1 conv layer added to reduce calculations is called bottle neck layer.
Auxilary loss (aka auxillary classifiers)
By adding auxiliary classifiers connected to these intermediate layers, we would expect to
encourage discrimination in the lower stages in the classifier, increase the gradient signal that
gets propagated back, and provide additional regularization. I don't understand what is meant by
discrimination in earlier stages 

Are there any versions of GoogleNet?


Yes , In fact there are versions till v7 of GoogleNet but this is too much to be explained here but
generally v2 introduced the usage of Batch normalization after output layers to normalize the
response and this helps training as the next layer does not have
to learn offsets in the input data, and can focus on how to best
combine features.
v3 increase the depth and the width at each layer as well as
using stacked smaller filters instead larger filters to give the
same receptive field with better results and less computaions
as VGG introduce. This is the new inception module now 

v3 also introduced decomposing of filters by flattened


convolutions to make more complex modules.

There are more concepts that are introduced in v4, v5,v6 & v7.
ResNet (by Kaiming He et al from Microsoft in 2015)
Residual network introduced a 152 layer network architecture that made new reocrds in
classification, localization and detection tasks. It wins the ILSVRC in 2015 by incredible error
rate of 3.6% that beats even the human-level performance. Although it is a very deep neural
network but thank to the idea they brought it was able to have lower complexity than VGG.
As VGG has 16 and 19 layer architecture, ResNet has 18, 34, 50, 101 & 152 layer architecture.
Architecture

Notes:
1-The architecture in figure is 34-layer so if you want 152 layer map it from the table.
2-This model was trained on 8 GPUS.
3-Number of trainable parameters are : ResNet-5025.6Mil, ResNet-10144.5Mil, ResNet-
15260.2Mil.
Main concept
1-It introduced the idea of residual blocks that partially eliminates the problem of overfitting
with the increase of layers. (we can reach now 1000 layers by using this idea)
2-Used the idea of NiN together with residual network.

What was the problem that residual block solved?


When deeper networks starts converging, a degradation problem has been exposed: with the
network depth increasing, accuracy gets saturated and then degrades rapidly. direct mappings
are hard to learn so let us explore undirect mapping with residuals.

Residual block
Residual block is based on skipping connection such that we feed the
output of two or more successive convolutional layer AND also
bypass the input to the next layers. F(x) is the output of passing
through two or more successive convolutional layers.
So now instead of passing F(x) to the next ReLU we now pass F(x) + x
to the next ReLU or any activation function in general.

What is the importance of residual block?


You can think of this residual function as a refinement step in which we learn how to adjust the
input feature map for higher quality features. This compares with a "plain" network in which
each layer is expected to learn new and distinct feature maps. In the event that no refinement is
needed, the intermediate layers can learn to gradually adjust their weights toward zero such that
the residual block represents an identity function.

How NiN idea is used along with the residual block?


This was used in ResNet-50, 101 & 152. ResNet with a large
number of layers started to use a bottleneck layer similar to the
Inception bottleneck: This layer reduces the number of features at
each layer by first using a 1x1 convolution with a smaller output
(usually 1/4 of the input), and then a 3x3 layer, and then again a
1x1 convolution to a larger number of features. Like in the case of Inception modules, this
allows to keep the computation low, while providing
rich combination of features.

Note:
The idea of residuals was used in Inception v4 as
shown
ResNeXt (by Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He in 2016)
A modified version that was the ruuner-up in ILSVRC in 2016. it replaces the standard residual
block with one that leverages a "split-transform-merge" strategy (ie. branched paths within a
cell) used in the Inception models. It is similar to the architecture of ResNet except for two
things:
1) All 3×3 and smaller 1×1 conv (not bigger ones)
are doubled in no. of filters 

2) They used a modified residual block which is the


main concept it adds.

What modification it did on residual network?


This may look familiar to you as it is very similar to the inception module, they both follow the
split-transform-merge paradigm, except in this variant, the outputs of different paths are merged
by adding them together, while in inception they are depth-concatenated. Another difference is
that in inception, each path is different (1×1, 3×3 and 5×5 convolution) from each other, while
in this architecture, all paths share the same topology.
The authors introduced a hyper-parameter called cardinality (the number of independent paths)
denoted by C, to provide a new way of adjusting the model capacity. Experiments show that
accuracy can be gained more efficiently by increasing the cardinality than by going deeper or
wider. The authors state that compared to Inception, this novel architecture is easier to adapt to
new datasets/tasks, as it has a simple paradigm and only one hyper-parameter to be adjusted,
while Inception has many hyper-parameters (like the kernel size of the convolutional layer of
each path) to tune.
Trimps-soushen network (in 2016)
This was the winner of ILSVRC in 2016 for classification as well as localization with error rate
=2.9% but I can't find a good resource for understanding this network.

DenseNet (by Huang et al. in 2016)


DenseNet architecture is new, it is a logical extension of ResNet.
ResNet architecture has a fundamental building block (Identity) where you merge (additive) a
previous layer into a future layer. Reasoning here is by adding additive merges we are forcing
the network to learn residuals (errors i.e. diff between some previous layer and current one). In
contrast, DenseNet paper proposes concatenating outputs from the previous layers instead of
using the summation. Note that the input of each layer consists of the feature maps of all earlier
layer. The authors solve the problem of vanishing gradient descent ensuring maximum
information (and gradient) flow.

Architecture

Difference between ResNet and DenseNet by equations?


ResNet uses summation
DenseNet uses concatenation 
What is the importance of densing (concatenating) each layer to all previous layers?
Firstly. it helps in solving vanishing gradient descent by keeping earlier layers go along the
network till it reaches deeper layer. Other than tackling the vanishing gradients problem, the
authors argue that this architecture also encourages feature reuse, making the network highly
parameter-efficient. One simple interpretation of this is that, in ResNet, the output of the
identity mapping was added to the next block, which might impede (‫ )تعيق‬information flow if
the feature maps of two layers have very different distributions. Therefore, concatenating
feature maps can preserve them all and increase the variance of the outputs, encouraging feature
reuse.

By densing each layer to all previous, the deeper layers will have very large size !!
They controlled the size by making the number of filters added in all layers constant, the no. of
filters is denoted by k and has the name "growth rate" as it actually controls the growth rate of
the layers and prevents the network from growing too wide. each successive layer will
have k more channels than the previous one (as a result of accumulating and concatenating all
previous layers to the input).

Notes:
1-They also used a 1x1 convolutional bottleneck layer to reduce the number of feature maps
before the expensive 3x3 convolution.
2-The DenseNet has a modified version with compression factor θ. It is used to reduce the
number of output feature maps. Instead of having m feature maps at a certain layer, we will
have θ*m. Of course, is in the range [0–1]. So DenseNets will remain the same when θ=1.
These DenseNet are called DenseNet-BC.
3- Dense connections have a regularizing effect, which reduces over- fitting on tasks with
smaller training set sizes.
SENet (by Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu in late 2017)
Squeeze and Excitation network was the winner of ILSVRC 2017 by achieving an error rate of
2.25% in classification tasks. This the final network to be discussed here.

Architecture

Main concept
Increase channels inter-dependencies by squeeze-excitation module

What is meant by increasing channel inter-dependencies?


We know that CNNs use their convolutional filters to extract hierarchal information from
images. Lower layers find trivial low-level pieces of context like edges or curves, while upper
layers can detect faces, text or other complex geometrical shapes. They extract whatever is
necessary to solve a task efficiently. Intuitively we know that each feature map (each channel)
of each convolutional layer is weighted equally which means that every feature map affect the
result by the same amount of the other feature maps in the same conv layer. so this means that
the channels are having low inter-dependencies as all of them have the same weight or effect on
the output. Here comes the idea of the SENet by adding parameter (weight) block for each
feature map in the conv layer such that This block helps dynamically “excite” feature maps
that help classification and suppress feature maps that don’t help in classification. So SENets
are all about changing the equivalent weight by adding a content aware mechanism to weight
each channel adaptively.

What is the mechanism to be used ti weight each channel?


If we want to think in its most basic form, this could mean adding a single parameter to each
channel and giving it a linear scalar how relevant each one is as shown  w1featuremap1+w2
featuremap2+ …+wn featuremapn. However, this is so basic in weighting so they thought of
another idea for weighting each channel. the authors push it a little further. First, they get a
global understanding of each channel by squeezing the feature maps to a single numeric value.
This results in a vector of size n, where n is equal to the number of convolutional channels
(feature maps). Afterwards, it is fed through a two-layer neural network, which outputs a vector
of the same size that excites some feature maps. These n values can now be used as weights on
the original features maps, scaling each channel based on its importance.

How Squeezing and excitation is done? In other words, what is Squeeze and excitation module?
Squeezing is done using average pooling of each feature map. So we apply AvgPool filter to
each feature map having the same size so that the result would be a single numeric value and
that is why it is called global average pooling as it results in a single value for each feature map.

Excitation is done using two-layer NN as shown (FCReLUFCSigmoid). onnected layer


followed by a ReLU function adds the necessary nonlinearity. It’s output channel complexity is
also reduced by a certain ratio. A second fully connected layer followed by a Sigmoid activation
gives each channel a smooth gating function.

Now we have a vector of the same length of the number of feature maps so we weight each
feature map by this value.

Notes:
1-The authors show that by adding SE-blocks to ResNet-50 you can expect almost the same
accuracy as ResNet-101 delivers. This is impressive for a model requiring only half of the
computational costs.
2-Use SE block with latter layers as they are more specialized for each class.
Transfer learning (aka Inductive transfer)
Introduction
Humans have an inherent ability to transfer knowledge across tasks. What we acquire as
knowledge while learning about one task, we utilize in the same way to solve related tasks. The
more related the tasks, the easier it is for us to transfer, or cross-utilize our knowledge. Some
simple examples would be,
 Know how to ride a motorbike ⮫ Learn how to ride a car
 Know how to play classic piano ⮫ Learn how to play jazz piano
 Know math and statistics ⮫ Learn machine learning
In each of the above scenarios, we don’t learn everything from scratch when we attempt to learn
new aspects or topics. We transfer and leverage our knowledge from what we and others have
learnt in the past! Conventional machine learning and deep learning algorithms, so far, have
been traditionally designed to work in isolation. These algorithms are trained to solve specific
tasks. The models have to be rebuilt from scratch once the feature-space distribution
changes. So Transfer learning came to end rebuilding from scratch and make use of what others
have reached to start from. Transfer learning is not a machine learning model or technique; it is
rather a ‘design methodology’ within machine learning.
What is meant by transfer learning?
Transfer learning is a machine learning design methodology where a model trained on one task
is re-purposed on a second related task by reusing the model trained on first task as the starting
point for a model on a second task. It is mostly used in computer vision and natural language
processing tasks. Hence instead of starting the learning process from scratch, you start from
patterns that have been learned from solving a related task.

What is the importance of transfer learning?


Imagine you’re trying to build a deep learning model but don’t have much training data. Maybe
you’re trying to identify a rare skin disease and only have 100 images. Meanwhile, someone
else has trained an image recognition model on a 100,000 labeled photos of dogs and has
managed to get 96 percent accuracy at classifying different breeds. These tasks don’t seem
related, but that doesn’t mean the dog breed classifier is irrelevant. By training the model on a
ton of images, the weights in the different layers of the dog breed classifier are well tuned to
identify all kinds of useful visual features: edges, shapes, and different intensities of light and
dark pixels to name a few. It’s possible that a number of these features might even help you
identify the rare skin disease. In fact, to find out you’d just need to remove the final layer of the
dog breed classifier (the one that selects which breed has the highest probability based on the
input) and replace it with a new layer that makes a binary decision as to whether or not the skin
disease is present. To sum up, transfer learning has the advantage in making use of small data
sets, decreasing training time needed and enhance performance as well.
Simple visualization of the process

Transfer learning strategies in deep learning


we have two categories to define strategies, each category has several techniques:
Category1: Freezing vs. fine-tuning
This category defines whether:
 Use the weights of the pre-trained model as it is without modification (Freeze all layers
except for classification (last) layer)
 Use the pre-trained model as feature extractor which means that you will freeze the
earlier layers that learn the low-level & inter-mediate level features of the data and just
unfreeze the latter layers to be able to be fine tuned on your task's high-level features.
This approach is known as representation
learning. The learned representation can
then be used for other problems as well.
You simply use the first layers to spot the
right representation of features but you
don’t use the latter layers of the network
because it is too task-specific. Simply
feed data into your network and use one
of the intermediate layers as the output
layer. This layer can then be interpreted
as a representation of the raw data.
Category2: Train a model then reuse VS. Use Pre-trained
 Imagine you want to solve Task A but don’t have enough data to train a Deep Neural
Network. One way around this issue would be to find a related Task B, where you have
an abundance (a lot) of data. Then you could train a Deep Neural Network on Task B and
use this model as starting point to solve your initial Task A. If you have to use the whole
model or only a few layers of it, depends heavily on the problem you are trying to solve.
 Approach 2 would be to use an already pre-trained model. There are a lot of these models
out there, so you have to do a little bit of research. How many layers you reuse and how
many you are training again, depends like I already said on your problem and it is
therefore hard to form a general rule.

Are there many pre-trained models available to be used?


Yes, there are many available pre-trained CNN for the most popular CNN architectures trained
on popular datasets as ImageNet, COCO and others.

Check this link for pre-trained model:[ http://pretrained.ml/]


Oxford VGG Model  [http://www.robots.ox.ac.uk/~vgg/research/very_deep/]
GoogleInceptionModel[https://github.com/tensorflow/models/tree/master/research/inception]
Microsoft ResNet Model [ https://github.com/KaimingHe/deep-residual-networks]
More pre-trained models  [https://github.com/BVLC/caffe/wiki/Model-Zoo]
Course 4 Week 3
Object localization
We have been seeing throughout all what we have taken that our main concentration was on
image classification where we have c classes and our network is responsible to tell each input
image belongs to which class. Localization on the other hand means to not only tell each input
image belongs to which class but also tell the boundary box of the single object found in te
image. There is also another task called detection where here we have multiple objects in each
input image and these objects could belong to similar or different classes and so we have to
detect and localize each object of them (as long as they belong to a pre-defined class).
To sum up:
1) Image classification: A single object to be classified (assign a class label to it).
2) Classification + Localization: A single object to be classified (assign a class label to it) and to
be localized (put a boundary box around this object).
3) Detection: Multiple objects of similar or different classes each to be classified (assign a class
label to each of them) and each to be localized (put a boundary box around each of them).

How to apply localization ?


We have discussed before that to do classification tasks, we should go along the pipeline as
follows : Input image → Convnet → Softmax layer with c classes →
This was the typical path to classify, so what modifications we do to apply localization as well?
Instead of making softmax outputs only predictions representing probabilities of each class, we
will add few more outputs representing the bounding box (bx, by, bh & bw) where bx & by
represents the co-ordinates of the center of the object and bh & bw represents the height and
width of the bounding box respectively. Now instead of having dataset containing only class
labels for each training example (supervised learning), we will also have 4 values representing
the boundary box so now dataset will contain: [Image + Class label + boundary box of object].
Note: The notation convention to be used along the main course will be that the upper left
corner has co-ordinate of (0,0) and the lower right corner has co-ordinate of (image_width,
image_height) or (1,1) depending on the topic.
Check the following figure to conclude what said:

Defining target label y

y= , where:

 Pc defines is there any object =1 OR just a background (no object)=0?


 (bx, by, bh & bw) defines the boundary box where (bx, by) are the co-ordinates of the center
of the box and (bh & bw) are the height and width of the box.
 C1, C2, …, Cc are the pre-defined classes to classify the object in the image by putting 1
on the right class and 0s on the remaining.
Notes:
1-There is a single object in the image as it is a localization problem.
2- In our example at the top of the page, Number of classes is 3 not 4 as the background is not
class so we have C1,C2,C3 while the background is identified by Pc to tell if there is an object or
just a background.

On our example: y= given that C1 is pedestrian, C2 is car, C3 is motorcycle.

Note: If the image is a background so Pc=0 and all the remaining will be don't cares (?).
Loss function

ℓ( )=

Note: 5+c represents the length or size of vector y so if we have 3 classes as our previous
example so we have y1=Pc & y2,…, y5=(bx, by, bh & bw) & y6, y7, y8 = C1, C2, C3.

Landmark detection
Instead of having bounding box with (bx, by, bh & bw) as an output to localize an object, here we
can have some cases where we need just x & y co-ordinates of some important points in an
image to be recognized, these points are called landmarks.
For example, assume you are having a face recognition task where you want to
know where is the corner of the right eye so this in fact is a single co-ordinate
(x,y) that represents the corner of the eye (red dot in the eye's corner)
So now our final layer will produce co-ordinates lx, ly to represent this
landmark.

Similarly if you want 4 landmarks that represents the 2 corners of both eyes so
now our final layer will have to produce (l1x, l1y), (l2x, l2y), (l3x, l3y), (l4x, l4y) to
represent the 4 landmarks (2 corners of each eye). 

More generally for face recognition we will have several landmarks to


represent everything in the face (eyes, nose, mouth, cheek points, face edges,
…) so that these landmarks represent the face features so our final layer will
produce (l1x, l1y), (l2x, l2y), …, (lnx, lny) to represent position of each landmark
assuming having n landmarks. 
Note: we have an additional output which tells whether there is a face in the image or not which
is similar to Pc in localization. So our final layer output size will be (n×2)+1 assuming having n
landmarks.

Of course landmark detection is not used only for face recognition but it has
millions of other tasks as for example person's pose detection (walking,
running, bending, kicking, …) 

Real-world application for landmark detection: From the famous real-world applications is the
augmented reality used in snapchat filters that detects landmarks of the face to put the filter on
your face, Also face unlock in the recent smart phones are using landmark detection.
Sliding window for object detection
We have said before that object detection is mainly having
multiple objects of similar or different classes in an image
and our task is to localize each object and assign class
label to it. The basic idea to do that is to have a dataset of
cropped images that is labeled to a specific class say we
have cars or no cars as a simple example as shown 
then train our network on this dataset using convnet
similarly to what we were doing in image classification
tasks (now training phase is over) then apply sliding window of specific size over the test image
and crop this window and pass it to the trained network to see whether there is a car or not in
this cropped window as shown below in the test image:

 … 

As shown above we have slided a window over the test image and each time we crop this
window and pass it to the convnet to say whether there is a car in the window or not and if there
is a car it saves the boundary box position relatively to the test image as whole.

But may be the sliding window size was not right so that is why it couldn't detect the car so that
is why we apply the sliding window several times each time with different window size:

This method has a problem of high computational cost as iteratively we are doing several
sliding windows with different sizes and each sliding window is moving along the whole image
where each time you apply the cropped part to a convnet which is an expensive task to be done.
Before getting in how to solve this problem, let us conquer another idea then get back to the
solution for this problem that this idea will be used in.
Turning Fully Connected (FC) layer into convolutional layer

Assume having a network as shown

Now we want to have a fully connected layer but in form of convolutional layer, so what we do
is that we will convolve the last layer before the required FC layer with n filters each of the
same dimension (and channels of course) of this last layer before required FC layer where n is
the length of FC layer so in our example the layer before the first FC is of dimension 5×5×16 so
we will convolve it with 400 filters each filter is of dimension 5×5×16 so each convolution will
result in a 1×1 result (single value) and since we have 400 filters so the final output will be of
dimension 1×1×400 as shown:

What about the next FC layer ? Similarly, we will look at the layer before it so its size is
1×1×400 and the size needed for FC is also 400 so we will convolve this layer with 400 filters
each of them is of size 1×1×400 so that the output now is of size 1×1×400 as shown:

Similarly with the softmax layer which has 4 outputs (4 classes) so to form this layer we should
convolve the layer before it with 4 filters each of dimension 1×1×400 to end up with an output
of dimension 1×1×4 as shown:
Now, How to solve the problem of high computational cost of sliding windows?
This is solved using overfeat method which depends on convolutions. Assume having input
RGB images of size 14×14 and the same training network we used before as shown:

So now our window is of size 14×14 so if we assume that our test image is of size 16×16 and
we want to apply this 14 by 14 sliding window with stride=2 (means that every time we slide
the window over the testing image by shifting 2 pixels) so by doing that we will find that we
have 4 possible windows to be applied as shown on the test image:

So what we were doing is that each window of this 4 windows will crop the test image and
passes the cropped photo to the convnet to decide whether there is an object of the 4 classes or
not but this is expensive computation to do as we will repeat this 4 times for each of the 4
windows taking into consideration that we are giving a simple example where there is only 4
possible windows but if stride=1 there will be 16 possible windows and if the image is larger
and sliding window size is smaller there will be much much more so what overfeat method says
is to apply the 16×16 test image as it is without any cropping to the same pre-trained network
and the output will contain all the 4 possible outcomes of the 4 windows internally so the final
output will be 2×2 instead of 1×1 so we will have the 4 vector output where each one represents
the output of one of the windows as shown:

So the vector with red arrow is for the red window, the vector with green arrow is for the green
window and so on …  So now we have 4 vectors (representing the 4 windows) each of length
4 (representing the 4 classes) so instead of doing the same path 4 times for each window, now
we do it just one time with the same result.
Similarly if the test images is 28×28 and we have the same sliding 14×14 window that is slided
with stride=2 so now we have 64 possible windows so instead of doing 64 paths through the
trained network each with the cropped window of the 64 windows now we pass it once and get
the whole output at once as shown:

Bounding boxes predictions


Sliding windows is still a not very efficient way for predicting
an accurate bounding box as there could be no bounding box
that perfectly match the object so we will discuss now how to
make these boxes more accurate and matching to the object.
What we will do is that we will apply grid cells to the image in
our algorithm say for example we apply 3×3 grid cells to our
100×100 image as shown 
Now we will apply what we have done in classification + localization part for each grid cell so

for each grid cell we will have y for the training where y = and as we have 9 grid cells then

we will have 9 y's each represents 1 grid cell.


Each of the 7 yellow grid cells has no objects (I mean here that
the center of the object is not in these grid cells) and the 2
green grid cells are the ones having objects inside them (I
mean here that the center of the object is in these grid cells) so

the 7 yellow grid cells will have the following y  where

(?) is don't care while the 2 green grid cells will have the following y's

assuming that we have 3 classes and class 2 is the cars classes. so now the output size is 3×3×8.
So now during training we are passing an input image that a grid cell of any size (3×3 in our
example) into the conv network then having an output as shown above including a more precise
bounding box with any aspect ratio rather than like sliding window which has specific aspect
ratio represented in the size of the sliding window all over the image.

Notes:
1-There is a problem with this implementation in the case that more than 1 object's center lies in
the same grid so only 1 object will be identified and the remaining will be neglected.
2-As the grid size gets smaller (instead of 3×3 we can have 19×19) the chance that more than 1
object center lies in the same grid cell decreases.
3- The process on one image is done at once which means that I'm not applying each grid cell to
a convnet to apply classification and localization but the whole image is passed at once as we
have done with the sliding window using convolutional layers.
How bx, by, bh, bw are represented?
1-bx and by are values between 0 and 1 that represents how far each of the co-ordinates from the
upper left point relatively to the whole size of the grid (whole size is always equals 1). This
means that if bx=0.5 then the x co-ordinate of the center of the bounding box of the object in
this grid cell is half the way from left side of grid cell → right side of the grid cell
and if by=0.7 then y co-ordinate of the center of the bounding box of the object in this grid cell
is 7/10 the way from the upper side of the grid cell → lower side of the grid cell
2- bh and bw can be any value larger than 0 so if bh say equals 0.9 so this means that the height
of the bounding box of the object in this grid cell is 0.9 the height of the grid cell (0.9×height of
the grid cell) and similarly for the width.
So if bw is larger than 1 (say 1.5) then this
means that width of the bounding box is 1.5
the width of the grid cell (1.5× width of the
grid cell).

Note:
All what we have said in this section (Bounding box prediction) is introduced mainly in what is
called by YOLO algorithm that will be discussed later after summing all the ideas introduced in
it.

Intersection over union (IoU)


This is an evaluation metric for the accuracy of the predicted boundary box by comparing its
overlapping with the actual box (aka ground truth box) during training, validation and testing
phases.
Assume having this image where the ground
truth box is the red box and the predicted box
from your network is the purple one, so how
accurate is this predicted box?

IoU =

So if IoU = 0 so there is no overlapping between the boxes and if IoU =1 so the predicted box is
perfectly matching with the ground truth (actual) box. We use IoU to tell how much we would
penalize our algorithm by comparing to a threshold (say 0.5) so if IoU is greater than the
threshold then this is a good acceptable predicted boundary box and otherwise (<threshold) then
we should penalize our algorithm for this bad prediction.
Non-Max suppression
Non-max suppression is used for finding only one box for
each objects as we previously have said that we could use
different sliding window to find the object so these sliding
windows in fact would detect the same object within
several sliding windows or more generally with any other
technique to be used for detection there could be several
boxes detecting the same object as shown in the figure 
so how could we make our algorithm only choose the best
box for each object and reject the others? Noting that the values on the boxes is the highest
probability among the classes of the network so if a network has 4 classes (cats, dogs, persons
& cars) so each class has its own probability as said before in the softmax regression discussion
and the highest probability (called Pc) is the predicted class by the network to this input test
image so we take the highest probability and wrote it along with the boundary box. Let us get
back to non-max suppression algorithm:

Algorithm
1-Discard all boxes associated with low probabilities (Pc < threshold).
2-If there are remaining boxes (boxes with Pc > 0.6) so loop over the following:
 Take the box with highest probability(Pc) among all the remaining boxes and keep it.
 Perform IoU calculation between this kept box and all the remaining boxes and discard
any box of the remaining ones with IoU > threshold (say 0.5) because this means that
most probably these boxes are detecting the same object of the kept box.
 Repeat the 2 previous steps again on the non-discarded boxes (IoU<0.5) because these
boxes are most probably belonging to another object in the image.

Example
Now we will work on the example at the top of the page so firstly
we will pick the highest probability among them all (0.9) and keep
it (in white), then now we apply IoU with all the remaining 4 boxes
so we found out that 2 of the remaining boxes(0.6 and 0.7 boxes)
are having high IoU with this box in white so we will discard them.
Now we have remaining 2 non-discarded boxes (on the left) so we
will take the highest among these 2 boxes (0.8 box in white) then
perform IoU with the remaining 1 box (0.7 box on the left) so we
found out that it has high IoU so we will discard it. Eventually we
have no remaining boxes so final output is the 2 white boxes with
0.9 and 0.8 class probability.
Anchor boxes
We have addressed a problem before that we skipped it which is that each grid cell in the image
can only has 1 object so if there is more than 1 object so the remaining will be neglected and
here comes the idea of anchor boxes to be able to get all the objects that their centers lie in the
same grid cells.
Assuming having an image like this shown and there are 2 objects
(a person and a car) where both the centers of the 2 objects are in
the same grid cell. Anchor boxes are introduced so that each of
them is of a different aspect ratio so if we assume we have 2
anchor boxes (could be more as you like) so y now is not only

this  but will be doubled 

So now each part of the output is detecting objects that mostly can be represented by a box with
the aspect ratio of those anchor boxes so for our example Anchor box 1 will mostly fit the
person and anchor box 2 will mostly fit the car noting that in each grid cell we can't have more
objects' centers than the number of anchor boxes (If no. of anchor boxes=2 therefore the
maximum number of objects center to be detected within one grid cells is 2 ) so that is why we
are using small grid size (19×19) with high number of anchor boxes (around 5) to be sure that
almost no probability of missing any objects in the image.
Note: we determine the best fit anchor box based on the highest IoU
between the anchor boxes (purple) and the ground truth box (red) 

Output size now (assuming grid cells 3×3) is [3×3× no. of anchor boxes × (5 + no. of classes)]

Our Example:
You Only Look Once (YOLO) Algorithm
In fact most of the topics discussed after the sliding window topic are introduced in the YOLO
algorithm paper so this part we will gather all what we said in a single algorithm called YOLO.

For training:

In training we have images labeled as shown as shown assuming having 3×3 grid cells, 2
anchor boxes and 3 classes so outputs are 9 vectors each of length 16 (2 anchor boxes × (5+3))
such that for the given training example there will be 8 vectors of just 0's and don't cares (?)
while there will be 1 vector with 0 for the first anchor box and 1 for the second as the ground
truth box is wider than being taller so IoU with anchor box 2 is greater.
You will notice in the bottom right of the image that we can use 19×19 grid cells and also we
can use 5 anchor boxes (5 anchor boxes × (5+3) = 40).
After this image will pass through the network we will have output of (3×3×2×8) as shown
Don't forget that in the test phase we will do non-max suppression to discard boxes pointing at
same object as well as discarding boxes with low probability by take into your consideration
that all grid cells will provide boxes (their number is similar to number of anchor boxes) and
non-max suppression is the technique that will take care of discarding all these boxes especially
that most of them will have low probabilities and some will point to same object.

Example: Assuming we have 2 anchor boxes


First image with all boxes (2 boxes for each grid cells).
Second image is after discarding boxes with low probability using non-max suppression.
Third image is after discarding boxes pointing to the same object using non-max suppression.

 

Note:
YOLO algorithm is usually used for real-time application as it is a fast run-time algorithm
compared to other algorithms.
Region Proposals
This idea is the main basis of some networks as R-CNN, Fast R-CNN, Faster R-CNN,…
As we said before that sliding windows is a little bit bad and time consuming as we need to
slide the window over the whole image so most probably a lot of windows will be of
background that doesn't contain any objects and this is a waste of times and resources. What R-
CNN did is applying region proposal where
we also having windows but more
reasonable windows and to determine these
reasonable windows, a segmentation
algorithm is used (right photo) 
As we see here that segmentation gives
some blobs with different colors so the
windows generated are detecting those
blobs to check if they contain object or not and the number of windows here of course is much
less than that will be generated by normal sliding window algorithm.
R-CNN was too slow so fast and faster R-CNN was introduced.
Localization and detection, Overfeat, Image Segmentation, R-CNN and its variants, SSD,
YOLO and its variants, other algorithms
Localization VS. Detection VS. Instance segmentation VS. Semantic segmentation
We all know the classification task where there will be an object in an image and your network
has to tell the probability of each class of your pre-defined classes concerning this image such
that the class with the highest probability is the predicted class of the object in the image. but
what if we want not just to say what is the class of the object in the image, what if we want also
to detect its position in the image. Is that all ? No what if there are multiple objects in the image
and you want to detect all the object regarding their position then classify each object. Here
comes some terminologies like localization, detection & segmentation.
Localization: it refers for having a single object in the image so we have an object in the image
and want to detect its position by drawing a boundary box around the object and classify it.
Detection: it refers for having multiple objects in the same image (objects could be of same
class or different classes) and your task is to determine the position of each object by drawing a
boundary box around each object and classify it.
Instance segmentation: This means detecting each object outline at a pixel level and classify it
such that if we have 3 objects (2 cars and 1 person) we have to detect each of the cars and the
person separately on pixel level.
Semantic segmentation: it refers for per-pixel segmentation which means that we are not only
defining each pixel of object is related to which instance but we are detecting every single pixel
is related to which class.
Note: Instance means that we are dealing with each object as a different object even if they are
belonging to the same class, while semantic or detection (non-instance based) we are dealing
with each object as different object but if we found 2 objects of same class here we put around
each object a boundary box (in case of detection) or color each pixel (in case of semantic
segmentation) by the same color. (In semantic all balloons are pixel colored with the same
color, In detection all balloons have boundary boxes of same color, In Instance each
balloon is colored differently as a single instance although they are all of same class "balloon")
Before going deeply with the recent CNN models used in deep learning for object detection as
R-CNN and its variant, SSD, YOLO, etc… Let us think of object detection approaches starting
from the most basic naïve idea till we reach deep learning approaches and their progress till we
reached recent CNN models.
Different approaches for solving object localization and detection
Assume our example that we will cope with along
this part is hypothetically building a pedestrian
detection system for a self-driving car supposing
your car captures an image like this one 
What is our target ? creating a bounding box
around these people, so that the system can
pinpoint where in the image the people are, and
then accordingly make a decision as to which path
to take, in order to avoid any mishaps.

Our objective behind doing object detection is two folds:


 To identify what all objects are present in the image and where they’re located
 Filter out the object of attention

Approach1: Naïve way (Divide and conquer)


The simplest approach is to divide our image into four parts as shown then pass them to a
classifier to tell if this part has a pedestrian or not.

Say the classifier found that the upper right part of the image contains pedestrians so he will
draw a bounding box including this whole part as a part to be avoided by the car as shown:

Good as starting point but still we need more precise bounding boxes around the objects.
Approach2 :Increase number of divisions
Instead of having 4 patches only with fixed sizes to be given to our trained classifier, let us
make them several patches with different sizes to be passed and to store the patch size and
position to act as bounding boxes as shown:

A lot of boxes!! but the good thing that now we have some boxes closer to the two objects we
have than the previous method. So we are somehow much closer to find more precise and
accurate bounding boxes.

Approach3: Perform structured divisions


for having more structured division, let us
divide our image into a grid cell (say
10×10) as shown: 

Define the center of each grid then for each


center take several patches (say 3) of
different aspect ratios as shown:

Pass every patch of every grid cell to the


classifier to get predictions and save the
boundary boxes of the classified patches
and let us see how they look on the image
as shown

Less boundary box than the previous approach and more close to the right boundary boxes
needed. We are more steps closer to the precise boundary boxes. 

Approach4: More efficient structured divisions


Instead of having 10×10 grids, what about
20×20 as shown:
Instead of only 3 patches let them 9 or even more with
more aspect ratios so now we are having more tighter
regions and better bounding box.

But wait this is so computationally expensive as we now


have more patches to be passed to the trained classifier!!
The solution is to instead of passing all patches, we can
have like an intermediate classifier that leads the way to the place that most probably has the
object then in this part we can take those high no. of patches to be passed.
All what we are saying is lacking the power of deep learning and CNNs, they are just giving
intuition of how we could think to solve the problem of object detection and how approaches
are developed before to reach this point which is actually like the sliding windows which we
have discussed before in the main course. Even the problem of high computation cost in CNN
with sliding windows is solved greatly by using overfeat method that has discussed before and
how it generated the idea of using conv layer instead of FC layer to perform detection and
localization (check the main course) and the algorithm generally will be discussed after this
part.

Let us take a look on deep learning as a whole in detection problems before going with CNN.

Approach5: Using Deep Learning for feature selection and to build an end-to-end approach
Deep learning has somuch potential in the object detection space. Can you recommend where
and how can we leverage it for our problem? I have listed a couple of methodologies below:
 Instead of taking patches from the original image, we can pass the original image through
a neural network to reduce the dimensions
 We could also use a neural network to suggest selective patches
 We can reinforce a deep learning algorithm to give predictions as close to the original
bounding box as possible. This will ensure that the algorithm gives more tighter and finer
bounding box predictions
Now instead of training different neural networks for solving each individual problem, we can
take a single deep neural network model which will attempt to solve all the problems by itself.
The advantage of doing this, is that each of the smaller components of a neural network will
help in optimizing the other parts of the same
neural network. This will help us in jointly
training the entire deep model.
Our output would give us the best performance
out of all the approaches we have seen so far,
somewhat similar to this image. 
We have the normal path for applying classification as we have learnt:
Input image  stack of conv layers pooling  stack of conv layers pooling… stack
of fully-connected layer (FC)  Fully connected softmax layer.
The softmax layer was having class scores (probabilities) that determines each object
probability such that the class with highest probability is the predicted class. However, that is
all for classification what about localization, what modifications do we need to apply to perform
localization then we can generalize it to perform detection and segmentation. Here comes the
first published neural net based object localization architecture called overfeat that integrates
classification, localization and detection in one conv network in 2013 noting that AlexNet in
2012 was supposed to do object localization by some modifications but they didn't mention the
implementation.

Overfeat (by Pierre Sermanet David Eigen Xiang Zhang Michael Mathieu Rob Fergus
Yann LeCun in 2013/2014)
This was the winner of ILSVRC 2013 in localization task as well as detection task and ranked
fourth in classification task. The main point of this paper is to show that training a
convolutional network to simultaneously classify, locate and detect objects in images can boost
the classification accuracy and the detection and localization accuracy of all tasks. The paper
proposes a new integrated approach to object detection, recognition, and localization with a
single ConvNet.

If we remove the boundary regressor, it will be a


typical image classification architecture we know
previously but what have been done here is
adding boundary regressor for object localization.
Regression is about returning a number instead of
a class, in our case we're going to return 4
numbers (x, y, width, height) that are related to a
bounding box. You train this system with an image and a ground truth bounding box, and use
L2 distance to calculate the loss between the predicted bounding box and the ground truth.
Normally what you do is attach another fully connected layer on the last convolution layer to do
classification but for overfeat we attach regression part at the last convolution layer to do
localization. The algorithm here is to train as classifier then train as localizer regressor as
follows:
 Do image classification at different locations on regions of multiple scales of the image in
a sliding window fashion so there are FC layers as normal.
 Predict the bounding box locations with a regressor trained on top of the same
convolution layers.
Note: Normal sliding window (not the one used in overfeat) means that we move window over
the image and slide it such that each time we crop the image at the window position so we are
not seeing anything in the image except for this window. Sliding is done across the whole
image from the top left corner till it reaches the bottom right corner then repeated iteratively
with different windows sizes. This was discussed in details with images in the main course.

Back to our algorithm, maybe you didn't get the idea let us
see the steps of the training process noting that overfeat
model architecture is similar to AlexNet model:
 Train a CNN model (similar to AlexNet) on the image classification task.
 Then, we replace the top classifier layers by a regression network and train it to predict
object bounding boxes at each spatial location and scale. The regressor is class-specific,
each generated for one image class.
o Input: Images with classification and bounding box.
o Output: (x, y, width, height), 4 values in total, representing co-ordinates of the
center of the box in addition to width and height of this box
o Loss: The regressor is trained to minimize L2 norm between generated bounding
box and the ground truth for each training example.

At detection time, we perform it similarly on classification followed by localization:


 Perform classification at each location using the pretrained CNN model.
 Predict object bounding boxes on all classified regions generated by the classifier.
 Merge bounding boxes with sufficient overlap from localization and sufficient confidence
of being the same object from the classifier.
Generally we won't spend more time in discussing other concepts than what we have said here
in addition to the part said in the main course regarding decreasing computation cost as it is a
very basic technique for detection but let us first define the performance metric for detection.
Performance metric for object localization/detection
Before reading this make sure to revise the
precision and recall concept in machine
learning chapter concerning evaluation
metrics and revise IoU from the main course
in this chapter. Note: IoU 

Now let us start:


mAP or mean average precision is a complicated evaluation metric that is used for evaluation
object detection algorithm on a specific data set that relies on combining the concepts of
precision and recall for classification and IoU for detction as we will discuss noting that the
mAP calculation definition can vary for different competitions as COCO, Pascal VOC,
ImageNet, …etc. we will start with Average precision AP on image then apply mean on dataset.

Computing the Average Precision (AP) for a particular object detection pipeline is essentially a
three step process:
 Compute the precision which is the proportion of true positives.
 Compute the recall which is the proportion of true positives out of all possible positives.
 Average together the maximum precision value across all recall levels in steps of size s.
To compute the precision we first apply our object detection algorithm to an input image. Each
bounding box normally has something called confidence score. The bounding box scores are
then sorted in descending order by their confidence till a certain threshold (differs from
competition to another) and all boxes with confidence scores less than this threshold are
discarded.
We know from a priori knowledge (i.e., it’s a validation/testing example and we therefore know
the total number of objects in the image) there are 5 objects in this image for specific class. We
seek to determine how many “correct” detections our network made. A “correct” prediction
here is one where we have a matching classification with the ground truth label and minimum
IoU of 0.5 (this value is tunable depending on the challenge but 0.5 is a standard value).
Here is where the calculation starts to
become a bit more complicated. We need
to compute the precision at
different recall values (also called “recall
levels” or “recall steps”).
For example, assume after ordering the
boundary boxes according to their
confidence class, this was the output
if we take the box with the highest confidence score (#rank 1) so it was correct which means
that is correctly classified to the class and has IoU > 0.5. So this means that the only prediction
we have done is right then precision is 1/1=1 and recall is that we detect 1 object out of the 5
objects of this class we know they are in the image so recall=1/5=0.2.
Let us take the second highest confidence score (#rank2) so it was also a correct prediction so
this means that out of the 2 predictions we have done there are 2 correct predictions so precision
= 2/2=1 and recall now is detecting 2 objects out of the 5 objects we need to detect. Hence,
recall= 2/5 =0.4.
Let us now take the third highest confidence score (#rank3) so it was false prediction which
means that it is either false classified or IoU<0.5. This means that out of the 3 predictions we
have so far we have 2 correct ones and 1 false one. So precision is 2/3=0.667 while the recall
now is still getting only 2 objects out of the 5 objects we required to get. It is still 2 as this
prediction (third highest confidence score) was not correct.
Similarly we will go along the first top N predictions (usually 10) and see the precision and
recall of each rank.

You will notice as we go further in the table, the recall is increasing as we are more likely to
find the next object needed till reaching 1 if we found all objects.
Let's now plot precision versus recall of this class (P-R curve) in this image:
Always use recall of 11 steps
recall= ȓ= (0,0.1,0.2,0.3,…,0.9,1).
But this plot is so zigzag-shaped so
we smooth it by replacing the
precision value with the maximum
precision for any recall ≥ ȓ.
for example, at ȓ=0.4 we look at all
precision values for ȓ=0.4 or higher
(ȓ=0. ,0.5,0.6,…1) and take the
highest value. Therefore for r=0.4 we stick to precision=1. similarly for ȓ=0.5 we look at all
precision values with ȓ=0.5 or larger so we will find that the highest precision available is 0.57.
Note that at ȓ=0.4 the highest precision is 1 but for ȓ=0.400000001 the highest precision will be
0.57 so there will be a falling edge from 1 to 0.57 at ȓ=0.4. Let us see now the new P-R curve.
Now how to calculate AP?
AP (average precision) is computed as the
average of maximum precision at these 11
recall levels:
This is close to finding the total area under the green curve and divides it by 11.
For more precise definition:

The final recall and max precision for each value of recall is shown in this table
So now AP for our example is (5 × 1.0 + 4 × 0.57 + 2 × 0.5) /11=0.7527
so now We now have our average precision for a single evaluation image for
single class. we repeat this for every class in the image. Then we calculate AP for
all images in your testing/validation dataset. Now we have a list containing AP for
each class for each image. We do 2 more steps to calculate mean of AP:
 Compute the mean of the APs for each class, giving us a mAP for each
individual class (for many datasets/challenges you’ll want to examine the
mAP class-wise so you can spot if your deep learning object detector is
struggling with a specific class) 
,where n is the number of images in dataset.
Note: if an image doesn't contain classi, its AP will be equal 1.
 Take the mAPs for each individual class and then average them together, yielding the
final mAP for the dataset. , where C
is the total number of classes.

Notes:
1-PASCAL VOC is a popular dataset for object detection. For the PASCAL VOC challenge, a
prediction is positive if IoU > 0.5. However, if multiple detections of the same object are
detected, it counts the first one as a positive while the rest as negatives.
2-For COCO, AP is the average over multiple IoU (the minimum IoU to consider a positive
match). AP@[.5:.95] corresponds to the average AP for IoU from 0.5 to 0.95 with a step size of
0.05. For the COCO competition, AP is the average over 10 IoU levels on 80 categories
(AP@[.50:.05:.95]: start from 0.5 to 0.95 with a step size of 0.05).
3-Averaging over the 10 IoU thresholds rather than only considering one generous threshold of
IoU ≥ 0.5 tends to reward models that are better at precise localization.
Region-based convolution neural network R-CNN and its variants (Fast & Faster R-CNN)
Those papers were done by Ross Girshick from Microsoft research team in 2014, 2015& 2016.
As we said before, Object detection started by the simple idea of sliding windows along the
whole image several times such that each time we use different window sizes and aspect ratios
but we said before that for each windows size to be slided along the whole image has high
computational cost and then solved it using the overfeat method and changing each FC layer to
a conv layer but we still have the main problem of trying different windows sizes and aspect
ratios to be solved. In R-CNN, our algorithm is splitted into 3 main processes: Region proposal
step, feature extraction step using CNN & classification/regression step.
They introduced the idea of selective search to make
each image doesn't need more than 2000 proposal or
windows to perform convolutional operations on to
extract features of these 2000 regions. In R-CNN,
the CNN architecture used for extracting features
form each of the 2000 region proposed was
AlexNet. The output of each convnet (fixed-length
feature vector from each region) is passed to 2
places as we said before a classification network for
classification and regressor for bounding boxes. The
classification network is a set of linear SVMs
classifier that are trained for each class and output classification while the boundary box
regressor is a linear regressor for tightening boundary boxes. Non-max suppression is then used
to suppress bounding boxes that have significant overlap with each other (discussed in the main
course and nothing more to be said in it)
Let us now go deeply in understanding how the image segmentation is done then go back to see
how we will use it in the selective algorithm used for region proposal.
Image segmentation (Felzenszwalb’s Algorithm):
When there exist multiple objects in one image (true for almost every real-world photos), we
need to identify a region that potentially contains a target object so that the classification can be
executed more efficiently.
Felzenszwalb and Huttenlocher (2004) proposed an algorithm for segmenting an image into
similar regions using a graph-based approach. It is also the initialization method for Selective
Search (a popular region proposal algorithm) that we are gonna discuss later.
Say, we use a undirected graph G=(V,E) to represent an input image(remember graph theory
mentioned in the computational graph part in machine learning chapter).
One vertex vi V represents one pixel. One edge e=(vi, vj) E connects
two vertices vi and vj. Its associated weight w(vi, vj) measures the
dissimilarity between vi and vj.
The dissimilarity can be quantified in dimensions like color, location, intensity, etc. The higher
the weight, the less similar two pixels are. A segmentation solution S is a partition of V into
multiple connected components, {C}. Intuitively similar pixels should belong to the same
components while dissimilar ones are assigned to different components.
A connected component (or just component) of an undirected
graph is a subgraph in which any two vertices are connected to each
other by paths, and which is connected to no additional vertices in
the supergraph. For example, the graph shown in the illustration has
three connected components. A vertex with no incident edges is
itself a connected component. A graph that is itself connected has
exactly one connected component, consisting of the whole graph.
The connected components of the graph are taken to be the segments in
the image segmentation (so we will refer to segments and connected components
interchangeably).

There are two approaches to constructing a graph out of an image:


 Grid Graph: Each pixel is only connected with surrounding neighbours (8 other cells in
total). The edge weight is the absolute difference between the intensity values of the
pixels.
 Nearest Neighbor Graph: Each pixel is a point in the feature space (x, y, r, g, b), in which
(x, y) is the pixel location and (r, g, b) is the color values in RGB. The weight is the
Euclidean distance(revise it) between two pixels’ feature vectors. (we use this)
Before going to algorithm of image segmentation let us define some key concepts:
Some additional info to understand the key concepts:
MST (C,E) ≡ E(C) which is the set of edges that have been added to the graph between
vertices of connected component C. You will understand what is meant by (added) in the
algorithm part. Noting that if C has no edges then max =0 then Int(C)=0.
 the weight w(e) is calculated by Euclidean distance as follows:
w(e) = w(v1,v2) =

Let us show now the algorithm knowing that it follows a bottom up procedure so at the
beginning of the algorithm, there are no edges, so each vertex is its own connected component.
So if the image is w by h pixels, there are w×h connected components (we will call it n
components). then we start having edges between each two vertices so we will have m
edges(e1,e2,…,em) where m equals nC2 (n combination 2) and weight each edge of them.

Note: At the beginning as each vertex (pixel) is of different component so point 3 in step 3
won't exist but after we finish the m edges and merge the components related to each other to
form a segment so we end up with some segments (components) we repeat again over these
components to form bigger segments and that is a bottom up approach.(see next part)
How R-CNN uses selective search algorithm works?
It is done as follows:
 At the initialization stage, apply Felzenszwalb and Huttenlocher’s graph-based image
segmentation algorithm to create regions to start with.
 Use a greedy algorithm to iteratively group regions together:
o First the similarities between all neighbouring regions are calculated.
o The two most similar regions are grouped together, and new similarities are
calculated between the resulting region and its neighbours.
 The process of grouping the most similar regions (Step 2) is repeated until the whole
image becomes a single region.
Hence start to propose the 2000
region proposals as shown in the
figure

What are the similarities used to be able to merge regions?


Given two regions (ri, rj), selective search proposed four complementary similarity measures:
 Color similarity
 Texture: Use algorithm that works well for material recognition such as SIFT.
 Size: Small regions are encouraged to merge early.
 Shape: Ideally one region can fill the gap of the other.

Until now 2000 region proposals are generated, what to do after?


AlexNet [3] is used to extract the CNN features.
For each proposal, a 4096-dimensional feature vector is computed by forward propagating a
mean-subtracted 227×227 RGB image through five convolutional layers and two fully
connected layers.
The input has the fixed size of 227×227 while bounding boxes have various shapes and sizes.
So, all pixels in a tight bounding box are warped to 227×227 size.
The feature vector is scored by SVM trained for each class.
For each class, High IoU (Intersection over Union) overlapping bounding boxes are
rejected since they are bounding the same object.
The predicted bounding box can be further fine-tuned by another bounding box regressor.
Now we have finished R-CNN, let us see problems with R-CNN then introduce modifications
done to create fast R-CNN and faster R-CNN.
The problems of R-CNN?
 It still takes a huge amount of time to train the network as you would have to classify
2000 region proposals per image and then pass it to bounding box regressor.
 It cannot be implemented real time as it takes around 40-50 seconds for each test image.
 The selective search algorithm is a fixed algorithm. Therefore, no learning is happening
at that stage. This could lead to the generation of bad candidate region proposals.

The idea of spatial pyramid pooling that paved the way for fast R-CNN to developed
Still, RCNN was very slow. Because running CNN on 2000 region proposals generated by
Selective search takes a lot of time. SPP-Net tried to fix this. With SPP-net, we calculate the
CNN representation for entire image only once and can use that to calculate the CNN
representation for each patch generated by Selective Search. This can be done by performing a
pooling type of operation on JUST that section of the feature maps of last conv layer that
corresponds to the region. The rectangular section of conv layer corresponding to a region can
be calculated by projecting the region on conv layer by taking into account the downsampling
happening in the intermediate layers.
There was one more challenge: we need to generate the fixed size of input for the fully
connected layers of the CNN so, SPP introduces one more trick. It uses spatial pooling (aka RoI
pooling) after the last convolutional layer as opposed to traditionally used max-pooling. SPP
layer divides a region of any arbitrary size into a constant number of bins (grids) and max pool
is performed on each of the bins. Since the number of bins is known, we can just concatenate
the SPP outputs to give a fixed length
representation as determined in this
figure 
However, there was one big drawback
with SPP net, it was not trivial to
perform back-propagation through
spatial pooling layer. Hence, the
network only fine-tuned the fully
connected part of the network. SPP-
Net paved the way for more popular
Fast RCNN which we will see next.

Notes:
1-256 here determines the number of filters.
2-the number of bins is fixed regardless of the image size.
3-Visulaize the RoI pooling[https://i.stack.imgur.com/rJL7D.gif]
Fast R-CNN in 2015
Fast RCNN uses the ideas from SPP-net and RCNN and fixes the key problem in SPP-net
i.e. they made it possible to train end-to-end. To propagate the gradients through spatial
pooling, It uses a simple back-propagation calculation which is very similar to max-pooling
gradient calculation with the exception that pooling regions overlap and therefore a cell can
have gradients pumping in from multiple regions.
One more thing that Fast RCNN did that they added the bounding box regression to the neural
network training itself. So, now the network had two heads, classification head, and bounding
box regression head. This multitask objective is a salient feature of Fast-RCNN as it no longer
requires training of set of networks independently for classification(SVMs) and
localization(Boundary box regressors). These two changes reduce the overall training time and
increase the accuracy in comparison to SPP net because of the end to end learning of CNN.
Remember: RoI pooling layer is a
type of pooling layer which
performs max pooling on inputs
(here, convnet feature maps) of
non-uniform sizes and produces a
small feature map of fixed size
(say 7x7). The choice of this fixed
size is a network hyper-parameter
and is predefined.
The main function of ROI layer is to reshape inputs with arbitrary size into a fixed length output
because of size constraint in Fully Connected layers as well as speed up the training and test
time and also to train the whole system from end-to-end (in a joint manner).

Problems with Fast R-CNN?


It still uses selective search as a proposal method to find the Regions of Interest, which is a slow
and time consuming process. It takes around 2 seconds per image to detect objects, which is
much better compared to RCNN. But when we consider large real-life datasets, then even a Fast
RCNN doesn’t look so fast anymore. Let's see what Faster R-CNN would do to solve this
problem.
Faster R-CNN in 2016
As we said that now the problem located in the selective algorithm that propose 2000 region
proposals that takes about ~2 seconds. The authors here introduced a replacement called Region
Proposal Network (RPN) to generate the regions of interests.
The algorithm is now as follows:
 We take an image as input and pass it to the
ConvNet which returns the feature maps
 for that image.
 Region proposal network is applied on these
feature maps. This returns the object proposals
along with their objectness score.
 A RoI pooling layer is applied on these proposals
to bring down all the proposals to the same size.
 Finally, the proposals are passed to a fully
connected layer which has a softmax layer and a
linear regression layer at its top, to classify and
output the bounding boxes for objects.
Now comes the question, how RPN works?
 At the last layer of an initial CNN, a 3x3 sliding
window moves across the feature map and maps
it to a lower dimension (e.g. 256-d)
 For each sliding-window location, it
generates multiple possible regions based
on k fixed-ratio anchor boxes (default bounding
boxes)
 Each region proposal consists of a) an “objectness” score for that region and b)
coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes
centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output
whether or not we think it contains an object, and what the coordinates for that box are. This is
what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on
“object.” Notice that although the RPN outputs bounding box coordinates, it does not try to
classify any potential objects: its sole job is still proposing object regions. If an anchor box has
an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a
region proposal. Once we have our region proposals, we feed them straight into RoI pooling
layer→ FC→ softmax and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-
CNN.
What are the problems in Faster R-CNN?
All of the object detection algorithms we have discussed so far use regions to identify the
objects. The network does not look at the complete image in one go, but focuses on parts of the
image sequentially. This creates two complications:
 The algorithm requires many passes through a single image to extract all the objects.
 As there are different systems working one after the other, the performance of the
systems further ahead depends on how the previous systems performed.

Summary

Note: We can notice that Fast R-CNN is about 25x faster than R-CNN, Faster R-CNN is 10x
faster than Fast R-CNN and 250x faster than R-CNN.

So far, all the methods discussed handled detection as a classification problem by building a
pipeline where first object proposals are generated and then these proposals are send to
classification/regression heads. However, there are a few methods that pose detection as a
regression problem. Two of the most popular ones are SSD and YOLO. These detectors are also
called single shot detectors. Let’s have a look at them.
Single-Shot MultiBox Detector (by W Liu from Google in 2015)
Single-shot multi-box detector (SSD) presents an object detection model using a single deep
neural network combining regional proposals and feature extraction. It reached new records in
terms of performance and precision for object detection tasks, scoring over 74% mAP
(mean Average Precision) at 59 frames per second on standard datasets such
as PascalVOC and COCO.
What is meant by its name?
 Single Shot: this means that the tasks of object localization and classification are done in
a single forward pass of the network unlike R-CNN and its variants.
 MultiBox: this is the name of a technique for bounding box regression developed by
Szegedy et al. (we will briefly cover it shortly)
 Detector: The network is an object detector that also classifies those detected objects.

Architecture
SSD’s architecture builds on the venerable VGG-16 architecture, but discards the fully
connected layers.

Algorithm
Concretely, given an input image and a set of ground truth labels, SSD does the following:
 Pass the image through a series of convolutional layers, yielding several sets of feature
maps at different scales (e.g. 10x10, then 6x6, then 3x3, etc.)
 For each location in each of these feature maps, use a 3x3 convolutional filter to evaluate
a small set of default bounding boxes. These default bounding boxes are essentially
equivalent to Faster R-CNN’s anchor boxes.
 For each box, simultaneously predict a) the bounding box offset and b) the class
probabilities
 During training, match the ground truth box with these predicted boxes based on IoU.
The best predicted box will be labeled a “positive,” along with all other boxes that have
an IoU with the truth >0.5.
What is the problem of making the RPN inside the network as a single shot & how to solve it?
With the previous two models, the region proposal network ensured that everything we tried to
classify had some minimum probability of being an “object.” With SSD, however, we skip that
filtering step. We classify and draw bounding boxes from every single position in the image,
using multiple different shapes, at several different scales(Anchor boxes). As a result, we
generate a much greater number of bounding boxes than the other models, and nearly all of the
them are negative examples.
To fix this imbalance, SSD does two things. Firstly, it uses non-maximum suppression to group
together highly-overlapping boxes into a single box. In other words, if four boxes of similar
shapes, sizes, etc. contain the same dog, Non-Max Suppression would keep the one with the
highest confidence and discard the rest. Secondly, the model uses a technique called hard
negative mining to balance classes during training. In hard negative mining, only a subset of the
negative examples with the highest training loss (i.e. false positives) are used at each iteration of
training. SSD keeps a 3:1 ratio of negatives to positives.

What does MultiBox refer to ?


The bounding box regression technique of SSD is inspired by Szegedy’s work on MultiBox, a
method for fast class-agnostic bounding box coordinate proposals. Interestingly, in the work
done on MultiBox an Inception-style convolutional network is used.

MultiBox’s loss function also combined two critical components that made their way into SSD:
 Confidence Loss: this measures how confident the network is of the objectness of the
computed bounding box. Categorical cross-entropy is used to compute this loss.
 Location Loss: this measures how far away the network’s predicted bounding boxes are
from the ground truth ones from the training set. L2-Norm is used here.
Multibox_loss= confidence loss + α × location loss where α balances the contribution of
location loss.
Why each output feature map is convolved?
SSD predicts bounding boxes after multiple convolutional layers (feature maps) in order to
handle scales as each convolutional layer is able to detect objects of different scale from another
convolutional layer. For example here is SSD in action ↓

In smaller feature maps (e.g. 4x4), each cell covers a larger region of the image, enabling them
to detect larger objects. Region proposal and classification are performed simultaneously:
given p object classes, each bounding box is associated with a (4+p)-dimensional vector that
outputs 4 box offset coordinates and p class probabilities. In the last step, softmax is again used
to classify the object.
What is hard negative mining? why do we need it?
During training, as most of the bounding
boxes will have low IoU and therefore be
interpreted as negative training examples, we
may end up with a disproportionate amount
of negative examples in our training set.
Therefore, instead of using all negative
predictions, it is advised to keep a ratio of
negative to positive examples of around 3:1.
The reason why you need to keep negative
samples is because the network also needs to
learn and be explicitly told what constitutes
an incorrect detection.
Summary of SSD operation (For more detailed explanation: https://bit.ly/2wqox7L)
The model takes an image as input which passes through multiple convolutional layers with
different sizes of filter (10x10, 5x5 and 3x3). Feature maps from convolutional layers at
different position of the network are used to predict the bounding boxes. They are processed by
a specific convolutional layers with 3x3 filters called extra feature layers to produce a set of
bounding boxes similar to the anchor boxes of the Fast R-CNN.
Each box has 4 parameters: the coordinates of the center, the width and the height. At the same
time, it produces a vector of probabilities corresponding to the confidence over each class of
object. Non-max suppression and hard negative mining are then used.
You Only Look Once (YOLO) and its versions

Let us go directly to how YOLOv1


works. Assume having this test
image 

Firstly, YOLO divides the image


into S×S grid cells such that each
cell can predict only 1 object. For
each object that is present on the
image, one grid cell is said to be
“responsible” for predicting it. That
is the cell where the center of the
object falls into. For example, the
yellow grid cell below tries to
predict the “person” object whose
center (the blue dot) falls inside the
grid cell.

Each grid cell predicts a fixed


number of boundary boxes(say 2).
In this example, the yellow grid cell
makes two boundary box
predictions (blue boxes) to locate
where the person is.

Hence, for each grid cell:


 it predicts B boundary boxes and each box has one box confidence score,
 it detects one object only regardless of the number of boxes B,
 it predicts C conditional class probabilities (one per class for the likeliness of the object
class).
To evaluate PASCAL VOC, YOLO uses 7×7 grids (S×S), 2 boundary boxes (B) and 20 classes
(C) as shown in the following figure.
Let’s get into more details. Each boundary box contains 5 elements: (x, y, w, h) and a box
confidence score. Formally we define confidence as Pr(Object) × IOU(pred, truth) . If no object
exists in that cell, the confidence score should be zero. Otherwise we want the confidence score
to equal the intersection over union (IOU) between the predicted box and the ground truth so
the confidence reflects the presence or absence of an object of any class. . We normalize the
bounding box width w and height h by the image width and height. x and y are offsets to the
corresponding cell. Hence, x, y, w and h are all between 0 and 1. Each cell has 20 conditional
class probabilities. The conditional class probability is the probability that the detected object
belongs to a particular class (one probability per category for each cell). So, YOLO’s prediction
has a shape of (S, S, B×5 + C) = (7, 7, 2×5 + 20) = (7, 7, 30).
Notes: 1-we have said in the main course that w & h can be larger than 1 as we were
normalizing to the grid cell size but more generally as explained here we will normalize to the
whole image size.
2-This probability is conditioned on the grid cell containing one object so that f no object is
present on the grid cell, the loss function will not penalize it for a wrong class prediction
Example of how to calculate box coordinates in a 448x448 image with S=3. Note how the (x,y)
coordinates are calculated relative to the center grid cell.
Till now we have seen how the image is encoded in YOLO, let us take a look on the
architecture of YOLOv1:
Notes:
1-The architecture was crafted for use in the
Pascal VOC dataset, where the authors used
S=7, B=2 and C=20. This explains why the
final feature maps are 7x7, and also explains
the size of the output (7x7x(2*5+20)). Use
of this network with a different grid size or
different number of classes might require
tuning of the layer dimensions.
2-The authors mention that there is a fast
version of YOLO, with only 9 convolutional
layers called Tiny-YOLO. The table above,
however, display the full version.
3-The sequences of 1x1 reduction layers and
3x3 convolutional layers were inspired by
the GoogleNet (Inception) model.
4-The final layer uses a linear activation
function. All other layers use a leaky RELU.

Loss function
YOLO predicts multiple bounding boxes per grid cell. To compute the loss for the true positive,
we only want one of them to be responsible for the object. For this purpose, we select the one
with the highest IoU (intersection over union) with the ground truth. This strategy leads to
specialization among the bounding box predictions. Each prediction gets better at predicting
certain sizes and aspect ratios.
YOLO uses sum-squared error between the predictions and the ground truth to calculate loss.
The loss function composes of:
 the classification loss.
 the localization loss (errors between the predicted boundary box and the ground truth).
 the confidence loss (the objectness of the box).

For classification loss: If an object is detected,


the classification loss at each cell is the
squared error of the class conditional
probabilities for each class.
For localization loss: The localization loss measures the errors in the predicted boundary box
locations and sizes. We only count the box responsible for detecting the object.

We do not want to weight absolute errors in large boxes and small boxes equally. i.e. a 2-pixel
error in a large box is the same for a small box. To partially address this, YOLO predicts the
square root of the bounding box width and height instead of the width and height. In addition, to
put more emphasis on the boundary box accuracy, we multiply the loss by λcoord(default: 5).

If you don't get the previous paragraph, let us take an example:


Assume 2 bounding boxes, one of size 4×4 and the other of size 100×100 so assume that the
predicted bounding box was of size 2×2 for the first case, and 98×98 in the second one. now the
error in the first case is 2 pixels and the error in the second case is also 2 pixels, however, the
first case the 2 pixels represents 50% error (as the ground truth one is 4) while in second case
the 2 pixels represents only 2% (as the ground truth one is 100) so here comes the role of the
"square root to relate the error to the original size. This is what the authors said:
our error metric should reflect that small deviations in large boxes matter less than in small
boxes. To partially address this we predict the square root of the bounding box width and height
instead of the width and height directly".

For confidence loss: If an object is detected in the box, the confidence loss
(measuring the objectness of the box) is:

If an object is not detected in the box, the confidence loss is:


Note: Most boxes do not contain any objects.
This causes a class imbalance problem, i.e. we
train the model to detect background more
frequently than detecting objects. To remedy this,
we weight this loss down by a factor
λnoobj (default: 0.5).
The Final loss now:

Non-max suppression:
YOLO can make duplicate detections for the same object. To fix this, YOLO applies non-
maximal suppression to remove duplications with lower confidence. Non-maximal suppression
adds 2- 3% in mAP.
Here is one of the possible non-maximal suppression implementation:
 Sort the predictions by the confidence scores.
 Start from the top scores, ignore any current prediction if we find any previous
predictions that have the same class and IoU > 0.5 with the current prediction.
 Repeat step 2 until all predictions are checked.

What are limitations of YOLOv1?


YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only
predicts two boxes and can only have one class. This spatial constraint limits the number of
nearby objects that our model can predict. Our model struggles with small objects that appear in
groups, such as flocks of birds. Since our model learns to predict bounding boxes from data, it
struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model
also uses relatively coarse features for predicting bounding boxes since our architecture has
multiple downsampling layers from the input image.

Note: SSD is a strong competitor for YOLO which at one point demonstrates higher accuracy
for real-time processing. Comparing with region based detectors, YOLO has higher localization
errors and the recall (measure how good to locate all objects) is lower. YOLOv2 is the second
version of the YOLO with the objective of improving the accuracy significantly while making it
faster.
Yolov2 (by Joseph Redmon, Ali Farhadi in 25 Dec 2016)
What are the improvements made in this version?
 Batch normalization: Add batch normalization in all convolution layers. This removes the
need for dropouts and pushes mAP up 2%.
 High resolution classifier: After trained by 224×224 images, YOLOv2 also uses 448×448
images for fine-tuning the classification network for 10 epochs on ImageNet. This gives
the network time to adjust its filters to work better on higher resolution input. We then
fine tune the resulting network on detection. This makes the detector training easier and
moves mAP up by 4%.Higher resolution images are accepted as input: the YOLO model
uses 448x448 images while the YOLOv2 uses 608x608 images, thus enabling the
detection of potentially smaller objects.
 Convolutional with Anchor Boxes: YOLOv2 removes all fully connected layers and uses
anchor boxes to predict bounding boxes. One pooling layer is removed to increase the
resolution of output. . Using anchor boxes we get a small decrease in accuracy. YOLO
uses (7×7) grid cells with 2 bounding boxes that only predicts(7×7×2)= 98 boxes per
image but with anchor boxes our model predicts 19×19×5=1805 anchor boxes per image
as we use 5 anchor boxes per grid cell. Without anchor boxes our intermediate model gets
69.5 mAP with a recall of 81%. With anchor boxes our model gets 69.2 mAP with a
recall of 88%. Even though the mAP decreases, the increase in recall means that our
model has more room to improve.
 Dimension cluster: In many problem domains, the boundary boxes have strong patterns.
For example, in the autonomous driving, the 2 most common boundary boxes will be cars
and pedestrians at different distances. To identify the top-K boundary boxes that have the
best coverage for the training data, we run K-means clustering on the training data to
locate the centroids of the top-K clusters. Since we are dealing with boundary boxes
rather than points, we cannot use the regular spatial
distance to measure datapoint distances. No surprise, we
use IoU. IoU is done between anchor boxes and ground
truth boxes noting that number of clusters determines the
number of anchor boxes. as number of anchors increases,
the accuracy increases then start to settle down as shown
and YOLOv2 settles down with 5 anchors.
 Direct location prediction: I didn't understand the idea 
 Fine-Grained features: Convolution layers decrease the spatial dimension gradually. As
the corresponding resolution decreases, it is harder to detect small objects. Other object
detectors like SSD locate objects from different layers of feature maps. So each layer
specializes at a different scale. YOLO adopts a different approach called passthrough that
brings features from an earlier layer (finer low-level features that helps in detecting
smaller objects) and concatenate it with the normal output before making predictions.
Hence, it takes the earlier layer output that has the shape of 28×28×512 and reshaped it to
14×14×2048. We also have the final output layer of shape 14×14×1024. So we
concatenate the reshaped earlier layer output with final output to form an output layer of
14×14×3072 that now can make good predictions for both small and large objects.
.

 Multi-scale training: For every 10 batches, new image dimensions are randomly chosen.
The image dimensions are {320, 352, …, 608} by step 32.The network is resized and
continue training. This acts as data augmentation and forces the network to predict well
for different input image dimension and scale. In additional, we can use lower resolution
images for object detection at the cost of accuracy. This can be a good tradeoff for speed
on low GPU power devices. At 288 × 288 YOLO runs at more than 90 FPS with mAP
almost as good as Fast R-CNN. At high-resolution YOLO achieves 78.6 mAP on VOC
2007.

Accuracy:

The last thing in this part is that YOLOv2 paper's title is better, faster and stronger, The
improvements we were talking about is related to the "better". For faster, they developed a
framework called darknet and make a network called darknet-19 that can reach 200 FPS. For
stronger, they combine COCO and ImageNet dataset by using wordtree to combine classes of
related objects together.
Darknet-19 architecture
With DarkNet, YOLO achieves 72.9% top-1
accuracy and 91.2% top-5 accuracy on ImageNet.
Darknet uses mostly 3 × 3 filters to extract features
and 1 × 1 filters to reduce output channels. It also
uses global average pooling to make predictions. We
replace the last convolution layer (the cross-out
section) with three 3 × 3 convolutional layers each
outputting 1024 output channels. Then we apply a
final 1 × 1 convolutional layer to convert the 7 × 7 ×
1024 output into 7 × 7 × 125. (5 boundary boxes
each with 4 parameters for the box, 1 objectness
score and 20 conditional class probabilities).
Darknet-19 can work on 45 FPS.

YOLOv3 (by Joseph Redmon, Ali Farhadi in april 2018)


YOLO v2 used a custom deep architecture darknet-19, an originally 19-layer network
supplemented with 11 more layers for object detection. With a 30-layer architecture, YOLO v2
often struggled with small object detections. This was attributed to loss of fine-grained features
as the layers downsampled the input. To remedy this, YOLO v2 used an identity mapping,
concatenating feature maps from from a previous layer to capture low level features.
However, YOLO v2’s architecture was still lacking some of the most important elements that
are now staple in most of state-of-the art algorithms. No residual blocks, no skip connections
and no upsampling. YOLO v3 incorporates all of these. The speed has been traded off for
boosts in accuracy in YOLO v3. While the earlier variant ran on 45 FPS on a Titan X, the
current version clocks about 30 FPS.
First, YOLO v3 uses a variant of Darknet, which originally has 53 layer network trained on
Imagenet. For the task of detection, 53 more layers are stacked onto it, giving us a 106 layer
fully convolutional underlying architecture for YOLO v3. This is the reason behind the
slowness of YOLO v3 compared to YOLO v2. Here is how the architecture of YOLO now
looks like. YOLOv3 can be named better, stronger but not faster 
What are the improvements in YOLOv3?
 Detection at three scales: Before starting keep your eyes on the architecture while reading
in order not to be confused. The newer architecture boasts of residual skip connections,
and upsampling. The most salient feature of v3 is that it makes detections at three
different scales. YOLO is a fully convolutional network and its eventual output is
generated by applying a 1 x 1 kernel on a feature map. In YOLO v3, the detection is done
by applying 1 x 1 detection kernels on feature maps of
three different sizes at three different places in the
network.
The shape of the detection kernel is 1 x 1 x (B x (5 + C) ).
Here B is the number of bounding boxes a cell on the
feature map can predict, “5” is for the bounding box
attributes and one object confidence, and C is the
number of classes.
In YOLO v3 trained on COCO, B = 3 and C = 80, so
the kernel size is 1 x 1 x 255. The feature map produced
by this kernel has identical height and width of the previous
feature map, and has detection attributes along the depth as
described above.
Before we go further, I’d like to point out that stride of the network, or a layer is defined
as the ratio by which it downsamples the input. In the following examples, I will assume
we have an input image of size 416 x 416.
YOLO v3 makes prediction at three scales, which are precisely given by downsampling
the dimensions of the input image by 32, 16 and 8 respectively.
The first detection is made by the 82nd layer. For the first 81 layers, the image is down
sampled by the network, such that the 81st layer has a stride of 32. If we have an image
of 416 x 416, the resultant feature map would be of size 13 x 13. One detection is made
here using the 1 x 1 detection kernel, giving us a detection feature map of 13 x 13 x 255.
Then, the feature map from layer 79 is subjected to a few convolutional layers before
being up sampled by 2x to dimensions of 26 x 26. This feature map is then depth
concatenated with the feature map from layer 61. Then the combined feature maps is
again subjected a few 1 x 1 convolutional layers to fuse the features from the earlier layer
(61). Then, the second detection is made by the 94th layer, yielding a detection feature
map of 26 x 26 x 255.
A similar procedure is followed again, where the feature map from layer 91 is subjected
to few convolutional layers before being depth concatenated with a feature map from
layer 36. Like before, a few 1 x 1 convolutional layers follow to fuse the information
from the previous layer (36). We make the final of the 3 at 106th layer, yielding feature
map of size 52 x 52 x 255.
Detections at different layers helps address the issue of detecting small objects, a frequent
complaint with YOLO v2. The upsampled layers concatenated with the previous layers
help preserve the fine grained features which help in detecting small objects.
The 13 x 13 layer is responsible for detecting large objects, whereas the 52 x 52 layer
detects the smaller objects, with the 26 x 26 layer detecting medium objects. Here is a
comparative analysis of different objects picked in the same object by different layers.
 Choice of anchor box: YOLO v3, in total uses 9 anchor boxes. Three for each scale. If
you’re training YOLO on your own dataset, you should go about using K-Means
clustering to generate 9 anchors. Then, arrange the anchors is descending order of a
dimension. Assign the three biggest anchors for the first scale , the next three for the
second scale, and the last three for the third. YOLO v3 predicts 10x the number of boxes
predicted by YOLO v2. You could easily imagine why it’s slower than YOLO v2. At
each scale, every grid can predict 3 boxes using 3 anchors. Since there are three scales,
the number of anchor boxes used in total are 9, 3 for each scale.
 Change in loss function: The last three terms in YOLO v2 are the squared errors, whereas
in YOLO v3, they’ve been replaced by cross-entropy error terms. In other words, object
confidence (2 terms) and class predictions (1 term) in YOLO v3 are now predicted
through logistic regression.
 No more softmax layer: YOLO v3 now performs multilabel classification for objects
detected in images. Earlier in YOLO, authors used to softmax the class scores and take
the class with maximum score to be the class of the object contained in the bounding box.
This has been modified in YOLO v3. Softmaxing classes rests on the assumption that
classes are mutually exclusive, or in simple words, if an object belongs to one class, then
it cannot belong to the other. This works fine in COCO dataset. However, when we have
classes like Person and Women in a dataset, then the above assumption fails. This is the
reason why the authors of YOLO have refrained from softmaxing the classes. Instead,
each class score is predicted using logistic regression and a threshold is used to predict
multiple labels for an object. Classes with scores higher than this threshold are assigned
to the box.

Faster-RCNN gives very good accuracy but low FPS regarding real-time application. SSD gives
better FPS so that we can get real-time application into work but lower accuracy. Then came
YOLO that dives more towards real-time application by giving more FPS but still lower
accuracy than both of Faster R-CNN and SSD and still the trade-off exists between accuracy
and FPS and it will remain. RetinaNet then showed up to give more ideal balance between
accuracy and speed while being simple and open source and hence come into action. Let us see
what it tells.

RetinaNet (Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár in 2017)
Architecture
RetinaNet is a composite network composed of:
 A backbone network called Feature Pyramid Net, which is built on top of ResNet and is
responsible for computing convolutional feature maps of an entire image.
 a subnetwork responsible for performing object classification using the backbone’s
output (feature pyramid net's output).
 a subnetwork responsible for performing bounding box regression using the backbone’s
output (feature pyramid net's output).
What is the feature pyramid net (FPN)?
RetinaNet adopts the Feature Pyramid Network (FPN) as its backbone, which is in turn built on
top of ResNet (ResNet-50, ResNet-101 or ResNet-152)1 in a fully convolutional fashion. The
fully convolutional nature enables the network to take an image of an arbitrary size and outputs
proportionally sized feature maps at multiple levels in the feature pyramid. The construction of
FPN involves two pathways (Bottom-up pathway & Top-down pathway) which are connected
with lateral connections. They are described as below.
Bottom-up pathway→ Recall that in ResNet, some
consecutive layers may output feature maps of the
same scale; but generally, feature maps of deeper
layers have smaller scales/resolutions. The bottom-
up pathway of building FPN is accomplished by
choosing the last feature map of each group of
consecutive layers2 that output feature maps of the
same scale. These chosen feature maps will be used
as the foundation of the feature pyramid.
Top-down pathway→ Using nearest neighbor
upsampling, the last feature map from the bottom-up
pathway is expanded to the same scale as the
second-to-last feature map. These two feature maps
are then merged3 by element-wise addition to form a
new feature map. This process is iterated until each
feature map from the bottom-up pathway has a
corresponding new feature map connected with
lateral connections.
There are altogether five levels in the pyramid (the figures only shows three).
History of feature pyramids till we reached FPN.
In real-world object detection, objects from the
same class may be presented in a wide range of
scales in images. This leads to some decrease in
detection accuracy, especially for small objects.
This is because feature maps from higher levels of
the pyramid are spatially coarser, though
semantically stronger. Therefore, only using the
last feature map of a network to make the
prediction is less ideal as in fig(b).
One solution would be to generate different scales of an image and feed them to the network
separately for prediction as fig(a). This approach is termed “featurized image pyramid” and was
widely adopted before the era of deep learning. However, since each image needs to be fed into
the network multiple times, this approach also introduces a significant increase in test time,
making it impractical for real-time applications.
Another solution would be to simply use multiple feature maps generated by a ConvNet for
prediction as in fig(c), and each feature map would be used to detect objects of different scales.
This is an approach adopted by some detectors like SSD. However, although the approach
requires little extra cost in computation, it is still sub-optimal since the lower feature maps
cannot sufficiently obtain semantical features from the higher ones.
Finally, we turn to FPN. As mentioned, FPN is built in a fully convolutional fashion which can
take an image of an arbitrary size and output proportionally sized feature maps at multiple
levels. Higher level feature maps contain grid cells that cover larger regions of the image and is
therefore more suitable for detecting larger objects; on the contrary, grid cells from lower level
feature maps are better at detecting smaller objects(check this figure ↓). With the help of the
top-down pathway and lateral connections , which
do not require much extra computation, every level
of the resulting feature maps can be both
semantically and spatially strong. These feature
maps can be used independently to make
predictions and thus contributes to a model that is
scale-invariant and can provide better performance
both in terms of speed and accuracy.

What about the classification subnet and box regression subnet?


The classification subnet is a fully convolutional network (FCN) attached to each FPN level.
The subnet consists of four 3×3 convolutional layers with 256 filters, followed by RELU
activations. Then, another 3×3convolutional layer with K×A filters are applied, followed by
sigmoid activation (instead of softmax activation)5. The subnet has shared parameters across all
levels. The shape of the output feature map would be (W,H,KA), where W and H are
proportional to the width and height of the input feature map, K and A are the numbers of
object class and anchor box.
The regression subnet is attached to each feature map of the FPN in parallel to the classification
subnet. The design of the regression subnet is identical to that of the classification subnet,
except that the last convolutional layer is 3×3 with 4A filters. Therefore, the shape of the output
feature map would be (W,H,4A).
Note: Fully conv net (FCN) means network with convolutional layers with no fully-connected
layers. A fully convolutional net tries to learn representations and make decisions based
on local spatial input. Fully-connected layers learn global information not local.
Why KA channels & 4A channels in each subnet, and what do these channels represent?
This part is actually a revision on the idea of anchor boxes but for RetinaNet. You can pass it.
Let’s suppose that, given an input image, the width
and height of the feature map output by a
classification subnet is 3×3. Then for each one of
these nine grid cells, the RetinaNet
defines A=9 boxes called anchor boxes, each having
different sizes and aspect ratios and covering an area
in the input image. 
Each anchor box is responsible for detecting the
existence of objects from K classes in the area that it
covers. Therefore, each anchor box corresponds to K numbers indicating the class probabilities.
And since there are A bounding boxes per grid, the output feature map of the classification
subnet has KA channels. In addition to detecting the existence of objects, each anchor box is
also responsible for detecting the size and shape of objects (if any). This is done through the
regression subnetwork, which outputs 4 numbers for each anchor box that predict the relative
offset (in terms of center coordinates, width and height) between the anchor box and the ground
truth box. Therefore, the output feature map of the regression subnet has 4A channels.
Matching predictions with ground-truth through anchor boxes
Note that the predictions output by the subnets are stored in output tensors. To calculate the
loss, we would also need to create target tensors, each with the same shape as its corresponding
output tensor and fill them with ground-truth labels at matching positions.
Also note that the match is actually performed between each anchor box and a ground-truth
box. But since each anchor box has an one-to-one relationship with the bounding box
prediction, the match naturally extends to the prediction and the ground-truth.
An anchor box is matched to a ground-truth box if their intersection-over-union (IoU) is greater
than 0.5. When a match is found, the ground-truth labels will be assigned to the target tensor in
the same positions as the corresponding predictions in the output tensor. In case of
classification, a ground-truth label is a length K one-hot encoded vector with a value of 1 in the
corresponding class entry, while all the remaining class entries would be 0. In case of
regression, the ground-truth label is a length 4vector indicating the offset between the anchor
box and the ground-truth box.
An anchor box is considered to be a background and has no matching ground-truth if its IoU
with any ground-truth box is below 0.4. In this case, the target would be a length K vector with
all 0s8. If the anchor box predicts an object, it will be penalized by the loss function. The
regression target could be a vector of any values (typically zeros), but they will be ignored by
the loss function. Finally, an anchor box will also be considered to have no match if its IoU with
any ground-truth box is between 0.4 and 0.5. However, unlike the previous case, both the labels
for classification and regression will be ignored by the loss function.
Till now we have discussed architecture and idea of anchor box and how they match predictions
to ground-truth labels to reach the point that the RetinaNet's paper was named on which is focal
loss that changed the idea of normal loss function using cross-entropy. But before discussing
focal loss let us see the problem with the normal cross-entropy loss noting that cross-entropy
loss in machine learning is the loss for softmax layer which is similar to that of logistic
regression we have discussed before.

Problems with normal cross-entropy loss


Methods like SSD or YOLO suffer from an extreme class imbalance: The detectors evaluate
roughly between ten to hundred thousand candidate locations and of course most of these boxes
do not contain an object. Even if the detector easily classifies these large number of boxes as
negatives/background, there is still a problem.
This is the cross entropy loss function and its plot where i is the
index of the class, yi the label (1 if the object belongs to class i, 0
otherwise), and pi is the predicted probability that the object belongs to class i.
Let’s say a box contains background and the
network is 80% sure that it actually is only
background. In this case y(background)=1,
all other yi are 0 and p(background)=0.8.
You can see that at 80% certainty that the
box contains only background, the loss is
about ~0.22.
Let’s say for example that we have ten
actual objects in the image and the network
is not really sure to which class they belong
so that their loss is i.e. ~3. This would give
us around 10×3=30 (all numbers are absolutely fictional and chosen for demonstration purposes
by the way). All the other ~10,000 default boxes are background and the network is 80% sure
that they are just background. This gives us a loss of around 10,000×0.22 = 2200.
You can see the difference ? total loss is 2230 where 2200 is loss for background and only 30 is
for objects. So tell me, what dominates? The few true objects the network actually has
difficulties to classify? Or all the boxes that the network easily classifies as background?
Well, the large number of easily classified examples absolutely dominates the loss and thus the
gradients and therefore overwhelms the few interesting cases the network still has difficulties
with and should learn from.
From this difficulty mentioned, the authors of RetinaNet had a novel idea for the cross-entropy
loss.
Focal loss
Focal loss is the reshaping of cross entropy loss such that it down-weights the loss assigned to
well-classified examples (background and easy objects). The novel focal loss focuses training
on a sparse set of hard examples and prevents the large number of easy negatives from
overwhelming the detector during training.

Equation of focal loss: 


According to their paper γ=2 works the best.

Noticing the plot (with γ=2) that when the


network is pretty sure about a prediction, the
loss is now significantly lower. In our
previous example of 80% certainty, the cross
entropy loss had a value of ~0.22 and now
the focal loss a value of only 0.009. For
predictions the network is not so sure about,
the loss got reduced by a much smaller
factor!

With this rescaling, the large number of easily classified examples (mostly background) does
not dominate the loss anymore and learning can concentrate on the few interesting cases.

Note: As γ increases, the loss will focus on much harder examples.

All our discussion about the focal loss as seen was on the classification part only. For regression
part it is as normal which is the L1 or L2 (L1 was used here in retina as it is more robust to
outliers) loss between predictions and ground truth labels.
Assuming loss of classification is denoted by Lclass and loss of regression (detection) is denoted
by Ldetection, the total loss equation will be : Lclass + λ Ldetectioon noting that α is a hyper-parameter
for balancing between the contribution of each task loss in the total loss.

Note: λ also has an impact on the class imbalance as γ so selecting them should be taken care of
as they interact with each other. Generally, as you increase γ, λ should be decreased slightly.
Semantic segmentation
Semantic segmentation is a natural step in the progression from coarse to fine inference:
 The origin could be located at classification, which consists of making a prediction for a
whole input.
 The next step is localization / detection, which provide not only the classes but also
additional information regarding the spatial location of those classes.
 Finally, semantic segmentation achieves fine-grained inference by making dense
predictions inferring labels for every pixel, so that each pixel is labeled with the class of
its enclosing object ore region. [visualize GIF: https://bit.ly/2AxkE3I]

Popular datasets (to have pixel-wise ground truth labels):


PASCAL VOC →[http://host.robots.ox.ac.uk/pascal/VOC/voc2012/]
Microsoft COCO→[http://cocodataset.org/]
Evaluation metrics: pixel accuracy, mean accuracy, region IoU, and frequency weighted IU.
Approaches:
There were traditional approaches before deep learning as Random Forest based and
TextonForest for semantic segmenattion. But after CNN has started to spead over, it took over
traditional approaches.
One of the popular initial deep learning approaches was patch classification where each pixel
was separately classified into classes using a patch of image around it. Main reason to use
patches was that classification networks usually have full connected layers and therefore
required fixed size images.
In 2014, Fully Convolutional Networks (FCN) by Long et al. from Berkeley, popularized CNN
architectures for dense predictions without any fully connected layers. This allowed
segmentation maps to be generated for image of any size and was also much faster compared to
the patch classification approach. Almost all the subsequent state of the art approaches on
semantic segmentation adopted this paradigm.
What is the main problem with CNN architectures apart from FC layers and how to solve?
Apart from fully connected layers, one of the main problems with using CNNs for segmentation
is pooling layers. Pooling layers increase the field of view and are able to aggregate the context
while discarding the ‘where’ information. However, semantic segmentation requires the exact
alignment of class maps and thus, needs the ‘where’ information to be preserved.
Two different types of architectures evolved in the literature to tackle this issue:
1) Encoder-Decoder architecture:
Encoder gradually reduces the
spatial dimension with pooling
layers and decoder gradually
recovers the object details and
spatial dimension. There are usually
shortcut connections from encoder
to decoder to help decoder recover
the object details better. U-Net is a
popular architecture from this class.

2) Dilated/atrous convolution architectures:


Dilated convolution is a
convolution operation
where 0 is inserted into the
kernel to increase the
receptive field. Check the
image below

Improvement:
Conditional Random Field (CRF)
postprocessing are usually used to
improve the segmentation. CRFs
are graphical models which
‘smooth’ segmentation based on the
underlying image intensities. They
work based on the observation that
similar intensity pixels tend to be
labeled as the same class. CRFs can
boost scores by 1-2%.
Note: In the previous CRF illustration figure: (b) Unary classifiers is the segmentation input to
the CRF. (c, d, e) are variants of CRF with (e) being the widely used one.

There are a lot of semantic segmentation networks to be explained in addition to other


approaches as pyramid networks and supervised/weakly supervised methods to be explained.
Hoewever, we will go through only 3 of them which are FCN, U-Net & Multi-Scale Context
Aggregation by Dilated Convolutions (DilatedNet).
But you can check recent ideas in semantic segmentation networks released in 2016 & 2017
here[https://towardsdatascience.com/semantic-segmentation-with-deep-learning-a-guide-and-
code-e52fc8958823]

Fully convolutional network FCN (by Jonathan Long, Evan Shelhamer, Trevor Darrell
from UC Berkeley in 2014)
FCNs owe their name to their architecture, which is built only from locally connected layers,
such as convolution, pooling and upsampling. Note that no dense layer (FC) is used in this kind
of architecture. This reduces the number of parameters and computation time. Also, the network
can work regardless of the original image size, without requiring any fixed number of units at
any stage, given that all connections are local.
The final output layer will be the same height and width as the input image, but the number of
channels will be equal to the number of classes. If we’re classifying each pixel as one of fifteen
different classes, then the final output layer will be height x width x 15 classes.
To obtain a segmentation map (output), segmentation networks usually have 2 parts :
 Downsampling path : capture semantic/contextual information
 Upsampling path : recover spatial information
The downsampling path is used to extract and interpret the context (what), while
the upsampling path is used to enable precise localization (where). Furthermore, to fully recover
the fine-grained spatial information lost in the pooling or downsampling layers, we often use
skip connections to transfer local information from downsampling path with upsampling path
either by concatenating or by summing (as in Feature pyramid network FPN)

Downsampling occurs by pooling & convolution, How can we upsample?


A logistical hurdle to overcome in FCNs is that the intermediate layers typically get smaller and
smaller (although often deeper), as striding and pooling reduce the height and width dimensions
of the tensors. FCNs use “deconvolutions”, or essentially backwards convolutions, to upsample
the intermediate tensors so that they match the width and height of the original input image.
Because backward convolution layers are just convolutions, turned around, their weights are
learnable, just like normal convolutional layers.
Architecture
The authors had success converting canonical networks like AlexNet, VGG, and GoogLeNet
into FCNs by replacing their final layers. But there was a consistent problem, which was that
upsampling from the final convolutional tensor seemed to be inaccurate. Too much spatial
information had been lost by all the downsampling in the network. So they combined
upsampling from that final intermediate tensor with upsampling from earlier tensors, to get
more precise spatial information.
There are variants of the FCN architecture, which mainly differ in the spatial precision of their
output. For example, the figures below show the FCN-32, FCN-16 and FCN-8 variants. In the
figures, convolutional layers are represented as vertical lines between pooling layers, which
explicitely show the relative size of the feature maps.

What is the difference between the three variants?


As shown below, these 3 different architectures differ in the stride of the last convolution, and
the skip connections used to obtain the output segmentation maps. We will use the
term downsampling path to refer to the network up to conv7 layer and we will use the term
upsampling path to refer to the network composed of all layers after conv7. It is worth noting
that the 3 FCN architectures share the same downsampling path, but differ in their respective
upsampling paths.
1) FCN-32 : Directly produces the segmentation map from conv7, by using a transposed
convolution layer with stride 32.
2) FCN-16 : Sums the 2x upsampled prediction from conv7 (using a transposed convolution
with stride 2) with pool4 and then produces the segmentation map, by using a transposed
convolution layer with stride 16 on top of that.
3) FCN-8 : Sums the 2x upsampled conv7 (with a stride 2 transposed convolution) with pool4,
upsamples them with a stride 2 transposed convolution and sums them with pool3, and applies a
transposed convolution layer with stride 8 on the resulting feature maps to obtain the
segmentation map.

As explained above, the upsampling paths of the FCN variants are different, since they use
different skip connection layers and strides for the last convolution, yielding different
segmentations. As we go back to retrieve finer spatial information we are getting better
segmentation as shown.

We can notice that the idea is simple but we still lack one information that is used extensively in
our context which is upsampling or more precisely we call it learnable upsampling
(deconvolution or backward convolution).

Learnable Upsampling (deconvolution or fractionally-strided conv or transposed convolution)


Upsampling in the CNNs is not actually a proper mathematical deconvolution. It is used
upsampling the output of a convnet to the original image resolution. If we want our network to
learn how to up-sample optimally, we can use the transposed convolution. It does not use a
predefined interpolation method. It has learnable parameters.
To understand transposed convolution. we should see the conv operation but this time in a
totally different way that we used to then use this new method which will result in same output
to build the transposed convolution. Let us start:
Assume having this conv
operation as shown and we used
to just slide the kernel over the
input with specific stride and
padding and perform element-
wise multiplication, right? let us
see now the same kernel but in
different organization.
We rearrange the 3×3 kernel into a 4×16 matrix as below:

This is the convolution matrix. Each row defines one convolution operation. If you do not see it,
the below diagram may help. Each row of the convolution matrix is just a rearranged kernel
matrix with zero padding in different places.

Note that we took each row and put it followed by 0 then


the next row in the kernel and stack it again followed by 0
and so on till you finish the 16 columns. The next row is
the same but rotated by 1 to the right.
The final thing to note is that we want to convolve 4×4
input to end up with 2×2 output that is why we rearrange
the kernel as 4×16.

To use the rearranged kernel, we flatten the input matrix


(4×4) into a column vector (16×1). 
now we have a 4×16 kernel and input 16×1 and we need the output so we just multiply the two
matrices as shown↓

The output is 4×1 matrix that can be reshaped to 2×2 matrix. If you check the
normal convolution we did at the beginning you will find that this method
ends up with the same result. 
The point is that with the convolution matrix, you can go from 16 (4×4) to 4 (2×2) because the
convolution matrix is 4×16. Hence, if you have a 16×4 matrix, you can go from 4 (2×2) to 16
(4×4).
I think now you got what we will do next. We
just transpose the kernel matrix we have formed
and just multiply it by the 4×1 output to perform
the up-sampling operation. 
We have just up-sampled a smaller matrix (2×2)
into a larger one (4×4). The transposed
convolution maintains the 1 to 9 relationship
because of the way it lays out the weights.

Note: The actual weight values in the matrix


does not have to come from the original
convolution matrix. What’s important is that the
weight layout is transposed from that of the
convolution matrix.
U-Net (by Olaf Ronneberger, Philipp Fischer, Thomas Brox in 2015)
U-Net is a very popular end-to-end encoder-decoder network for semantic segmentation. It was
originally invented and first used for biomedical image segmentation.

DOWN UP

For those who are familiar with traditional convolutional neural networks first part (denoted as
DOWN) of the architecture will be familiar. This first part is called down or you may think it as
the encoder part where you apply convolution blocks followed by a maxpool downsampling to
encode the input image into feature representations at multiple different levels. The second part
of the network consists of upsample and concatenation followed by regular convolution
operations. Upsampling uses transposed convolutions as discussed before.
As we have prior knowledge that while upsampling and going deeper in the network we are
concatenating the higher resolution features from down part with the upsampled features in
order to better localize and learn representations with following convolutions. Since upsampling
is a sparse operation we need a good prior from earlier stages to better represent the
localization. Similar idea of combining matching levels is also seen in FPNs.
By inspecting the figure more carefully, you may notice that output dimensions (388 × 388) are
not same as the original input (572 × 572). If you want to get consistent size, you may apply
padded convolutions to keep the dimensions consistent across concatenation levels

The main drawback of using a vanilla U-Net


approach in the competition was the
overlapping nuclei. As it’s seen in the image if
you create a binary mask and use it as your
target, U-Net will surely predict something
similar to this and you will have a combined
mask for several nuclei which are overlapping
or lie very close to each other. This was solved
by weight map

Weight-map
Since the touching objects are closely placed each other, they are easily merged by the network,
to separate them, a weight map is applied to the output of network.
To compute the weight map as above, d1(x)
is the distance to the nearest cell border at
position x, d2(x) is the distance to the
second nearest cell border.
Thus, at the border, weight is much higher
as in the figure(Segmentation map on the
left and weight map on the right). 
Thus, the cross entropy function is
penalized at each position by the weight
map. And it help to force the network to
learn the small separation borders between
touching cells.
The weight map is not related to the
network weights but to the sample weights: during the training, you feed three pieces of
information to the network, 1) the pixel intensities, 2) the pixel labels, 3) the pixel weights.
Using binary map as on the left, you will be able to determine border hence give them more
weight and use this weight in the cross-entropy loss (similar to what we did in focal loss).
There was something called overlap tiles strategy mentioned in the paper but I didn't understand
its idea and can't find a good explanation that gives the intuition behind it. 

Before going to the last semantic segmentation technique to be discussed, I want to make
something clear. Upsampling generally has a lot of techniques to be applied not only transposed
convolution but transposed convolution is the only technique -I know- to be learnable, however,
there are other techniques for upsampling that is not learnable and just has simple algorithm as
nearest neighbor interpolation and bilinear interpolation.

Nearest neighbor interpolation


This is the simplest approach ever to interpolate. Rather than calculate an average value by
some weighting criteria or generate an intermediate value based on complicated rules, this
method simply determines the “nearest” neighboring pixel, and assumes the intensity value of it
to be the same as shown:

Bilinear interpolation
A single pixel value is calculated as the weighted average of all other values based on distances.
given a matrix how to do bilinear interpolation to make it 4×4?

simply put them on the edges like that and do normal linear interpolation on each

axis as follows:
along x-axis between value 1 & 2 we need 2 pixel value.
according to this equation yinterpolated= y1+ (x-x1) noting that y represent pixel value and x
represents pixel position.
then value of pixel at second position = 1+ (2-1) = 1.333
then value of pixel at third position = 1+ (3-1) = 1.666
and similarly we do that for every single position till we finish the whole matrix.
DilatedNet (by F Yu in late 2015)
DilatedNet was the introduction to 2 new concepts which are dilated (atrous) convolution and
multi-scale context aggregation.

Dilated convolution
Dilated convolution is a simple idea as it looks exactly the same as the convolution operation
except that we introduce notation l that defines the gap between the kernel values known as
dilation factor and the gap is filled with zeros hence no effect. If l=1 so it is the normal
convolution where no gaps between the values. See the examples:

The importance of dilated convolution is obvious which is increasing the receptive field seen by
the network (global view). This increase is exponential in receptive field while being linear in
number of parameters. One general use is image segmentation where each pixel is labelled by
its corresponding class. In this case, the network output needs to be in the same size of the input
image. Straight forward way to do is to apply convolution then add deconvolution layers to
upsample[1]. However, it introduces many more parameters to learn. Instead, dilated
convolution is applied to keep the output resolutions high and it avoids the need of upsampling
The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which
is 3x3 in this example, and greed area is the receptive field captured by each of these inputs.
Receptive field is the implicit area captured on the initial input by each input (unit) to the next
layer .
Multi-scale context aggregation (The context module)
The context module is designed to increase the performance of dense prediction architectures by
aggregating multi-scale contextual information by using dilated convolution. The module takes
C feature maps as input and produces C feature maps as output. The input and output have the
same form, thus the module can be plugged into existing dense prediction architectures.
The context module architecture:

The context module has 7 layers that apply 3×3 convolutions with different dilation factors. The
dilations are 1, 1, 2, 4, 8, 16, and 1.
The last one is the 1×1 convolutions for mapping the number of channels to be the same as the
input one. Therefore, the input and the output has the same number of channels. And it can be
inserted to different kinds of convolutional neural networks.
The basic context module has the only 1 channels (1C) throughout the module while the large
context module has increasing number of channels from 1C as input to 32C at 7th layer.
This module can be used with any architectures of other networks but for their paper VGG-16 is
used as the front-end module. The last two pooling and striding layers are removed entirely,
and the context module is plugged in. The padding of the intermediate feature maps is also
removed. Authors only pad the input feature maps by a width of 33. Zero padding and reflection
padding yielded similar results in our experiments. Also, a weight initialization considering the
number of channels of input and output is used instead of standard random initialization.

====================================================================
That is enough for semantic segmentation, however, there are more recent networks as
DeepLapv3, RefineNet & PSPNet that can be walked up through. In case you want a quick
review of the concepts they add you can check this: [http://blog.qure.ai/notes/semantic-
segmentation-deep-learning-review]

Semantic segmentation is relatively easier compared to its big brother, instance segmentation.
In instance segmentation, our goal is to not only make pixel-wise predictions for every person,
car or tree but also to identify each entity separately as person 1, person 2, tree 1, tree 2, car 1,
car 2, car 3 and so on. Current state of the art algorithm for instance segmentation is Mask-
RCNN. Before going forward to read Mask-RCNN, revise faster R-CNN and the idea of RoI
pooling layer that was said in SPP-Net and Fast R-CNN.
Mask-RCNN (by Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick in 2017)
Before going deeply in the algorithm we can say that The actual model we use to solve the
problem of instance segmentation is actually much simpler than you might think. Instance
segmentation can essentially be solved in 2 steps:
 Perform a version of object detection to draw bounding boxes around each instance of a
class
 Perform semantic segmentation on each of the bounding boxes
This amazing simple model actually performs extremely well. It works, because if we assume
step 1 to have a high accuracy, then semantic segmentation in step 2 is provided a set of images
which are guaranteed to have only 1 instance of the main class. The job of the model in step 2 is
to just take in an image with exactly 1 main class, and predict which pixels correspond to the
main class (cat/dog/human etc.), and which pixels correspond to the background of an image.
Another interesting fact is that if we are able to solve the multi bounding box problem and
semantic segmentation problem independently, we have also essentially solved the task of
instance segmentation! The good news is that very powerful models have been built to solve
both of these problems, and putting the 2 together is a relatively trivial task.

Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that
says whether or not a given pixel is part of an object. The branch (in white in the above image),
as before, is just a Fully Convolutional Network on top of a CNN based feature map. Here are
its inputs and outputs:
 Inputs: CNN Feature Map.
 Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s
elsewhere (this is known as a binary mask).

You may notice that the segmentation part is similar to what we used to see (FCN) but there is a
different word written here called RoI allign. What is RoI allign? Is it similar to normal RoI
layer we know from SPP_Net? let us see.
RoI allign
When run normal RoI layer without modifications on the original Faster R-CNN architecture,
the Mask R-CNN authors realized that the regions of the feature map selected by RoIPool were
slightly misaligned from the regions of the original image. Since pixel-level segmentation
requires much more fine-grained alignment than bounding boxes, mask R-CNN improves the
RoI pooling layer (named “RoIAlign layer”) so that RoI can be better and more precisely
mapped to the regions of the original image.

But what was the cause of misalignment? simply it was rounding up  still didn't get it !! ok
let us assume having an image of size 128×128 which have a region of interest of size 15×15 so
after passing through the convNet the
resulting feature map was of size 25×25
hence the size is reduce by a factor
(128/25=5.12) so the RoI layer should be
now reduced by the same factor hence
become 2.929×2.929. In normal RoI layer it
was not output as 2.929×2.929 but was
rounded up to the nearest integer hence the
Region of interest now become 3×3 which is
not accurate hence leading to misalignment.
What authors of Mask R-CNN do to prevent such misalignment? they removed this rounding up
(quantization) by using bilinear interpolation for computing the floating-point location values.

This alignment helped in boosting up the accuracy.


Visualize the whole process as outputs here in the Mask R-CNN visualization part:
[https://medium.com/@jonathan_hui/image-segmentation-with-mask-r-cnn-ebe6d793272]
Better visualization for the architecture

Loss function
In faster R-CNN, there was to tasks to penalized which are classification and localization but
now we have a new task to be penalized which is segmentation so it is a multi-task loss function
that combines the loss of classification, localization and segmentation mask.
We are familiar now for both of classification and localization loss. What about segmentation
mask loss?
It is defined as the average binary cross-entropy loss, only including k-th mask if the region is
associated with the ground truth class k.

where is the the label of a cell (i, j) in the true mask and is the predicted value of the
same cell in the mask learned for the ground-truth class k and m×m is the mask of dimension
for each RoI and each class (for K total classes).
Summary of R-CNN
Course 4 Week 4
This week will be concerned with 2 real-world applications which are face recognition and
neural style transfer.
Face verification VS. Face recognition
Verification
 Input  Image + Name or ID
 Output Whether the image belongs to the claimed person or not.
It is one to one (1:1) problem which means we compare 1 image to 1 ID or name.

Recognition
We have a database of K persons
 Input  Image
 Output  ID if the image is any of the K persons
It is one to all (1:K) problem which means we compare 1 image to all K data in the database.

Face recognition system is harder than face verification because if you have a face verification
system of 99%, it will be acceptable but if we use this system as a building block of the face
recognition system having K=100 so this means that as we 1 % error for each person and we
have K=100 then we may get wrong 100 times which is a high number so this means that we
need the face verification method to be higher than 99% (may be 99.9% accuracy)

One-shot learning
One of the challenges that faces face recognition system is that we need to solve the one-shot
learning problem which means that for most face recognition application you need to recognize
the person given just 1 single image or in other words given 1 training example of that person's
face which historically is almost an impossible task for deep learning to solve.
Example
Assume we have a database of 4 persons (K=4) as shown where each
person has a photo of his face and now we entered 2 persons and want
to check if those 2 persons are in the database or some intruders so we
find that one of them is actually in the database and the other is not so
here the case is that we are learning from one example to recognize a
person again so learning process will be bad due to the small dataset.
Note: If we supposed that we will train this with convnet, softmax will
be 5 not 4 because none of the above choice will be added.

This problem is solved using learning "Similarity" function.


Learning "similarity" function
d(img1, img2)= degree of difference between images
so if img1 and img2 are of the same person then you need d(img1, img2) to be low (↓)
and if img1 and img2 are of different persons then you need d(img1, img2) to be high (↑)
then if d(img1, img2) < threshold (τ)  same person
then if d(img1, img2) > threshold (τ)  different person
Then in face recognition we just apply similarity function between the new photo and each of
the photos in the database.

So here comes the question, how to apply this function ? using Siamese network

Siamese network
Usually we input image (x(1)) to a network composed of convolutional, max pooling and fully
connected layers and then passes the output from the last FC layer to softmax so we will do the
same sequence except that we will remove the softmax layer (now the FC layer is the last
layer). Assume that the last FC layer is composed of 128 nodes and we gonna call it f(x (1)) and
we can think of this as an encoding of x(1). so now we have the image x(1) is represented by
vector 128 values. So now if you want to build a face recognition system so you want to
compare two images (x(1) & x(2)) so what you do is to apply x(2) to the same network with the
same parameters that x(1) was
trained on.
So now as x(2) is a different
person as shown in the figure 
so the resultant vector will be of
different 128 values (call it
f(x(2))) that encodes the x(2).

Now we have two vectors each of length 128 which are f(x(1)) & f(x(2)) and to apply similarity
function we do the following  d(x(1), x(2)) =
Triplet loss
To get good parameters for the network to be able to get a good encoding of the image is to
define and apply gradient descent on what is called by triplet loss so we need to define our
learning objective of this system:
Our learning objective is to make the distance between the anchor image (original image) and
the positive image to be small and the distance between the anchor image and the negative
image to be as high as possible and from this learning objective the idea of triplet loss arises as
we always look at the 3 images (anchor (A), positive (P) & negative (N)) at a time.

How to formalize the learning objective?


we want : ≤

d(A, P) d(A, N)  noting that d() refers to the distance.


then:
but there will be 2 trivial solution for this inequality:
 If network always gives encoding equals 0 (f(A)=f(P)=f(N)=0) so the network always
gives 0 output.
 If the network gives equal encoding to all images (f(A)=f(P)=f(N)) so the network output
will be always 0.
To avoid these 2 cases we will add a small positive term (+α) to L.H.S. Now α is a hyper-
parameter to the network that prevents trivial solutions. α is called margin.

Finally 
Triplet loss function
Given 3 images (A, P & N)
ℓ(A, P, N)=
cost function J =

How to use the cost function?


Assume having 10,000 pictures of 1,000 persons so you have to start drawing(‫ )يسحب‬from these
10,000 images sets of triplets images where 2 of them (A, P) of same person and 1 (N) for
different and do gradient descent and repeat this process on different sets till you reach good
parameters to encode every person's picture of the 1000 persons.

Note: You can notice here that every person is only having 10 images so we can train network
with small dataset using triplet loss function and then use the final network for one-shot
learning for your facial recognition system. 

How to choose your triplets ?


Don't choose A, P, N randomly because this would lead to an easy satisfaction of the learning
objective and the gradient descent won't learn that much. You should choose triplets that are so
hard to train on to make the gradient descent be able to learn and push parameters to make the
two distances (d(A, P) , d(A, N)) as far as possible. Andrew NG hasn't explained how could we
choose those hard trainable triplets from the dataset but said if you are keen on that you could
check the paper of  FaceNet: A unified embedding for face recognition and clustering.

Modification on Siamese network


Triplet loss is a good approach but not the only one to build a face recognition system. Siamese
network could be modified to make learning similarity applied as a binary classification task by
just adding a logistic regression that has the inputs from the outputs of the 2 Siamese networks
subtracted from each other such that the output will be 1 if the 2 networks having the same
person as inputs and 0 otherwise.
Note: Input here is a pair of images to
learn from that could be of same of
different persons.

Another formula:

Note: Both Siamese networks are having the same parameters.


What are deep convnets layers really learning?
Assume we will go deeper in the layers of Alex convnet (↓) to see how layers and neurons see
and deal with the input image.

Each hidden unit is dealing with patches of the input image that maximizes its activation so
assume each hidden unit is dealing with 9 patches of the input image. Now let us pick one
hidden unit from layer 1 so what we have seen that those 9 patches are the patches

that maximizes this unit's activation which means that this neuron is dealing with those edges
inclined to the right and when we picked another unit we found that those 9 patches are

the patches that maximizes the activation so this means that this neuron dealing with those
edges inclined to the left and we started repeating picking different neurons so we found that
each neuron is responsible of picking a low feature to cope with it as left side to be green

or patches to be orange colored or patches to be green and so on …

So now we understand that hidden units in layer 1 are concerning low-level features as edges
with different inclination, shades, colors and so on…
When we went further through the network for deeper layers we found out that hidden units
now are dealing with larger patches sizes that could formulate higher level features as
curvatures, geometrical shapes and more complex shapes as we go deeper.

Layer 2

Layer 3

Layer 4
Note: As we go deeper, patches are bigger and concerned shapes are more complex.
Neural style transfer
Neural style transfer allows you to create an existed image using a style by merging them in a
way such that the generated image is the original image (aka content) drawn by this style as
shown in the figure below:

Cost function
J(G)= α Jcontent(C,G) + β Jstyle(S,G)

Note: It is a little bit redundant to use 2 hyper-parameters (α, β) while we can use only 1 for
weighting but the original paper discussing this was using 2 parameters so we will follow that.

Content cost function Jcontent(C,G)


1-Say you use hidden layer l (this is lower-case L not number 1 and this will be used along this
part) to compute content cost. If we use the 1st layer as layer l then it will force the generated
image to be very similar (to the pixel extent) to the content image and if you use a deep layer
say layer L, this will lead that if for example there is a dog in the content image so there must be
a dog in the generated image so very similar (to the high-level feature extent) to the content
image. so we have to find some layer in the middle (neither too shallow or too deep) such that
the generated image is somehow having intermediate-features so can keep the outlines and
borders of the image without getting much details to be able to apply the style over it.
2-Use- pretrained convnet (say VGG) such that a[l](C) and a[l](G) are the activation of the layer l
(remember this is lower-case of L) of images C & G respectively.

Now what the cost function will do is to measure how similar are those images (C, G) and
penalizes if it they are different and controlled by α & β to balances between content and style:
Jcontent(C,G) =
Style cost function Jstyle(S,G)
Say we are using layer l's activation to measure style so we will define the style that it is the
correlation between activations across channels in this layer l.
Correlation between activation across channels can be illustrated as shown:
Layer l is a block of dimensions nH, nW, nC where nH, nW defines the
number of height and width of the image in this layer and n C represents the
number of channels of the image in this layer and assume n C=5 so if we
looked at first two channels (Red and yellow) so each activation in red channel has a
corresponding activation in the yellow channel (as those pointed at by arrows) so we look at the
correlation between each 2 corresponding activation along the channel but why does the
correlations between different channels capture the style of the image ? let us see an example:
assume the red channel represents the neuron the find those patches
surrounded by red border (subtle vertical texture) and yellow channel
represents the neuron the find those patches surrounded by yellow border
(orange tint) so saying that there is a high correlation between activations
of these 2 channels means that whenever part of the image having those
types of subtle vertical textures, this part of the image will most probably
has this orange tint and vice versa for uncorrelated. So we can say now that the degree of
correlation measures how often different high level (or intermediate-level) features will occur
together.

Now what we will do is to apply the concept of degree of correlation between channels on the
style image and the generated image to tell how close is the style of the generated image to that
of the input style image. Let us formalize what we have said:
Since we are measuring correlation between all channels and each other so G[l](S) is nc[l] × nc[l].
let ai,j,k[l] = activation at i, j, k where i is height, j is width and k is channel.
,
where ai,j,k[l] and ai,j,k'[l] are the corresponding activations in the two channels k and k' in layer l
and k varies from 1 to nc[l] and Gkk' is the "correlation matrix" or "style matrix" or "gram matrix"
.
Note: As we are only multiplying activation to represent correlation so it is actually not
correlation that we know from mathematics but it is mathematically unnormalized cross
variance but we will use the word correlation as convention.

The cost function now is : Jcontent(C,G) =


How generated image G is generated?  Algorithm

epochs increases
loss decreases as
Remember: J(G) = α Jcontent(C,G) + β Jstyle(S,G)

===============================================
All what we were concerning was 2-D images and we were using 2-D filters with them to
perform convolutions but in fact convolutional networks are used in 1-D and 3-D as well.
Convolution in 1-D is used in fields like signal processing as in EKG signal and it is performed
the same way as 2-D by sliding 1-D filter along the signal (discretized) and outputs a signal of
length n-f+1 similarly to what occurs in 2-D convolutions.
3-D is a little bit harder where we have depth as in CT scan which takes slices of your body to
represent the depth.

Note: RGB images are not 3-D images they are just 2-D images with 3 channels. 
One shot learning & Improvement on Siamese Network loss function
Motivation behind one shot learning
Humans learn new things with a very small set of examples — e.g. a child can generalize the
concept of a “Dog” from a single picture but a machine learning system needs a lot of examples
to learn its features. In particular, when presented with stimuli, people seem to be able to
understand new concepts quickly and then recognize variations on these concepts in future
percepts. Machine learning as a field has been highly successful at a variety of tasks such as
classification, web search, image and speech recognition. Often times however, these models do
not do very well in the regime of low data. This is the primary motivation behind One Shot
Learning.
One-shot learning (aka few-shot learning or low-shot learning)
One-shot learning can be used for object categorization problem in computer vision. Whereas
most machine learning based object categorization algorithms require training on hundreds or
thousands of images and very large datasets, one-shot learning aims to learn information about
object categories from one, or only a few, training images.
Why not using CNNs as usual to solve CV problems especially Face recognition?
CNNs can't be used for such problems because:
 CNN doesn’t work on a small training set
 It is not convenient to retrain the model every time we add a picture of a new person to
the system.
 When working with a huge dataset, correctly labeling the data can be costly.

However, we can use Siamese neural network for face recognition. FaceNet is a Siamese
network which is currently state of art for face recognition. Those types of networks are used in
FaceID recognition in smart phones. Baidu is using face recognition instead of ID cards to
allow their employees to enter their offices.
Siamese Network
Siamese networks are neural networks containing two or more identical subnetwork
components. It is important that not only the architecture of the sub networks is identical, but
the weights have to be shared among them as well for the network to be called “Siamese”. The
main idea behind Siamese networks is that they can learn useful data descriptors that can be
further used to compare between the inputs of the respective sub networks. Hereby, inputs can
be anything from numerical data (in this case the sub networks are usually formed by fully-
connected layers), image data (with CNNs as sub networks) or even sequential data such
as sentences or time signals (with RNNs as sub networks).
There is nothing for basics of Siamese networks more to be told than main course except for an
improvement that will be discussed so read the triplet loss from the main course before
continuing for improvement.
Lossless triplet loss
Assume as you can see on this picture, the positive
data marked as green dots are easily spitted away
from error type 1 (Red) and error type 2 (Orange)
that are acting as negatives. This is the final model
after training with the normal triplet loss discussed in
main course.

So what is the problem, it seems to work fine, doesn’t it? After some reflection I realized that
there was a big flaw in the loss function which is in the max function max (0,loss)
There is a major issue here, every time your loss gets below 0, you lose information, a ton of
information.
First let’s look at this function
It basically does this:
It tries to bring close the Anchor
(current record) with the Positive (A record that is in theory similar with the Anchor) as far as
possible from the Negative (A record that is different from the Anchor).

Noting that this is the formula of triplet loss

So as long as the negative value is further than the positive value + alpha there will be no gain
for the algorithm to condense the positive and the anchor.
Still didn't get the problem!! check this :
Let’s pretend that:
 Alpha is 0.2
 Negative Distance is 2.4
 Positive Distance is 1.2
The loss function result will be 1.2–2.4+0.2 = -1. Then
when we look at Max(-1,0) we end up with 0 as a loss. The
Positive Distance could be anywhere above 1 and the loss
would be the same. With this reality, it’s going to be very
hard for the algorithm to reduce the distance between the
Anchor and the Positive value.
here is 2 scenarios A and B. They both represent what the
loss function measure for us. After the Max function both
A and B now return 0 as their loss, which is a clear lost of information. By looking simply, we
can say that B is better than A. In other words, you cannot trust the loss function result.
Solution (Lossless triplet function)
With the title, you can easily guess what is my plan… To make a loss function that will capture
the “lost” information below 0. After some basic geometry, I realized that if you contain the N
dimension space where the loss is calculated you can more efficiently control this. So the first
step was to modify the model. The last layer (Embedding layer) needed to be controlled in size.
By using a Sigmoïde activation function instead of a linear we can guarantee that each
dimension will be between 0 and 1.

Secondly, Instead of the linear cost we proposed a non-linear cost function, so with this new
non-linearity, our cost function now looks like:

Where N is the number of dimensions (Number of output of your network; Number of features
for your embedding) and β is a scaling factor.

As you can see, the result speaks for itself:


We now have very condensed cluster of points, way more than the standard triplet function
result.
Another main topic
Generative Adversarial Network (GAN)
Analogy of Generative Adversarial Network (GAN) from real life example
The easiest way to understand what GANs are is through a simple analogy: Suppose there is a
shop which buys certain kinds of wine from customers which they will later resell.

However, there are bad customers who sell fake wine in order to get money. In this case, the
shop owner has to be able to distinguish between the fake and authentic wines.

You can imagine that initially, the forger might make a lot of mistakes when trying to sell the
fake wine and it will be easy for the shop owner to identify that the wine is not authentic.
Because of these failures, the forger will keep on trying different techniques to simulate the
authentic wines and some will eventually be successful. Now that the forger knows that certain
techniques got past the shop owner's checks, he can start to further improve the fake wines
based on those techniques.
At the same time, the shop owner would probably get some feedback from other shop owners or
wine experts that some of the wines that she has are not original. This means that the shop
owner would have to improve how she determines whether a wine is fake or authentic. The goal
of the forger is to create wines that are indistinguishable from the authentic ones, and the goal
of the shop owner is to accurately tell if a wine is real or not.
This back-and-forth competition is the main idea behind GANs.

Let us then discuss the components of GANs based on this analogy:


Using the example above, we can come up with the architecture of a GAN.

There are two major components within GANs: the generator and the discriminator. The shop
owner in the example is known as a discriminator network and is usually a convolutional neural
network (since GANs are mainly used for image tasks) which assigns a probability that the
image is real.
The forger is known as the generative network, and is also typically a convolutional neural
network (with deconvolution layers). This network takes some noise vector and outputs an
image. When training the generative network, it learns which areas of the image to
improve/change so that the discriminator would have a harder time differentiating its generated
images from the real ones.
The generative network keeps producing images that are closer in appearance to the real images
while the discriminative network is trying to determine the differences between real and fake
images. The ultimate goal is to have a generative network that can produce images which are
indistinguishable from the real ones.
Generative Adversarial Networks GAN (by Ian Goodfellow in 2014)
Generative Adversarial Networks takes up a game-theoretic approach, unlike a conventional
neural network. The network learns to generate from a training distribution through a 2-player
game. These two adversaries are in constant battle throughout the training process. The 2
players are actually a deep neural networks (mostly CNNs as for dealing with image but other
domains are also included as music & speech) that called:
 Generative network: Called the Generator.
o Given a certain label, tries to predict features.
o EX: Given an email is marked as spam, predicts (generates) the text of the email.
o Generative models learn the distribution of individual classes.

 Adversarial network: he Discriminator.


o Given the features, tries to predict the label.
o EX: Given the text of an email, predicts (discriminates) whether spam or not-spam.
o Discriminative models learn the boundary between classes.

What is the random noise?


First, we sample some noise z using a normal or
uniform distribution. With z as an input, we use a
generator G to create an image x (x=G(z)).
Conceptually, z represents the latent features of the
images generated, for example, the color and the
shape. In Deep learning classification, we don’t control
the features the model is learning. Similarly, in GAN, we don’t control the semantic meaning
of z. We let the training process to learn it.

What is the generative algorithms generally?


You can group generative algorithms into one of three buckets:
 Given a label, they predict the associated features (Naive Bayes)
 Given a hidden representation, they predict the associated features (Variational
Autoencoders, Generative Adversarial Networks)
 Given some of the features, they predict the rest (Inpainting, Imputation)
So now let us see architecture of one of the popular generator used with GAN called DCGAN.
Deep Convolutional Generative Adversarial Network (DCGAN)
DCGAN is one of the most popular designs for the generator network. It performs multiple
transposed convolutions to upsample z to generate the image x. We can view it as the deep
learning classifier in the reverse direction. It mainly composes of convolution layers without
max pooling or fully connected layers. It uses convolutional stride and transposed convolution
for the downsampling and the upsampling. It has the idea of FCN actually.

The network has 4 convolutional layers, all followed by Batch Normalization (except for the
output layer) and Rectified Linear unit (ReLU) activations.
It takes as an input a random vector z (drawn from a normal distribution). After reshaping z to
have a 4D shape, we feed it to the generator that starts a series of upsampling layers.
Each upsampling layer represents a transpose convolution operation with strides 2. The stride of
a transpose convolution operation defines the size of the output layer. With “same” padding and
stride of 2, the output features will have double the size of the input layer.
In short, the generator begins with this very deep but narrow input vector. After each transpose
convolution, z becomes wider and shallower. All transpose convolutions use a 5×5 kernel’s size
with depths reducing from 512 all the way down to 3 representing an RGB color image.
The final layer outputs a 32×32×3 tensor — squashed between values of -1 and 1 through
the Hyperbolic Tangent (tanh) function.

Now the generator by itself produces garbage as the initial input is a randomly chosen values
through Gaussian or any other distribution (latent features). So it needs to be trained to enhance
the output image of the generator each epoch. but to enhance the output we need the adversarial
network that will guide the generator to be enhanced so let us see the discriminator now.
Discriminator
Discriminator is much simpler as it looks like typical machine and deep learning problem where
we have two outputs to compare and compute the error between them.
GAN builds a discriminator to learn what
contributes as real images, and it provides feedback
to the generator to create paintings that look like
the real Monet paintings. So how is it done
technically? The discriminator looks at real images
(training samples) and generated images separately.
It distinguishes whether the input image to the
discriminator is real or generated. The
output D(X) is the probability that the input x is
real, i.e. P(class of input = real image).
We train the discriminator just like a deep network
classifier. If the input is real, we want D(x)=1. If it
is generated, it should be zero. Through this process, the discriminator identifies features that
contribute to real images. The last activation is sigmoid to tell us the probability of whether the
input image is real or not. So, the output can be any value between 0 and 1.

On the other hand, we want the generator to create images with D(x) = 1. So we can train the
generator by backpropagation this target value all the way back to the generator, i.e. we train the
generator to create images that towards what the discriminator thinks it is real.

So you have a double feedback loop:


 The Discriminator is in a feedback loop with the
ground truth of the images (are they real or
fake), which we know.
 The Generator is in a feedback loop with the
Discriminator (did the Discriminator label it
real or fake, regardless of the truth).
There is one catch in this process of training the generator via the GAN. We do not want the
discriminator’s weights to be affected because we are using the discriminator as merely a
classifier. However, we also need to train the discriminator as well so that it can do a good job
as a classifier of real and fake images. So how to do this ? let us check the training procedure.
Training procedure and its problem
We train the discriminator and the generator in turn in a loop as follows:
1. Set the discriminator trainable
2. Train the discriminator with the real MNIST digit images and the images generated by
the generator to classify the real and fake images.
3. Set the discriminator non-trainable
4. Train the generator as part of the GAN. We feed latent samples into the GAN and let the
generator to produce digit images and use the discriminator to classify the image.
The loop should ideally continue until they are both trained well and cannot be improved any
further. But does it actually work? The result of the simple GAN is not outstanding. Some of
them look pretty good but others are not. But as the research goes on in this field it starts to get
better with the generator models as the one we discussed (DCGAN)

Direct tips for training GANs


 Pre-training the Discriminator before you start training the Generator will establish a
clearer gradient.
 When you train the Discriminator, hold the Generator values constant. When you train
the Generator, hold the Discriminator values constant. This gives the networks a better
read on the gradient it must learn by.
 GANs are formulated as a game between two networks and it is important (and difficult!)
to keep them in balance. If either the Generator or Discriminator is too good, it can be
hard for the GAN to learn.
 GANs take a long time to train. On a single GPU a GAN might take hours, on a single
CPU a GAN might take days.
GANs were just invented in 2014 — they are very new! GANs are a promising family of
generative models because, unlike other methods, they produce very clean and sharp images
and learn weights that contain valuable information about the underlying data. However, as
mentioned above it can be difficult to keep the Discriminator and Generator networks in
balance. There is a lot of on-going work to make GAN training more stable.

What could be the importance of GANs?


 Using GAN we can create our own dataset based on examples. Although the quality of
GAN in such tasks is not yet very supreme people have already been employing this
approach in various machine learning tasks.
 Image super-resolution: take a low- resolution image and synthesize a high-resolution
equivalent
 Create art: GANs can be used to create interactive programs that assist the user in
creating realistic images that correspond to rough scenes in the user’s imagination.
 Image-to-image translation: convert one type of photos into another type, or convert
sketches to images.
 Text to image translation: generate realistic image from a text description and vice versa
 Image inpaiting: completing an image with missing patches.
 Generating video frames continuation by learning from existing frames
 chatbots that are more useful and more engaging (This can be thought of as Turing
learning where each party is trying it’s hard to outsmart another strengthening both as the
result)

Now let us discuss the toughest part that makes GANs hard task to be ultimately solved which
is the objective function of GANs called Nash equilibrium.

Nash equilibrium (objective function of GANs)


GAN is based on the zero-sum non-cooperative game. In short, if one wins the other loses. A
zero-sum game is also called minimax. Your opponent wants to maximize its actions and your
actions are to minimize them. In game theory, the GAN model converges when the
discriminator and the generator reach a Nash equilibrium.
 For Discriminator: Now, we will go through some simple equations. The discriminator
outputs a value D(x) indicating the chance that x is a real image. Our objective is to
maximize the chance to recognize real images as real and generated images as fake. i.e.
the maximum likelihood of the observed data. To measure the loss, we use cross-
entropy as in most Deep Learning: p log(q). For real image, p(the true label for real
images) equals to 1. For generated images, we reverse the label (i.e. one minus label). So
the objective becomes:

 For generator: its objective function wants the model to generate images with the highest
possible value of D(x) to fool the discriminator.

Hence the final minimax equation will be :


Once both objective functions are defined, they are learned jointly by the alternating gradient
descent. We fix the generator model’s parameters and perform a single iteration of gradient
descent on the discriminator using the real and the generated images. Then we switch sides. Fix
the discriminator and train the generator for another single iteration. We train both networks in
alternating steps until the generator produces good quality images. The following summarizes
the data flow and the gradients used for the backpropagation.

Algorithm with losses: Usually K=1 so we done once for discriminator and once for generator.

You think we have finished !! No. This model suffered from generator diminished gradient
(vanishing gradient descent) so we had to understand it and solve it.
Generator diminished gradient
It is observed that optimizing the generator objective function does not work so well, this is
because when the sample is generated is likely to be classified as fake, the model would like to
learn from the gradients but the gradients turn out to be relatively flat. This makes it difficult for
the model to learn.
In other words we can say that the discriminator usually wins early against the generator. It is
always easier to distinguish the generated images from real images in early training. That
makes V approaches 0. i.e. log(1 -D(G(z))) → 0. The gradient for the generator will also vanish
which makes the gradient descent optimization very slow. So how to solve it ?
Instead of minimizing the likelihood of discriminator being correct, we maximize the likelihood
of discriminator being wrong. Therefore, we perform gradient ascent on generator. So the GAN
provides an alternative function to backpropagate the gradient to the generator as shown:

There is more recent improvements on the cost function that are introduce in 2017 called
Wasserstein GANs but I still didn't get its idea. It provides ideas as Wasserstein distance to be
used as loss function with some enhancement on the model as adding noise on the inputs of
discriminators and using one-sided label smoothing (instead of using 0 & 1 as labels, they use
0.1 and 0.9 instead)
You can check its explanation here:[ https://medium.com/@jonathan_hui/gan-wasserstein-gan-
wgan-gp-6a1a2aa1b490]
END OF PART 3
COMPUTER VISION

You might also like