Deep Learning
Deep Learning
I Introduction
Computer Vision: We’re trying to make sure that machines are learning to see in a similar way that humans
are doing. That is why we need Machine Learning methods, to get there. CV is the center for robotics so
that you understand the environment and what it does for you. There’s a lot of images/videos processing
done by CV as well.
They tried to construct a significant part of a visual system, and it was the time when pattern recognition
was coined.
CV is a very core element to other areas as well such as robotics, NLP, optics and image processing, algo
optimization, neuroscience, AI and ML.
Previously DL was not used for Image classification yet it became popular later. Earlier they did prepro-
cessing (i.e. normalizing the colors of images), then came the feature descriptor which funcioned kind of
the same as the Hubel Wiesel Experiment in the sense that certain properties were not important such as
position of edges.
Different types of feature descriptor are HAAR, HOG, SIFT, SURF. In order to get to these feature descriptor
they had to hand engineer it, since most are gradient based. After that you have aggregators such as svm
rf, ann etc which would aggreagate the features and give the label.
Instead of feature extraction+accumulation we have a magic box that does that for us. That magic box is
deep learning. We do not have the hand engineer the feature descriptors. We are letting a data set decide
what the best possible descriptor might be that will give us the best results.
Image Classification Issues:
• Occlussion.
• Background cluttering: Background and foreground (object) similar colors
• Representation: Ex: cat drawing vs cat photo
1
Downloaded by Eng Esraa ([email protected])
lOMoARcPSD|15872722
Started in 1940 with the electronic brain. Each cell has a certain pattern in them. They accumulated
weights/impulses and eventually made a decision.
1960 we saw the perceptron. Instead of fixed weights, we could learn the weights. We showed the system a
couple of example and we hope to essentially learn certain parameters of these perceptrons. We learn the
feature extraction (weights) and the threshold of learning. This was all hardwired.
Then we had Adaline (the golden age of deep learning). There was a lot of hype and progess being made.
Then in 1969, people realized the problems with perceptrons, specifically the xor problem. The problem
was that a linear model (a single perceptron) cannot separate the two classes. This era was called the AI
winter.
In 1986, the multi-layer perceptron came to light. We have several layers that can be trained (optimized for
the weights of the multi-layer perceptron). This is called backpropagation. Gradient based method.
In 1995 there was the SVM. Since it was successful, it put a halt to deep learning.
In 2006, Hinton and Ruslan developed Deep Belief Networks. The idea of pretraining came around. So you
train an nn and then you train it again for a specific task. The idea of pretraining is still one of the most
relevant today (for example transfer learning with ImageNet weights). Despite of this, neural networks were
still not a mainstream method.
In 2012 : the AlexNet architecture (see Section X.2) was the first neural network based architecture that
won the ImageNet competition based on the lowest top 5-error.
Definition of top 5 error: Give me an image, ask the method what class it is and see if the top five predictions
include the correct class.
• Big Data: When we have big data, models learn where to learn from and we have so much more data
today then we did back then. The datasets are also online.
• Better Hardware: Not only has the data changed, the hardware has changes as well (i.e. GPU).
Hardware was developed for the rendering of images in games, and it is now used as well for deep
learning, to train models faster.
• Models are more complex
• Object Detection
• Self-Driving Cars
• Gaming (i.e. AlphaGO, AlphaStar)
• Machine Translation
• Automated Text Generation (ChatBots)
• Healthcare, cancer Detection
2
Downloaded by Eng Esraa ([email protected])
lOMoARcPSD|15872722
Unsupervised Learning
Supervised Learning An underlying assumption is that train and test data come from the same distribu-
tion.
Nearest neighbor Model: Supervised learning method, labels the sample based on the majority label of
its neighboring samples. The hyper-parameters to be tweaked in KNN are: k, L1 or L2 distances.
Cross Validation: Split the data in K folds
Decision Boundaries are boundaries where the data is separated into classes.
The pros and cons of using linear decision boundaries:
Linear Regression is a supervised learning method that finds a linear model that explains a target y given
inputs x with weights ¹:
d
ŷi = xij ¹j
j=1
d
ŷi = ¹0 + xij ¹j =⇒ ŷ = X¹
j=1
xij are the features; ¹ are the weights (model parameters). ¹0 is a bias.
Mean squared error:
n n
1 1 2
J(¹) = (ŷi − yi ) = (xi ¹ − yi )2
n i=1
n i=1
Find the parameter values that maximize the likelihood of making the observations given by the parameters.
3
Downloaded by Eng Esraa ([email protected])
lOMoARcPSD|15872722
MLE assumes that the training samples are independent and generated by the same distribu-
tion.
What shape does our probability distribution have? Assuming Gaussian distribution: yi = N (xi ¹, Ã 2 ) =
xi ¹ + N (0, Ã 2 )
1 − 12 (yi −xi ¹)2
p(yi |xi , ¹) = e 2σ
(2ÃÃ 2 )
So the MLE is the same as the least squares estimate we found previously.
Sigmoid function:
1
Ã(x) =
1 + e−x
ŷi = Ã(xi ¹)
4
Downloaded by Eng Esraa ([email protected])
lOMoARcPSD|15872722
N C
L(ŷi , yi ) = yi,j log yˆi,j
i=1 j=1
n
Cost function (mean of losses for all samples): C(¹) = − n1 i=1 L(ŷi , yi ), optimize via gradient descent (no
closed form)
5
Downloaded by Eng Esraa ([email protected])