What Is Computer Vision?
What Is Computer Vision?
Some figures and texts in the slides are cut/paste from these references.
A few basics
Vector spaces and geometries
Rn
E n A n with dot prod. P R pts at inf.
n n
A n
7
• Vector space to affine: isomorph, one-to-one
Pts, lines, parallelism
8
Numerical computation and Linear
algebra
• Gaussian elimination
• LU decomposition
• Choleski (sym positive) LL^T
• orthogonal decomposition
• QR (Gram-Schmidt)
• SVD (the high(est)light of linear algebra!)
• row space: first Vs
A m n U V T • null space: last Vs
• col space: first Us
• null space of the trans : last Us 9
Applications
10
High dimensionality discussion --- curse
• [0,1]^d
•…
11
Visual recognition
Image classification and object
recognition
• Viewing an image as a one dimensional vector of features x, thanks to features! (or simply pixels if
you would like),
• Image classification and recognition becomes a machine learning application f(x).
• Unsupervised, given examples x_i of a random variable x, implicitly or explicitly we learn the
distributions p(x), or some properties of p(x)
– PCA Principal Components Analysis
– K-means clustering
• Given a set of training images and labels, we predict labels of a test image.
• Supervised, given examples x_i and associated values or labels y_i, we learn to predict a y from a x,
usually by estimating p(y|x)
– KNN K-Nearest Neighbors
• Distance metric and the (hyper) parameter K
• Slow by curse of dimension
– From linear classifiers to nonlinear neural networks
Unsupervised
• K-means clustering
– Partition n data points into k clusters
– NP hard
– heuristics
• PCA Principal Components Analysis
– Learns an orthogonal, linear transformation of the data
Supervised
• The cost can also be viewed as a functional, mapping functions to real numbers, we
are learning functions f parameterized by theta. By calculus of variations, f(x) = E_*
[y]
• The SVM loss is carefully designed and special, hinge loss, max margin loss,
• The softmax is the cross-entropy between the estimated class probabilities
e^y_i / \sum e and the true class labels, also the negative log likelihood loss L_i = -
log (e^y_i / \sum e)
MLE, likelihood and probabity discussions
KL, information theory, ….
Gradient-based learning
L1/2 regularization
L1 regularization has ‘feature selection’ property, that is, it produces a sparse vector,
setting many to zeros
L2 is usually diffuse, producing small numbers.
L2 superior is not explicitly selecting features
Architecture prior and inductive biases
discussions ...
Hyperparameters and validation
• The hyper-parameter lambda
• Split the training data into two disjoint subsets
– One is to learn the parameters
– The other subset, the validation set, is to guide the
selection of the hyperparameters
A linearly separable toy example
• The toy spiral data consists of three classes (blue, red, yellow) that are not
linearly separable.
– 300 pts, 3 classes
• Linear classifier fails to learn the toy spiral dataset.
• Neural Network classifier crushes the spiral dataset.
– One hidden layer of width 100
A toy example from cn231n
• The toy spiral data consists of three classes (blue, red, yellow) that are not
linearly separable.
– 3 classes, 100 pts for each class
• Softmax linear classifier fails to learn the toy spiral dataset.
– One layer, W, 2*3, b,
– analytical gradients, 190 iteration, loss from 1.09 to 0.78, 48% training set accuracy
• Neural Network classifier crushes the spiral dataset.
– One hidden layer of width 100, W1, 2*100, b1, W2, 100*3, only few extra lines of python codes!
– 9000 iteration, loss from 1.09 to 0.24, 98% training set accuracy
Generalization
• The ‘optimization’ reduces the training errors (or residual errors before)
• The ‘machine learning’ wants to reduce the generalization error or the
test error as well.
• The generalization error is the expected value of the error on a new input,
from the test set.
– Make the training error small.
– Make the gap between training and test error small.
• Underfitting is not able to have sufficiently low error on the training set
• Overfitting is not able to narrow the gap between the training and the
test error.
The training data was generated synthetically, by randomly sampling x values and choosing y deterministically
by evaluating a quadratic function. (Left) A linear function fit to the data suffers from
underfitting. (Center) A quadratic function fit to the data generalizes well to unseen points. It does not suffer from
a significant amount of overfitting or underfitting. (Right) A polynomial of degree 9 fit to
the data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse to solve
the underdetermined normal equations. The solution passes through all of the training
points exactly, but we have not been lucky enough for it to extract the correct structure.
The capacity of a model
• The old Occam’s razor. Among competing ones, we should choose the
“simplest” one.
• The modern VC Vapnik-Chervonenkis dimension. The largest possible
value of m for which there exists a training set of m different x points that
the binary classifier can label arbitrarily.
• The no-free lunch theorem. No best learning algorithms, no best form of
regularization. Task specific.
Practice: learning rate
Left: A cartoon depicting the effects of different learning rates. With low learning rates the
improvements will be linear. With high learning rates they will start to look more exponential.
Higher learning rates will decay the loss faster, but they get stuck at worse values of loss
(green line). This is because there is too much "energy" in the optimization and the
parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization
landscape. Right: An example of a typical loss function over time, while training a small
network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightly
too small learning rate based on its speed of decay, but it's hard to say), and also indicates
that the batch size might be a little too low (since the cost is a little too noisy).
Practice: avoid overfitting
• Deeper networks are able to use far fewer units per layer
and far fewer parameters, as well as frequently
generalizing to the test set
• But harder to optimize!
• Choosing a deep model encodes a very general belief that
the function we want to learn involves composition of
several simpler functions. Or the learning consists of
discovering a set of underlying factors of variation that can
in turn be described in terms of other, simpler underlying
factors of variation.
Curse of dimensionality
• Converting input images into feature vector loses the spatial neighborhood-
ness
• complexity increases to cubics
• Yet, the connectivities become local to reduce the complexity!
What is a convolution?
A CNN arranges its neurons in three dimensions (width, height, depth). Every layer
of a CNN transforms the 3D input volume to a 3D output volume. In this example,
the red input layer holds the image, so its width and height would be the dimensions
of the image, and the depth would be 3 (Red, Green, Blue channels).
Convolutional Neural Networks: 1998. Input 32*32. CPU
• 224*224*3
• 32*32*1 • 8 layers
• 7 layers • 5 conv and 3 fully classification
• 2 conv and 4 classification • 5 convolutional layers, and 3,4,5 stacked on top o
• 60 thousand parameters each other
• Three complete conv layers
• Only two complete convolutional
layers • 60 million parameters, insufficient data
– Conv, nonlinearities, and pooling as
• Data augmentation:
one complete layer – Patches (224 from 256 input), translations, reflection
– PCA, simulate changes in intensity and colors
The motivation of convolutions
VisGraph, HKUST
The Pooling Layer
VisGraph, HKUST
Pooling layer down-samples the volume spatially, independently in each depth slice of
the input volume.
Left: the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into
output volume of size [112x112x64]. Notice that the volume depth is preserved.
Right: The most common down-sampling operation is max, giving rise to max
pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers
(little 2x2 square).
The spatial hyperparameters
• Depth
• Stride
• Zero-padding
AlexNet 2012
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example
volume of neurons in the first Convolutional layer. Each neuron in the convolutional layer is
connected only to a local region in the input volume spatially, but to the full depth (i.e. all
color channels). Note, there are multiple neurons (5 in this example) along the depth, all
looking at the same region in the input - see discussion of depth columns in text below.
Right: The neurons from the Neural Network chapter remain unchanged: They still
compute a dot product of their weights with the input followed by a non-linearity, but their
connectivity is now restricted to be local spatially.
Receptive field
Gabor functions
Gabor-like learned features
CNN architectures and algorithms
CNN architectures
• The conventional linear structure, linear list of layers, feedforward
• Generally a DAG, directed acyclic graph
• ResNet simply adds back
• LeNet, 1998
• AlexNet, 2012
• VGGNet, 2014
• ResNet, 2015
VGGNet
• 16 layers
• Only 3*3
convolutions
• 138 million
parameters
ResNet
• 152 layers
• ResNet50
Computational complexity
• The memory bottleneck
• GPU, a few GB
Stochastic Gradient Descent
• Transfer learning
• Fine-tuning the CNN
– Keep some early layers
• Early layers contain more generic features, edges, color blobs
• Common to many visual tasks
– Fine-tune the later layers
• More specific to the details of the class
• CNN as feature extractor
– Remove the last fully connected layer
– A kind of descriptor or CNN codes for the image
– AlexNet gives a 4096 Dim descriptor
CNN classification/recognition nets
VisGraph, HKUST
Fully convolutional nets: semantic
segmentation
• Classification/recognition nets produce ‘non-spatial’
outputs
– the last fully connected layer has the fixed dimension of
classes, throws away spatial coordinates
VisGraph, HKUST
Using sliding windows for object
detection as classification
Mask R-CNN
Excellent results
End.
Some old notes in 2017 and 2018
VisGraph, HKUST
Fundamentally from continuous to discrete
views … from geometry to recognition
• ‘Simple’ neighborhood from topology
• discrete high order
VisGraph, HKUST
Classification vs regression
VisGraph, HKUST
Traditional stereo
• Input image H * W * C
• Argmin D
• H*W
End-to-end deep stereo regression
• Input image H * W * C
• 18 CNN
• H*W*F
• (F, features, are descriptor vectors for each pixel, we may just correlate or dot-product two descriptor
vectors f1 and f2 to produce a score in D*H*W. But F could be further redefined in successive convolution
layers.)
• Soft argmin D
• H*W
Bayesian decision
Precision
Recall
coverage
Reinforcement learning (RL)
116
General non-linear optimisation
118
An abstract view
The input
The classification with linear models,
Can be a SVM
The output layer of the linear softmax classifier
find a map or a transform to make them linear, but in higher
dimensions
provides a set of features describing
Or provides a new representation for
The nonlinear transformation
Can be hand designed kernels in svm
It is the hidden layer of a feedforward network
Deep learning
To learn
is compositional, multiple layers
CNN is convolutional for each layer
To do
• Add drawing for ‘receptive fields’
• Dot product for vectors, convolution, more specific for 2D? Time
invariant, or translation invariant, equivariant
• Convolution, then nonlinear acitivation, also called ‘detection
stage’, detector
• Pooling is for efficiency, down-sampling, and handling inputs of
variable sizes,
VisGraph, HKUST
Cost or loss functions - bis
• Classification is easier than regression, so always discretize and quantize the output, and
convert them into classification tasks!
• One big improvement in NN modern development is that the cross-entropy dominates the
mean squred error L2, as the mean squared error was popular and good for regression, but
not that good for NN. because of its fundamental more appropriate distribution assumption,
not normal distributions
• L2 for regression is harder to optimize than the more stable softmax for classification
• L2 is also less robust
Automatic differentiation (algorithmic diff),
and backprop, its role in the development
• Differentiation: symbolic or numerical (finite differences)
• Automatic differentiation is to compute derivatives algorithmically,
backprop is only one approach to it.
• Its history is related to that of NN and deep learning
• In NN, there are just more variables in each layer, but the elementary functions are
much simpler: add, multiply, and max.
• Even the primitive function in each layer takes also the simplest one! Then just a lot of
them!
Computational graph