0% found this document useful (0 votes)

16 views

4 MachineLearningForCV

Uploaded by

reach geeks

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

4 MachineLearningForCV

Uploaded by

reach geeks

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 73

CSCI E-25

Computer Vision
Machine Learning for Vision
Steve Elston

Copyright 2021, Stephen F Elston. All rights reserved.

Machine Learning and Computer Vision
How is machine learning applied in computer vision?
• Machine learning has had a dramatic impact on CV starting in the 1980s
• Many CV applications of machine learning:
– Object recognition
– Object detection
– Image stitching
– Motion models
– Generative models
– …
Machine Learning and Computer Vision
Many ML algorithms are applied to CV problems
• Linear models
– Widely used and flexible class of models
– Our focus for today

• K-nearest neighbours
• Support vector machines
• Tree models and tree ensembles
• Naïve Bayes models
• Deep neural networks – more on these starting next week!
• …
Machine Learning and Computer Vision
Key points for this lesson
• The formulation of linear machine learning models
• Basic machine learning workflow
• Formulation of CV features for machine learning
• The relationship between bias, variance and model capacity
• Theory of binary classifiers
• Theory of multi-class classifiers
Review of Linear Models
Why linear models?
• Understandable and interpretable
• Generalize well, if properly fit
• Highly scalable – computationally efficient
• Can approximate fairly complex functions
• Applied widely to CV problems
• A basis of understanding complex models
– Many non-linear models are locally linear at convergence
– e.g. we can learn a lot about the convergence of DL and RL models from linear
approximations
Review of Linear Models
• Given a feature matrix, A, we wish to compute a linear model to
predict some labels x
• Let A be n x p.
• Then, the model has a vector of p coefficients or weights, b
• We want to compute b so that we minimize errors, e
• The predictive model is then:
Review of Linear Models
• How can we compute b so that we minimize errors, e, for the model?

• Minimize the sum of the squared error, S e2

• A straight-forward solution is to find the inverse of :
Review of Linear Models
Solution of is difficult
• A has dimension n x p, and typically n >> p
• But p columns of are often colinear
– will not exist with colinear columns
– This situation leads to an ill-posed problem

• Finding directly has high computationally complexity,

• We will take up massively scalable methods of computing in a few
weeks
Review of Linear Regression Problem
Solution of is difficult
• Solution can be found using the normal equations

• Where,

– Is a dense matrix
– Computing computationally intensive,

• Computing is cov(A)-1 computationally efficient

– cov(A) has dimensions p x p
What could possibly go wrong?
It is typically the case that A has colinear features
• With colinear features cov(A) is not invertible
• Some features many be statistically independent of the label
– Noninformative
– Add noise to model
– Inverse of cov(A) is unstable

• We say the solution is ill-posed!

• We must solve a biased approximation
• This process is known as regularization
– We will take up this topic in a few weeks
The L2 Regularization Method
Regularization for Machine Learning
Regularization is essential for complex ML models
• Deep learning models require learning very large numbers of
parameters
 Even with large training datasets there are only a few samples per parameters

• Large number of parameters high chance of over-fitting ML models

 Over-fit models learn the training data too well
 Over-fit models do not generalize
 Over-fit models have poor response to input noise

• To prevent over-fitting we apply regularization methods

l2 Regularization
L2 or Euclidean norm regularization is a widely used method
• Over-fit models tend to have parameters (weights) with extreme
values
• One way to regularize models is to limit the values of the parameters
• Limit the L2 norm of the model parameter vector
• We add a small bias term to (greatly) reduce the variance
l2 Regularization
Limit the size of the model parameters is to constrain the l2 or
Euclidian norm:

• The regularized loss function is then:

• Where l is the regularization hyperparameter

 Large l increases bias but reduces variance
 Small l decreases bias and increases variance
l2 Regularization
How can you gain some intuition about l2 regularization?
X2
B1 ~ 0 B1 large
B2 small

-B1 = B2
X1

B2 ~ 0
Constant ||B||2

l2 regularization is considered a soft constraint

L2 Regularization
How can you gain some intuition about l2 regularization?

Contours of J(W)MLE

MinW (J(W)MLE)

Constraint on model
parameters binds

Initial W Domain of Overfitting

Contours of ||W||2
l2 Regularization
l2 regularization goes by many names
• Euclidian norm regularization
• First published by Andrey Tikonov in late 1940s
 Only published in English in 1977
 Is known as Tikhonov regularization

 In the statistics literature often called ridge regression

 In the engineering literature is referred to as pre-whitening
l2 Regularization

Plaque commemorating Andrey Tikhonov at Moscow Institute of

Mathematics
L1 Regularization Methods
l1 Regularization

Regularization can be performed with other norms

• The l1 (min-max) norm is another common choice
• Conceptually, l1 norm limits the sum of the absolute values of the
weights:

• The l1 norm is also known as the Manhattan distance or taxi cab

distance, since it is the distance traveled on a grid between two
points.
l1 Regularization
Given the l1 norm of the weights, the loss function becomes:

• Where a is the regularization hyperparameter

 Large a increases bias but reduces variance
 Small a decreases bias and increases variance

 The l1 constraint drives some weights to exactly 0

 This behavior leads to the term lasso regularization
 L1 regularization provides a hard constraint on parameters
 In contrast L2 provides soft constraints
l1 Regularization
A diagram helps develop some intuition on l1 regularization:
X2
B1 = 0

B2 = 0
Constant ||B||

L1 regularization is a hard constraint on the weights

L1 Regularization
A diagram helps develop some intuition on l1 regularization:

Contours of J(W)MLE

MinW (J(W)MLE)
Nonzero
parameter
value

Initial W Domain of Overfitting

Contours of ||W||1
Bias, Variance and Model Capacity
The Bias-Variance Trade-Off

• Machine learning algorithms learn a function approximation:

f(X, w) = y, where X is the feature vector, w is the parameter vector
and y is the label
• We say that a complex function has high capacity
– High capacity model can approximate complex functions
– High capacity model has large number of parameters or weights
– But, may not generalize well
– May learn the training data too well!
The Bias-Variance Trade-Off
• High capacity models fit training data well
 Overfit models exhibit high variance
 Do not generalize well; exhibit brittle behavior
 Errortraining << Errortest

• Low capacity models have high bias

 Generalize well -> low variance
 Do not fit data well

 Regularization adds bias

 Strong regularization adds significant bias
 Weak regularization leads to high variance
The Bias-Variance Trade-Off
• How can we understand the bias-variance trade-off?
• We start with the error:

Where:
The Bias-Variance Trade-Off

• We can expand the error term

• Increasing bias decreases variance

• Increasing variance decreases bias
• Notice that even if the bias and variance are 0 there is still
irreducible error
The Bias-Variance Trade-Off

Test Error

Variance Bias

Increasing bias
Increasing variance
Model Capacity, Bias and Variance
Example: capacity, bias and variance for a polynomial model linear
in coefficients

• For a set of data values fit a

strait line mode, with 2
parameters
• This model has high bias, since
does not fit the training data
well
• Variance is limited since low
capacity model produces
consistent predictions
Model Capacity, Bias and Variance
Example: capacity, bias and variance for a polynomial model linear
in coefficients

• Fit same data with 12th order

polynomial model with 13
parameters
• The model has low bias, since it
fits the training data fairly well
• But the model will have high
variance since predictions will be
erratic

v
Model Capacity, Bias and Variance
Example: capacity, bias and variance for a polynomial model linear
in coefficients

• Models of many different

capacities possible between the
extremes
• Need to find an optimal trade-off
point between bias and variance
Machine Learning Workflow
Machine Learning Workflow
Which types of machine learning models do we use for computer vision
problems?
• Unsupervised: Models applied when ground-truth is unknown
• Supervised: Models trained with known outcomes, or labels
– Regression for predicting numeric values
– Classifiers for predicting categories

• Initially, we focus on supervised classification

Machine Learning Workflow
How do we organize a machine learning workflow?

Define model and

hyperparameters

Split data to
Train
Database Train Model
Test
Evaluate

Test Model
Machine Learning Workflow

Supervised learning of model parameters using features with know labels

Define model and

hyperparameters

Training
features Trained model
Model training by
With
minimizing errors with for
Training Learned model
known labels
labels parameters
Machine Learning Workflow

Supervised learning of model parameters using features with know labels

Test
features Trained model
With Evaluate model
Test Learned model performance
labels parameters
Machine Learning Workflow
How do we evaluate classification models?
• Classifiers perform an hypothesis test with possible outcomes:
– True positive (TP): Positive cases are correctly classified
– True negative (TN): Negative cases are correctly classified
– False positive (FP): Negative case erroneously classified as positive
– False negative (FN): Positive case erroneously classified as negative

• These quantities can be organised into a confusion matrix:

Actual Positive Actual Negative
Classified Positive TP FP
Classified Negative FN TN
Machine Learning Workflow
How do we evaluate classification models?
• Accuracy:
– The faction of all cases classified correctly
• Selectivity or Precision:
– Fraction of cases classified as positive which are correctly classified
• Sensitivity or Recall:
– The fraction positive cases correctly classified
• There is an inherent trade-off between precision and recall
Machine Learning with Image Data
Formulating a Computer Vision Machine Learning Model

How do we formulate the machine learning model for computer vision

• Features, , are the categories of the objects

– These are the know labels used to train the model

• is the feature matrix

– Features values are in the columns
– Are columns

• is the vector of length model parameters or weights

– These values are learned
Formulating a Computer Vision Machine Learning Model
How do we formulate the machine learning model for computer vision

• is the feature matrix

– Features values are in the columns
– Are columns

• What types of features can we use?

– Gray-scale pixel values
– Color channel pixel values
– Lines
– Corners and interest points
– Texture
–…
How Can We Work With Image Data?
How to prepare image pixel values for machine learning? 0,0
1,0
.
• Must flatten the image .
into a feature vector 0 ,0 0 ,1 ... 0 ,M-1 .
... N-2,0
• Example: start with a d-2 1 ,0 1 ,1 1 ,M-1
2 ,0 2 ,1 ... 2 ,M-1 N-1,0
gray-scale image ... .
3 ,0 3 ,1 3 ,M-1
• Flatten to a 1-d feature . . Flatten
.
.
vector . .
. . 0,M-1
• For color image N-2 ,0 N-2 ,1 ... N-2 ,M-1 1,M-1
concatenate channel N-1,0 N-1,1 ... N-1,M-1 .
feature vectors .
.
N-2,M-1
N-1,M-1
How Can We Work With Image Data?
Images yield high-dimensional feature space
• A single 28x28 gray scale image -> 784 features
• A single 1024x1024x3 color image -> 3 million features!
• Large numbers of features cause convergence problems with
machine learning models
• Its hard to work with high-dimensional spaces
• Details are beyond the scope of our course
• There are more efficient ways to represent images, including:
• Principle component compression
• Extract higher level features; edges, corners, textures, etc.
How Can We Work With Image Data?
Images yield high-dimensional feature space
• Deep neural networks are machine learning models
• Deep neural networks learn features
• Under right conditions improve performance over hand engineered
features
• Much more on this topic!
Review Binary Classification
Review of Binary Classification
• Binary classification selects most probable category from set {0,1}
• Binary classification is based on the Bernoulli distribution
• Binary Bernoulli distribution for probability of success or probability
of observation: n = 1:
Review of Binary Classification
We extend the Bernoulli distribution for multiple trials with the
Binomial distribution for k successes in n trials:

Were the Binomial coefficient, pronounced n choose k is:

The Binomial coefficient tells us the number of ways we can choose k

values from n possibilities
Review of Binary Classification
• How do perform classification with the Bernoulli distribution?
• Must transform a numeric value to the set {0,1}
• Use the logistic or sigmoid function to squash the model output
value
Review of Binary Classification
Simplify the logistic function if k = 1, L = 1 and xo = 0:
Review of Binary Classification
Use logistic regression to transform linear model into binary classifier
• Start with a linear model for the link function for the Binomial
distribution
λi = ln=
• The probability of is given by the inverse link function

• The inverse link function transforms linear response to the nonlinear

response!
Review of Binary Classification
Example: Cut off for binary classification
• Cut off applied to the cumulative distribution function for the
positive and negative cases
Multi-Class Classification
Multi-Class Classifiers
Many applications of multi-class classifiers
• Images can contain many types of objects
• A binary classifier is not useful for object classification
• We need a multi-class classifier
Multi-Class Classifiers
How can we create multi-class classifiers?
• Approach
– A hierarchy of binary classifiers
– Use a multi-class distribution

• Hierarchical binary classifiers

– One vs. many
– One vs. one
Classification with the Categorical Distribution
What is the distribution for multi-class problems?
• Use the Categorical distribution for k categories with probability
mass function:

Where probability mass for each category is :

And the normalization of the probability distribution :

Classification with the Categorical Distribution
How do we create a categorical classifier?
• Use a softmax function for K classes:

• The normalization, , ensures the probabilities sum to 1.0

• Softmax squashes the response of the linear models to a probability
• Softmax used for response layer in deep learning models
Classification with the Categorical Distribution
What is the output of softmax?
• One value for each category, K
 For example, if we have 10 categories, there are 10 softmax output values
 Take the max as the most probable category

• Label must be one-hot encoded

 Binary value for each possible category
 Only one 1, others 0
Coding Multi-Class Labels
How do you work with multi-class labels?
• Must uniquely code each category using one-hot encoding
• Example; code label with K = 3 levels, {e1, e2, e3,}
Label One-hot encoding
Classification with the Categorical Distribution
(0,0, e3)
• Visualize the categorical
distribution as a simplex
• Example; encode 3 possible
categories: {e1, e2, e3,}
• Each category falls at the
vertex of the simplex with (0,e2,0)
probability, and

(e1,0,0)
(0,0,4)

(1,0,3) (0,3,1)

(1,1,2)
(2,0,2) C (0,2,2)

(2,1,1) (1,2,1)
(3,0,1) (0,3,1)
B

A
(4,0,0) (0,4,0)
(3,1,0) (2,2,0) (1,3,0)
Multi-Class Logistic Regression
How do we build a multinomial classifier?
1. Start with K classes, {1,2,3,…K}, with label one-hot encoded
2. The most probable class is:
3. One vs. Rest method uses k-1 classifiers to compute probabilitie
4. The probability of the Kth, or pivot class:
Multi-Class Logistic Regression
Use k-1 linear classifiers
Start with the K-1 log probability ratios with the pivot class

where
vector of coefficients for kth model
ith feature vector
ith one-hot encoded label vector
Multi-Class Logistic Regression
Find k-1 probabilities with linear classifiers

• Largest probability is most likely category

• But may not be much difference between classes
Multi-Class Logistic Regression
What could possibly go wrong?
• Probability differences between classes can be small
– Sometimes better to consider top few classes
– Example; most probable 5 classes of 1,000 of objects in image

• Assumption of well-separated classes

– But class characteristics can overlap – be nonunique

• Assumption of balance in class samples in training data

– Imbalance common – some objects are rare
– Model learning biased to more frequent cases
Multi-Class Logistic Regression
What could possibly go wrong?
• Consider fruit processing with 8 categories
{A_large, A_medium,…, B_small, unmarketable, not_fruit}
• Imbalance in classes
– Example; perhaps A_large infrequent, B_medium most frequent
• Classes not well defined
– How different is A_small from B_small in computer vision sensor?
• If each classifier is 95% accurate so overall accuracy is !
Multi-Class Logistic Regression
How else can we build a multi-class classifier?
• One vs. rest uses classifiers to compute probabilities
• Could use One vs. One algorithm
• One vs. one algorithm requires classifiers
Multi-Class Logistic Regression
How else can we build a multi-class classifier?
• One vs. one algorithm uses logistic regression models
• Models constructed in usual manner
• Use majority voting of (K – 1) votes for each class
– Provides some diversity
– Can average errors in individual vote
Multi-Class Logistic Regression
How else can we build a multi-class classifier?
• How many classifiers does each approach require?
Number of One vs. rest classifiers One vs. one classifiers
classes

Number of Accuracy at 1% Number of Accuracy at 1% Accuracy at 1% error

classifiers error rate classifiers error rate – no rate – Normally
diversity distributed errors
2 1 1
0.99 0.99 0.99
3 2 3
0.98 0.97 0.98
4 3 0.97 6 0.94 0.98
10 9 45
0.91 0.64 0.93
100 99 0.37 4950 0.00 0.30
1000 999 499500
0.00 0.00 0.00
Multi-Class Logistic Regression
Compare one vs. rest and one vs. one classifiers
Criteria One vs. rest classifiers One vs. one classifiers
Computational efficiency Few models to compute More models to compute
Error rate for fixed accuracy of
individual classifier if no diversity
Sensitivity to poor class High High
separation
Sensitivity to imbalance in High High
training data
Diversity in voting No Yes – majority of (K – 1)
classifiers for each class
Evaluation of Multi-Class Classifiers
How do we evaluate muliti-class classification models?
• Write the multi-class precision and recall as sums on rows and columns of
the multi-class confusion matrix
• Selectivity or Precision:
– Fraction of cases classified as positive which are correctly classified
• Sensitivity or Recall:
– The fraction positive cases correctly classified
Summary
Key points for this lesson
• The formulation of linear machine learning models
• Basic machine learning workflow
• Formulation of CV features for machine learning
• The relationship between bias, variance and model capacity
• Theory of binary classifiers
• Theory of multi-class classifiers