lecture18
lecture18
1.1 Definition
Visual tracking is the process of locating a moving object (or multiple objects) over time in a sequence.
1.2 Objective
The objective of tracking is to associate target objects and estimate target state over time in consecutive
video frames.
1.3 Applications
1.4 Challenges
Note, that tracking can be a time consuming process due to the amount of data that in video. Besides,
tracking relies on object recognition algorithms for tracking, which might become more challenging
and prone to failure for the following reasons:
• Variations due to geometric changes like the scale of the tracked object
• Changes due to illumination and other photometric aspects
• Occlusions in the image frame
• Motion that is non-linear
• Blurry videos or videos with bad resolution might make the recognition fail
• Similar objects in the scene
Computer Vision: Foundations and Applications (CS 131, 2017), Stanford University.
2 Feature Tracking
2.1 Definition
Feature tracking is the detection of visual feature points (corners, textured areas, ...) and tracking
them over a sequence of frames (images).
What kinds of image regions can we detect easily and consistently? We need a way that can measure
“quality” of features from just a single image. Intuitively, we want to avoid smooth regions and edges.
A solution for such a problem is to use Harris Corners. Detecting Harris corners as our key points to
track guarantees small error sensitivity!
Once we detect the features we want to track, we can use optical flow algorithms to solve our motion
estimation problem and track our features.
2.4 Example
2
• If the patch around the new point differs sufficiently from the old point, we discard
these points.
4. Introduce new Harris points by applying Harris detector at every (10 or 15) frames
5. Track new and old Harris points using steps 2-3
In the following frames from tracking videos, arrows represent the tracking motion of the harris
corners.
3 2D Transformations
3.1 Types of 2D Transformations
There are several types of 2D transformations. Choosing the correct 2D transformations can depend on
the camera (e.g. placement, movement, and viewpoint) and objects. A number of 2D transformations
are shown in Figure 3.1. Examples of 2D transformations include:
In this section we will cover three common transformations: translation, similarity, and affine.
3
3.2 Translation
Figure 5: Translation
Translational motion is the motion by which a body shifts from one point in space to another. Assume
we have a simple point m with coordinates (x, y). Applying a translation motion on m shifts it from
(x, y) to (x0 , y 0 ) where
x 0 = x + b1
(1)
y 0 = y + b2
We can write this as a matrix transformation using homogeneous coordinates:
0 x!
x 1 0 b1
= y (2)
y0 0 1 b2
1
4
3.4 Affine motion
Affine motion includes scaling, rotation, and translation. We can express this as the following:
x0 = a1 x + a2 y + b1
(8)
y 0 = a3 x + a4 y + b2
The affine transformation can be described with the following transformation matrix W and parame-
ters p:
a1 a2 b1
W =
a3 a4 b2 (9)
p = (a1 a2 b1 a3 a4 b2 )T
Given a video sequence, find the sequence of transforms that maps each frame to the next frame.
Should be able to deal with arbitrary types of motion, including object motion and camera/perspective
motion.
4.2 Approach
This approach differs from the simple KLT tracker by the way it links frames: instead of using
optical-flow to link motion vectors and track motion, we directly solve for the relevant transforms
using feature data and linear approximations. This allows us to deal with more complex (such as
affine and projective) transforms and link objects more robustly.
Steps:
1. First, use Harris corner detection to find the features to be tracked.
2. For each feature at location x = [x, y]T : Choose a feature descriptor and use it to create an
initial template for that feature (likely using nearby pixels): T (x).
3. Solve for the transform p that minimizes the error of the feature description around x2 =
W (x; p) (your hypothesis for where the feature’s new location is) in the next frame. In other
words, solve the equation X
[T (W (x; p)) − T (x)]2
x
4. Iteratively reapply this to link frames together, storing the coordinates of the features as the
transforms are continuously applied. This should give you a measure of how objects move
through frames.
5. Just as before, every 10-15 frames introduce new Harris corners to account for occlusion
and "lost" features.
4.3 Math
We can in fact analytically derive an approximation method for finding p (in Step 3). Assume that
you have an initial guess for p, p0 , and p = p0 + ∆p.
Now,
X X
E= [T (W (x; p)) − T (x)]2 = [T (W (x; p0 + ∆p)) − T (x)]2
x x
5
But using the Taylor approximation, we see that this error term is roughly equal to :
X ∂W
E≈ [T (W (x; p0 )) + ∇T ∆p − T (x)]2
x
∂p
To minimize this term, we take the derivative with regard to p0 and set it equal to 0, then solve for p0 .
∂E X ∂W T ∂W
≈ [∇T ( ) ][T (W (x; p0 )) + ∇T ∆p − T (x)] = 0
∂p x
∂p ∂p
X ∂W T
∆p = H −1 [∇T ][T (x) − T (W (x; p0 ))]
x
∂p
∂W T ∂W
P
, where H is x [∇T ∂p ] [∇T ∂p ]
References
[1] S. Chandra, G. Sharma, S. Malhotra, D. Jha, and A. P. Mittal. Eye tracking based human computer
interaction: Applications and their uses. pages 1–5, Dec 2015.
[2] A. I. Comport, E. Marchand, M. Pressigout, and F. Chaumette. Real-time markerless tracking for augmented
reality: the virtual visual servoing framework. IEEE Transactions on Visualization and Computer Graphics,
12(4):615–628, July 2006.
[3] A. Hampapur, L. Brown, J. Connell, A. Ekin, N. Haas, M. Lu, H. Merkl, and S. Pankanti. Smart video
surveillance: exploring the concept of multiscale spatiotemporal tracking. IEEE Signal Processing Magazine,
22(2):38–51, March 2005.
[4] P. Mountney, D. Stoyanov, and G. Z. Yang. Three-dimensional tissue deformation recovery and tracking.
IEEE Signal Processing Magazine, 27(4):14–24, July 2010.
[5] O. Rostamianfar, F. Janabi-Sharifi, and I. Hassanzadeh. Visual tracking system for dense traffic intersections.
In 2006 Canadian Conference on Electrical and Computer Engineering, pages 2000–2004, May 2006.