LectureNotes PDF
LectureNotes PDF
University of Cambridge
1 / 212
Lecture Topics
1. Overview. Goals of computer vision; why they are so difficult.
2. Pixel arrays, CCD / CMOS image sensors, image coding.
3. Biological visual mechanisms, from retina to visual cortex.
4. Mathematical operations for extracting structure from images.
5. Edge detection operators; gradients; zero-crossings of Laplacian.
6. Multi-resolution. Active Contours. Wavelets as primitives; SIFT.
7. Higher brain visual mechanisms; streaming; reciprocal feedback.
8. Texture, colour, stereo, and motion descriptors. Disambiguation.
9. Lambertian and specular surface properties. Reflectance maps.
10. Shape description. Codons; superquadrics and surface geometry.
11. Perceptual organisation and cognition. Vision as model-building.
12. Lessons from neurological trauma and deficits. Visual illusions.
13. Bayesian inference. Classifiers; probabilistic decision-making.
14. Model estimation. Machine learning and statistical methods.
15. Optical character recognition. Content-based image retrieval.
16. Face detection, face recognition, and facial interpretation.
2 / 212
Aims of this course:
– to introduce the principles, models and applications of computer vision,
as well as some mechanisms used in biological visual systems that might
inspire design of artificial ones. At the end of the course you should:
I understand visual processing from both “bottom-up” (data oriented)
and “top-down” (goals oriented) perspectives;
I be able to decompose visual tasks into sequences of image analysis
operations, representations, algorithms, and inference principles;
I understand the roles of image transformations and their invariances;
I describe detection of features, edges, shapes, motion, and textures;
I describe some key aspects of how biological visual systems work;
I consider ways to try to implement biological visual strategies in
computer vision, despite the enormous differences in hardware;
I be able to analyse the robustness, brittleness, generalisability, and
performance of different approaches in computer vision;
I understand roles of machine learning in computer vision, including
probabilistic inference, discriminative and generative methods;
I understand in depth at least one major vision application domain,
such as face detection, recognition, or interpretation.
3 / 212
Online resources and recommended books
4 / 212
1. Examples of computer vision applications and goals:
6 / 212
(some computer vision applications and goals, con’t)
I 3D assessment of tissue and organs from non-invasive scanning
I automated medical image analysis, interpretation, and diagnosis
8 / 212
(some computer vision applications and goals, con’t)
I robotic manufacturing: manipulation and assembly of parts
I agricultural robots: weeding, harvesting, and grading of produce
9 / 212
(some computer vision applications and goals, con’t)
I anomaly detection; event detection; automated surveillance and
security screening of passengers at airports
10 / 212
1(b). Why the goals of computer vision are so difficult
In many respects, computer vision is an “AI-complete” problem.
Building general-purpose vision machines would entail, or require,
solutions to most of the general goals of artificial intelligence:
11 / 212
(Why the goals of computer vision are so difficult, con’t)
Although vision seems like such an effortless, immediate faculty for
humans and other animals, it has proven to be exceedingly difficult
to automate. Some of the reasons for this include the following:
14 / 212
(Why the goals of computer vision are so difficult, con’t)
Extracting and magnifying the lower-left corner of the previous image
(capturing most of the body of the fourth fox, minus its head) illustrates
the impoverished limits of a purely “data-driven, bottom-up” approach.
I How can edge detection algorithms find and trace this fox’s outline?
Simple methods would meander, finding nonsense edges everywhere.
I Even for humans this is difficult. “Top-down” guidance based on the
entire image is needed, allowing the use of prior knowledge about
the nature of the world and of the things that may populate it.
I Model-driven vision can drive image parsing by setting expectations.
Maybe the three central foxes with their distinctive heads are critical.
15 / 212
(Why the goals of computer vision are so difficult, con’t)
The image of foxes was intentionally noisy, grainy, and monochromatic,
in order to highlight how remarkable is the fact that we (humans) can
easily process and understand the image despite such impoverished data.
How can there possibly exist mathematical operators for such an image
that can, despite its poor quality:
I perform the figure-ground segmentation of the scene (into its
objects, versus background clutter)
I infer the 3D arrangements of objects from their mutual occlusions
I infer surface properties (texture, colour) from the 2D image statistics
I infer volumetric object properties from their 2D image projections
I and do all of this in “real time?” (This matters quite a lot in the
natural world, “red in tooth and claw”, since survival depends on it.)
Here is a video demo showing that computer vision algorithms can infer
3D world models from 2D (single) images, and navigate within them:
http://www.youtube.com/watch?v=VuoljANz4EA .
16 / 212
Consider now the actual image data of a face, shown as a pixel array
(Why the goals of computer
with luminance vision
plotted as a function are
of (X,Y) pixelso difficult,
coordinates. Can youcon’t)
seethe face in this image, or even segmentthe face from its background,
Consider nowletthe actual
alone image
recognize data Inofthis
the face? a face, shown
form, the imageasreveals
a pixel
botharray
the with
greyscale value plottedof as
complexity the a function
problem of poverty
and the (x,y) of
pixel coordinates. Can you
the data.
see the face in this image, or even segment the face from its background,
let alone recognise the face? In this format, the image reveals both the
complexity of the problem and the poverty of the signal data.
17 / 212
(Why the goals of computer vision are so difficult, con’t)
This “counsel of despair” can be given a more formal statement:
18 / 212
(Why the goals of computer vision are so difficult, con’t)
I inferring a 3D shape unambiguously from a 2D line drawing:
19 / 212
o interpreting the mutual occlusionsof objects, and stereo disparity
rI
G ,L
--
t3 I
rf
(''
fr t
For a chess-playing robot, the task of visually identifying an actual chess
...but enough counsel of despair. Let us begin with understanding what
piece inimage
3D array
(e.g.is.a knight, with pose-invariance and “design-invariance”)
is a much harder problem than playing chess! (The latter problem was
solved years ago, and chess-playing algorithms today perform at almost
superhuman skill levels; but the former problem remains barely solved.)
22 / 212
Data in video streams
Composite video uses a high-frequency “chrominance burst” to encode
colour; or in S-video there are separate “luma” and “chroma” signals; or
there may be separate RGB colour channels. Colour information requires
much less resolution than luminance; some coding schemes exploit this.
A framegrabber or a strobed sampling block in a digital camera contains
a high-speed analogue-to-digital converter which discretises this video
signal into a byte stream, making a succession of frames.
Conventional video formats include NTSC (North American standard):
640×480 pixels, at 30 frames/second (actually there is an interlace of
alternate lines scanned out at 60 “fields” per second); and PAL
(European, UK standard): 768×576 pixels, at 25 frames/second.
Note what a vast flood of data is a video stream, even without HDTV:
768×576 pixels/frame × 25 frames/sec = 11 million pixels/sec. Each
pixel may be resolved to 8 bits in each of the three colour planes, hence
24×11 million = 264 million bits/sec. How can we possibly cope with
this data flux, let alone understand the objects and events it encodes?
23 / 212
Image formats and sampling theory
Images are represented as rectangular arrays of numbers (1 byte each),
sampling the image intensity at each pixel position. A colour image may
be represented in three separate such byte arrays called “colour planes”,
containing red, green, and blue components as monochromatic images.
An image with an oblique edge within it might include this array:
0 0 0 1 1 0
0 0 1 2 10 0
0 1 2 17 23 5
0 3 36 70 50 10
1 10 50 90 47 12
17 23 80 98 85 30
There are many different image formats used for storing and transmitting
images in compressed form, since raw images are large data structures
that contain much redundancy (e.g. correlations between nearby pixels)
and thus are highly compressible. Different formats are specialised for
compressibility, manipulability, or for properties of browsers or devices.
24 / 212
Examples of image formats and encodings
I .jpeg - for compression of continuous-tone and colour images, with
a controllable “quality factor”. Tiles of Discrete Cosine Transform
(DCT) coefficients are quantised, with frequency-dependent depth.
I .jpeg2000 - a superior version of .jpeg implemented with smooth
Daubechies wavelets to avoid block quantisation artifacts.
I .mpeg - a stream-oriented, compressive encoding scheme used
for video and multimedia. Individual image frames are .jpeg
compressed, but an equal amount of temporal redundancy is
removed by inter-frame predictive coding and interpolation.
I .gif - for sparse binarised images; 8-bit colour. Very compressive;
favoured for websites and other bandwidth-limited media.
I .png - using lossless compression, the portable network graphic
format supports 24-bit RGB.
I .tiff - A complex umbrella class of tagged image file formats.
Non-compressive; up to 24-bit colour; randomly embedded tags.
I .bmp - a non-compressive bit-mapped format in which individual
pixel values can easily be extracted. Non-compressive.
25 / 212
(Image formats and sampling theory, con’t)
26 / 212
(Image formats and sampling theory, con’t)
I How much information does an image contain? Bit count does not
relate to optical properties, nor to frequency analysis.
I Nyquist’s Sampling Theorem says that the highest spatial frequency
component of information contained in an image equals one-half the
sampling density of the pixel array.
I Thus a pixel array with 640 columns can represent spatial frequency
components of image structure no higher than 320 cycles/image.
I Likewise, if image frames are sampled in time at 30 per second, then
the highest temporal frequency component of information contained
within a moving sequence is 15 Hertz.
I Increasingly, more complex sensors called RGB-D sensors capture
colour as well as depth information for purposes such as human
activity recognition, tracking, segmentation, and 3D reconstruction
from RGB-D data.
27 / 212
Using second-order pixel statistics to assist segmentation
I Low-level local statistical metrics can be useful for segmentation
(dividing an image into meaningful regions). For example, in the
NIR band (700nm – 900nm) used for iris imaging, it can be difficult
to detect the boundary between the eye’s sclera and the eyelid skin.
I But after computing pixel variance and mean in local (4×4) patches,
imaging their ratio sets the eyelid boundaries (eyelashes) “on FIRE”.
This helps with eye-finding because of the distinctive, iconic, orifice.
28 / 212
3. Biological visual mechanisms: retina to visual cortex
29 / 212
Low-level neurobiological mechanisms
I No artificial ‘general purpose’ vision system has yet been built.
I Natural vision systems abound. What can we learn from visual
neuroscience, despite the enormous differences in hardware?
30 / 212
Wetware
I Neurones are sluggish but richly interconnected cells having both
analogue and discrete aspects, with nonlinear, adaptive features.
I Fundamentally they consist of an enclosing membrane that can
separate electrical charge, so a voltage difference generally exists
between the inside and outside of a neurone.
I The membrane is a lipid bilayer that has a capacitance of about
10,000 µFarads/cm2 , and it also has pores that are differentially
selective to different ions (mainly Na+ , K+ , and Cl− ). Seawater
similarity: – L’homme vient de l’océan, son sang reste salé.
I These ion species cross the neural membrane through protein pores
acting as discrete conductances (hence as resistors).
I The resistors for Na+ and K+ have the further crucial property that
their resistance is not constant, but voltage-dependent.
I As more positive ions (Na+ ) flow into the neurone, the voltage
becomes more positive on the inside, and this further reduces the
membrane’s resistance to Na+ , allowing still more to enter.
I This catastrophic breakdown in resistance to Na+ constitutes a
nerve impulse. Within about a msec a slower but opposite effect
involving K+ restores the original trans-membrane voltage.
31 / 212
(Wetware, con’t)
I After a refractory period of about 2 msec to restore electro-osmotic
equilibrium, the neurone is ready for action again.
I Nerve impulses propagate down axons at speeds of about 100 m/sec.
I Impulse signalling can be described as discrete, but the antecedent
summations of current flows into a neurone from other neurones at
synapses, triggering an impulse, are essentially analogue events.
I In general, neural activity is fundamentally asynchronous: there is no
master clock on whose edges utut the events occur.
I Impulse generation time prevents “clocking” faster than ∼ 300 Hz
(– about 10 million times slower than the 3 GHz clock in your PC.)
I Balanced against this sluggishness is massive inter-connectivity:
I Typically in brain tissue, there are about 105 neurones / mm3 .
I Each has 103 − 104 synapses with other neurones within ∼ 3 cm.
I Thus brain tissue has about 3 kilometers of “wiring” per mm3 !
I Not possible to distinguish between processing and communications,
as we do in Computer Science. They are inseparable.
32 / 212
I Human brain has about 1011 neurones, making about 1015 synapses.
I About 2/3rds of the brain receives visual input; we are fundamentally
visual creatures. There are at least 30 different visual areas, with
reciprocal connections, of which the primary visual cortex in the
occipital lobe at the rear has been the most extensively studied.
33 / 212
The retina
I The mammalian eye is formed as an extruded ventricle of the brain.
I The retina is about 1 mm thick and it contains about 120 million
light-sensitive photoreceptors, of which only 6 million are cones
(with photopigments specialised for red, green, or blue wavelengths).
The rest are rods which do not discriminate in wavelength bands.
I The visible spectrum of light has wavelength range 400nm - 700nm.
I Rods are specialised for much lower light intensities. They subserve
our “night vision” (hence the absence of perceived colour at night),
and they pool their responses, at the cost of spatial resolution.
I Cones exist primarily near the fovea, in about the central 20◦ where
their responses are not pooled, giving much higher spatial resolution.
I As cones function only at higher light levels, we really have a dual
system with two barely overlapping sensitivity ranges.
I The total dynamic range of human vision (range of light intensities
that can be processed) is a staggering 1011 to 1. At the lowest level,
we can reliably “see” individual photons (i.e. reliably have a visual
sensation when at most a few photons reach the retina in a burst).
34 / 212
Distributions of photoreceptors
V i s u a la x i s
0.20
0 . 1B
d 0.16
h6
ut 0.14
. s-o 0.10
.,u
e ? 1
.0x
:lt 0.08
(-
180,000 0.06
tt
E r 60,000
E
140,000 o- 0.04
AJ 120,000
c
U
100,000 0.02
7 80,000
0.00
60,000
400 440 480 520 560 600 640 680
O
-o 40,000
E W a v e l e n g t (hn m )
z 20,000
0
70" 600 500 40" 30' 20' 100 200 300 400
T e m p o r a lo n r e t i n a
P e r i m e t r i ca n g l e ( d e g )
35 / 212
Phototransduction and colour separation
I The most distal neurones in the retina are analogue devices.
I Photoreceptors do not generate impulses but respond to absorption
of photons by hyperpolarisation (increased trans-membrane voltage).
I This happens because in the photo-chemical isomerisation reaction,
11-cis-retinal + ~ν → all-trans-retinal, a carbon double-bond simply
flips from cis to trans, and this causes a pore to close to Na+ ions.
I As Na+ ions are actively pumped (the Na+ “dark current”), this
increased resistance causes an increased trans-membrane voltage.
I Voltage change is sensed synaptically by bipolar and horizontal cells.
I The three colour-selective classes of cones have cis-retinal embedded
in different opsin molecules. These quantum-mechanically affect the
probability of photon capture as a function of wavelength λ = c/ν.
37 / 212
Retina: not a sensor, but a part of the brain
I Far from acting as a camera, the retina is actually part of the brain.
I Note that there are 120 million photoreceptors, but only 1 million
“output channels” (the axons of the ganglion cells which constitute
the fibres of the optic nerve, sending coded impulses to the brain).
I Actual retinal cross-sections above (with flourescent dyes and stains)
reveal some of the complexity of retinal networks.
I Already at its first synapse, the retina is performing a lot of spatial
image processing, with temporal processing at the second synapse.
38 / 212
Lateral and longitudinal signal flows in the retina
I The retina is a multi-layered network, containing three nuclear layers
(of neurones) and two plexiform layers (synaptic interconnections).
I Paradoxically, the photoreceptors are at the rear, so light must first
travel through all of the rest of the retina before being absorbed.
I There are two orthogonal directions of signal flow in the retina:
longitudinal (photoreceptors → bipolar cells → ganglion cells); and
lateral (horizontal and amacrine cells, outer/inner plexiform layers).
39 / 212
Centre-surround opponent spatial processing in the retina
40 / 212
“See” your own retinal centre-surround operators!
I There actually are no dark circles at the intersections in this grid.
I But just move your eyes around the image, and you will see brief
round flashes of illusory darkness at those positions, thanks to
surround inhibition in the receptive fields of your retinal neurones.
41 / 212
Summary of image processing and coding in the retina
I Sampling by photoreceptor arrays, with pooling of signals from rods
I Both convergence (“fan-in”) and divergence (“fan-out”) of signals
I Spatial centre-surround comparisons implemented by bipolar cells
(direct central input from photoreceptors, minus surround inhibition
via horizontal cells, in an annular structure having either polarity)
I Temporal differentiation by amacrine cells, for motion detection
I Separate channels for sustained versus transient image information
by different classes of ganglion cells (parvo-cellular, magno-cellular)
I Initial colour separation by “opponent processing” mechanisms
(yellow versus blue; red versus green) sometimes also coupled with
a spatial centre-surround structure, termed “double opponency”
43 / 212
Brain projections and visual cortical architecture
I The right and left visual fields project to different brain hemispheres.
44 / 212
Visual splitting and cortical projections
I You actually have two quasi-independent brains, not one.∗
I The optic nerve from each eye splits into two at the optic chiasm.
I The portion from the nasal half of each retina crosses over to project
only to the contralateral (opposite side) brain hemisphere.
I The optic nerve portion bearing signals from the temporal half of
each eye projects only to the ipsilateral (same side) brain hemisphere.
I Therefore the left-half of the visual world (relative to gaze fixation)
is directly seen only by the right brain, while the right-half of the
visual world is directly seen only by the left brain.
I It is not unreasonable to ask why we don’t see some kind of “seam”
going down the middle...
I Ultimately the two brain hemispheres share all of their information
via a massive connecting bundle of 500 million commissural fibres
called the corpus callosum.
∗
Commissurotomy, a radical surgical “last resort” for epilepsy patients in the 1960s, separated the
hemispheres and allowed two “minds” to exist. [C.E. Marks, Commissurotomy..., MIT Press, 1986.]
45 / 212
What is the thalamus doing with all that feedback?
I The projections to each visual cortex first pass to the 6-layered
lateral geniculate nucleus (LGN), in a polysensory organ of the
midbrain called the thalamus.
I It is an intriguing fact that this “relay station” actually receives
three times more descending (efferent) fibres projecting back down
from the cortex, as it gets ascending (afferent) fibres from the eyes.
I Could it be that this signal confluence compares cortical feedback
representing hypotheses about the visual scene, with the incoming
retinal data, in a kind of predictive coding or hypothesis testing
operation? We will return to this theory later.
I Several scientists have proposed that “vision is graphics” (i.e. what
we see is really our own internally generated 3D graphics, modelled
to fit the 2D retinal data, with the model testing and updating
occuring here in the thalamus via this cortical feedback loop).
46 / 212
Interweaving data from the two eyes for stereo vision
I The right-eye and left-eye innervations from each LGN to the
primary visual cortex in the occipital lobe of that hemisphere are
be infered from the available retinal measurementsr?()) without explic-
interwoven into “slabs,” or columns, in which neurones receive input
knowing /()).
primarily from just one of the eyes. Right and left eyes alternate.
I These ocular dominance columns have a cycle of about 1 mm and
resemble fingerprints in scale and flow (see radiograph below).
Stereo information
I Clearly each hemisphere is trying to integrate together the signals
portant information about depth can be obtained from the use of two (ot
re) cameras,infrom bothway
the same eyes in humans
that a way suitable for stereoscopic
achievestereoscopic vision, computing
depth vision
virtue of havingthe
tworelative retinalindisparities
eyes. Objects of corresponding
front or behind pointsat in the images.
of the point in space
ch the two I The axes
optical disparities
intersectreflect the relative
(as determined positions
by the of thethem,
angle between points in depth,
asby
ch is controlled wecamera
will study later in
movements stereo
or eye algorithms.
movementt), will project into
erent relative parts of the two images. This is called stereoscopicdisparity.
47 / 212
New tuning variable in visual cortex: orientation selectivity
I Orthogonally to the ocular dominance columns in the visual cortical
architecture, there runs a finer scale sequence of orientation columns.
I Neurones in each such column respond only to image structures
(such as bars or edges) in a certain preferred range of orientations.
I Their firing rates (plotted as nerve impulses per second) reveal their
selectivity for stimulus orientation: a “tuning curve”:
48 / 212
Origin of cortical orientation selectivity
I Orientation selectivity might arise from the alignment of isotropic
subunits in the LGN, summated together in their projection to the
primary visual cortex (V1). Both “on” and “off” polarities exist.
49 / 212
“Hypercolumns”
I A 3D block of about 100,000 cortical V1 neurones that includes one
“right/left” cycle of ocular dominance columns, and (orthogonally
organised) about ten orientation columns spanning 360◦ of their
preferred orientations in discrete steps, is called a “hypercolumn”.
I In the third dimension going down, there are six layers in which
neurones vary mainly in the sizes of their receptive fields.
I This block occupies approximately 2 mm3 of cortical tissue, and it
contains the “neural machinery” to process about a 1◦ patch in the
foveal area of visual space, or about a 6◦ patch in the periphery.
50 / 212
phase by approximately 90°. This phase relationship suggests that adjacent simple ings afforded an opportunity to deter-
cells tuned to the same spatial frequency and orientation represent paired sine and mine'the extent of spatial overlap of the
cosine filters in terms of their processing of afferent spatial inputs and truncated sine adjacent receptive fields as measured by
Quadrature phase relationships among paired V1 neurones
and cosine filters in terms of the output of simple cells. their difference in spatial "phase." In-
formation of this type has not previously
We have recorded from "pairs" of preference (2), and relative phase be- been reported for neurons in the visual
adjacent simple cells in the visual cortex tween pair members. cortex nor, to our knowledge, for a pair
I Recording from action adjacent
potentials of decided-pairs
of the cat. Recording situations in which
two distinct ofTheseneurones simultaneously, using a
Initially we examined the records from
16 cell pairs. were obtained in the
of adjacent neurons anywhere along the
visual pathways.
ly different amplitude can be recorded course of 24 recent experiments for General methods of anesthesia, re-
kind of “double-barrelled”
simultaneously from a single microelec- micro-electrode,
which the original purpose was to com-
trode occur infrequently. However, pare the response of simple and complex
showed that neurones
cording, presentation of visual stimuli,
and evaluation of preferred spatial fre-
A
2 s --E - -rri
£bEn-L~~~~~~~~~~~~~~~~~~~~~~~~ I 60 spikes
per second
0.31 cycle/dog
B
JAAAAAF~~~~~~~~~~ E
0.62 cycle/ deg
( 150 spikes
per second
C 0.44 cycle/deg
F
0.88 cycle/deg
secodI
1 second
51 / 212
Summary of spatial image encoding in primary visual cortex
I There seem to be five main “degrees of freedom” in the spatial
structure of cortical receptive field profiles: position in visual space
(two coordinates), orientation preference, receptive field size, and
phase (even or odd symmetry).
I These parameters can be infered from the boundaries between the
excitatory and inhibitory regions, usually either bipartite or tripartite.
I Plotting how much a neurone is excited or inhibited by light as a
detailed function of stimulus coordinates within its receptive field,
extracts its 2D receptive field profile.
I For about 97% of such neurones studied, these receptive field
profiles could be well described as 2D Gabor wavelets (or phasors).
I In the next slide, several examples of empirically measured profiles
are shown in the top row; an ideal theoretical form of each such 2D
Gabor wavelet (to be defined later) is shown in the middle row; and
the difference between these two functions in the bottom row.
I The differences are statistically insignificant. So, it seems the brain’s
visual cortex discovered during its evolution the valuable properties
of such 2D wavelets for purposes of image coding and analysis!
52 / 212
Cortical encoding of image structure by 2D Gabor wavelets
Residuals
53 / 212
Historical comment
54 / 212
4. Mathematical image operations
I Almost all image processing begins with (2D) convolutions of an
image with small kernel arrays designed for specific purposes.
I Examples include: edge detection, filtering, feature extraction,
motion detection, keypoint identification, texture classification,...
I Conceptual unity: convolution ⇔ filtering ⇔ Fourier operation
I Even differential operators, such as taking derivatives to find edges,
are implemented as convolutions with simple Fourier interpretations.
I Example: applying the Laplacian operator (sum of the image second
derivatives in the vertical and horizontal directions) is equivalent to
simply multiplying the Fourier transform of the image by an isotropic
paraboloid: it is just a type of high-pass filtering.
I Equivalence between convolutions and (computationally simpler)
Fourier domain operations make it faster to perform convolutions in
the Fourier domain if the kernel chosen for the purpose is larger than
(5 × 5), because of the huge efficiency of the Fast Fourier Transform
(FFT) and the fact that convolution is replaced by multiplication.
55 / 212
The Fourier perspective on images
I It is therefore useful to regard an image as a superposition of many
2D Fourier components, which are complex exponential plane-waves
having the form: f (x, y ) = e iπ(µx+νy ) with complex coefficients.
I Their
p parameters (µ, ν) can be interpreted as 2D spatial frequency
µ2 + ν 2 and orientation tan−1 (ν/µ) of the plane-wave.
I Adding together a conjugate pair of them makes a real-valued wave.
I Different images simply have different amplitudes (contrasts) and
phases associated with the same universal set of Fourier components.
I Convolutions (filtering operations) just manipulate those amplitudes
and phases, as a function of 2D spatial frequency and orientation.
56 / 212
Convolution Theorem for two-dimensional functions
Let function f (x, y ) have 2D Fourier Transform (2DFT) F (µ, ν), and let
function g (x, y ) have 2DFT G (µ, ν). The convolution of f (x, y ) with
g (x, y ), which is denoted f ∗ g , combines these two functions to generate
a third function h(x, y ) whose value at location (x, y ) is equal to the 2D
integral of the product of the functions f and g after one is flipped and
undergoes a relative shift by amount (x, y ):
Z Z
h(x, y ) = f (α, β)g (x − α, y − β) dα dβ
α β
58 / 212
XX
result(i, j) = kernel(m, n)· image(i − m, j − n)
Pseudo-code for explicit
m n image convolution with a kernel
60 / 212
Differentiation Theorem
Computing derivatives of an image f (x, y ) is equivalent to multiplying its
2DFT, F (µ, ν), by the corresponding spatial frequency coordinate (× i)
raised to a power equal to the order of differentiation:
m n
∂ ∂ 2DFT
f (x, y ) =⇒ (iµ)m (iν)n F (µ, ν)
∂x ∂y
61 / 212
5. Edge detection
Whether edges are straight, curved, or forming closed boundary contours,
they are very informative for several reasons:
I Edges demarcate the boundaries of objects, or of material properties.
I Objects have parts, which typically make edges where they join.
I The three-dimensional distribution of objects in a scene usually
generates occlusions of some objects by other objects, and these
form occlusion edges which reveal the geometry of the scene.
I Edges can be generated in more abstract domains than luminance.
For example, if some image property such as colour, or a textural
signature, or stereoscopic depth, suddenly changes, it constitutes
an “edge” which is very useful for that domain.
I Aligning edges is a way to solve the stereo correspondence problem.
I A correspondence problem exists also for frames displaced in time.
I Velocity fields, containing information about object trajectories, can
be organized and understood by the movements of edges. Motions
of objects generate velocity discontinuities at their boundaries.
64 / 212
First finite difference operators for detecting edges
convolution with [−1, 1] ⇐= ORIGINAL =⇒ convolution with [−1, 1]T
67 / 212
Problem with noise and clutter in edge detection
Unfortunately, object boundaries of interest are sometimes fragmented,
and can have spurious “clutter” edge points. These problems are not
solved, but traded-off, by applying a threshold to the gradient magnitude:
68 / 212
Humans perform better at image segmentation
69 / 212
Combining 2nd-order differential operators with smoothing
An alternative to the gradient vector field is a second-order differential
operator, combined with smoothing at a specific scale of analysis. An
example of a 2D kernel based on the second finite difference operator is:
-1 2 -1
-1 2 -1
-1 2 -1
Clearly, such an operator will detect edges only in a specific orientation.
It is integrating in the vertical direction, and taking a second derivative
horizontally. In comparison, an isotropic operator such as the Laplacian
(sum of 2nd derivatives in two orthogonal orientations) has no preferred
orientation; that is the meaning of isotropy. A discrete approximation to
the Laplacian operator ∇2 (no smoothing) in just a small (3 x 3) array is:
-1 -2 -1
-2 12 -2
-1 -2 -1
Notice how each of these simple (3 x 3) operators sums to zero when all
of their elements are combined together. Therefore they give no response
to uniform illumination, but respond only to actual image structure.
70 / 212
Scale-specific edge operator: Laplacian of a Gaussian
A popular second-order differential operator for detecting edges at a
specific scale of analysis, with a smoothing parameter σ, is ∇2 Gσ (x, y ):
2 2 2
1 −(x +y )/2σ
For a parameterised Gaussian form Gσ (x, y ) = 2πσ 2e , we have
2
∂ ∂2 x 2 + y 2 − 2σ 2 −(x 2 +y 2 )/2σ2
∇2 Gσ (x, y ) = 2
+ 2 Gσ (x, y ) = e
∂x ∂y 2πσ 6
71 / 212
Why specify a scale of analysis for edge detection?
I Edges in images are defined at different scales: some transitions in
brightness are gradual, others very crisp. Importantly, at different
scales of analysis, different edge structure emerges.
I Example: an image of a leopard that has been low-pass filtered
(or analyzed at a coarse scale) has edge outlines corresponding to
the overall form of its body.
I At a somewhat finer scale of analysis, image structure may be
dominated by the contours of its “spots.” At a still finer scale,
the relevant edge structure arises from the texture of its fur.
I In summary, non-redundant structure exists in images at different
scales of analysis (or if you prefer, in different frequency bands).
I The basic recipe for extracting edge information from images is to
use a multi-scale family of filters as the image convolution kernels.
I One approach is to apply a single filter to successively downsampled
copies of the original image. A Laplacian pyrammid thereby extracts
image structure in successive octave bands of spatial frequencies.
72 / 212
Different image structure at different scales of analysis
75 / 212
Canny edge operator
A computationally more complex approach to edge detection was
developed by Canny, to avoid the spurious edge clutter seen earlier.
It is popular because it is better able to distinguish real edges that
correspond to actual object boundaries.
The Canny edge operator has five main steps (two discussed earlier):
1. Smooth the image with a Gaussian filter to reduce noise.
~ (x, y ) over the image.
2. Compute the gradient vector field ∇I
3. Apply an “edge thinning” technique, non-maximum suppression,
to eliminate spurious edges. A given edge should be represented
by a single point, at which the gradient is maximal.
4. Apply a double threshold to the local gradient magnitude, resulting
in three classes of edge data, labelled strong, weak, or suppressed.
The threshold values are adaptively determined for a given image.
5. Impose a connectivity constraint: edges are “tracked” across the
image; edges that are weak and not connected to strong edges are
eliminated.
76 / 212
Cleaner results using the Canny edge detector
77 / 212
6. Multiscale wavelets for image analysis; active contours
An effective method to extract, represent, and analyse image structure is
to compute its 2D Gabor wavelet coefficients.
78 / 212
(2D Gabor wavelets, con’t)
Two-dimensional Gabor wavelets have the functional form:
2
/α2 +(y −y0 )2 /β 2 ] −i[u0 (x−x0 )+v0 (y −y0 )]
f (x, y ) = e −[(x−x0 ) e
1
1
0.5
0.5
Z
0
0.5 10
0.5 10
Sp D )
atia
Po
es lF (CP
siti
on 0 0 egre req 0 0 cy
n
in D ue ue
De in ncy F req
gre ion (CP -10 0 tial
es -0.5 0.5 osit D) -1 pa
S
- P
80 / 212
(2D Gabor wavelets, con’t)
With parameterisation for dilation, rotation, and translation, such 2D
wavelets can form a complete and self-similar basis for representing and
analysing the structure in images.
Here are examples of a wavelet family codebook having five sizes, by
factors of two (thus spanning four octaves), six orientations in 30 degree
increments, and two phases, over a lattice of positions.
81 / 212
(2D Gabor wavelets, con’t)
Self-similarity is reflected in using a generating function. If we take
Ψ(x, y ) to be some chosen generic 2D Gabor wavelet, then we can
generate from this one member, or “mother wavelet”, the self-similar
family of daughter wavelets through the generating function
82 / 212
(2D Gabor wavelets, con’t)
The completeness of 2D Gabor wavelets as a basis for image analysis
can be shown by reconstructing a facial image from them, in stages.
Reconstruction of Lena: 25, 100, 500, and 10,000 Two-Dimensional Gabor Wavelets
83 / 212
A “philosophical” comment about Gabor wavelets
I Aristotle defined vision as “knowing what is where.” We have noted
the optimality of 2D Gabor wavelets for simultaneously extracting
structural (“what”) and positional (“where”) information.
I Thus if we share Aristotle’s goal for vision, then we cannot do better
than to base computer vision representations upon these wavelets.
I Perhaps this is why mammalian visual systems appear to have
evolved their use. Currently this is the standard model for how the
brain’s visual cortex represents the information in the retinal image.
I The 2D Gabor framework has also become ubiquitous in Computer
Vision, not only as the “front-end” representation but also as a
general toolkit for solving many practical problems. Thus we have
seen the migration of an idea from neurobiology into mainstream
engineering, mathematical computing, and artificial intelligence.
az@,y)
2D Gabor Proiected
<rr
Phasor Modules Image I (r, y)
Q@,v)
A (x,y)
0 6,y)
86 / 212
Detection of facial features using quadrature wavelets
Left panel: original image. Right panel (clockwise from top left): real part after 2D Gabor wavelet
convolution; imaginary part; modulus; and modulus superimposed on the original (faint) image,
illustrating feature localisation.
87 / 212
Edge detection and selection constrained by shape models
I Integro-differential operators for edge detection can be constrained
so that they find only certain specified families of boundary shapes.
I By computing derivatives of contour integrals along shaped paths,
it is possible to find (say) only circular or parabolic boundary shapes.
88 / 212
Parameterised edge selection by voting; Hough transform
I The white boundaries isolated in the previous slide are the curves
whose parameters were found to maximise the blurred derivatives,
with respect to increasing radius, of contour (“path”) integrals:
I
∂ I (x, y )
arg max(r ,x0 ,y0 ) Gσ (r ) ∗ ds
∂r r ,x0 ,y0 2πr
89 / 212
Active contours for boundary descriptors
I Detection of edges and object boundaries within images can be
combined with constraints that control some parameters of
admissibility, such as the shape of the contour or its “stiffness,”
or the scale of analysis that is being adopted.
I These ideas have greatly enriched the old subject of edge detection,
whilst also enabling the low-level operators we have considered so far
to be directly integrated with high-level goals about shape, such as
geometry, complexity, classification and smoothness, and also with
theory of evidence and data fusion.
I The image of the eye (next slide) contains three active contours:
two defining the inner and outer boundaries of the iris, and one
defining the boundary between the iris and the lower eyelid. These
must be accurately localised in order for the biometric technology of
iris recognition to work.
I Evidence about the local edge structure is integrated with certain
constraints on the boundary’s mathematical form, to get a “best fit”
that minimises some energy function or other “cost” function.
90 / 212
(Active contours for boundary descriptors, con’t)
Active contours are deformable yet constrained shape models.
The “snakes” in the box show radial edge gradients at the iris
boundaries, and active contour approximations (dotted curves).
91 / 212
(Active contours for boundary descriptors, con’t)
Demonstration: https://www.youtube.com/watch?v=ceIddPk78yA
92 / 212
Scale-Invariant Feature Transform (SIFT)
Goals and uses of SIFT:
I Object recognition with geometric invariance to transformations in
perspective, size (distance), position, and pose angle
I Object recognition with photometric invariance to changes in
imaging conditions like brightness, exposure, quality, wavelengths
I Matching corresponding parts of different images or objects
I “Stitching” overlapping images into a seamless panorama
I 3D scene understanding (despite clutter)
I Action recognition (what transformation has happened...)
93 / 212
(Scale-Invariant Feature Transform, con’t)
The goal is to estimate a homography: to find the rotation, translation,
and scale parameters that best relate the contents of two image frames.
I Various kinds of feature detectors can be used, but they should have
an orientation index and a scale index
I Classic approach of Lowe used extrema (maxima and minima) of
difference-of-Gaussian functions in scale space
I Build a Gaussian image pyramid in scale space by successively
smoothing (at octave blurring scales σi = σ0 2i ) and resampling
I Dominant orientations of features, at various scales, are detected
and indexed by oriented edge detectors (e.g. gradient direction)
I Low contrast candidate points and edges are discarded
I The most stable keypoints are kept, indexed, and stored for
“learning” a library of objects or classes
94 / 212
(Scale-Invariant Feature Transform, con’t)
For eachSIFT
local regioninterpolation
performs (four are highlighted here),keypoints
to localise candidate an orientation histogram
with sub-pixel
accuracy and
is constructed discards
from the keypoints
gradient with poor contrast
directions or stability.descriptor.
as a keypoint In order to
96 / 212
(Scale-Invariant Feature Transform, con’t)
I The bins of the orientation histogram are normalised relative to the
dominant gradient direction in the region of each keypoint, so that
rotation-invariance is achieved
I Matching process resembles identification of fingerprints: compare
relative configurations of groups of minutiae (ridge terminations,
spurs, etc), but search across many relative scales as well
I The best candidate match for each keypoint is determined as its
nearest neighbour in a database of extracted keypoints, using the
Euclidean distance metric
I Algorithm: best-bin-first; heap-based priority queue for search order
I The probability of a match is computed as the ratio of that nearest
neighbour distance, to the second nearest (required ratio > 0.8)
I Searching for keys that agree on a particular model pose is based on
Hough Transform voting, to find clusters of features that vote for a
consistent pose
I SIFT does not account for any non-rigid deformations
I Matches are sought across a wide range of scales and positions;
30 degree orientation bin sizes; octave (factor of 2) changes in scale
97 / 212
Summary: philosophy and theology of the SIFT
99 / 212
(7. Parallel functional streams; reciprocal feedback, con’t)
I It is not unreasonable to ask how results from the division-of-labour
get reunified later. (E.g. analysing a bird in flight: how do its form,
colour, texture, and motion information get “put back together”?)
I Reciprocal pairwise connections exist between many such areas,
highlighted here by blue and red pairings of arrows.
See this 3D scan reconstruction of wiring bundles connecting diverse parts of the human brain:
www.bbc.co.uk/news/video_and_audio/must_see/40487049/the- most- detailed- scan- of- the- wiring- of- the- human- brain 100 / 212
Interactions between colour, motion, & texture streams?
-- .- ._.- -
c - ' . -
->- C
c._
102 / 212
.l-f
(Structure from texture, con’t)
I Quasi-periodicity can be detected best by Fourier-related methods
I The eigenfunctions of Fourier analysis (complex exponentials) are
periodic, with a specific scale (frequency) and wavefront orientation
I Therefore they excel at detecting a correlation distance and direction
I They can estimate the “energy” within various quasi-periodicities
t n I | | ,l .
xture rs also a uselul cue to lmage segmentatlon by parslng tne I
Texture also supports figure/ground segmentation by dipole statistics
I
ocal regions which are relatively homogeneous
I The examples below can be segmented in their
(into figure textural
vs ground) either pr
ere are some illustrations:
by their first-order statistics (size of the texture elements), or by
their second-order statistics (dipole orientation)
'til/ lt a5**11vi;
'
i:""i:i=;"ig;';
F)'"'"2 j.. ...\rl i: \'.. ' : I .... i. i.\\'i '..,.' -
z\]:iz:""i .{....- r\r\\.1lsr.
.:
\ \ \ -" '. \i' - i . . r 1 )
; :li
\\\ \
*?:;1;11;i"}ei"ffi*#ri
\\\
---)\ : \\' \\ \
.t l.'.\').\l.i:-.r\
:liil r\ri\'.\\.\r.1
i li iri iii:..S.ii
ffin:
ls*lffi ii\\iNiiiiil,:ir:\
\ i:::i\W :.i:,i:ri::
:i.\. :i;i.:i;
E;ii*:i$.rit"illirrtffi,rt
Nl\t$$rsr 103 / 212
(Structure from texture, con’t)
104 / 212
(Structure from texture, con’t)
I Resolving textural spectra simultaneously with location information
is limited by the Heisenberg Uncertainty Principle, and this trade-off
is optimised by Gabor wavelets
I Texture segmentation using Gabor wavelets can be a basis for
extracting the shape of an object to recognise it. (Left image)
I Phase analysis of iris texture using Gabor wavelets is a powerful
basis for person identification. (Right image)
105 / 212
(Structure from texture, con’t)
Inferring depth from texture gradients can have real survival value...
106 / 212
8b. Colour information
Two compelling paradoxes are apparent in how humans process colour:
1. Perceived colours hardly depend on the wavelengths of illumination
(colour constancy), even with dramatic changes in the wavelengths
2. But the perceived colours depend greatly on the local context
The brown tile at the centre of the illuminated upper face of the cube,
and the orange tile at the centre of the shadowed front face, are actually
returning the same light to the eye (as is the tan tile lying in front)
107 / 212
(Colour information, con’t)
Colour is aTonearly
give theubiquitous property
problem a slightly of surfaces,
more formal and it is useful both for
presentation:
object identification and for
Let /()) represent segmentation.
the wavelength Butofinferring
composition colour
the illuminant properties
(i.e. the
(“spectral reflectances”)
amount of energy of object assurfaces
it contains a functionfrom images ),seems
of wavelength acrossimpossible,
the
visible spectrum from about 400 nanometersto 700 nm).
because generally we don’t know the spectrum of the illuminant.
Let O()) representthe inherent spectral reflectanceof the object at a
I Let I (λ) be thepoint:
particular wavelength
the fractioncomposition of the
of incident light that illuminant
is scattered back from
I Let O(λ) be the
its surface there, spectral
&s a function reflectance of the
of the incident object
light's at some
wavelength ). point
(the fraction
Let n()) of light the
represent scattered back as
actual wavelength a function
mixture receivedbyof the
wavelength
camera λ)
at the correspondingpoint in the image of the scene.
I Let R(λ) be the actual wavelength mixture received by the camera at
,rly Rl ( ) ) : / ( ( )))c 'h
the CL
corresponding
le
ear t) o) ( ) ) . T]in
point nethe orblimage,
pprc em is thatsay to infer the<"object
for (400nm
we wish λ < 700nm)
)
col
rk
lou Irtt (ir t s s p) e ctr: a l rr(eflec
:al ctban ]tce
C AS a a function of wavelength,O())), but we
'e
Clearly, R(λ)
on.ly krn(= )wrR I (λ)O(λ).
( ), \ )l t, tl
bh
h e ,actu, The
la llr Wwaavproblem
eleng is that
t) th mixture
o we wish
received by our to inferSothe
sensor.
unlesr .I ' € t
SS w c can m€ Iea Sure
)as U / ( ) \ )) d iire tlv, hh ow could this problem of inferring O())
ir C C:t
“object colour” O(λ),
frorIItr RL) /\ ,) ))l p)oO Srsit
Sibl
but
rly b e solv
we
V€ed? ?
only know R(λ), the mixture received.
I@'
a\
C(f:Lt-)
\/l
i)o(D
{00 ir pt
7-- ?bon,n ldo nnt
108 / 212
(Colour information, con’t)
An algorithm for computing O(λ) from R(λ) was proposed by Dr E Land
(founder of Polaroid Corporation). He named it the Retinex Algorithm
because he regarded it as based on biological vision (RETINa + cortEX).
It is a ratiometric algorithm:
1. Obtain the red/green/blue value (r , g , b) of each pixel in the image
2. Find the maximal values (rmax , gmax , bmax ) across all the pixels
3. Assume that the scene contains some objects that reflect “all” the
red light, others that reflect “all” the green, and others “all” the blue
4. Assume that those are the origins of the values (rmax , gmax , bmax ),
thereby providing an estimate of I (λ)
5. For each pixel, the measured values (r , g , b) are assumed to arise
from actual object spectral reflectance (r /rmax , g /gmax , b/bmax )
6. With this renormalisation, we have discounted the illuminant
7. Alternative variants of the Retinex exist which estimate O(λ) using
only local comparisons across colour boundaries, assuming only local
constancy of the illuminant spectral composition I (λ), rather than
relying on a global detection of (rmax , gmax , bmax )
109 / 212
(Colour information, con’t)
Colour assignments are very much a matter of calibration, and of making
assumptions. Many aspects of colour are “mental fictions”.
For example, why does perceptual colour space have a seamless, cyclic
topology (the “colour wheel”), with red fading into violet fading into
blue, when in wavelength terms that is moving in opposite directions
along a line (λ → 700nm red) versus (blue 400nm ← λ)?
113 / 212
(Structure from stereo vision, con’t)
Of course, alternative methods exist for estimating depth. For example,
the “Kinect” gaming device projects an infrared (IR, invisible) laser grid
into the scene, whose resulting pitch in the image sensed by an IR camera
is a cue to depth and shape, as we saw in discussing shape from texture.
Here we consider only depth computation from stereoscopic disparity.
114 / 212
tly knowing /()).
(Structure from stereo vision, con’t)
I If the optical axes of the 2 cameras converge at a point, then objects
8.3 Stereo information
in front or behind that point in space will project onto different parts
mportant of the two images.
information Thisdepth
about is sometimes
can becalled parallax
obtained from the use of two
I The disparity becomes greater in proportion to the distance of the
objectin
more) cameras, in the same
front, way that
or behind, humans
the point achievestereoscopicdepth vis
of fixation
by virtueIofClearly
having two eyes. Objects in front or
it depends also on the convergence angle behind of the
of the point
optical axesin space
which theI two
Evenoptical axesintersect
if the optical (aseach
axes parallel determined by the angle
other (“converged betweenthe
at infinity”),
there will be
which is controlled by disparity
camerainmovements
the image projections of nearby objects
or eye movementt), will project in
I Disparity also becomes greater with increased spacing between the
different relative parts of the two images. This is called stereoscopicdispar
two cameras, as that is the base of triangulation
115 / 212
(Structure from stereo vision, con’t)
In the simplifying case that the optical axes are parallel, once the
correspondence problem has been solved, plane geometry enables
calculation of how the depth d of any given point depends on:
Unfortunately, current algorithms for solving the CorrespondencePro
I camera focal length f tend to require very large searchesfor matching featuresunder a large num
of possiblepermutations. It is difficult to know which set of featuresin the
I base distance b between the optical centres of their lenses
frames to select for comparisonin evaluating the degreeof alignment, w
I disparities (α, β) in the image
trying projections
to find of some object
that relative registration point maximum
which generates(P) correl
relativethetotwothe
in opposite directions between background
opticalscenes.
axes, outwards
lensth)
Namely: d = fb/(α + β) I(rocal
d={b/(cr+p)
Imagine the visual world of “hunter spiders” that have got eight eyes...
117 / 212
8d. Optical flow; detecting and estimating motion
I Optical flow is the apparent motion of objects in a visual scene
caused by relative motion between an observer and the scene.
I It assists scene understanding, segmentation, 3D object recognition,
stereo vision, navigation control, collision and obstacle avoidance.
I Motion estimation computes local motion vectors that describe the
transformation between frames in a video sequence. It is a variant
of the correspondence problem, illustrated by this vector field:
118 / 212
Information from motion vision
Few vision applications involve just static image frames. That is basically
vision “off-line;” – but the essence of an effective visual capability is its
real-time use in a dynamic environment. Among the challenges are:
I Need to infer 3D object trajectories from 2D image motion.
I Need to make local measurements of velocity, which may differ in
different image regions in complex scenes with many moving objects.
Thus, a velocity vector field needs to be assigned over an image.
I It may be necessary to assign more than one velocity vector to any
given local image region (as occurs in “motion transparency”).
I Need to disambiguate object motion from contour motion, so that
we can measure the velocity of an object regardless of its form.
I We may need to detect a coherent overall motion pattern across
many small objects or regions separated from each other in space.
I May need complex inferences about form and object identity, from
merely a few moving points. See classic Johansson demonstration of
easily identifiable human activity from just a few sparse points:
http://www.youtube.com/watch?v=r0kLC-pridI - even gender and age:
https://www.youtube.com/watch?v=4E3JdQcmIAg (- is he a Neanderthal?)
119 / 212
Main classes of motion detection and estimation models
I Intensity gradient models: Assume that the local time-derivative in
image intensities at a point is related to the local spatial gradient in
I (x, y , t) image intensities because of some object velocity ~v :
∂I (x, y , t) ~ (x, y , t)
− = ~v · ∇I
∂t
Then the ratio of the local image time-derivative to the spatial
~ (x, y , t) is an estimate of the local image velocity
gradient ∇I
(in the direction of the gradient).
I Dynamic zero-crossing models: Measure image velocity by finding
first the edges and contours of objects (using the zero-crossings of
a blurred Laplacian operator!), and then take the time-derivative of
the Laplacian-Gaussian-convolved image:
∂ 2
− ∇ Gσ (x, y ) ∗ I (x, y , t)
∂t
in the vicinity of a Laplacian zero-crossing. The amplitude is an
estimate of speed, and the sign of this quantity determines the
direction of motion relative to the normal to the contour.
120 / 212
Spatio-temporal spectral methods for motion estimation
I Motion can also be detected and measured by Fourier methods.
I This approach exploits the fact that motion creates a covariance in
the spatial and temporal spectra of the time-varying image I (x, y , t),
whose 3-dimensional (spatio-temporal) Fourier transform is defined:
Z Z Z
F (ωx , ωy , ωt ) = I (x, y , t)e −i(ωx x+ωy y +ωt t) dx dy dt
x y t
121 / 212
Spectral co-planarity theorem
Translational image motion has a 3D spatio-temporal Fourier spectrum
that is non-zero only on a plane through the origin of frequency-space.
Coordinates of the unit normal to this spectral plane correspond to the
speed and direction of motion.
124 / 212
Optical flow: elementary motion detectors in flying insects
(and for other autonomous vehicles like driverless cars?)
125 / 212
9. Surfaces and reflectance maps
How can we infer the shape and reflectance properties of a surface from
measurements of brightness in an image?
126 / 212
(Surfaces and reflectance maps, con’t)
I A Lambertian surface (also called diffusely reflecting, or “matte”)
reflects light equally well in all directions
I Examples of Lambertian surfaces include: snow, non-glossy paper,
ping-pong balls, magnesium oxide, projection screens, ...
I The amount of light reflected from a Lambertian surface depends on
the angle of incidence of the light (by Lambert’s famous cosine law),
but not on the angle of emission (the viewing angle)
I A specular surface is mirror-like. It obeys Snell’s law (the angle of
incidence of light is equal to its angle of reflection from the surface),
and it does not scatter light into other angles
I Most metallic surfaces are specular. But more generally, surfaces lie
somewhere on a continuum between Lambertian and specular
I Special cases arise from certain kinds of dust. The surface of the
moon (called unsurprisingly a lunar surface) reflects light depending
on the ratio of cosines of angle of incidence and angle of emission
I That is why a full moon looks more like a penny than like a sphere;
its brightness does not fade, approaching the boundary (!)
127 / 212
(Surfaces and reflectance maps, con’t)
128 / 212
(Surfaces and reflectance maps, con’t)
The reflectance map is a function φ(i, e, g ) which relates intensities in
the image to surface orientations of objects. It specifies the fraction of
incident light reflected per unit surface area, per unit solid angle, in the
direction of the camera; thus it has units of flux/steradian
It is a function of three variables:
I i is the angle of the illuminant, relative to the surface normal N
I e is the angle of a ray of light re-emitted from the surface
I g is the angle between the emitted ray and the illuminant
.->q'l
:r )--
?rxi
129 / 212
(Surfaces and reflectance maps, con’t)
There are many types of reflectance maps φ(i, e, g ), each of which is
characteristic of certain surfaces and imaging environments
130 / 212
(Surfaces and reflectance maps, con’t)
Typically, surfaces have both specular and matte properties. For example,
facial skin may vary from Lambertian (powdered) to specular (oily). The
purpose of powdering one’s face is to specify s and n in this expression:
131 / 212
(Surfaces and reflectance maps, con’t)
Typically there is : not justone
is notjust point source
onepoint sourceofof
illurnination, but but rather a
illumination,
(such as the extended light source providecl
multitude of sources (such as the extended light source provided by t-, by a
tered scene)rnuch of the light receivedby objcc
bright overcast sky). In a cluttered scene, much of the light received by
rer objects (and colouredby them...) One needs
objects has been terms
reflected from other objects (and coloured by them...)
of ray-tracingbut in terms of thermodyt
One needs almostquilibrium
to think of lighta room.
inside not in terms of ray-tracing but in terms
of thermodynamics: a “gas” of photons in equilibrium inside a room
*-t'
r)Pi
133 / 212
(Surfaces and reflectance maps, con’t)
Sometimes the only consistent solution is to assume simply that the
surface albedo really is different. In this image, tile A is emitting the
same light as tile B. But the requirements of illumination context and
shading make it impossible to see them as having the same albedo
134 / 212
(Surfaces and reflectance maps, con’t)
The inference of a surface shape (a relief map, or an object-centred
description of a surface) from shading information is an inherently
ill-posed problem because the data necessary for the computation is not
known. One has to introduce ancillary assumptions about the surface
material composition, its albedo and reflectance map, the illumination of
the scene and its geometry, before such inferences become possible.
where the local radius of curvature r (s) is defined as the limiting radius
of the circle that best “fits” the contour at position s, as arc ∆s → 0.
Curvature sign, +/−, depends on whether the circle is inside, or outside,
the figure. For open contours, other conventions determine the sign. The
figure’s concavities are linked with minima; its convexities with maxima.
136 / 212
(Shape representation and codon shape grammars, con’t)
The purpose of computing shape descriptions like curvature maps θ(s)
(which might result from fitting active contours, for example), is to build
a compact classification grammar for recognising common shapes.
By the Fundamental Theorem of Curves, a curvature map θ(s) together
with a “starting point” tangent t(so ) specifies a shape fully. Some nice
properties of curvature-map descriptions are:
1. The description is position-independent (i.e., object-centred).
2. The description is orientation-independent (rotating the shape in the
plane does not affect its curvature map).
3. The description represents mirror-reversed shapes just by changing
the sign of s, so the perimeter is traversed in the opposite direction:
θ(s) → θ(−s)
4. Scaling property: Changing the size of a shape just scales θ(s) by a
constant (K is reciprocal to the size change factor):
θ(s) → K θ(s)
137 / 212
the “vase” of Rubin’s famous figure-ground illusion observed as early as 1819 by
Turton [14].) Thus, knowing which side is the figure determines the choice of
(Shape representation and codon shape grammars, con’t)
orientation on a curve, or, conversely, choosing an orientation determines which side
is the figure by convention. Minima are then typically associated with the concavities
of theis figure,
The goal whereas maxima
to construct are convexities.taxonomy of closed 2D shapes,
an hierarchical
To define our basic primitive codons, we first note that all curve segments lying
based between
on theminimaextrema of curvature. Their possible combinations are very
of curvature must have zero, one, or two points of zero curvature. If
restricted
there byare the requirement
no zeroes of closure,
(i.e., inflections), leading
then the segmenttoisadesignated
codon grammaras a type 0of
shapescodon
(analogous
(see Fig. 4).toThose
the ordered triples
with two zeroes are of thetype
called nucleotide
2 codons. Ifbases A,G,C,T
a segment has
which exactly
specify onethe
zero,20then
aminothe zero may be encountered either before (type l-) or after
acids).
(type 1’) reaching the maximum point of the segment during traversal in the chosen
Note that since curvature is a signed quantity (depending on whether the
orientation.
The type
fitting circle 0 codonsormay
is inside be further
outside the subdivided
shape), the into minimum
O+, 0 - and (co)
and tomaximum
yield six
basic codon types. Consider Fig. 3 once again. Note that as the ellipse is traversed in
of curvature may mean the same radius. For open contours,
different directions, the minima of curvature change as expected. In the lower ellipse, they depend
on signwhich
conventions and the direction of travel. We are
corresponds to a “hole” with figure outside, the minima have negative interested in the
extremacurvature, because theminima,
of curvature: direction ofmaxima,
rotation isandclockwise. (Thus,
zeroes (thetheinflexion
slashes suggest a
points).
part boundary by our rule, which will be repaired later when we discuss “holes.“) In
Theretheareupper
just six primitive
ellipse, however, codon types:haveall positive
the minima curve segments
curvature (the lying between
rotation is
minima always counterclockwise).
of curvature must Thus,havethe0, type1 or0 codon
2 points can beofsubdivided into 0’ andfurther
zero curvature, 0-
with the superscript indicating the sign of curvature. Note that the 0 - codon can
classified by whether
constitute a zero whereas
a part boundary, is encountered
the type O+ before (“−”)
codon must appearoronly
afteras a(“+”)
shape
reaching the maximum curvature in the chosen direction of traversal.
Dots show zeroes of curvature (inflexions); slashes indicate the minima:
/V-JtPO?
ocl 0+ 0- 1+ 1- 2
138 / 212
(Shape representation and codon shape grammars, con’t)
Note that because curvature is a signed quantity, the loci of its minima
depend on what we take to be “figure” vs “ground”. For open contours
like these face profiles (alternatively Rubin’s Vase profiles), if we regard
“figure” as “to left”, then loci of minima depend on direction of traversal:
139 / 212
(Shape representation and codon shape grammars, con’t)
There are 5 possible Codon Triples, and 9 possible Codon Quads:
140 / 212
(Shape representation and codon shape grammars, con’t)
Constraints on codon strings for closed curves are very strong. While
sequences of (say) 6 codons have 56 = 15, 625 possible combinations,
these make only 33 generic shapes.
Ordinal relations among singular points of curvature (maxima, minima,
and zeroes) remain invariant under translations, rotations, and dilations.
The inflexion (a zero of curvature) of a 3D curve is preserved under 2D
projection, thereby guaranteeing that the ordinal relations among the
extrema of curvature will also be preserved when projected to an image.
Thus we can acquire a very compact lexicon of elementary shapes, and
we can construct an object classification algorithm as follows:
142 / 212
(Volumetric descriptions of 3D shape, con’t)
Superquadrics represent objects as the unions and/or intersections of
generalized superquadric closed surfaces, which are the loci of points in
(x, y , z)-space that satisfy parametric equations of this form:
Ax α + By β + Cz γ = R
Spheres have (α, β, γ) = (2, 2, 2) and A = B = C . Other examples:
I cylinders: (α, β, γ) = (2, 2, 100) and A = B
I rectangular solids: (α, β, γ) = (100, 100, 100)
I prolate spheroids (shaped like zeppelins): (α, β, γ) = (2, 2, 2) and
(say) A = B but C < (A, B)
I oblate spheroids (shaped like tomatoes): (α, β, γ) = (2, 2, 2) and
(say) A = B but C > (A, B)
Rotations of such objects in 3D produce cross-terms in (xy , xz, yz).
Parameters (A, B, C ) determine object dimensions. Origin-centred.
These simple, parametric models for solids, augmented by Boolean
relations for conjoining them, allow the generation of object-centered,
“volumetric” descriptions of many objects (instead of an image-based
description) by just listing parameters (α, β, γ, A, B, C ) and relations,
rather like the codon descriptors for closed 2D shapes.
143 / 212
11. Vision as model building
I role of context in determining a model
I percepts as hypotheses generated for testing
I rivalrous and paradoxical percepts, and visual
illusions: “bugs” or “features” of a system?
144 / 212
Vision as perceptual inference and hypothesis testing
I Low-level visual percepts, built from extracted features, must be
iteratively compared with high-level models to derive hypotheses
about the visual world
I This iterative cycle of model-building for hypothesis generation and
testing is sometimes called the hermeneutical cycle
I It fits the key anatomical observation that mammalian brains have
Vision as a Cycle of Perception
massive feedback projections from the visual cortex back down to
the thalamus, meeting the upcoming data stream from the eyes
Analysis and recognition - Induction
Bottom-up
path
Signal Compare Generate Symbolic
with model new model
features hypotheses Hermeneutical features
cycle hypotheses
and estimate and derive
likelihoods expectations
Top-down path
Synthesis and verification - Deduction
145 / 212
12. Lessons from visual illusions, neural trauma, & deficits
I Normal human vision is often not veridical. Illusions are standard.
I Illusions can reveal top-down processing; the role of expectation; and
interactions between cooperative and competitive neural processes.
I In the “cafe wall illusion” below, all long lines are actually parallel.
I Are illusions features or bugs? Should we design them into systems?
146 / 212
Neurones (in cats) actually respond to illusory contours
“Illusory contours and cortical neuron responses”, Science 224 (1984), pp. 1260-1262.
147 / 212
Illusory contours can even drive some high-level inferences
148 / 212
Lessons from neurological trauma and deficits
Strokes and battlefield injuries sometimes have astonishing consequences,
with aphasias and agnosias indicating highly specialised brain areas.
I Facial prosopagnosia: lost ability to recognise faces. Vision appears
normal otherwise, but faces cease to be represented or processed.
I Achromatopsia: cortical loss of colour vision, but “black-and-white”
(achromatic) vision is apparently normal.
I Astereognosia: loss of ability to perceive three-dimensionality.
I Simultanagnosia: inability to perceive simultaneously more than one
thing at a time (e.g. multiple elements in a display).
I Neglect and hemi-inattention syndromes: one side of any object is
always neglected. Such patients dress themselves only on (say) their
right side, and always bump into things with their left side; and will
draw a clock face with all the numbers 1 - 12 in the right half only.
I Xanthopsia: perception that all objects are covered with gold paint.
What sort of “computer” is the brain, that it can display these types of
faults when traumatised? What do these phenomena reveal about the
nature of the brain’s architecture, data structures, and algorithms?
149 / 212
propertiesand relationships,the metaphysicsof objects, etc...) with empirical
information gathered from incoming image data. This principle is expressed
13. Bayesian inference in vision
in the form of a basic rule for relating conditional probabilities in which the
ttantecedent"and t'consequent"are interchanged. The value of this method
It is almost impossible to perform most computer vision tasks in a purely
for computer vision is that it provides a framework for continually updating
“bottom-up” fashion. The data are just too impoverished by themselves
one's theory of u'hat one is looking at, by integrating continuously incoming
to support the task of object recognition
evidencer,viththe best avtrilableinferenceor interpretation so far.
,ffi
Ziltt
150 / 212
(Bayesian inference in vision, con’t)
The Bayesian view focuses on the use of priors, which allow vision to be
steered heavily by a priori knowledge about the world and the things
which populate it.
For example, probabilistic priors can express the notions that:
I some events, objects, or interpretations are much more probable
than others
I matter cannot just disappear, but it does routinely become occluded
153 / 212
Statistical decision theory
In many applications, we need to perform pattern classification on the
basis of some vector of acquired features from a given object or image.
The task is to decide whether or not this feature vector is consistent with
a particular class or object category. Thus the problem of classification
amounts to a “same / different” decision about the presenting feature
vector, compared with vectors characteristic of certain object classes.
Usually there is some similarity between “different” patterns, and some
dissimilarity between “same” patterns. The four possible combinations of
“ground truths” and decisions creates a decision environment:
1. Hit: Actually same; decision “same”
2. Miss: Actually same; decision “different”
3. False Alarm: Actually different; decision “same”
4. Correct Reject: Actually different; decision “different”
We would like to maximize the probability of outcomes 1 and 4, because
these are correct decisions. We would like to minimize the probability of
outcomes 2 and 3, because these are incorrect decisions
154 / 212
Statistical Decision Theory
6
5 Authentics Imposters
Criterion
Rate of Accepting Imposters
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Dissimilarity Metric (Hamming Distance, HD)
155 / 212
(Statistical decision theory, con’t)
I In the two-state decision problem, the feature vectors or data are
regarded as arising from two overlapping probability distributions
I They might represent the features of two object classes, or they
might represent the similarity scores for “same” vs “different”
I When a decision is made, based upon the observed similarity and
some acceptability threshold, the probabilities of the four possible
outcomes can be computed as the four cumulatives under these
two probability distributions, to either side of the decision criterion
I These four probabilities correspond to the shaded areas in last figure
I The computed error probabilities can be translated directly into a
confidence level which can be assigned to any decision that is made
I Moving the decision criterion (dashed line) has coupled effects:
I Increasing the “Hit” rate also increases the “False Alarm” rate
I Decreasing the “Miss” rate also decreases the “Correct Reject” rate
I These dependencies map out the Receiver Operating Characteristic
I Each point (∗) on the ROC curve (next fig.) represents a particular
choice for the decision criterion, or threshold of acceptance
156 / 212
Receiver Operator Characteristic (“ROC curve”)
Decision Strategies
1.0
Liberal
Strategy
Curve
Hit Rate
0.5
More conservative:
Raise the Acceptance Criterion
Conservative
More liberal:
Lower the Acceptance Criterion
0.0
|µ2 − µ1 |
d0 = q
1 2 2
2 (σ2 + σ1 )
22000
d’ = 11.36
10 20 30 40 50 60 70 80 90 100
18000
mean = 0.089 mean = 0.456
stnd dev = 0.042 stnd dev = 0.018
14000
Count
10000
6000
Theoretical curves: binomial family
Theoretical cross-over point: HD = 0.342
0 2000
Theoretical cross-over rate: 1 in 1.2 million
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Hamming Distance 159 / 212
(Statistical decision theory, con’t)
Decidability d 0 ≥ 3 is normally considered good. The distributions shown
originally to illustrate had d 0 = 2. The empirical ones for iris recognition
(previous figure) had d 0 ≈ 11.
Because reliability of pattern recognition depends on the between-class
variance being larger than the within-class variance, R. Fisher defined the
“separation between two distributions” as the ratio of their between-class
variance to their within-class variance. This definition is related to d 0 .
Another metric is the total area under the ROC curve, which ideally → 1.
Other relevant metrics include the total probability of error for a chosen
decision criterion, as illustrated by the combined shaded areas below:
p( x|C 1) p( C 1)
p( x|C 2) p( C 2)
R1 R2
x 160 / 212
Bayesian pattern classifiers
Consider a two-class pattern classification problem, such as OCR (optical
character recognition) involving only two letters, a and b. We compute
some set of features x from the image data, and we wish to build a
Bayesian classifier that will assign a given pattern to one of two classes,
C1 ≡ a or C2 ≡ b, corresponding to the two letter instances.
Would we then still say that, given a slightly smallish sampled value x,
the letter class is more likely to have been C1 than C2 ?
No. As Bayesians we must take into account the baseline rates. Define
the prior probabilities P(C1 ) and P(C2 ) as their two relative frequencies
(summing to 1).
If we had to guess which character had appeared without even seeing it,
we would always just guess the one with the higher prior probability.
For example, since in fact an ‘a’ is about 4 times more frequent than a ‘b’
in English, and these are the only two options in this two-class inference
problem, we would set the priors P(a) = 0.8 and P(b) = 0.2 then.
162 / 212
(Bayesian pattern classifiers, con’t)
I For each class separately, we can measure how likely any particular
feature sample value x will be, by empirical observation of examples
I (Note that this requires knowing the “ground truth” of examples)
I This gives us P(x|Ck ) for all the classes Ck
I We get the unconditional probability P(x) of any measurement x by
summing P(x|Ck ) over all the classes, weighted by their frequencies:
X
P(x) = P(x|Ck )P(Ck )
k
P(x|Ck )P(Ck )
P(Ck |x) =
P(x)
163 / 212
(Bayesian pattern classifiers, con’t)
Thus we have a principled, formal way to perform pattern classifications
on the basis of available data and our knowledge of class baseline rates,
and how likely the data would be for each of the classes.
We can minimise the total probability of misclassification if we assign
each observation x to the class with the highest posterior probability.
Assign x to class Ck if:
Because the costs of the two different types of errors are not always
equal, we may not necessarily want to place our decision criterion at the
point where the two curves cross, even though that would minimise the
total error. If the decision boundary we choose is instead as indicated by
the vertical line, so R1 and R2 are the regions of x on either side of it,
then the total probability of error (which is the total shaded area) is:
P(error) = P(x ∈ R2 , C1 ) + P(x ∈ R1 , C2 )
= P(x ∈ R2 |C1 )P(C1 ) + P(x ∈ R1 |C2 )P(C2 )
Z Z
= P(x|C1 )P(C1 )dx + P(x|C2 )P(C2 )dx
R2 R1
165 / 212
14. Discriminant functions and decision boundaries
167 / 212
15. Discriminative versus generative methods in vision
I Discriminative methods learn a function yk (x) = P(Ck |x) that maps
input features x to class labels Ck . They require large training data
covering all expected kinds of variation. Examples of such methods:
I artificial neural networks
I support vector machines
I boosting methods
I linear discriminant analysis
169 / 212
of a Bayesian network). On specific (supervised) learning tasks, discriminative
methods usually perform better and are more efficient, but the training data
Example: convolutional
needs to be neural
large enough to span network
the expected modes for OCR in(LeCun)
of variation the data.
Optical Character Recognition systems have many applications:
15I Applications of learning and statistical methods in vision
postal sorting, bank cheque routing
I
15.1automated number
Optical character plate recognition
recognition (OCR); Convolutional neural networks
I book and manuscript digitisation
OCR systems have been developed for numerous applications including postal
I text-to-speech synthesis for the blind
and bank cheque routing, book digitisation, automated number plate recog-
I handwriting recognition for portable device interfaces
nition, text-to-speech synthesis for the blind, and handwriting recognition for
portable device
Handwritten interfaces.
fonts Modern approaches
require methods from Machinemake heavy use
Learning of machine
to cope with
learning to allow recognition of multiple fonts and to cope with distortions,
all writing variations (size, slant, stroke thickness), distortions, and noise.
noise, andconvolutional
A classic variations in size, slant,
NN for OCRandwas
linedeveloped
thickness. by Yann LeCun:
170 / 212
portable device interfaces. Modern approaches make heavy use of machine
learning to allow recognition of multiple fonts and to cope with distortions,
(Example: convolutional
noise, and variations neural
in size, slant, network
and line thickness. for OCR, con’t)
175 / 212
(Face detection, recognition, and interpretation, con’t)
Classic problem: within-class variation (same person, different conditions)
can exceed the between-class variation (different persons).
Persons who share 50% of their genes (parents and children; full siblings;
double cousins) sometimes look almost identical (apart from age cues):
176 / 212
(Face detection, recognition, and interpretation, con’t)
Classic problem: within-class variation (same person, different conditions)
can exceed the between-class variation (different persons).
Photos
...and these by Françoisunrelated
are completely Brunelle people,
of unrelated doppelgängers
in Doppelgänger pairs:
177 / 212
rection are 60 from the camera axis.
varying lighting. Sample images from each subset are
Subset 5 contains 105 images for which the greater of t
wn in Fig. 4.
(Face detection, recognition, and interpretation, con’t)
bset 1 contains 30 images for which both the longitudi-
longitudinal and latitudinal angles of light source d
rection are 75$ from the camera axis.
nal and latitudinal angles of light source direction are For all experiments, classification was performed using
within Classic
15$ of theproblem:
camera axis,within-class
including the variation (same
lighting nearest person, different conditions)
neighbor classifier. All training images of an ind
can exceed the between-class variation (different persons).
The Yale database is available for download from http://cvc.yale.edu.
178 / 212
(Face detection, recognition, and interpretation, con’t)
Classic problem: within-class variation (same person, different conditions)
can exceed the between-class variation (different persons).
Effect of variations in pose angle (easy and hard), and distance:
179 / 212
(Face detection, recognition, and interpretation, con’t)
Classic problem: within-class variation (same person, different conditions)
can exceed the between-class variation (different persons).
Changes in appearance over time (sometimes artificial and deliberate)
180 / 212
Paradox of Facial Phenotype and Genotype
Facial appearance (phenotype) of everyone changes over time with age;
but monozygotic twins (identical genotype) track each other as they age.
Therefore at any given point in time, they look more like each other than
they look like themselves at either earlier or later periods in time
181 / 212
(Face detection, recognition, and interpretation, con’t)
Detecting and recognising faces raises all the usual questions encountered
in other domains of computer vision:
183 / 212
(Viola-Jones face detection algorithm, con’t)
Key idea: build a strong classifier from a cascade of many weak classifiers
− all of whom in succession must agree on the presence of a face
I A face (in frontal view) is presumed to have structures that should
trigger various local “on-off” or “on-off-on” feature detectors
I A good choice for such feature detectors are 2D Haar wavelets
(simple rectangular binary alternating patterns)
I There may be 2, 3, or 4 rectangular regions (each +1 or −1) forming
feature detectors fj , at differing scales, positions, and orientations
I Applying Haar wavelets to a local image region only involves adding
and subtracting pixel values (no multiplications; hence very fast)
I A given weak classifier hj (x) consists of a feature fj , a threshold θj
and a polarity pj ∈ ±1 (all determined in training) such that
−pj if fj < θj
hj (x) =
pj otherwise
I A strong classifier h(x) takes a linear combination of weak classifiers,
using weights αj learned in a training phase, and considers its sign:
X
h(x) = sign( αj hj )
j
184 / 212
currently most popular method is due to Viola and Jones (2004), who popu-
larised the use of the AdaBoost (“Adaptive Boosting,” formulated by Freund
(Viola-Jones face detection algorithm, con’t)
and Schapire) machine learning algorithm to train a cascade of feature clas-
sifiers for object detection and recognition. Boosting is a supervised machine
I At a given level of the cascade, a face is “provisionally deemed to
learning framework which works by building a “strong classifier” as a com-
have been
bination detected”very
of (potentially at asimple)
certain position
“weak if h(x)
classifiers.” As>illustrated
0 in the
Only below,
I figure those aimage regions
Viola-Jones faceaccepted by a given
detector consists layer based
of classifiers of theoncascade
simple
rectangular
(h(x) > 0)features (whichon
are passed cantobethe
viewed
nextaslayer
approximating
for furtherHaar wavelets)
consideration
and makes use of an image representation known as the integral
I A face detection cascade may have 30+ layers, yet the vast majority image (also
called summed area table) to compute such features very efficiently.
of candidate image regions will be rejected early in the cascade.
185 / 212
(Viola-Jones face detection algorithm, con’t)
I Training uses the AdaBoost (“Adaptive Boosting”) algorithm
I This supervised machine learning process adapts the weights αj such
that early cascade layers have very high true accept rates, say 99.8%
(as all must detect a face; hence high false positive rates, say 68%)
I Later stages in the cascade, increasingly complex, are trained to be
more discriminating and therefore have lower false positive rates
I More and more 2D Haar wavelet feature detectors are added to each
layer and trained, until performance targets are met
I The cascade is evaluated at different scales and offsets across an
image using a sliding window approach, to find any (frontal) faces
I With “true detection” probability di in the i th layerQof an N-layer
N
cascade, the overall correct detection rate is: D = i=1 di
th
I With “erroneous detection” QNprobability ei at the i layer, the overall
false positive rate is E = i=1 ei (as every layer must falsely detect)
I Example: if we want no false detections, with 105 image subregions
so E < 10−5 , in a 30-layer cascade we train for ei = 10−5/30 ≈ 0.68
which shows why each layer can use such weak classifiers!
I Likewise, to achieve a decent overall detection rate of D = 0.95
requires di = 0.951/30 ≈ .9983 (very happy to call things “faces”)
186 / 212
these. To perform face detection, the cascade is evaluated at different scales
(Viola-Jones
and offsets withinface detection
an image using a algorithm,
sliding window con’t)
approach. The following
figurePerformance
illustrates what the sliding window finds
on a local group photograph: in a local group photo:
187 / 212
2D Appearance-based face recognition: Gabor wavelets
We saw that 2D Gabor wavelets can make remarkably compact codes for
faces, among many other things. In this sequence, even using only about
100 Gabor wavelets, not only the presence of a face is obvious, but also
its gender, rough age, pose, expression, and perhaps even identity:
Im
[0, 1] [1, 1]
Re
[0, 0] [1, 0]
192 / 212
objects, even as dynamic objects, in order to achieve invariance both to pose
angle and illumination geometry. Of course, this requires solving the ill-posed
Three-dimensional approaches to face recognition
problems of infering shape from shading, interpreting albedo versus variations
in Lambertian and specular surface properties, structure from motion, etc.
Face recognition algorithms now aim to model faces as three-dimensional
On page 4 we examined how difficult this problem is, and how remarkable it
objects,
is thateven as dynamic
we humans seem objects,
to be so in order to at
competent achieve
it. Theinvariances
synthesis offorvision
pose,
sizeas(distance), and and
model-building illumination
graphics, geometry. Performing
to perform face faceinrecognition
recognition object-basedin
object-based (volumetric)
terms, rather terms, rather
than appearance-based than
terms, appearance-based
is now terms,
a major focus of this field.
unitesInvision
orderwith model-building
to construct and graphics.
a 3D representation of a face (so that, for example,
its appearance can be predicted at different pose angles as we saw on page 4),
To it
construct a 3D
is necessary representation
to extract separatelyofboth
a face, it is model
a shape necessary
and ato extract
texture both
model
a shape model (below right), and a texture model (below left).
(texture encompasses albedo, colouration, any 2D surface details, etc). The term
“texture” here encompasses albedo, colouration, and 2D surface details.
193 / 212
(Three-dimensional approaches to face recognition)
Extracting the 3D shape model can be done by various means:
I laser range-finding, even down to millimetre resolution
I calibrated stereo cameras
I projection of structured IR light (grid patterns whose distortions
reveal shape, as with Kinect)
I extrapolation from multiple images taken from different angles
The size of the resulting 3D data structure can be in the gigabyte range,
and significant time can be required for the computation.
Since the texture model is linked to coordinates on the shape model, it is
possible to “project” the texture (tone, colour, features) onto the shape,
and thereby to generate predictive models of the face in different poses.
Clearly sensors play an important role here for extracting shape models,
but it is also possible to do this even from just a single photograph if
sufficiently strong Bayesian priors are also marshalled, assuming an
illumination geometry and some universal aspects of head and face shape.
194 / 212
(Three-dimensional approaches to face recognition)
“...a method for face recognition across variations in pose, ranging from
frontal to profile views, and across a wide range of illuminations,
including cast shadows and specular reflections. To account for these
variations, the algorithm simulates the process of image formation in 3D
space, using computer graphics, and it estimates 3D shape and texture of
faces from single images. The estimate is achieved by fitting a statistical,
morphable model of 3D faces to images. The model is learned from a set
of textured 3D scans of heads. Faces are represented by model
parameters for 3D shape and texture.”
196 / 212
Face algorithms compared with human performance
The US National Institute for Standards and Technology (NIST) runs
periodic competitions for face recognition algorithms, over a wide range
of conditions. Uncontrolled illumination and pose remain challenging.
In a recent test, three algorithms had ROC curves above (better than)
human performance at non-familiar face recognition (the black curve).
But human performance remains (as of 2018) better on familiar faces.
1
0.8
Verification Rate
0.6
NJIT
Performance of humans
0.4 and seven algorithms on the difficult face pairs (Fig. 3a) andCMU
easy face pairs (Fig. 3b) shown
algorithms outperform humans on the difficult face pairs at most or all combinations
Viisage of verification
(cf., [20] NJIT, [21] CMU for details on two of the three algorithms). Humans out-perform the other four
Human Performance
face pairs. All but one algorithm performs more accurately than humans on the easy face
Algorithm A
pairs. (A color
figure is provided in the Supplemental Material.) Algorithm B
0.2
Algorithm C
Algorithm D
Chance Performance
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Accept Rate
197 / 212
Major breakthrough with CNNs: deep-learning ‘FaceNet’
Machine learning approaches focused on scale (“Big Data”) are having a
profound impact in Computer Vision. In 2015 Google demonstrated large
reductions in face recognition error rates (by 30%) on two very difficult
databases: YouTube Faces (95%), and Labeled Faces in the Wild (LFW)
database (99.63%), which remain as accuracy records. But when tested
on larger (“MegaFace”) datasets, accuracy fell to about 75%.
198 / 212
(Major breakthrough with CNNs: deep-learning ‘FaceNet’)
I Convolutional Neural Net with 22 layers and 140 million parameters
I Big dataset: trained on 200 million face images, 8 million identities
I 2,000 hours training (clusters); about 1.6 billion FLOPS per image
) learned forwhere α
E
M
I Euclidean distance metric (L2 norm) on embeddings f (xi Triplet
B
E
Negative Gen
Anchor LEARNING triplets
Negative in Eq. (
Anchor
Positive
ing and
Positive
be pass
Figure 3. The Triplet Loss minimizes the distance between an an- triplets,
I The embeddings create a compact (128 byte) code for each face
chor and a positive, both of which have the same identity, and
I Simple threshold on Euclidean distances among these embeddings proving
maximizes the distance between the anchor and a negative of a
then gives decisions differen
different identity.of “same” vs “different” person
199 / 212
(Major breakthrough with CNNs: deep-learning ‘FaceNet’)
NN2
95.0%
90.0%
NNS1
85.0%
NN1
80.0%
75.0%
Accuracy @103 Recall
70.0% NNS2
65.0%
60.0%
55.0%
50.0%
45.0%
40.0%
35.0%
30.0%
25.0%
20.0%
10,000,000 100,000,000 1,000,000,000
MultiAdd (FLOPS)
Different 4. FLOPS
Figurevariants vs.Convolutional
of the Accuracy trade-off. Shown
Neural Net is the sizes
and model trade-off
were
between
generated andFLOPS and accuracy
run, revealing for a wide
the trade-off rangeFLOPS
between of different model
and accuracy
for asizes and architectures.
particular Highlighted
point on the ROC are the
curve (False fourRate
Accept models that we
= 0.001)
Fi
200 / 212
2017 IARPA (US DoD) Face Recognition Competition
I Major NIST test commissioned by several US intelligence agencies
I Search a gallery of “cooperative portrait photos” of ∼ 700K faces...
I ... using non-ideal probes: face photos without quality constraints:
I persons unaware of, and not cooperating with, the image acquisition
I variations in head pose, facial expression, illumination, occlusion
I reduced image resolution (e.g. photos taken from a distance)
I Face image databases were“web-scraped, ground-truthed”
I Competitors: 16 commercial and academic entries, all trained on
vast databases using advanced machine learning algorithms (CNNs)
to learn invariances for pose, expression, illumination, occlusion
I Metrics and benchmarks of the competition:
I 1-to-1 verification with lowest False non-Match Rate (FnMR) when
the False Match Rate (FMR) threshold is set to FMR = 0.001
I identification accuracy: lowest FnMR when FMR = 0.001 while
searching a gallery of ∼ 700,000 images
I identification speed: fastest search of ∼ 700,000 identities while
FnMR remains good
201 / 212
Highlights of 2017 Face Recognition Competition results
I identification accuracy: FnMR = 0.204 achieved at FMR = 0.001
I successful indexing instead of exhaustive search through a gallery:
matches retrieved from 700K image gallery in just 0.6 milliseconds!
(One process, running on a single core of a c. 2016 server-class CPU)
I sub-linear scaling of search time: a 30-fold increase in gallery size
incurs only a 3-fold increase in search duration, for fastest entry
I (obviously humans don’t perform sequential searches through a
memory bank of previously seen faces in order to recognise a face)
I but building the fast-search index on 700,000 images takes 11 hours
202 / 212
This publication is available free of charge from: https://doi.org/10.6028/NIST.IR.8197
0.90
ibug
cybe
0.80
ayon
0.70
digi
smil
0.60
morp
visi
Dataset: WILD
0.50
inno FNMR @ FMR=0.001
3div and Algorithm
0.89 ibug_0_gpu
yitu rank
0.86 vocord_0_gpu
0.40
0.83 cyberextruder_
0.77 morpho_0_gp
False non−match rate (FNMR)
hbin
FNIR(N, R, T)
IDENTIFICATION IN TOP R RANKS FROM DATABASE OF SIZE N
0.72 neurotechnolo
0.30
0.68 smilart_0_gpu
deep 0.67 digitalbarriers_
0.60 rankone_0_cp
“False positive identification rate”
“False negative identification rate”
0.56 innovatrics_0_
0.50 hbinno_0_cpu
0.49 deepsense_0_
0.20 neur
0.42 3divi_0_cpu
0.35 yitu_0_cpu
0.22 ntechlab_0_cp
FNMR(T)
VERIFICATION
FMR(T)
0.10
“False match r
“False non-ma
0.09
1e−05 1e−04 1e−03 1e−02 1e−01 1e+00
False match rate (FMR) 203 / 212
NIST also made DET curves for face versus iris recognition
I Because of much greater entropy, IrisCode FMR was 100,000 x lower
I ROCs for face + iris.
IrisCode FnMR was also 10 x lower than face recognition algorithms
Identification mode, N = 1.6M, MBE face test 2010, IREX III iris test 2011. Detainee
populations, face = single FBI Mugshot, iris = single eye DoD.
Leading
FACE
Algorithms
Miss Rate
x10 False Positive
False Positive
x100,000
Leading
IRIS
Algorithms
False Positive Identification Rate, aka “False Alarm Rate”
204 / 212
NISTIR 8197 Face Recognition Competition (Nov. 2017)
FRPC - FACE RECOGNITION PRIZE CHALLENGE 24
References
[1] Artem Babenko and Victor Lempitsky. Efficient indexing of billion-scale datasets of deep descriptors. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[2] Patrick Grother and Mei Ngan. Interagency report 8009, performance of face identification algorithms. Face Recogni-
tion Vendor Test (FRVT), May 2014.
[3] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for
studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts,
Amherst, October 2007.
[4] Masato Ishii, Hitoshi Imaoka, and Atsushi Sato. Fast k-nearest neighbor search for face identification using bounds
This publication is available free of charge from: https://doi.org/10.602
of residual score. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages
194–199, Los Alamitos, CA, USA, May 2017. IEEE Computer Society.
[5] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. CoRR, abs/1702.08734,
2017.
[6] Ira Kemelmacher-Shlizerman, Steven M. Seitz, Daniel Miller, and Evan Brossard. The megaface benchmark: 1 million
faces for recognition at scale. CoRR, abs/1512.00596, 2015.
[7] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.
[8] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and
clustering. CoRR, abs/1503.03832, 2015.
[9] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level perfor-
mance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
’14, pages 1701–1708, Washington, DC, USA, 2014. IEEE Computer Society.
205 / 212
Affective computing: interpreting facial emotion
Humans use their faces as visually expressive organs, cross-culturally
206 / 212
Many areas of the human brain are concerned with recognising and
interpreting faces, and social computation is believed to have!"#$%&
been# the!!!!$$ !! !!! !
) ! -
) ! - . % %
) ! - "
) ! - % %
) ! - ))
) %# & )
) )
) ! -
) ! - . % %
) ! - "
) ! - % %
) ! - ))
207 / 212
Affective computing: classifying identity and emotion
208 / 212
(Affective computing: interpreting facial emotion)
MRI scanning has revealed much about brain areas that interpret facial
expressions. Affective computing aims to classify visual emotions as
articulated sequences using Hidden Markov Models of their generation.
Mapping the visible data to action sequences of the facial musculature
becomes a generative classifier of emotions.
209 / 212
Facial Action Coding System (FACS)
I FACS is a taxonomy of human facial expressions
I It specifies 32 atomic facial muscle actions called Action Units (AU)
executed by the seven articulating pairs of facial muscles
I Classically humans display six basic emotions: anger, fear, disgust,
happiness, sadness, surprise, through prototypical expressions
I Ethnographic studies suggest these are cross-cultural universals
I FACS also specifies 14 Action Descriptors (ADs): e.g. head pose,
gaze direction, thrust of the jaw, mandibular actions
I Message judgement decodes meanings of these objective encodings
I Analysis is subtle: e.g. distinguishing polite versus amused smiles
I Promising applications:
I understanding human mental state; attributing feelings
I detecting intentions, detecting deceit
I affective computing; understanding nonverbal communication
I building emotion interfaces
I prediction of human behaviour
210 / 212
(Facial Action Coding System FACS, con’t)
I Pre-processing: face detection; normalisation; facial point tracking
I Feature extraction, e.g. 2D Gabor features (usually 8 orientations,
and 5 to 9 frequencies) are powerful to detect facial landmarks, and
for representing wrinkling and bulging actions
I Appearance-based, geometry, motion, or hybrid approaches
I Spatio-temporal appearance features in video, versus static frames
I AU temporal segmentation, classification, and intensity estimation
I Coding the dynamic evolution between facial displays in videos
I Generative models (used with active appearance models) aim to infer
emotional state by modeling the muscular actions that generated it
I Discriminative methods fit deformable models and train a classifier
I Hidden Markov Models trained on articulated facial expressions
211 / 212
Facial Expression and Analysis: algorithm tests
I Just like the Face Recognition Competitions, there are FERA:
Facial Expression and Analysis Challenges (2011, 2015, 2017)
I Metrics used: Occurrence detection, and Intensity estimation
I Facial action detection measured with varying head pose
I Disappointing results so far (2017), compared to face recognition:
I Occurence detection accuracy: ∼ 0.57
I Intensity estimation accuracy: ∼ 0.44
I Limitations: training sets were often non-spontaneous expressions;
small datasets; large subject differences; environmental influences
I Building database ‘ground truths’: more than 100 hours of training
is required to become a human expert FACS coder
I Manual scoring: each minute of video requires about an hour
I Facial AU analysis remains an underdeveloped field with many open
issues but enormous potential for more fluid HCI interfaces
212 / 212