Regions of Interest For Accurate Object Detection
Regions of Interest For Accurate Object Detection
INTRODUCTION
2.
PROPOSED FRAMEWORK
Feature Extraction
Unsupervised
Clustering (DBScan)
Check Combinations of
ROIs at all Spatial
Configurations
Verification
Stage
(1),
where I denotes the image function and (xi, yi) are the points
in the window W (Gaussian) centred on (x, y). The shifted
image is approximated by a Taylor expansion truncated to
the first order terms.
x
I ( xi + x, yi + y ) I ( xi , yi ) + I x ( xi , yi ) I y ( xi , yi ) (2)
y
Substituting (1) into (2) we get,
x
c ( x, y ) = [ x y ] C ( x, y )
(3),
y
where matrix C(x,y) captures the intensity structure of the
local neighborhood.
Let 1, 2 be the eigen-values of matrix C(x,y). The
eigen-values form a rotationally invariant description. The
presence of a key-point depends on the relation between 1
and 2. Low levels of 1, 2 are associated with flat
autocorrelation function that corresponds to an image area
with no abrupt change in any direction. If only one of the
eigen-values is high and the other is low, that implies a ridge
shaped auto-correlation function. Such ridges in the autocorrelation function are associated to the occurrence of
transition boundaries between different surfaces. High levels
of 1, 2 are reflected to sharp peaks on the autocorrelation
function and then shifts in any direction will result in a
significant increase that indicates a corner.
Point detection is followed by spatial filtering to
eliminate the density of feature points within local
neighborhoods. Equation (4) provides a mathematical
description of the points filtering operation. Set S
corresponds to the set of locations obtained after an initial
selection of interest points.
S = { p H : p q < , q H}
(4),
where p denotes the point under consideration, H the set of
points obtained by the Harris corner detector and a
parameter denoting the radius of the area being checked.
The size of is chosen to be significantly smaller than the
patch considered for extracting visual descriptors. The
underlying idea of selecting the value of is to suppress the
number of points co-occurring in the same spatial
neighborhood while also enabling accurate representation of
local image structures.
Keypoints localization is followed by a features
extraction procedure. At this step, we consider local square
patches of limited extent (significantly larger than ) and
evaluate colour and texture descriptors to represent local
areas. Thus, the three most significant components of
dominant colour are selected to reflect the local colour
content while a statistical descriptor of texture was used to
estimate texture variations within images sub-regions. More
specifically, at each keypoint location the gray-level cooccurrence matrix is extracted and some features expressing
the contrast and homogeneity, within these areas, are
evaluated. The co-occurrence matrix displacement vector is
selected so as to represent the spatial arrangement of
Entropy : F2 = P ( i, j ) log ( P ( i, j ) )
i
Homogeneity : F3 =
i
P ( i, j )
1+ i j
and
Contrast : F4 = ( i j ) P ( i, j ) ,
2
(a)
(b)
(c)
(d)
Figure 2: (a) Points of interest (marked in red), (b) Clusters derived after
applying the DBScan (yellow rectangles enclosing interest points), (c)
Neighborhoods of clusters on the image, (represented as blue ellipses), (d)
candidate regions (enclosed in the light-green rectangles)
The feature extraction approach is based on evaluating wellnormalized local features of image gradient orientations in a
dense grid [7]. The basic idea is that local object appearance
and shape can be characterized rather well by the
distribution of local intensity gradients. The features are
selected to be robust to illumination changes and imaging
conditions and this is satisfied by encoding the objects
boundary orientation and discarding information relative to
the local colour or intensity. The histogram representation
also enhances the methods robustness to rotation changes.
In practice, the implementation involves dividing the
image window into small spatial regions (cells) and for
each cell accumulating a local 1-D histogram of gradient
Compute Gradients
Verification Stage
(AdaBoost)
Normalize Contrast
Over Spatial Regions
2.4. Classification
For the candidate regions defined in 2.2, we evaluate the
feature set representing the local image content and employ
a cascade of boosted classifiers [13] at the verification stage.
The underlying idea in this approach is that smaller and
therefore more efficient boosted classifiers can be
constructed in order to reject many of the negative areas
while detecting almost all positive instances. Simpler
classifiers are used to reject the majority of regions of
interest before more complex classifiers are called upon to
achieve low false positive rates. The overall form of the
detection process is that of a degenerate decision tree, also
called a cascade (see Fig. 4).
T
1
F
Reject
...
N-1
...
EXPERIMENTAL RESULTS
Candidate
Regions
3.
Further
Processing
F
Reject
(TN + FP )
TP
(TP + FN )
TP
P
(5)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 5: Results of our object detection system (a)-(d): Person detection under different scales and poses, (e)-(h) Detection of cars, (i) (l): Detection of
airplanes.
Sensitivity
0.6
0.5
0.4
0.3
Cars
AirPlanes
Persons
0.2
0.1
0.2
0.4
0.6
1Specificity
0.8
Sensitivity
0.6
0.5
0.4
0.3
Cars
AirPlanes
Persons
0.2
0.1
0.2
0.4
0.6
1Specificity
0.8
Sensitivity
0.6
0.5
0.4
0.3
0.2
Cars
AirPlanes
Persons
0.1
0
0.1
0.2
0.3
0.4
0.5
1Specificity
0.6
0.7
0.8
0.9
0.8
0.7
0.6
5.
REFERENCES
0.5
0.4
0.3
0.2
0.1
ACKNOWLEDGEMENTS
This research work was supported in part by the European
Commission
under
contracts:
FP6-2005-IST-5
IMAGINATION, FP6-027685 MESH and FP6-26978 XMedia.
Sensitivity
4.
Our Detector
HOGsTriggs
0
0.1
0.2
0.3
0.4
0.5
1Specificity
0.6
0.7
0.8
0.9