EXPERIMENTS WITH PATCH-BASED OBJECT CLASSIFICATION
EXPERIMENTS WITH PATCH-BASED OBJECT CLASSIFICATION
R. G. J. Wijnhoven1,2 P. H. N. de With2,3
1 2
Bosch Security Systems B.V. Tech. Univ. Eindhoven, 3 LogicaCMG
Eindhoven, The Netherlands Eindhoven, The Netherlands
106
Authorized licensed use limited to: UNED- Universidad Estatal a Distancia. Downloaded on August 12,2024 at 18:10:00 UTC from IEEE Xplore. Restrictions apply.
to obtain spatial invariance, the maximum is taken over a 4 Dataset
local spatial neighborhood around each pixel and the result-
ing image is sub-sampled. Because of the down-sampling, Most available datasets for classification focus on the do-
the number of C1 features is much lower than the number main of generic object detection, where surveillance spe-
of S1 features. The resulting C1 feature maps for the input cific datasets have been created for the purpose of object
image (33 elements in height at band zero and 12 at band 7) tracking, and therefore contain a strictly limited number of
of the car image in Figure 2 are shown in Figure 3. different objects. For the purpose of object classification, a
high number of different objects is required. Ma and Grim-
son [3] presented a limited dataset for separating various car
types. Since future smart cameras should be able to make a
distinction between more object classes, we have created a
new dataset.
A one-hour video capture was made at CIF resolution
(352x288 pixels) from a single, static camera, monitoring a
traffic crossing. After applying the tracking algorithm pro-
posed by the authors of [4], the resulting object images (of
size 10-100 pixels) were manually adjusted if required, to
Figure 3: C1 feature maps for S1 responses from Figure 2
have a clean performance of the blob extraction and avoid
(at band 0). The C1 maps are re-scaled for visualization.
any possible negative interference with the new algorithm.
For this reason, redundant images, images of occluded ob-
The next layer (S2) in the processing chain of the model jects and images containing false detections have been re-
applies template matching of image patches onto the C1 moved. Because of the limited time-span of the recording,
feature maps. This can be compared to the simple layer the scene conditions do not change significantly. The final
S1, where the filter response is generated for several Gabor dataset contains 9,233 images of objects.
filters. This template matching is done for several image
The total object set has been split into the following 13
patches (prototypes). These patch prototypes are extracted
classes: trailers, cars, city buses, Phileas buses (name of a
from natural images at a random band and spatial location,
specific type of bus), small buses, trucks, small trucks, per-
at the C1 level. Each prototype contains all four orientations
sons, cleaning cars, bicycles, jeeps, combos and scooters.
and prototypes are extracted at four different sizes: 4 × 4,
Some examples of each object class are shown in Figure 5.
8 × 8, 12 × 12 and 16 × 16 elements. Hence, a 4 × 4 patch
contains 64 C1 elements. Next to the new surveillance dataset, we used the
datasets from Ma and Grimson [5] and the PETS 2001
The response of a prototype patch P over the C1 fea-
dataset1 . Results are presented in the next section.
ture map C of the input image I is defined by a radial basis
function that normalizes the response to the patch-size con-
sidered, as proposed by Mutch and Lowe [15].
4.1 Orientation separation
Examples of image patches (prototypes) and the corre-
sponding S2-responses are shown in Figure 4 for the car The detection performance of the human visual system de-
image from Figures 2 and 3. Note that we only show two pends on the 3D viewpoint of the objects learned. Lo-
patch prototypes, each of size 4 × 4 C1 elements. gothetis et al. [17] demonstrate this view-dependence of
The last step in the processing architecture is the extrac- the visual system and state that detection performance de-
tion of the most relevant response. The maximum patch- creases with a deviation in viewpoint angle. When the ob-
response over all bands and all spatial locations is stored as ject rotation increases above roughly 30 degrees, the detec-
the final value in the feature vector for each patch prototype tion performance decreases drastically.
considered. Therefore, the final feature vector has a dimen- Since some orientation measure is already produced by
sionality equal to the number of prototype patches used. In the tracking algorithm, we can use this knowledge a priori.
our implementation, we used 1,000 prototype patches. Note To be independent of the tracking performance, the object
that by considering a higher or lower number of C1 patch orientations have been manually annotated for all images in
prototypes, the required computation power can be linearly the dataset. For each object-class of the total set of C ob-
scaled. ject classes, we create N new classes, where N equals the
In order to classify the resulting C2 feature vector, we number of quantized orientations. However, note that since
use a one-vs-all Support Vector Machine (SVM) classifier the object orientation is a priori given, we use N indepen-
with a linear kernel. The SVM with highest output score
1 PETS 2001 Dataset 1, Cameras 1 & 2, testing sets only. The complete
defines the output class of the feature vector. The Torch3
library [16] was used for the implementation of the SVM. dataset is available from http://ftp.pets.rdg.ac.uk/.
107
Authorized licensed use limited to: UNED- Universidad Estatal a Distancia. Downloaded on August 12,2024 at 18:10:00 UTC from IEEE Xplore. Restrictions apply.
Figure 4: Patch response of 4 × 4 C1 elements for two example patches. The right 8 images represent the S2 feature maps
at each band. The top prototype results in higher responses in the medium bands, where the lower prototype gives a higher
reaction in the lower bands.
Table 1: Detection rates for the four-class classification
problem, without and with orientation separation.
108
Authorized licensed use limited to: UNED- Universidad Estatal a Distancia. Downloaded on August 12,2024 at 18:10:00 UTC from IEEE Xplore. Restrictions apply.
Table 2: Confusion matrix of the four-class classifica- in the correct classification. Note that both cameras have
tion problem for the normal and the orientation separated a different viewing angle, which are both not equal to the
datasets (10 training samples, values in %). camera configuration of the training set (see Section 4).
109
Authorized licensed use limited to: UNED- Universidad Estatal a Distancia. Downloaded on August 12,2024 at 18:10:00 UTC from IEEE Xplore. Restrictions apply.
new detector. On the opposite, the patch-based approach is Pattern Anal. Machine Intell. (PAMI), vol. 22, pp. 809–830,
a more general approach which generates one feature vector IEEE, August 2000.
for every object image and the SVM classifier is trained to [7] R. Wijnhoven and P. de With, “3d wire-frame object-
make a distinction between the application-specific object modeling experiments for video surveillance,” in Proc. 27th
classes. Int. Symp. Inform. Theory in Benelux, pp. 101–108, June
2006.
[8] P. Viola and M. Jones, “Rapid object detection using a
boosted cascade of simple features,” in Proc. IEEE Com-
puter Society Conf. Comp. Vision Pattern Recogn. (CVPR),
vol. 1, pp. 511–518, 2001.
[9] A. Mohan, C. Papageorgiou, and T. Poggio, “Example-based
object detection in images by components,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (PAMI),
vol. 23, pp. 349–361, April 2001.
[10] N. Dalai and B. Triggs, “Histogram of oriented gradients for
human detection,” in Proc. of the IEEE Computer Society
Conf. on Computer Vision and Pattern Recognition (CVPR),
Figure 7: Generic object modeling architecture, containing vol. 1, pp. 886–893, IEEE, June 2005.
multiple detectors. [11] Y. Ke and R. Sukthankar, “Pca-sift: A more distinctive rep-
resentation for local image descriptors,” in In Proc. of IEEE
Computer Vision and Pattern Recognition (CVPR), vol. 2,
In our view, when aiming at a generic object
pp. 506–513, 2004.
modeling architecture, we envision a convergence be-
tween application-specific techniques and application- [12] T. Serre, Learning a Dictionary of Shape-Components in Vi-
sual Cortex: Comparison with Neurons, Humans and Ma-
independent algorithms, thereby leading to a mixture of
chines. PhD thesis, Massachusetts Institute of Technol-
both types of approaches. The architecture as shown in Fig- ogy Computer Science and Artificial Intelligence Labora-
ure 7 should be interpreted in this way. For example, in tory, April 2006.
one detector the pixel processing may be generic whereas
[13] D. H. Hubel and T. N. Wiesel, “Receptive fields of single
in the neighboring detector the pixel processing could be neurons in the cats visual system,” Journal of Physiology,
application-specific. The more generic detectors may be re- vol. 148, pp. 574–591, October 1959.
used for different purposes in several applications. [14] M. Riesenhuber and T. Poggio, “Models of object recogni-
tion,” Nature Neuroscience, vol. 3, pp. 1199–1204, 2000.
References [15] J. Mutch and D. Lowe, “Multiclass object recognition with
sparse, localized features,” in IEEE Computer Society Conf.
[1] J. Lou, T. Tan, W. Hu, H. Yang, and S. Maybank, “3-d model- on Computer Vision and Pattern Recognition (CVPR), vol. 1,
based vehicle tracking,” IEEE Trans. Image Proc., vol. 14, pp. 11–18, June 2006.
pp. 1561–1569, October 2005. [16] R. Collobert, S. Bengio, and J. Mariethoz, “Torch: a modular
[2] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Pog- machine learning software library,” tech. rep., Dalle Molle
gio, “Robust object recognition with cortex-like mecha- Instit. for Percep. Artif. Intell., Valais, Switzerland, October
nisms,” Trans. Pattern Anal. Machine Intell. (PAMI), vol. 29, 2002.
pp. 411–426, March 2007. [17] N. K. Logothetis, J. Pauls, and T. Poggio, “Shape represen-
[3] M. Xiaoxu and W. Grimson, “Edge-based rich representation tation in the inferior temporal cortex of monkeys,” Current
for vehicle classification,” in Proc. IEEE Int. Conf. Computer Biology, vol. 5, pp. 552–563, March 1995.
Vision (ICCV), vol. 2, pp. 1185–1192, October 2005. [18] P. Viola, M. Jones, and D. Snow, “Detecting pedestrians us-
[4] S. Muller-Schneiders, T. Jager, H. Loos, and W. Niem, “Per- ing patterns of motion and appearance,” in Proc. of the Ninth
formance evaluation of a real time video surveillance sys- IEEE Int. Conf. on Computer Vision (ICCV), vol. 2, pp. 734–
tem,” in Proc. 2nd Joint IEEE Int. Workshop Visual Surv. 741, October 2003.
and Perf. Eval. of Tracking and Surv. (VS-PETS), pp. 137– [19] B. Wu and R. Nevatia, “Detection of multiple, partially oc-
144, October 2005. cluded humans in a single image by bayesian combination of
[5] B. Bose and W. E. L. Grimson, “Improving object classi- edgelet part detectors,” in Proc. 10th IEEE Int. Conf. Comp.
fication in far-field video,” in Proc. IEEE Comp. Vision Pat- Vision (ICCV), vol. 1, pp. 90–97, IEEE Comp. Soc., Wash-
tern Recogn. (CVPR), vol. 2, pp. 181–188, IEEE Comp. Soc., ington DC, USA, 2005.
Washington DC, USA, June 2004. [20] F. Zuo, Embedded face recognition using cascaded struc-
[6] I. Haritaoglu, D. Harwood, and L. Davis, “W4: real-time tures. PhD thesis, Technische Universiteit Eindhoven, Oc-
surveillance of people and their activities,” in IEEE Trans. tober 2006.
110
Authorized licensed use limited to: UNED- Universidad Estatal a Distancia. Downloaded on August 12,2024 at 18:10:00 UTC from IEEE Xplore. Restrictions apply.