People_tracking_in_RGB-D_data_with_on-line_boosted_target_models
People_tracking_in_RGB-D_data_with_on-line_boosted_target_models
I. Introduction
People detection and tracking is an important and a fixed pedestrian model. Unlike these works that require
fundamental component for many robots, interactive a ground plane assumption, Spinello et al. [6] overcome
systems and intelligent vehicles. Popular sensors for this this limitation via a voting approach of classified parts
task are cameras and range finders. While both sensing and a top-down verification procedure that learns an
modalities have advantages and drawbacks, their dis- optimal feature set and volume tessellation.
tinction may become obsolete with the availability of In the computer vision literature, the problem of de-
affordable and increasingly reliable RGB-D sensors that tecting, tracking and modeling humans has been exten-
provide both image and range data. sively studied [7], [8], [9], [10]. A major difference to
Many researchers in robotics have addressed the issue range-based systems is that the richness of image data
of detection and tracking people in range data. Early makes is straightforward to learn target appearance mod-
works were based on 2D data in which people have been els. For this reason, visual tracking systems can achieve
detected using ad-hoc classifiers that find moving local good results with methods as simple as independent
minima in the scan [1], [2]. A learning approach has been particle filters with nearest-neighbor data association
taken by Arras et al. [3], where a classifier for 2D point [11]. Dense depth data from stereo are used by Beymer
clouds has been trained by boosting a set of geometric and Konolige [12] to support foreground segmentation in
and statistical features. an otherwise vision-based people detection and tracking
People detection and tracking in 3D range data is a system. They use a set of binary person templates to
rather new problem with little related work. Navarro et detect people in images and demonstrate multi-person
al. [4] collapse the 3D scan into a virtual 2D slice to tracking with learned appearance-based target models.
find salient vertical objects above ground and classify a The work of [13], [14] detect people in intensity images
person by a set of SVM classified features. Bajracharya and track them in 3D. In [15] a stereo system for combin-
et al. [5] detect people in point clouds from stereo vision ing intensity images, stereo disparity maps, and optical
by processing vertical objects and considering a set of flow is used to detect people. Multi-modal detection and
geometrical and statistical features of the cloud based on tracking of people is performed in [16] where a trainable
2D range data and camera system is presented.
All authors are with the Social Robotics Lab, Depart-
ment of Computer Science, University of Freiburg, Germany This paper advances the state of the art in the fol-
{luber,spinello,arras}@informatik.uni-freiburg.de. lowing aspects. First, we address the novel problem of
Authorized licensed use limited to: The University of Toronto. Downloaded on October 28,2022 at 02:53:05 UTC from IEEE Xplore. Restrictions apply.
detecting and tracking people in RGB-D data. We com- and size of the bounding boxes in the depth images. They
bine an a priori person detector with an on-line learned are the observations zi (t) that constitute the set of mk
person detector and a multi-hypothesis tracker (MHT), observations Z(t) at time index t.
able to estimate the motion state of multiple people in
3D. Learning individual target models is a new aspect III. On-line Boosting
to range data-based object tracking that usually deals The detector described in the previous section learns a
with targets of identical appearance. To this end, we generic person model from a priori labeled data. In this
adapt the on-line learning method from Grabner et al. section, we describe the use of on-line boosting to learn
[17] to RGB-D data. We present a novel framework to target appearance models in RGB-D data, later used to
integrate the two detectors and the tracker that involves guide data association in the tracking system.
a track interpretation feedback to control learning. This Boosting is a widely used technique to improve the
enables the system to bridge gaps of misdetections of accuracy of learning algorithms. Given training samples
the a priori detector and handle target occlusions while x with labels y, a strong classifier H(x) is computed as
avoiding drift of the on-line detector. Finally, we give linear combination of a set of weighted hypotheses called
quantitative results using the CLEAR MOT performance weak classifiers h(x). The discrete AdaBoost algorithm
metric. Unlike the above mentioned works that integrate by Freund and Shapire [19] belongs to the most popular
multiple sensory modalities, we consider image and range boosting algorithms. The method trains weak classifiers
data as equally important cues for detection, tracking, from labeled training samples (x, y), initialized with
and target model adaptation. We further present a novel uniform weights wi associated to each x. Learning is done
integration framework to effectively combine a tracker in rounds where the weights are updated based on the
with on-line learned target classifiers. mistakes of the previous weak learner. By increasing the
The paper is structured as follows: the a priori peo- weights of the wrongly classified samples the algorithm
ple detector is briefly summarized in the next section focuses on the difficult examples.
followed by the description of our on-line AdaBoost On-line boosting, initially proposed by Oza and Russell
learning approach for target appearances in RGB-D data [20], processes each training instance “on arrival” without
in Section III. The integration of this learning procedure the need of storage and reprocessing, and maintains a
into the tracking system is described in Section IV. current hypothesis that reflect all the training samples
Section V describes the experiments and gives the results. seen so far. The approach has been applied for object
Section VI concludes the paper. detection while tracking by Grabner et al. [17]. We build
upon the latter to develop our on-line people detector in
II. Detection of People in 3D Range Data RGB-D data.
In this section we briefly summarize the a priori A. Updating the Weak Classifiers
people detector used in this paper. We rely on a novel Unlike the off-line approach to boosting, the on-line
RGB-D person detector called Combo-HOD (Combined algorithm presents training samples only once and dis-
Histograms of Oriented Depths and Gradients). The cards them after training. The weak classifiers have thus
method takes inspiration from Histogram of Oriented to be updated in an on-line fashion each time a new
Gradients (HOG) introduced by Dalal and Triggs [7] and training sample is available. As the difficulty of the
combines the HOG detector in the color image with a samples is not known in advance the computation of
novel approach in the depth image called Histograms of the weight distribution of the samples is a critical issue.
Oriented Depths (HOD). The basic idea of on-line boosting is that the weight of
Since RGB-D data contains both color and depth a sample (called importance λ in this context) can be
information, the Combo-HOD detector combines the two estimated by propagating it through a fixed chain of
sensory cues. HOD descriptors are computed in the weak classifiers [20]. If the sample is misclassified, λ is
depth image and HOG descriptors are computed in the increased proportional to the error of the weak classifier.
color image. They are fused on the level of detections Therefore, the importance has the same effect as the
via a weighted mean of the probabilities obtained by adapted weight in the off-line approach. The error of
a sigmoid fitted to the SVM outputs. HOD includes the i-th weak classifiers is estimated from the summed
a depth-informed scale-space search in which the used weights of the correctly (λcorr
i ) and wrongly (λwrong
i )
scales in an image are first collected and then tested classified samples,
for compatibility with the respective depth. This test is
made particularly efficient by the use of integral tensors, λwrong
i
ei = . (1)
an extension of integral images over several scales. This (λwrong
i + λcorr
i )
strategy dramatically reduces the number of descriptors
computed in the image at improved detection rates. For B. On-line-boosting for Feature Selection
more details, the reader is referred to [18]. For the purpose of learning target models during track-
The output of the detector in each step are the posi- ing, Grabner et al. [17] propose feature selectors. The
tions and size of all targets in 3D space and the center main idea is to apply on-line boosting not directly to
3845
Authorized licensed use limited to: The University of Toronto. Downloaded on October 28,2022 at 02:53:05 UTC from IEEE Xplore. Restrictions apply.
mentioned adaptation mechanism, their relative numbers
can change to best describe a target dynamically.
The features are computed in rectangular areas sam-
pled with randomized positions and scales in the bound-
ing box associated to each target. This is done once at
initialization and then kept fix over the lifetime of a
target (up to the weak feature that get replaced). The
best ten features of two persons are shown in Fig. 2.
D. On-line Boosting for Tracking
Fig. 2. Bounding boxes of two detected persons in the RGB On-line boosting enables a tracker to continuously
and depth images. The ten best features of each on-line detector update a target model to optimally discriminate it from
are marked with colored rectangles. Haar-like features calculated the current background. This is a formulation of tracking
on the intensity image are shown in green and Haar-like features
computed on the depth image are marked in red. The Lab color as a classification problem [22] which is implemented by
features calculated on the RGB image are depicted in blue. a confidence maximization procedure around the current
tracking region . The region is obtained as the bounding
box of the previous detection. All features within the
the weak classifiers but to the selectors. A selector hsel
region are considered the positively labeled foreground
selects the best weak classifier from a pool of M weak
samples. The negative samples are obtained by sweep-
learners F with ‘best’ being defined by the lowest error.
ing the bounding box over a local neighborhood. The
With the number of selectors N being a fix parameter,
classifier is then evaluated at each sweep position of this
the following procedure is repeated for all selectors when
neighborhood yielding a confidence map whose maximum
a new sample (x, y) arrives: First, all weak classifiers are
is taken as the new position of the tracking region. The
updated and the best one, denoted m+ , is selected
classifier is updated in this region and the process is
hsel weak
n (x) = hm+ (x) (2) continued. The evolution of the confidence values over
time can be seen in Fig. 5.
with m+ = arg minm (en,m ) and en,m defined like Eq. 1
Unlike [17] where the new region is bootstrapped from
with subscript n, m for i. Then, the voting weight
the previous detection, we use the bounding box position
αn = 21 · ln( 1−e
en ) is computed where en = en,m
n
+ and
of the a priori detector to recenter the on-line detector.
the updated importance weight λ is propagated to the
This strategy avoids a key problem of on-line adaptation
next selector hsel
n+1 . Similar to AdaBoost, λ is increased namely drifting of the model to background, clutter, or
if hsel
n predicts x correctly and decreased otherwise. other targets.
The strong classifier is finally obtained by computing
the confidence as a linear combination of the N selectors IV. Integration into the Tracking System
and applying the signum function, In this section we describe how the on-line detector
N
X is integrated into a Kalman filter based multi-hypothesis
κ(x) = (αn · hsel
n (x)) , H(x) = sign(κ(x)). (3) tracking framework (MHT). For reasons of limited space,
n=1 we will only discuss the aspects that change in the MHT,
Unlike the off-line version, the on-line procedure creates refer to [23], [24] for more details.
an always-available strong classifier in a any-time fashion. In short, the MHT algorithm hypothesizes about the
In order to increase the diversity of the classifier pool target states by considering all statistically feasible as-
F and to adapt to appearance changes of the targets, signments between measurements and tracks and all
at the end of each iteration, the worst weak classifier is possible interpretations of measurements as false alarms
replaced by one randomly chosen from F. or new track and tracks as matched, occluded or obsolete.
Thereby, the MHT handles the entire life-cycle of tracks
C. Features from creation and confirmation to occlusion and deletion.
We take advantage of the richness of RGB-D data Formally, let ξ(t) = (xt yt zt ẋt ẏt żt )T be the filtered
by computing three types of features that correspond state of a track t at time t with position and velocity
to the weak classifiers: Haar-like features [21] in the information in 3D and Σ its associated 6 × 6 covariance.
intensity image (converted from the RGB values), Haar- Let Z(t) = {zi (t)}m i=1 be the set of mt observations which
t
like features in the depth image, and illumination agnos- in our case is the set of detected people in RGB-D data.
tic Lab color features in the RGB image. Lab features Observations consist in a 3D position from the a priori
are computed by summing up the intensity values in detector zi (t) and a training sample xi (t) from the on-
a* (b*) space under the area. The advantage of the line detector. The sample xi (t)) is a vector of stacked
Lab color model is that features in a* or b* space can features values computed in the rectangular areas within
compactly and robustly subsume entire RGB histograms. the current tracking region.
A total of M features is computed where the initial Let Ωl (t) be the l-th hypothesis at time t and Ωt−1 p(l)
number of features is M/3 for all types. Given the above the parent hypothesis from which Ωl (t) was derived. Let
3846
Authorized licensed use limited to: The University of Toronto. Downloaded on October 28,2022 at 02:53:05 UTC from IEEE Xplore. Restrictions apply.
Detection
hypotheses z
A Priori
Detector
3D target
Sensory Bounding box
motion states
data positions Multi-Hypothesis
Tracker
On-line
Detector Detection
Measurement and track
hypotheses z* interpretations
Data (matched, new, occluded)
Controls
3847
Authorized licensed use limited to: The University of Toronto. Downloaded on October 28,2022 at 02:53:05 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Evolution of the confidence of the on-line detector. The
top image shows the confidences over the life cycle of a track.
After initialization the values achieves steady state. Person 2 is
occluded twice between frames 172 to 185 and frames 192 to 199.
Thanks to the feedback from the MHT tracker, the on-line detector
pauses its adaptation. This strategy avoids drifting of the model to
Fig. 6. Visualization of the 3D point cloud produced by the three
background, clutter, or other targets. When the person reappears,
Kinect sensors including the positions and trajectories of eight of 31
adaptation is resumed immediately with high confidence.
tracks in the data set. The colored disks mark the current Kalman
filter estimates of the target positions, the small dots show their
past trajectories. The tracker maintains full 3D estimates as it can
sensor. The data set has been manually annotated to be seen by the dark blue trajectory of the subject coming down the
include the bounding box in 2D depth image space, the stairs.
visibility of subjects (fully visible/partially occluded),
and the data association ground truth of the tracks. A of view without interference with other targets. After
total of 3021 instances of people in 1133 frames and 31 an initialization phase of nearly ten frames, the on-line
tracks have been labeled. The data set will be made detector has adapted to its appearance and achieves
available on the laboratory webpage at publication date steady state at a value of around 0.8. Person 2 undergoes
of this paper. two occlusions. During the occlusions the confidence
The sensory setup for data collection is shown in Fig. 4. values drop immediately, indicating that the target is no
It consists in three vertically mounted Kinect sensors longer visible. As the MHT correctly declares the target
that jointly extend the field of view to 130◦ × 50◦ . as occluded, adaptation of H is paused and resumed
Measures have been taken to calibrate the intrinsics and with high confidences after the person reappears. We
extrinsics of the setup and to guarantee synchronized have further investigated the usage statistics of the three
acquisition of the three images at frame rate. features types of the on-line detector. They are generally
The parameters of the MHT have been learned from a used with similar frequency and importance.
training data set over 600 frames. The detection proba- We then compare the on-line boosting approach to the
bility is set to pdet = 0.99 and the termination likelihood baseline using the CLEAR MOT metrics. The results
to λdel = 30. The average rates of new tracks and show a clear improvement of all values except for the
false alarms are determined to be λnew = 0.001 and number of false positives (see Table I). We manually
λfal = 0.005, respectively. Further, the maximal number inspected the behavior of the tracker and discuss the
of hypothesis NHyp is set to 100. The strong classifiers of insights gained.
the targets are based on 50 selectors which are trained The strongest impact of the presented approach is the
with 50 weak hypotheses. reduction of the number of missed targets by 50%. This
To assess the impact of the on-line boosting onto the improvement is caused by the on-line observations z∗ .
tracking performance we run the tracker with the a priori When the a priori detector fails to detect an existing
detector only to obtain a baseline. All following runs are track in several consecutive frames, the best MHT hy-
then compared using the CLEAR MOT metrics [25]. The pothesis will eventually (and wrongly) declare the track
metric counts three numbers with respect to the ground as deleted. When this happens, the miss count (FN)
truth that are incremented at each frame: misses (missing is increased at each frame until the detector finds the
tracks that should exist at a ground truth position, FN), target again and creates a new track. This is where the
false positives (tracks that should not exist, FP), and z∗ observations come into play by detecting the target
mismatches (track identifier switches, ID). The latter from the on-line learned model. Given a z∗ , the MHT
value quantifies the ability to deal with occlusion events can match the target and correctly continue the track.
that typically occur when tracking people. From these This benefit comes at the expense of a delayed deletion
numbers, two values are determined: MOTP (avg. metric of tracks that are incorrectly created from wrong false
distance between estimated targets and ground truth) positives of the a priori detector. In this case, the on-
and MOTA (avg. number of times of a correct tracking line detector tries to continue the track with the same
output with respect to the ground truth). We ignore strategy leading to a increase of the number of false
MOTP as it is based on a metric ground truth of target positives (FP) by 19%. We observed that this happens
positions which is unreliable in our data. for recurring false positive detections on static objects on
which the on-line detector can particularly well adapt.
A. Results The improvement in the number of id switches (ID) is
First, we analyze the confidence values of the strong achieved by the joint likelihood model that guides data
classifier H and the integration framework in different association in situations of interacting and thus occluding
situations (see Fig. 5). Person 1 traverses the sensor field targets. The fact that this number is not higher is due
3848
Authorized licensed use limited to: The University of Toronto. Downloaded on October 28,2022 at 02:53:05 UTC from IEEE Xplore. Restrictions apply.
FN FP ID MOTA [3] K. O. Arras, O. Martı́nez Mozos, and W. Burgard, “Using
Baseline 1502 168 42 62% boosted features for the detection of people in 2d range data,”
On-line boosting 751 201 32 78% in Int. Conf. on Robotics and Automation (ICRA), 2007.
[4] L. Navarro-Serment, C. Mertz, and M. Hebert, “Pedestrian
Improvement 50% -19% 24% 16% detection and tracking using three-dimensional LADAR data,”
in International Conference on Field and Service Robotics,
TABLE I
Cambridge, USA, 2009.
CLEAR MOT results. [5] M. Bajracharya, B. Moghaddam, A. Howard, S. Brennan, and
L. Matthies, “Results from a real-time stereo-based pedestrian
detection system on a moving vehicle,” in Workshop on People
to the unscripted behavior of people in our data set. At Detection and Tracking, IEEE ICRA, Kobe, Japan, 2009.
[6] L. Spinello, M. Luber, and K. O. Arras, “Tracking people in 3D
the particular place of data collection, subjects mainly using a bottom-up top-down people detector,” in Int. Conf. on
walked past rather than creating situations that stress Robotics and Automation (ICRA), Shanghai, China, 2011.
the occlusion handling capability of the tracker. [7] N. Dalal and B. Triggs, “Histograms of oriented gradients for
human detection,” in Proc. of the IEEE Conf. on Comp. Vis.
VI. Conclusions and Pat. Rec. (CVPR), San Diego, USA, 2005.
[8] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection
In this paper we presented a novel 3D people detection in crowded scenes,” in Proc. of the IEEE Conf. on Comp. Vis.
and tracking approach in RGB-D data. We combined and Pat. Rec. (CVPR), San Diego, USA, 2005.
[9] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A dis-
on-line learning of target appearance models using three criminatively trained, multiscale, deformable part model,” in
types of RGB-D features with multi-hypothesis tracking. Proc. of the IEEE Conf. on Comp. Vis. and Pat. Rec. (CVPR),
We proposed an decisional framework to integrate the on- Anchorage, USA, 2008.
[10] M. Enzweiler and D. Gavrila, “Monocular pedestrian detec-
line person detector, an off-line learned a priori detector tion: Survey and experiments,” IEEE Trans. on Pattern Anal-
and a multi-hypothesis tracker. The framework enables ysis and Machine Intell. (PAMI), vol. 31, no. 12, 2009.
the tracker to support the on-line classifier in training [11] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier,
and L. V. Gool, “Online multi-person tracking-by-detection
only on the correct samples and to guide data association from a single, uncalibrated camera,” IEEE Trans. on Pattern
via a joint motion and appearance likelihood. It also Analysis and Machine Intell. (PAMI), vol. 33, no. 9, 2011.
avoids the key problem of on-line adaptation namely [12] D. Beymer and K. Konolige, “Real-time tracking of multiple
people using stereo,” in ICCV Workshop on Frame-rate Ap-
drifting of models to background, clutter, or other targets plications, Kerkyra, Greece, 1999.
by resetting the detection window at the location of the [13] A. Ess, B. Leibe, K. Schindler, and L. V. Gool, “Robust multi-
a priori detector and pausing adaptation in case of occlu- person tracking from a mobile platform,” IEEE Trans. on
Pattern Analysis and Machine Intell. (PAMI), vol. 31, no. 10,
sions. The framework further allows to fill gaps of false pp. 1831–1846, 2009.
negatives from the a priori detector by observations of [14] B. Leibe, K. Schindler, N. Cornelis, and L. V. Gool, “Coupled
the on-line detectors found by confidence maximization object detection and tracking from static cameras and moving
vehicles,” IEEE Trans. on Pattern Analysis and Machine
search in 3D space. Intell. (PAMI), pp. 1683–1698, 2008.
The experiments show a clear overall improvement of [15] M. Enzweiler, A. Eigenstetter, B. Schiele, and D. Gavrila,
the tracking performance, particularly in the number “Multi-cue pedestrian classification with partial occlusion han-
dling,” in Proc. of the IEEE Conf. on Comp. Vis. and Pat. Rec.
of missed tracks and also in the number of identifier (CVPR), 2010.
switches. They demonstrate that the on-line classifier [16] L. Spinello, R. Triebel, and R. Siegwart, “Multiclass mul-
contributes to find the correct observations in cases when timodal detection and tracking in urban environments,”
Int. Journal of Robotics Research, vol. 29, no. 12, pp. 1498–
the a priori detector fails. This reduces the number of 1515.
missed tracks by 50%. Further, the joint data association [17] H. Grabner and H. Bischof, “On-line boosting and vision,” in
likelihood helps to decrease the number of track identifier Proc. of the IEEE Conf. on Comp. Vis. and Pat. Rec. (CVPR),
New York, USA, 2006.
switches by 24%. The overall tracking accuracy (MOTA) [18] L. Spinello and K. O. Arras, “People detection in RGB-D
is improved by 16%. data,” in Proc. of the IEEE/RSJ Int. Conf. on Intelligent
Future work will focus on the collection and annota- Robots and Systems (IROS), San Francisco, USA, 2011.
[19] Y. Freund and R. Schapire, “A decision-theoretic generaliza-
tion of more RGB-D data sets containing a variety of tion of on-line learning and an application to boosting,” in
challenging social situations that stress more aspects of Computational Learning Theory, 1995.
this approach. [20] N. C. Oza and S. Russell, “Online bagging and boosting,” in
Artificial Intelligence and Statistics, 2001, pp. 105–112.
Acknowledgment [21] P. Viola and M. Jones, “Rapid object detection using a boosted
cascade of simple features,” Proc. of the IEEE Conf. on Comp.
This work has been supported by the German Research Vis. and Pat. Rec. (CVPR), vol. 1, pp. 511–518, 2001.
Foundation (DFG) under contract number SFB/TR-8. [22] S. Avidan, “Support vector tracking,” IEEE Trans. on Pattern
Analysis and Machine Intell. (PAMI), vol. 26, no. 8, 2004.
References [23] D. B. Reid, “An algorithm for tracking multiple targets,” IEEE
Transactions on Automatic Control, vol. 24, no. 6, 1979.
[1] A. Fod, A. Howard, and M. Matarı́c, “Laser-based people [24] I. J. Cox and S. L. Hingorani, “An efficient implementation of
tracking,” in Int. Conf. on Robotics and Automation (ICRA), reid’s multiple hypothesis tracking algorithm and its evalua-
2002. tion for the purpose of visual tracking,” IEEE Trans. Pattern
[2] D. Schulz, W. Burgard, D. Fox, and A. Cremers, “People track- Anal. Mach. Intell. (PAMI), vol. 18, no. 2, pp. 138–150, 1996.
ing with a mobile robot using sample-based joint probabilistic [25] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object
data association filters,” International Journal of Robotics tracking performance: the CLEAR MOT metrics,” EURASIP
Research (IJRR), vol. 22, no. 2, pp. 99–116, 2003. Journal on Image and Video Processing, vol. 2008, 2008.
3849
Authorized licensed use limited to: The University of Toronto. Downloaded on October 28,2022 at 02:53:05 UTC from IEEE Xplore. Restrictions apply.