MTA2013 - Scene-Adaptive Accurate and Fast Vertical Crowd Counting Via Joint Using Depth and Color Information
MTA2013 - Scene-Adaptive Accurate and Fast Vertical Crowd Counting Via Joint Using Depth and Color Information
DOI 10.1007/s11042-013-1608-4
Abstract Reliable and real-time crowd counting is one of the most important tasks
in intelligent visual surveillance systems. Most previous works only count passing
people based on color information. Owing to the restrictions of color information
influences themselves for multimedia processing, they will be affected inevitably by
the unpredictable complex environments (e.g. illumination, occlusion, and shadow).
To overcome this bottleneck, we propose a new algorithm by multimodal joint infor-
mation processing for crowd counting. In our method, we use color and depth infor-
mation together with a ordinary depth camera (e.g. Microsoft Kinect). Specifically,
we first detect each head of the passing or still person in the surveillance region with
adaptive modulation ability to varying scenes on depth information. Then, we track
and count each detected head on color information. The characteristic advantage
of our algorithm is that it is scene adaptive, which means the algorithm can be
applied into all kinds of different scenes directly without additional conditions. Based
on the proposed approach, we have built a practical system for robust and fast
crowd counting facing complicated scenes. Extensive experimental results show the
effectiveness of our proposed method.
1 Introduction
Reliable and real-time crowd counting is one of the critical tasks in intelligent
visual surveillance systems. It is the foundation for so many typical vision based
applications. For example, estimating the crowd density in public places can help
H. Fu · H. Ma (B) · H. Xiao
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,
Beijing University of Posts and Telecommunications, Beijing, China
e-mail: [email protected]
Multimed Tools Appl
Fig. 1 Typical complex scenes (e.g. varying illumination, heavy occlusion, and obvious shadow) we
addressed for crowd counting in this paper. Current color information only based methods are not
robust to these complicated scenes
Multimed Tools Appl
approach. In this way, we can count people robust in the complex environment shown
in Fig. 1. Based on the approach, we have already built a people counting system. Our
objective is to make the system not only has very high accuracy for people counting in
kinds of complicated instances (such as changeable illuminations, hair colors, heavy
occlusions, shadows, etc), but also can uninterruptedly run in real time. Extensive
experimental results demonstrate that our method achieves significant improvement
comparing to current states-of-the-art approaches, and the built system is not only
robust to changeable illuminations, hair colors, heavy occlusions, shadows, etc, but
also can be run in real time in the complex environments.
The rest of this paper is organized as follows: Section 2 describes some approaches
relevant to our work. In Section 3, the proposed method is presented in detail.
Experimental results and analysis are described in Section 4. Finally, Section 5
concludes the work.
2 Related works
3 Our method
The traditional approaches are generally based on color (RGB) information. Due to
that color information is sensitive to environmental factors such as illumination, we
bring multi-channel information (depth + color) in our algorithm. We capture the
multi-channel information together from a common depth camera (kinect). It can
capture each pixel of images’ distance to the camera, and the effective range is from
800–4,000 mm. Using this depth information, we can define a lot of depth effective
graphs. The ordinary way to define this depth effective graph is to build it with grey
format. The depth range in the practical scenes will map into gray space (0–255) in
this way, then the pixel’s luminance of this gray graph stands for the distance to the
camera.
In our method, we propose a color gradient model to define the depth effective
graph. The advantage of this model is that it can represent the distance to the camera
more visually and more semantically. Color gradient model can be seen in Fig. 2. The
horizontal axis stands for the distance of human to the floor, and the vertical axis
stands for the color channel value. We use three colors (red, green and blue) to label
the corresponding practical locations of the different colors. We can find that the
color of floor (0 cm) is pure blue, and the color will gradually transit from pure blue
to pure green at the range of 0–90 cm. Moreover, the color will gradually transit from
pure green to pure red at the range of 90–180 cm. Using this color gradient model,
we can calculate the rough locations of humans in our captured region according to
the shown colors directly.
Based on the color gradient model, we construct a new depth graph on color
representation of the input practical image. As seen in Fig. 3, The left row means the
practical captured color image, and the right row means the constructed graph based
on our proposed color gradient model. We label the distances in the constructed
graph according to the practical scenes. The distances can be labeled according to
two rules: the distance of human to the floor (as can be seen in (b) in Fig. 3) and the
distance of human to the camera (as can be seen in (d) in Fig. 3). In this way, we
can infer that the height of human at the upper right corner in (b) in Fig. 3 is about
160 cm.
Thanks to the locations of the depth camera and the floor are both fixed when the
surveillance depth camera is from an elevated view, we can use this information to
add a learning procedure in our crowd counting system. Specifically, we first model
the practical scene before our crowd counting procedure to obtain the distance be-
tween the floor and the camera. According to this distance, we can choose the appro-
priate initial parameter. During the learning period, we require that there is nobody
appearing in the scenes. The detail algorithm about our scene-adaptive scheme for
crowd counting in different scenes can be taken as follows:
In the algorithm, we choose the projection area of the camera as specific area
for learning. In our experiments, the specific area is always about 20 × 20. On the
one hand, we can save the RAM using relative small region. On the other hand, we
can make sure the quality of learning accuracy. In our algorithm, we consider that
the distance from camera to each pixel of the floor is same. In the practical scene, the
distance maybe not strictly the same of each pixel. This tiny difference can be ignored
Multimed Tools Appl
in our algorithm, because the basic unit of our crowd counting is each relatively large
person. So the difference can not bring any influence for our task. The median rule
based scene-adaptive scheme means the leaning procedure. Finally, we can take the
median value of Dxy as the final floor depth Dfloor . In this way, we can quickly obtain
the Dfloor of each different scene. This is very key for our crowd counting algorithm
below. Based on the scene-adaptive scheme, our crowd counting algorithm can be
used in any different scene without more manmade inference.
Our objective is for fast and robust crowd counting facing the very complicated envi-
ronments such as shown in Fig. 1. In our method, we focus on detecting each head of
every person. Because the head rather than whole body via background subtraction
is more robust to heavy occlusions. However, accurate head detection is not ready-
made. We try to address this problem in depth domain.
As seen in Fig. 4, the original RGB and depth images are Fig. 4a and b,
respectively. From the raw depth map, we can achieve the depth data of every point
considering the distance between the pixel position and the camera. In our method,
we detect the head only based on this depth map. Because depth data and pixel data
can be connected by corresponding relations, we make up a more visual depth map
representing the objects in color format (see Fig. 4c) from Fig. 4b. We know that the
peak of head is closest to camera, so the depth data of head peak is maximum. For the
(d) (e)
Fig. 4 Adaptive head detection in complicated scenes. a raw color data; b raw depth data; c our
constructed depth map in color format; d the contour of depth data in b for removing the adjacent
region of persons and the floor and for speeding up the head detection task; e final head detection
result
Multimed Tools Appl
same reason, the foot of person is far from the camera, so the depth data of the foot is
minimum. From the top view of camera, we can always find the head peak comparing
to other parts of body. So we can make the best of this information. However, we
don’t know how to achieve the whole head region as in Fig. 4c. Because the persons
in the surveillance region are in different height, and the same person in different
varying scenes may have uncertain depth data. This make the head detection task
difficult.
To overcome this difficulty, we extract each head region based on former scene-
adaptive scheme facing different scenes. We can obtain the Dfloor of current scene.
This is key for distinguishing persons and the floor based on depth value. Through
extensive observations of depth graph using the color gradient model representation,
we find that each person on the floor can appear a peak when the floor is regarded as
zero layer. In the very crowd scene, each head of person can still keep this peak.
Meanwhile, the adjacent place between the head and the shoulder will appear a
valley. We can use this observation to extract each head of person in the scene by
finding each peak. Although the surveillance scene may be change, the true depth
range of the depth camera (kinect) is fixed. Moreover, the intensity of the pixel range
in the depth map is also fixed (from 0 to 255). Our task is to construct the relationship
between the true depth range and the pixel range, and map the true head height into
the depth image of Fig. 4c to achieve a threshold for estimating the border of head
region. Our scheme can be defined as follows:
P
T = H. (1)
D
where P means the factual pixel range of intensity, D means the factual depth range
of the depth camera, H means the factual head height of a person. The head height
of different people may be different. However when we map it into the depth image,
this indistinctive difference can be ignored. We assume the maximum pixel in depth
map Fig. 4c is m0 , so the head region in Fig. 4c is [m0 − T, m0 ]. The scheme is simple,
but it is no need for parameter setup and modification, so it is convenient for practical
systems.
In order to accelerate the method, we don’t process our algorithm on the whole
image of Fig. 4c. Instead, we shrink the search range by obtaining the contour of each
person based on Fig. 4b. We detect the contour using a binary threshold value in
terms of depth data to split the part body (see Fig. 4d). In this way, we can obtain our
final detection result in Fig. 4e. The detail algorithm about our head region extraction
based on depth information can be taken as follows:
After above head detection step in depth domain (see Fig. 4e), we transform to
process in RGB domain (see Fig. 5a) from contour detection of Fig. 4e. In the fol-
lowing, we track and count the detected head in our surveillance region in Fig. 5a. In
order to improve the processing speed, we don’t use traditional Kalman filter [2, 10]
and particle filter [3, 4] based tracking method in our algorithm. Instead, we present
a mixture model based tracking method fusing the match between frames comparing
with position, shape size and distance. Because the detected heads are not occluded,
Multimed Tools Appl
(a) (b)
Fig. 5 The representation for pedestrian tracking and counting result. a tracking each detected head
from Fig. 4e; b counting each head based on the tracking result
our tracking and counting method needn’t considering the split and merge of
the heads. Our tracking scheme is defined as follows:
Definition 1 We define that the area size difference of head whose ID is i and
another head whose ID is j between two adjacent frames as follows:
Di f (Si (k), S j(k + 1)) = Si (k) − S j(k + 1) (2)
In this way, the heads can be tracked robust. The bidirectional estimations for
each head in our approach is simply based on the positions of the consecutive frames
when the head is first detected in the monitor region. When the head is ready to
leave the monitor region, we count the number of entering/existing people in the
surveillance region bidirectionally. Figure 5b shows the result for head tracking and
counting. Our summary of the proposed algorithm can be seen in Algorithm 3 for
details.
Algorithm 3 Proposed algorithm for crowd counting by joint using depth and color
information
1: Initialization: In = 0, Out = 0
2: Find the maximum depth data in depth image signed m0 .
3: Detect the head region according to (1).
4: Tracking the detected head using the proposed mixture model based tracker
between (2) and (6).
5: Counting each head bidirectionally using the relationship between the consecu-
tive frames.
6: Result Output: In, Out
4 Experiments
In this paper, we propose a new way for crowd counting by joint using the color
and depth information together from a commodity depth camera—Microsoft kinect.
Figure 6 shows our capture system setup in many different scenes. In order to
demonstrate that our presented scheme is effective for this challenge task, we test our
Multimed Tools Appl
Fig. 6 Experimental setup. We test the algorithm in many different scenes. The cameras of our
experiments are labeled in red ellipse
method in extensive different scenes. Based on our new scheme for crowd counting,
we have built a practical system. We evaluate the performance of our system both in
indoor scenes and outdoor scenes with complex environments (e.g. heavy occlusions,
shadows, dynamic backgrounds, etc). We will give the detail experimental results in
the following section. To evaluate a crowd counting system, people always consider in
two fields: accuracy and real-time. So we also compare our built system with recent
states-of-the-art methods based systems for crowd counting both in accuracy and
processing speed. All of our experiments are conducted on an Intel E7500, 3 GHz
dual core processor with 2GB RAM.
First, we test our scene adaptive scheme for crowd counting. As seen in Fig. 7, the
corresponding captured images in these two typical scenes are random. We can see
that the person’s size of them is different. This is because the setup height of the
camera is different. The key point of our method is to detect each head robust using
the scene-adaptive head detection scheme. Now we give the specific parameters in
our experiment. In these scenes, we both use the Kinect depth camera. The value of
D is 3,060 mm, because current Kinect factual depth range is about from 800 mm to
3,860 mm. In the experiment, we set P to 255. In practical, the factual head height
of each person is not salient. We assume D is 170 mm in all of our experiments.
According to (1), we can compute out the threshold for detecting the head region
Multimed Tools Appl
Fig. 7 Typical two surveillance scenes in our experiments. (top) Our experiment setup facing the
indoor scenes; (bottom) our experiment setup facing the outdoor scenes
around the maximum depth data in depth image. Meanwhile, we use the learning
based scene-adaptive scheme to obtain the threshold of Dfloor to remove the floor
in different scenes. As seen in Fig. 8, the results demonstrate that it is adaptive to
different scenes without more modulation.
Following, we evaluate our proposed method both in the indoor and outdoor
scenes in many fields. As seen in Fig. 9, this is a very crowded instance. The
traditional color information only based methods will become very ineffective in
this complicated scene. However, our approach is very effective. We can detect each
head region accurately, and then we can count them conveniently. Moreover, we
compare our approach with current states-of-the-art approaches [1, 16]. Because
we bring the depth information for crowd counting with RGB information, it is
a little inequitable to compare with only RGB information based methods [1, 16].
However, the depth camera (Microsoft Kinect) is now very cheap and can generally
be instead of current general camera. In this paper, we focus on designing a practical
people counting system facing the complicated environments (see Fig. 1) both in
indoor and outdoor scenes (see Fig. 6). While the evaluation for a good crowd
Multimed Tools Appl
Fig. 8 Evaluation for our proposed scene-adaptive scheme for crowd counting in different scenes.
(left) One typical indoor scene. (right) One typical outdoor scene
Fig. 9 Test our algorithm in a very crowded scene. a Input crowded scene with heavy occlusions;
d final head region extraction result through our algorithm
Multimed Tools Appl
Table 1 Test for accuracy and speed of our method for crowd counting
In Out Speed
Building scene Truth 350 286 Head detection 31.2 FPS
Result 338 280 Total algorithm 27.3 FPS
Accuracy 96.6 % 97.9 %
Entrance scene Truth 520 545 Head detection 29.5 FPS
Result 509 536 Total algorithm 25.2 FPS
Accuracy 97.9 % 98.3 %
Corridor scene Truth 1,360 1,180 Head detection 33.2 FPS
Result 1,334 1,139 Total algorithm 27.8 FPS
Accuracy 98.1 % 96.5%
Export scene Truth 1,050 997 Head detection 39.5 FPS
Result 988 969 Total algorithm 30.2 FPS
Accuracy 94.1 % 97.2 %
counting system is performance on counting accuracy and speed. From this view,
we think it is also can be comparing to [1, 16]. In [1], the authors assume that at
most three persons in the scene meanwhile, it is a restraint condition in practical
scenes for useful application. The method of [16] relies on moving blob segmentation
result well which is not common for practical systems. As seen in Fig. 10, we
evaluate our approach with the methods [1, 16] in an entrance scene. Both in the
mid-crowded instance and heavy-crowded instance, our method is highly more
effective than the others. We also show the comparison results of these three methods
5 Conclusions
In this paper, we propose a new method by joint using for crowd counting in compli-
cated scenes. We evaluate our algorithm in many fields, and the experimental results
demonstrate this new scheme achieves significant improvement both on accuracy and
speed. Based on the proposed approach, we have built a practical system for robust
and fast crowd counting facing complicated scenes. Our built system shows that joint
using depth and color information together is a promising way for the challenge
task of crowd counting in different scenes.
Acknowledgements This work was supported by the China National Funds for Distinguished
Young Scientists under Grant No.60925010, Natural Science Foundation of China under Grant
No.61272517, The Research Fund for the Doctoral Program of Higher Education of China under
Grant No.20120005130002, the Co-sponsored Project of Beijing Committee of Education, the Funds
for Creative Research Groups of China under Grant No.61121001, and the Program for Changjiang
Scholars and Innovative Research Team in University under Grant No.IRT1049.
References
1. Antic B, Letic D, Culibrk D, Crnojevic V (2009) K-MEANS based segmentation for real-time
zenithal people counting. In: IEEE international conference on image processing, pp 2565–
2568
2. Antoniou C, Ben-Akiva M, Koutsopoulos HN (2007) Nonlinear kalman filtering algorithms
for on-line calibration of dynamic traffic assignment models. IEEE Trans Intell Transport Syst
8:661–670
3. Bouaynaya N, Schonfeld D (2009) On the Optimality of motion-based particle filtering. IEEE
Trans Circuits Syst Video Technol 19:1068–1072
Multimed Tools Appl
4. Chateau T, Gay-Belille V (2006) Real-time tracking with classifiers. In: IEEE European confer-
ence on computer vision, pp 218–231
5. Cong Y, Gong HF, Zhu SC, Tang YD (2009) Flow mosaicking: real-time pedestrian counting
without scene-specific learning. In: IEEE conference on computer vision and pattern recognition,
pp 1093–1100
6. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE
conference on computer vision and pattern recognition, vol 1, pp 886–893
7. Dollar P, Wojek C, Schiele B, Perona P (2009) Pedestrian detection: a benchmark. In: IEEE
conference on computer vision and pattern recognition, pp 304–311
8. Fu HY, Ma HD, Liu L (2011) Robust human detection with low energy consumption in visual
sensor network. In: IEEE international conference on mobile ad-hoc and sensor networks,
pp 91–97
9. Gavrila DM, Munder S (2007) Multi-cue pedestrian detection and tracking from a moving
vehicle. Int J Comput Vis 73:41–59
10. Kai X, Wei CL, Liu LD (2010) Robust extended kalman filtering for nonlinear systems with
stochastic uncertainties. IEEE Trans Syst Man Cybern 40:399–405
11. Ma HD, Zeng CB, Ling CX (2012) A reliable people counting system via multiple cameras. ACM
Trans Intel Syst Technol 3:1–22
12. Mikolajczyk K, Schmid C, Zisserman A (2004) Human detection based on a probabilistic as-
sembly of robust part detectors. In: Pajdla T, Matas J (eds) European Conference on Computer
Vision, vol 3021. Berlin, Heidelberg, pp 69–82
13. Mu Y, Yan S, Liu Y, Huang T, Zhou B (2008) Discriminative local binary patterns for human
detection in personal album. In: IEEE conference on computer vision and pattern recognition,
pp 1–8
14. Kilambi P, Ribnick E, Joshi AJ, Masoud O, Papanikolopoulos N (2008) Estimating pedestrian
counts in groups. Comp Vision Image Underst 110:43–59
15. Shimada A, Arita D, Taniguchi R (2006) Dynamic control of adaptive mixture of gaussians
background model. In: IEEE international conference on video and signal based surveillance,
pp 20–24
16. Stauffer C, Grimson WEL (1999) Adaptive background mixture models for real-time tracking.
In: IEEE conference on computer vision and pattern recognition, vol 2, pp 246–252
17. Tuzel O, Porikli FM, Meer P (2008) Pedestrian detection via classification on riemannian mani-
folds. IEEE Trans Pattern Anal Mach Intell 30(10):1713–1727
18. Velipasalar S, Tian YL, Hampapur A (2006) Automatic counting of interacting people by using a
single uncalibrated camera. In: IEEE international conference on multimedia and expo, pp 1265–
1268
Huiyuan Fu is currently working towards the Ph.D. degree with the Beijing key Laboratory of
Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecom-
munications, China. His research interests include multimedia system and network, computer vision
and pattern recognition.
Multimed Tools Appl
Huadong Ma received the Ph.D. degree in computer science from Institute of Computing Tech-
nology, Chinese Academy of Science (CAS), in 1995, the M.S. degree in computer science from
Shenyang Institute of Computing Technology, CAS in 1990, and the B.S. degree in mathematics
from Henan Normal University, China in 1984. He is a professor at the School of Computer
Science, Beijing University of Posts and Telecommunications, China. His research interests include
multimedia networks and systems, Internet of things and sensor networks. He has published over
100 papers in these fields.
Hongtian Xiao received the M.Sc. degree in computer science from Beijing key Laboratory of
Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecom-
munications, China in 2013. His research interests include computer vision and pattern recognition.