0% found this document useful (0 votes)
14 views

MTA2013 - Scene-Adaptive Accurate and Fast Vertical Crowd Counting Via Joint Using Depth and Color Information

The document proposes a new algorithm for crowd counting that uses both depth and color information from an ordinary depth camera like Microsoft Kinect. The algorithm first detects heads in the surveillance region using an adaptive scheme based on depth information. It then tracks and counts the detected heads bidirectionally using color information. The algorithm is scene-adaptive, meaning it can be directly applied to different scenes without additional parameters. Experimental results show the algorithm effectively counts crowds in complex scenes with varying illumination, occlusion and shadows.

Uploaded by

venjyn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

MTA2013 - Scene-Adaptive Accurate and Fast Vertical Crowd Counting Via Joint Using Depth and Color Information

The document proposes a new algorithm for crowd counting that uses both depth and color information from an ordinary depth camera like Microsoft Kinect. The algorithm first detects heads in the surveillance region using an adaptive scheme based on depth information. It then tracks and counts the detected heads bidirectionally using color information. The algorithm is scene-adaptive, meaning it can be directly applied to different scenes without additional parameters. Experimental results show the algorithm effectively counts crowds in complex scenes with varying illumination, occlusion and shadows.

Uploaded by

venjyn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Multimed Tools Appl

DOI 10.1007/s11042-013-1608-4

Scene-adaptive accurate and fast vertical crowd


counting via joint using depth and color information

Huiyuan Fu · Huadong Ma · Hongtian Xiao

© Springer Science+Business Media New York 2013

Abstract Reliable and real-time crowd counting is one of the most important tasks
in intelligent visual surveillance systems. Most previous works only count passing
people based on color information. Owing to the restrictions of color information
influences themselves for multimedia processing, they will be affected inevitably by
the unpredictable complex environments (e.g. illumination, occlusion, and shadow).
To overcome this bottleneck, we propose a new algorithm by multimodal joint infor-
mation processing for crowd counting. In our method, we use color and depth infor-
mation together with a ordinary depth camera (e.g. Microsoft Kinect). Specifically,
we first detect each head of the passing or still person in the surveillance region with
adaptive modulation ability to varying scenes on depth information. Then, we track
and count each detected head on color information. The characteristic advantage
of our algorithm is that it is scene adaptive, which means the algorithm can be
applied into all kinds of different scenes directly without additional conditions. Based
on the proposed approach, we have built a practical system for robust and fast
crowd counting facing complicated scenes. Extensive experimental results show the
effectiveness of our proposed method.

Keywords Multimodal joint multimedia processing · Crowd counting ·


Ordinary depth camera · Scene-adaptive scheme · Real time system

1 Introduction

Reliable and real-time crowd counting is one of the critical tasks in intelligent
visual surveillance systems. It is the foundation for so many typical vision based
applications. For example, estimating the crowd density in public places can help

H. Fu · H. Ma (B) · H. Xiao
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,
Beijing University of Posts and Telecommunications, Beijing, China
e-mail: [email protected]
Multimed Tools Appl

managers identify unsafe situations and regulate traffic appropriately. As another


example, the public museum can control the number of people entering accord-
ing to the real-time people flow information. Although it is so important, robust
crowd counting in the complex environments is still far from being completely
solved.
Recently, the computer vision based methods are widely used to crowd counting
or people counting [1, 5, 15, 16, 18]. Facing the different application needs, people
counting systems can be classified into two fields according to the views: top-view
(or called vertical. The applicative scenes are entrance/exit gate, museum, teaching
building, etc.) [1, 15, 16, 18] and squint-view (such as in the sideway street) [5]. They
both have relative advantages to each other. For example, top-view people counting
has low occlusion comparing to the squint-view methods, but it is difficult for top-
view methods to detect the shape-feature of the body. On the contrary, squint-
view methods can count people based on effective body features (such as HOG
feature [6]), but it will face more heavy occlusion. In this paper, we focus on counting
people from the top-view, such as in [1, 15, 16, 18]. But we try to address the most
complex instances in practical surveillance environment (see Fig. 1). Most previous
works [1, 15, 16, 18] can only count people based on color (RGB) information.
They will inevitably be affected by the unpredictable complicated environments with
heavy occlusions and changeable illuminations, etc (see Fig. 1).
In this paper, we propose a new scheme to overcome this bottleneck for crowd
counting. We process our approach based on fusing the color and depth information
together from a cheap depth camera (such as Microsoft Kinect). To the best of our
knowledge, it is maybe the first-time to use depth data for counting people by co-
processing the captured real-time stream of passing people both in color and depth
format from the top-view. In our presented approach, we first propose an scene-
adaptive head detection scheme for people counting based on depth information.
It is the characteristic advantage of our algorithm, because this scheme guarantees
that our approach is robust to extensive different scenes. Meanwhile, the scene-
adaptive scheme is free of parameter modulation which make it is very convenient
for us to use it in unpredicted environments. Moreover, our method is not only for
detecting and counting the moving person, but also for still person. Then, when we
obtain the position of each head in our surveillance region, we will track and count
them on color information bidirectionally. To speed up our algorithm for real-time
performance, we propose a simple but very quick mixture model based tracker in our

Fig. 1 Typical complex scenes (e.g. varying illumination, heavy occlusion, and obvious shadow) we
addressed for crowd counting in this paper. Current color information only based methods are not
robust to these complicated scenes
Multimed Tools Appl

approach. In this way, we can count people robust in the complex environment shown
in Fig. 1. Based on the approach, we have already built a people counting system. Our
objective is to make the system not only has very high accuracy for people counting in
kinds of complicated instances (such as changeable illuminations, hair colors, heavy
occlusions, shadows, etc), but also can uninterruptedly run in real time. Extensive
experimental results demonstrate that our method achieves significant improvement
comparing to current states-of-the-art approaches, and the built system is not only
robust to changeable illuminations, hair colors, heavy occlusions, shadows, etc, but
also can be run in real time in the complex environments.
The rest of this paper is organized as follows: Section 2 describes some approaches
relevant to our work. In Section 3, the proposed method is presented in detail.
Experimental results and analysis are described in Section 4. Finally, Section 5
concludes the work.

2 Related works

According to our knowledge an approach by joint using multi-modal information


(color and depth) together joint for crowd counting does not exist. Many works have
been proposed for crowd counting or people counting [1, 5, 11, 14–16, 18] (Crowd
counting can be also seen as people counting in the complex scenes with heavy
occlusions). Nevertheless, those papers were all only based on color information.
Generally, this counting problem can be classified into two basic classes based
on camera view: top-view [1, 16, 18] and squint-view [5, 11, 14]. On squint-view
people counting problem, it is often based on human detection results closely [7–
9, 12, 13, 17]. In this paper, we focus on top-view people counting facing the compli-
cated environments such as in Fig. 1. The most related works to ours are [1, 15, 16, 18].
In the following place, we introduce these four papers in detail.
The authors of [18] use a two-level hierarchical tracking method to count passing
people using a single uncalibrated camera, their method is useful in simple instance.
For example, there is little occlusions. Because they use only use the simple size
feature to estimate the number in a moving blob, their method is not controlled and
not robust to complicated environments.
The authors of [1] use K-Means based method to segment the multiple passing
people at first and then track them, so their method is tightly related to the segmen-
tation result. However, it is very difficult to segment every pedestrian precisely from
the obtained blobs, especially in the complex environments. Thus, their method is
not effective in the scenes such as in Fig. 1.
The authors of [16] propose a typical scheme for moving object detection. Nu-
merous practical applications have demonstrated that it is very useful for detecting
moving objects in dynamic change scenes. However, their method is not very fit for
people counting. Because although the people in surveillance region is moving, we
can not constrain that they can keep still too. Moreover, their method is easy to be
affected by the occlusions. So it is not robust in the complex scenes.
The authors of [15] use a multi-camera based method to estimate the number
of people enter in public buildings, but they also emphasize that their method is
based on well image segmentation technology. In fact, this is very difficult in the com-
plicated environments.
Multimed Tools Appl

3 Our method

In this section, we first propose a color gradient model based on observation of


the depth distributions. Based on this color gradient model, we present our scene-
adaptive algorithm in detail for accurate and fast crowd counting via joint using depth
and color information. Then, we present our algorithm for detecting each head of
the passing or still person in the surveillance region with adaptive modulation ability
to different scenes based on depth information. In the following place, we indicate
our algorithm for bidirectional tracking and counting each detected head based on
color information. The summary of our proposed algorithm is listed at the end of the
section.

3.1 Color gradient model

The traditional approaches are generally based on color (RGB) information. Due to
that color information is sensitive to environmental factors such as illumination, we
bring multi-channel information (depth + color) in our algorithm. We capture the
multi-channel information together from a common depth camera (kinect). It can
capture each pixel of images’ distance to the camera, and the effective range is from
800–4,000 mm. Using this depth information, we can define a lot of depth effective
graphs. The ordinary way to define this depth effective graph is to build it with grey
format. The depth range in the practical scenes will map into gray space (0–255) in
this way, then the pixel’s luminance of this gray graph stands for the distance to the
camera.
In our method, we propose a color gradient model to define the depth effective
graph. The advantage of this model is that it can represent the distance to the camera
more visually and more semantically. Color gradient model can be seen in Fig. 2. The
horizontal axis stands for the distance of human to the floor, and the vertical axis
stands for the color channel value. We use three colors (red, green and blue) to label
the corresponding practical locations of the different colors. We can find that the
color of floor (0 cm) is pure blue, and the color will gradually transit from pure blue
to pure green at the range of 0–90 cm. Moreover, the color will gradually transit from
pure green to pure red at the range of 90–180 cm. Using this color gradient model,
we can calculate the rough locations of humans in our captured region according to
the shown colors directly.
Based on the color gradient model, we construct a new depth graph on color
representation of the input practical image. As seen in Fig. 3, The left row means the

Fig. 2 Color gradient model


representation (horizontal
axis: distance of human to the
floor; vertical axis: color
channel value). The details can
be seen in the text
Multimed Tools Appl

Fig. 3 Depth graph on color


representation of input
practical image. a and c: input
data; b and d: constructed
depth graph based on our
proposed color gradient model
in color

practical captured color image, and the right row means the constructed graph based
on our proposed color gradient model. We label the distances in the constructed
graph according to the practical scenes. The distances can be labeled according to
two rules: the distance of human to the floor (as can be seen in (b) in Fig. 3) and the
distance of human to the camera (as can be seen in (d) in Fig. 3). In this way, we
can infer that the height of human at the upper right corner in (b) in Fig. 3 is about
160 cm.

3.2 Scene-adaptive scheme for crowd counting in different scenes

Thanks to the locations of the depth camera and the floor are both fixed when the
surveillance depth camera is from an elevated view, we can use this information to
add a learning procedure in our crowd counting system. Specifically, we first model
the practical scene before our crowd counting procedure to obtain the distance be-
tween the floor and the camera. According to this distance, we can choose the appro-
priate initial parameter. During the learning period, we require that there is nobody
appearing in the scenes. The detail algorithm about our scene-adaptive scheme for
crowd counting in different scenes can be taken as follows:

Algorithm 1 Median rule based scene-adaptive scheme


1: Capture each pixel’s depth value of specific area at current frame, and add it to
an array Dxy .
2: Get the median value of each specific area’s pixel corresponding array Dxy .
3: Update the median value Dxy .
4: Goto step 1, until the value of Dxy convergence end.

In the algorithm, we choose the projection area of the camera as specific area
for learning. In our experiments, the specific area is always about 20 × 20. On the
one hand, we can save the RAM using relative small region. On the other hand, we
can make sure the quality of learning accuracy. In our algorithm, we consider that
the distance from camera to each pixel of the floor is same. In the practical scene, the
distance maybe not strictly the same of each pixel. This tiny difference can be ignored
Multimed Tools Appl

in our algorithm, because the basic unit of our crowd counting is each relatively large
person. So the difference can not bring any influence for our task. The median rule
based scene-adaptive scheme means the leaning procedure. Finally, we can take the
median value of Dxy as the final floor depth Dfloor . In this way, we can quickly obtain
the Dfloor of each different scene. This is very key for our crowd counting algorithm
below. Based on the scene-adaptive scheme, our crowd counting algorithm can be
used in any different scene without more manmade inference.

3.3 Head region extraction based on depth information

Our objective is for fast and robust crowd counting facing the very complicated envi-
ronments such as shown in Fig. 1. In our method, we focus on detecting each head of
every person. Because the head rather than whole body via background subtraction
is more robust to heavy occlusions. However, accurate head detection is not ready-
made. We try to address this problem in depth domain.
As seen in Fig. 4, the original RGB and depth images are Fig. 4a and b,
respectively. From the raw depth map, we can achieve the depth data of every point
considering the distance between the pixel position and the camera. In our method,
we detect the head only based on this depth map. Because depth data and pixel data
can be connected by corresponding relations, we make up a more visual depth map
representing the objects in color format (see Fig. 4c) from Fig. 4b. We know that the
peak of head is closest to camera, so the depth data of head peak is maximum. For the

(a) (b) (c)

(d) (e)
Fig. 4 Adaptive head detection in complicated scenes. a raw color data; b raw depth data; c our
constructed depth map in color format; d the contour of depth data in b for removing the adjacent
region of persons and the floor and for speeding up the head detection task; e final head detection
result
Multimed Tools Appl

same reason, the foot of person is far from the camera, so the depth data of the foot is
minimum. From the top view of camera, we can always find the head peak comparing
to other parts of body. So we can make the best of this information. However, we
don’t know how to achieve the whole head region as in Fig. 4c. Because the persons
in the surveillance region are in different height, and the same person in different
varying scenes may have uncertain depth data. This make the head detection task
difficult.
To overcome this difficulty, we extract each head region based on former scene-
adaptive scheme facing different scenes. We can obtain the Dfloor of current scene.
This is key for distinguishing persons and the floor based on depth value. Through
extensive observations of depth graph using the color gradient model representation,
we find that each person on the floor can appear a peak when the floor is regarded as
zero layer. In the very crowd scene, each head of person can still keep this peak.
Meanwhile, the adjacent place between the head and the shoulder will appear a
valley. We can use this observation to extract each head of person in the scene by
finding each peak. Although the surveillance scene may be change, the true depth
range of the depth camera (kinect) is fixed. Moreover, the intensity of the pixel range
in the depth map is also fixed (from 0 to 255). Our task is to construct the relationship
between the true depth range and the pixel range, and map the true head height into
the depth image of Fig. 4c to achieve a threshold for estimating the border of head
region. Our scheme can be defined as follows:

P
T = H. (1)
D

where P means the factual pixel range of intensity, D means the factual depth range
of the depth camera, H means the factual head height of a person. The head height
of different people may be different. However when we map it into the depth image,
this indistinctive difference can be ignored. We assume the maximum pixel in depth
map Fig. 4c is m0 , so the head region in Fig. 4c is [m0 − T, m0 ]. The scheme is simple,
but it is no need for parameter setup and modification, so it is convenient for practical
systems.
In order to accelerate the method, we don’t process our algorithm on the whole
image of Fig. 4c. Instead, we shrink the search range by obtaining the contour of each
person based on Fig. 4b. We detect the contour using a binary threshold value in
terms of depth data to split the part body (see Fig. 4d). In this way, we can obtain our
final detection result in Fig. 4e. The detail algorithm about our head region extraction
based on depth information can be taken as follows:

3.4 Pedestrian tracking and counting based on color information

After above head detection step in depth domain (see Fig. 4e), we transform to
process in RGB domain (see Fig. 5a) from contour detection of Fig. 4e. In the fol-
lowing, we track and count the detected head in our surveillance region in Fig. 5a. In
order to improve the processing speed, we don’t use traditional Kalman filter [2, 10]
and particle filter [3, 4] based tracking method in our algorithm. Instead, we present
a mixture model based tracking method fusing the match between frames comparing
with position, shape size and distance. Because the detected heads are not occluded,
Multimed Tools Appl

(a) (b)
Fig. 5 The representation for pedestrian tracking and counting result. a tracking each detected head
from Fig. 4e; b counting each head based on the tracking result

Algorithm 2 Head region extraction based on depth information


1: Set threshold Th for distinguishing persons and floor roughly by using scene-
adaptive scheme to obtain Dfloor of current scene.
2: if Th < Dfloor then
3: Remove the adjacent region of persons and the floor. The result can be seen
in Fig. 4c.
4: end if
5: Use T = H. D P
to achieve a threshold for estimating the border of head region.
6: Find out the extreme high point in Fig. 4c, assume it is m0 .
7: Add m0 to coordinate array M.
8: Considering the shoulder to shoulder in crowded scene, extreme high point m0
in Fig. 4c maybe not only one.
9: Traverse each coordinate of array M.
10: Calculate head region by [m0 − T, m0 ].
11: Fill the head region from each point m0 in color.
12: The result can be seen in Fig. 4e.

our tracking and counting method needn’t considering the split and merge of
the heads. Our tracking scheme is defined as follows:

Definition 1 We define that the area size difference of head whose ID is i and
another head whose ID is j between two adjacent frames as follows:
 
Di f (Si (k), S j(k + 1)) =  Si (k) − S j(k + 1) (2)

Definition 2 We define that the centroid distance difference of head whose ID is i


and another head whose ID is j between two adjacent frames as follows:

Dis( pi(k), pj(k + 1)) = LDis (3)
Multimed Tools Appl

where LDis can be presented as follows:


 2  2
LDis = pi x (k) − p j x (k + 1) + pi y (k) − p j y (k + 1) (4)

Definition 3 We define the matching distance between the head whose ID is L in


current frame and the heads in the predicted area next frame by using mixture model
as follows:
pL = α • Dis( pL (k), p M (k + 1)) + β • Di f (S L (k), S M (k + 1)) (5)
where α and β are weight coefficient, and M is a foreground in the next frame. The
matching matrix can be defined as follows:
⎡ ⎤
p11 p12 p13 · · · p1n
⎢ p21 p22 p23 · · · p2n ⎥
⎢ ⎥
⎢ ⎥
match[m, n] = ⎢ p31 p32 p33 · · · p3n ⎥ (6)
⎢ .. .. .. .. ⎥
⎣ . . . ··· . ⎦
pm1 pm2 pm3 · · · pmn
where pij (i = 1...m, j = 1...n) means the matching distance.

In this way, the heads can be tracked robust. The bidirectional estimations for
each head in our approach is simply based on the positions of the consecutive frames
when the head is first detected in the monitor region. When the head is ready to
leave the monitor region, we count the number of entering/existing people in the
surveillance region bidirectionally. Figure 5b shows the result for head tracking and
counting. Our summary of the proposed algorithm can be seen in Algorithm 3 for
details.

Algorithm 3 Proposed algorithm for crowd counting by joint using depth and color
information
1: Initialization: In = 0, Out = 0
2: Find the maximum depth data in depth image signed m0 .
3: Detect the head region according to (1).
4: Tracking the detected head using the proposed mixture model based tracker
between (2) and (6).
5: Counting each head bidirectionally using the relationship between the consecu-
tive frames.
6: Result Output: In, Out

4 Experiments

4.1 Experiment setup

In this paper, we propose a new way for crowd counting by joint using the color
and depth information together from a commodity depth camera—Microsoft kinect.
Figure 6 shows our capture system setup in many different scenes. In order to
demonstrate that our presented scheme is effective for this challenge task, we test our
Multimed Tools Appl

Fig. 6 Experimental setup. We test the algorithm in many different scenes. The cameras of our
experiments are labeled in red ellipse

method in extensive different scenes. Based on our new scheme for crowd counting,
we have built a practical system. We evaluate the performance of our system both in
indoor scenes and outdoor scenes with complex environments (e.g. heavy occlusions,
shadows, dynamic backgrounds, etc). We will give the detail experimental results in
the following section. To evaluate a crowd counting system, people always consider in
two fields: accuracy and real-time. So we also compare our built system with recent
states-of-the-art methods based systems for crowd counting both in accuracy and
processing speed. All of our experiments are conducted on an Intel E7500, 3 GHz
dual core processor with 2GB RAM.

4.2 Experimental results

First, we test our scene adaptive scheme for crowd counting. As seen in Fig. 7, the
corresponding captured images in these two typical scenes are random. We can see
that the person’s size of them is different. This is because the setup height of the
camera is different. The key point of our method is to detect each head robust using
the scene-adaptive head detection scheme. Now we give the specific parameters in
our experiment. In these scenes, we both use the Kinect depth camera. The value of
D is 3,060 mm, because current Kinect factual depth range is about from 800 mm to
3,860 mm. In the experiment, we set P to 255. In practical, the factual head height
of each person is not salient. We assume D is 170 mm in all of our experiments.
According to (1), we can compute out the threshold for detecting the head region
Multimed Tools Appl

Fig. 7 Typical two surveillance scenes in our experiments. (top) Our experiment setup facing the
indoor scenes; (bottom) our experiment setup facing the outdoor scenes

around the maximum depth data in depth image. Meanwhile, we use the learning
based scene-adaptive scheme to obtain the threshold of Dfloor to remove the floor
in different scenes. As seen in Fig. 8, the results demonstrate that it is adaptive to
different scenes without more modulation.
Following, we evaluate our proposed method both in the indoor and outdoor
scenes in many fields. As seen in Fig. 9, this is a very crowded instance. The
traditional color information only based methods will become very ineffective in
this complicated scene. However, our approach is very effective. We can detect each
head region accurately, and then we can count them conveniently. Moreover, we
compare our approach with current states-of-the-art approaches [1, 16]. Because
we bring the depth information for crowd counting with RGB information, it is
a little inequitable to compare with only RGB information based methods [1, 16].
However, the depth camera (Microsoft Kinect) is now very cheap and can generally
be instead of current general camera. In this paper, we focus on designing a practical
people counting system facing the complicated environments (see Fig. 1) both in
indoor and outdoor scenes (see Fig. 6). While the evaluation for a good crowd
Multimed Tools Appl

Fig. 8 Evaluation for our proposed scene-adaptive scheme for crowd counting in different scenes.
(left) One typical indoor scene. (right) One typical outdoor scene

Fig. 9 Test our algorithm in a very crowded scene. a Input crowded scene with heavy occlusions;
d final head region extraction result through our algorithm
Multimed Tools Appl

Fig. 10 Comparison on crowd


counting precision estimation
with two state-of-art
algorithms in entrance scene
both in mid-crowded instance
and heavy-crowded instance

Fig. 11 Comparison on crowd


counting precision estimation
with two state-of-art
algorithms in corridor scene
both in mid-crowded instance
and heavy-crowded instance

(a) (b) (c) (d)


Fig. 12 Evaluation for our method in some abnormal instances. a With a bag; b with occlusion; c
with moving thing and shadow; d with a holding bag
Multimed Tools Appl

Table 1 Test for accuracy and speed of our method for crowd counting
In Out Speed
Building scene Truth 350 286 Head detection 31.2 FPS
Result 338 280 Total algorithm 27.3 FPS
Accuracy 96.6 % 97.9 %
Entrance scene Truth 520 545 Head detection 29.5 FPS
Result 509 536 Total algorithm 25.2 FPS
Accuracy 97.9 % 98.3 %
Corridor scene Truth 1,360 1,180 Head detection 33.2 FPS
Result 1,334 1,139 Total algorithm 27.8 FPS
Accuracy 98.1 % 96.5%
Export scene Truth 1,050 997 Head detection 39.5 FPS
Result 988 969 Total algorithm 30.2 FPS
Accuracy 94.1 % 97.2 %

counting system is performance on counting accuracy and speed. From this view,
we think it is also can be comparing to [1, 16]. In [1], the authors assume that at
most three persons in the scene meanwhile, it is a restraint condition in practical
scenes for useful application. The method of [16] relies on moving blob segmentation
result well which is not common for practical systems. As seen in Fig. 10, we
evaluate our approach with the methods [1, 16] in an entrance scene. Both in the
mid-crowded instance and heavy-crowded instance, our method is highly more
effective than the others. We also show the comparison results of these three methods

(a) (b) (c) (d)


Fig. 13 Typical whole results for evaluating the performance our proposed crowd counting system
facing the complex scenes. a, b, and c show the inter-medial processing results of our system, and d
shows our final results
Multimed Tools Appl

both in the mid-crowded instance and heavy-crowded instance in a corridor scene


(see Fig. 11). So our method is more effective than [1, 16] for crowd counting in
complicated scenes.
Then, we evaluate the performance of our jointing RGB and depth method for
crowd counting in some abnormal instances. As seen in Fig. 12, these pictures are
obtained from the outdoor scene as shown at the bottom of Fig. 7. Figure 12a (with
a bag), Fig. 12b (occlusion), Fig. 12c (with moving thing and shadow) and Fig. 5d
(holding a bag) represent kinds of fallible instances for head counting. The bottom
line of Fig. 12 means the final head detection results. We find that our system is robust
to Fig. 12a–c. But in Fig. 5d, we also detect the bag because of its height equals to the
head. In this time, we need use another simple features to filter the bag(such as size,
shape, etc).
Finally, we take more experiments in different scenes for a long time. As seen
in Table 1, we give out the test for accuracy and speed for crowd counting in four
common daily scenes. All of the experiment results in building scene, entrance scene,
corridor scene and export scene show that our algorithm can be run in real time.
Meanwhile, our method can have a very high accuracy for crowd counting task.
Some visual processing results of our approach can be seen in Fig. 13. Based on our
proposed algorithm, we have implemented our system for crowd counting using C++
programming language. And the system has been run in our school. The practical
results show our system is very useful.

5 Conclusions

In this paper, we propose a new method by joint using for crowd counting in compli-
cated scenes. We evaluate our algorithm in many fields, and the experimental results
demonstrate this new scheme achieves significant improvement both on accuracy and
speed. Based on the proposed approach, we have built a practical system for robust
and fast crowd counting facing complicated scenes. Our built system shows that joint
using depth and color information together is a promising way for the challenge
task of crowd counting in different scenes.

Acknowledgements This work was supported by the China National Funds for Distinguished
Young Scientists under Grant No.60925010, Natural Science Foundation of China under Grant
No.61272517, The Research Fund for the Doctoral Program of Higher Education of China under
Grant No.20120005130002, the Co-sponsored Project of Beijing Committee of Education, the Funds
for Creative Research Groups of China under Grant No.61121001, and the Program for Changjiang
Scholars and Innovative Research Team in University under Grant No.IRT1049.

References

1. Antic B, Letic D, Culibrk D, Crnojevic V (2009) K-MEANS based segmentation for real-time
zenithal people counting. In: IEEE international conference on image processing, pp 2565–
2568
2. Antoniou C, Ben-Akiva M, Koutsopoulos HN (2007) Nonlinear kalman filtering algorithms
for on-line calibration of dynamic traffic assignment models. IEEE Trans Intell Transport Syst
8:661–670
3. Bouaynaya N, Schonfeld D (2009) On the Optimality of motion-based particle filtering. IEEE
Trans Circuits Syst Video Technol 19:1068–1072
Multimed Tools Appl

4. Chateau T, Gay-Belille V (2006) Real-time tracking with classifiers. In: IEEE European confer-
ence on computer vision, pp 218–231
5. Cong Y, Gong HF, Zhu SC, Tang YD (2009) Flow mosaicking: real-time pedestrian counting
without scene-specific learning. In: IEEE conference on computer vision and pattern recognition,
pp 1093–1100
6. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE
conference on computer vision and pattern recognition, vol 1, pp 886–893
7. Dollar P, Wojek C, Schiele B, Perona P (2009) Pedestrian detection: a benchmark. In: IEEE
conference on computer vision and pattern recognition, pp 304–311
8. Fu HY, Ma HD, Liu L (2011) Robust human detection with low energy consumption in visual
sensor network. In: IEEE international conference on mobile ad-hoc and sensor networks,
pp 91–97
9. Gavrila DM, Munder S (2007) Multi-cue pedestrian detection and tracking from a moving
vehicle. Int J Comput Vis 73:41–59
10. Kai X, Wei CL, Liu LD (2010) Robust extended kalman filtering for nonlinear systems with
stochastic uncertainties. IEEE Trans Syst Man Cybern 40:399–405
11. Ma HD, Zeng CB, Ling CX (2012) A reliable people counting system via multiple cameras. ACM
Trans Intel Syst Technol 3:1–22
12. Mikolajczyk K, Schmid C, Zisserman A (2004) Human detection based on a probabilistic as-
sembly of robust part detectors. In: Pajdla T, Matas J (eds) European Conference on Computer
Vision, vol 3021. Berlin, Heidelberg, pp 69–82
13. Mu Y, Yan S, Liu Y, Huang T, Zhou B (2008) Discriminative local binary patterns for human
detection in personal album. In: IEEE conference on computer vision and pattern recognition,
pp 1–8
14. Kilambi P, Ribnick E, Joshi AJ, Masoud O, Papanikolopoulos N (2008) Estimating pedestrian
counts in groups. Comp Vision Image Underst 110:43–59
15. Shimada A, Arita D, Taniguchi R (2006) Dynamic control of adaptive mixture of gaussians
background model. In: IEEE international conference on video and signal based surveillance,
pp 20–24
16. Stauffer C, Grimson WEL (1999) Adaptive background mixture models for real-time tracking.
In: IEEE conference on computer vision and pattern recognition, vol 2, pp 246–252
17. Tuzel O, Porikli FM, Meer P (2008) Pedestrian detection via classification on riemannian mani-
folds. IEEE Trans Pattern Anal Mach Intell 30(10):1713–1727
18. Velipasalar S, Tian YL, Hampapur A (2006) Automatic counting of interacting people by using a
single uncalibrated camera. In: IEEE international conference on multimedia and expo, pp 1265–
1268

Huiyuan Fu is currently working towards the Ph.D. degree with the Beijing key Laboratory of
Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecom-
munications, China. His research interests include multimedia system and network, computer vision
and pattern recognition.
Multimed Tools Appl

Huadong Ma received the Ph.D. degree in computer science from Institute of Computing Tech-
nology, Chinese Academy of Science (CAS), in 1995, the M.S. degree in computer science from
Shenyang Institute of Computing Technology, CAS in 1990, and the B.S. degree in mathematics
from Henan Normal University, China in 1984. He is a professor at the School of Computer
Science, Beijing University of Posts and Telecommunications, China. His research interests include
multimedia networks and systems, Internet of things and sensor networks. He has published over
100 papers in these fields.

Hongtian Xiao received the M.Sc. degree in computer science from Beijing key Laboratory of
Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecom-
munications, China in 2013. His research interests include computer vision and pattern recognition.

You might also like