0% found this document useful (0 votes)

7 views

2019

The document presents a video-based vehicle counting framework that utilizes deep learning for object detection, tracking, and trajectory processing to accurately obtain traffic flow information. It establishes two datasets, VDD for vehicle detection and VCD for vehicle counting validation, achieving over 90% accuracy in vehicle counting at a rate of 20.7 frames per second. The framework aims to enhance intelligent transportation systems by providing detailed traffic parameters at a lower cost compared to traditional sensor methods.

Uploaded by

210108007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

2019

Uploaded by

210108007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore

tracking, and trajectory processing to obtain the traffic flow information. First, a dataset for vehicle object detection
 Contents
(VDD) and a standard dataset for verifying the vehicle counting results (VCD) were established. The object detection
Downl
was then completed by deep learning with VDD. Using this detection, a matching algorithm was designed to perform
PDF
multi-object tracking in combination with a traditional tracking method. Trajectories of the moving objects were obtained
using this approach. Finally, a trajectory counting algorithm based on encoding is proposed. The vehicles were counted
according to the vehicle categories and their moving route to obtain detailed traffic flow information. The results
demonstrated that the overall accuracy of our method for vehicle counting can reach more than 90%. The running rate
of the proposed framework is 20.7 frames/s on the VCD. Therefore, the proposed vehicle counting framework is
capable of acquiring reliable traffic flow information, which is likely applicable to intelligent traffic control and dynamic
signal timing.

The graphic abstract indicates a video-based vehicle counting framework using a three-component process of object
detection, object tracking, and trajectory processing to... View more

Published in: IEEE Access ( Volume: 7)

Page(s): 64460 - 64470 DOI: 10.1109/ACCESS.2019.2914254

Date of Publication: 01 May 2019  Publisher: IEEE

Electronic ISSN: 2169-3536

 Funding Agency:

SECTION I.
Introduction
Estimating the number of vehicles in a traffic video sequence is an essential component in intelligent transportation
systems (ITS), which provides valuable traffic flow information. The number of vehicles on the road reflects traffic
conditions, such as traffic status, lane occupancy, and congestion levels, which can be utilized for accident warnings,
congestion prevention, and automatic navigation [1]. Meanwhile, traffic flow information at different intersections
during different time periods is used for dynamic signal timing [2], which can improve traffic flow and provide
substantial economic benefits.

In traditional ITS, special sensors such as magnetic coil, microwave, or ultrasonic detectors are primary tools to count
vehicles. However, these sensors have limitations in obtaining detail information and installation cost. Video-based PDF
vehicle counting systems have begun to attract attention due to the development of image processing technology and its
Help
powerful capabilities. Compared to traditional sensor methods, these systems provide more traffic parameters, such as
detecting vehicle category, density, speed, and traffic accidents for low costs, simple installations, and easy maintenance
[3]. The machine vision vehicle counting method is an integrated procedure comprised of detection, tracking, and
trajectory processing.

https://ieeexplore.ieee.org/document/8703814?denied= 2/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore
The detection of vehicle objects is the first step in obtaining traffic flow information. The purpose of object detection is
to obtain the location and classification of the object from an image. Its primary Contents
task is to acquire the features of the
Downl
object. In the past, it mainly relied on artificially designed features such as SIFT [4], HOG [5], and Haar-like [6], and
PDF
then put these features into the classifier for training, such as SVM and Adaboost, etc. Felzenszwalb established a
deformable part model (DPM) by using both HOG and SVM [7]. It exhibits better performance if the object has some
deformation or scale change. However, this method cannot adapt to large rotations, has poor stability, and is slow to
calculate. A classification framework based on deep convolutional neural network (DCNN) was previously reported. Its
accuracy has been greatly improved compared with traditional classification algorithms, which lays a foundation for
deep learning-based object detection research. Ross Girshick et al. used region proposal method to search all possible
regions of the object, which is a selective search algorithm named regions with CNN (RCNN) [8]. Due to the slow
detection speed of RCNN, He et al. proposed spatial pyramid pooling in deep convolutional networks for visual
recognition network (SPPNet) [9]. Ross Girshickrb et al. proposed Fast RCNN [10], adding bounding box regression
and multi-task loss function. Ren added a new region proposal network based on the fast RCNN algorithm and proposed
the Faster RCNN [11]. The accuracy of the Faster RCNN has been greatly improved and is rated the best in all current
detection algorithms, but the speed is one of its draw back. Liu proposed an end-to-end detection algorithm, single shot
multibox detector (SSD) [12], which obtains proposal regions by uniform extraction and greatly enhances the detection
speed. In the most recent year, Redmon et al. proposed YOLO V3 [13] by using multi-scale prediction and improving
the basic classification network, with fast detection speed, low false detection rate, and strong versatility.

There are two main solutions for current object tracking. (1) Detect objects for each frame of the video sequence,
complete tracking based on detection results of consecutive frames, and obtain the trajectory information. (2) Detect
objects in the initial frame to obtain the features, search region to match the features in the subsequent image sequence,
and track the objects to get the trajectory information. In the first solution, Meyer et al. proposed a contour-based target
detection and tracking method which leads to a good effect, but this method has the disadvantages of poor anti-noise
ability and high calculation volume. The object tracking based on deep learning [14], [15] has higher detection accuracy
than the traditional algorithm. However, the detection is unstable, and there is a problem that the target may be missed in
detection, which leads to tracking failure. The second solution relies less on object detection which avoided the
disadvantage of the first one. The extraction of object features is the key of the program. One way is to extract feature
points from the object. Commonly used feature point extraction methods include Harris corner detector [16] and SIFT
[4]. Feature point-based methods [17], [18] can adapt to rotation and illumination changes of the object, but excessive
feature extractions leads to difficult matching. Too few feature extractions are easy to cause false detection. Additionally,
the feature extraction process is complicated and time consuming. Another way is to extract the feature of the object as a
whole, including image edges, shapes, textures, color histogram, etc., and the reliability of the object features is
enhanced by combination of multiple features. After feature extraction of the object, the similarity measurement is used
to relocate the object to achieve object tracking. The feature-based tracking algorithm is insensitive to changes in the
scale, deformation, and brightness of the moving object. Even if the object portion is occluded, the tracking task can be
completed based on the partial feature information. However, it is sensitive to image blur, noise, etc. The extraction
effect of features depends on the setting of various extraction parameters. In addition, the correspondence between
consecutive frames is difficult to determine and has a negative impact on tracking performance.

The existing vehicle counting approaches can be mainly divided into three categories: regression-based methods [19];
cluster-based methods [20], [21]; and detection-based methods [22]–[24]. The regression-based method aims to learn
the regression function using detection region characteristics. The cluster-based method is used to track features of
objects to obtain their trajectories, and the trajectories are clustered to count the objects. The detection-based method can
be further divided into four different categories: (1) frame difference method [25]; (2) optical flow method [26]; (3)
background subtraction (BS) method [27], [28]; and (4) Convolutional neural network (CNN) method [29], [30]. The
frame difference method is fast and simple. However, only parts of the moving objects are detected through the
comparison between moving objects and background in consecutive frames. In contrast, the optical flow method
calculates the optical flow information of the entire picture, resulting in a slower pace. The BS method uses background
modeling to find moving objects by comparing the difference between the input image and the background. One of its
main weaknesses is that it is difficult to build a good background model in a real scene. CNN is currently the most
popular object detection method. In this method, the object of interest is marked in a large sample which builds up a
dataset. Then a model is obtained by training the dataset using CNN. Eventually, objects of interest in a image can be
detected by using both the model and CNN. There are, however, some common problems in the counting methods
PDF
mentioned above: video angle limitations [31]; slow speed calculation [32]; and its inability to handle complex scenes
[33]. Help

In this study, we established two datasets, a vehicle counting dataset (VCD) and a vehicle detection dataset (VDD).
Additionally, we proposed a video-based vehicle counting framework including three steps: object detection, object
tracking, and trajectory processing. The VDD was used in the object detection step, and the VCD was used to validate
the framework. Our results showed that the proposed framework is capable of obtaining reliable traffic flow information.

https://ieeexplore.ieee.org/document/8703814?denied= 3/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore

 Contents
Downl

PDF
SECTION II.
Dataset
A. Vehicle Counting Dataset (VCD)
In this Section, we present the Vehicle Counting Dataset (VCD) introduced for the problem of vehicle counting in real-
world conditions. It is used in the Experiment section for validation. This data is publicly accessible and can be
downloaded from the following link: https://pan.baidu.com/s/1_iwXh8OijACMb5WTxyCuOg.

The data consists of videos of urban road and intersection scenes recorded using Miovision camera (Miovision
company). The camera is mounted on the roadside, and captures vehicle entering or exiting the different roads and
intersections. Fig. 1 illustrates nine representative scenes from the dataset. Due to the real-world scenarios, complexity of
the data is apparent from the Fig. 1. Videos in VCD were taken from diverse angles and time periods.


FIGURE 1.
Different scenarios captures from VCD.

We divide the videos in the dataset based on the different road and intersection types with the data. There are three kinds
of scenes in the dataset, crossroad, T-junction, and straight road. For crossroad, the data has been collected on 6 different
days. We spent 2 days respectively for T-junction and straight road. Thus, in total, there are 10 different scenes in our
dataset. In Fig. 2, we provide the folder structure of the proposed dataset.

 PDF
FIGURE 2. Help
Folder structure of VCD.

https://ieeexplore.ieee.org/document/8703814?denied= 4/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore
More detailed description of Fig. 2, there is one video file in each scene. Each video lasts for an hour. The scene is
named after the road or intersection. The video is named after the acquisition date.
Contents
For each scene, we will give a
Downl
screenshot of the video with encoding information as shown in Fig. 3.
PDF


FIGURE 3.
Region encoding diagram of the video named 20180205 in VCD.

For each video, we calculated the vehicle entry and exit road or intersection, and counted the numbers by vehicle
category and vehicle moving route. The counting result was made into a label file, which was placed under a scene
folder with a corresponding video. The label of counting result is shown in Table 1.

TABLE 1 Label Diagram of Counting Result Named 20180205

Table 1 takes the crossroad as an example, summarizes the number of vehicle in different type entering and exiting the
area for each sub-category. Moreover, the total number is also provided.

B.Video information

In Table 2, we summarize the basic attributes of the videos in our dataset. We note that these video attributes along the
camera parameter details are also provided in each folder of the dataset.

TABLE 2 Results of Abandoned Object


B. Vehicle Detection Dataset (VDD)
In this Section, we describe our dataset VDD(Vehicle Detection Dataset) specifically designed for the task of vehicle PDF
object detection. The dataset is publicly available for download at the following URL:
Help
https://pan.baidu.com/s/1CHO3fjy00T1qOcN6ereqAQ. VDD is a vehicle dataset. It contains 6134 RGB images. Each
image includes 720*480 pixels. There are corresponding artificially obtained annotations, including the size of the
image, the objects number of the image, and the bounding box coordinate and classification of each object in the image.
There are three categories in the dataset, car, truck, and bus. Fig. 4 shows the categories in the dataset, as well as 7
random images from each category.

https://ieeexplore.ieee.org/document/8703814?denied= 5/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore

 Contents
Downl

PDF


FIGURE 4.
Category diagram of VDD.

SECTION III.
Proposed Framework
In this section, we introduced the proposed vehicle counting framework. Fig. 5 gives the framework of the proposed
object detection and tracking system. There are three main stages in this framework: Object detection, Object tracking,
Trajectory processing.


FIGURE 5.
Strategy of the vehicle counting framework.

Object detection is done to give the bounding-box and classification of each object in one frame. Object tracking is used
for obtaining trajectory of each object while it appears in the video stream. The object tracking is completed by a
combination of proposed template matching method and KCF [17] algorithm. Trajectory processing provides a detailed
result by using a region encoding method.

SECTION IV.
Object Detection
PDF
The aim of the object detection stage is to identify the category and location of the vehicle object in a picture. From the
Help
past ten years or so, the object detection algorithms for natural images are roughly divided into two categories. One is
based on traditional manual features before 2013, the other one is based on deep learning used thereafter. Since the
emergence of deep learning, the object detection has made a huge breakthrough. The most important two kinds of deep
learning are: region proposal-based method represented by RCNN [8] (Fast-RCNN [10], Faster- RCNN [11], etc.; 2

https://ieeexplore.ieee.org/document/8703814?denied= 6/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore
regression-based method represented by YOLO (YOLOV3 [13], SSD [12], etc.). The former one is superior in
accuracy, and the latter one is superior in speed. Contents
Downl

PDF
Because deep learning method has excellent performance in object detection, this framework selects the YOLOV3
algorithm to complete the detection task. By training the VDD dataset that we introduced in 2.2, we can get the detection
result information of the vehicle object in each frame of the video sequence.

F ramet represents the all information of t th. We read the I from the video sequence in a loop. t represents the frame
t

number, I represents the pixel information of the t th image, including three attributes, width (W ), height (H ), and
t

size (S ) of the t th image. We used DB = {BB , i = 1, … , n} represents the detection result of I . Each BB
t i t i

represents the detection result of one object and including 11 attributes: left-top(LT = x, y ), right-bottom(RB = x, y
), and center (C en = x, y ) coordinate, width (W ), height (H ), and size (S ) of the bounding box of the detected
object, the pixel information of the object bounding box in current image (Roi ), the detected confidence(P ),
classification of the object (C ls ), frame number (t ), and a predict bounding box in next frame (P B ). P B has the
same attributes structure from BB , among them the Roi is taken from I by a tracking method, C ls and t keep t+1

consistent with corresponding BB . If there is no detected object in I , DB will be null. If a BB in I has no predict
t t t

result in I , P B will be null. Finally, we bind I and DB to F rame .

t+1 t t t

SECTION V.
Object Tracking
In the stage of object tracking, we proposed a muti-object tracking method based on detection. According to Input the
F ramet , outputting the all trajectories(T A = {T , i = 1, … , n} ) of vehicle objects in the video. A
i

Ti = {Bi , i = 1, … , n} , and B could be a BB or a P B . The T S = {T , i = 1, … , m} ,

i i

Ti = {Bi , i = 1, … , m, last} is a intermediate variables, which is used to save trajectories that are being tracked.
There are three attributes of each T , including a timer(T imer ) that represents the number of consecutive occurrences
of P B in a T , classification(C ls ) that represents the vehicle category of the T , and length(Len ) that represents the
number of B of a T . i

1) Proposed Matching Method

We proposed a matching method to complete the two tasks of creating new trajectory and tracking. At first, we need to
calculate the overlap(OL ) and distance(Dis ) between two bounding box of two objects such as BB or P B . Cause
BB and P B have the same structure of attributes, so we will describe the stuff as BB in the following text. Formula 1(

BB1 , BB ) is used to calculate the OL ,

RB RB LT LT
W = min(x ,x ) − max(x ,x )
1 2 1 2

RB RB LT LT
H = min(y ,y ) − max(y ,y )
1 2 1 2

⎧0 if W ≤ 0or H ≤ 0

OL =⎨ W ∗ H (1)
⎩ otherwise
S1 + S2 − W ∗ H

View Source

where x RB
1
means the attribute x of RB of BB , so do x
1
RB
1
,x RB
1
, andx RB
1
. The S and S means that the attribute
1 2

S of BB and BB . Formula 2(BB , BB ) is used to calculate the Dis ,

1 2 1 2

−−−−−−−−−−−−−−−−−−−−−−
C en RB 2 RB RB 2
Dis = √(x − x ) , (y − y ) (2)
1 2 1 2

PDF
View Source
Help

where x 1
means the attribute x of C en of BB , so do x
C en
1
RB
2
,y RB
1
, and y RB
2
. Then, matching value(M V ) is
calculated by OL and Dis in algorithm 1.

https://ieeexplore.ieee.org/document/8703814?denied= 7/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore

SECTION Algorithm 1
 Contents
Calculation of matching values of two bounding
Downl

boxes.
PDF

Require: BB1 , BB2 , It

Ensure: MV

OL = Formula 1(BB 1 , BB2 )

Dis = Formula 2(BB 1 , BB2 )

−−−−−−−− − −−−−−−−−−−
2 2
I Dis = √(W ⇐ It ) + (H ⇐ It )

if OL = 0or Dis

I Dis
≥ max(W , H ⇐ BB1 ) then

return M V = 0

else

return M V = 0.8 ≤ OL + 0.2 ≤

Dis

I Dis

end if

2) Select a High-Confidence Tracker

In order to complete muti-object tracking, a high-confidence tracker is also needed to provide some additional
information. We choose KCF algorithm [17] as the high-confidence tracker. KCF method has a K C F tracker with
inputting a BB in I . Then using the extraction method init() ⇐ K C F to obtain the features of the BB . Finally,
t1

use the search method update() ⇐ K C F to find the object in another image I . Algorithm 2 shows that how we
t2

use KCF to support the tracking stage.

SECTION Algorithm 2
Tracking an object using KCF algorithm.
Require: BB ⇐ It1 , It2 , K C F

Ensure: P B ⇐ BB

initial init(), update() ⇐ K C F init(Roi ⇐ BB)

if P B = update(I t2
) not null then

return P B

else PDF
Help
return null

end if

3) Find the Next Position of a Trajectory

https://ieeexplore.ieee.org/document/8703814?denied= 8/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore
This part introduces that how to find the next position of a trajectory. By inputting the current frame F rame and T S . t

 First, we use algorithm 1 to match the Blast ⇐ Ti ⇐ T S i t

Contents
with BB ⇐ DB , simultaneously perform algorithm 2
Downl
with B last
⇐ Ti ⇐ T S and get P B ⇐ B last
⇐ Ti as feedback. Obviously, due to the difference between the number
PDF
of trajectories and the number of detected objects, there are three cases after the processing of algorithm 1: case1,
B ⇐ T successful matching, then add the matched BB as next node to the T ; case2, B
last i i
⇐ T matching failed,i last i

add the P B ⇐ BB as next node to the T . If P B is null, then do not add a new node to the T . Finally add 1 to the
i i i

value of the T imer ⇐ T ; case3, BB ⇐ DB matching failed, then create a new trajectory T
i i t
with using BB as new i

the first node of it and add T new to T S . Algorithm 3 is shown as followed:

SECTION Algorithm 3
Tracking objects.
Require: T S , F ramet

Ensure: TS

initial P ⇐ BBi ⇐ DBt > ξ, It ⇐ F ramet init(Roi ⇐ BB)

for all BB i ⇐ DBt with B last ⇐ Ti do

algorithm 1(BB i
, BBlast )

algorithm 2(B last , It )

end for

case 1: Bn+1 ⇐ Ti = Blast ⇐ Ti

Blast ⇐ Ti = BBi

case 2: Bn+1 ⇐ Ti = Blast ⇐ Ti

Blast ⇐ Ti = P B

case 3: Tn+1 = BBi

SECTION Algorithm 4
Obtaining of the complete trajectory of each object.
Require: T S , T A, It

Ensure: T S, T A

for all T i
⇐ T S , C en, W , H ⇐ Blast⇐T
i
do

if T imer ⇐ T i > 30 or B last ⇐ Ti out of I border then

if Len ⇐ T > η then

PDF
i

Help
Add T to T Ai

end if

Delete T from T S
i

https://ieeexplore.ieee.org/document/8703814?denied= 9/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore
end if
 Contents
Downl
end for
PDF

return T S , T A

4) Update Approach

When the objects have left from the video, we need to move the trajectories from T S to T A . Two conditions are used
to judge the departure of the object: T imer ⇐ T > 30 , or B
i
⇐ T is located on the edge of the image from video.
last i

5) Multi-Object Tracking Method

The whole process of the multi-object tracking method we propose is: Firstly, read the video frame in a loop and obtain
the detection result of each frame. Secondly, the strategy of finding the next node of the trajectory is used to
continuously track the object. Finally, the update approach is also a necessary step to ensure the stability of tracking,
and the complete trajectory of all the objects in the video is stored. Algorithm 5 illustrates the whole process.

SECTION Algorithm 5
Obtaining of node of each complete trajectory.
Require: sequence of F rame t

Ensure: TA

initial T S = {}, T A = {}

for all F rame do t

TS = algorithm 3(F rame t

, TS )

T S, T A = algorithm 4(T A, T S )

end for

return T A

SECTION VI.
Trajectory Processing
Due to the complexity of the scene and the uncertainty of vehicle motion, the trajectory obtained from V-.2 is cluttered.
In order to calculate the traffic flow information of roads and intersections, the moving direction and category
information of vehicle objects are particularly important. The purpose of the trajectory processing is to obtain the
direction and category information of each trajectory, and classify and count the numbers.

The first step is to confirm the category of each track (where car = 1, bus = 2, and truck = 3), and by weighting all the PDF
node frame categories in a track, the class of its track is weighted. The specific algorithm is as follows: Help

SECTION Algorithm 6

https://ieeexplore.ieee.org/document/8703814?denied= 10/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore

Calculating category of each objectContents

in the node of
trajectory.
Downl

PDF

Require: TA

Ensure: TA

initial B i
⇐ Ti

for all T i
⇐ TA do

num
C ls ⇐ Ti =
den

end for

return T A

Then, clustering is performed by extracting the coordinates of the starting point of the trajectory, and several intersection
areas in the image are obtained. The partition model is shown in Fig. 6. Finally, the vehicle counting result are completed
according to the region encoding model and the location area of the start and end points of the trajectories.


FIGURE 6.
Region encoding of different scenes.

SECTION VII.
Experimental Results
A. Detection Result
In this section, the main purpose of the experiment is to choose an optimal detection algorithm that can be applied in
our framework. In order to reach a good detection effect and a better result, we selected to use deep learning method.
With different deep learning algorithm, we trained the training set of VDD as proposed in 2.2 and verified it on the test
set to obtain the results of different networks in terms of detection speed and accuracy. We have selected the most
advanced SSD300, SSD512, YoloV3, and Faster-RCNN algorithms to compare their outcomes of object detection. We
performed the following experiment on a single 1080Ti graphics card. In the training, we used the 0.001 learning rate
for 40 k iterations, then continued training for 10 k iterations with 0.0001 and 0.00001, respectively. When using the
trained model for validation, we determined the precision and speed using threshold of IOU (Intersection over Union) =
0.5 and ξ = 0.7 . Precision and recall were calculated as: PDF
T rueP ositive Help
Recall = ,
T rueP ositive + F alseP ositive

T rueP ositive
P recision = (3)
T rueP ositive + T rueN agtive

https://ieeexplore.ieee.org/document/8703814?denied= 11/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore
View Source
 Contents
Downl

PDF

T rueP ositive indicates the target of correct detection. F alseP ositive indicates that there are no detected targets.
T rueN agtive indicates the detection of an error.

The resolutions of the input images for Faster-RCNN, SSD300, SSD512, and YOLOV3 are 720*480, 300*300,
512*512, 416*416, respectively. Apparently, larger input size led to better precision and recall. Faster R-CNN exhibited
the best precision but the lowest speed. SSD300 and YOLOV3 are both real-time detection method, but the mean
average precision of YOLOV3 was 3.1% higher. By using images with larger input size, SSD512 displayed 1.5% higher
in mAP than YOLOV3, but the speed was over three-fold slower (Table 3). Taken together, we recommend using the
YOLOV3 algorithm to detect the object in the vehicle counting framework.

TABLE 3 Result of Detection on VDD Test


B. Counting Result
Method is based on the category and moving route of vehicle. In this paper, there are three categories of vehicle and
three moving routes at each mouth of an intersection. Therefore, we have 36, 18, and 6 counters of crossroad, T-
junction, and straight road, respectively. We performed counting experiment using the videos of the VCD dataset by our
framework and compared the counting results with those of the labels. Threshold η in Algorithm4 represents that when
the number of nodes of a trajectory is less than η , it is considered to be an error trajectory, which will be discarded.
According to our experience, it is set at 30. Threshold ξ in Algorithm3 represents that when the P of a detection box is
higher than the ξ , it is considered to be reliable. ξ is set at 0.7 with our experience. The errors in tables 1– 8 can be
obtained by:

F ramework result−Label result

Error = (4)
Label result

View Source

TABLE 4 The Label Result of the Video From VCD Named 20180324

TABLE 5 The Framework Result of the Video From VCD Named 20180324

PDF
Help

https://ieeexplore.ieee.org/document/8703814?denied= 12/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore

TABLE 6 The Vehicle Counting Error of the Video From VCD Named 20180324
 Contents
Downl

PDF

TABLE 7 The Label Result of the Video From VCD Named 20180328

TABLE 8 The Framework Result of the Video From VCD Named 20180328

We selected three different scenarios in the VCD dataset, and the experimental results were displayed in Tables 4–
12, the videos screenshot corresponding to them are Figs. 7, 8, and 9. They are the video in the crossroad scene named
20180324, the video in the T-junction scene is 20180328, and the video of road scene named 20180202.

TABLE 9 The Vehicle Counting Error of the Video From VCD Named 20180328

TABLE 10 The Label Result of the Video From VCD Named 20180202

PDF
Help

TABLE 11 The Framework Result of the Video From VCD Named 20180202

https://ieeexplore.ieee.org/document/8703814?denied= 13/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore

 Contents
Downl

PDF

TABLE 12 The Vehicle Counting Error of the Video From VCD Named 20180202


FIGURE 7.
region encoding diagram of the video (20180324).

PDF
Help

FIGURE 8.
region encoding diagram of the video (20180328).

https://ieeexplore.ieee.org/document/8703814?denied= 14/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore

 Contents
Downl

PDF


FIGURE 9.
region encoding diagram of the video (20180202).

As shown in Tables 4, 5, and 6 with Fig. 7, the overall error detected at the cross was −3.1%. Generally, we found the
larger the sample volume, the lower the absolute number of the error. For an example, at the road mouths A to D, the
vehicle numbers were counted 522, 307, 569, and 176, respectively. The errors detected using the framework were
−4.6%, −5.2%, 2.4%, and −12.5%, respectively. Thus the errors detected at each mouth were quite similar, indicating
that the results obtained using our framework is highly reliable.

The overall error determined at the T-Junction was 1.3% (Tables 7, 8, and 9 with Fig. 8). The correlation between
sample volumes and errors rate was similar compared to that detected from the cross, suggesting that the detection result
using our framework will be remarkably accurate when the sample volume is large enough.

Tables 10, 11, and 12 with Fig. 9 demonstrated that the overall error was −4.9% detected at the straight road. In this
scenario, the error of trucks was the highest among the three vehicle categories. It was likely caused by the less numbers
of trucks detected during this time period compared to other vehicles. Only 24 trucks were labeled, whereas the numbers
for the cars and buses were 2545 and 204, respectively.

From the perspective of total errors of the three road scenarios, the overall error at the T-junction was the lowest,
whereas the straight road exhibited the highest error. The high error of the straight road was likely due to the video angle,
which was constant for the vehicle counting in a fixed direction. In contrast, at the cross and the T-junction, the errors at
the different directions could be either positive or negative, thus the total error was low after addition of the errors
detected from these directions. Comparing between the results of the cross and the T-junction, the errors detected from
the T-junction was generally lower because the complexity of the intersection is lower, which might result from fewer
occlusion problems of the camera and occlusion would enhance detection errors and tracking errors. In the T-junction
experiment, the error of the relevant direction of the mouth B was high, as it was the closest to the camera. The vehicle
moved to a large proportion in the image, and the occlusion of the image was severe. Therefore, camera angle also plays
a significant role in the accuracy of object detection.

The outcomes of the experiments in this study are generally in line with our expectations. Although the errors caused by
occlusion still exists, the counting method is not significantly affected by it because the counting condition of the
counting method is that the starting point and the end point fall in two different areas. If a trajectory breaks or is missing
during the target movement, the error can be eliminated by the counting frame, thereby achieving a higher counting
accuracy.
C. Running Time Result
As a video-based framework, the performance of both accuracy and speed are important. In this section, a speed
experiment is achieved to evaluate the speed of the proposed method. First, Table 13 shows the running time results on PDF
the three videos in the section VII-B. At the same time, Table 14 indicates the running time results of all the videos in Help
the VCD, and the average frame per second according to different mouth types. The real-time rate in Table 13 can be
obtained by:

F ramework duration
Realtime rate = . (5)
V ideo duration

https://ieeexplore.ieee.org/document/8703814?denied= 15/18
11/17/24, 8:07 PM Video-Based Vehicle Counting Framework | IEEE Journals & Magazine | IEEE Xplore
View Source
 Contents
Downl
The
PDF smaller the value of the real-time rate is, illustrate the faster the framework is. When the value of the real-time rate is
less than or equal to 1, the proposed framework can realize real-time processing.

TABLE 13 Video Running Time Result

TABLE 14 VCD Running Time Result

As can be seen from Table 13, when the vehicle number are 1574, 1505, and 2773, the real-time rate are 1.3, 1.2, and
1.4, respectively. As the number of vehicles in the video increases, the running time of the proposed framework also
increases. Because of a large number of vehicles will make the object tracking stage takes more time. Therefore, the
number of vehicles in the video is positively correlated with the running time of the proposed framework. Table 14
shows that the framework running rate of different mouth types are 18.9 fps, 20.4 fps, and 22.7 fps, respectively,
corresponding to crossroad, T-junctions, and straight road. It can be seen that the complexity of the video scene also
affects the speed of the framework.

From the experimental results in this section, it can be seen that the speed of the framework is affected by the number of
vehicles in the video and the complexity of the video scene. The running rate of the proposed framework is 20.7 fps on
the VCD. However, the real-time counting of the traffic flow information of the video can also be determined by the
video frame rate.

SECTION VIII.
Conclusion
In this paper, a vehicle counting framework based on video in traffic scenes is proposed, which is tested in different
traffic scenes. The results show that the accuracy of detection method using YoloV3 with our dataset is very high,
reaching 87.6%, even if the traffic condition is quite complex. Moreover, the framework is capable of counting vehicles
in different categories and moving route with an overall accuracy of more than 90%. The running rate of the proposed
framework is 20.7 fps on the VCD. In the future study, we will also detect pedestrians and bikers. The final framework
will provide more insights to improve the traffic flow at different settings such as intersections near hospitals and
commercial centers.

Authors 
Figures 
References 
Citations PDF

Help
Keywords 
Metrics 

https://ieeexplore.ieee.org/document/8703814?denied= 16/18