Face Tracking With Convolutional Neural Network Heat-Map: February 2018

This document discusses using a convolutional neural network heat-map approach for human face tracking. The method utilizes heat-maps extracted from a shallow CNN trained for face/non-face classification. Multiple CNNs are trained with different pooling sizes in the last layer to generate well-defined heat-maps. Experiments on a face tracking dataset showed encouraging results, demonstrating the effectiveness of using CNN heat-maps for face tracking.

Uploaded by

kavi test

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Face Tracking With Convolutional Neural Network Heat-Map: February 2018

Uploaded by

kavi test

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324166730

Face Tracking with Convolutional Neural Network Heat-Map

Conference Paper · February 2018

DOI: 10.1145/3184066.3184081

CITATIONS READS
2 2,066

5 authors, including:

Tai Do Nhu S.H. Kim

Ho Chi Minh City University of Foreign Languages - Information Technology (HUFLIT)… Mokpo National University
26 PUBLICATIONS 31 CITATIONS 150 PUBLICATIONS 412 CITATIONS

SEE PROFILE SEE PROFILE

In Seop Na
Chosun University
84 PUBLICATIONS 337 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Drone based image analysis, Human activity recognition, emotion recognition, Automous tractor driving View project

[Applied Sciences] Special Issue: Machine Learning based Medical Image Analysis View project

All content following this page was uploaded by Tai Do Nhu on 03 April 2018.

The user has requested enhancement of the downloaded file.

Face Tracking with Convolutional
Neural Network Heat-Map
Nhu-Tai Do, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee, In-Seop Na
School of Electronics and Computer Engineering,
Chonnam National University
77 Yongbong-ro, Buk-gu, Gwangju 500 – 757, Korea
[email protected], [email protected]

ABSTRACT
In this paper, we apply a heat-map approach for human face
tracking. We utilize the heat-map extracted from the
convolutional neural networks (CNN) for face / non-face
classification problem. The CNN architecture we build is a
shallow network to extract information that is meaningful in
locating an object. In addition, we made many CNNs with
changes in pool-size of the last layer to obtain a well-defined heat-
map. Experiments in the Visual Tracking Object dataset show that
the results of the method are very encouraging. This shows the
effectiveness of our proposed method.

CCS Concepts
• Computing methodologies~Face Tracking
• Computing methodologies~Convolutional Neural Networks

Keywords
Convolution neural network; heat-map; face tracking Figure 1. Row 1, 2, 3 in column 1 show images containing faces
in multiple cases such as many faces, occlusion, and motion blur.
1. INTRODUCTION Columns 2, 3 show the heat-map obtained from feedback of the
Face tracking is a fundamental problem for many applications in shallow CNNs and mask image identifying human face based on
image processing and computer vision. It provides facial the specified threshold.
positioning information for wide applications to handle face With the successful application of the Convolution Neural
recognition, facial expression, and gaze tracking. Up to now, it is Network in image classification [6] as well as object detection [7],
a challenging task with advance requirements in unconstrained robust features extracting from the layers in CNN are considered
environments such as light changes, motion blur, pose change as and applied in the problem tracking [8][9] which solve difficulties
well as clutter, occlusion [1] [2]. encountered in tracking-by-detection approaches for deep learning
Most visual object tracking methods (a general problem for face [10] due to the fundamental heterogeneity between the tracking
tracking) have two-way approach consisting of motion-based and and detection problem. This problem comes from the robustness
appearance-based tracking methods. Motion-based tracking of CNN in semantic classification leading to the inefficient
methods predict the probability of target states based on their representation of spatial details such as object location. Moreover,
previous states, typically as Particle Filter [3]. Besides, CNN needs large amounts of data to learn for object classification
appearance-based tracking methods are divided into two groups and detection, as opposed to visual object tracking problems that
including generative-based and discriminative-based tracking require only a small amount of samples during processing.
methods. In particular, generative-based tracking tries to represent This approach changes using the CNN for object detection, such
visual observations of the target object such as sparse as the black box to apply to the visual object tracking. Heat-maps
representation [4] and then finds the optimal position of the object extracted at each layer in CNN based on pre-train data with
in the image region of interest. Simultaneously, discriminative- famous object detection networks such as VGG network [11] that
based tracking methods select robust features for binary are analysis and discover significant layers to extract robust
classification and constructing a classifier that distinguishes features for object tracking problem.
between background and object tracking. Typically, Ensemble
tracking [5] uses the AdaBoost classifier in conjunction with the
color feature and the local histogram.
Training Model (a)

Heatmap Model (b) Padding Same Size

Dropout
Padding Valid
Dropout mse
0.25 0.5

Input
Image

64 x 64 x 3 Norm. Conv (ReLU) Conv (ReLU) Max Pooling Conv (ReLU) Conv (tanh) Flatten
-1..1 3 x 3 x 10 3 x 3 x 10 s xs 64/s x 64/s x 128 1x1x1 1

Figure 2. Shallow CNN for binary classification face/non-faces applying in (a) Training process with Flatten layer using mean-
square error loss (b) Heat-map building process without Flatten, loss function as well as size of input image.
CNN features in the high-level layers are often very powerful for 2. PROPOSED METHOD
classifying object semantics before data deformation and
corruption. While the low-level layers often contain more local 2.1 Binary classification for face/non-faces
spatial details to accurately locate the object [8]. Further deep Many CNNs architectures for facial detection are proposed in past
learning methods such as MDNet [12], FCNT [13] use CNN decade. Li et al. [15] built three cascade CNNs from low to high
features derived from pre-train CNNs and apply various tracking complexity to quickly remove non-background objects in window
algorithms for the online update and target localization stage to sliding and improve positions through calibration-nets. Zhang
take advantage of the robust features. et.al [16] reduced complexity when using kernel size 3 and
In this paper, we propose a method to build a heat-map that is proposed multitasking with the integration of face classification,
more suitable for face tracking instead of visual object tracking. bounding box regression, and five facial landmarks.
Heat-map and face image mask are illustrated in Figure 1. We The goal of the paper is to construct CNN for binary classification
build a shallow face CNN for binary classification with detecting of face/non-face without too complexity VGG network in object
face/non-faces. After that, we remove the fully connected layer detection [11], but strong enough to predict semantics as well as
and apply the shallow face CNN for extracting the robust feature face location. The method to build heat-map from CNN should
from the last layer. Because this network is shallow, so these also avoid sliding window to calculate face-appearance
features have both properties about semantic classification and probability density for reducing time-consuming.
spatial details for object location.
Therefore, we build CNN with input image size 64 x 64 to be able
To improve the accuracy of features from shallow face CNN, we to recognize large faces as in Figure 2. The top two convolution
experimented with the pool-size changes in the Max Pooling layer layers are 3 x 3 because of the face/non-faces classification
near last layers. We realized that increasing pool-size helps better simpler than object classification. Besides, the number of features
determination in spatial details, but decreasing accuracy when in the layers is 10 [16]. Succeeding is the max pooling layer with
determining face/non-face regions in heat-map. From there, we size s x s used to reduce the size of the image, as well as highlight
proposed a method to build heat-map from many shallow important features. We also use the two Dropout layers [17] in
networks with different pool-size changes. The final heat-map is order to further strengthen the ability to prevent overfitting.
based on a feedback rating from the shallow networks according
to the respective weights along with the definite threshold. From The last two layers of CNN play the role as Full Connected Layer.
there, we built a mask to identify the object quickly in the target The first layer has size 64/s with no padding, which leads to
tracking window. output image width and height with value 1. Besides, feature
amounts of this layer are 128 to be connected with last
The contribution of this paper is three parts. Firstly, we construct convolution layer with size 1 x 1 and depth 1. Depth has value 1
a face / non-face shallow CNN and analysis the pool-size change because of purpose about binary classification with the face (value
effect in Max Pooling Layer for face / non-face classification and 1) and non-face (value -1).
face localization. Secondly, we build heat-map based on voting
from many shallow networks in same structure but different pool- The introduction of the last two convolution layers replacing the
size to increase the accuracy of the heat-map. Finally, we Dense and Flatten layers is commonly found in CNN to reuse the
experimented our method in Online Object Tracking dataset [14] face recognition model used in the heat-map generation process.
with encouraging results. In the training and detection face model, the input image size is 64
In this paper, it is organized as follows in the rest. In Section II, x 64, so when throughout max pooling layer, it drops to 64/s.
we will describe the shallow face heat-map approach to face With kernel size 64/s and no padding, the image size is reduced to
tracking problem. It is the experiments and results in Section III 1 with 128 features in depth. In conjunction with convolution,
together with the conclusion in Section IV. classifier algorithm uses minimum squared loss function and
adaptive learning rate method with class -1/1 for non-face and
face.
2.2 Building a heat-map Predict Threshold Remapping
During the construction of the heat-map, the model assigns the
input image to an unspecified size. Therefore, when throughout
max pooling layer, image size decrease to (w/s, h/s) where (w, h)
is the width and height of the image. Moreover, images will be 32
reduced in size according to the kernel size 64/s due to lack of
padding.
Along with the final convolution layer with 1x1 dimension gives
probability density image or heat-map with every pixel value in [- 16
1, 1]. Each pixel in heat-map corresponds to the input image in 64
x 64 square. The position is calculated based on the upper left-
hand side with multiplying by pool-size value s in max pooling
layer as shown in Figure 3.
8
y
2y
x

2x
4
Predicted HeatMap
64 (Pool-Size 2)

Input Image 2
Figure 3. Mapping between density probability heat-map and
original image by a shallow face CNN.
Figure 5. Heat-map building process by shallow face
We build 5 different heat-map models with different pool-size CNNs in size s = 32, 16, 8, 4, 2. The rows correspond to the
values s = 2, 4, 8, 16, 32, respectively. Every pixel value greater pool-size, and the columns correspond to the heat-maps
than 0.9 in these heat-maps will respond to common joint heat- receiving from prediction, threshold in specific value
map according to the formula below with the results shown in lower than 0.9, mapping heat-map size into same size with
Figure 5: original image. Results and original image are illustrated
in Figure 4.
 x  y
H x, y   s ws H s   ,    (1) Figure 1 shows positive results in determination face mask
  s   s  value0.9 image in cases such as multiple faces, facial occlusion, and
Where H is joint heat-map, HS is heat-map corresponding with facial blur due to the motion. However, there are some
CNN containing max pooling size s  [2, 4, 8, 16, 32], ws is HS situations that heat-map determines wrong human face
heat-map weight respectively values 1, 2, 3, 4, 5. After building, regions as shown in Figure 4 and should be addressed by the
joint heat-map H is normalized into [0, 1]. face detection model.
2.3 Target Localization 2.4 Single Face Tracking
Based on the joint heat-map, we obtain a mask image of Joint heat-map has tested for efficiency in terms of
regions capable of appearing human faces based on a performance as well as accuracy through single face
defined threshold t determined by the following formula: tracking commonly used in the mobile platform.

1 when H x, y   t At the beginning of the program, the algorithm uses face
Maskx, y    (2) detection to find the position of the face to initiate tracking.
0 otherwise At specified time t, the algorithm maintains a viewport size
that is greater than half the face area to be observed.
In the paper, we define the threshold t of 0.3 to delineate Viewport size is a minimum size (128, 128). The center of
the human face. the subject at the next frame will be the highest value in the
heat-map corresponding to the viewport. Size of the region
will be maxed region containing facemask image. If heat-
map cannot obtain face region, the program will call face
detection on the small region to find face region for update
heat-map again, and expand on all frame if not successful
Figure 4. A situation that determines the area of the face is in the small region.
redundant.
3. EXPERIMENTS AND RESULTS Jumping 0.006 0.0362 0.009 0.0219 0.0135
3.1 Training shallow face CNN model Man 0.0019 0.0016 0.0021 0.0073 0.0041
In this paper, we used framework Keras with backed end Mhyang 0.0009 0.0014 0.0013 0.0039 0.0038
Tensorflow on environment Python 3.5 to build Shallow Face
CNNs. We download WIDER FACE [18] and pre-processing Trellis 0.001 0.0281 0.0288 0.026 0.0307
faces in training and validation dataset. We eliminate faces in the The results show that our method is more efficient than the
dataset with attribute too small than 128, blur, heavy occlusion. methods being compared. The average processing speed of the
We collect total 29965 faces for training and 7682 faces for algorithm is 12 frames/s for the above videos.
validation and resize with width and height 64. To obtain non-face
image, we generate background from images in WIDER FACE 4. CONCLUSION
dataset with the non-intersect region in face ground truth. There The paper has suggested some ideas for adopting CNN to create
are 30,197 non-face images for train and 9,419 for the test. and using heat-map. The proposed heat-map is capable of
We create five models different from pool-size to train. We use satisfying object localization as well as providing semantic
batch size 128 with epochs 1000 as well as mean-square-error loss information for tracking algorithms. The method achieves initial
function and adaptive learning rate algorithm. Results of training effects, but also requires improvements in the accuracy of heat-
process as in Table 1. map in determining effectively object in the unconstrained
environment.
Table 1. Training result of Shallow CNNs
CNN with Pool-size Loss Accuracy
ACKNOWLEDGMENTS
2 0.089 0.967 This research was supported by Basic Science Research Program
4 0.071 0.969 through the National Research Foundation of Korea (NRF) funded
8 0.069 0.961 by the Ministry of Education(NRF-2017R1A4A1015559).
16 0.070 0.957 Besides, this work was also partly supported by Institute for
32 0.114 0.929 Information & communications Technology Promotion(IITP)
grant funded by the Korea government(MSIT) (No.2017-0-00383,
3.2 Single Face Tracking Smart Meeting: Development of Intelligent Meeting Solution
For a quality comparison, we use a measure of the Euclidean based on Big Screen Device).
distance to the ground truth as formula following:
REFERENCES
D C , C 
Error O, G   E O G
[1] M. H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces
(3) in images: A survey. IEEE Trans. Pattern Anal. Mach.
SG Intell., vol. 24, no. 1, pp. 34–58, 2002.
Where O is the tracking object, G is the ground truth, DE (CO, [2] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,
CG) is the Euclidean distance between the center of target and A. Dehghan, and M. Shah. Visual tracking: An experimental
ground truth, SG is the size of the ground truth window. survey. IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no.
7, pp. 1442–1468, 2014.
We use four algorithms in the OpenCV library to compare with
[3] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A
the mathematical algorithm in the paper. Four algorithms consist
tutorial on particle filters for online nonlinear/non-Gaussian
of Boosting (BT), Median Flow (ML), MIL, and TLD. The results
Bayesian tracking. IEEE Trans. Signal Process, vol. 50, no.
are presented in Table 2.
2, pp. 174–188, 2002.
Table 2. Comparison between proposed method and other [4] W. Hu, W. Li, X. Zhang, and S. Maybank. Single and
methods in OpenCV library for face tracking problem Multiple Object Tracking Using a Multi-Feature Joint Sparse
Heat Representation. Tpami, vol. 37, no. 4, pp. 816–833, 2015.
Video BT ML MIL TLD
Map [5] S. Avidan. Ensemble tracking. IEEE Trans. Pattern Anal.
BlurFace 0.0015 0.0068 0.0035 0.008 0.0025 Mach. Intell., vol. 29, no. 2, pp. 261–271, 2007.
Boy 0.0029 0.0384 0.0179 0.0184 0.0067 [6] A. Krizhevsky, I. Sutskever, and H. Geoffrey E. ImageNet
Classification with Deep Convolutional Neural Networks.
David2 0.0025 0.005 0.0288 0.063 0.0151 Adv. Neural Inf. Process. Syst. 25, pp. 1–9, 2012.
Dragon [7] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
0.0072 0.0173 0.0259 0.0192 0.032
Baby real-time object detection with region proposal networks.
Dudek 0.0007 0.0008 0.0008 0.0011 0.0015 Nips, pp. 91–99, 2015.
FaceOcc1 0.0006 0.0047 0.0025 0.0015 0.0015 [8] C. Ma, J. Bin Huang, X. Yang, and M. H. Yang. Hierarchical
convolutional features for visual tracking. Proc. IEEE Int.
FaceOcc2 0.0009 0.0014 0.006 0.0021 0.0087 Conf. Comput. Vis., vol. 2015 Inter, pp. 3074–3082, 2015.
FleetFace 0.0012 0.0031 0.0016 0.0015 0.002 [9] H. Li, Y. Li, and F. Porikli. Deep Track: Learning
Discriminative Feature Representations Online for Robust
Freeman1 0.0045 0.0928 0.0396 0.086 0.0657
Visual Tracking. IEEE Trans. Image Process, vol. 25, no. 4,
Girl 0.0066 0.0179 0.0117 0.0099 0.0112 pp. 1834–1848, 2016.
[10] Li Wang, Ting Liu, Gang Wang, Kap Luk Chan, and [15] G. Li, Haoxiang and Lin, Zhe and Shen, Xiaohui and Brandt,
Qingxiong Yang. Video tracking using learned hierarchical Jonathan and Hua. A Convolutional Neural Network
features. IEEE Trans. Image Process, vol. 24, no. 4, pp. Approach for Face Detection. Cvpr, pp. 5325–5334, 2015.
1424–35, 2015. [16] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint Face Detection
[11] K. Simonyan and A. Zisserman. Very deep convolutional and Alignment Using Multitask Cascaded Convolutional
networks for large-scale image recognition. arXiv Prepr. Networks. IEEE Signal Process. Lett, vol. 23, no. 10, pp.
arXiv1409.1556, 2014. 1499–1503, 2016.
[12] H. Nam and B. Han. Learning Multi-Domain Convolutional [17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R.
Neural Networks for Visual Tracking. Cvpr, pp. 4293–4302, Salakhutdinov. Dropout: A Simple Way to Prevent Neural
2016. Networks from Overfitting. J. Mach. Learn. Res., vol. 15, pp.
[13] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking 1929–1958, 2014.
with fully convolutional networks. Proc. IEEE Int. Conf. [18] S. Yang, P. Luo, C. C. Loy, and X. Tang. WIDER FACE: A
Comput. Vis., vol. 2015 Inter, pp. 3119–3127, 2015. face detection benchmark. Proc. IEEE Comput. Soc. Conf.
[14] Y. Wu, J. Lim, and M.-H. Yang. Online Object Tracking: A Comput. Vis. Pattern Recognit., vol. 2016–Decem, pp. 5525–
Benchmark. 2013 IEEE Conf. Comput. Vis. Pattern 5533, 2016.
Recognit., pp. 2411–2418, 2013.
Authors’ background
Your Name Title* Research Field Personal website

Pattern Recognition, Deep

Nhu-Tai Do Phd Candidate
Learning
Pattern Recognition, Deep
Soo-Hyung Kim Full Professor
Learning
http://pr.jnu.ac.kr/shkim/

Pattern Recognition, Deep

Hyung-Jeong Yang Full Professor
Learning
Pattern Recognition, Deep
Guee-Sang Lee Full Professor
Learning
Pattern Recognition, Deep
In-Seop Na Phd
Learning