A Deep Face Identification Network Enhanced by Facial Attributes Prediction
A Deep Face Identification Network Enhanced by Facial Attributes Prediction
matrix direct product, there are no learnable parameters at of 4-d tensors; W ∈ IRl×c×p×q where l, c, p and q are di-
this layer and, consequently, chances of overfitting are low mensions of the weights along the axes of filter, channel,
at this layer. Furthermore, we argue that, due to existing and spatial width and height, respectively. For notational
correlation between facial attributes features and face fea- simplicity, we denote all the weights in net@1 with W1 and
tures, the output neurons of the fusion layer are simple to the weights in net@2 with W2 . W2 is separated into two
interpret and are semantically meaningful. (i.e., the mani- groups of W2,1 and W2,2 representing all weights in the first
fold that they will lie on is not complex, however, it is just and second branches, respectively.
high dimensional). Therefore, it is simple for the following L1 and L2 described in (2) and (3) are the loss functions
layers of the network to decode such meaningful informa- designed to perform attribute prediction and face identifi-
tion. Assume that v and u are the feature vectors of at- cation tasks, respectively. We use the cross entropy as our
tributes and face, respectively. The Kronecker product of network loss functions. T, C and X = {xi }N i=1 indicate
these two vectors is defined as follows: the number of facial attributes used in the model, number
of classes and the training samples, respectively. L0i and
u 1 v1
u1 v2 Lji represent face label and facial attribute label for at-
u1 v1
..
tribute j and the training sample i, respectively. f and g
u2 v2 . functions are outputs of the network for attribute predic-
tion and face identification tasks, respectively. f 0 and g 0
u ⊗ v = . ⊗ . = u1 vm (1)
.. ..
u2 v1
are soft-max functions performed on the f and g outputs,
un vm
. respectively. The loss functions represented in (2) and (3)
..
show how two branches of net@2 communicate informa-
un vm tion and update their learning parameters with each other.
As shown in (2) and (3), the f function (attribute prediction
4. Training our CNN architecture output) takes W1 and W2,1 as input. The g function (face
In this section, we describe how we train our model. identification output) takes W1 , W2,2 and f as input. There-
Thousands of images are needed to train such a deep model. fore, both attribute prediction and face identification use W1
For this reason, we initialize net@1 parameters by a CNN as shared parameters. Furthermore, attribute prediction pa-
pre-trained on the ImageNet dataset and then we fine tune rameters and W2,2 are used for face identification.
it as a classifier by using the CASIA-Web Face dataset. We use an Adam optimizer [15] to minimize our net-
CASIA-Web Face contains 10,575 subjects and 494,414 work’s loss functions. The Adam optimizer is a robust and
images. As far as we know, this is the largest publicly avail- well-adapted optimizer that can be applied to a variety of
able face image dataset, second only to the private Facebook non-convex optimization problems in the field of deep neu-
dataset. ral networks. All parameter values used in Adam optimizer
The proposed deep network is described as a succession are initialized using the authors’ suggestion; we set learning
of two cascaded networks. net@1 is constructed from 16 rate to 0.001 to minimize our network’s loss functions.
layers of convolutional operations on the inputs, intertwined The optimization algorithm mainly consists of two steps,
with ReLU non-linear operation and five pooling opera- the first of which calculates the gradient of the loss func-
tions. Weights in each convolutional layer form a sequence tions with respect to the model parameters, and then, for the
second step, updates the biased first moment estimate and
the model parameters, successively.
T X
X N
L1 (W1 , W2,1 , X) = − Lji log(f 0 (f (Lji |xi , W1 ,
j=1 i=1
N X
X C
L2 (W1 , W2,1 , W2,2 , X) = − L0ik log(g 0 (g(L0ik |xi ,
i=1 k=1
W1 , W2,2 , f (xi , W1 , W2,1 ))))
(3)
5. Experiment
Figure 2. : First and second rows are image samples in CelebA
We conducted experiments for two different cases to ex- dataset; third and forth rows are samples of aligned face images in
amine if our model improves overall performance in identi- MegaFace dataset.
fication and prediction tasks. In the first case, we train and
test the model to perform two tasks separately in isolation,
while in the second case we employ our model to train both
tasks jointly. In the second case, however, we predict facial CelebA is a large-scale, richly annotated face attribute
attributes assuming that such information is not available dataset containing more than 200K celebrity images, each
during the testing phase, and then outputs of the attribute of which is notated with 40 facial attributes. CelebA has
prediction branch before performing the soft-max operation about ten thousand identities with twenty images per iden-
is fused with the last pooling layer of net@1 by using the tity on average. This dataset is also annotated by five land-
Kronecker product. We fuse the face modality with those marks. The dataset can be used as the training and testing
facial attributes such as gender and face shape which re- sets for facial attribute prediction, face detection, and land-
main the same in all images of a person. Experimental re- mark (or facial part) localization. To compare our method
sults show that our model increases overall performance in fairly with the other methods, we use the same setup that
face identification as well as attribute prediction in compar- they have used. We use images of 8000 identities for train-
ison to the first case. We performed our experiments on two ing and remaining 1000 identities for testing. Train and test
GeForce GTX TITAN X 12GB GPU. We ran our model sets are available here.1
through 100 epochs using batch normalization (i.e. shift-
MegaFace is a publicly available and very challenging
ing inputs to zero-mean and unit variance) after each con-
dataset which is used for evaluating the performance of
volutional and fully connected layer before performing no-
face recognition algorithms with up to a million distrac-
linearity. Batch normalization potentially helps to achieve
tors ( i.e., up to a million people who are not in the test
faster learning as well as higher overall accuracy. Further-
set). MegaFace contains 1M images from 690K individu-
more, batch normalization allows us to use a higher learning
als with unconstrained pose, expression, lighting, and expo-
rate, which potentially provides another boost in speed. We
sure. MegaFace captures many different subjects rather than
used TensorFlow to implement our network. The batch size
many images of a small number of subjects. The gallery
in all experiments is fixed to 128.
set of MegaFace is collected from a subset of Flickr [31].
5.1. Datasets The probe set of MegaFace used in the challenge consists
of two databases; Facescrub [21] and FGNet [9]. FG-NET
We conducted our experiments on the CelebA dataset contains 975 images of 82 individuals, each with several im-
[19] for facial attribute prediction, as well as MegaFace [14] ages spanning ages from 0 to 69. Facescrub dataset contains
which is a widely used and well-known face datasets for
face identification. 1 http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
more than 100K face images of 530 people. The MegaFace 5.3. Methods for Comparisons
challenge evaluates performance of face recognition algo-
rithms by increasing the numbers of distractors (going from Attribute Prediction: We compare our method
10 to 1M) in the gallery set. Training size is important, with several competitive algorithms including FaceTracer,
since it has been shown that face recognition algorithms that PANDA[37], ANet+LNet [19] and MT-RBM-PCA [8].
were trained on larger sets tend to perform better at scale. FaceTracer [17] extracts handcraft features including color
In order to evaluate the face recognition algorithms fairly, histogram and HOG from some functional face image re-
MegaFace challenge has two protocols including large or gion and then concatenates these features to train a SVM
small training sets. If a training set has more than 0.5M im- classifier for predicting attributes. Functional regions are
ages and 20K subjects, it is considered as large. Otherwise, determined by using ground truth landmarks. PANDA
it is considered as small. We use a small training set which mainly was proposed by creating an ensemble of several
has 0.44M images and 10k subjects. The prob set in our CNNs for body attributes prediction. Each CNN in this
experiments is Facescrub. model extracts features from a well-aligned human part us-
ing poselet. Next, all of the extracted features are concate-
5.2. Evaluation metrics nated to train a SVM for body attribute prediction. How-
ever, for our case, it is simple to adjust this method for fa-
We evaluate the face identification performance of our cial attribute prediction such that the face part is aligned us-
model on the MegaFace dataset; and facial attribute pre- ing landmark points. In ANet+LNet method, images of the
diction performance on the CelebA dataset. The MegaFace first 8000 identities, which is roughly 162k images, are em-
dataset is not annotated by facial attribute. Our model, how- ployed for pre-training and face localization. The images
ever, predicts facial attributes and then uses them for face of the next 1000 identities, which is roughly 20k images,
identification. To conduct experiments on the MegaFace are used to train a SVM classifier. We use same testing
dataset, we restore the model parameters trained on the and training sets to conduct our experiment. We compare
CelebA dataset, which is annotated by facial attributes as our model with the other methods for attribute prediction.
well as people identification, and then fine-tune the model Table.1 shows the model improvement on identity facial at-
parameters on the MegaFace dataset for the objective of tribute prediction once the model trains both tasks jointly.
face identification on the MegaFace dataset. Our model pre- The results shows that joint-training has higher contribution
dicts facial attributes from the first branch of our architec- for the attributes of gender, bald, narrow eyes, big lip, big
ture and employs this auxiliary modality for face identifica- nose, oval face, young, high cheekbone and chubby, respec-
tion. tively.
Face Identification: we calculate the similarity between Face Identification: We compare our method with the
each of the images in the gallery set and given image from exiting methods on face identification which are reported
the probe set, and then rank these images based on the ob- from the official websites of MegaFace2 . We primarily
tained similarities. In face identification, the gallery set compare with publicly released methods, for which the
should contain at least one image of the same identity. We details are known. These methods are listed as follows:
evaluate our model by using rank-1 identification accuracy Google FaceNet [24], Center Loss [34], Lightened CNN
as well as Cumulative Match Characteristics (CMC) curves. [35], LBP [1] and Joint Bayes model [6].
CMC is a rank-base metric indicating the probability of the There are several other methods from commercial com-
correct gallery image that can be found in the top k similar panies such as FaceAll, NTechLAB, SIAT MMLAB, Bare-
images from the gallery set. BonesFR, 3DiVi companies, the details of which are not
Facial Attribute Prediction: We leverage identity facial known to the community yet. Therefore, we can not com-
attributes as an auxiliary modality for improving face iden- pare these methods with ours fairly; however, we report
tification performance. Identity facial attributes are invari- these methods to provide a comprehensive list of references
ant attributes which remain same from different images of on the Megaface dataset. Fig. 3 represents CMC curves
a person. For example, gender, nose and lips shapes remain for different methods; it is shown that our model covers
the same in different images of a person; however, attributes larger area under the curve in comparison to the other meth-
such as glasses, mustaches, or beards may or may not exist ods. We also compare our model performance when the
in different images of a person. We discard such attributes model trains facial attributes prediction and face identifi-
in our model because we look for robust as well as invariant cation jointly and separately. The results show that our
facial attributes. Identity facial attributes in CelebA dataset face identifier benefits from joint training. We also com-
are listed as follows: narrow eyes, big nose, pointy nose, pare performance of the algorithms by rank-1 identification
chubby, double chin, high cheekbones, male, bald, big lips accuracy; Table.2 compares face identification models on
and oval face . We evaluate our attribute predictor by using
accuracy metric. 2 http://megaface.cs.washington.edu/results/facescrub.html
FaceTracer PANDA LNets+ANet RBM-PCA Ours-S Ours-J
Bald 89 96 98 98 96.16 98.93
Big Lips 64 67 68 69 69.25 71.69
Big Nose 74 75 78 81 82.35 84.67
Chubby 86 86 91 95 94.22 95.27
High Cheekbones 84 86 88 83 86.61 87.79
Male 91 97 98 90 95.65 98.61
Narrow Eyes 82 84 81 86 85.45 87.9
Oval Face 64 65 66 73 74.49 75.94
Young 80 84 87 81 87.12 88.54
Table 1. Comparing attribute prediction models on CelebFacesA dataset.
References
[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face description
with local binary patterns: Application to face recognition.
IEEE transactions on pattern analysis and machine intelli-
gence, 28(12):2037–2041, 2006. 5
[2] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs.-one
features for fine-grained categorization, face verification, and
attribute estimation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 955–
962, 2013. 1
[3] L. Best-Rowden, H. Han, C. Otto, B. F. Klare, and A. K.
Jain. Unconstrained face recognition: Identifying a person
of interest from a media collection. IEEE Transactions on
Information Forensics and Security, 9(12):2144–2157, 2014.
1
[4] L. Bourdev, S. Maji, and J. Malik. Describing people: A
poselet-based approach to attribute classification. In Com-
puter Vision (ICCV), 2011 IEEE International Conference
on, pages 1543–1550. IEEE, 2011. 1
Figure 4. Example of class activation map generated from attribute [5] L. D. Bourdev. Pose-aligned networks for deep attribute
predictor part of our model. Each row indicates nose attribute , modeling, July 26 2016. US Patent 9,400,925. 2
mouth attribute, eyes attribute and head attribute , respectively. We [6] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian
observe that highlighted regions are activated by class activation face revisited: A joint formulation. In European Conference
map algorithm. on Computer Vision, pages 566–579. Springer, 2012. 5
[7] J. Chung, D. Lee, Y. Seo, and C. D. Yoo. Deep attribute net-
works. In Deep Learning and Unsupervised Feature Learn-
ing NIPS Workshop, volume 3, 2012. 1
Inspired by the work in [39] on class activation map, we [8] M. Ehrlich, T. J. Shields, T. Almaev, and M. R. Amer. Fa-
interpret the prediction decision made by our proposed ar- cial attributes classification using multi-task representation
chitecture. Fig.4 shows the class activation map for predict- learning. In Proceedings of the IEEE Conference on Com-
ing big nose, big lips, narrow eyes and bald, respectively. puter Vision and Pattern Recognition Workshops, pages 47–
We can see that our model is triggered by different semantic 55, 2016. 5
regions of the image for different predictions. Fig.4 shows [9] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, and Y. Yao.
that our model due to using GAP layer also learns to local- Interestingness prediction by robust learning to rank. In
ize the common visual patterns for the same facial attribute. European conference on computer vision, pages 488–503.
Springer, 2014. 4
Furthermore, the deep features obtained from our attribute
[10] A. Graham. Kronecker products and matrix calculus: With
predictor branch can also be used for generic facial attribute
applications (mathematics and its applications) pdf. 1981. 2
localization in any given image without using any extra in-
[11] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? met-
formation such as bounding box.
ric learning approaches for face identification. In Computer
Vision, 2009 IEEE 12th international conference on, pages
6. Conclusion 498–505. IEEE, 2009. 1
In this paper, we proposed an end to end deep network to [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE con-
predict facial attributes and identify face images simultane-
ference on computer vision and pattern recognition, pages
ously with better performance. Our model trains these two
770–778, 2016. 1
tasks jointly through shared CNN feature space, and also
[13] M. M. Kalayeh, B. Gong, and M. Shah. Improving fa-
fuses predicted identity attributes modality with face modal- cial attribute prediction using semantic segmentation. arXiv
ity features to improve face identification performance. The preprint arXiv:1704.08740, 2017. 2
model increases both face recognition and face attribute pre- [14] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and
diction performance in comparison to the case when the E. Brossard. The megaface benchmark: 1 million faces for
model is trained separately. Experimental results show the recognition at scale. In Proceedings of the IEEE Conference
superiority of the model in comparison to the current face on Computer Vision and Pattern Recognition, pages 4873–
identification models. The model also predicts identity fa- 4882, 2016. 4
[15] D. Kingma and J. Ba. Adam: A method for stochastic opti- [31] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde,
mization. arXiv preprint arXiv:1412.6980, 2014. 3 K. Ni, D. Poland, D. Borth, and L.-J. Li. The new data
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet and new challenges in multimedia research. arXiv preprint
classification with deep convolutional neural networks. In arXiv:1503.01817, 1(8), 2015. 4
Advances in neural information processing systems, pages [32] R. Torfason, E. Agustsson, R. Rothe, and R. Timofte. From
1097–1105, 2012. 1, 2 face images and attributes to attributes. In Asian Conference
[17] N. Kumar, P. Belhumeur, and S. Nayar. Facetracer: A on Computer Vision, pages 313–329. Springer, 2016. 1
search engine for large collections of images with faces. In [33] Z. Wang, K. He, Y. Fu, R. Feng, Y.-G. Jiang, and X. Xue.
European conference on computer vision, pages 340–353. Multi-task deep neural network for joint face recognition and
Springer, 2008. 5 facial attribute prediction. In Proceedings of the 2017 ACM
[18] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. on International Conference on Multimedia Retrieval, pages
Attribute and simile classifiers for face verification. In Com- 365–374. ACM, 2017. 1, 2, 6
puter Vision, 2009 IEEE 12th International Conference on, [34] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discrimina-
pages 365–372. IEEE, 2009. 1 tive feature learning approach for deep face recognition. In
[19] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face at- European Conference on Computer Vision, pages 499–515.
tributes in the wild. In Proceedings of the IEEE International Springer, 2016. 5
Conference on Computer Vision, pages 3730–3738, 2015. 1, [35] X. Wu, R. He, and Z. Sun. A lightened cnn for deep face
2, 4, 5 representation. In 2015 IEEE Conference on IEEE Computer
[20] P. Luo, X. Wang, and X. Tang. A deep sum-product archi- Vision and Pattern Recognition (CVPR), volume 4, 2015. 5
tecture for robust facial attributes analysis. In Proceedings [36] L. C. Yuhang He and J. Chen. Multi-task relative attribute
of the IEEE International Conference on Computer Vision, prediction by incorporating local context and global style in-
pages 2864–2871, 2013. 1, 2 formation. In E. R. H. Richard C. Wilson and W. A. P. Smith,
[21] H.-W. Ng and S. Winkler. A data-driven approach to clean- editors, Proceedings of the British Machine Vision Confer-
ing large face datasets. In Image Processing (ICIP), 2014 ence (BMVC), pages 131.1–131.12. BMVA Press, Septem-
IEEE International Conference on, pages 343–347. IEEE, ber 2016. 2
2014. 4 [37] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev.
[22] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face Panda: Pose aligned networks for deep attribute modeling. In
recognition. In BMVC, volume 1, page 6, 2015. 1 Proceedings of the IEEE conference on computer vision and
[23] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chel- pattern recognition, pages 1637–1644, 2014. 1, 5
lappa. An all-in-one convolutional neural network for face
[38] Y. Zhong, J. Sullivan, and H. Li. Face attribute prediction
analysis. In Automatic Face & Gesture Recognition (FG
using off-the-shelf cnn features. In Biometrics (ICB), 2016
2017), 2017 12th IEEE International Conference on, pages
International Conference on, pages 1–7. IEEE, 2016. 2
17–24. IEEE, 2017. 1
[39] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-
[24] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-
ralba. Learning deep features for discriminative localization.
fied embedding for face recognition and clustering. In Pro-
In Computer Vision and Pattern Recognition (CVPR), 2016
ceedings of the IEEE Conference on Computer Vision and
IEEE Conference on, pages 2921–2929. IEEE, 2016. 6
Pattern Recognition, pages 815–823, 2015. 1, 5
[25] W. R. Schwartz, H. Guo, and L. S. Davis. A robust and scal-
able approach to face identification. In European Conference
on Computer Vision, pages 476–489. Springer, 2010. 1
[26] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014. 1, 2
[27] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3: Face
recognition with very deep neural networks. arXiv preprint
arXiv:1502.00873, 2015. 1
[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 1–9, 2015. 1
[29] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-
scale training for face identification. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 2746–2754, 2015. 1
[30] V. Talreja, M. C. Valenti, and N. M. Nasrabadi. Multibio-
metric secure system based on deep learning. arXiv preprint
arXiv:1708.02314, 2017. 1