0% found this document useful (0 votes)
6 views

A Deep Face Identification Network Enhanced by Facial Attributes Prediction

Uploaded by

ohmyhilbert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

A Deep Face Identification Network Enhanced by Facial Attributes Prediction

Uploaded by

ohmyhilbert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Deep Face Identification Network Enhanced by Facial Attributes Prediction

Fariborz Taherkhani, Nasser M. Nasrabadi, Jeremy Dawson


West Virginia University
[email protected], [email protected], [email protected]
arXiv:1805.00324v1 [cs.CV] 20 Apr 2018

Abstract the predicted facial attributes as an auxiliary modality to


improve face identification performance. We also show that
In this paper, we propose a new deep framework which when our model is trained jointly to recognize face images
predicts facial attributes and leverage it as a soft modal- and predict facial attributes, the model performance on fa-
ity to improve face identification performance. Our model cial attribute prediction increases as well. In other words,
is an end to end framework which consists of a convolu- in our model the two modalities improve each other’s per-
tional neural network (CNN) whose output is fanned out formance once they are trained jointly. We show that some
into two separate branches; the first branch predicts facial soft biometric information, such as age and gender which on
attributes while the second branch identifies face images. their own are not distinctive enough for face identification,
Contrary to the existing multi-task methods which only use but, nevertheless provide complementary information along
a shared CNN feature space to train these two tasks jointly, with other primary information, such as the face images.
we fuse the predicted attributes with the features from the Despite significant improvements in face recognition
face modality in order to improve the face identification per- performance, it is still an ongoing problem in computer vi-
formance. Experimental results show that our model brings sion [3, 11, 24, 25, 27, 29]. There are a number of ap-
benefits to both face identification as well as facial attribute proaches in the literature that use facial attributes for bio-
prediction performance, especially in the case of identity metrics applications such as face recognition. For example,
facial attributes such as gender prediction. We tested our Wang et al [33] propose an attribute-constrained face recog-
model on two standard datasets annotated by identities and nition model for joint facial attributes prediction and face
face attributes. Experimental results indicate that the pro- recognition. In this model, the parameters of the network
posed model outperforms most of the current existing face are first updated for attributes prediction and then same net-
identification and attribute prediction methods. work is fine-tuned for face recognition. While Ranjan et al
[23] add other face related tasks to improve overall perfor-
mance. Their model is a single multi-task CNN network
1. Introduction for simultaneous face detection, face alignment, pose esti-
Deep neural networks, particularly deep Convolutional mation, gender recognition, smile detection, age estimation
Neural Networks (CNNs), have provided significant im- and face recognition.
provement in visual tasks such as face recognition, attribute Facial attributes as semantic features can be predicted
prediction and image classification [16, 26, 28, 12, 19, 22]. from face images directly, or from other facial attributes in-
Despite this advancement, designing a deep model to learn directly [32]. Attribute prediction methods are generally
different tasks jointly while improving their performance by classified into local or global approaches. Local methods
sharing learned parameters remains a challenging problem. consist of three steps; first they detect different parts of
Providing auxiliary information to a CNN-based face the object and then extract features from each part. Fi-
recognition model can improve its recognition perfor- nally, these features are concatenated to train a classifier
mance; however, in some cases such information is avail- [18, 4, 7, 2, 20, 37]. For example, Kumar et al’s method
able only during training and may not be available during [18] is based on extracting hand-crafted features from ten
the testing phase. Despite the potential advantages of using facial parts. Zhang et al [37] extract poselets aligning face
auxiliary data, these problems have diminished the popular- parts to predict facial attributes. This method works im-
ity and flexibility of using both soft and hard modalities for properly if object localization and alignment are not perfect.
biometric applications [30]. Global approaches, however, extract features from entire
We propose a model which jointly predicts facial at- image disregarding object parts and then train a classifier
tributes and identifies faces while simultaneously leverages on extracted features; these methods perform improperly if
large face variations such as occlusion, pose and lighting 2. Deep Joint Facial Attributes Prediction and
are present in the image [19, 13, 20]. Face Identification Model
Attribute prediction has been improved in recent years.
The proposed architecture predicts facial attributes and
Bourdev et al [5] propose a part-based attribute prediction
uses them as an auxiliary modality to recognize face im-
method which deploys semantic segmentation in order to
ages. The model is constructed from two successive cas-
transfer localization information from the auxiliary task of
caded networks as shown in Fig.1. The first network
semantic face parsing to the facial attribute prediction task.
(net@1) uses the VGG 19 structure [26] with identical filter
Liu et al [19] use two cascaded CNNs; the first of which,
size, convolutional layers, and pooling operation. The first
LNet, is used for face localization, while the second, ANet,
network applies filters with 3×3 receptive field. The convo-
is used for attribute description. Zhong et al [38] first local-
lution stride is set to 1 pixel. To preserve spatial resolution
ize face images and then use an off-the-shelf architecture
after convolution, spatial padding of the convolutional layer
designed for face recognition to describe face attributes at
is fixed to 1 pixel for all 3 × 3 convolutional layers. Spa-
different levels of a CNN. He et al [36] propose a multi-
tial pooling is performed by four max-pooling layers placed
task framework for relative attribute prediction. The method
after the second, fourth, eighth, and twelfth convolutional
uses a CNN to learn local context and global style informa-
layers and one global average pooling (GAP) layer which is
tion from the intermediate convolution and fully connected
placed after the sixteenth convolutional layer. Max-pooling
layers, respectively.
is carried out on a 2 × 2 pixel window with a stride of 2.
Our network is inspired by multi-task network but we Each hidden layer is followed by a Rectified Linear Units
fuse the output of the attribute predictor into the face recog- (ReLU) [16] activation function. A GAP layer is a substan-
nition layers which makes it different from other existing tial process in our model because by disregarding the GAP
multi-task methods such as Wang et al’s [33] approach. Our layer and replacing it by a max-pooling layer, the output of
deep CNN model is constructed from two cascaded net- the fusion layer will have a very high dimension when we
works in which the final one consists of two branches, each fuse face and attribute modalities together. The GAP layer
of which are used for facial attribute prediction and face simply takes average of each feature map obtained from last
identification, respectively. Both these two branches com- convolutional layer. Since no parameter is optimized at the
municate information together by sharing parameters of the GAP layer, overfitting is prevented at this layer.
first network in the model as well as fusing attribute branch The second network (net@2) is divided into two sepa-
with the last pooling layer of the face identification branch. rate branches trained simultaneously while communicating
In our model, all the parameters (i.e. the parameters of the information together through the training process. Both of
two cascaded networks) are updated simultaneously in each these branches consist of two fully connected (FC) layers
training step. operating on the output of the first network. The first FC
The Contributions of our work are summarized as fol- layer of each branch (Fc1 and Fc0 1 in Fig.1) consists of
lows: 4096 units. The next layers of (Fc1) and (Fc0 1) are fully
1) We design a new end to end CNN architecture that connected layers on which the soft-max operation is con-
learns to predict facial attributes while simultaneously being ducted. The first branch performs the attribute prediction
trained with the objective of face identification. Our model task, and the output of the last FC layer in this branch be-
shares learned parameters to train both tasks and also fuses fore performing soft-max operation is fused with the GAP
attribute information and the face modality to improve face layer of net @1 by using Kronecker product [10]. Finally
identification performance. this fused layer is employed to train the second branch - the
face identification task. As shown in Fig.1, attributes are
2) Contrary to the existing multi-task methods that only
predicted by net@1 and first branch of net@2 parameters
use a shared CNN feature space to train these two tasks
while face images are identified by net@1 and all parame-
jointly, our model uses a feature level fusion approach to
ters in net@2; the overall proposed architecture is shown in
leverage facial attributes for improving face identification
Fig.1.
performance. Furthermore, we observe that our jointly
trained network is a more capable face attribute predictor
3. Fusion Layer on Facial Attributes and Face
than one trained on facial attributes alone.
Modalities
The rest of this paper is organized as follows: The CNN
architecture is described in section 2, fusion of attribute and Previously, feature concatenation has been used as an ap-
face modalities is described in section 3, model training pa- proach for multimodal fusion. In this work, we use the Kro-
rameters are described in section 4, and finally, results and necker product to fuse facial attributes features with face
concluding remarks are provided in sections 5 and 6, re- features. Since the Kronecker product of two vectors (i.e.
spectively. attributes and face features) is mathematically formed by a
Figure 1. Proposed CNN architecture, face identification and attribute prediction are trained jointly.

matrix direct product, there are no learnable parameters at of 4-d tensors; W ∈ IRl×c×p×q where l, c, p and q are di-
this layer and, consequently, chances of overfitting are low mensions of the weights along the axes of filter, channel,
at this layer. Furthermore, we argue that, due to existing and spatial width and height, respectively. For notational
correlation between facial attributes features and face fea- simplicity, we denote all the weights in net@1 with W1 and
tures, the output neurons of the fusion layer are simple to the weights in net@2 with W2 . W2 is separated into two
interpret and are semantically meaningful. (i.e., the mani- groups of W2,1 and W2,2 representing all weights in the first
fold that they will lie on is not complex, however, it is just and second branches, respectively.
high dimensional). Therefore, it is simple for the following L1 and L2 described in (2) and (3) are the loss functions
layers of the network to decode such meaningful informa- designed to perform attribute prediction and face identifi-
tion. Assume that v and u are the feature vectors of at- cation tasks, respectively. We use the cross entropy as our
tributes and face, respectively. The Kronecker product of network loss functions. T, C and X = {xi }N i=1 indicate
these two vectors is defined as follows: the number of facial attributes used in the model, number
  of classes and the training samples, respectively. L0i and
u 1 v1
 u1 v2  Lji represent face label and facial attribute label for at-
   
u1 v1

 .. 
 tribute j and the training sample i, respectively. f and g
 u2   v2   .  functions are outputs of the network for attribute predic-
tion and face identification tasks, respectively. f 0 and g 0
 
u ⊗ v =  . ⊗ .  =  u1 vm  (1)
   
 ..   ..  
 u2 v1 

are soft-max functions performed on the f and g outputs,
un vm
 
 .  respectively. The loss functions represented in (2) and (3)
 .. 
show how two branches of net@2 communicate informa-
un vm tion and update their learning parameters with each other.
As shown in (2) and (3), the f function (attribute prediction
4. Training our CNN architecture output) takes W1 and W2,1 as input. The g function (face
In this section, we describe how we train our model. identification output) takes W1 , W2,2 and f as input. There-
Thousands of images are needed to train such a deep model. fore, both attribute prediction and face identification use W1
For this reason, we initialize net@1 parameters by a CNN as shared parameters. Furthermore, attribute prediction pa-
pre-trained on the ImageNet dataset and then we fine tune rameters and W2,2 are used for face identification.
it as a classifier by using the CASIA-Web Face dataset. We use an Adam optimizer [15] to minimize our net-
CASIA-Web Face contains 10,575 subjects and 494,414 work’s loss functions. The Adam optimizer is a robust and
images. As far as we know, this is the largest publicly avail- well-adapted optimizer that can be applied to a variety of
able face image dataset, second only to the private Facebook non-convex optimization problems in the field of deep neu-
dataset. ral networks. All parameter values used in Adam optimizer
The proposed deep network is described as a succession are initialized using the authors’ suggestion; we set learning
of two cascaded networks. net@1 is constructed from 16 rate to 0.001 to minimize our network’s loss functions.
layers of convolutional operations on the inputs, intertwined The optimization algorithm mainly consists of two steps,
with ReLU non-linear operation and five pooling opera- the first of which calculates the gradient of the loss func-
tions. Weights in each convolutional layer form a sequence tions with respect to the model parameters, and then, for the
second step, updates the biased first moment estimate and
the model parameters, successively.
T X
X N
L1 (W1 , W2,1 , X) = − Lji log(f 0 (f (Lji |xi , W1 ,
j=1 i=1

W2,1 ))) + (1 − Lji )log(f 0 (f (1 − Lji |xi , W1 , W2,1 )))


(2)

N X
X C
L2 (W1 , W2,1 , W2,2 , X) = − L0ik log(g 0 (g(L0ik |xi ,
i=1 k=1
W1 , W2,2 , f (xi , W1 , W2,1 ))))
(3)

We iterate this algorithm through several epochs for the


complete training batches until training error convergence
is achieved.

5. Experiment
Figure 2. : First and second rows are image samples in CelebA
We conducted experiments for two different cases to ex- dataset; third and forth rows are samples of aligned face images in
amine if our model improves overall performance in identi- MegaFace dataset.
fication and prediction tasks. In the first case, we train and
test the model to perform two tasks separately in isolation,
while in the second case we employ our model to train both
tasks jointly. In the second case, however, we predict facial CelebA is a large-scale, richly annotated face attribute
attributes assuming that such information is not available dataset containing more than 200K celebrity images, each
during the testing phase, and then outputs of the attribute of which is notated with 40 facial attributes. CelebA has
prediction branch before performing the soft-max operation about ten thousand identities with twenty images per iden-
is fused with the last pooling layer of net@1 by using the tity on average. This dataset is also annotated by five land-
Kronecker product. We fuse the face modality with those marks. The dataset can be used as the training and testing
facial attributes such as gender and face shape which re- sets for facial attribute prediction, face detection, and land-
main the same in all images of a person. Experimental re- mark (or facial part) localization. To compare our method
sults show that our model increases overall performance in fairly with the other methods, we use the same setup that
face identification as well as attribute prediction in compar- they have used. We use images of 8000 identities for train-
ison to the first case. We performed our experiments on two ing and remaining 1000 identities for testing. Train and test
GeForce GTX TITAN X 12GB GPU. We ran our model sets are available here.1
through 100 epochs using batch normalization (i.e. shift-
MegaFace is a publicly available and very challenging
ing inputs to zero-mean and unit variance) after each con-
dataset which is used for evaluating the performance of
volutional and fully connected layer before performing no-
face recognition algorithms with up to a million distrac-
linearity. Batch normalization potentially helps to achieve
tors ( i.e., up to a million people who are not in the test
faster learning as well as higher overall accuracy. Further-
set). MegaFace contains 1M images from 690K individu-
more, batch normalization allows us to use a higher learning
als with unconstrained pose, expression, lighting, and expo-
rate, which potentially provides another boost in speed. We
sure. MegaFace captures many different subjects rather than
used TensorFlow to implement our network. The batch size
many images of a small number of subjects. The gallery
in all experiments is fixed to 128.
set of MegaFace is collected from a subset of Flickr [31].
5.1. Datasets The probe set of MegaFace used in the challenge consists
of two databases; Facescrub [21] and FGNet [9]. FG-NET
We conducted our experiments on the CelebA dataset contains 975 images of 82 individuals, each with several im-
[19] for facial attribute prediction, as well as MegaFace [14] ages spanning ages from 0 to 69. Facescrub dataset contains
which is a widely used and well-known face datasets for
face identification. 1 http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
more than 100K face images of 530 people. The MegaFace 5.3. Methods for Comparisons
challenge evaluates performance of face recognition algo-
rithms by increasing the numbers of distractors (going from Attribute Prediction: We compare our method
10 to 1M) in the gallery set. Training size is important, with several competitive algorithms including FaceTracer,
since it has been shown that face recognition algorithms that PANDA[37], ANet+LNet [19] and MT-RBM-PCA [8].
were trained on larger sets tend to perform better at scale. FaceTracer [17] extracts handcraft features including color
In order to evaluate the face recognition algorithms fairly, histogram and HOG from some functional face image re-
MegaFace challenge has two protocols including large or gion and then concatenates these features to train a SVM
small training sets. If a training set has more than 0.5M im- classifier for predicting attributes. Functional regions are
ages and 20K subjects, it is considered as large. Otherwise, determined by using ground truth landmarks. PANDA
it is considered as small. We use a small training set which mainly was proposed by creating an ensemble of several
has 0.44M images and 10k subjects. The prob set in our CNNs for body attributes prediction. Each CNN in this
experiments is Facescrub. model extracts features from a well-aligned human part us-
ing poselet. Next, all of the extracted features are concate-
5.2. Evaluation metrics nated to train a SVM for body attribute prediction. How-
ever, for our case, it is simple to adjust this method for fa-
We evaluate the face identification performance of our cial attribute prediction such that the face part is aligned us-
model on the MegaFace dataset; and facial attribute pre- ing landmark points. In ANet+LNet method, images of the
diction performance on the CelebA dataset. The MegaFace first 8000 identities, which is roughly 162k images, are em-
dataset is not annotated by facial attribute. Our model, how- ployed for pre-training and face localization. The images
ever, predicts facial attributes and then uses them for face of the next 1000 identities, which is roughly 20k images,
identification. To conduct experiments on the MegaFace are used to train a SVM classifier. We use same testing
dataset, we restore the model parameters trained on the and training sets to conduct our experiment. We compare
CelebA dataset, which is annotated by facial attributes as our model with the other methods for attribute prediction.
well as people identification, and then fine-tune the model Table.1 shows the model improvement on identity facial at-
parameters on the MegaFace dataset for the objective of tribute prediction once the model trains both tasks jointly.
face identification on the MegaFace dataset. Our model pre- The results shows that joint-training has higher contribution
dicts facial attributes from the first branch of our architec- for the attributes of gender, bald, narrow eyes, big lip, big
ture and employs this auxiliary modality for face identifica- nose, oval face, young, high cheekbone and chubby, respec-
tion. tively.
Face Identification: we calculate the similarity between Face Identification: We compare our method with the
each of the images in the gallery set and given image from exiting methods on face identification which are reported
the probe set, and then rank these images based on the ob- from the official websites of MegaFace2 . We primarily
tained similarities. In face identification, the gallery set compare with publicly released methods, for which the
should contain at least one image of the same identity. We details are known. These methods are listed as follows:
evaluate our model by using rank-1 identification accuracy Google FaceNet [24], Center Loss [34], Lightened CNN
as well as Cumulative Match Characteristics (CMC) curves. [35], LBP [1] and Joint Bayes model [6].
CMC is a rank-base metric indicating the probability of the There are several other methods from commercial com-
correct gallery image that can be found in the top k similar panies such as FaceAll, NTechLAB, SIAT MMLAB, Bare-
images from the gallery set. BonesFR, 3DiVi companies, the details of which are not
Facial Attribute Prediction: We leverage identity facial known to the community yet. Therefore, we can not com-
attributes as an auxiliary modality for improving face iden- pare these methods with ours fairly; however, we report
tification performance. Identity facial attributes are invari- these methods to provide a comprehensive list of references
ant attributes which remain same from different images of on the Megaface dataset. Fig. 3 represents CMC curves
a person. For example, gender, nose and lips shapes remain for different methods; it is shown that our model covers
the same in different images of a person; however, attributes larger area under the curve in comparison to the other meth-
such as glasses, mustaches, or beards may or may not exist ods. We also compare our model performance when the
in different images of a person. We discard such attributes model trains facial attributes prediction and face identifi-
in our model because we look for robust as well as invariant cation jointly and separately. The results show that our
facial attributes. Identity facial attributes in CelebA dataset face identifier benefits from joint training. We also com-
are listed as follows: narrow eyes, big nose, pointy nose, pare performance of the algorithms by rank-1 identification
chubby, double chin, high cheekbones, male, bald, big lips accuracy; Table.2 compares face identification models on
and oval face . We evaluate our attribute predictor by using
accuracy metric. 2 http://megaface.cs.washington.edu/results/facescrub.html
FaceTracer PANDA LNets+ANet RBM-PCA Ours-S Ours-J
Bald 89 96 98 98 96.16 98.93
Big Lips 64 67 68 69 69.25 71.69
Big Nose 74 75 78 81 82.35 84.67
Chubby 86 86 91 95 94.22 95.27
High Cheekbones 84 86 88 83 86.61 87.79
Male 91 97 98 90 95.65 98.61
Narrow Eyes 82 84 81 86 85.45 87.9
Oval Face 64 65 66 73 74.49 75.94
Young 80 84 87 81 87.12 88.54
Table 1. Comparing attribute prediction models on CelebFacesA dataset.

Methods Rels Protocol Acc%


Google - FaceNet v8 X Large 70.5
NTechLAB - Large × Large 73.3
Faceall Co. - Norm-1600 × Large 64.8
Faceall Co. - FaceAll-1600 × Large 63.98
Lightened CNN X Small 67.11
Center Loss X Small 65.23
LBP X Small 3.02
Joint Bayes X Small 2.33
NTechLAB -Small × Small 58.22
3DiVi Company × Small 33.71
SIAT-MMLAB × Small 65.23
Barebones FR × Small 59.36
Wang et al [33] X Small 77.74
Figure 3. CMC curves of different methods with the protocol of
small training set by 1M distractors. Please note that results of
PM-Separately X Small 76.15
the other methods are reported from official website of MegaFace PM-Jointly X Small 78.82
dataset. Table 2. Comparing face identification models on MegaFace
dataset using rank-1 identification accuracy metric.

Experimental results show that training the two tasks


MegaFacedataset using rank-1 identification accuracy met- jointly increases not only face identification performance,
ric. The results show the superiority of our model. We also but also facial attribute prediction performance, especially
observe that the model performance increases about 2.5% if on identity facial attributes such as gender. For example,
the model train attributes and face jointly in comparison to experiments performed on the CelebA dataset indicate that
the case which the model is trained separately. performance on face attributes including narrow eyes, big
nose, pointy nose, chubby, double chin, high cheekbones,
5.4. Further Analysis male, bald, big lips and oval face is improved around 2% on
Experimental results included in Table.2 show that our average if the tasks are trained jointly. Moreover, as shown
model improves face recognition performance by leverag- in Table.2, our proposed model outperforms the accuracy
ing identity facial attributes. To verify this claim, we con- of the state of the art methods for identity facial attributes
ducted experiments for two different cases described ear- prediction. One of the intuitive reasons causing this im-
lier. In the second case we emphasize predicting facial at- provement is that, once our deep CNN model is trained to
tributes, because in a real face identification scenario, such identify face images, it also learns more accurate face at-
information is not available during the testing phase. To use tributes in order to perform better face identification. In
facial attributes as an auxiliary modality in our proposed other words, these two modalities enhance each others’ per-
model for face identification, we fused this modality with formance once they are trained jointly.
the last pooling layer of the model shown in Fig.1. The sec- Table.2 also indicates that using facial attributes as priv-
ond case of our model which uses the predicted attributes ileged data boosts the model performance on face identifi-
outperforms the first case which does not use any privilege cation task. Our model beats most of the face identification
data. algorithms used in the MegaFace data set challenge.
cial attributes better than the state of the art models.

References
[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face description
with local binary patterns: Application to face recognition.
IEEE transactions on pattern analysis and machine intelli-
gence, 28(12):2037–2041, 2006. 5
[2] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs.-one
features for fine-grained categorization, face verification, and
attribute estimation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 955–
962, 2013. 1
[3] L. Best-Rowden, H. Han, C. Otto, B. F. Klare, and A. K.
Jain. Unconstrained face recognition: Identifying a person
of interest from a media collection. IEEE Transactions on
Information Forensics and Security, 9(12):2144–2157, 2014.
1
[4] L. Bourdev, S. Maji, and J. Malik. Describing people: A
poselet-based approach to attribute classification. In Com-
puter Vision (ICCV), 2011 IEEE International Conference
on, pages 1543–1550. IEEE, 2011. 1
Figure 4. Example of class activation map generated from attribute [5] L. D. Bourdev. Pose-aligned networks for deep attribute
predictor part of our model. Each row indicates nose attribute , modeling, July 26 2016. US Patent 9,400,925. 2
mouth attribute, eyes attribute and head attribute , respectively. We [6] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian
observe that highlighted regions are activated by class activation face revisited: A joint formulation. In European Conference
map algorithm. on Computer Vision, pages 566–579. Springer, 2012. 5
[7] J. Chung, D. Lee, Y. Seo, and C. D. Yoo. Deep attribute net-
works. In Deep Learning and Unsupervised Feature Learn-
ing NIPS Workshop, volume 3, 2012. 1
Inspired by the work in [39] on class activation map, we [8] M. Ehrlich, T. J. Shields, T. Almaev, and M. R. Amer. Fa-
interpret the prediction decision made by our proposed ar- cial attributes classification using multi-task representation
chitecture. Fig.4 shows the class activation map for predict- learning. In Proceedings of the IEEE Conference on Com-
ing big nose, big lips, narrow eyes and bald, respectively. puter Vision and Pattern Recognition Workshops, pages 47–
We can see that our model is triggered by different semantic 55, 2016. 5
regions of the image for different predictions. Fig.4 shows [9] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, and Y. Yao.
that our model due to using GAP layer also learns to local- Interestingness prediction by robust learning to rank. In
ize the common visual patterns for the same facial attribute. European conference on computer vision, pages 488–503.
Springer, 2014. 4
Furthermore, the deep features obtained from our attribute
[10] A. Graham. Kronecker products and matrix calculus: With
predictor branch can also be used for generic facial attribute
applications (mathematics and its applications) pdf. 1981. 2
localization in any given image without using any extra in-
[11] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? met-
formation such as bounding box.
ric learning approaches for face identification. In Computer
Vision, 2009 IEEE 12th international conference on, pages
6. Conclusion 498–505. IEEE, 2009. 1
In this paper, we proposed an end to end deep network to [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE con-
predict facial attributes and identify face images simultane-
ference on computer vision and pattern recognition, pages
ously with better performance. Our model trains these two
770–778, 2016. 1
tasks jointly through shared CNN feature space, and also
[13] M. M. Kalayeh, B. Gong, and M. Shah. Improving fa-
fuses predicted identity attributes modality with face modal- cial attribute prediction using semantic segmentation. arXiv
ity features to improve face identification performance. The preprint arXiv:1704.08740, 2017. 2
model increases both face recognition and face attribute pre- [14] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and
diction performance in comparison to the case when the E. Brossard. The megaface benchmark: 1 million faces for
model is trained separately. Experimental results show the recognition at scale. In Proceedings of the IEEE Conference
superiority of the model in comparison to the current face on Computer Vision and Pattern Recognition, pages 4873–
identification models. The model also predicts identity fa- 4882, 2016. 4
[15] D. Kingma and J. Ba. Adam: A method for stochastic opti- [31] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde,
mization. arXiv preprint arXiv:1412.6980, 2014. 3 K. Ni, D. Poland, D. Borth, and L.-J. Li. The new data
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet and new challenges in multimedia research. arXiv preprint
classification with deep convolutional neural networks. In arXiv:1503.01817, 1(8), 2015. 4
Advances in neural information processing systems, pages [32] R. Torfason, E. Agustsson, R. Rothe, and R. Timofte. From
1097–1105, 2012. 1, 2 face images and attributes to attributes. In Asian Conference
[17] N. Kumar, P. Belhumeur, and S. Nayar. Facetracer: A on Computer Vision, pages 313–329. Springer, 2016. 1
search engine for large collections of images with faces. In [33] Z. Wang, K. He, Y. Fu, R. Feng, Y.-G. Jiang, and X. Xue.
European conference on computer vision, pages 340–353. Multi-task deep neural network for joint face recognition and
Springer, 2008. 5 facial attribute prediction. In Proceedings of the 2017 ACM
[18] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. on International Conference on Multimedia Retrieval, pages
Attribute and simile classifiers for face verification. In Com- 365–374. ACM, 2017. 1, 2, 6
puter Vision, 2009 IEEE 12th International Conference on, [34] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discrimina-
pages 365–372. IEEE, 2009. 1 tive feature learning approach for deep face recognition. In
[19] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face at- European Conference on Computer Vision, pages 499–515.
tributes in the wild. In Proceedings of the IEEE International Springer, 2016. 5
Conference on Computer Vision, pages 3730–3738, 2015. 1, [35] X. Wu, R. He, and Z. Sun. A lightened cnn for deep face
2, 4, 5 representation. In 2015 IEEE Conference on IEEE Computer
[20] P. Luo, X. Wang, and X. Tang. A deep sum-product archi- Vision and Pattern Recognition (CVPR), volume 4, 2015. 5
tecture for robust facial attributes analysis. In Proceedings [36] L. C. Yuhang He and J. Chen. Multi-task relative attribute
of the IEEE International Conference on Computer Vision, prediction by incorporating local context and global style in-
pages 2864–2871, 2013. 1, 2 formation. In E. R. H. Richard C. Wilson and W. A. P. Smith,
[21] H.-W. Ng and S. Winkler. A data-driven approach to clean- editors, Proceedings of the British Machine Vision Confer-
ing large face datasets. In Image Processing (ICIP), 2014 ence (BMVC), pages 131.1–131.12. BMVA Press, Septem-
IEEE International Conference on, pages 343–347. IEEE, ber 2016. 2
2014. 4 [37] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev.
[22] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face Panda: Pose aligned networks for deep attribute modeling. In
recognition. In BMVC, volume 1, page 6, 2015. 1 Proceedings of the IEEE conference on computer vision and
[23] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chel- pattern recognition, pages 1637–1644, 2014. 1, 5
lappa. An all-in-one convolutional neural network for face
[38] Y. Zhong, J. Sullivan, and H. Li. Face attribute prediction
analysis. In Automatic Face & Gesture Recognition (FG
using off-the-shelf cnn features. In Biometrics (ICB), 2016
2017), 2017 12th IEEE International Conference on, pages
International Conference on, pages 1–7. IEEE, 2016. 2
17–24. IEEE, 2017. 1
[39] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-
[24] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-
ralba. Learning deep features for discriminative localization.
fied embedding for face recognition and clustering. In Pro-
In Computer Vision and Pattern Recognition (CVPR), 2016
ceedings of the IEEE Conference on Computer Vision and
IEEE Conference on, pages 2921–2929. IEEE, 2016. 6
Pattern Recognition, pages 815–823, 2015. 1, 5
[25] W. R. Schwartz, H. Guo, and L. S. Davis. A robust and scal-
able approach to face identification. In European Conference
on Computer Vision, pages 476–489. Springer, 2010. 1
[26] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014. 1, 2
[27] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3: Face
recognition with very deep neural networks. arXiv preprint
arXiv:1502.00873, 2015. 1
[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 1–9, 2015. 1
[29] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-
scale training for face identification. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 2746–2754, 2015. 1
[30] V. Talreja, M. C. Valenti, and N. M. Nasrabadi. Multibio-
metric secure system based on deep learning. arXiv preprint
arXiv:1708.02314, 2017. 1

You might also like